COMP 170 (Sec. 008) - PROGRAM VIII

Due: Monday, 16 November 1998

Deoxyribonucleic acid (DNA) is composed of a sequence of nucleotide bases paired together to form a double-stranded helix structure. Through a series of complex biochemical processes the nucleotide sequences in an organism's DNA are translated into the proteins it requires for life. The object of this problem is to write a computer program which accepts a DNA strand and reports the protein generated, if any, from the DNA strand.

The nucleotide bases from which DNA is built are adenine, cytosine, guanine, and thymine (hereafter referred to as A, C, G, and T, respectively). These bases bond together in a chain to form half of a DNA strand. The other half of the DNA strand is a similar chain, but each nucleotide is replaced by its complementary base. The bases A and T are complementary, as are the bases C and G. These two "half-strands" of DNA are then bonded by the pairing of the complementary bases to form a strand of DNA.

Typically a DNA strand is listed by simply writing down the bases which form the primary strand (the complementary strand can always be created by writing the complements of the bases in the primary strand). For example, the sequence TACTCGTAATTCACT represents a DNA strand whose complement would be ATGAGCATTAAGTGA. Note that A is always paired with T, and C is always paired with G.

From a primary strand of DNA, a strand of ribonucleic acid (RNA) known as messenger RNA (mRNA for short) is produced in a process known as transcription. The transcribed mRNA is identical to the complementary DNA strand with the exception that thymine is replaced by a nucleotide known as uracil (hereafter referred to as U). For example, the mRNA strand for the DNA in the previous paragraph would be AUGAGCAUUAAGUGA.

It is the sequence of bases in the mRNA which determines the protein that will be synthesized. The bases in the mRNA can be viewed as a collection of codons, each codon having exactly three bases. The codon AUG marks the start of a protein sequence, and any of the codons UAA, UAG, or UGA marks the end of the sequence. The one or more codons between the start and termination codons represent the sequence of amino acids to be synthesized to form a protein. For example, the mRNA codon AGC corresponds to the amino acid serine (Ser), AUU corresponds to isoleucine (Ile), and AAG corresponds to lysine (Lys). So, the protein formed from the example mRNA in the previous paragraph is, in its abbreviated form, Ser-Ile-Lys.

The complete genetic code from which codons are translated into amino acids is shown in the table below (note that only the amino acid abbreviations are shown). It should also be noted that the sequence AUG, which has already been identified as the start sequence, can also correspond to the amino acid methionine (Met). So, the first AUG in a mRNA strand is the start sequence, but subsequent AUG codons are translated normally into the Met amino acid.

First base         Second base in codon                  Third base
in codon          U C A G                                  in codon
________________________________________________________________________
U                       Phe Ser Tyr Cys                           U
                          Phe Ser Tyr Cys                           C
                          Leu Ser --- ---                               A
                          Leu Ser --- Trp                              G
________________________________________________________________________
C                      Leu Pro His Arg                             U
                          Leu Pro His Arg                            C
                          Leu Pro Gln Arg                              A
                          Leu Pro Gln Arg                              G
________________________________________________________________________
A                      Ile Thr Asn Ser                                U
                          Ile Thr Asn Ser                               C
                          Ile Thr Lys Arg                                A
                          Met Thr Lys Arg                            G
________________________________________________________________________
G                      Val Ala Asp Gly                            U
                         Val Ala Asp Gly                            C
                          Val Ala Glu Gly                              A
                          Val Ala Glu Gly                            G
________________________________________________________________________

Design, test, and run a Java project, DNA, which translates a collection of DNA sequences into the protein each generates, if any. Assume that the given DNA strand is primary and that the start and termination sequences need not necessarily appear at the ends of the strand. For example, the DNA strand ATACTCGTAATTCACTCC yields the protein Ser-Ile-Lys. Input will be terminated by a line containing a single asterisk character.

You may assume the input to contain only valid, upper-case DNA nucleotide base letters (A, C, G, and T). No input line will exceed 255 characters in length. There will by no blank lines or spaces in the input. Some sequences, though valid DNA strands, do not produce valid protein sequences; the string "*** No translatable DNA found ***" should be output when an input DNA stand does not translate into a valid protein.

Note: Use the built-in charAt(int n) method, to select the n^th character in a character string. So, for example, if strand = "ATACTCC", then strand.charAt(0) is 'A', strand.charAt(1) is 'T', etc. Use methods and arrays in an appropriate manner. Create a class, DNAStrand, which defines a DNA string object, and appropriate instance methods to perform the translation and printing. As usual, you should try to make your program as robust as possible.

Extra Credit: Assume that the given DNA strand may be either primary or complementary, and that it may appear in either forward or reverse order.

Sample dialogue:

Please enter a primary DNA strand.
(Indicate the end of strand by entering a space before your carriage return.):

ATACTCGTAATTCACTCC

The protein generated is:

Ser-Ile-Lys