Lecture Quiz #2 (Part 2-bioinformatics) This part is also due on April 4th (or 5th). You are on the honor system to work only independently on this quiz! Answer all questions (2 and 3) only on the Quiz 2 Answer Sheet provided as a hand out in class. (2) 3. Genes are obviously much longer than 6 bases used in question 1. Proteins range in size from about 100 to 10,000 amino acids and since each amino acid is encoded by a 3-base codon this would correlate to gene sequences at a minimum of 300-30,000 base pairs in length. Genes, however, are even longer because in addition to encoding the information for the protein amino acid sequence, genes also include regions such as promoters and encode the ribosome binding site necessary for initiating translation. I have provided you with the sequence of an actual gene in a file at the end of this assignment called “DNA sequence”. As the length of DNA sequence in this gene is quite long, determining the complementary strand and translation by hand would be prohibitive. Fortunately, there are programs that make these tasks quite simple—just follow the instructions below. a. To determine the complementary DNA strand, go to the following web site: http://bioinformatics.org/sms/rev_comp.html Paste your sequence into the box provided and Submit the sequence. You will get a box saying that there are non-DNA characters which will be omitted—click on OK. The reverse complement sequence will be given (that is the complementary DNA strand written in the 5’ 3’ direction). Use this reverse complement sequence to answer question 3.a. on the answer sheet. b. The program Six-Frame Translation will transcribe and translate your assigned DNA sequence for you. It is named Six-frame because DNA has six possible reading frames, three in each direction (three reading 5’ 3’ on one strand and three reading frames reading 5’ 3’ on the complementary strand). Since the start of the protein amino acid sequence has not been designated this program determines all possible translations. The amino acids are given using either their three or one letter code designation (refer to Figure 3.12 on page 49 for amino acid abbreviations). Stop codons are designated by an asterisk (*). You can access Six-Frame Translation at: http://molbiol.ru/eng/scripts/01_13.html You should see the screen shown on the following page. Paste your sequence into the box provided. Use the defaults shown except under Translation set the “output” to 60 amino acids per line, the amino acid “code” to “one letter”, and set the frame number to “all”. Then simply click on “translate” to translate your sequence. You will now get six sets of translations. Open reading frames (ORFs) start with a start codon (check the genetic code to determine this) and end with a stop codon. Long ORFs are unlikely to occur by chance and thus signify potential genes. Examine the results of your Six-Frame translations to determine which reading frame gives the longest ORF. Use this ORF (a potential gene) to answer question 3.b. on your answer sheet. Hint: Can methionine amino acids appear at positions in a protein other than the first amino acid? 1 Six-Frame Translation Name (not necessary): Nucleotide sequence without name: (case insensitive, all letters except agctuswrymkhbdvn are disregarded) agtatcaaataagtaatttatttaggttcttttaagaaaggagcgacttgt Translate Clear Nucleotide sequence: Output: 60 nucleotides in one line; display the initial sequence; display complementary sequence in 5' --> 3' direction; Translation: 60 Output: aminoacids in one line, code: Frame number: +1; +2; -1; -2; +3; one letter all -3; Alignment: align amino acids sequences relative to nucleic acid. <> (8) 4. A wild-type DNA sequence aligned with four mutant DNA sequences that each encode versions of a bacterial protein is provided on the following pages. You are also provided with an alignment of the amino acid sequence of each of these DNA sequences on the last page. The order of the sequences in these alignments has been randomized. For example, mutant DNA sequence A does not necessarily correspond to mutant protein sequence #1. The only thing you know for sure is that the wild-type DNA sequence does translate to the wild-type protein sequence. Note that by convention the DNA sequences used in the alignment are those for the “coding strand” of the DNA. That is, the sequence given is not actually used as a template for transcription, but the one complementary to the template. Thus, the “coding strand” can be directly compared to the sequence of the mRNA (except that T’s occur in the position of U’s). Answer questions related to the DNA and protein sequences (4.a 4.e.) on the answer sheet. 2 DNA Sequence Alignment using the bioinformatics program CLUSTAL W (1.81). Note that a dash (-) indicates that a gap has been inserted in one or more of the sequences to preserve the optimal alignment. This does NOT mean that there are actual gaps in the DNA or proteins of the organism. It simply means that the computer has “shifted” one sequence relative to others to preserve as much alignment as possible throughout the length of the aligned sequences. The “#” indicates the location of the mutations. Each row of sequence contains 50 nucleotides numbered from left to right. Mutant A Wild-type Mutant B Mutant C Mutant D 1 11, etc. ATGAAGTTTG GAAATATTTG ATGAAGTTTG GAAATATTTG ATGAAGTTTG GAAATATTTG ATGAAGTTTG GAAATATTTG ATGAAGTTTG GAAATATTTG TTTTTCGTAT TTTTTCGTAT TTTTTCGTAT TTTTTCGTAT TTTTTCGTAT CAACCACCAG CAACCACCAG CAACCACCAG CAACCACCAG CAACCACCAG GTGAAACTCA GTGAAACTCA GTGAAACTCA GTGAAACTCA GTGAAACTCA Mutant A Wild-type Mutant B Mutant C Mutant D 51 TAAGCTAAGT TAAGCTAAGT TAAGCTAAGT TAAGCTAAGT TAAGCTAAGT AATGGATCGC AATGGATCGC AATGGATCGC AATGGATCGC AATGGATCGC TTTGTTCGGC TTTGTTCGGC TTTGTTCGGC TTTGTTCGGC TTTGTTCGGC TTGGTATCGC TTGGTATCGC TTGGTATCGC TTGGTATCGC TTGGTATCGC CTCAGAAGAG CTCAGAAGAG CTCAGAAGAG CTCAGAAGAG CTCAGAAGAG Mutant A Wild-type Mutant B Mutant C Mutant D 101 TAGGGTTTGA TAGGGTTTGA TAGGGTTTGA TAGGGTTTGA TAGGGTTTGA TACATATTGG TACATATTGG TACATATTGG TACATATTGG TACATATTGG ACCTTAGAAC ACCTTAGAAC ACCTTAGAAC ACCTTAGAAC ACCTTAGAAC ATCATTTTAC ATCATTTTAC ATCATTTTAC ATCATTTTAC ATCATTTTAC AGAGTTTGGT AGAGTTTGGT AGAGTTTGGT AGAGTTTGGT AGAGTTTGGT Mutant A Wild-type Mutant B Mutant C Mutant D 151 CTTACGGGAA CTTACGGGAA CTTACGGGAA CTTACGGGAA CTTACGGGAA ATTTATTTGT ATTTATTTGT ATTTATTTGT ATTTATTTGT ATTTATTTGT TGCTGCGGCT TGCTGCGGCT TGCTGCGGCT TGCTGCGGCT TGCTGCGGCT AACCTGTTAG AACCTGTTAG AACCTGTTAG AACCTGTTAG AACCTGTTAG GAAGAACTAA GAAGAACTAA GAAGAACTAA GAAGAACTAA GAAGAACTAA Mutant A Wild-type Mutant B Mutant C Mutant D 201 AACATTAAAT AACATTAAAT AACATTAAAT AACATTAAAT AACATTAAAT GTTGGCACTA GTTGGCACTA GTTGGCACTA GTTGGCACTA GTTGGCACTA TGGGGGTTGT TGGGGGTTGT TGGGGGTTGT TGGGGGTTGT TGGGGGTTGT TATTCCGACA TATTCCGACA TATTCCGACA TATTCCGACA TATTCCGACA GCACACCCAG GCACACCCAG GCACACCCAG GCACACCCAG GCACACCCAG Mutant A Wild-type Mutant B Mutant C Mutant D 251 TTCGACAGTT TTCGACAGTT TTCGACAGTT TTCGACAGTT TTCGACAGTT AGAAGACGTT AGAAGACGTT AGAAGACGTT AGAAGACGTT AGAAGACGTT TTATTATTAG TTATTATTAG TTATTATTAG TTATTATTAG TTATTATTAG ATCAAATGTC ATCAAATGTC ATCAAATGTC ATCAAATGTC ATCAAATGTC GAAAGGTCGT GAAAGGTCGT GAAAGGTCGT GAAAGGTCGT GAAAGGTCGT Mutant A Wild-type Mutant B Mutant C Mutant D 301# TATAATTTTG TTTAATTTTG TTTAATTTTG TTTAATTTTG TTTAATTTTG GAACCGTTCG GAACCGTTCG GAACCGTTCG GAACCGTTCG GAACCGTTCG AGGGCTATAC AGGGCTATAC AGGGCTATAC AGGGCTATAC AGGGCTATAC CATAAAGATT CATAAAGATT CATAAAGATT CATAAAGATT CATAAAGATT TTCGAGTATT TTCGAGTATT TTCGAGTATT TTCGAGTATT TTCGAGTATT Mutant A Wild-type Mutant B Mutant C Mutant D 351 TGGTGTTGAT TGGTGTTGAT TGGTGTTGAT TGGTGTTGAT TGGTGTTGAT ATGGAAGAGT ATGGAAGAGT ATGGAAGAGT ATGGAAGAGT ATGGAAGAGT CTCGAGCAAT CTCGAGCAAT CTCGAGCAAT CTCGAGCAAT CTCGAGCAAT TACTCAAAAT TACTCAAAAT TACTCAAAAT TACTCAAAAT TACTCAAAAT TTCTACCAGA TTCTACCAGA TTCTACCAGA TTCTACCAGA TTCTACCAGA 3 Mutant A Wild-type Mutant B Mutant C Mutant D 401 TGATAATGGA TGATAATGGA TGATAATGGA TGATAATGGA TGATAATGGA AAGCTTACAG AAGCTTACAG AAGCTTACAG AAGCTTACAG AAGCTTACAG ACAGGAACCA ACAGGAACCA ACAGGAACCA ACAGGAACCA ACAGGAACCA TTAGCTCTGA TTAGCTCTGA TTAGCTCTGA TTAGCTCTGA TTAGCTCTGA TAGTGATTAC TAGTGATTAC TAGTGATTAC TAGTGATTAC TAGTGATTAC Mutant A Wild-type Mutant B Mutant C Mutant D 451 ATTCAATTTC ATTCAATTTC ATTCAATTTC ATTCAATTTC ATTCAATTTC CTAAGGTTGA CTAAGGTTGA CTAAGGTTGA CTAAGGTTGA CTAAGGTTGA TGTATATCCC TGTATATCCC TGTATATCCC TGTATATCCC TGTATATCCC AAAGTGTACT AAAGTGTACT AAAGTGTACT AAAGTGTACT AAAGTGTACT CAAAAAATGT CAAAAAATGT CAAAAAATGT CAAAAAATGT CAAAAAATGT Mutant A Wild-type Mutant B Mutant C Mutant D 501 ACCAACCTGT ACCAACCTGT ACCAACCTGT ACCAACCTGT ACCAACCTGT ATGACTGCTG ATGACTGCTG ATGACTGCTG ATGACTGCTG ATGACTGCTG AGTCCGCAAG AGTCCGCAAG AGTCCGCAAG AGTCCGCAAG AGTCCGCAAG TACGACAGAA TACGACAGAA TACGACAGAA TACGACAGAA TACGACAGAA TGGCTAGCAA TGGCTAGCAA TGGCTAGCAA TGGCTAGCAA TGGCTAGCAA Mutant A Wild-type Mutant B Mutant C Mutant D 551 TACAAGGGCT TACAAGGGCT TACAAGGGCT TACAAGGGCT TACAAGGGCT ACCAATGGTT ACCAATGGTT ACCAATGGTT ACCAATGGTT ACCAATGGTT CTTAGTTGGA CTTAGTTGGA CTTAGTTGGA CTTAGTTGGA CTTAGTTGGA TTATTGGTAC TTATTGGTAC TTATTGGTAC TTATTGGTAC TTATTGGTAC TAATGAAAAA TAATGAAAAA TAATGAAAAA TAATGAAAAA TAATGAAAAA Mutant A Wild-type Mutant B Mutant C Mutant D 601 # AAA-GCACAGA AAA-GCACAGA AAA-GCACAGA AAAAGCACAGA AAA-GCACAGA Mutant A Wild-type Mutant B Mutant C Mutant D 651 TATATCTAAA TATATCTAAA TATATCTAAA TATATCTAAA TATATCTAAA ATAGATCATT ATAGATCATT ATAGATCATT ATAGATCATT ATAGATCATT GTATGACTTA GTATGACTTA GTATGACTTA GTATGACTTA GTATGACTTA TATTTGTTCT TATTTGTTCT TATTTGTTCT TATTTGTTCT TATTTGTTCT GTTGATGATG GTTGATGATG GTTGATGATG GTTGATGATG GTTGATGATG Mutant A Wild-type Mutant B Mutant C Mutant D 701 ATGCACAAAA ATGCACAAAA ATGCACAAAA ATGCACAAAA ATGCACAAAA GGCGCAAGAT GGCGCAAGAT GGCGCAAGAT GGCGCAAGAT GGCGCAAGAT GTTTGTCGGG GTTTGTCGGG GTTTGTCGGG GTTTGTCGGG GTTTGTCGGG AGTTTCTGAA AGTTTCTGAA AGTTTCTGAA AGTTTCTGAA AGTTTCTGAA AAATTGGTAT AAATTGGTAT AAATTGGTAT AAATTGGTAT AAATTGGTAT Mutant A Wild-type Mutant B Mutant C Mutant D 751 GACTCATATG GACTCATATG GACTCATATG GACTCATATG GACTCATATG TAAATGCGAC TAAATGCGAC TAAATGCGAC TAAATGCGAC TAAATGCGAC CAATATCTTT CAATATCTTT CAATATCTTT CAATATCTTT CAATATCTTT AATGATAGCA AATGATAGCA AATGATAGCA AATGATAGCA AATGATAGCA ATCAAACTCG ATCAAACTCG ATCAAACTCG ATCAAACTCG ATCAAACTCG Mutant A Wild-type Mutant B Mutant C 801 TGGTTATGAT TGGTTATGAT TGGTTATGAT TGGTTATGAT TATCATAAAG TATCATAAAG TATCATAAAG TATCATAAAG GTCAATGGCG GTCAATGGCG GTCAATGGCG GTCAATGGCG TGATTTTGTT TGATTTTGTT TGATTTTGTT TGATTTTGTT TTACAAGGAC TTACAAGGAC TTACAAGGAC TTACAAGGAC TGGAACTCTA TGGAACTCTA TGGAACTCTA TGGAACTCTA TGGAACTCTA TAATGAAATT TAATGAAATT TAATGAAATT TAATGAAATT TAATGAAATT 4 GCGACAGAAT GCGACAGAAT GCGACAGAAT GCGACAGAAT GCGACAGAAT ATGGTCATGA ATGGTCATGA ATGGTCATGA ATGGTCATGA ATGGTCATGA Mutant D TGGTTATGAT TATCATAAAG GTCAATGGCG TGATTTTGTT TTACAAGGAC Mutant A Wild-type Mutant B Mutant C Mutant D 851 ATACAAACAC ATACAAACAC ATACAAACAC ATACAAACAC ATACAAACAC CAATCGACGT CAATCGACGT CAATCGACGT CAATCGACGT CAATCGACGT # GTTGATTATA GTTGATTATA GTTGATTAGA GTTGATTATA GTTGATTATA GCAATGGTAT GCAATGGTAT GCAATGGTAT GCAATGGTAT GCAATGGTAT TAACCCTGTA TAACCCTGTA TAACCCTGTA TAACCCTGTA TAACCCTGTA Mutant A Wild-type Mutant B Mutant C Mutant D 901 GGCACTCCTG GGCACTCCTG GGCACTCCTG GGCACTCCTG GGCACTCCTG AGCAGTGTAT AGCAGTGTAT AGCAGTGTAT AGCAGTGTAT AGCAGTGTAT TGAAATCATT TGAAATCATT TGAAATCATT TGAAATCATT TGAAATCATT CAACGTGATA CAACGTGATA CAACGTGATA CAACGTGATA CAACGTGATA TTGATGCAAC TTGATGCAAC TTGATGCAAC TTGATGCAAC TTGATGCAAC Mutant A Wild-type Mutant B Mutant C Mutant D 951 GGGTATTACA GGGTATTACA GGGTATTACA GGGTATTACA GGGTATTACA AACATTACAT AACATTACAT AACATTACAT AACATTACAT AACATTACAT GCGGATTTGA GCGGATTTGA GCGGATTTGA GCGGATTTGA GCGGATTTGA AGCTAATGGA AGCTAATGGA AGCTAATGGA AGCTAATGGA AGCTAATGGA ACTGAAGATG ACTGAAGATG ACTGAAGATG ACTGAAGATG ACTGAAGATG 1001 Mutant A AAATAATTGC Wild-type AAATAATTGC Mutant B AAATAATTGC Mutant C AAATAATTGC Mutant D AAATAATTGC # TTCCATGCGA TTCCATGCGA TTCCATGCGA TTCCATGCGA TTCAATGCGA CGCTTTATGA CGCTTTATGA CGCTTTATGA CGCTTTATGA CGCTTTATGA CACAAGTCGC CACAAGTCGC CACAAGTCGC CACAAGTCGC CACAAGTCGC TCCTTTCTTA TCCTTTCTTA TCCTTTCTTA TCCTTTCTTA TCCTTTCTTA 1051 Mutant A AAAGAACCTA Wild-type AAAGAACCTA Mutant B AAAGAACCTA Mutant C AAAGAACCTA Mutant D AAAGAACCTA AATAA AATAA AATAA AATAA AATAA 5 Protein sequence alignment using one letter amino acid code. The symbol “*” in the protein alignment indicates one of the three stop codons. The “#” indicates the location of the mutations. CLUSTAL W (1.81) multiple sequence 1 MKFGNICFSY QPPGETHKLS NGSLCSAWYR MKFGNICFSY QPPGETHKLS NGSLCSAWYR MKFGNICFSY QPPGETHKLS NGSLCSAWYR MKFGNICFSY QPPGETHKLS NGSLCSAWYR MKFGNICFSY QPPGETHKLS NGSLCSAWYR 51 LTGNLFVAAA LTGNLFVAAA LTGNLFVAAA LTGNLFVAAA LTGNLFVAAA 101 # FNFGTVRGLY FNFGTVRGLY FNFGTVRGLY YNFGTVRGLY FNFGTVRGLY NLLGRTKTLN NLLGRTKTLN NLLGRTKTLN NLLGRTKTLN NLLGRTKTLN VGTMGVVIPT VGTMGVVIPT VGTMGVVIPT VGTMGVVIPT VGTMGVVIPT alignment LRRVGFDTYW LRRVGFDTYW LRRVGFDTYW LRRVGFDTYW LRRVGFDTYW 50 TLEHHFTEFG TLEHHFTEFG TLEHHFTEFG TLEHHFTEFG TLEHHFTEFG Wild-type Mutant #1 Mutant #2 Mutant #3 Mutant #4 AHPVRQLEDV AHPVRQLEDV AHPVRQLEDV AHPVRQLEDV AHPVRQLEDV 100 LLLDQMSKGR LLLDQMSKGR LLLDQMSKGR LLLDQMSKGR LLLDQMSKGR Wild-type Mutant #1 Mutant #2 Mutant #3 Mutant #4 150 HKDFRVFGVD HKDFRVFGVD HKDFRVFGVD HKDFRVFGVD HKDFRVFGVD MEESRAITQN MEESRAITQN MEESRAITQN MEESRAITQN MEESRAITQN FYQMIMESLQ FYQMIMESLQ FYQMIMESLQ FYQMIMESLQ FYQMIMESLQ KVYSKNVPTC KVYSKNVPTC KVYSKNVPTC KVYSKNVPTC KVYSKNVPTC MTAESASTTE MTAESASTTE MTAESASTTE MTAESASTTE MTAESASTTE WLAIQGLPMV WLAIQGLPMV WLAIQGLPMV WLAIQGLPMV WLAIQGLPMV ATEYGHDISK ATEYGHDISK ATEYGHDISK ATEYGHDISK ---------- IDHCMTYICS IDHCMTYICS IDHCMTYICS IDHCMTYICS ---------- VDDDAQKAQD VDDDAQKAQD VDDDAQKAQD VDDDAQKAQD ---------- 200 LSWIIGTNEK LSWIIGTNEK LSWIIGTNEK LSWIIGTNEK LSWIIGTNEK 250 VCREFLKNWY VCREFLKNWY VCREFLKNWY VCREFLKNWY ---------- 251 DSYVNATNIF DSYVNATNIF DSYVNATNIF DSYVNATNIF ---------- NDSNQTRGYD NDSNQTRGYD NDSNQTRGYD NDSNQTRGYD ---------- YHKGQWRDFV YHKGQWRDFV YHKGQWRDFV YHKGQWRDFV ---------- LQGHTNTNRR LQGHTNTNRR LQGHTNTNRR LQGHTNTNRR ---------- # 300 VDYSNGINPV VD*------VDYSNGINPV VDYSNGINPV ---------- Wild-type Mutant #1 Mutant #2 Mutant #3 Mutant #4 301 GTPEQCIEII ---------GTPEQCIEII GTPEQCIEII ---------- QRDIDATGIT ---------QRDIDATGIT QRDIDATGIT ---------- NITCGFEANG ---------NITCGFEANG NITCGFEANG ---------- TEDEIIASMR ---------TEDEIIASMR TEDEIIASMR ---------- 350 RFMTQVAPFL ---------RFMTQVAPFL RFMTQVAPFL ---------- Wild-type Mutant #1 Mutant #2 Mutant #3 Mutant #4 151 IQFPKVDVYP IQFPKVDVYP IQFPKVDVYP IQFPKVDVYP IQFPKVDVYP 201 # KAQMELYNEI KAQMELYNEI KAQMELYNEI KAQMELYNEI KSTDGTL*-- 351 KEPK* ---KEPK* KEPK* ---- 6 TGTISSDSDY TGTISSDSDY TGTISSDSDY TGTISSDSDY TGTISSDSDY Wild-type Mutant #1 Mutant #2 Mutant #3 Mutant #4 Wild-type Mutant #1 Mutant #2 Mutant #3 Mutant #4 Wild-type Mutant #1 Mutant #2 Mutant #3 Mutant #4 DNA Sequence 1 gagctcattg 61 catgatcttg 121 gccttagcat 181 gaggacgtta 241 ttattagaat 301 aatatttgtt 361 tgttcggctt 421 cattttacag 481 agaactaaaa 541 cgacagttag 601 accgttcgag 661 cgagcaatta 721 agctctgata 781 aaaaatgtac 841 caagggctac 901 gaactctata 961 atgacttata 1021 tttctgaaaa 1081 caaactcgtg 1141 acaaacacca 1201 cagtgtattg 1261 ggatttgaag 1321 caagtcgctc atagcattga gggaaaattt tagatgatgg caagtattac tggcttaaat tttcgtatca ggtatcgcct agtttggtct cattaaatgt aagacgtttt ggctatacca ctcaaaattt gtgattacat caacctgtat caatggttct atgaaattgc tttgttctgt attggtatga gttatgatta atcgacgtgt aaatcattca ctaatggaac ctttcttaaa atctagtaat ggttgtatta tttattggat tgttaaggag aaacagaatc accaccaggt cagaagagta tacgggaaat tggcactatg attattagat taaagatttt ctaccagatg tcaatttcct gactgctgag tagttggatt gacagaatat tgatgatgat ctcatatgta tcataaaggt tgattatagc acgtgatatt tgaagatgaa agaacctaaa tgtaagctct agaaattttt ttagagatag cgtagattaa accaaaaagg gaaactcata gggtttgata ttatttgttg ggggttgtta caaatgtcga cgagtatttg ataatggaaa aaggttgatg tccgcaagta attggtacta ggtcatgata gcacaaaagg aatgcgacca caatggcgtg aatggtatta gatgcaacgg ataattgctt taaattactt 7 attcgctaat atcaatcagt acattattga aaaatgaaat aatagagtat agctaagtaa catattggac ctgcggctaa ttccgacagc aaggtcgttt gtgttgatat gcttacagac tatatcccaa cgacagaatg atgaaaaaaa tatctaaaat cgcaagatgt atatctttaa attttgtttt accctgtagg gtattacaaa ccatgcgacg atttgatact tggaagttca gacgaaggca acctcgattt tgaaaatgaa gaagtttgga tggatcgctt cttagaacat cctgttagga acacccagtt taattttgga ggaagagtct aggaaccatt agtgtactca gctagcaata agcacagatg agatcattgt ttgtcgggag tgatagcaat acaaggacat cactcctgag cattacatgc ctttatgaca