BCB 444/544 Lab 2 – NCBI Tools, Pairwise Sequence Alignment & Analysis Answer Key 30 points possible. Score converted to 10 pt. scale, rounded to nearest tenth. 1 pt. 1. a) Entrez allows text-based searches of all NCBI databases. It can be used to search for nucleotides, proteins, and structures, as well as organism taxonomy and genome features. Biological and genetics publications are also searched. For instance, a user can retrieve sequence data for a particular group of organisms, or find articles related to apoptosis. 1 pt. b) 509 1 pt. c) 37 1 pt. f) 33,455 1 pt. g) 9 1 pt. i) >gi|5031767|ref|NP_005517.1| heat shock transcription factor 1 [Homo sapiens] MDLPVGPGAAGPSNVPAFLTKLWTLVSDPDTDALICWSPSGNSFHVFDQGQFAKEVLPKYFKHNNMASFV RQLNMYGFRKVVHIEQGGLVKPERDDTEFQHPCFLRGQEQLLENIKRKVTSVSTLKSEDIKIRQDSVTKL LTDVQLMKGKQECMDSKLLAMKHENEALWREVASLRQKHAQQQKVVNKLIQFLISLVQSNRILGVKRKIP LMLNDSGSAHSMPKYSRQFSLEHVHGSGPYSAPSPAYSSSSLYAPDAVASSGPIISDITELAPASPMASP GGSIDERPLSSSPLVRVKEEPPSPPQSPRVEEASPGRPSSVDTLLSPTALIDSILRESEPAPASVTALTD ARGHTDTEGRPPSPPPTSTPEKCLSVACLDKNELSDHLDAMDSNLDNLQTMLSSHGFSVDTSALLDLFSP SVTVPDMSLPDLDSSLASIQELLSPQEPPRPPEAENSSPDSGKQLVHYTAQPLFLLDPGSVDTGSNDLPV LFELGEGSYFSEGDGFAEDPTISLLTGSEPPKAKDPTVS Note: credit was given as long as the top two lines were present 1 pt. j) >gi|132626772|ref|NM_005526.2| Homo sapiens heat shock transcription factor 1 (HSF1), mRNA GCGGCGGGAGCGCGCCCGTTGCAAGATGGCGGCGGCCATGCTGGGCCCCGGGGCTGTGTGTGCGCAGCGG GCGGCGGCGCGGCCCGGAAGGCTGGCGCGGCGACGGCGTTAGCCCGGCCCTCGGCCCCTCTTTGCGGCCG CTCCCTCCGCCTATTCCCTCCTTGCTCGAGATGGATCTGCCCGTGGGCCCCGGCGCGGCGGGGCCCAGCA ACGTCCCGGCCTTCCTGACCAAGCTGTGGACCCTCGTGAGCGACCCGGACACCGACGCGCTCATCTGCTG GAGCCCGAGCGGGAACAGCTTCCACGTGTTCGACCAGGGCCAGTTTGCCAAGGAGGTGCTGCCCAAGTAC TTCAAGCACAACAACATGGCCAGCTTCGTGCGGCAGCTCAACATGTATGGCTTCCGGAAAGTGGTCCACA TCGAGCAGGGCGGCCTGGTCAAGCCAGAGAGAGACGACACGGAGTTCCAGCACCCATGCTTCCTGCGTGG CCAGGAGCAGCTCCTTGAGAACATCAAGAGGAAAGTGACCAGTGTGTCCACCCTGAAGAGTGAAGACATA AAGATCCGCCAGGACAGCGTCACCAAGCTGCTGACGGACGTGCAGCTGATGAAGGGGAAGCAGGAGTGCA TGGACTCCAAGCTCCTGGCCATGAAGCATGAGAATGAGGCTCTGTGGCGGGAGGTGGCCAGCCTTCGGCA GAAGCATGCCCAGCAACAGAAAGTCGTCAACAAGCTCATTCAGTTCCTGATCTCACTGGTGCAGTCAAAC CGGATCCTGGGGGTGAAGAGAAAGATCCCCCTGATGCTGAACGACAGTGGCTCAGCACATTCCATGCCCA AGTATAGCCGGCAGTTCTCCCTGGAGCACGTCCACGGCTCGGGCCCCTACTCGGCCCCCTCCCCAGCCTA CAGCAGCTCCAGCCTCTACGCCCCTGATGCTGTGGCCAGCTCTGGACCCATCATCTCCGACATCACCGAG CTGGCTCCTGCCAGCCCCATGGCCTCCCCCGGCGGGAGCATAGACGAGAGGCCCCTATCCAGCAGCCCCC TGGTGCGTGTCAAGGAGGAGCCCCCCAGCCCGCCTCAGAGCCCCCGGGTAGAGGAGGCGAGTCCCGGGCG CCCATCTTCCGTGGACACCCTCTTGTCCCCGACCGCCCTCATTGACTCCATCCTGCGGGAGAGTGAACCT GCCCCCGCCTCCGTCACAGCCCTCACGGACGCCAGGGGCCACACGGACACCGAGGGCCGGCCTCCCTCCC CCCCGCCCACCTCCACCCCTGAAAAGTGCCTCAGCGTAGCCTGCCTGGACAAGAATGAGCTCAGTGACCA CTTGGATGCTATGGACTCCAACCTGGATAACCTGCAGACCATGCTGAGCAGCCACGGCTTCAGCGTGGAC ACCAGTGCCCTGCTGGACCTGTTCAGCCCCTCGGTGACCGTGCCCGACATGAGCCTGCCTGACCTTGACA GCAGCCTGGCCAGTATCCAAGAGCTCCTGTCTCCCCAGGAGCCCCCCAGGCCTCCCGAGGCAGAGAACAG CAGCCCGGATTCAGGGAAGCAGCTGGTGCACTACACAGCGCAGCCGCTGTTCCTGCTGGACCCCGGCTCC GTGGACACCGGGAGCAACGACCTGCCGGTGCTGTTTGAGCTGGGAGAGGGCTCCTACTTCTCCGAAGGGG ACGGCTTCGCCGAGGACCCCACCATCTCCCTGCTGACAGGCTCGGAGCCTCCCAAAGCCAAGGACCCCAC TGTCTCCTAGAGGCCCCGGAGGAGCTGGGCCAGCCGCCCACCCCCACCCCCAGTGCAGGGCTGGTCTTGG GGAGGCAGGGCAGCCTCGCGGTCTTGGGCACTGGTGGGTCGGCCGCCATAGCCCCAGTAGGACAAACGGG CTCGGGTCTGGGCAGCACCTCTGGTCAGGAGGGTCACCCTGGCCTGCCAGTCTGCCTTCCCCCAACCCCG TGTCCTGTGGTTTGGTTGGGGCTTCACAGCCACACCTGGACTGACCCTGCAGGTTGTTCATAGTCAGAAT TGTATTTTGGATTTTTACACAACTGTCCCGTTCCCCGCTCCACAGAGATACACAGATATATACACACAGT GGATGGACGGACAAGACAGGCAGAGATCTATAAACAGACAGGCTCTATGCTAAAAAAAAAAAAAAA Note: credit was given as long as the top two lines were present >gi|37574696|ref|NT_037704.4|Hs8_37708 Homo sapiens chromosome 8 genomic contig, reference assembly GAATTCTTTAAAAGTTCTGGCCAGGCATGGTGGCACACACCTGTAATCCCAGCACTTTGGGAGGCCAAGG Note: Only top two lines displayed to save space 3 pts. 2. a) This NCBI database contains information on human genes and diseases, as well as references and links to sequences and other genetics resources. Anyone can search the database for such information, although it is primarily for physicians, genetics students, and researchers. The page is not clear on exactly what disease information is contained. 1 pt. b) 117 1 pt. c) 32 1 pt. d) 8 1 pt. e) DYSTROPHIN; DMD MUSCULAR DYSTROPHY, DUCHENNE TYPE; DMD Note: The other result is for Dystrophin, the protein involved in DMD. There was no deduction for including this in your answer. 1 pt. f) MUSCULAR DYSTROPHY, DUCHENNE TYPE; DMD 1 pt. g) pseudohypertrophy of the calves, abnormal heartbeat, etc 1 pt. h) Click the Limits tab, and check Chromosome X. (Other methods where also accepted) 3 pts. 3. a) UniProt is a comprehensive database of well-annotated and classified protein sequences, resulting from a collaboration between the EBI, the SIB, and the PIR. A strong emphasis on cross-referencing the data to other databases and providing access to the scientific community through a variety of query interfaces. 1 pt. b) Any sequence was accepted here, as long as it was in FASTA format e.g. >gi|150036270|ref|NM_004019.2| Homo sapiens dystrophin (muscular dystrophy, Duchenne and Becker types) (DMD), transcript variant Dp40, mRNA ACTTTCGGGGAGCCCGGCGGCTCTGGGAAGCTCACTCCTCCACTCGTACCCACACTCGACCGCGGAGCCC TTGCAGCCATGAGGGAACAGCTCAAAGGCCACGAGACTCAAACAACTTGCTGGGACCATCCCAAAATGAC AGAGCTCTACCAGTCTTTAGCTGACCTGAATAATGTCAGATTCTCAGCTTATAGGACTGCCATGAAACTC CGAAGACTGCAGAAGGCCCTTTGCTTGGATCTCTTGAGCCTGTCAGCTGCATGTGATGCCTTGGACCAGC ACAACCTCAAGCAAAATGACCAGCCCATGGATATCCTGCAGATTATTAATTGTTTGACCACTATTTATGA CCGCCTGGAGCAAGAGCACAACAATTTGGTCAACGTCCCTCTCTGCGTGGATATGTGTCTGAACTGGCTG CTGAATGTTTATGATACGGGACGAACAGGGAGGATCCGTGTCCTGTCTTTTAAAACTGGCATCATTTCCC TGTGTAAAGCACATTTGGAAGACAAGTACAGATACCTTTTCAAGCAAGTGGCAAGTTCAACAGGATTTTG TGACCAGCGCAGGCTGGGCCTCCTTCTGCATGATTCTATCCAAATTCCAAGACAGTTGGGTGAAGTTGCA TCCTTTGGGGGCAGTAACATTGAGCCAAGTGTCCGGAGCTGCTTCCAATTTGCTAATAATAAGCCAGAGA TCGAAGCGGCCCTCTTCCTAGACTGGATGAGACTGGAACCCCAGTCCATGGTGTGGCTGCCCGTCCTGCA CAGAGTGGCTGCTGCAGAAACTGCCAAGCATCAGGCCAAATGTAACATCTGCAAAGAGTGTCCAATCATT GGATTCAGGTACAGGAGTCTAAAGCACTTTAATTATGACATCTGCCAAAGCTGCTTTTTTTCTGGTCGAG TTGCAAAAGGCCATAAAATGCACTATCCCATGGTGGAATATTGCACTCCGACTACATCAGGAGAAGATGT TCGAGACTTTGCCAAGGTACTAAAAAACAAATTTCGAACCAAAAGGTATTTTGCGAAGCATCCCCGAATG GGCTACCTGCCAGTGCAGACTGTCTTAGAGGGGGACAACATGGAAACGTGAGTAGTAGCAAAAGCAGAAC ACACTCTTGTTTGATGTATATTTGAACTCCTCTCAGCTGAACACCCTCCTTCACTCCCAAATGCAAACAG TCTCTTCTATTTCTTTCTTTTTATTTACATTAGCTGAAAAGAGAAAAATAAGCTGATGTCCAGTTGCCAC TTTCCCACGTCACTTGACAATTTCTTTTTCCAAAAGTTAAACTTTATCTCACAGGGGGAAAAAAAAAAAA AAACCACAACACAATACAGCCACTAATTGCCTTACAAGCCTTATAAGAAATATGGGACTGTTTACAAATG AGTGATTCCAGTATTTCATTTTGATTTTCCTCTCTCACAAATCAGTAAATGTGTGTCTTTTTGTATCTCA TTGTGTGGTCATATCTAGTCACTTGTTTCTACTCAAAAGAAAATATAGTCACAGGAAACTACTTCACAGT AAGTAGTAATGATTCTCAAGATCAAAGGGGA 1 pt. 4. b) >readseq-65287_tmp_1 3899 bp tccgccgctgctgtctgcggggtctggcgccggggtctgagtctctgctggctaagccgc cgcctcagccgcctcagtcgcctcaatctcgccttccgccctcgctctccctccgcgcca ccagaccccgtagccccgcgcgcccccagccctttaagccagatgatgaacttcctgcgg cgccggctgtcggacagcagcttcatcgccaacctgcccaacggctacatgaccgacctg cagcggcccgagccccagcagccgccgccgccgccgccccccggtccgggcgccgcctcg gcctcggcggcgcccccgaccgcctcgccgggcccggagcggaggccgccgcccgcctcg gcgcccgcgccgcagcccgcgccgacgccgtcggtgggcagcagcttcttcagctcgctg tcccaagccgtgaagcagacggccgcctcggctggcctggtggacgcgcccgctcccgcg cccgcagccgccaggaaggccaaggtgctgctggtggtcgacgagccgcacgccgactgg gccaagtgctttcggggcaaaaaagtccttggagattatgatatcaaggtggaacaggca gaattttcagagctcaacctggtggcccatgcagatggcacctatgctgtggatatgcag gttctccggaatggcacaaaggttgtccggtccttccggccagacttcgtgctcatccgg cagcatgcatttggcatggcggagaatgaggacttccgccacctgatcattggtatgcag tatgcaggcctccccagcatcaactcactggaatccatatacaacttctgtgacaagcca tgggtgtttgcccagctggtcgctatctataagacactgggaggagaaaagttccctctc attgaacagacatactaccccaaccacaaagagatgctgacactgcccacgttccctgtg gtggtgaagattggccacgctcactcaggcatgggcaaggtcaaagtggaaaaccactac gacttccaggacattgccagcgtggtggctctcacccagacctatgccactgcagagcct ttcattgactccaagtatgacatccgggtccagaagattggcaacaactacaaggcttac atgaggacatcgatctcagggaactggaagacgaacactggctctgcgatgctggagcag attgccatgtcagacaggtacaaactgtgggtggacacctgctctgagatgtttggcggc ctggacatctgtgctgtcaaagctgtacatggcaaagatgggaaagactacatttttgag gtcatggactgtagcatgccactgattggggaacatcaggtggaggacaggcaactcatc accgaactagtcatcagcaagatgaaccagctgctgtccaggactcctgccctgtctcct cagagacccctaacaacccagcagccacagagcggaacacttaaggatccggactcaagc aagaccccacctcagcggccaccccctcaaggttgtttacagtatattctcgactgtaat ggcattgcagtagggccaaaacaagtccaagcttcttaaaatgattggtggttaattttt caaagcagaaattttaagccaaaaacaaacgaaaggaaagcggggaggggaaaacagacc ctcccactggtgccgttgctgcgttctttcaatgctgactggactgtgtttttcctatgc agtgtcagctcctctgtctggttgtttacctgttcctgttcgtgcttgtaatgctcactt atgttttctctgtataacttgtgattccagggctgtttgtcaacagtatacaaaagaatt gtgcctctcccaagtccagtgtgactttatcttctgggtggtttgatagtgtttttaaaa gtaatatataatgtggggtgaaatgggagtaggggggtggacaggggagaaacgaaaacc acaaaaagaaaacccaactcctctcctccccccaagctcagttaaatcccccacctccaa ctttccctccaccagtgtgcttgggatcttcaatgaactgtgcttttcgctttctttctg catgactattgtaactagatagaacattaagagattttcaagatcaaacttccatagctt catccactgaatttgaaggcatccacctttttctccatttgctaaaatttggtgcagttt gagtttatgtgaataggctggctgtgcctgtagagctcttgtgtttttagtgatgacatg aaatacaaagaacaagctatttccaggaatgtgttctgtattttacatcccagtgtaccc tttattttattattaactaattaactatgagatttttaaaaaatggggccgctgatgtgc aatatcaaagtgaacttgtgagtattttgtgtgtgttgatctcagttgtttcttcattgt tgctgtttctggatccagccatgtgtgcgcttgtgtggacctgaggctgctttctgttcc caaagcttgacctgtgtacagagataattccttggcaatgttggacatagaatgcaggga gctactgaaggtctgtcagggatttgtccattctgctcttggcctctcctgaggcctcat aatgggagaccaaatcaaaaatgtcccatgtcacttgagtgggtacactgcctacagaac cttgaggttgactcctgcttcagttctcagctgtttaccacagccctccagggtccaaag attgaggagctttctctttcctgggaggaactgtctcagatttagcttgtgtgtgttttg gacagaggctccacagcggtggctcttgaggaatcctcaccagtttgttctcttccctct gacaagcagcacctgagcagatgctgaggcagttcattaaaccaggcctcagcttcagtg cctcatcttgccatctcccggccaggctgggaacgggcaccaagcagccgcctctaacaa acaccatggtccgtggaagttcatgccagcagcttgcctttgagaagaaatgctgctggc tctatttttacattcccttccacctctatactgtcatgtcaccgttctgaactcccagat ctgagaaggaactagtgttggtggtatgtaacaagagttacgtatccaggggcttgtgcc ttggtttctcctttgattgctggtaaattctgaggccacagagaaatgcattgagtgtga atgttgtcatctgtaatccctccctcagctgataatggtagttgatctgttgtaaatata tacatatatgcatatttgcacttccagatgggttgcataagaatcaggtccttaaatacc tcccaatctgatgaaacgatagaataaagtaacatttcccagaatggaggaatacattat tttatcgtatatttttgtccaagcgataagctgacggtggtattgcttctctgcatgtta tcagtgtgtacatctggtgcttttcatgtgtcatttgtgagccacaaatgcaaagttgcc atttgaattcagtcaggctacagggtggtgtcagtcaaggtctttcaggtgggggagaaa ttggttagggctcccactgccaaatgcaagcagatagcataacctgactgttatgtgccc tcaggcagcatgcttagggacaactctgtggcctgggggacatctgtgtcacagtatagg attgccattcaggtgttttgtacctatttctttcctgacgttgtcccctttttttgtact gatccaactgggagaacctcagccaatgctggaagtatgattgaagtacctctcttttgt gactcttgtacagcttaatgtgcaataaaggaaaagttatatctgaaaaaaaaaaaaaa 1 pt. 5. a) 24 Note: I have double checked this answer on multiple days, and had another TA do this problem, also resulting in this answer. However, all students appear to have gotten 95 hits for this query, and the likelihood of all students making the same mistake leads me to believe that the server was behaving oddly at some point, so this value will also be accepted as correct. 3 pts. b ii) The part of this question requiring you to exclude partial sequences was more complicated than expected, so any ClustalW multiple alignment will be expected. Thanks to Addie Hall for finding the correct method for obtaining the correct sequences: “To the right of the search box click "Fields", then under Field select "Fragment (yes/no)" and choose no in the next drop down box. This leaves you with only the full length proteins and no fragments.” 4 pts. 6. h) For this exercise, the first step was once again slightly more complicated than expected, with no single result being obviously correct for each of the sequences. Fortunately, your choice of sequences should not have greatly affected your response, which was to briefly discuss the importance of parameter settings and differences between alignment methods. This was a very open-ended question, so credit was given as long as a reasonable effort was made, with no glaring inaccuracies. A discussion of the differences between global (must align entirety of both sequences) and local alignments (find optimal alignment of subsequences) would be appropriate. With respect to varying the gap penalty parameter, in general, the trend should be for the number of gaps to be lower when you raise the gap open penalty and for the gaps to be longer as the gap extension penalty is lowered.