BCB 444/544 Answer Key

advertisement
BCB 444/544
Lab 2 – NCBI Tools, Pairwise Sequence Alignment & Analysis
Answer Key
30 points possible. Score converted to 10 pt. scale, rounded to nearest tenth.
1 pt.
1. a) Entrez allows text-based searches of all NCBI databases. It can be used to search
for nucleotides, proteins, and structures, as well as organism taxonomy and genome
features. Biological and genetics publications are also searched. For instance, a user can
retrieve sequence data for a particular group of organisms, or find articles related to
apoptosis.
1 pt.
b) 509
1 pt.
c) 37
1 pt.
f) 33,455
1 pt.
g) 9
1 pt.
i) >gi|5031767|ref|NP_005517.1| heat shock transcription factor 1 [Homo sapiens]
MDLPVGPGAAGPSNVPAFLTKLWTLVSDPDTDALICWSPSGNSFHVFDQGQFAKEVLPKYFKHNNMASFV
RQLNMYGFRKVVHIEQGGLVKPERDDTEFQHPCFLRGQEQLLENIKRKVTSVSTLKSEDIKIRQDSVTKL
LTDVQLMKGKQECMDSKLLAMKHENEALWREVASLRQKHAQQQKVVNKLIQFLISLVQSNRILGVKRKIP
LMLNDSGSAHSMPKYSRQFSLEHVHGSGPYSAPSPAYSSSSLYAPDAVASSGPIISDITELAPASPMASP
GGSIDERPLSSSPLVRVKEEPPSPPQSPRVEEASPGRPSSVDTLLSPTALIDSILRESEPAPASVTALTD
ARGHTDTEGRPPSPPPTSTPEKCLSVACLDKNELSDHLDAMDSNLDNLQTMLSSHGFSVDTSALLDLFSP
SVTVPDMSLPDLDSSLASIQELLSPQEPPRPPEAENSSPDSGKQLVHYTAQPLFLLDPGSVDTGSNDLPV
LFELGEGSYFSEGDGFAEDPTISLLTGSEPPKAKDPTVS
Note: credit was given as long as the top two lines were present
1 pt.
j) >gi|132626772|ref|NM_005526.2| Homo sapiens heat shock transcription
factor 1 (HSF1), mRNA
GCGGCGGGAGCGCGCCCGTTGCAAGATGGCGGCGGCCATGCTGGGCCCCGGGGCTGTGTGTGCGCAGCGG
GCGGCGGCGCGGCCCGGAAGGCTGGCGCGGCGACGGCGTTAGCCCGGCCCTCGGCCCCTCTTTGCGGCCG
CTCCCTCCGCCTATTCCCTCCTTGCTCGAGATGGATCTGCCCGTGGGCCCCGGCGCGGCGGGGCCCAGCA
ACGTCCCGGCCTTCCTGACCAAGCTGTGGACCCTCGTGAGCGACCCGGACACCGACGCGCTCATCTGCTG
GAGCCCGAGCGGGAACAGCTTCCACGTGTTCGACCAGGGCCAGTTTGCCAAGGAGGTGCTGCCCAAGTAC
TTCAAGCACAACAACATGGCCAGCTTCGTGCGGCAGCTCAACATGTATGGCTTCCGGAAAGTGGTCCACA
TCGAGCAGGGCGGCCTGGTCAAGCCAGAGAGAGACGACACGGAGTTCCAGCACCCATGCTTCCTGCGTGG
CCAGGAGCAGCTCCTTGAGAACATCAAGAGGAAAGTGACCAGTGTGTCCACCCTGAAGAGTGAAGACATA
AAGATCCGCCAGGACAGCGTCACCAAGCTGCTGACGGACGTGCAGCTGATGAAGGGGAAGCAGGAGTGCA
TGGACTCCAAGCTCCTGGCCATGAAGCATGAGAATGAGGCTCTGTGGCGGGAGGTGGCCAGCCTTCGGCA
GAAGCATGCCCAGCAACAGAAAGTCGTCAACAAGCTCATTCAGTTCCTGATCTCACTGGTGCAGTCAAAC
CGGATCCTGGGGGTGAAGAGAAAGATCCCCCTGATGCTGAACGACAGTGGCTCAGCACATTCCATGCCCA
AGTATAGCCGGCAGTTCTCCCTGGAGCACGTCCACGGCTCGGGCCCCTACTCGGCCCCCTCCCCAGCCTA
CAGCAGCTCCAGCCTCTACGCCCCTGATGCTGTGGCCAGCTCTGGACCCATCATCTCCGACATCACCGAG
CTGGCTCCTGCCAGCCCCATGGCCTCCCCCGGCGGGAGCATAGACGAGAGGCCCCTATCCAGCAGCCCCC
TGGTGCGTGTCAAGGAGGAGCCCCCCAGCCCGCCTCAGAGCCCCCGGGTAGAGGAGGCGAGTCCCGGGCG
CCCATCTTCCGTGGACACCCTCTTGTCCCCGACCGCCCTCATTGACTCCATCCTGCGGGAGAGTGAACCT
GCCCCCGCCTCCGTCACAGCCCTCACGGACGCCAGGGGCCACACGGACACCGAGGGCCGGCCTCCCTCCC
CCCCGCCCACCTCCACCCCTGAAAAGTGCCTCAGCGTAGCCTGCCTGGACAAGAATGAGCTCAGTGACCA
CTTGGATGCTATGGACTCCAACCTGGATAACCTGCAGACCATGCTGAGCAGCCACGGCTTCAGCGTGGAC
ACCAGTGCCCTGCTGGACCTGTTCAGCCCCTCGGTGACCGTGCCCGACATGAGCCTGCCTGACCTTGACA
GCAGCCTGGCCAGTATCCAAGAGCTCCTGTCTCCCCAGGAGCCCCCCAGGCCTCCCGAGGCAGAGAACAG
CAGCCCGGATTCAGGGAAGCAGCTGGTGCACTACACAGCGCAGCCGCTGTTCCTGCTGGACCCCGGCTCC
GTGGACACCGGGAGCAACGACCTGCCGGTGCTGTTTGAGCTGGGAGAGGGCTCCTACTTCTCCGAAGGGG
ACGGCTTCGCCGAGGACCCCACCATCTCCCTGCTGACAGGCTCGGAGCCTCCCAAAGCCAAGGACCCCAC
TGTCTCCTAGAGGCCCCGGAGGAGCTGGGCCAGCCGCCCACCCCCACCCCCAGTGCAGGGCTGGTCTTGG
GGAGGCAGGGCAGCCTCGCGGTCTTGGGCACTGGTGGGTCGGCCGCCATAGCCCCAGTAGGACAAACGGG
CTCGGGTCTGGGCAGCACCTCTGGTCAGGAGGGTCACCCTGGCCTGCCAGTCTGCCTTCCCCCAACCCCG
TGTCCTGTGGTTTGGTTGGGGCTTCACAGCCACACCTGGACTGACCCTGCAGGTTGTTCATAGTCAGAAT
TGTATTTTGGATTTTTACACAACTGTCCCGTTCCCCGCTCCACAGAGATACACAGATATATACACACAGT
GGATGGACGGACAAGACAGGCAGAGATCTATAAACAGACAGGCTCTATGCTAAAAAAAAAAAAAAA
Note: credit was given as long as the top two lines were present
>gi|37574696|ref|NT_037704.4|Hs8_37708 Homo sapiens chromosome 8
genomic contig, reference assembly
GAATTCTTTAAAAGTTCTGGCCAGGCATGGTGGCACACACCTGTAATCCCAGCACTTTGGGAGGCCAAGG
Note: Only top two lines displayed to save space
3 pts.
2. a) This NCBI database contains information on human genes and diseases, as well as
references and links to sequences and other genetics resources. Anyone can search the
database for such information, although it is primarily for physicians, genetics students,
and researchers. The page is not clear on exactly what disease information is contained.
1 pt.
b) 117
1 pt.
c) 32
1 pt.
d) 8
1 pt.
e) DYSTROPHIN; DMD
MUSCULAR DYSTROPHY, DUCHENNE TYPE; DMD
Note: The other result is for Dystrophin, the protein involved in DMD. There was
no deduction for including this in your answer.
1 pt.
f) MUSCULAR DYSTROPHY, DUCHENNE TYPE; DMD
1 pt.
g) pseudohypertrophy of the calves, abnormal heartbeat, etc
1 pt.
h) Click the Limits tab, and check Chromosome X. (Other methods where also accepted)
3 pts.
3. a) UniProt is a comprehensive database of well-annotated and classified protein
sequences, resulting from a collaboration between the EBI, the SIB, and the PIR. A
strong emphasis on cross-referencing the data to other databases and providing access to
the scientific community through a variety of query interfaces.
1 pt.
b) Any sequence was accepted here, as long as it was in FASTA format
e.g.
>gi|150036270|ref|NM_004019.2| Homo sapiens dystrophin (muscular
dystrophy, Duchenne and Becker types) (DMD), transcript variant Dp40,
mRNA
ACTTTCGGGGAGCCCGGCGGCTCTGGGAAGCTCACTCCTCCACTCGTACCCACACTCGACCGCGGAGCCC
TTGCAGCCATGAGGGAACAGCTCAAAGGCCACGAGACTCAAACAACTTGCTGGGACCATCCCAAAATGAC
AGAGCTCTACCAGTCTTTAGCTGACCTGAATAATGTCAGATTCTCAGCTTATAGGACTGCCATGAAACTC
CGAAGACTGCAGAAGGCCCTTTGCTTGGATCTCTTGAGCCTGTCAGCTGCATGTGATGCCTTGGACCAGC
ACAACCTCAAGCAAAATGACCAGCCCATGGATATCCTGCAGATTATTAATTGTTTGACCACTATTTATGA
CCGCCTGGAGCAAGAGCACAACAATTTGGTCAACGTCCCTCTCTGCGTGGATATGTGTCTGAACTGGCTG
CTGAATGTTTATGATACGGGACGAACAGGGAGGATCCGTGTCCTGTCTTTTAAAACTGGCATCATTTCCC
TGTGTAAAGCACATTTGGAAGACAAGTACAGATACCTTTTCAAGCAAGTGGCAAGTTCAACAGGATTTTG
TGACCAGCGCAGGCTGGGCCTCCTTCTGCATGATTCTATCCAAATTCCAAGACAGTTGGGTGAAGTTGCA
TCCTTTGGGGGCAGTAACATTGAGCCAAGTGTCCGGAGCTGCTTCCAATTTGCTAATAATAAGCCAGAGA
TCGAAGCGGCCCTCTTCCTAGACTGGATGAGACTGGAACCCCAGTCCATGGTGTGGCTGCCCGTCCTGCA
CAGAGTGGCTGCTGCAGAAACTGCCAAGCATCAGGCCAAATGTAACATCTGCAAAGAGTGTCCAATCATT
GGATTCAGGTACAGGAGTCTAAAGCACTTTAATTATGACATCTGCCAAAGCTGCTTTTTTTCTGGTCGAG
TTGCAAAAGGCCATAAAATGCACTATCCCATGGTGGAATATTGCACTCCGACTACATCAGGAGAAGATGT
TCGAGACTTTGCCAAGGTACTAAAAAACAAATTTCGAACCAAAAGGTATTTTGCGAAGCATCCCCGAATG
GGCTACCTGCCAGTGCAGACTGTCTTAGAGGGGGACAACATGGAAACGTGAGTAGTAGCAAAAGCAGAAC
ACACTCTTGTTTGATGTATATTTGAACTCCTCTCAGCTGAACACCCTCCTTCACTCCCAAATGCAAACAG
TCTCTTCTATTTCTTTCTTTTTATTTACATTAGCTGAAAAGAGAAAAATAAGCTGATGTCCAGTTGCCAC
TTTCCCACGTCACTTGACAATTTCTTTTTCCAAAAGTTAAACTTTATCTCACAGGGGGAAAAAAAAAAAA
AAACCACAACACAATACAGCCACTAATTGCCTTACAAGCCTTATAAGAAATATGGGACTGTTTACAAATG
AGTGATTCCAGTATTTCATTTTGATTTTCCTCTCTCACAAATCAGTAAATGTGTGTCTTTTTGTATCTCA
TTGTGTGGTCATATCTAGTCACTTGTTTCTACTCAAAAGAAAATATAGTCACAGGAAACTACTTCACAGT
AAGTAGTAATGATTCTCAAGATCAAAGGGGA
1 pt.
4. b) >readseq-65287_tmp_1 3899 bp
tccgccgctgctgtctgcggggtctggcgccggggtctgagtctctgctggctaagccgc
cgcctcagccgcctcagtcgcctcaatctcgccttccgccctcgctctccctccgcgcca
ccagaccccgtagccccgcgcgcccccagccctttaagccagatgatgaacttcctgcgg
cgccggctgtcggacagcagcttcatcgccaacctgcccaacggctacatgaccgacctg
cagcggcccgagccccagcagccgccgccgccgccgccccccggtccgggcgccgcctcg
gcctcggcggcgcccccgaccgcctcgccgggcccggagcggaggccgccgcccgcctcg
gcgcccgcgccgcagcccgcgccgacgccgtcggtgggcagcagcttcttcagctcgctg
tcccaagccgtgaagcagacggccgcctcggctggcctggtggacgcgcccgctcccgcg
cccgcagccgccaggaaggccaaggtgctgctggtggtcgacgagccgcacgccgactgg
gccaagtgctttcggggcaaaaaagtccttggagattatgatatcaaggtggaacaggca
gaattttcagagctcaacctggtggcccatgcagatggcacctatgctgtggatatgcag
gttctccggaatggcacaaaggttgtccggtccttccggccagacttcgtgctcatccgg
cagcatgcatttggcatggcggagaatgaggacttccgccacctgatcattggtatgcag
tatgcaggcctccccagcatcaactcactggaatccatatacaacttctgtgacaagcca
tgggtgtttgcccagctggtcgctatctataagacactgggaggagaaaagttccctctc
attgaacagacatactaccccaaccacaaagagatgctgacactgcccacgttccctgtg
gtggtgaagattggccacgctcactcaggcatgggcaaggtcaaagtggaaaaccactac
gacttccaggacattgccagcgtggtggctctcacccagacctatgccactgcagagcct
ttcattgactccaagtatgacatccgggtccagaagattggcaacaactacaaggcttac
atgaggacatcgatctcagggaactggaagacgaacactggctctgcgatgctggagcag
attgccatgtcagacaggtacaaactgtgggtggacacctgctctgagatgtttggcggc
ctggacatctgtgctgtcaaagctgtacatggcaaagatgggaaagactacatttttgag
gtcatggactgtagcatgccactgattggggaacatcaggtggaggacaggcaactcatc
accgaactagtcatcagcaagatgaaccagctgctgtccaggactcctgccctgtctcct
cagagacccctaacaacccagcagccacagagcggaacacttaaggatccggactcaagc
aagaccccacctcagcggccaccccctcaaggttgtttacagtatattctcgactgtaat
ggcattgcagtagggccaaaacaagtccaagcttcttaaaatgattggtggttaattttt
caaagcagaaattttaagccaaaaacaaacgaaaggaaagcggggaggggaaaacagacc
ctcccactggtgccgttgctgcgttctttcaatgctgactggactgtgtttttcctatgc
agtgtcagctcctctgtctggttgtttacctgttcctgttcgtgcttgtaatgctcactt
atgttttctctgtataacttgtgattccagggctgtttgtcaacagtatacaaaagaatt
gtgcctctcccaagtccagtgtgactttatcttctgggtggtttgatagtgtttttaaaa
gtaatatataatgtggggtgaaatgggagtaggggggtggacaggggagaaacgaaaacc
acaaaaagaaaacccaactcctctcctccccccaagctcagttaaatcccccacctccaa
ctttccctccaccagtgtgcttgggatcttcaatgaactgtgcttttcgctttctttctg
catgactattgtaactagatagaacattaagagattttcaagatcaaacttccatagctt
catccactgaatttgaaggcatccacctttttctccatttgctaaaatttggtgcagttt
gagtttatgtgaataggctggctgtgcctgtagagctcttgtgtttttagtgatgacatg
aaatacaaagaacaagctatttccaggaatgtgttctgtattttacatcccagtgtaccc
tttattttattattaactaattaactatgagatttttaaaaaatggggccgctgatgtgc
aatatcaaagtgaacttgtgagtattttgtgtgtgttgatctcagttgtttcttcattgt
tgctgtttctggatccagccatgtgtgcgcttgtgtggacctgaggctgctttctgttcc
caaagcttgacctgtgtacagagataattccttggcaatgttggacatagaatgcaggga
gctactgaaggtctgtcagggatttgtccattctgctcttggcctctcctgaggcctcat
aatgggagaccaaatcaaaaatgtcccatgtcacttgagtgggtacactgcctacagaac
cttgaggttgactcctgcttcagttctcagctgtttaccacagccctccagggtccaaag
attgaggagctttctctttcctgggaggaactgtctcagatttagcttgtgtgtgttttg
gacagaggctccacagcggtggctcttgaggaatcctcaccagtttgttctcttccctct
gacaagcagcacctgagcagatgctgaggcagttcattaaaccaggcctcagcttcagtg
cctcatcttgccatctcccggccaggctgggaacgggcaccaagcagccgcctctaacaa
acaccatggtccgtggaagttcatgccagcagcttgcctttgagaagaaatgctgctggc
tctatttttacattcccttccacctctatactgtcatgtcaccgttctgaactcccagat
ctgagaaggaactagtgttggtggtatgtaacaagagttacgtatccaggggcttgtgcc
ttggtttctcctttgattgctggtaaattctgaggccacagagaaatgcattgagtgtga
atgttgtcatctgtaatccctccctcagctgataatggtagttgatctgttgtaaatata
tacatatatgcatatttgcacttccagatgggttgcataagaatcaggtccttaaatacc
tcccaatctgatgaaacgatagaataaagtaacatttcccagaatggaggaatacattat
tttatcgtatatttttgtccaagcgataagctgacggtggtattgcttctctgcatgtta
tcagtgtgtacatctggtgcttttcatgtgtcatttgtgagccacaaatgcaaagttgcc
atttgaattcagtcaggctacagggtggtgtcagtcaaggtctttcaggtgggggagaaa
ttggttagggctcccactgccaaatgcaagcagatagcataacctgactgttatgtgccc
tcaggcagcatgcttagggacaactctgtggcctgggggacatctgtgtcacagtatagg
attgccattcaggtgttttgtacctatttctttcctgacgttgtcccctttttttgtact
gatccaactgggagaacctcagccaatgctggaagtatgattgaagtacctctcttttgt
gactcttgtacagcttaatgtgcaataaaggaaaagttatatctgaaaaaaaaaaaaaa
1 pt.
5. a) 24 Note: I have double checked this answer on multiple days, and had another TA
do this problem, also resulting in this answer. However, all students appear to have gotten
95 hits for this query, and the likelihood of all students making the same mistake leads
me to believe that the server was behaving oddly at some point, so this value will also be
accepted as correct.
3 pts.
b ii) The part of this question requiring you to exclude partial sequences was more
complicated than expected, so any ClustalW multiple alignment will be expected. Thanks
to Addie Hall for finding the correct method for obtaining the correct sequences: “To the
right of the search box click "Fields", then under Field select "Fragment (yes/no)" and
choose no in the next drop down box. This leaves you with only the full length proteins
and no fragments.”
4 pts.
6. h) For this exercise, the first step was once again slightly more complicated than
expected, with no single result being obviously correct for each of the sequences.
Fortunately, your choice of sequences should not have greatly affected your response,
which was to briefly discuss the importance of parameter settings and differences
between alignment methods. This was a very open-ended question, so credit was given as
long as a reasonable effort was made, with no glaring inaccuracies. A discussion of the
differences between global (must align entirety of both sequences) and local alignments
(find optimal alignment of subsequences) would be appropriate. With respect to varying
the gap penalty parameter, in general, the trend should be for the number of gaps to be
lower when you raise the gap open penalty and for the gaps to be longer as the gap
extension penalty is lowered.
Download