Glossary - christopherking.name

advertisement
Bioinformatics, Part 2
Adapted from a paper (http://www.lifescied.org/cgi/content/full/4/3/207;
http://www.nslc.wustl.edu/elgin/genomics/Bio3055/manual.pdf) by April Bednarski and
Himadri Pakrasi that was funded by a grant from the Howard Hughes Medical Institute of
Washington University.
Glossary
Genome – The entire amount of genetic information for an organism. The human genome
is the set of 46 chromosomes.
Homologous – With regard to amino acids, homologous amino acids have similar chemical
properties and sizes. For example, glutamate can be considered homologous to aspartate
because both residues have similar sizes and both residues contain a carboxylic acid side
chain.
Sequence alignment – a sequence alignment is a way of arranging the sequences present
in DNA, RNA, or proteins so as to identify regions that are similar.
Multiple sequence alignment – a sequence alignment of three or more biological
sequences.
Conserved – the amino acid residues at a position in a multiple sequence alignment are
identical throughout the alignment.
Conservative residue change – the amino acid residues at a position in a multiple
sequence alignment are homologous.
ClustalW – A program for making multiple sequence alignments.
www.ebi.ac.uk/clustalw/index.html
ExPASy – Expert Protein Analysis System - us.expasy.org/ A server maintained by the
Swiss Institute of Bioinformatics. Home of SWISS-PROT, the most extensive and annotated
protein database. The Swiss-Pdb Viewer protein-viewing program is also available at this
site for free download.
FASTA – Fast Alignment Search Tool-All (since it works on both nucleotide and amino acid
sequences). Associated with this software is a way of formatting a nucleic acid or protein
sequence. It is important because many bioinformatics programs require that the
sequence be in FASTA format. The FASTA format has a title line for each sequence that
begins with a “>” followed by any needed text to name the sequence. The end of the
title line is signified by a paragraph mark (hit the return key). Bioinformatics
programs will know that the title line isn’t part of the sequence if you have it formatted
correctly. The sequence itself does NOT have any returns, spaces, or formatting of any
kind. The sequence is given in one-letter code. An example of a protein in correct FASTA
format is shown below:
>K-Ras protein Homo sapiens
MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDI
LDTAGQEEYSAMRDQYMRTGEGFLCVFAINNTKSFEDIHHYREQIKRVKDSEDVP
1
MVLVGNKCDLPSRTVDTKQAQDLARSYGIPFIETSAKTRQGVDDAFYTLVREIRK
HKEKMSKDGKKKKKKSKTKCVIM
Sequence Manipulation Suite – bioinformatics.org/sms/ a website that contains a
collection of web-based programs for analyzing and formatting DNA and protein
sequences.
Procedure
NCBI – Gene
1. Go (again) to the NCBI homepage: http://www.ncbi.nlm.nih.gov
2. Search in the “Gene” database for “Homo sapiens PTGS2”. Click on the “PTGS2” entry.
The section NCBI Reference Sequences (RefSeq) gives RefSeq accession numbers for
the mRNA sequence of Homo sapiens prostaglandin G/H synthase 2 precursor. (The
number starts with NM_.)
Write one of them here__________________.
3. Open the RefSeq entry by clicking on that number (first link in the section), then click
on “FASTA” (near the top of the page). Copy the nucleotide sequence (including the title
line designated by the “>” symbol) and paste it into a text or Word document.
4. Save the file as PTGS2rna.doc (or .txt) on your desktop. Review the entry for “FASTA”
in the Glossary: understanding the FASTA format will help in working with the
bioinformatics programs.
5. The amino acid sequence is conveniently obtained by first clicking on the “RefSeq
Protein Product” link, which is in the second column of the page, then selecting the
FASTA format again. Follow the steps given above to save the amino acid sequence in
FASTA format as a document called PTGS2prot.doc.
Swiss-Prot Entry
1. Go to the Expasy website (http://us.expasy.org/). Under Databases select “UniProtKB”
(a protein knowledgebase). At the top of the page, click “Fields >>” (to the right of the
search box). For the first field, select “Protein Name”, and enter, for the “Term”,
Phospholipase C gamma 1. Click “Add & Search”, then click “Fields” again, and for the
field, “Organisms”, use the term “Homo sapiens”. Click “Add & Search”, again. Select the
one entry that has been reviewed (the gold star).
2. What is the “accession number” of this protein?
3. Click on the accession number. Write at least three alternate names for this protein.
2
4. In which two areas of the cell is this protein found? (Under “cellular component”)
5. What is its “cofactor” (needed for the enzyme to function)?
6. What is the PLC gamma1 amino acid length and molar mass in daltons of isoform 1
(under “Sequences”)?
7. Return to the home page of the ExPASy Proteomics Server; select the SWISS-2DPAGE
database. Enter the accession number in the search box. Has anyone reported 2-D gel
electrophoresis data?
Sequence Manipulation
1. Go to the Sequence Manipulation Suite (http://bioinformatics.org/sms/).
2. Under from the menu entry, “DNA Analysis”, click on “Translate”.
3. Clear the data entry box by clicking on “Clear”.
4. Copy the mRNA sequence in FASTA format from your file (PTGS2rna.doc) and paste it
into the data entry box on the Sequence Manipulation website.
5. Select “Reading Frame 3” and “direct” from the pull-down menus, then click “Submit”.
6. When the Output window opens with your results, copy and paste the sequence into a
Word document and save it as, “translate.doc” on your desktop.
7. Compare this sequence in the “translate.doc” file with the sequence in the
“PTGS2prot.doc”.
What are the first residues that are the same in the sequences?
Do the sequences look like they are the same? (Note: protein sequences should start
with a methionine, M.)
3
Multiple Sequence Alignment with ClustalW
1. Go to the ClustalW2 website, http://www.ebi.ac.uk/Tools/clustalw2/index.html.
2. The following are 6 FASTA formatted sequences of PTGS2 from different organisms.
Copy and paste all of the FASTA formatted sequences into the data entry box.
>dog [Canis familiaris]
MLARALVLCAALAVVRAANPCCSHPCQNQGICMSTGFDQYKCDCTRTGFYGENCS
TPEFLTRIKLYLKPT
PNTVHYILTHFKGVWNIVNNIPFLRNTIMKYVLTSRSHLIESPPTYNVNYGYKSW
EAFSNLSYYTRALPP
VPDDCPTPMGVKGKKELPDSKEIVEKFLLRRKFIPDPQGTNMMFAFFAQHFTHQF
FKTDHKRGPAFTKGL
GHGVDLNHVYGETLDRQHKLRLFKDGKMKYQVIDGEVYPPTVKDTQVEMIYPPHV
PEHLQFAVGQEVFGL
VPGLMMYATIWLREHNRVCDVLKQEHPEWDDERLFQTSRLILIGETIKIVIEDYV
QHLSGYHFKLKFDPE
LLFNQQFQYQNRIAAEFNTLYHWHPLLPDTLQIDDQEYNFQQFIYNNSILLEHGL
TQFVESFSRQIAGRV
AGGRNVPAAVQQVAKASIDQSRQMKYQSLNEYRKRFRLKPYTSFEELTGEKEMAA
GLEALYGDIDAMELY
PALLVEKPRPDAIFGETMVEMGAPFSLKGLMGNPICSPDYWKPSTFGGEVGFKII
NTASIQSLICNNVKG
CPFTAFSVQDGQLTKTVTINASSSHSGLDDINPTVLLKERSTEL
>cow [Bos taurus]
MLARALLLCAAVALSGAANPCCSHPCQNRGVCMSVGFDQYKCDCTRTGFYGENCT
TPEFLTRIKLLLKPT
PNTVHYILTHFKGVWNIVNKISFLRNMIMRYVLTSRSHLIESPPTYNVHYSYKSW
EAFSNLSYYTRALPP
VPDDCPTPMGVKGRKELPDSKEVVKKVLLRRKFIPDPQGTNLMFAFFAQHFTHQF
FKTDFERGPAFTKGK
NHGVDLSHIYGESLERQHKLRLFKDGKMKYQMINGEMYPPTVKDTQVEMIYPPHV
PEHLKFAVGQEVFGL
VPGLMMYATIWLREHNRVCDVLKQEHPEWGDEQLFQTSRLILIGETIKIVIEDYV
QHLSGYHFKLKFDPE
LLFNQQFQYQNRIAAEFNTLYHWHPLLPDVFQIDGQEYNYQQFIYNNSVLLEHGL
TQFVESFTRQRAGRV
AGGRNLPVAVEKVSKASIDQSREMKYQSFNEYRKRFLVKPYESFEELTGEKEMAA
ELEALYGDIDAMEFY
PALLVEKPRPDAIFGETMVEAGAPFSLKGLMGNPICSPEYWKPSTFGGEVGFKII
NTASIQSLICSNVKG
CPFTSFSVQDTHLTKTVTINASSSHSGLDDINPTVLLKERSTEL
4
>mouse [Mus musculus]
MLFRAVLLCAALGLSQAANPCCSNPCQNRGECMSTGFDQYKCDCTRTGFYGENCT
TPEFLTRIKLLLKPT
PNTVHYILTHFKGVWNIVNNIPFLRSLIMKYVLTSRSYLIDSPPTYNVHYGYKSW
EAFSNLSYYTRALPP
VADDCPTPMGVKGNKELPDSKEVLEKVLLRREFIPDPQGSNMMFAFFAQHFTHQF
FKTDHKRGPGFTRGL
GHGVDLNHIYGETLDRQHKLRLFKDGKLKYQVIGGEVYPPTVKDTQVEMIYPPHI
PENLQFAVGQEVFGL
VPGLMMYATIWLREHNRVCDILKQEHPEWGDEQLFQTSRLILIGETIKIVIEDYV
QHLSGYHFKLKFDPE
LLFNQQFQYQNRIASEFNTLYHWHPLLPDTFNIEDQEYSFKQFLYNNSILLEHGL
TQFVESFTRQIAGRV
AGGRNVPIAVQAVAKASIDQSREMKYQSLNEYRKRFSLKPYTSFEELTGEKEMAA
ELKALYSDIDVMELY
PALLVEKPRPDAIFGETMVELGAPFSLKGLMGNPICSPQYWKPSTFGGEVGFKII
NTASIQSLICNNVKG
CPFTSFNVQDPQPTKTATINASASHSRLDDINPTVLIKRRSTEL
>Rabbit
MLARALLLCAAVALSHAANPCCSNPCQNRGVCMTMGFDQYKCDCTRTGFYGENCS
TPEFLTRIKLLLKPT
PDTVHYILTHFKGVWNIVNSIPFLRNSIMKYVLTSRSHMIDSPPTYNVHYNYKSW
EAFSNLSYYTRALPP
VADDCPTPMGVKGKKELPDSKDVVEKLLLRRKFIPDPQGTNMMFAFFAQHFTHQF
FKTDLKRGPAFTKGL
GHGVDLNHIYGETLDRQHKLRLFKDGKMKYQVIDGEVYPPTVKDTQVEMIYPPHI
PAHLQFAVGQEVFGL
VPGLMMYATIWLREHNRVCDVLKQEHPEWDDEQLFQTSRLILIGETIKIVIEDYV
QHLSGYHFKLKFDPE
LLFNQQFQYQNRIAAEFNTLYHWHPLLPDTFQIDDQQYNYQQFLYNNSILLEHGL
TQFVESFTRQIAGRV
AGGRNVPPAVQKVAKASIDQSRQMKYQSLNEYRKRFLLKPYESFEELTGEKEMAA
ELEALYGDIDAVELY
PALLVERPRPDAIFGESMVEMGAPFSLKGLMGNPICSPNYWKPSTFGGEVGFKIV
NTASIQSLICNNVKG
CPFTSFNVPDPQLTKTVTINASASHSRLEDINPTVLLKGRSTEL
>pig [Sus scrofa]
MLARALLLCAAVSLCTAAKPCCSNPCQNRGICMSVGFDHYKCDCTRTGFYGENCT
TPEFLTRIKLFLKPT
PNTVHYILTHFKGVWNIVNNIPFLRNAIMKYVLISRSHLIDSPPTYNMHYGYKSW
EAFSNLSYYTRALPP
VPDDCPTPMGVKGRKELPDSKEVVEKLLLRRKFIPDPQGTNMMFAFFAQHFTHQF
FKTDQKRGPAFTKGQ
GHGVDLSHVYGESLERQHKLRLFKDGKMKYQIIDGEMYPPTAKDTQVEMIYPPHT
PEHLRFAVGHEVFGL
VPGLMMYATIWLREHNRVCDVLKQEHPEWDDERLFQTSRLILIGETIKIVIEDYV
QHLSGYHFKLKFDPE
5
LLFNQQFQYQNRIAAEFNTLYHWHPLLPDAFQIDGHEYNYQQFLYNNSILLEHGI
TQFVESFSRQIAGRV
AGGRNLPAAVQKVSKASIDQSREMRYQSFNEYRKRFLLKPYRSFEELTGEKEMAA
ELEALYGDIDAMELY
PALLVEKPRPDAIFGETMVEAGAPFSLKGLMGNPICSPEYWKPSTFGGEVGFKII
NTASIQSLICNNVKG
CPFTSFSVQDPQLAKTVTINASSSHSGLDDINPTVLLKERSTEL
>coral [Gersemia fruticosa]
MVAKFVVFLGLQLILCSVVCEAVNPCCSFPCESGAVCVEDGDKYTCDCTRTGHYG
VNCEKPNWSTWFKAL
IAPSEETKHFILTHFKWFWWIVNNVPFIRNTVMKAAYFSRTDFVPVPHAYTSYHD
YATMEAHYNRSYFAR
TLPPVPKNCPTPFGVAGKKELPPAEEVANKFLKRGKFKTDHTSTSWLFMFFAQHF
THEFFKTIYHSPAFT
WGNHGVDVSHIYGQDMERQNKLRSFEDGKLKSQTINGEEWPPYLKDVDNVTMQYP
PNTPEDQKFALGHPF
YSMLPGLFMYASIWLREHNRVCTILRKEHPHWVDERLYQTGKLIITGELIKIVIE
DYVNHLANYNLKLTY
NPELVFDHGYDYDNRIHVEFNHMYHWHPFSPDEYNISGSTYSIQDFMYHPEIVVK
HGMSSFVDSMSKGLC
GQMSHHNHGAYTLDVAVEVIKHQRELRMQSFNNYRKHFALEPYKSFEELTGDPKM
SAELQEVYGDVNAVD
LYVGFFLEKGLTTSPFGITMIAFGAPYSLRGLLSNPVSSPTYWKPSTFGGDVGFD
MVKTASLEKLFCQNI
AGECPLVTFTVPDDIARETRKVLEARDEL
For alignment select “Full”; for output format, select “aln w/numbers” so that particular
residues (amino acids) in the alignment can be found; for the Output order select “input”.
Click the “Run” button located in the lower right.
3. View the output- the SCORES table:
SeqA Name
Len(aa)
SeqB Name
Len(aa)
Score
===================================================
1
dog
604
2
cow
604
90
1
dog
604
3
mouse
604
89
Note that different specific combinations are examined; DOG TO COW for example. You
would expect a higher SCORE (right column; similarity of the gene sequence) between two
mammals than a mouse and the coral. What is the similarity score for the gene found in
mouse and coral? ________
View the cladogram at the bottom of the page. (To learn more about cladograms go to
en.wikipedia.org/wiki/Cladogram.) Switch to the phylogram view. Which two species are
most similar, based on this view? (Or can one even tell?)
6
Now for the most important part of this ClustalW analysis: an amino acid by amino acid
comparison of the same protein from different species. Go a little ways down the web page
and find ALIGNMENT. A button labeled 'Show Colors' will be displayed in the Alignment
section of results page. If you press this button the alignment will be show in color
according to the table below. (This option only works when you have chosen ALN or GCG
as the output format).
AVFPMILW
Red
Small: small or hydrophobic; includes aromatic except Tyr
DE
Blue
Acidic
RHK
Magenta
Basic
STYHCNGQ
Green
Hydroxyl + Amine + Basic - Q
Others
Gray
CONSENSUS SYMBOLS: An alignment will display by default the following symbols
denoting the degree of conservation observed in each column:
Symbol
Meaning
*
The residues in that column are identical in all sequences in the alignment.
:
Conserved substitutions are present, according to the COLOR table above.
.
Semi-conserved substitutions are present.
(space) ?
7
Figure 1. A Venn diagram showing the relationship of the 20 naturally occurring amino
acids to some physio-chemical properties. Exarchos et al. BMC Bioinformatics, 2009,
10:113 (Creative Commons Attribution License)
Copy the alignment of amino acids in various species and paste it into a Word
document. To make this file readable, do the following things:
a) Go to “Page Set-up” under “File” and change the page orientation to landscape.
b) Select all text and change to “Courier” font, size 10. Courier is the best font for
alignments because all the letters are the same width. This is one of the major
secrets of working with FASTA sequences.
c) Save and Print this file to the desktop as “ClustalW.doc” (send the file to yourself by
email or place on a floppy or flash drive). Place a copy in your lab notebook.
4. Review the alignment. What does the presence of a space under a column in the
alignment indicate about the relation of the residues?
5. Find the longest string of conserved (defined in glossary) residues (watch out for
strings at the ends of rows). How many residues does it contain?
8
Download