Glossary - christopherking.name

Bioinformatics, Part 2 Adapted from a paper (http://www.lifescied.org/cgi/content/full/4/3/207; http://www.nslc.wustl.edu/elgin/genomics/Bio3055/manual.pdf) by April Bednarski and Himadri Pakrasi that was funded by a grant from the Howard Hughes Medical Institute of Washington University. Glossary Genome – The entire amount of genetic information for an organism. The human genome is the set of 46 chromosomes. Homologous – With regard to amino acids, homologous amino acids have similar chemical properties and sizes. For example, glutamate can be considered homologous to aspartate because both residues have similar sizes and both residues contain a carboxylic acid side chain. Sequence alignment – a sequence alignment is a way of arranging the sequences present in DNA, RNA, or proteins so as to identify regions that are similar. Multiple sequence alignment – a sequence alignment of three or more biological sequences. Conserved – the amino acid residues at a position in a multiple sequence alignment are identical throughout the alignment. Conservative residue change – the amino acid residues at a position in a multiple sequence alignment are homologous. ClustalW – A program for making multiple sequence alignments. www.ebi.ac.uk/clustalw/index.html ExPASy – Expert Protein Analysis System - us.expasy.org/ A server maintained by the Swiss Institute of Bioinformatics. Home of SWISS-PROT, the most extensive and annotated protein database. The Swiss-Pdb Viewer protein-viewing program is also available at this site for free download. FASTA – Fast Alignment Search Tool-All (since it works on both nucleotide and amino acid sequences). Associated with this software is a way of formatting a nucleic acid or protein sequence. It is important because many bioinformatics programs require that the sequence be in FASTA format. The FASTA format has a title line for each sequence that begins with a “>” followed by any needed text to name the sequence. The end of the title line is signified by a paragraph mark (hit the return key). Bioinformatics programs will know that the title line isn’t part of the sequence if you have it formatted correctly. The sequence itself does NOT have any returns, spaces, or formatting of any kind. The sequence is given in one-letter code. An example of a protein in correct FASTA format is shown below: >K-Ras protein Homo sapiens MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDI LDTAGQEEYSAMRDQYMRTGEGFLCVFAINNTKSFEDIHHYREQIKRVKDSEDVP 1 MVLVGNKCDLPSRTVDTKQAQDLARSYGIPFIETSAKTRQGVDDAFYTLVREIRK HKEKMSKDGKKKKKKSKTKCVIM Sequence Manipulation Suite – bioinformatics.org/sms/ a website that contains a collection of web-based programs for analyzing and formatting DNA and protein sequences. Procedure NCBI – Gene 1. Go (again) to the NCBI homepage: http://www.ncbi.nlm.nih.gov 2. Search in the “Gene” database for “Homo sapiens PTGS2”. Click on the “PTGS2” entry. The section NCBI Reference Sequences (RefSeq) gives RefSeq accession numbers for the mRNA sequence of Homo sapiens prostaglandin G/H synthase 2 precursor. (The number starts with NM_.) Write one of them here__________________. 3. Open the RefSeq entry by clicking on that number (first link in the section), then click on “FASTA” (near the top of the page). Copy the nucleotide sequence (including the title line designated by the “>” symbol) and paste it into a text or Word document. 4. Save the file as PTGS2rna.doc (or .txt) on your desktop. Review the entry for “FASTA” in the Glossary: understanding the FASTA format will help in working with the bioinformatics programs. 5. The amino acid sequence is conveniently obtained by first clicking on the “RefSeq Protein Product” link, which is in the second column of the page, then selecting the FASTA format again. Follow the steps given above to save the amino acid sequence in FASTA format as a document called PTGS2prot.doc. Swiss-Prot Entry 1. Go to the Expasy website (http://us.expasy.org/). Under Databases select “UniProtKB” (a protein knowledgebase). At the top of the page, click “Fields >>” (to the right of the search box). For the first field, select “Protein Name”, and enter, for the “Term”, Phospholipase C gamma 1. Click “Add & Search”, then click “Fields” again, and for the field, “Organisms”, use the term “Homo sapiens”. Click “Add & Search”, again. Select the one entry that has been reviewed (the gold star). 2. What is the “accession number” of this protein? 3. Click on the accession number. Write at least three alternate names for this protein. 2 4. In which two areas of the cell is this protein found? (Under “cellular component”) 5. What is its “cofactor” (needed for the enzyme to function)? 6. What is the PLC gamma1 amino acid length and molar mass in daltons of isoform 1 (under “Sequences”)? 7. Return to the home page of the ExPASy Proteomics Server; select the SWISS-2DPAGE database. Enter the accession number in the search box. Has anyone reported 2-D gel electrophoresis data? Sequence Manipulation 1. Go to the Sequence Manipulation Suite (http://bioinformatics.org/sms/). 2. Under from the menu entry, “DNA Analysis”, click on “Translate”. 3. Clear the data entry box by clicking on “Clear”. 4. Copy the mRNA sequence in FASTA format from your file (PTGS2rna.doc) and paste it into the data entry box on the Sequence Manipulation website. 5. Select “Reading Frame 3” and “direct” from the pull-down menus, then click “Submit”. 6. When the Output window opens with your results, copy and paste the sequence into a Word document and save it as, “translate.doc” on your desktop. 7. Compare this sequence in the “translate.doc” file with the sequence in the “PTGS2prot.doc”. What are the first residues that are the same in the sequences? Do the sequences look like they are the same? (Note: protein sequences should start with a methionine, M.) 3 Multiple Sequence Alignment with ClustalW 1. Go to the ClustalW2 website, http://www.ebi.ac.uk/Tools/clustalw2/index.html. 2. The following are 6 FASTA formatted sequences of PTGS2 from different organisms. Copy and paste all of the FASTA formatted sequences into the data entry box. >dog [Canis familiaris] MLARALVLCAALAVVRAANPCCSHPCQNQGICMSTGFDQYKCDCTRTGFYGENCS TPEFLTRIKLYLKPT PNTVHYILTHFKGVWNIVNNIPFLRNTIMKYVLTSRSHLIESPPTYNVNYGYKSW EAFSNLSYYTRALPP VPDDCPTPMGVKGKKELPDSKEIVEKFLLRRKFIPDPQGTNMMFAFFAQHFTHQF FKTDHKRGPAFTKGL GHGVDLNHVYGETLDRQHKLRLFKDGKMKYQVIDGEVYPPTVKDTQVEMIYPPHV PEHLQFAVGQEVFGL VPGLMMYATIWLREHNRVCDVLKQEHPEWDDERLFQTSRLILIGETIKIVIEDYV QHLSGYHFKLKFDPE LLFNQQFQYQNRIAAEFNTLYHWHPLLPDTLQIDDQEYNFQQFIYNNSILLEHGL TQFVESFSRQIAGRV AGGRNVPAAVQQVAKASIDQSRQMKYQSLNEYRKRFRLKPYTSFEELTGEKEMAA GLEALYGDIDAMELY PALLVEKPRPDAIFGETMVEMGAPFSLKGLMGNPICSPDYWKPSTFGGEVGFKII NTASIQSLICNNVKG CPFTAFSVQDGQLTKTVTINASSSHSGLDDINPTVLLKERSTEL >cow [Bos taurus] MLARALLLCAAVALSGAANPCCSHPCQNRGVCMSVGFDQYKCDCTRTGFYGENCT TPEFLTRIKLLLKPT PNTVHYILTHFKGVWNIVNKISFLRNMIMRYVLTSRSHLIESPPTYNVHYSYKSW EAFSNLSYYTRALPP VPDDCPTPMGVKGRKELPDSKEVVKKVLLRRKFIPDPQGTNLMFAFFAQHFTHQF FKTDFERGPAFTKGK NHGVDLSHIYGESLERQHKLRLFKDGKMKYQMINGEMYPPTVKDTQVEMIYPPHV PEHLKFAVGQEVFGL VPGLMMYATIWLREHNRVCDVLKQEHPEWGDEQLFQTSRLILIGETIKIVIEDYV QHLSGYHFKLKFDPE LLFNQQFQYQNRIAAEFNTLYHWHPLLPDVFQIDGQEYNYQQFIYNNSVLLEHGL TQFVESFTRQRAGRV AGGRNLPVAVEKVSKASIDQSREMKYQSFNEYRKRFLVKPYESFEELTGEKEMAA ELEALYGDIDAMEFY PALLVEKPRPDAIFGETMVEAGAPFSLKGLMGNPICSPEYWKPSTFGGEVGFKII NTASIQSLICSNVKG CPFTSFSVQDTHLTKTVTINASSSHSGLDDINPTVLLKERSTEL 4 >mouse [Mus musculus] MLFRAVLLCAALGLSQAANPCCSNPCQNRGECMSTGFDQYKCDCTRTGFYGENCT TPEFLTRIKLLLKPT PNTVHYILTHFKGVWNIVNNIPFLRSLIMKYVLTSRSYLIDSPPTYNVHYGYKSW EAFSNLSYYTRALPP VADDCPTPMGVKGNKELPDSKEVLEKVLLRREFIPDPQGSNMMFAFFAQHFTHQF FKTDHKRGPGFTRGL GHGVDLNHIYGETLDRQHKLRLFKDGKLKYQVIGGEVYPPTVKDTQVEMIYPPHI PENLQFAVGQEVFGL VPGLMMYATIWLREHNRVCDILKQEHPEWGDEQLFQTSRLILIGETIKIVIEDYV QHLSGYHFKLKFDPE LLFNQQFQYQNRIASEFNTLYHWHPLLPDTFNIEDQEYSFKQFLYNNSILLEHGL TQFVESFTRQIAGRV AGGRNVPIAVQAVAKASIDQSREMKYQSLNEYRKRFSLKPYTSFEELTGEKEMAA ELKALYSDIDVMELY PALLVEKPRPDAIFGETMVELGAPFSLKGLMGNPICSPQYWKPSTFGGEVGFKII NTASIQSLICNNVKG CPFTSFNVQDPQPTKTATINASASHSRLDDINPTVLIKRRSTEL >Rabbit MLARALLLCAAVALSHAANPCCSNPCQNRGVCMTMGFDQYKCDCTRTGFYGENCS TPEFLTRIKLLLKPT PDTVHYILTHFKGVWNIVNSIPFLRNSIMKYVLTSRSHMIDSPPTYNVHYNYKSW EAFSNLSYYTRALPP VADDCPTPMGVKGKKELPDSKDVVEKLLLRRKFIPDPQGTNMMFAFFAQHFTHQF FKTDLKRGPAFTKGL GHGVDLNHIYGETLDRQHKLRLFKDGKMKYQVIDGEVYPPTVKDTQVEMIYPPHI PAHLQFAVGQEVFGL VPGLMMYATIWLREHNRVCDVLKQEHPEWDDEQLFQTSRLILIGETIKIVIEDYV QHLSGYHFKLKFDPE LLFNQQFQYQNRIAAEFNTLYHWHPLLPDTFQIDDQQYNYQQFLYNNSILLEHGL TQFVESFTRQIAGRV AGGRNVPPAVQKVAKASIDQSRQMKYQSLNEYRKRFLLKPYESFEELTGEKEMAA ELEALYGDIDAVELY PALLVERPRPDAIFGESMVEMGAPFSLKGLMGNPICSPNYWKPSTFGGEVGFKIV NTASIQSLICNNVKG CPFTSFNVPDPQLTKTVTINASASHSRLEDINPTVLLKGRSTEL >pig [Sus scrofa] MLARALLLCAAVSLCTAAKPCCSNPCQNRGICMSVGFDHYKCDCTRTGFYGENCT TPEFLTRIKLFLKPT PNTVHYILTHFKGVWNIVNNIPFLRNAIMKYVLISRSHLIDSPPTYNMHYGYKSW EAFSNLSYYTRALPP VPDDCPTPMGVKGRKELPDSKEVVEKLLLRRKFIPDPQGTNMMFAFFAQHFTHQF FKTDQKRGPAFTKGQ GHGVDLSHVYGESLERQHKLRLFKDGKMKYQIIDGEMYPPTAKDTQVEMIYPPHT PEHLRFAVGHEVFGL VPGLMMYATIWLREHNRVCDVLKQEHPEWDDERLFQTSRLILIGETIKIVIEDYV QHLSGYHFKLKFDPE 5 LLFNQQFQYQNRIAAEFNTLYHWHPLLPDAFQIDGHEYNYQQFLYNNSILLEHGI TQFVESFSRQIAGRV AGGRNLPAAVQKVSKASIDQSREMRYQSFNEYRKRFLLKPYRSFEELTGEKEMAA ELEALYGDIDAMELY PALLVEKPRPDAIFGETMVEAGAPFSLKGLMGNPICSPEYWKPSTFGGEVGFKII NTASIQSLICNNVKG CPFTSFSVQDPQLAKTVTINASSSHSGLDDINPTVLLKERSTEL >coral [Gersemia fruticosa] MVAKFVVFLGLQLILCSVVCEAVNPCCSFPCESGAVCVEDGDKYTCDCTRTGHYG VNCEKPNWSTWFKAL IAPSEETKHFILTHFKWFWWIVNNVPFIRNTVMKAAYFSRTDFVPVPHAYTSYHD YATMEAHYNRSYFAR TLPPVPKNCPTPFGVAGKKELPPAEEVANKFLKRGKFKTDHTSTSWLFMFFAQHF THEFFKTIYHSPAFT WGNHGVDVSHIYGQDMERQNKLRSFEDGKLKSQTINGEEWPPYLKDVDNVTMQYP PNTPEDQKFALGHPF YSMLPGLFMYASIWLREHNRVCTILRKEHPHWVDERLYQTGKLIITGELIKIVIE DYVNHLANYNLKLTY NPELVFDHGYDYDNRIHVEFNHMYHWHPFSPDEYNISGSTYSIQDFMYHPEIVVK HGMSSFVDSMSKGLC GQMSHHNHGAYTLDVAVEVIKHQRELRMQSFNNYRKHFALEPYKSFEELTGDPKM SAELQEVYGDVNAVD LYVGFFLEKGLTTSPFGITMIAFGAPYSLRGLLSNPVSSPTYWKPSTFGGDVGFD MVKTASLEKLFCQNI AGECPLVTFTVPDDIARETRKVLEARDEL For alignment select “Full”; for output format, select “aln w/numbers” so that particular residues (amino acids) in the alignment can be found; for the Output order select “input”. Click the “Run” button located in the lower right. 3. View the output- the SCORES table: SeqA Name Len(aa) SeqB Name Len(aa) Score =================================================== 1 dog 604 2 cow 604 90 1 dog 604 3 mouse 604 89 Note that different specific combinations are examined; DOG TO COW for example. You would expect a higher SCORE (right column; similarity of the gene sequence) between two mammals than a mouse and the coral. What is the similarity score for the gene found in mouse and coral? ________ View the cladogram at the bottom of the page. (To learn more about cladograms go to en.wikipedia.org/wiki/Cladogram.) Switch to the phylogram view. Which two species are most similar, based on this view? (Or can one even tell?) 6 Now for the most important part of this ClustalW analysis: an amino acid by amino acid comparison of the same protein from different species. Go a little ways down the web page and find ALIGNMENT. A button labeled 'Show Colors' will be displayed in the Alignment section of results page. If you press this button the alignment will be show in color according to the table below. (This option only works when you have chosen ALN or GCG as the output format). AVFPMILW Red Small: small or hydrophobic; includes aromatic except Tyr DE Blue Acidic RHK Magenta Basic STYHCNGQ Green Hydroxyl + Amine + Basic - Q Others Gray CONSENSUS SYMBOLS: An alignment will display by default the following symbols denoting the degree of conservation observed in each column: Symbol Meaning * The residues in that column are identical in all sequences in the alignment. : Conserved substitutions are present, according to the COLOR table above. . Semi-conserved substitutions are present. (space) ? 7 Figure 1. A Venn diagram showing the relationship of the 20 naturally occurring amino acids to some physio-chemical properties. Exarchos et al. BMC Bioinformatics, 2009, 10:113 (Creative Commons Attribution License) Copy the alignment of amino acids in various species and paste it into a Word document. To make this file readable, do the following things: a) Go to “Page Set-up” under “File” and change the page orientation to landscape. b) Select all text and change to “Courier” font, size 10. Courier is the best font for alignments because all the letters are the same width. This is one of the major secrets of working with FASTA sequences. c) Save and Print this file to the desktop as “ClustalW.doc” (send the file to yourself by email or place on a floppy or flash drive). Place a copy in your lab notebook. 4. Review the alignment. What does the presence of a space under a column in the alignment indicate about the relation of the residues? 5. Find the longest string of conserved (defined in glossary) residues (watch out for strings at the ends of rows). How many residues does it contain? 8

Glossary - christopherking.name

Related documents

Products

Support

Glossary - christopherking.name

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib