Practical Bioinformatics Tools For Understanding Evolution Robert Latek, PhD Bioinformatics and Research Computing Whitehead Institute for Biomedical Research Aims • Examine Techniques For Describing Evolutionary Relationships • Learn To Apply Bioinformatics Tools To Study Evolution • Question, Interrupt, Discuss, Suggest WIBR Bioinformatics, © Whitehead Institute 2004 2 Bioinformatics ? • Definition – Integration of computational and biological methods to promote biological discovery – Combination of Biology, Statistics, CS, Clinical Research • Purpose – Predict, Decipher, Visualize • Methodology – Data Mining and Comparisons MSRKGPRAEVCADCS APDPGWASISRGVLVC DECCSVHRLGRHISIV KHLRHSAWPPTLLQM VHTLASNGANSIWEHS LLDPAQVQSGRRKAN Data Visualization G. Bell WIBR Bioinformatics, © Whitehead Institute 2004 3 Bioinformatics :-) • Biological Comparisons (Evolutionary Analysis) – How closely/distantly related are two populations? • Gene Function Prediction – How and why does Gene X function/malfunction? • Pharmaceutical Design & In Silico Testing WIBR Bioinformatics, © Whitehead Institute 2004 4 Bioinformatics@WI • Bioinformatics and Research Computing • Collaboration, Consultation, Education in Bioinformatics and Graphics • Provide hardware, commercial/custom software tools, training, and bioinformatics expertise Predict Decipher WIBR Bioinformatics, © Whitehead Institute 2004 5 Discussion Map • Relationships Among Groups Of Genes – Comparing Sequences – Building Sequence Families • Sequence Conservation During Evolution – Aligning Multiple Sequences • Evolutionary Diagrams – Tracing The Descent From Common Ancestors – Growing Phylogenetic Trees WIBR Bioinformatics, © Whitehead Institute 2004 6 Evolutionary Analysis • Definition – The use of phylogeny to reveal relationships among sets of genes • Purpose – To utilize information about common ancestors to predict gene function and regulation • Methodology – Compare properties between genes/organisms and identify commonalities and differences – Organization of genes into a evolutionary diagrams – Sequence by sequence comparisons WIBR Bioinformatics, © Whitehead Institute 2004 7 Sequence-Based Comparisons • Identify sequences within an organism that are related to each other and/or across different species – Within: Fetal and adult hemoglobin – Across : Human and chimpanzee hemoglobin • Generate an evolutionary history of related genes • Locate insertions, deletions, and substitutions that have occurred during evolution (C) (R) (E) (A) (T) (S) (L) (P) (G) Cysteine Arginine Glutamate Alanine Threonine Serine Leucine Proline Glycine CREATE [Ancestor] CREASE -RELAPSE [Progenitors] GREASER WIBR Bioinformatics, © Whitehead Institute 2004 8 Homology & Similarity • Homology – Conserved sequences arising from a common ancestor – Orthologs: homologous genes that share a common ancestor in the absence of any gene duplication (Mouse and Human Hemoglobin) – Paralogs: genes related through gene duplication (one gene is a copy of another - Fetal and Adult Hemoglobin) • Similarity – Genes that share common sequences but are not necessarily related WIBR Bioinformatics, © Whitehead Institute 2004 9 Sequences As Modules • Proteins are derived from a limited number of basic building blocks (Modules) • Evolution has shuffled these modules giving rise to a diverse repertoire of protein sequences Global Local • As a result, proteins can share a global relationships or local relationship specific to a particular DOMAIN WIBR Bioinformatics, © Whitehead Institute 2004 10 Sequence Domains Modules Define Functional/Structural Domains WIBR Bioinformatics, © Whitehead Institute 2004 11 Sequence Families • Definition – Group of sequences that share a common function and/or structure, that are potentially derived from a common ancestor (set of homologous sequences) • Building A Family – Domains are used to group different sequences into common families WIBR Bioinformatics, © Whitehead Institute 2004 12 Defining A Sequence Family Family B Family D Family A Family E Family C WIBR Bioinformatics, © Whitehead Institute 2004 13 Sequence Family Resources • Search and Browse Family Databases • PFAM – http://pfam.wustl.edu/ >src MGSNKSKPKDASQRRRSLEPAENVHGAGGGAFPASQTPSKPASADGHRGPSAAFAPAAAEPKLFGGFNSSDTVTSPQRAGPLAGG VTTFVALYDYESRTETDLSFKKGERLQIVNNTEGDWWLAHSLSTGQTGYIPSNYVAPSDSIQAEEWYFGKITRRESERLLLNAEN PRGTFLVRESETTKGAYCLSVSDFDNAKGLNVKHYKIRKLDSGGFYITSRTQFNSLQQLVAYYSKHADGLCHRLTTVCPTSKPQT QGLAKDAWEIPRESLRLEVKLGQGCFGEVWMGTWNGTTRVAIKTLKPGTMSPEAFLQEAQVMKKLRHEKLVQLYAVVSEEPIYIV TEYMSKGSLLDFLKGETGKYLRLPQLVDMAAQIASGMAYVERMNYVHRDLRAANILVGENLVCKVADFGLARLIEDNEYTARQGA KFPIKWTAPEAALYGRFTIKSDVWSFGILLTELTTKGRVPYPGMVNREVLDQVERGYRMPCPPECPESLHDLMCQCWRKEPEERP TFEYLQAFLEDYFTSTEPQYQPGENL WIBR Bioinformatics, © Whitehead Institute 2004 14 Sequence Family Resources • NCBI Family Database Resources • Conserved Domain Database – http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=cdd • Conserved Domain Architecture Retrieval Tool – http://www.ncbi.nlm.nih.gov/BLAST/ WIBR Bioinformatics, © Whitehead Institute 2004 15 Discussion Map • Relationships Among Groups Of Genes – Comparing Sequences – Building Sequence Families • Sequence Conservation During Evolution – Aligning Multiple Sequences • Evolutionary Diagrams – Tracing The Descent From Common Ancestors – Growing Phylogenetic Trees WIBR Bioinformatics, © Whitehead Institute 2004 16 Multiple Sequence Alignments • Place residues in columns that are derived from a common CREASE ancestral residue • Identify Matches, Mismatches, CREATE RELAPSE and Gaps GREASER • MSA can reveal sequence patterns SeqA CRE-A-TE– Demonstration of homology between >2 sequences SeqB CRE-A-SE– Identification of functionally SeqC GRE-A-SER important sites SeqD -RELAPSE– Protein function prediction 123456789 – Structure prediction WIBR Bioinformatics, © Whitehead Institute 2004 17 Global vs. Local Alignments • Global – Search for alignments, matching over entire sequences • Local – Examine regions of sequence for conserved segments • Both Consider: Matches, Mismatches, Gaps WIBR Bioinformatics, © Whitehead Institute 2004 18 Global Sequence Alignments Yeast Prion-Like Proteins WIBR Bioinformatics, © Whitehead Institute 2004 19 How To Make A Global MSA • On The Web – http://pir.georgetown.edu/pirwww/search/multaln.html • On Your Computer – ClustalX: http://www-igbmc.u-strasbg.fr/BioInfo/ClustalX/ WIBR Bioinformatics, © Whitehead Institute 2004 20 MSA Example Sequences Standard FASTA Sequence Format >KSYK_HUMAN FFFGNITREEAEDYLVQGGMSDGLYLLRQSRNYLGGFALSVAHGRKAHHYTIERELNGTYAIAGGRTHASPADLCHYH >ZA70_HUMAN WYHSSLTREEAERKLYSGAQTDGKFLLRPRKEQGTYALSLIYGKTVYHYLISQDKAGKYCIPEGTKFDTLWQLVEYL >KSYK_PIG WFHGKISRDESEQIVLIGSKTNGKFLIRARDNGSYALGLLHEGKVLHYRIDKDKTGKLSIPGGKNFDTLWQLVEHY >MATK_HUMAN WFHGKISGQEAVQQLQPPEDGLFLVRESARHPGDYVLCVSFGRDVIHYRVLHRDGHLTIDEAVFFCNLMDMVEHY >CSK_CHICK WFHGKITREQAERLLYPPETGLFLVRESTNYPGDYTLCVSCEGKVEHYRIIYSSSKLSIDEEVYFENLMQLVEHY >CRKL_HUMAN WYMGPVSRQEAQTRLQGQRHGMFLVRDSSTCPGDYVLSVSENSRVSHYIINSLPNRRFKIGDQEFDHLPALLEFY >YES_XIPHE WYFGKLSRKDTERLLLLPGNERGTFLIRESETTKGAYSLSLRDWDETKGDNCKHYKIRKLDNGGYYITTRTQFMSLQMLVKHY >FGR_HUMAN WYFGKIGRKDAERQLLSPGNPQGAFLIRESETTKGAYSLSIRDWDQTRGDHVKHYKIRKLDMGGYYITTRVQFNSVQELVQHY >SRC_RSVP WYFGKITRRESERLLLNPENPRGTFLVRKSETAKGAYCLSVSDFDNAKGPNVKHYKIYKLYSGGFYITSRTQFGSLQQLVAYY WIBR Bioinformatics, © Whitehead Institute 2004 21 MSA Example Result YES_XIPHE FGR_HUMAN SRC_RSVP MATK_HUMAN CSK_CHICK CRKL_HUMAN ZA70_HUMAN KSYK_PIG KSYK_HUMAN WYFGKLSRKDTERLLLLPGNERGTFLIRESETTKGAYSLSLRDWDETKGDNCKHYKIRKL WYFGKIGRKDAERQLLSPGNPQGAFLIRESETTKGAYSLSIRDWDQTRGDHVKHYKIRKL WYFGKITRRESERLLLNPENPRGTFLVRKSETAKGAYCLSVSDFDNAKGPNVKHYKIYKL WFHGKISGQEAVQQLQPPED--GLFLVRESARHPGDYVLCVS-----FGRDVIHYRVLHR WFHGKITREQAERLLYPPET--GLFLVRESTNYPGDYTLCVS-----CEGKVEHYRIIYS WYMGPVSRQEAQTRLQGQRH--GMFLVRDSSTCPGDYVLSVS-----ENSRVSHYIINSL WYHSSLTREEAERKLYSGAQTDGKFLLRPRK-EQGTYALSLI-----YGKTVYHYLISQD WFHGKISRDESEQIVLIGSKTNGKFLIRAR--DNGSYALGLL-----HEGKVLHYRIDKD FFFGNITREEAEDYLVQGGMSDGLYLLRQSRNYLGGFALSVA-----HGRKAHHYTIERE :: . : :: : * :*:* * : * : ** : YES_XIPHE FGR_HUMAN SRC_RSVP MATK_HUMAN CSK_CHICK CRKL_HUMAN ZA70_HUMAN KSYK_PIG KSYK_HUMAN DNGGYYITTRTQFMSLQMLVKHY DMGGYYITTRVQFNSVQELVQHY YSGGFYITSRTQFGSLQQLVAYY -DGHLTIDEAVFFCNLMDMVEHY -SSKLSIDEEVYFENLMQLVEHY PNRRFKIGDQE-FDHLPALLEFY KAGKYCIPEGTKFDTLWQLVEYL KTGKLSIPGGKNFDTLWQLVEHY LNGTYAIAGGRTHASPADLCHYH * . : . WIBR Bioinformatics, © Whitehead Institute 2004 22 Discussion Map • Relationships Among Groups Of Genes – Comparing Sequences – Building Sequence Families • Sequence Conservation During Evolution – Aligning Multiple Sequences • Evolutionary Diagrams – Tracing The Descent From Common Ancestors – Growing Phylogenetic Trees WIBR Bioinformatics, © Whitehead Institute 2004 23 Phylogenetic Trees • A Graph Representing The Evolutionary History Of Sequences – Relationship of sequences to one another (How everything is connected) – Dissect the order of appearance of insertions, deletions, and mutations • Identify Related Sequences, Predict Function, Observe Epidemiology (Analyze changes in viral strains) WIBR Bioinformatics, © Whitehead Institute 2004 Simple Tree A B C D 24 Tree Shapes Rooted A Un-rooted A B B C C D D A C B D Branches intersect at Nodes Leaves are the topmost branches WIBR Bioinformatics, © Whitehead Institute 2004 25 Tree Characteristics • – – • – Tree Properties Clade: all the descendants of a common ancestor represented by a node Distance: number of changes that have taken place along a branch Tree Types Phylogram .035 .012 A .009 B Cladogram: shows the branching order of nodes .057 – Phylogram: shows branching order and distances WIBR Bioinformatics, © Whitehead Institute 2004 .016 C .044 D 26 Tree Building Methods • Group Most Common Sequences – Find the tree that changes one sequence into all of the others by the least number of steps – Sequences with the smallest number of differences have the shortest distance between them and are called: “related taxa” WIBR Bioinformatics, © Whitehead Institute 2004 27 Tree Building Methods C A B F 1 2 C D E F C E 1 F 2 4 A B E F E 3 D 1 B 3 A F A B D E A B C D E F 2 C 1 D A B C D A 5 A B F B C C D D E E WIBR Bioinformatics, © Whitehead Institute 2004 F 28 Example Evolutionary Trees Anthropological and Archeological • Tree Of Life – http://tolweb.org/tree/phylogeny.html • Theory Of Human Evolution At The SI – http://www.mnh.si.edu/anthro/humanorigins/ha/a_tree.html WIBR Bioinformatics, © Whitehead Institute 2004 29 How To Build A Tree Sequence Based • Create Alignment – http://pir.georgetown.edu/pirwww/search/multaln.html • Create Tree – http://www.genebee.msu.su/services/phtree_reduced.html • Draw Tree – http://iubio.bio.indiana.edu/treeapp/treeprint-form.html WIBR Bioinformatics, © Whitehead Institute 2004 30 MSA and Tree Relationship • “The optimal alignment of several sequences can be thought of as minimizing the number of mutational steps in an evolutionary tree for which the sequences are the leaves” (Mount, 2001) CREATE CREATE T to S CREATE CRE-A-TE- SeqA CREASE CRE-A-SE- SeqB CREASE +R C to G GREASE +L +P -G GRE-A-SER SeqC -RELAPSE- SeqD WIBR Bioinformatics, © Whitehead Institute 2004 31 Summary Review • Relationships Among Groups Of Genes – Comparing Sequences – Building Sequence Families • Sequence Conservation During Evolution – Aligning Multiple Sequences • Evolutionary Diagrams – Tracing The Descent From Common Ancestors – Growing Phylogenetic Trees WIBR Bioinformatics, © Whitehead Institute 2004 32 References • Bioinformatics: Sequence and genome Analysis. David W. Mount. CSHL Press, 2001. • Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins. Andreas D. Baxevanis and B.F. Francis Ouellete. Wiley Interscience, 2001. • Bioinformatics: Sequence, structure, and databanks. Des Higgins and Willie Taylor. Oxford University Press, 2000. WIBR Bioinformatics, © Whitehead Institute 2004 33