Lecture 29
#29_Oct31
BCB 444/544 F07 ISU Terribilini #29- Phylogenetics 10/31/07 1
Required Reading
(before lecture)
Mon Oct 29 - Lecture 28
Promoter & Regulatory Element Prediction
• Chp 9 - pp 113 - 126
Wed Oct 30 - Lecture 29
Phylogenetics Basics
• Chp 10 - pp 127 - 141
Thurs Oct 31 - Lab 9
Gene & Regulatory Element Prediction
Fri Oct 30 - Lecture 29
Phylogenetic Tree Construction Methods & Programs
• Chp 11 - pp 142 - 169
BCB 444/544 F07 ISU Terribilini #29- Phylogenetics 10/31/07 2
Assignments & Announcements
Mon Oct 29 HW#5
HW#5 = Hands-on exercises with phylogenetics and tree-building software
Due: Mon Nov 5 (not Fri Nov 1 as previously posted)
BCB 444/544 F07 ISU Terribilini #29- Phylogenetics 10/31/07 3
BCB 544 "Team" Projects
Last week of classes will be devoted to Projects
• Written reports due:
• Mon Dec 3 (no class that day)
• Oral presentations (20-30') will be:
• Wed-Fri Dec 5,6,7
• 1 or 2 teams will present during each class period
See Guidelines for Projects posted online
BCB 444/544 F07 ISU Terribilini #29- Phylogenetics 10/31/07 4
BCB 544 Only:
New Homework Assignment
544 Extra#2
Due: √PART 1 - ASAP
PART 2 - meeting prior to 5 PM Fri Nov 2
Part 1 - Brief outline of Project, email to Drena & Michael after response/approval, then:
Part 2 - More detailed outline of project
Read a few papers and summarize status of problem
Schedule meeting with Drena & Michael to discuss ideas
BCB 444/544 F07 ISU Terribilini #29- Phylogenetics 10/31/07 5
Seminars this Week
BCB List of URLs for Seminars related to Bioinformatics: http://www.bcb.iastate.edu/seminars/index.html
• Nov 1 Thurs - BBMB Seminar 4:10 in 1414 MBB
• Todd Yeates
UCLA
TBA -something cool about structure and evolution?
• Nov 2 Fri - BCB Faculty Seminar 2:10 in 102 ScI
• Bob Jernigan
BBMB, ISU
• Control of Protein Motions by Structure
BCB 444/544 F07 ISU Terribilini #29- Phylogenetics 10/31/07 6
Chp 10 - Phylogenetics
SECTION IV MOLECULAR PHYLOGENETICS
Xiong: Chp 10 Phylogenetics Basics
• Evolution and Phylogenetics
• Terminology
• Gene Phylogeny vs. Species Phylogeny
• Forms of Tree Representation
• Why Finding a True Tree is Dificult
• Procedure of Building a Phylogenetic Tree
BCB 444/544 F07 ISU Terribilini #29- Phylogenetics 10/31/07 7
Evolution and Phylogenetics
• Evolution – the development of biological form from other preexisting forms
• Evolution proceeds by natural selection
BCB 444/544 F07 ISU Terribilini #29- Phylogenetics 10/31/07 8
Natural Selection
• Species can produce more offspring than the environment can support. This leads to competition for resources. Genetic variations exist in a population that give some individuals an advantage, others a disadvantage, leading to differential reproductive success.
BCB 444/544 F07 ISU Terribilini #29- Phylogenetics 10/31/07 9
Phylogenetics
• Phylogenetics is the study of the evolutionary history of living organisms
• Uses tree like diagrams to represent the pedigrees of the organisms
• Similarities and differences seen in a multiple sequence alignment are easier to make sense of in a phylogenetic tree
BCB 444/544 F07 ISU Terribilini #29- Phylogenetics 10/31/07 10
Data Used in Phylogenetics
• Fossil records - morphology and timeline of divergence
• Limitations - not available for all species in all areas, morphology determined by multiple genetic factors, fossils for microorganisms are especially rare
• Molecular data - DNA and protein sequences molecular fossils
• Advantages - lots of data, easy to obtain
• Limitations - can be difficult to get sequences from extinct species
• Physical, behavior, and developmental characteristics can also be used in phylogenetics
BCB 444/544 F07 ISU Terribilini #29- Phylogenetics 10/31/07 11
Molecular Phylogenetics
• Molecular phylogenetics is the study of evolutionary relationships of genes and other biological macromolecules by analyzing their sequences
• Sequence similarity can be used to infer evolutionary relationships
BCB 444/544 F07 ISU Terribilini #29- Phylogenetics 10/31/07 12
Assumptions in Molecular Phylogenetics
• Sequences used are homologous, i.e. share a common ancestor
• Phylogenetic divergence is bifurcating, i.e. parent branch splits into two daughter branches
• Each position in a sequence evolved independently
• Molecular Clock – sequences evolve at constant rates (only used in some methods)
BCB 444/544 F07 ISU Terribilini #29- Phylogenetics 10/31/07 13
Terminology
A B C D E F G H
BCB 444/544 F07 ISU Terribilini #29- Phylogenetics 10/31/07 14
Terminology
• Clade = group of taxa descended from a common ancestor
• Lineage = branch path depicting ancestordescendant relationship
• Paraphyletic group = group of taxa that share more than one closest common ancestor
A B C D E F G H
BCB 444/544 F07 ISU Terribilini #29- Phylogenetics 10/31/07 15
Tree Topology
• Tree topology is the branching pattern in a tree
Dichotomy
Bifurcation
BCB 444/544 F07 ISU Terribilini #29- Phylogenetics
Polytomy
Multifurcation
10/31/07 16
Rooted vs. Unrooted Trees
BCB 444/544 F07 ISU Terribilini #29- Phylogenetics
10/31/07 17
Rooted vs. Unrooted Trees
• Unrooted trees have no root node – do not assume knowledge of a common ancestor, just relationships
• Can convert between unrooted and rooted, but first need to determine where the root is
• Two ways to define the root:
• Use an outgroup
• Midpoint rooting – midpoint of the two most divergent groups is assigned to be the root
BCB 444/544 F07 ISU Terribilini #29- Phylogenetics 10/31/07 18
Outgroups
• Outgroup is a sequence related to the sequences being studied, but is more distantly related
• Must be distinct from the ingroup, but not too distant
• If outgroup is too distantly related, it can lead to errors in tree construction
• Trick is to find the closest related sequence that is removed from the ingroup
BCB 444/544 F07 ISU Terribilini #29- Phylogenetics 10/31/07 19
Gene Phylogeny vs. Species Phylogeny
• When using molecular data, we are technically building a phylogeny for just that sequence, not for the species from which the sequences came
• Species evolution is the result of mutations in the entire genome
• Your gene may have evolved differently than other genes in the genome
• To obtain a species phylogeny, we need to use a variety of gene families to construct the tree
BCB 444/544 F07 ISU Terribilini #29- Phylogenetics 10/31/07 20
Forms of Tree Representation
Branch lengths represent amount of evolutionary divergence
BCB 444/544 F07 ISU Terribilini #29- Phylogenetics
Branch lengths are meaningless, only topology matters
10/31/07 21
Forms of Tree Representation
• Newick format – text format for use by computer programs
• Example: (((B,C),A),(D,E))
• Can also have branch lengths
BCB 444/544 F07 ISU Terribilini #29- Phylogenetics 10/31/07 22
Consensus Trees
BCB 444/544 F07 ISU Terribilini #29- Phylogenetics 10/31/07 23
Why Finding a True Tree is Difficult
Number of rooted trees
• The number of possible trees grows exponentially with the number of species (or sequences)
• N r
• N u
= (2n -3)!/2
= (2n -5)!/2
(n-2)
(n-3)
(n-2)!
(n-3)!
• To find the best tree, you must explore all possibilities (or must you?)
BCB 444/544 F07 ISU Terribilini #29- Phylogenetics 10/31/07 24
Tree Building Procedure
•
•
•
•
•
BCB 444/544 F07 ISU Terribilini #29- Phylogenetics 10/31/07 25
Choice of Molecular Markers
• Very closely related organisms - nucleic acid sequence will show more differences
• For individuals within a species - faster mutation rate is in noncoding regions of mtDNA
• More distantly related species - slowly evolving nucleic acid sequences like ribosomal RNA or protein sequences
• Very distantly related species - use highly conserved protein sequences
BCB 444/544 F07 ISU Terribilini #29- Phylogenetics 10/31/07 26
Advantages of Protein Sequences
• More highly conserved - mutations in DNA may not change amino acid sequence
• Third position in a codon especially can vary - violates our assumption of independent evolution of all positions in a sequence
• DNA sequences can be biased by codon usage differences between species - causes variations in sequence that are not attributable to evolution
• In alignments, DNA sequences that are not related can show a lot of similarity due to only 4 letters in alphabet, proteins do not have this problem (at least not as much)
• Introducing gaps in alignments of DNA sequences can cause frameshift errors, making alignment biologically meaningless
BCB 444/544 F07 ISU Terribilini #29- Phylogenetics 10/31/07 27
Advantages of DNA Sequences
• Better for closely related species
• Show synonymous and non-synonymous mutations, which allows analysis of positive and negative selection events
• Lots of nonsynonymous mutations may mean positive selection for new functions of protein with different amino acid sequence
• Lots of synonymous mutations may mean negative selection - changed amino acid sequence is detrimental
BCB 444/544 F07 ISU Terribilini #29- Phylogenetics 10/31/07 28
Multiple Sequence Alignment
• Most critical step in tree building - cannot build correct tree without correct alignment
• Should build alignments with multiple programs, then inspect and compare to identify the most reasonable one
• Most alignments need manual editing
• Make sure important functional residues align
• Align secondary structure elements
• Use full alignment or just parts
BCB 444/544 F07 ISU Terribilini #29- Phylogenetics 10/31/07 29
Automatic Editing of Alignments
• Rascal and NorMD – correct alignment errors, remove potentially unrelated or highly divergent sequences
• Gblocks – detect and eliminate poorly aligned positions and divergent regions
BCB 444/544 F07 ISU Terribilini #29- Phylogenetics 10/31/07 30