5.8MB - Whitehead Institute for Biomedical Research

advertisement
Practical Bioinformatics Tools
For
Understanding Evolution
Robert Latek, PhD
Bioinformatics and Research Computing
Whitehead Institute for Biomedical Research
Aims
• Examine Techniques For Describing
Evolutionary Relationships
• Learn To Apply Bioinformatics Tools To
Study Evolution
• Question, Interrupt, Discuss, Suggest
WIBR Bioinformatics, © Whitehead Institute 2004
2
Bioinformatics ?
• Definition
– Integration of computational and biological methods to
promote biological discovery
– Combination of Biology, Statistics, CS, Clinical Research
• Purpose
– Predict, Decipher, Visualize
• Methodology
– Data Mining and Comparisons
MSRKGPRAEVCADCS
APDPGWASISRGVLVC
DECCSVHRLGRHISIV
KHLRHSAWPPTLLQM
VHTLASNGANSIWEHS
LLDPAQVQSGRRKAN
Data Visualization
G. Bell
WIBR Bioinformatics, © Whitehead Institute 2004
3
Bioinformatics :-)
• Biological Comparisons (Evolutionary Analysis)
– How closely/distantly related are two populations?
• Gene Function Prediction
– How and why does Gene X function/malfunction?
• Pharmaceutical Design & In Silico Testing
WIBR Bioinformatics, © Whitehead Institute 2004
4
Bioinformatics@WI
• Bioinformatics and Research Computing
• Collaboration, Consultation, Education
in Bioinformatics and Graphics
• Provide hardware, commercial/custom
software tools, training, and
bioinformatics expertise
Predict
Decipher
WIBR Bioinformatics, © Whitehead Institute 2004
5
Discussion Map
• Relationships Among Groups Of Genes
– Comparing Sequences
– Building Sequence Families
• Sequence Conservation During Evolution
– Aligning Multiple Sequences
• Evolutionary Diagrams
– Tracing The Descent From Common Ancestors
– Growing Phylogenetic Trees
WIBR Bioinformatics, © Whitehead Institute 2004
6
Evolutionary Analysis
• Definition
– The use of phylogeny to reveal relationships
among sets of genes
• Purpose
– To utilize information about common ancestors to
predict gene function and regulation
• Methodology
– Compare properties between genes/organisms
and identify commonalities and differences
– Organization of genes into a evolutionary
diagrams
– Sequence by sequence comparisons
WIBR Bioinformatics, © Whitehead Institute 2004
7
Sequence-Based Comparisons
• Identify sequences within an organism that are related
to each other and/or across different species
– Within: Fetal and adult hemoglobin
– Across : Human and chimpanzee hemoglobin
• Generate an evolutionary history of related genes
• Locate insertions, deletions, and substitutions that
have occurred during evolution
(C)
(R)
(E)
(A)
(T)
(S)
(L)
(P)
(G)
Cysteine
Arginine
Glutamate
Alanine
Threonine
Serine
Leucine
Proline
Glycine
CREATE
[Ancestor]
CREASE
-RELAPSE
[Progenitors]
GREASER
WIBR Bioinformatics, © Whitehead Institute 2004
8
Homology & Similarity
• Homology
– Conserved sequences arising from a common
ancestor
– Orthologs: homologous genes that share a
common ancestor in the absence of any gene
duplication (Mouse and Human Hemoglobin)
– Paralogs: genes related through gene duplication
(one gene is a copy of another - Fetal and Adult
Hemoglobin)
• Similarity
– Genes that share common sequences but are not
necessarily related
WIBR Bioinformatics, © Whitehead Institute 2004
9
Sequences As Modules
• Proteins are derived from a limited
number of basic building blocks
(Modules)
• Evolution has shuffled these modules
giving rise to a diverse repertoire of
protein sequences
Global
Local
• As a result, proteins can share a global
relationships or local relationship
specific to a particular DOMAIN
WIBR Bioinformatics, © Whitehead Institute 2004
10
Sequence Domains
Modules Define Functional/Structural Domains
WIBR Bioinformatics, © Whitehead Institute 2004
11
Sequence Families
• Definition
– Group of sequences that share a common function
and/or structure, that are potentially derived from a
common ancestor (set of homologous sequences)
• Building A Family
– Domains are used to group different sequences into
common families
WIBR Bioinformatics, © Whitehead Institute 2004
12
Defining A Sequence Family
Family B
Family D
Family A
Family E
Family C
WIBR Bioinformatics, © Whitehead Institute 2004
13
Sequence Family Resources
• Search and Browse Family Databases
• PFAM
– http://pfam.wustl.edu/
>src
MGSNKSKPKDASQRRRSLEPAENVHGAGGGAFPASQTPSKPASADGHRGPSAAFAPAAAEPKLFGGFNSSDTVTSPQRAGPLAGG
VTTFVALYDYESRTETDLSFKKGERLQIVNNTEGDWWLAHSLSTGQTGYIPSNYVAPSDSIQAEEWYFGKITRRESERLLLNAEN
PRGTFLVRESETTKGAYCLSVSDFDNAKGLNVKHYKIRKLDSGGFYITSRTQFNSLQQLVAYYSKHADGLCHRLTTVCPTSKPQT
QGLAKDAWEIPRESLRLEVKLGQGCFGEVWMGTWNGTTRVAIKTLKPGTMSPEAFLQEAQVMKKLRHEKLVQLYAVVSEEPIYIV
TEYMSKGSLLDFLKGETGKYLRLPQLVDMAAQIASGMAYVERMNYVHRDLRAANILVGENLVCKVADFGLARLIEDNEYTARQGA
KFPIKWTAPEAALYGRFTIKSDVWSFGILLTELTTKGRVPYPGMVNREVLDQVERGYRMPCPPECPESLHDLMCQCWRKEPEERP
TFEYLQAFLEDYFTSTEPQYQPGENL
WIBR Bioinformatics, © Whitehead Institute 2004
14
Sequence Family Resources
• NCBI Family Database Resources
• Conserved Domain Database
– http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=cdd
• Conserved Domain Architecture Retrieval Tool
– http://www.ncbi.nlm.nih.gov/BLAST/
WIBR Bioinformatics, © Whitehead Institute 2004
15
Discussion Map
• Relationships Among Groups Of Genes
– Comparing Sequences
– Building Sequence Families
• Sequence Conservation During Evolution
– Aligning Multiple Sequences
• Evolutionary Diagrams
– Tracing The Descent From Common Ancestors
– Growing Phylogenetic Trees
WIBR Bioinformatics, © Whitehead Institute 2004
16
Multiple Sequence Alignments
• Place residues in columns that
are derived from a common
CREASE
ancestral residue
• Identify Matches, Mismatches, CREATE
RELAPSE
and Gaps
GREASER
• MSA can reveal sequence
patterns
SeqA CRE-A-TE– Demonstration of homology
between >2 sequences
SeqB CRE-A-SE– Identification of functionally
SeqC GRE-A-SER
important sites
SeqD -RELAPSE– Protein function prediction
123456789
– Structure prediction
WIBR Bioinformatics, © Whitehead Institute 2004
17
Global vs. Local Alignments
• Global
– Search for alignments, matching over
entire sequences
• Local
– Examine regions of sequence for
conserved segments
• Both Consider: Matches, Mismatches,
Gaps
WIBR Bioinformatics, © Whitehead Institute 2004
18
Global Sequence Alignments
Yeast Prion-Like Proteins
WIBR Bioinformatics, © Whitehead Institute 2004
19
How To Make A Global MSA
• On The Web
– http://pir.georgetown.edu/pirwww/search/multaln.html
• On Your Computer
– ClustalX: http://www-igbmc.u-strasbg.fr/BioInfo/ClustalX/
WIBR Bioinformatics, © Whitehead Institute 2004
20
MSA Example Sequences
Standard FASTA Sequence Format
>KSYK_HUMAN
FFFGNITREEAEDYLVQGGMSDGLYLLRQSRNYLGGFALSVAHGRKAHHYTIERELNGTYAIAGGRTHASPADLCHYH
>ZA70_HUMAN
WYHSSLTREEAERKLYSGAQTDGKFLLRPRKEQGTYALSLIYGKTVYHYLISQDKAGKYCIPEGTKFDTLWQLVEYL
>KSYK_PIG
WFHGKISRDESEQIVLIGSKTNGKFLIRARDNGSYALGLLHEGKVLHYRIDKDKTGKLSIPGGKNFDTLWQLVEHY
>MATK_HUMAN
WFHGKISGQEAVQQLQPPEDGLFLVRESARHPGDYVLCVSFGRDVIHYRVLHRDGHLTIDEAVFFCNLMDMVEHY
>CSK_CHICK
WFHGKITREQAERLLYPPETGLFLVRESTNYPGDYTLCVSCEGKVEHYRIIYSSSKLSIDEEVYFENLMQLVEHY
>CRKL_HUMAN
WYMGPVSRQEAQTRLQGQRHGMFLVRDSSTCPGDYVLSVSENSRVSHYIINSLPNRRFKIGDQEFDHLPALLEFY
>YES_XIPHE
WYFGKLSRKDTERLLLLPGNERGTFLIRESETTKGAYSLSLRDWDETKGDNCKHYKIRKLDNGGYYITTRTQFMSLQMLVKHY
>FGR_HUMAN
WYFGKIGRKDAERQLLSPGNPQGAFLIRESETTKGAYSLSIRDWDQTRGDHVKHYKIRKLDMGGYYITTRVQFNSVQELVQHY
>SRC_RSVP
WYFGKITRRESERLLLNPENPRGTFLVRKSETAKGAYCLSVSDFDNAKGPNVKHYKIYKLYSGGFYITSRTQFGSLQQLVAYY
WIBR Bioinformatics, © Whitehead Institute 2004
21
MSA Example Result
YES_XIPHE
FGR_HUMAN
SRC_RSVP
MATK_HUMAN
CSK_CHICK
CRKL_HUMAN
ZA70_HUMAN
KSYK_PIG
KSYK_HUMAN
WYFGKLSRKDTERLLLLPGNERGTFLIRESETTKGAYSLSLRDWDETKGDNCKHYKIRKL
WYFGKIGRKDAERQLLSPGNPQGAFLIRESETTKGAYSLSIRDWDQTRGDHVKHYKIRKL
WYFGKITRRESERLLLNPENPRGTFLVRKSETAKGAYCLSVSDFDNAKGPNVKHYKIYKL
WFHGKISGQEAVQQLQPPED--GLFLVRESARHPGDYVLCVS-----FGRDVIHYRVLHR
WFHGKITREQAERLLYPPET--GLFLVRESTNYPGDYTLCVS-----CEGKVEHYRIIYS
WYMGPVSRQEAQTRLQGQRH--GMFLVRDSSTCPGDYVLSVS-----ENSRVSHYIINSL
WYHSSLTREEAERKLYSGAQTDGKFLLRPRK-EQGTYALSLI-----YGKTVYHYLISQD
WFHGKISRDESEQIVLIGSKTNGKFLIRAR--DNGSYALGLL-----HEGKVLHYRIDKD
FFFGNITREEAEDYLVQGGMSDGLYLLRQSRNYLGGFALSVA-----HGRKAHHYTIERE
:: . :
::
:
* :*:*
* : * :
** :
YES_XIPHE
FGR_HUMAN
SRC_RSVP
MATK_HUMAN
CSK_CHICK
CRKL_HUMAN
ZA70_HUMAN
KSYK_PIG
KSYK_HUMAN
DNGGYYITTRTQFMSLQMLVKHY
DMGGYYITTRVQFNSVQELVQHY
YSGGFYITSRTQFGSLQQLVAYY
-DGHLTIDEAVFFCNLMDMVEHY
-SSKLSIDEEVYFENLMQLVEHY
PNRRFKIGDQE-FDHLPALLEFY
KAGKYCIPEGTKFDTLWQLVEYL
KTGKLSIPGGKNFDTLWQLVEHY
LNGTYAIAGGRTHASPADLCHYH
*
.
: .
WIBR Bioinformatics, © Whitehead Institute 2004
22
Discussion Map
• Relationships Among Groups Of Genes
– Comparing Sequences
– Building Sequence Families
• Sequence Conservation During Evolution
– Aligning Multiple Sequences
• Evolutionary Diagrams
– Tracing The Descent From Common Ancestors
– Growing Phylogenetic Trees
WIBR Bioinformatics, © Whitehead Institute 2004
23
Phylogenetic Trees
• A Graph Representing The
Evolutionary History Of Sequences
– Relationship of sequences to one
another (How everything is connected)
– Dissect the order of appearance of
insertions, deletions, and mutations
• Identify Related Sequences, Predict
Function, Observe Epidemiology
(Analyze changes in viral strains)
WIBR Bioinformatics, © Whitehead Institute 2004
Simple
Tree
A
B
C
D
24
Tree Shapes
Rooted
A
Un-rooted
A
B
B
C
C
D
D
A
C
B
D
Branches intersect at Nodes
Leaves are the topmost branches
WIBR Bioinformatics, © Whitehead Institute 2004
25
Tree Characteristics
•
–
–
•
–
Tree Properties
Clade: all the descendants of a common
ancestor represented by a node
Distance: number of changes that have taken
place along a branch
Tree Types
Phylogram
.035
.012
A
.009
B
Cladogram: shows the branching order of
nodes
.057
–
Phylogram: shows branching order and
distances
WIBR Bioinformatics, © Whitehead Institute 2004
.016
C
.044 D
26
Tree Building Methods
• Group Most Common Sequences
– Find the tree that changes one sequence into all of the
others by the least number of steps
– Sequences with the smallest number of differences have the
shortest distance between them and are called:
“related taxa”
WIBR Bioinformatics, © Whitehead Institute 2004
27
Tree Building Methods
C
A
B
F
1
2
C
D
E
F
C
E
1
F
2
4
A
B
E
F
E
3
D
1
B
3
A
F
A
B
D
E
A
B
C
D
E
F
2
C
1
D
A
B
C
D
A
5
A
B
F
B
C
C
D
D
E
E
WIBR Bioinformatics, © Whitehead Institute 2004
F
28
Example Evolutionary Trees
Anthropological and Archeological
• Tree Of Life
– http://tolweb.org/tree/phylogeny.html
• Theory Of Human Evolution At The SI
– http://www.mnh.si.edu/anthro/humanorigins/ha/a_tree.html
WIBR Bioinformatics, © Whitehead Institute 2004
29
How To Build A Tree
Sequence Based
• Create Alignment
– http://pir.georgetown.edu/pirwww/search/multaln.html
• Create Tree
– http://www.genebee.msu.su/services/phtree_reduced.html
• Draw Tree
– http://iubio.bio.indiana.edu/treeapp/treeprint-form.html
WIBR Bioinformatics, © Whitehead Institute 2004
30
MSA and Tree Relationship
• “The optimal alignment of several sequences can
be thought of as minimizing the number of
mutational steps in an evolutionary tree for which
the sequences are the leaves” (Mount, 2001)
CREATE
CREATE
T to S
CREATE
CRE-A-TE- SeqA
CREASE
CRE-A-SE- SeqB
CREASE
+R
C to G
GREASE
+L +P
-G
GRE-A-SER SeqC
-RELAPSE- SeqD
WIBR Bioinformatics, © Whitehead Institute 2004
31
Summary Review
• Relationships Among Groups Of Genes
– Comparing Sequences
– Building Sequence Families
• Sequence Conservation During Evolution
– Aligning Multiple Sequences
• Evolutionary Diagrams
– Tracing The Descent From Common Ancestors
– Growing Phylogenetic Trees
WIBR Bioinformatics, © Whitehead Institute 2004
32
References
• Bioinformatics: Sequence and genome
Analysis. David W. Mount. CSHL Press, 2001.
• Bioinformatics: A Practical Guide to the
Analysis of Genes and Proteins. Andreas D.
Baxevanis and B.F. Francis Ouellete. Wiley
Interscience, 2001.
• Bioinformatics: Sequence, structure, and
databanks. Des Higgins and Willie Taylor.
Oxford University Press, 2000.
WIBR Bioinformatics, © Whitehead Institute 2004
33
Download