#4 - Sequence Alignment 8/27/07 Lecture 4 #4_Aug27

advertisement
#4 - Sequence Alignment
8/27/07
Required Reading
BCB 444/544
Finish:
(before lecture)
Mon Aug 27 - for Lecture #4
Lecture 2- Biological Databases
Pairwise Sequence Alignment
• Chp 3 - pp 31-41
Lecture 4
Xiong Textbook
Wed Aug 29 - for Lecture #5
Dynamic Programming
• Eddy: What is Dynamic Programming?
Sequence Alignment
Thurs Aug 30 - Lab #2:
Databases, ISU Resources,& Pairwise Sequence Alignment
#4_Aug27
Fri Aug 31 - for Lecture #6
Scoring Matrices and Alignment Statistics
• Chp 3 - pp 41-49
BCB 444/544 F07 ISU
Dobbs #4 - Sequence Alignment
8/27/07
1
BCB 444/544 F07 ISU
HW#2:
Back to:
Dobbs #4 - Sequence Alignment
8/27/07
2
Chp 2- Biological Databases
• Xiong: Chp 2
Introduction to Biological Databases
•
•
•
•
•
BCB 444/544 F07 ISU
Dobbs #4 - Sequence Alignment
8/27/07
3
What is a Database?
What is a Database?
Types of Databases
Biological Databases
Pitfalls of Biological Databases
Information Retrieval from Biological
Databases
BCB 444/544 F07 ISU
Dobbs #4 - Sequence Alignment
8/27/07
4
Types of Databases
3 Major types of electronic databases:
Duh!!
1. Flat files - simple text files
OK: skip we'll skip that!
• no organization to facilitate retrieval
2. Relational - data organized as tables ("relations")
• shared features among tables allows rapid
search
3. Object-oriented - data organized as "objects"
• objects associated hierarchically
BCB 444/544 F07 ISU
Dobbs #4 - Sequence Alignment
BCB 444/544 Fall 07 Dobbs
8/27/07
5
BCB 444/544 F07 ISU
Dobbs #4 - Sequence Alignment
8/27/07
6
1
#4 - Sequence Alignment
8/27/07
Biological Databases
Types of Biological Databases
Currently - all 3 types, but MANY flat files
1- Primary
• "simple" archives of sequences, structures, images, etc.
What are goals of biological databases?
• raw data, minimal annotations, not always well curated!
2- Secondary
1. Information retrieval
• enhanced with more complete annotation of sequences,
2.Knowledge discovery
structures, images, etc.
• usually curated!
Important issue:
3- Specialized
Interconnectivity
• focused on a particular research interest or organism
• usually - not always - highly curated
BCB 444/544 F07 ISU
Dobbs #4 - Sequence Alignment
8/27/07
7
BCB 444/544 F07 ISU
Examples of Biological Databases
8/27/07
8
8/27/07
10
8/27/07
12
Examples of Biological Databases
1- Primary
2- Secondary
• DNA sequences
• Protein sequences
• GenBank - US
• Swiss-Prot, TreEMBL, PIR
• European Molecular Biology Lab - EMBL
• these recently combined into UniProt
• DNA Data Bank of Japan - DDBJ
3- Specialized
• Structures (Protein, DNA, RNA)
• Species-specific (or "taxonomic" specific)
• PDB - Protein Data Bank
•
Dobbs #4 - Sequence Alignment
• Flybase, WormBase, AceDB, PlantDB
NDB - Nucleic Acid Data Bank
BCB 444/544 F07 ISU
Dobbs #4 - Sequence Alignment
• Molecule-specific,disease-specific
8/27/07
9
BCB 444/544 F07 ISU
Dobbs #4 - Sequence Alignment
Information Retrieval from
Biological Databases
Pitfalls of Biological Databases
• Errors!
&
• Lack of documentation re: quality or reliability of data
• Limited mechanisms for "data checking" or preventing
propagation of errors (esp. annotation errors!!)
• Redundancy
• Inconsistency
• Incompatibility (format, terminology, data types, etc.)
2 most popular retrieval systems:
• ENTREZ - NCBI
• will use a LOT - was introduced in Lab 1
• SRS - Sequence Retrieval Systems - EBI
• will use less, similar to ENTREZ
Both:
• Provide access to multiple databases
• Allow complex queries
BCB 444/544 F07 ISU
Dobbs #4 - Sequence Alignment
BCB 444/544 Fall 07 Dobbs
8/27/07
11
BCB 444/544 F07 ISU
Dobbs #4 - Sequence Alignment
2
#4 - Sequence Alignment
8/27/07
Web Resources:
Bioinformatics & Computational Biology
ISU Resources & Experts
ISU Research Centers & Graduate Training Programs:
• NCBI - National Center for Biotechnology Information
•
•
•
•
•
•
•
•
•
•
•
ISCB - International Society for Computational Biology
JCB - Jena Center for Bioinformatics
Pitt - OBRC Online Bioinformatics Resources Collection
UBC - Bioinformatics Links Directory
UWash - BioMolecules
LH Baker Center - Bioinformatics & Biological Statistics
BCB - Bioinformatics & Computational Biology
BCB Lab - (Student-Led Consulting & Resources)
CIAG - Center for Integrated Animal Genomics
CCILD - Computational Intelligence, Learning & Discovery
IGERT Training Grant - Computational Molecular Biology
ISU Facilities:
• ISU - Bioinformatics Resources - Andrea Dinkelman
• ISU - YABI = "Yet Another Bioinformatics Index"
(from BCB Lab at ISU)
• Wikipedia:
• Biotechnology - Instrumentation Facilities
• PSI - Plant Sciences Institute
• PSI Centers
Bioinformatics
BCB 444/544 F07 ISU
Dobbs #4 - Sequence Alignment
8/27/07
13
BCB 444/544 F07 ISU
SUMMARY:
#2- Biological Databases
8/27/07
14
Chp 3- Sequence Alignment
SECTION II
BEWARE!
SEQUENCE ALIGNMENT
Xiong: Chp 3
Pairwise Sequence Alignment
•
•
•
•
•
•
BCB 444/544 F07 ISU
Dobbs #4 - Sequence Alignment
8/27/07
15
Motivation for Sequence Alignment
Evolutionary Basis
Sequence Homology versus Sequence Similarity
Sequence Similarity versus Sequence Identity
Methods
Scoring Matrices
Statistical Significance of Sequence Alignment
Adapted from Brown and Caragea, 2007, with some slides from:
Altman, Fernandez-Baca, Batzoglou, Craven, Hunter, Page.
BCB 444/544 F07 ISU
Dobbs #4 - Sequence Alignment
8/27/07
16
Why Align Sequences?
"Sequence comparison lies at the heart of bioinformatics
analysis."
Dobbs #4 - Sequence Alignment
Databases contain many sequences with known functions
& many sequences with unknown functions.
Jin Xiong
Sequence comparison is important for drawing functional
& evolutionary inferences re: new genes/proteins
Genes (or proteins) with similar sequences may have
similar structures and/or functions.
Pairwise sequence alignment is fundamental; it used to:
Sequence alignment can provide important clues to the
function of a novel gene or protein
• Search for common patterns of characters
• Establish pair-wise correspondence between related sequences
Pairwise sequence alignment is basis for:
• Database searching (e.g., BLAST)
• Multiple sequence alignment (MSA)
BCB 444/544 F07 ISU
Dobbs #4 - Sequence Alignment
BCB 444/544 Fall 07 Dobbs
8/27/07
17
BCB 444/544 F07 ISU
Dobbs #4 - Sequence Alignment
8/27/07
18
3
#4 - Sequence Alignment
8/27/07
Examples of Bioinformatics Tasks that
Rely on Sequence Alignment
Evolutionary Basis
• DNA, RNA and proteins are "molecular fossils"
• Genomic sequencing
(> 500 complete genomes sequenced!)
• they encode the history of millions of years of evolution
• During evolution, molecular sequences accumulate
random changes (mutations/variants)
• Assembling multiple sequence reads
into contigs, scaffolds
• Aligning sequences with chromosomes
• Finding genes and regulatory regions
• Identifying gene products
• Identifying function of gene products
• Studying the structural organization of genomes
• Comparative genomics
BCB 444/544 F07 ISU
Dobbs #4 - Sequence Alignment
• some of which provide a selective advantage or disadvantage,
and some of which are neutral
• Sequences that are structurally and/or functionally
important tend to be conserved
• (e.g., chromosomal telomeric sequences; enzyme active sites)
• Significant sequence conservation allows inference of
evolutionary relatedness
8/27/07
19
BCB 444/544 F07 ISU
Homology
(e.g., human α-globin & mouse α-globin)
Homology = similarity due to descent from a common evolutionary
ancestor
• Paralogs - "similar genes" within a species; result of
gene duplication events; corresponding proteins may
(or may not) have similar functions
HOMOLOGY ≠ SIMILARITY
(e.g., human α-globin & human β-globin)
When 2 sequences share a sufficiently high degree of sequence
similarity (or identity), we may infer that they are homologous
A
We can infer homology from similarity (can't prove it!)
A is the parent gene
Speciation leads to B & C
Duplication leads to C’
Speciation
Duplication
B
Dobbs #4 - Sequence Alignment
8/27/07
21
Sequence Homology vs Similarity
• Similar sequences - sequences that have a high percentage of
BCB 444/544 F07 ISU
Dobbs #4 - Sequence Alignment
8/27/07
22
• Two DNA sequences can share a high degree of sequence identity
(or similarity) -- means the same thing
• Drena's opinion: Always use "identity" when making quantitative
comparisons re: DNA or RNA sequences (to avoid confusion!)
aligned residues with similar physicochemical properties
(e.g., size, hydrophobicity, charge)
IMPORTANT:
• Sequence homology:
For protein sequences, sequence similarity and identity
have different meanings:
• An inference about a common ancestral relationship, drawn when
two sequences share a high enough degree of sequence similarity
• Homology is qualitative
• Identity = % of exact matches between two aligned sequences
• Similarity = % of aligned residues that share similar
characteristics (e.g, physicochemical characteristics,
• Sequence similarity:
structural propsensities, evolutionary profiles)
• The direct result of observation from a sequence alignment
• Similarity is quantitative; can be described using percentages
BCB 444/544 Fall 07 Dobbs
B and C are Orthologous
C and C’ are Paralogous
C'
For nucleotide sequences (DNA & RNA), sequence
similarity and identity have the "same" meaning:
evolutionary ancestry
Dobbs #4 - Sequence Alignment
C
Sequence Similarity vs Identity
• Homologous sequences - sequences that share a common
BCB 444/544 F07 ISU
20
2 types of homologous sequences:
• Orthologs - "same genes" in different species;
result of common ancestry; corresponding proteins
have "same" functions
For us:
BCB 444/544 F07 ISU
8/27/07
Orthologs vs Paralogs
Homology has a very specific meaning in evolutionary & computational
biology - & the term is often used incorrectly
But,
Dobbs #4 - Sequence Alignment
8/27/07
• Drena's opinion: Always use "identity" when making quantitative
comparisons re: protein sequences (to avoid confusion!)
23
BCB 444/544 F07 ISU
Dobbs #4 - Sequence Alignment
8/27/07
24
4
#4 - Sequence Alignment
8/27/07
Goal of Sequence Alignment
What is Sequence Alignment?
Given 2 sequences of letters, and a scoring scheme for
evaluating matching letters, find an optimal pairing of
letters in one sequence to letters of other sequence.
Find the best pairing of 2 sequences, such that there
is maximum correspondence between residues
Align:
• DNA
4 letter alphabet (+ gap)
1: THIS IS A RATHER LONGER SENTENCE THAN THE NEXT.
2: THIS IS A SHORT SENTENCE.
TTGACAC
TTTACAC
1: THIS IS A RATHER LONGER SENTENCE THAN THE NEXT.
2: THIS IS A ######SHORT## SENTENCE##############.
• Proteins
OR
20 letter alphabet (+ gap)
1: THIS IS A RATHER LONGER SENTENCE THAN THE NEXT.
2: THIS IS A ##SHORT###SENT#EN###CE##############.
RKVA-GMA
RKIAVAMA
Is one of these alignments "optimal"?
Which is better?
BCB 444/544 F07 ISU
Dobbs #4 - Sequence Alignment
8/27/07
25
Statement of Problem
8/27/07
26
• Sequences can diverge from a common ancestor through
various types of mutations:
• 2 sequences
• Scoring system for evaluating match (or
mismatch) of two characters
• Penalty function for gaps in sequences
• Substitutions
• Insertions
• Deletions
Find: Optimal pairing of sequences that
• Retains the order of characters
• Introduces gaps where needed
• Maximizes total score
Dobbs #4 - Sequence Alignment
Dobbs #4 - Sequence Alignment
Types of Sequence Variation
Given:
BCB 444/544 F07 ISU
BCB 444/544 F07 ISU
ACGA → AGGA
ACGA → ACCGA
ACGA → AGA
• Insertions or deletions ("indels") result in gaps in
alignments
• Substitotions result in mismatches
• No change? match
8/27/07
27
BCB 444/544 F07 ISU
Dobbs #4 - Sequence Alignment
8/27/07
28
Avoiding Random Alignments with a
Scoring Function
Gaps
Indels of various sizes can occur in one sequence relative
to the other
e.g., corresponding to a shortening of the polypeptide
chain in a protein
• Introducing too many gaps generates nonsense alignments:
s--e-----qu---en--ce
sometimesquipsentice
• Need to distinguish between alignments that occur due to
homology and those that occur by chance
• Define a scoring function that accounts for mismatches
and gaps
Scoring Function (F):
Match:
Mismatch:
Gap:
+ m
- s
- d
e.g.
+1
-1
-2
F = m(#matches) + s(#mismatches) + d(#gaps)
BCB 444/544 F07 ISU
Dobbs #4 - Sequence Alignment
BCB 444/544 Fall 07 Dobbs
8/27/07
29
BCB 444/544 F07 ISU
Dobbs #4 - Sequence Alignment
8/27/07
30
5
#4 - Sequence Alignment
8/27/07
Not All Mismatches are the Same
Substitution Matrix
• Some amino acids are more "exchangeable" than
others; e.g., Ser and Thr are more similar than Trp and Ala
s(a,b) corresponds to score
of aligning character a with
character b
• A substitution matrix can be used to introduce
"mismatch costs" for handling different types of
substitutions
Match scores are often calculated
based on frequency of mutations
in very similar sequences
(more details later)
• Mismatch costs are not usually used in aligning
DNA or RNA sequences, because no substitution is
"better" than any other (in general)
BCB 444/544 F07 ISU
Dobbs #4 - Sequence Alignment
8/27/07
31
Methods
•
•
•
•
BCB 444/544 F07 ISU
Dobbs #4 - Sequence Alignment
8/27/07
32
Global vs Local Alignment
Global and Local Alignment
Alignment Algorithms
Dot Matrix Method
Dynamic Programming Method
Global alignment
• Finds best possible alignment across entire length of 2 sequences
• Aligned sequences assumed to be generally similar over entire length
• Gap penalities
• DP for Global Alignment
• DP for Local Alignment
Local alignment
• Scoring Matrices
• Finds local regions with highest similarity between 2 sequences
• Amino acid scoring matrices
• PAM
• BLOSUM
• Comparisons between PAM & BLOSUM
• Aligns these without regard for rest of sequence
• Sequences are not assumed to be similar over entire length
• Statistical Significance of Sequence Alignment
BCB 444/544 F07 ISU
Dobbs #4 - Sequence Alignment
8/27/07
33
S = CTGTCGCTGCACG
T = TGCCGTG
CTGTCG-CTGCACG
-TGC-CG-TG----
8/27/07
34
Both are important
but it is critical to use right method for a given task!
Global alignment:
Local alignment
• Good for: aligning closely related sequences of approx. same length
• Not good for: divergent sequences or sequences with different
lengths
CTGTCGCTGCACG--------TGC-CGTG
Local Alignment:
• Good for: searching for conserved patterns (domains or motifs) in
DNA or protein sequences
• Not good for: generating alignment of closely related sequences
CTGTCG-CTGCACG
-TGCCG--TG----
Global and local alignments are fundamentally similar and differ only in
optimization strategy used in aligning similar residues
Which is better?
BCB 444/544 F07 ISU
Dobbs #4 - Sequence Alignment
Global vs Local Alignment
When use which?
Global vs Local Alignment - example
Global alignment
BCB 444/544 F07 ISU
Dobbs #4 - Sequence Alignment
BCB 444/544 Fall 07 Dobbs
8/27/07
35
BCB 444/544 F07 ISU
Dobbs #4 - Sequence Alignment
8/27/07
36
6
#4 - Sequence Alignment
8/27/07
Alignment Algorithms
Dot Matrix Method (Dot Plots)
3 major methods for alignment:
• Place 1 sequence along top row of matrix
• Place 2nd sequence along left column of
matrix
• Plot a dot each time there is a match
between an element of row sequence and
an element of column sequence
1. Dot matrix analysis
2. Dynamic Programming
3. Word or k-tuple methods (later, in Chp 4)
• For proteins, usually use more
sophisticated scoring schemes than
"identical match"
• Diagonal lines indicate areas of match
• Reverse diagonals (perpendicular to
diagonal) indicate inversions
A C G C G
A
C
A
C
G
Exploring Dot Plots
BCB 444/544 F07 ISU
Dobbs #4 - Sequence Alignment
BCB 444/544 Fall 07 Dobbs
8/27/07
37
BCB 444/544 F07 ISU
Dobbs #4 - Sequence Alignment
8/27/07
38
7
Download