Lecture 4 #4_Aug27 Sequence Alignment BCB 444/544

advertisement
BCB 444/544
Finish:
Lecture 2- Biological Databases
Lecture 4
Sequence Alignment
#4_Aug27
BCB 444/544 F07 ISU
Dobbs #4 - Sequence Alignment
8/27/07
1
Required Reading
(before lecture)
Mon Aug 27 - for Lecture #4
Pairwise Sequence Alignment
• Chp 3 - pp 31-41
Xiong Textbook
Wed Aug 29 - for Lecture #5
Dynamic Programming
• Eddy: What is Dynamic Programming?
Thurs Aug 30 - Lab #2:
Databases, ISU Resources,& Pairwise Sequence Alignment
Fri Aug 31 - for Lecture #6
Scoring Matrices and Alignment Statistics
• Chp 3 - pp 41-49
BCB 444/544 F07 ISU
Dobbs #4 - Sequence Alignment
8/27/07
2
HW#2:
BCB 444/544 F07 ISU
Dobbs #4 - Sequence Alignment
8/27/07
3
Back to:
Chp 2- Biological Databases
• Xiong: Chp 2
Introduction to Biological Databases
•
•
•
•
•
What is a Database?
Types of Databases
Biological Databases
Pitfalls of Biological Databases
Information Retrieval from Biological
Databases
BCB 444/544 F07 ISU
Dobbs #4 - Sequence Alignment
8/27/07
4
What is a Database?
Duh!!
OK: skip we'll skip that!
BCB 444/544 F07 ISU
Dobbs #4 - Sequence Alignment
8/27/07
5
Types of Databases
3 Major types of electronic databases:
1. Flat files - simple text files
• no organization to facilitate retrieval
2. Relational - data organized as tables ("relations")
• shared features among tables allows rapid
search
3. Object-oriented - data organized as "objects"
• objects associated hierarchically
BCB 444/544 F07 ISU
Dobbs #4 - Sequence Alignment
8/27/07
6
Biological Databases
Currently - all 3 types, but MANY flat files
What are goals of biological databases?
1. Information retrieval
2. Knowledge discovery
Important issue:
Interconnectivity
BCB 444/544 F07 ISU
Dobbs #4 - Sequence Alignment
8/27/07
7
Types of Biological Databases
1- Primary
• "simple" archives of sequences, structures, images, etc.
• raw data, minimal annotations, not always well curated!
2- Secondary
• enhanced with more complete annotation of sequences,
structures, images, etc.
• usually curated!
3- Specialized
• focused on a particular research interest or organism
• usually - not always - highly curated
BCB 444/544 F07 ISU
Dobbs #4 - Sequence Alignment
8/27/07
8
Examples of Biological Databases
1- Primary
• DNA sequences
• GenBank - US
• European Molecular Biology Lab - EMBL
• DNA Data Bank of Japan - DDBJ
• Structures (Protein, DNA, RNA)
• PDB - Protein Data Bank
•
NDB - Nucleic Acid Data Bank
BCB 444/544 F07 ISU
Dobbs #4 - Sequence Alignment
8/27/07
9
Examples of Biological Databases
2- Secondary
• Protein sequences
• Swiss-Prot, TreEMBL, PIR
• these recently combined into UniProt
3- Specialized
• Species-specific (or "taxonomic" specific)
• Flybase, WormBase, AceDB, PlantDB
• Molecule-specific,disease-specific
BCB 444/544 F07 ISU
Dobbs #4 - Sequence Alignment
8/27/07
10
Pitfalls of Biological Databases
• Errors!
&
• Lack of documentation re: quality or reliability of data
• Limited mechanisms for "data checking" or preventing
propagation of errors (esp. annotation errors!!)
• Redundancy
• Inconsistency
• Incompatibility (format, terminology, data types, etc.)
BCB 444/544 F07 ISU
Dobbs #4 - Sequence Alignment
8/27/07
11
Information Retrieval from
Biological Databases
2 most popular retrieval systems:
• ENTREZ - NCBI
• will use a LOT - was introduced in Lab 1
• SRS - Sequence Retrieval Systems - EBI
• will use less, similar to ENTREZ
Both:
• Provide access to multiple databases
• Allow complex queries
BCB 444/544 F07 ISU
Dobbs #4 - Sequence Alignment
8/27/07
12
Web Resources:
Bioinformatics & Computational Biology
• NCBI - National Center for Biotechnology Information
•
•
•
•
•
ISCB - International Society for Computational Biology
JCB - Jena Center for Bioinformatics
Pitt - OBRC Online Bioinformatics Resources Collection
UBC - Bioinformatics Links Directory
UWash - BioMolecules
• ISU - Bioinformatics Resources - Andrea Dinkelman
• ISU - YABI = "Yet Another Bioinformatics Index"
(from BCB Lab at ISU)
• Wikipedia:
Bioinformatics
BCB 444/544 F07 ISU
Dobbs #4 - Sequence Alignment
8/27/07
13
ISU Resources & Experts
ISU Research Centers & Graduate Training Programs:
•
•
•
•
•
•
LH Baker Center - Bioinformatics & Biological Statistics
BCB - Bioinformatics & Computational Biology
BCB Lab - (Student-Led Consulting & Resources)
CIAG - Center for Integrated Animal Genomics
CCILD - Computational Intelligence, Learning & Discovery
IGERT Training Grant - Computational Molecular Biology
ISU Facilities:
• Biotechnology - Instrumentation Facilities
• PSI - Plant Sciences Institute
• PSI Centers
BCB 444/544 F07 ISU
Dobbs #4 - Sequence Alignment
8/27/07
14
SUMMARY:
#2- Biological Databases
BEWARE!
BCB 444/544 F07 ISU
Dobbs #4 - Sequence Alignment
8/27/07
15
Chp 3- Sequence Alignment
SECTION II
SEQUENCE ALIGNMENT
Xiong: Chp 3
Pairwise Sequence Alignment
•
•
•
•
•
•
Evolutionary Basis
Sequence Homology versus Sequence Similarity
Sequence Similarity versus Sequence Identity
Methods
Scoring Matrices
Statistical Significance of Sequence Alignment
Adapted from Brown and Caragea, 2007, with some slides from:
Altman, Fernandez-Baca, Batzoglou, Craven, Hunter, Page.
BCB 444/544 F07 ISU
Dobbs #4 - Sequence Alignment
8/27/07
16
Motivation for Sequence Alignment
"Sequence comparison lies at the heart of bioinformatics
analysis."
Jin Xiong
Sequence comparison is important for drawing functional
& evolutionary inferences re: new genes/proteins
Pairwise sequence alignment is fundamental; it used to:
• Search for common patterns of characters
• Establish pair-wise correspondence between related sequences
Pairwise sequence alignment is basis for:
• Database searching (e.g., BLAST)
• Multiple sequence alignment (MSA)
BCB 444/544 F07 ISU
Dobbs #4 - Sequence Alignment
8/27/07
17
Why Align Sequences?
Databases contain many sequences with known functions
& many sequences with unknown functions.
Genes (or proteins) with similar sequences may have
similar structures and/or functions.
Sequence alignment can provide important clues to the
function of a novel gene or protein
BCB 444/544 F07 ISU
Dobbs #4 - Sequence Alignment
8/27/07
18
Examples of Bioinformatics Tasks that
Rely on Sequence Alignment
• Genomic sequencing
(> 500 complete genomes sequenced!)
• Assembling multiple sequence reads
into contigs, scaffolds
• Aligning sequences with chromosomes
• Finding genes and regulatory regions
• Identifying gene products
• Identifying function of gene products
• Studying the structural organization of genomes
• Comparative genomics
BCB 444/544 F07 ISU
Dobbs #4 - Sequence Alignment
8/27/07
19
Evolutionary Basis
• DNA, RNA and proteins are "molecular fossils"
• they encode the history of millions of years of evolution
• During evolution, molecular sequences accumulate
random changes (mutations/variants)
• some of which provide a selective advantage or disadvantage,
and some of which are neutral
• Sequences that are structurally and/or functionally
important tend to be conserved
• (e.g., chromosomal telomeric sequences; enzyme active sites)
• Significant sequence conservation allows inference of
evolutionary relatedness
BCB 444/544 F07 ISU
Dobbs #4 - Sequence Alignment
8/27/07
20
Homology
Homology has a very specific meaning in evolutionary & computational
biology - & the term is often used incorrectly
For us:
Homology = similarity due to descent from a common evolutionary
ancestor
But,
HOMOLOGY ≠ SIMILARITY
When 2 sequences share a sufficiently high degree of sequence
similarity (or identity), we may infer that they are homologous
We can infer homology from similarity (can't prove it!)
BCB 444/544 F07 ISU
Dobbs #4 - Sequence Alignment
8/27/07
21
Orthologs vs Paralogs
2 types of homologous sequences:
• Orthologs - "same genes" in different species;
result of common ancestry; corresponding proteins
have "same" functions
(e.g., human -globin & mouse -globin)
• Paralogs - "similar genes" within a species; result of
gene duplication events; corresponding proteins may
(or may not) have similar functions
(e.g., human -globin & human -globin)
A
A is the parent gene
Speciation leads to B & C
Duplication leads to C’
Speciation
Duplication
B
C
B and C are Orthologous
C and C’ are Paralogous
C'
BCB 444/544 F07 ISU
Dobbs #4 - Sequence Alignment
8/27/07
22
Sequence Homology vs Similarity
• Homologous sequences - sequences that share a common
evolutionary ancestry
• Similar sequences - sequences that have a high percentage of
aligned residues with similar physicochemical properties
(e.g., size, hydrophobicity, charge)
IMPORTANT:
• Sequence homology:
• An inference about a common ancestral relationship, drawn when
two sequences share a high enough degree of sequence similarity
• Homology is qualitative
• Sequence similarity:
• The direct result of observation from a sequence alignment
• Similarity is quantitative; can be described using percentages
BCB 444/544 F07 ISU
Dobbs #4 - Sequence Alignment
8/27/07
23
Sequence Similarity vs Identity
For nucleotide sequences (DNA & RNA), sequence
similarity and identity have the "same" meaning:
• Two DNA sequences can share a high degree of sequence
identity (or similarity) -- means the same thing
• Drena's opinion: Always use "identity" when making quantitative
comparisons re: DNA or RNA sequences (to avoid confusion!)
For protein sequences, sequence similarity and identity
have different meanings:
• Identity = % of exact matches between two aligned sequences
• Similarity = % of aligned residues that share similar
characteristics (e.g, physicochemical characteristics,
structural propsensities, evolutionary profiles)
• Drena's opinion: Always use "identity" when making quantitative
comparisons re: protein sequences (to avoid confusion!)
BCB 444/544 F07 ISU
Dobbs #4 - Sequence Alignment
8/27/07
24
What is Sequence Alignment?
Given 2 sequences of letters, and a scoring scheme for
evaluating matching letters, find an optimal pairing of
letters in one sequence to letters of other sequence.
Align:
1: THIS IS A RATHER LONGER SENTENCE THAN THE NEXT.
2: THIS IS A SHORT SENTENCE.
1: THIS IS A RATHER LONGER SENTENCE THAN THE NEXT.
2: THIS IS A ######SHORT## SENTENCE##############.
OR
1: THIS IS A RATHER LONGER SENTENCE THAN THE NEXT.
2: THIS IS A ##SHORT###SENT#EN###CE##############.
Is one of these alignments "optimal"?
Which is better?
BCB 444/544 F07 ISU
Dobbs #4 - Sequence Alignment
8/27/07
25
Goal of Sequence Alignment
Find the best pairing of 2 sequences, such that there
is maximum correspondence between residues
• DNA
4 letter alphabet (+ gap)
TTGACAC
TTTACAC
• Proteins
20 letter alphabet (+ gap)
RKVA-GMA
RKIAVAMA
BCB 444/544 F07 ISU
Dobbs #4 - Sequence Alignment
8/27/07
26
Statement of Problem
Given:
• 2 sequences
• Scoring system for evaluating match (or
mismatch) of two characters
• Penalty function for gaps in sequences
Find: Optimal pairing of sequences that
• Retains the order of characters
• Introduces gaps where needed
• Maximizes total score
BCB 444/544 F07 ISU
Dobbs #4 - Sequence Alignment
8/27/07
27
Types of Sequence Variation
• Sequences can diverge from a common ancestor through
various types of mutations:
• Substitutions
• Insertions
• Deletions
ACGA  AGGA
ACGA  ACCGA
ACGA  AGA
• Insertions or deletions ("indels") result in gaps in
alignments
• Substitotions result in mismatches
• No change? match
BCB 444/544 F07 ISU
Dobbs #4 - Sequence Alignment
8/27/07
28
Gaps
Indels of various sizes can occur in one sequence relative
to the other
e.g., corresponding to a shortening of the polypeptide
chain in a protein
BCB 444/544 F07 ISU
Dobbs #4 - Sequence Alignment
8/27/07
29
Avoiding Random Alignments with a
Scoring Function
• Introducing too many gaps generates nonsense alignments:
s--e-----qu---en--ce
sometimesquipsentice
• Need to distinguish between alignments that occur due to
homology and those that occur by chance
• Define a scoring function that accounts for mismatches
and gaps
Scoring Function (F):
Match:
Mismatch:
Gap:
+ m
- s
- d
e.g.
+1
-1
-2
F = m(#matches) + s(#mismatches) + d(#gaps)
BCB 444/544 F07 ISU
Dobbs #4 - Sequence Alignment
8/27/07
30
Not All Mismatches are the Same
• Some amino acids are more "exchangeable" than
others; e.g., Ser and Thr are more similar than Trp and Ala
• A substitution matrix can be used to introduce
"mismatch costs" for handling different types of
substitutions
• Mismatch costs are not usually used in aligning
DNA or RNA sequences, because no substitution is
"better" than any other (in general)
BCB 444/544 F07 ISU
Dobbs #4 - Sequence Alignment
8/27/07
31
Substitution Matrix
s(a,b) corresponds to score of
aligning character a with
character b
Match scores are often calculated
based on frequency of mutations
in very similar sequences
(more details later)
BCB 444/544 F07 ISU
Dobbs #4 - Sequence Alignment
8/27/07
32
Methods
•
•
•
•
Global and Local Alignment
Alignment Algorithms
Dot Matrix Method
Dynamic Programming Method
• Gap penalities
• DP for Global Alignment
• DP for Local Alignment
• Scoring Matrices
• Amino acid scoring matrices
• PAM
• BLOSUM
• Comparisons between PAM & BLOSUM
• Statistical Significance of Sequence Alignment
BCB 444/544 F07 ISU
Dobbs #4 - Sequence Alignment
8/27/07
33
Global vs Local Alignment
Global alignment
• Finds best possible alignment across entire length of 2 sequences
• Aligned sequences assumed to be generally similar over entire length
Local alignment
• Finds local regions with highest similarity between 2 sequences
• Aligns these without regard for rest of sequence
• Sequences are not assumed to be similar over entire length
BCB 444/544 F07 ISU
Dobbs #4 - Sequence Alignment
8/27/07
34
Global vs Local Alignment - example
S = CTGTCGCTGCACG
T = TGCCGTG
Global alignment
CTGTCG-CTGCACG
-TGC-CG-TG----
Local alignment
CTGTCGCTGCACG--------TGC-CGTG
CTGTCG-CTGCACG
-TGCCG--TG---Which is better?
BCB 444/544 F07 ISU
Dobbs #4 - Sequence Alignment
8/27/07
35
Global vs Local Alignment
When use which?
Both are important
but it is critical to use right method for a given task!
Global alignment:
• Good for: aligning closely related sequences of approx. same length
• Not good for: divergent sequences or sequences with different
lengths
Local Alignment:
• Good for: searching for conserved patterns (domains or motifs) in
DNA or protein sequences
• Not good for: generating alignment of closely related sequences
Global and local alignments are fundamentally similar and differ only in
optimization strategy used in aligning similar residues
BCB 444/544 F07 ISU
Dobbs #4 - Sequence Alignment
8/27/07
36
Alignment Algorithms
3 major methods for alignment:
1. Dot matrix analysis
2. Dynamic Programming
3. Word or k-tuple methods (later, in Chp 4)
BCB 444/544 F07 ISU
Dobbs #4 - Sequence Alignment
8/27/07
37
Dot Matrix Method (Dot Plots)
• Place 1 sequence along top row of matrix
• Place 2nd sequence along left column of
matrix
• Plot a dot each time there is a match
between an element of row sequence and
an element of column sequence
• For proteins, usually use more
sophisticated scoring schemes than
"identical match"
• Diagonal lines indicate areas of match
A C G C G
A
C
A
C
G
• Reverse diagonals (perpendicular to
diagonal) indicate inversions
Exploring Dot Plots
BCB 444/544 F07 ISU
Dobbs #4 - Sequence Alignment
8/27/07
38
Download