#11 - Multiple Sequence Alignment 9/14/07 Multiple Sequence Alignment (MSA)

advertisement
#11 - Multiple Sequence Alignment
9/14/07
Required Reading
BCB 444/544
(before lecture)
√Mon Sept 10 - for Lecture 9/10
BLAST variations; BLAST vs FASTA, SW
• Chp 4 - pp 51-62
Lecture 11
First
BLAST vs FASTA
√Wed Sept 12 - for Lecture 11 & Lab 4
Multiple Sequence Alignment (MSA)
• Chp 5 - pp 63-74
Plus some Gene Jargon
Multiple Sequence Alignment
(MSA)
Fri Sept 14 - for Lecture 12
Position Specific Scoring Matrices & Profiles
• Chp 6 - pp 75-78 (but not HMMs)
#11_Sept14
• Good Additional Resource re: Sequence Alignment?
• Wikipedia: http://en.wikipedia.org/wiki/Sequence_alignment
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
9/14/07
1
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
Review: Gene Jargon #1
Assignments & Announcements - #1
9/14/07
2
(for HW2, 1c)
Exons = "protein-encoding" (or "kept" parts) of eukaryotic genes
Revised Grading Policy has been sent via email
Please review!
vs
Introns = "intervening sequences"
= segments of eukaryotic genes that "interrupt" exons
√Mon Sept 10 - Lab 3 Exercise due 5 PM:
to: terrible@iastate.edu
?Thu Sept 13 - Graded Labs 2 & 3
will be returned at beginning of Lab 4
Fri Sept 14 - HW#2 due by 5 PM (106 MBB)
• Introns are transcribed into pre-RNA
• but are later removed by RNA processing
• & do not appear in mature mRNA
• so are not translated into protein
Study Guide for Exam 1 will be posted by 5 PM
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
9/14/07
3
Assignments & Announcements - #2
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
9/14/07
4
Chp 4- Database Similarity Searching
Mon Sept 17 - Answers to HW#2
will be posted by 5 PM
SECTION II
SEQUENCE ALIGNMENT
Xiong: Chp 4
Database Similarity Searching
Thu Sept 20 - Lab = Optional Review Session for Exam
• √Unique Requirements of Database Searching
• √Heuristic Database Searching
Fri Sept 21 - Exam 1 - Will cover:
•
•
•
•
Lectures 2-12 (thru Mon Sept 17)
Labs 1-4
HW2
All assigned reading:
Chps 2-6 (but not HMMs)
Eddy: What is Dynamic Programming
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
BCB 444/544 Fall 07 Dobbs
9/14/07
• √Basic Local Alignment Search Tool (BLAST)
• FASTA
• Comparison of FASTA and BLAST
• Database Searching with Smith-Waterman Method
5
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
9/14/07
6
1
#11 - Multiple Sequence Alignment
9/14/07
Why search a database?
FASTA and BLAST
• Both FASTA, BLAST are based on heuristics
• Given a newly discovered gene,
• Tradeoff:
• Does it occur in other species?
• Is its function known in another species?
• user defines value for k = word length
•
Identification of potential genes
•
Identification of other functional parts of chromosomes
• Slower, but more sensitive than BLAST at lower values of k,
(preferred for searches involving a very short query sequence)
• BLAST family
• Family of different algorithms optimized for particular types of
queries, such as searching for distantly related sequence
matches
• Find members of a multigene family
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
9/14/07
• BLAST was developed to provide a faster alternative to FASTA
withoutBCB
sacrificing
much accuracy
444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
9/14/07
7
9/14/07
9
•
BLASTX - 6-frame translated DNA seq query against protein DB
•
TBLASTN - protein query against 6-frame DNA translation
•
TBLASTX - 6-frame DNA query to 6-frame DNA translation
•
PSI-BLAST - protein "profile" query against protein DB
•
PHI-BLAST - protein pattern against protein DB
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
1: Filter low-complexity regions
(LCRs)
2. Make a list (dictionary): all words of length 3aa or 11 nt
3. Augment list to include similar words
4. Store list in a search tree (data structure)
• Low complexity sequences can
yield false positives.
5. Scan database for occurrences of words in search tree
• Screen them out of your query
sequences!
When appropriate!
e.g., for GGGG:
6. Connect nearby occurrences
7. Extend matches (words) in both directions
L! = 4!=4x3x2x1= 24
nG =4 nT =nA =nC =0
P ni ! = 4!x0!x0!x0! = 24
K=1/4 log4 (24/24) = 0
8. Prune list of matches using a score threshold
9. Evaluate significance of each remaining match
9/14/07
9/14/07
10
This slide has
been changed!
K = computational complexity;
• Low complexity regions,
varies from 0 (very low complexity)
transmembrane regions and
to 1 (high complexity)
coiled-coil regions often display
Alphabet
size (4 or 20)
significant similarity without
Window
homology.
length (usually
Remove low-complexity regions (LCRs)
BCB 444/544 Fall 07 Dobbs
BLASTP - protein sequence query against protein DB
BLASTN - DNA/RNA seq query against DNA DB (GenBank)
http://www.ncbi.nlm.nih.gov/blast/producttable.shtml
Detailed Steps in BLAST algorithm
444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
10. PerformBCB
Smith-Waterman
to get alignment
•
•
• Which
Newest:
toolMEGA-BLAST
should you use?- optimized for highly similar sequences
Local
alignment
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
8
BLAST - a Family of Programs:
Different BLAST "flavors"
BLAST algorithms can generate both
"global" and "local" alignments
1.
Speed
• FASTA
• Given a newly sequenced genome, which regions align
with genomes of other organisms?
Global
alignment
Sensitivity vs
• DP is slower, but more sensitive
11
For CGTA: K=1/4 log4 (24/1) = 0.57
12)
K=
&
#
$ L! !
1
log N $
!
L
$ ' ni ! !
% i
"
Frequency of ith
letter in the window
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
9/14/07
12
2
#11 - Multiple Sequence Alignment
9/14/07
2: List all words in query
3: Augment word list
YGGFMTSEKSQTPLVTLFKNAIIKNAHKKGQ
YGG
GGF
GFM
FMT
MTS
TSE
SEK
…
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
YGGFMTSEKSQTPLVTLFKNAIIKNAHKKGQ
YGG
GGF
GFM
AAA
AAB
FMT
AAC
MTS
203 = 8000
…
TSE
possible matches
SEK
YYY
…
9/14/07
13
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
3: Augment word list
BLOSUM62
scores
Non-match
G
G
G
G
6 + 6 +
Match
9/14/07
16
YGGFMTSEKSQTPLVTLFKNAIIKNAHKKGQ
YGG
GGF
GFM
GGI
GGL
FMT
GGM
MTS
GGF
GGW
TSE
GGY
SEK
…
…
A user-specified threshold, T, determines which 3-letter
words are considered matches and non-matches
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
9/14/07
15
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
3: Augment word list
Example
Find all words that match EAM with a score greater
than or equal to 11
Observation:
A
R
N
D
C
Q
E
G
H
I
L
K
M
F
P
S
T
W
Y
V
Selecting only words with score > T greatly reduces
number of possible matches
otherwise, 203 for 3-letter words from amino acid
sequences!
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
BCB 444/544 Fall 07 Dobbs
14
3: Augment word list
G
G
F
A
A
A
0 + 0 + -2 = -2
F
Y
3 = 15
9/14/07
9/14/07
17
A
4
-1
-2
-2
0
-1
-1
0
-2
-1
-1
-1
-1
-2
-1
1
0
-3
-2
0
R
-1
5
0
-2
-3
1
0
-2
0
-3
-2
2
-1
-3
-2
-1
-1
-3
-2
-3
N
-2
0
6
1
-3
0
0
0
1
-3
-3
0
-2
-3
-2
1
0
-4
-2
-3
D
-2
-2
1
6
-3
0
2
-1
-1
-3
-4
-1
-3
-3
-1
0
-1
-4
-3
-3
C
0
-3
-3
-3
9
-3
-4
-3
-3
-1
-1
-3
-1
-2
-3
-1
-1
-2
-2
-1
Q
-1
1
0
0
-3
5
2
-2
0
-3
-2
1
0
-3
-1
0
-1
-2
-1
-2
E
-1
0
0
2
-4
2
5
-2
0
-3
-3
1
-2
-3
-1
0
-1
-3
-2
-2
G
0
-2
0
-1
-3
-2
-2
6
-2
-4
-4
-2
-3
-3
-2
0
-2
-2
-3
-3
H
-2
0
1
-1
-3
0
0
-2
8
-3
-3
-1
-2
-1
-2
-1
-2
-2
2
-3
I
-1
-3
-3
-3
-1
-3
-3
-4
-3
4
2
-3
1
0
-3
-2
-1
-3
-1
3
L
-1
-2
-3
-4
-1
-2
-3
-4
-3
2
4
-2
2
0
-3
-2
-1
-2
-1
1
K
-1
2
0
-1
-3
1
1
-2
-1
-3
-2
5
-1
-3
-1
0
-1
-3
-2
-2
M
-1
-1
-2
-3
-1
0
-2
-3
-2
1
2
-1
5
0
-2
-1
-1
-1
-1
1
F
-2
-3
-3
-3
-2
-3
-3
-3
-1
0
0
-3
0
6
-4
-2
-2
1
3
-1
P
-1
-2
-2
-1
-3
-1
-1
-2
-2
-3
-3
-1
-2
-4
7
-1
-1
-4
-3
-2
S
1
-1
1
0
-1
0
0
0
-1
-2
-2
0
-1
-2
-1
4
1
-3
-2
-2
T
0
-1
0
-1
-1
-1
-1
-2
-2
-1
-1
-1
-1
-2
-1
1
5
-2
-2
0
W
-3
-3
-4
-4
-2
-2
-3
-2
-2
-3
-2
-3
-1
1
-4
-3
-2
11
2
-3
Y
-2
-2
-2
-3
-2
-1
-2
-3
2
-1
-1
-2
-1
3
-3
-2
-2
2
7
-1
V
0
-3
-3
-3
-1
-2
-2
-3
-3
3
1
-2
1
-1
-2
-2
0
-3
-1
4
EAM
DAM
QAM
ESM
EAL
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
5
2
2
5
5
+
+
+
+
+
4
4
4
1
4
+
+
+
+
+
5
5
5
5
2
=
=
=
=
=
9/14/07
14
11
11
11
11
18
3
#11 - Multiple Sequence Alignment
9/14/07
4: Store words in search tree
Search tree
Augmented list of
query words
“Does this query contain GGF?”
G
GGF
GGL
GGM
GGW
GGY
Search tree
G
F
L
M
W
Y
“Yes, at position 2.”
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
9/14/07
19
Example
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
9/14/07
20
9/14/07
22
5: Scan the database sequences
Put this word list into a search tree
Database sequence
D
A
M
A
A
Q
C
M
M
I
E
G
M
•
K
S
M
T
M
V
M
Query sequence
DAM
QAM
EAM
KAM
ECM
EGM
ESM
ETM
EVM
EAI
EAL
EAV
A
M
•
•
•
V
L
•
•
•
•
M
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
9/14/07
21
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
6: Connect nearby occurences
(diagonal matches in Gapped BLAST)
Example
Scan this "database" for occurrences of your words
Database sequence
E
A
M
P
Q
L
S
V
D
A
M
Query sequence
MKFLILLFNILCLDAMLAADNHGVGPQGASGVDPITFDINSNQTGPAFLTAVEAIGVKYLQVQHGSNVNIHRLVEGNVKAMENA
•
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
BCB 444/544 Fall 07 Dobbs
9/14/07
23
Two dots are connected
IFF if they are less
than A letters apart &
are on diagonal
•
•
•
•
•
•
•
•
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
9/14/07
24
4
#11 - Multiple Sequence Alignment
9/14/07
7: Extend matches,
calculating score at each step
7: Extend matches in both directions
L P
M P
Scan
DB
P Q G L L
P E G L L
<word>
7 2 6
<----->
2 7 7 2 6 4 4
Query sequence
Database sequence
BLOSUM62 scores
word score = 15
HSP SCORE = 32
(High Scoring Pair)
• Each match is extended to left & right until a
negative BLOSUM62 score is encountered
• Extension step typically accounts for > 90% of
execution time
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
9/14/07
25
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
8: Prune matches
9: Evaluate significance
9/14/07
26
This slide has
been changed!
• BLAST uses an analytical statistical significance
calculation
• Discard all matches that score below defined
threshold
RECALL:
1.
E-value: E = m x n x P
m = total number of residues in database
n = number of residues in query sequence
P = probability that an HSP is result of random chance
lower E-value, less likely to result from random chance,
thus higher significance
2.
Bit Score: S' =
normalized score, to account for differences in size of database ( m) & sequence
length(n) ; Note (below) that bit score is linearly related to raw alignment
S'=score,
(λ X so:
S - lnhigher
K)/ln2 S'where:
λ = Gumble
distribution
constant
means alignment
has
higher significance
S = raw alignment score
K = constant associated with scoring matrix
For more details - see text & BLAST tutorial
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
9/14/07
27
10: Use Smith-Waterman algorithm
(DP) to generate alignment
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
9/14/07
28
BLAST: What is a "Hit"?
• A hit is a w-length word in database that aligns with
a word from query sequence with score > T
• ONLY significant matches are re-analyzed using
Smith-Waterman DP algorithm.
• BLAST looks for hits instead of exact matches
• Allows word size to be kept larger for speed, without
sacrificing sensitivity
• Alignments reported by BLAST are produced by
dynamic programming
• Typically, w = 3-5 for amino acids,
w = 11-12 for DNA
• T is the most critical parameter:
• ↑T ⇒ ↓ “background” hits (faster)
• ↓T ⇒ ↑ ability to detect more distant relationships
(at cost of increased noise)
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
BCB 444/544 Fall 07 Dobbs
9/14/07
29
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
9/14/07
30
5
#11 - Multiple Sequence Alignment
9/14/07
Tips for BLAST Similarity Searches
Practical Issues
Searching on DNA or protein level?
• If you don’t know, use default parameters first
In general,
• Try several programs & several parameter settings
protein-encoding DNA should be translated!
• If possible, search on protein sequence level
• DNA yields more random matches:
• Scoring matrices:
PAM1 / BLOSUM80:
• 25% for DNA vs. 5% for proteins
if expect/want less divergent proteins
PAM120 / BLOSUM62: "average" proteins
• DNA databases are larger and grow faster
PAM250 / BLOSUM45: if need to find more divergent proteins
• Selection (generally) acts on protein level
• Synonymous mutations are usually neutral
• Proteins:
• DNA sequence similarity decays faster
>25-30% identity (and >100aa)
-> likely related
15-25% identity
-> twilight zone
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
<15% identity
-> likely unrelated
9/14/07
31
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
BLAST vs FASTA
9/14/07
32
BLAST & FASTA References
• Seeding:
• FASTA -
• BLAST integrates scoring matrix into first phase
developed first
• Pearson & Lipman (1988) Improved Tools for Biological
Sequence Comparison. PNAS 85:2444- 2448
• FASTA requires exact matches (uses hashing)
• BLAST increases search speed by finding fewer, but
better, words during initial screening phase
• FASTA uses shorter word sizes - so can be more
sensitive
• BLAST
• Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990)
• Altschul, Madden, Schaffer, Zhang, Zhang, Miller, Lipman
(1997) Gapped BLAST and PSI-BLAST: a new generation of
protein database search programs. Nucleic Acids Res.
25:3389-402
• Results:
• BLAST can return multiple best scoring alignments
• FASTA returns only one final alignment
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
9/14/07
33
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
NCBI - BLAST Programs
Glossary & Tutorials
BLAST Notes - & DP Alternatives
9/14/07
34
BLAST
• BLAST uses heuristics: it may miss some good matches
• But, it’s fast: 50 - 100X faster than Smith-Waterman (SW) DP
• Large impact:
• NCBI’s BLAST server handles more than 100,000 queries/day
• Most used bioinformatics program in the world!
 But - Xiong says: "It has been estimated that for some families of
protein sequences BLAST can miss 30% of truly significant matches."
• Increased availability of parallel processing has made DP-based
approaches feasible:
•
http://www.ncbi.nlm.nih.gov/BLAST/
•
http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/glossary2.html
•
http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/information3.html
• 2 DP-based web servers: both more sensitive than BLAST
• Scan Protein Sequence: http://www.ebi.ac.uk/scanps/index.html
Implements modified SW optimized for parallel processing
• ParAlign www.paralign.org - parallel SW or heuristics
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
BCB 444/544 Fall 07 Dobbs
9/14/07
35
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
9/14/07
36
6
#11 - Multiple Sequence Alignment
9/14/07
Chp 5- Multiple Sequence Alignment
SECTION II
Multiple Sequence Alignments
SEQUENCE ALIGNMENT
Xiong: Chp 5
Multiple Sequence Alignment
• Scoring Function
• Exhaustive Algorithms
• Heuristic Algorithms
• Practical Issues
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
9/14/07
37
Overview
Credits for slides: Caragea & Brown, 2007;
Fernandez-Baca, Heber &HunterBCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
9/14/07
38
Multiple Sequence Alignment
• Generalize pairwise alignment of sequences to
include > 2 homologous sequences
1. What is a multiple sequence alignment (MSA)?
2. Where/why do we need MSA?
• Analyzing more than 2 sequences gives us much
more
information:
3. What is a good MSA?
• Which amino acids are required? Correlated?
4. Algorithms to compute a MSA
• Evolutionary/phylogenetic relationships
• Similar to PSI-BLAST idea (not yet covered in lecture):
use a set of homologous sequences to provide
more "sensitivity"
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
9/14/07
39
Not a MSA
9/14/07
40
Definition: MSA
What is a MSA?
ATT-GC
ATTTGC
ATTTG
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
Given a set of sequences, a multiple sequence
alignment is an assignment of gap characters, such
that
AT-TGC
ATTTGC
ATTTG-
AT-T-GC
ATTT-GC
ATTT-G-
MSA
Not a MSA
• resulting sequences have same length
• no column contains only gaps
Why?
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
BCB 444/544 Fall 07 Dobbs
9/14/07
41
ATT-GC
ATTTGC
ATTTG
AT-TGC
ATTTGC
ATTTG-
AT-T-GC
ATTT-GC
ATTT-G-
NO
YES
NO
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
9/14/07
42
7
#11 - Multiple Sequence Alignment
9/14/07
Displaying MSAs: using CLUSTAL W
What is a Consensus Sequence?
A single sequence that represents most common
residue of each column in a MSA
Example:
RED:
AVFPMILW (small)
BLUE:
DE
FGGHL-GF
F-GHLPGF
FGGHP-FG
FGGHL-GF
(acidic, negative chg)
MAGENTA: RHK (basic, positive chg)
GREEN:
STYHCNGQ (hydroxyl + amine + basic)
*
entirely conserved column
:
.
all residues have ~ same size
all residues have ~ same size
AND
OR
Steiner consensus seqence: Given sequences s1,…, sk,
find a sequence s* that maximizes Σi S(s*,si )
hydropathy
hydropathy
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
9/14/07
43
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
9/14/07
44
Application: Recover Phylogenetic Tree
Applications of MSA
What was series of events that led to current species?
• Building phylogenetic trees
• Finding conserved patterns, e.g.:
• Regulatory motifs (TF binding sites)
• Splice sites
• Protein domains
• Identifying and characterizing protein families
• Find out which protein domains have same function
• Finding SNPs (single nucleotide polymorphisms) &
mRNA isoforms (alternatively spliced forms)
NYLS
• DNA fragment assembly (in genomic sequencing)
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
9/14/07
NFLS
NYLS
45
Application: Discover Conserved Patterns
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
9/14/07
46
Goal: Characterize Protein Families
Which parts of globin sequences are most highly conserved?
Is there a conserved cis-acting regulatory sequence?
Rationale: if they are homologous (derived from a common ancestor),
they may be structurally equivalent
TATA box = transcriptional
promoter element
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
BCB 444/544 Fall 07 Dobbs
9/14/07
47
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
9/14/07
48
8
Download