Multiple Sequence Alignment (MSA) BCB 444/544 Lecture 11

advertisement
BCB 444/544
Lecture 11
First
BLAST vs FASTA
Plus some Gene Jargon
Multiple Sequence Alignment
(MSA)
#11_Sept14
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
9/14/07
1
Required Reading
(before lecture)
√Mon Sept 10 - for Lecture 9/10
BLAST variations; BLAST vs FASTA, SW
• Chp 4 - pp 51-62
√Wed Sept 12 - for Lecture 11 & Lab 4
Multiple Sequence Alignment (MSA)
• Chp 5 - pp 63-74
Fri Sept 14 - for Lecture 12
Position Specific Scoring Matrices & Profiles
• Chp 6 - pp 75-78 (but not HMMs)
• Good Additional Resource re: Sequence Alignment?
• Wikipedia: http://en.wikipedia.org/wiki/Sequence_alignment
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
9/14/07
2
Assignments & Announcements - #1
Revised Grading Policy has been sent via email
Please review!
√Mon Sept 10 - Lab 3 Exercise due 5 PM:
to: terrible@iastate.edu
?Thu Sept 13 - Graded Labs 2 & 3
will be returned at beginning of Lab 4
Fri Sept 14 - HW#2 due by 5 PM (106 MBB)
Study Guide for Exam 1 will be posted by 5 PM
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
9/14/07
3
Review: Gene Jargon #1
(for HW2, 1c)
Exons = "protein-encoding" (or "kept" parts) of eukaryotic genes
vs
Introns = "intervening sequences"
= segments of eukaryotic genes that "interrupt" exons
• Introns are transcribed into pre-RNA
• but are later removed by RNA processing
• & do not appear in mature mRNA
• so are not translated into protein
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
9/14/07
4
Assignments & Announcements - #2
Mon Sept 17 - Answers to HW#2
will be posted by 5 PM
Thu Sept 20 - Lab = Optional Review Session for Exam
Fri Sept 21 - Exam 1 - Will cover:
•
•
•
•
Lectures 2-12 (thru Mon Sept 17)
Labs 1-4
HW2
All assigned reading:
Chps 2-6 (but not HMMs)
Eddy: What is Dynamic Programming
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
9/14/07
5
Chp 4- Database Similarity Searching
SECTION II
SEQUENCE ALIGNMENT
Xiong: Chp 4
Database Similarity Searching
•
•
•
•
•
•
√Unique Requirements of Database Searching
√Heuristic Database Searching
√Basic Local Alignment Search Tool (BLAST)
FASTA
Comparison of FASTA and BLAST
Database Searching with Smith-Waterman Method
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
9/14/07
6
Why search a database?
• Given a newly discovered gene,
• Does it occur in other species?
• Is its function known in another species?
• Given a newly sequenced genome, which regions align
with genomes of other organisms?
•
•
Identification of potential genes
Identification of other functional parts of chromosomes
• Find members of a multigene family
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
9/14/07
7
FASTA and BLAST
• Both FASTA, BLAST are based on heuristics
• Tradeoff:
Sensitivity vs Speed
• DP is slower, but more sensitive
• FASTA
• user defines value for k = word length
• Slower, but more sensitive than BLAST at lower values of k,
(preferred for searches involving a very short query sequence)
• BLAST family
• Family of different algorithms optimized for particular types of
queries, such as searching for distantly related sequence matches
• BLAST was developed to provide a faster alternative to FASTA
without sacrificing much accuracy
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
9/14/07
8
BLAST algorithms can generate both
"global" and "local" alignments
Global
alignment
Local
alignment
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
9/14/07
9
BLAST - a Family of Programs:
Different BLAST "flavors"
•
•
•
•
•
BLASTP - protein sequence query against protein DB
BLASTN - DNA/RNA seq query against DNA DB (GenBank)
BLASTX - 6-frame translated DNA seq query against protein DB
TBLASTN - protein query against 6-frame DNA translation
TBLASTX - 6-frame DNA query to 6-frame DNA translation
•
•
•
PSI-BLAST - protein "profile" query against protein DB
PHI-BLAST - protein pattern against protein DB
Newest: MEGA-BLAST - optimized for highly similar sequences
Which tool should you use?
http://www.ncbi.nlm.nih.gov/blast/producttable.shtml
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
9/14/07
10
Detailed Steps in BLAST algorithm
1.
Remove low-complexity regions (LCRs)
2. Make a list (dictionary): all words of length 3aa or 11 nt
3. Augment list to include similar words
4. Store list in a search tree (data structure)
5. Scan database for occurrences of words in search tree
6. Connect nearby occurrences
7. Extend matches (words) in both directions
8. Prune list of matches using a score threshold
9. Evaluate significance of each remaining match
10. Perform Smith-Waterman to get alignment
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
9/14/07
11
1: Filter low-complexity regions
(LCRs)
This slide has
been changed!
K = computational complexity;
• Low complexity regions,
varies from 0 (very low complexity)
transmembrane regions and
to 1 (high complexity)
coiled-coil regions often display
Alphabet size
significant similarity without
(4 or 20)
Window length
homology.
(usually 12)
• Low complexity sequences can
yield false positives.
• Screen them out of your query


sequences! When appropriate!


e.g., for GGGG:
L! = 4!=4x3x2x1= 24
nG=4 nT=nA=nC=0
 ni! = 4!x0!x0!x0! = 24
K=1/4 log4 (24/24) = 0
For CGTA: K=1/4 log4(24/1) = 0.57
1
L!
K  log N 

L
  ni ! 
 i

Frequency of ith
letter in the window
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
9/14/07
12
2: List all words in query
YGGFMTSEKSQTPLVTLFKNAIIKNAHKKGQ
YGG
GGF
GFM
FMT
MTS
TSE
SEK
…
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
9/14/07
13
3: Augment word list
YGGFMTSEKSQTPLVTLFKNAIIKNAHKKGQ
YGG
GGF
GFM
AAA
AAB
FMT
AAC
MTS
203 = 8000
…
TSE
possible matches
SEK
YYY
…
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
9/14/07
14
3: Augment word list
BLOSUM62
scores
G
G
F
A
A
A
0 + 0 + -2 = -2
Non-match
G
G
G
G
6 + 6 +
Match
F
Y
3 = 15
A user-specified threshold, T, determines which 3-letter
words are considered matches and non-matches
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
9/14/07
15
3: Augment word list
YGGFMTSEKSQTPLVTLFKNAIIKNAHKKGQ
YGG
GGF
GFM
GGI
GGL
FMT
GGM
MTS
GGF
GGW
TSE
GGY
SEK
…
…
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
9/14/07
16
3: Augment word list
Observation:
Selecting only words with score > T greatly reduces
number of possible matches
otherwise, 203 for 3-letter words from amino acid sequences!
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
9/14/07
17
Example
Find all words that match EAM with a score greater
than or equal to 11
A
R
N
D
C
Q
E
G
H
I
L
K
M
F
P
S
T
W
Y
V
A
4
-1
-2
-2
0
-1
-1
0
-2
-1
-1
-1
-1
-2
-1
1
0
-3
-2
0
R
-1
5
0
-2
-3
1
0
-2
0
-3
-2
2
-1
-3
-2
-1
-1
-3
-2
-3
N
-2
0
6
1
-3
0
0
0
1
-3
-3
0
-2
-3
-2
1
0
-4
-2
-3
D
-2
-2
1
6
-3
0
2
-1
-1
-3
-4
-1
-3
-3
-1
0
-1
-4
-3
-3
C
0
-3
-3
-3
9
-3
-4
-3
-3
-1
-1
-3
-1
-2
-3
-1
-1
-2
-2
-1
Q
-1
1
0
0
-3
5
2
-2
0
-3
-2
1
0
-3
-1
0
-1
-2
-1
-2
E
-1
0
0
2
-4
2
5
-2
0
-3
-3
1
-2
-3
-1
0
-1
-3
-2
-2
G
0
-2
0
-1
-3
-2
-2
6
-2
-4
-4
-2
-3
-3
-2
0
-2
-2
-3
-3
H
-2
0
1
-1
-3
0
0
-2
8
-3
-3
-1
-2
-1
-2
-1
-2
-2
2
-3
I
-1
-3
-3
-3
-1
-3
-3
-4
-3
4
2
-3
1
0
-3
-2
-1
-3
-1
3
L
-1
-2
-3
-4
-1
-2
-3
-4
-3
2
4
-2
2
0
-3
-2
-1
-2
-1
1
K
-1
2
0
-1
-3
1
1
-2
-1
-3
-2
5
-1
-3
-1
0
-1
-3
-2
-2
M
-1
-1
-2
-3
-1
0
-2
-3
-2
1
2
-1
5
0
-2
-1
-1
-1
-1
1
F
-2
-3
-3
-3
-2
-3
-3
-3
-1
0
0
-3
0
6
-4
-2
-2
1
3
-1
P
-1
-2
-2
-1
-3
-1
-1
-2
-2
-3
-3
-1
-2
-4
7
-1
-1
-4
-3
-2
S
1
-1
1
0
-1
0
0
0
-1
-2
-2
0
-1
-2
-1
4
1
-3
-2
-2
T
0
-1
0
-1
-1
-1
-1
-2
-2
-1
-1
-1
-1
-2
-1
1
5
-2
-2
0
W
-3
-3
-4
-4
-2
-2
-3
-2
-2
-3
-2
-3
-1
1
-4
-3
-2
11
2
-3
Y
-2
-2
-2
-3
-2
-1
-2
-3
2
-1
-1
-2
-1
3
-3
-2
-2
2
7
-1
V
0
-3
-3
-3
-1
-2
-2
-3
-3
3
1
-2
1
-1
-2
-2
0
-3
-1
4
EAM
DAM
QAM
ESM
EAL
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
5
2
2
5
5
+
+
+
+
+
4
4
4
1
4
+
+
+
+
+
5
5
5
5
2
=
=
=
=
=
14
11
11
11
11
9/14/07
18
4: Store words in search tree
Augmented list of
query words
“Does this query contain GGF?”
Search tree
“Yes, at position 2.”
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
9/14/07
19
Search tree
GGF
GGL
GGM
GGW
GGY
G
G
F
L
M
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
W
Y
9/14/07
20
Example
Put this word list into a search tree
DAM
QAM
EAM
KAM
ECM
EGM
ESM
ETM
EVM
EAI
EAL
EAV
D
A
A
M
M
A
I
Q
E
K
C
G
S
T
V
A
M
M
M
M
M
M
V
L
M
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
9/14/07
21
5: Scan the database sequences
Query sequence
Database sequence








BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
9/14/07
22
Example
Scan this "database" for occurrences of your words
MKFLILLFNILCLDAMLAADNHGVGPQGASGVDPITFDINSNQTGPAFLTAVEAIGVKYLQVQHGSNVNIHRLVEGNVKAMENA
E
A
M
P
Q
L
S
V
D
A
M

BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
9/14/07
23
6: Connect nearby occurences
(diagonal matches in Gapped BLAST)
Query sequence
Database sequence
Two dots are connected
IFF if they are less
than A letters apart &
are on diagonal








BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
9/14/07
24
7: Extend matches in both directions
Scan
DB
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
9/14/07
25
7: Extend matches,
calculating score at each step
L P
M P
P Q G L L
P E G L L
<word>
7 2 6
<----->
2 7 7 2 6 4 4
Query sequence
Database sequence
BLOSUM62 scores
word score = 15
HSP SCORE = 32
(High Scoring Pair)
• Each match is extended to left & right until a
negative BLOSUM62 score is encountered
• Extension step typically accounts for > 90% of
execution time
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
9/14/07
26
8: Prune matches
• Discard all matches that score below defined
threshold
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
9/14/07
27
9: Evaluate significance
This slide has
been changed!
• BLAST uses an analytical statistical significance
calculation
RECALL:
1.
E-value: E = m x n x P
m = total number of residues in database
n = number of residues in query sequence
P = probability that an HSP is result of random chance
lower E-value, less likely to result from random chance,
thus higher significance
2.
Bit Score: S' =
normalized score, to account for differences in size of database (m) & sequence
length(n); Note (below) that bit score is linearly related to raw alignment
score, so: higher S' means alignment has higher significance
S'= ( X S - ln K)/ln2 where:
 = Gumble distribution constant
S = raw alignment score
K = constant associated with scoring matrix
For more details - see text & BLAST tutorial
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
9/14/07
28
10: Use Smith-Waterman algorithm
(DP) to generate alignment
• ONLY significant matches are re-analyzed using
Smith-Waterman DP algorithm.
• Alignments reported by BLAST are produced by
dynamic programming
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
9/14/07
29
BLAST: What is a "Hit"?
• A hit is a w-length word in database that aligns with a
word from query sequence with score > T
• BLAST looks for hits instead of exact matches
• Allows word size to be kept larger for speed, without
sacrificing sensitivity
• Typically, w = 3-5 for amino acids,
w = 11-12 for DNA
• T is the most critical parameter:
• ↑T  ↓ “background” hits (faster)
• ↓T  ↑ ability to detect more distant relationships
(at cost of increased noise)
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
9/14/07
30
Tips for BLAST Similarity Searches
• If you don’t know, use default parameters first
• Try several programs & several parameter settings
• If possible, search on protein sequence level
• Scoring matrices:
PAM1 / BLOSUM80:
if expect/want less divergent proteins
PAM120 / BLOSUM62: "average" proteins
PAM250 / BLOSUM45: if need to find more divergent proteins
• Proteins:
>25-30% identity (and >100aa)
15-25% identity
<15% identity
-> likely related
-> twilight zone
-> likely unrelated
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
9/14/07
31
Practical Issues
Searching on DNA or protein level?
In general,
protein-encoding DNA should be translated!
• DNA yields more random matches:
• 25% for DNA vs. 5% for proteins
• DNA databases are larger and grow faster
• Selection (generally) acts on protein level
• Synonymous mutations are usually neutral
• DNA sequence similarity decays faster
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
9/14/07
32
BLAST vs FASTA
• Seeding:
• BLAST integrates scoring matrix into first phase
• FASTA requires exact matches (uses hashing)
• BLAST increases search speed by finding fewer, but
better, words during initial screening phase
• FASTA uses shorter word sizes - so can be more
sensitive
• Results:
• BLAST can return multiple best scoring alignments
• FASTA returns only one final alignment
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
9/14/07
33
BLAST & FASTA References
• FASTA -
developed first
• Pearson & Lipman (1988) Improved Tools for Biological
Sequence Comparison. PNAS 85:2444- 2448
• BLAST
• Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990)
• Altschul, Madden, Schaffer, Zhang, Zhang, Miller, Lipman
(1997) Gapped BLAST and PSI-BLAST: a new generation of
protein database search programs. Nucleic Acids Res.
25:3389-402
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
9/14/07
34
BLAST Notes - & DP Alternatives
• BLAST uses heuristics: it may miss some good matches
• But, it’s fast: 50 - 100X faster than Smith-Waterman (SW) DP
• Large impact:
• NCBI’s BLAST server handles more than 100,000 queries/day
• Most used bioinformatics program in the world!
 But - Xiong says: "It has been estimated that for some families of
protein sequences BLAST can miss 30% of truly significant
matches."
• Increased availability of parallel processing has made DP-based
approaches feasible:
• 2 DP-based web servers: both more sensitive than BLAST
• Scan Protein Sequence: http://www.ebi.ac.uk/scanps/index.html
Implements modified SW optimized for parallel processing
• ParAlign www.paralign.org - parallel SW or heuristics
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
9/14/07
35
NCBI - BLAST Programs
Glossary & Tutorials
BLAST
•
http://www.ncbi.nlm.nih.gov/BLAST/
•
http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/glossary2.html
•
http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/information3.html
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
9/14/07
36
Chp 5- Multiple Sequence Alignment
SECTION II
SEQUENCE ALIGNMENT
Xiong: Chp 5
Multiple Sequence Alignment
•
•
•
•
Scoring Function
Exhaustive Algorithms
Heuristic Algorithms
Practical Issues
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
9/14/07
37
Multiple Sequence Alignments
Credits for slides: Caragea & Brown, 2007;
Fernandez-Baca, Heber &Hunter
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
9/14/07
38
Overview
1. What is a multiple sequence alignment (MSA)?
2. Where/why do we need MSA?
3. What is a good MSA?
4. Algorithms to compute a MSA
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
9/14/07
39
Multiple Sequence Alignment
• Generalize pairwise alignment of sequences to
include > 2 homologous sequences
• Analyzing more than 2 sequences gives us much more
information:
• Which amino acids are required? Correlated?
• Evolutionary/phylogenetic relationships
• Similar to PSI-BLAST idea (not yet covered in lecture):
use a set of homologous sequences to provide
more "sensitivity"
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
9/14/07
40
What is a MSA?
ATT-GC
ATTTGC
ATTTG
Not a MSA
AT-TGC
ATTTGC
ATTTG-
AT-T-GC
ATTT-GC
ATTT-G-
MSA
Not a MSA
Why?
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
9/14/07
41
Definition: MSA
Given a set of sequences, a multiple sequence
alignment is an assignment of gap characters, such
that
• resulting sequences have same length
• no column contains only gaps
ATT-GC
ATTTGC
ATTTG
AT-TGC
ATTTGC
ATTTG-
AT-T-GC
ATTT-GC
ATTT-G-
NO
YES
NO
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
9/14/07
42
Displaying MSAs: using CLUSTAL W
RED:
AVFPMILW (small)
BLUE:
DE (acidic, negative chg)
MAGENTA: RHK (basic, positive chg)
GREEN:
STYHCNGQ (hydroxyl + amine + basic)
*
:
.
entirely conserved column
all residues have ~ same size AND hydropathy
all residues have ~ same size OR hydropathy
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
9/14/07
43
What is a Consensus Sequence?
A single sequence that represents most common
residue of each column in a MSA
Example:
FGGHL-GF
F-GHLPGF
FGGHP-FG
FGGHL-GF
Steiner consensus seqence: Given sequences s1,…, sk,
find a sequence s* that maximizes Σi S(s*,si)
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
9/14/07
44
Applications of MSA
• Building phylogenetic trees
• Finding conserved patterns, e.g.:
• Regulatory motifs (TF binding sites)
• Splice sites
• Protein domains
• Identifying and characterizing protein families
• Find out which protein domains have same function
• Finding SNPs (single nucleotide polymorphisms) &
mRNA isoforms (alternatively spliced forms)
• DNA fragment assembly (in genomic sequencing)
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
9/14/07
45
Application: Recover Phylogenetic Tree
What was series of events that led to current species?
NYLS
NFLS
NYLS
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
9/14/07
46
Application: Discover Conserved Patterns
Is there a conserved cis-acting regulatory sequence?
Rationale: if they are homologous (derived from a common ancestor),
they may be structurally equivalent
TATA box = transcriptional
promoter element
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
9/14/07
47
Goal: Characterize Protein Families
Which parts of globin sequences are most highly conserved?
BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment
9/14/07
48
Download