How does BLAST work?

advertisement
BLAST Workshop
Maya Schushan
November 2011
Workshop OUTLINE
Part 1:
Part 2:
• Introduction and motivation
• Work Steps
• How does BLAST work?
• HANDS-ON
• BLAST programs
• Applications: ConSurf
• Sequence databases
• MSA programs
Why BLAST?
Finding homologous
• Homology- similarity between sequences that result from a
common ancestor.
• Sequences look alikeMore
 probably
have the same function
then:
and structure.
•
25% for proteins
70%
nucleotides
Use a sequence
as a for
search
query in order to find
homologous
sequences
in a dataas
base.
will be
considered
homologous
• Save time! – exploit the knowledge you have about your
homologues, and conclude about your query.
Why BLAST?
Finding homologous
Identify sequence motifs
Why BLAST?
Finding homologous
Find out which region are evolutionary conserved
 important for function and\or structure
Why BLAST?
Finding homologous
Construct phylogenetic trees  understand the
evolution of the sequence’s family
Why BLAST?
Finding homologous
Finding out if your protein sequence has a
structure (or a close homologue has one….)
Workshop OUTLINE
Part 1:
Part 2:
• Introduction and motivation
• Work Steps
• How does BLAST work?
• HANDS-ON
• BLAST programs
• Applications: ConSurf
• Sequence databases
• MSA programs
How does BLAST work?
What Is An Alignment?
Before we can understand how
BLAST works, we first have to
understand the principles of
sequence alignment….
How does BLAST work?
What Is An Alignment?
• Comparing 2 (pairwise) or more (multiple) sequences.
• Searching for a series of identical or similar
characters in the sequences.
VLSPADKTNVKAAWAKVGAHAAGHG
||| |
|
|||| | ||||
VLSEAEWQLVLHVWAKVEADVAGHG
How does BLAST work?
What Is An Alignment?
A process of lining-up 2 or more sequences to achieve
maximum level of identity, in order to find homologies.
TCATG
CATTG
?
TCATG
CATTG
or
TCATG
CATTG
How does BLAST work?
What Is An Alignment?
S = ACTG
T = AGT
S’ = AC_TG S’ = ACTG S’ = ACTG
T’ = A_GT_ T’ = AGT_ T’ = _AGT
Good: Identical characters- match.
Bad: Different characters- mismatch; gap (InDel).
• Each pair of characters gets a value, depending on its identity.
•The similarity score of the alignment is the sum of pair values.
General
Alignment
Methodology
How
does
BLAST
work?
What Is An Alignment?
Example: Aligning Two Globins
Human Hemoglobin (HH):
VLSPADKTNVKAAWGKVGAHAGYEG
Sperm Whale Myoglobin (SWM):
VLSEGEWQLVLHVWAKVEADVAGHG
How does BLAST work?
What Is An Alignment?
Example: Aligning Two Globins
• Percent identity: 36
• Percent similarity: 40
(HH)
No Gaps:
VLSPADKTNVKAAWGKVGAHAGYEG
(SWM) VLSEGEWQLVLHVWAKVEADVAGHG
How does BLAST work?
What Is An Alignment?
Example: Aligning Two Globins
With Gaps:
Gaps: 2
• Percent identity: 45.833 (instead of 36 without gaps)
• Percent similarity: 54.167 (instead of 40 without gaps)
•
(HH)
VLSPADKTNVKAAWGKVGAH-AGYEG
(SWM) VLSEGEWQLVLHVWAKVEADVAGH-G
How does BLAST work?
What Is An Alignment?
Alignment Scoring
1. Assume independent mutation model
2. Score at each position
– Positive if the same/similar
– Negative if different or gap
3. Score of an alignment is sum of position score
How does BLAST work?
What Is An Alignment?
Scoring Matrix
• A matrix n  n : n=4 for DNA, n=20 for proteins
• Each entry matrix defines the score for observing the
two letters in the alignment
A G C T
– Positive if likely to change
– Negative otherwise
A 1
G -5 1
C -5 -5 1
T -5 -5 -5 1
How does BLAST work?
What Is An Alignment?
DNA scoring matrices
From
To
A
G
A
G
2
-4
2
C
T
-6
-6
-6
-6
Transversion
C
T
2
-4
2
Transition
Match
How does BLAST work?
What Is An Alignment?
Proteins scoring matrices
• Observation: some substitutions
are more frequent than others,
e.g., chemically similar amino acids
• As for DNA, protein matrices
define the probabilities of change
between the different amino acids
• Popular matrices are based on
empirical data: PAM & BLOSUM
How does BLAST work?
What Is An Alignment?
PAM Matrices
• PAM matrices are based on sequences with 85% identity.
• The changes are “accepted” by natural selection
• 1 PAM unit:
the probability of 1 point mutation per 100 residues.
• Multiplying PAM1 by itself gives higher PAMs matrices that
are suitable for larger evolutionary distance.
How does BLAST work?
What Is An Alignment?
BLOSUM Matrices
• Based on BLOCKS database:
• Low BLUSOM numbers for distant sequences,
High BLUSOM numbers for similar sequence
• BLOSUMn is based on sequences that shared at least n
percent identity, generally:
BLOSUM62 for general use
BLOSUM80 for close relations
BLOSUM45 for distant relations
How does BLAST work?
What Is An Alignment?
Proteins scoring matrices
Closer sequences
PAM100
PAM120
PAM160
PAM200
PAM250
Distant sequences
=
=
=
=
=
BLOSUM90
BLOSUM80
BLOSUM60
BLOSUM52
BLOSUM45
How does BLAST work?
What Is An Alignment?
Scoring
• The final score of the alignment is the sum of the
positive scores and penalty scores:
Scoring
Matrix
+ Number of Identities
+ Number of Similarities
- Number of Gap insertions
- Number of Gap extensions
Alignment score
Gap
penalties
How does BLAST work?
BLAST
(Basic Local Alignment Search Tool)
• Goal: A fast search for homologues in a huge database
• The underlying hypothesis: when two sequences are similar
there are short ungapped regions of high similarity
between them
• The heuristic:
1. Discard irrelevant sequences
2. Perform exact local alignment only with the remaining
sequences
Altschul, S.F.,Gish, W., Miller, W., Myers, E.W., and Lipman,D.J(1990) “basic local alignment search
tool” J. Mol. Biol. 215: 403-410
How does BLAST work?
Searching a sequence database
•Idea:
In order to find homologous sequences to a sequence of
interest, one should compute its pairwise alignment against
all known sequences in a database, and detect the best
scoring significant homologs
•Query sequence - the sequence with which we are searching
•Hit – a sequence found in the database, suspected as
homologous
25
How does BLAST work?
The parametersW : Word size – find W-mers in target/query
2-3 for aa, 6-11 for nucleotides.
T : Threshold – focus on pairs scoring >T
usually 11-13
X : Drop-off – stop extending when loss >X
S : Score – the final score of segment pair
How does BLAST work?
The algorithm:
1.
Align a query sequence with the database.
2.
Find “hits”: short word pairs of length W with an
ungapped alignment score of at least T.
3.
Extend alignments until score drops more than X below
hitherto best score
s
Consumes most of the processing
time (>90%)
t
How does BLAST work?
How do we discard irrelevant
sequences quickly?
• Divide the database into words of length w (default:
w = 3 for protein and w = 7 for DNA)
• Save the words in a look-up table that can be
searched quickly
WTDFGYPAILKGGTAC
WTD
TDF
DFG
FGY
GYP …
How does BLAST work?
BLAST: discarding sequences
• When the user enters a query sequence, it is also divided
into words
• Search the database for consecutive neighboring words
• neighbor words are defined according to a scoring matrix
(e.g., BLOSUM62 for proteins) with a certain cutoff level
GFC (20)
GFB
GPC (11)
WAC (5)
How does BLAST work?
Look for a seed: hits on
the same diagonal which
can be connected
Neighbor word
Database
record
At least 2 hits on the same
diagonal with distance which
is smaller than a
predetermined cutoff
This is the filtering stage –
many unrelated hits are
filtered, saving lots of time!
Query
How does BLAST work?
Try to extend the alignment
• Stop extending when the score of the alignment
drops X beneath the maximal score obtained so far
• Discard segments with score < S
ASKIOPLLWLAASFLHNEQAPALSDAN
JWQEOPLWPLAASOIHLFACNSIFYAS
Score=15
Score=17
Score=14
How does BLAST work?
The result – local alignment
• The result of BLAST will be a series of local alignments
between the query and the different hits found
How does BLAST work?
E-value
•
To asses the bits score we calculate E-value:
E-value = The expected number of HSP’s with a score
of at least S:
 s
E  KMNe
•
For each score S there is a specific E-value.
Small E-value  better score
How does BLAST work?
E-value
Theoretically, we could trust any result with an
E-value ≤ 1
In practice – BLAST uses estimations.
•E-values of 10-4 and lower indicate a significant homology.
•E-values between 10-4 and 10-2 should be checked (similar
domains, maybe non-homologous).
•E-values between 10-2 and 1 do not indicate a good homology
How does BLAST work?
PSI-BLAST
Step 1:
1. Set a standard protein-protein BLAST search (BLOSUM62)
2. Build a position specific scoring matrix (PSSM)
according to MSA of the alignment results with low Evalue.
Step 2:
1. Set a BLAST search using the PSSM to evaluate the
alignment. PSSM vs. DB instead of seq vs. DB
2. Update the PSSM according to the new result
3. Go back to the beginning of step two or stop.
How does BLAST work?
The power of PSI-BLAST:
1.
A much sensitive scoring system .
each position has its own pattern probabilities .
2.
Important motifs are bounded
3.
Lowers the level of random noise.
4. Finding distant relatives.
How does BLAST work?
Lets sum up…
-
Blast is a fast way to find homologues
-
No analytic theory that estimates the
statistical significance of gapped alignments
-
Gap scores have been selected by trial and
error.
applying different scoring matrix  No grantee
for gap scores
-
PSI-BLAST finds weak homologues fast
Workshop OUTLINE
Part 1:
Part 2:
• Introduction and motivation
• Work Steps
• How does BLAST work?
• HANDS-ON
• BLAST programs
• Applications: ConSurf
• Sequence databases
• MSA programs
BLAST programs
• All types of searches are possible
Query:
DNA
Protein
Database:
DNA
Protein
blastn – nuc vs. nuc
blastp – prot vs. prot
blastx – translated query vs. protein database
tblastn – protein vs. translated nuc. DB
tblastx – translated query vs. translated database
39
BLAST programs
Amino acid sequence –
most suitable for homology search
• Amino acid sequence is more conserved
• 20 letter alphabet. Two random hits share 5% identity in
average (comparing to 25% in DNA seq)
• Protein comparison matrices more sensitive
• Protein databases smaller – less random hits
• Proteins are much more relevant for inferring structural
data.
Workshop OUTLINE
Part 1:
Part 2:
• Introduction and motivation
• Work Steps
• How does BLAST work?
• HANDS-ON
• BLAST programs
• Applications: ConSurf
• Sequence databases
• MSA programs
Sequence databases
Where do we want to search?
DNA sequences
• NR- All GenBank + EMBL + DDBJ + PDB sequences. No longer
"non-redundant" due to computational cost.
• Genomes a specific organisms
• RefSeq- mRna or genomic- an annotated collection from NCBI
Reference Sequence Project.
• EMBL- Europe's primary nucleotide sequence resource (EBI)
• ….
Sequence databases
Where do we want to search?
Protein databases:
• Uniprot –swissprot or trembl
• UniRef90- clustered UniRef using 90% identity.
• PDB- the sequences of proteins for which structures are available
• NR (non-redundant)- Non-redundant GenBank CDS translations +
PDB + SwissProt + PIR + PRF, excluding those in env_nr
• RefSeq- sequences from NCBI Reference Sequence project.
• Proteins of a specific organisms
Sequence databases
Where do we want to search?
UniProt
• UniProt is a collaboration between the
European Bioinformatics Institute (EBI), the
Swiss Institute of Bioinformatics (SIB) and
the Protein Information Resource (PIR).
• In 2002, the three institutes decided to pool
their resources and expertise and formed the
UniProt Consortium.
Sequence databases
Where do we want to search?
UniProt (UniRef90)
The world's most comprehensive catalog of information on
proteins- Sequence, function & more… Comprised mainly of
the databases:
• SwissProt ??? protein entries–
high quality annotation, non-redundant & cross-referenced to
other databases.
• TrEMBL - ??? protein entries–
translation of genetic information from the EMBL Nucleotide
Database  most proteins are poorly annotated since only
annotated automatically
After BLAST
We ran BLAST & found homologues (HSTs).
What next?
Workshop OUTLINE
Part 1:
Part 2:
• Introduction and motivation
• Work Steps
• How does BLAST work?
• HANDS-ON
• BLAST programs
• Applications: ConSurf
• Sequence databases
• MSA programs
MSA
Multiple Sequence Alignment (MSA)
• Perform alignment of a large collection of sequences
• Many algorithms, leading ones:
1. ClustalW
2. MUSCLE
3. Mafft
MSA
Pairwise Vs. Multiple Sequence Alignment
Alignments help to analyze sequence data:
organize, visualize.
Pairwise:
For 2 sequences
F G K  G K G
F G K F G K G
MSA:
For more than 2 sequences
F
F
-
G
G
G
-
K
K
K
K

F
Q
F
G
G
G
G
K
K
K
K
G
G
G
G
MSA
ClustalW- Introduction
• This heuristic approach works because it uses the biological
meaning of MSA
• Based on the idea that the sequences we usually want to align
are phylogenetically related
• The first program to implement progressive MSA
• Was introduced in 1994 and still used today.
Thompson, J.D. et al, 1994
MSA
ClustalW- Progressive Alignment
Hbb_Human 1
Hbb_Horse 2
Hba_Human 3
Hba_Horse 4
Myg_Whale 5
17
-
59
60
-
59
59
13
-
77
77
75
75
1. Quick pairwise alignment
calculate distance matrix
-
Hbb_Human
Hbb_Horse
Hba_Human
Hba_Horse
2. Build a guide tree using the
NJ phylogenetic method
Myg_Whale
3. Progressive alignment
following guide tree
MSA
ClustalW- Progressive Alignment
A
B
C
D
A
-
-
-
-
B
1
-
-
-
C
7
8
-
-
D
11
5
2
-
A
B
C
D
MSA
ClustalW- Problems
• Sequences that are similar only in some smaller regions
 ClustalW tries to find global alignments, not local.
• Sequence that contains a large insertion compared to the rest
 global not local
• Sequence that contains a repetitive element, while another sequence only
contains one copy.
Vs
MSA
MUSCLE- Introduction
• The most recent popular MSA software
• Considered to be the most accurate MSA software available
today
• The basic idea: Progressive Alignment
Edgar, R.C., 2004
MSA
MUSCLE Innovations
• Faster distance estimation between the input sequences
• Faster construction of an evolutionary tree
(UPGMA instead of NJ in ClustalW )
•Applying new score function to the profile alignments
• Refinement of the initial results
Edgar R.C., 2004
faster
more
accurate
MSA
MUSCLEIt’s Even More Complicated…
MSA
MAFFT
• Implements the Fast Fourier Transform (FFT) to optimize
protein alignments based on physical properties of the
amino acids
• Iterative progressive alignment and iterative refinement
• Useful for hard-to-align sequences such as those containing
large gaps.
• Different algorithms for different problems:
• Number and length of sequences
• RNA vs. protein
MSA
Comparison of MSA methods
No method is superior for all cases: trial and error!
Essoussi et al 2008
But MAFFT and ProbCons are very good…
Nuin et al 2006
Download