BLAST Introduction - CSE - University of South Carolina

advertisement
Bioinformatics Algorithms and
Data Structures
BLAST
Lecturer: Dr. Rose
BLAST Slides: Adaptation of Nir Friedman’s slides from the
Computational Methods in Molecular Biology course
(Spring 2001) at Hebrew University, Jerusalem, Israel
February 21, 2007
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
BLAST
Q: What is BLAST?
A: Uhmmm, actually no, BLAST is an acronym:
A:
Basic Local Alignment Search Tool
- a set of similarity search programs designed to
explore all of the available sequence databases
regardless of whether the query is protein or DNA
You can find it at:
http://www.ncbi.nlm.nih.gov/BLAST/
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
BLAST
• Q: Why do you care?
• A: Because you are going to do a project.
•
•
•
•
•
•
•
•
•
•
•
U51112
J03581
NM_000245
NM_010849
NM_007409
NM_002475
XM_086788
M30047
NM_000518
NM_000477
NM_008476
Membrane protein that transports sodium and hydrogen
Tyrosinase. . people lacking this are albino
MET, an oncogene. . .mutations in this cause cancer
MYC, another oncogene
Alcohol Dehydrogenase. . good to have when drinking
Myosin. . .one of the muscle proteins
Crystallin, the major protein in the lens
Myelin basic protein..protects the neurons
Hemoglobin, oxygen carrying protein in RBC
Albumin, major serum protein. . .does lot of things
Keratin, skin and integument protein
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
BLAST
• BLAST is designed to efficiently find alignments
of a target string s against large databases
– Motivation: increase the speed of finding fewer and
better hotspots.
– Idea: Find high scoring matches using a substitution
matrix rather than exact matches.
– We are still searching only for gapless matches.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
High-Scoring Pair
• Two strings s and t are a high scoring pair (HSP)
if d(s,t) > T
• Given a query s[1..n], BLAST construct all words
(fixed-length substrings) w, such that w scores > t
with a k-substring of s
– Each such match to such word in the database is called
a hit
• Typical k: 12 for nucleotides, 3-5 for amino acids.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
High-Scoring Pair
• Try to extend each such hit to an alignment with
maximal score (still with no gaps). Keep all HSPs
– Threshold is chosen so that a random match with such a
score is unlikely .
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Finding Potential Matches
We can locate seed words in a large database in a
single pass
• Construct a FSA that recognizes seed words
• Use hashing techniques to locate matching words
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Extending Potential Matches
• Once a seed is found, BLAST attempts to find a
local alignment that extends the seed
s
• Seeds on the same diagonal
are combined (as in FASTA)
t
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Which programs are used?
• Originally Blast did not allow gaps.
– Now people use gapped-Blast
– Gapped blast joins different diagonals.
• For proteins Blast is superior
• For nucleotides Fasta is better.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Review: Unrelated Sequences
• Our model of unrelated sequences is simple
– Each position is sampled independently from a
distribution over the alphabet 
– We assume there is a distribution q() that describes the
probability of letters in such positions
• Then:
P( s[1..n], t[1..n] | R)   q( s[i]) q(t[i])
i
• R denotes the assumption that s and t are random
unrelated strings
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Review: Related Sequences
• We assume that each pair of aligned positions
(s[i],t[i]) evolved from a common ancestor
• Let p(a,b) be a distribution over pairs of letters.
• p(a,b) is the probability that some ancestral letter
evolved into this particular pair of letters
P( s[1..n], t[1..n] | M )   p( s[i], t[i])
i
• Here M denotes the assumption that s and t are
related strings.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Review: Ratio Test for Alignment
• Taking logarithm of both sides, we get
P ( s, t | M )
p( s[i ], t[i ])
log
 log 
P ( s, t | R )
i q ( s[i ]) q (t[i ])
p( s[i ], t[i ])
  log
q( s[i ]) q(t[i ])
i
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Review: Probabilistic
Interpretation of Scoring Rule
• If we take
p ( a, b)
 (a, b)  log
q(a)q(b)
• then the score of an alignment is the log-ratio
between the two models:
– Score > 0  R is more “probable”
– Score < 0  U is more “probable”
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Problems with Scoring Rule
When searching for an optimal alignment in a big database,
there are a number of problems that arise with this simple
scheme.
• We are assuming P(M)=P(R), this assumes there are an
equal number of related and unrelated sequences in the
database.
• When searching through a big database, there is high
probability that an unrelated sequence will receive a high
score
• When searching for an optimal local alignment, we have
many possible starting points, heavily biasing the score
towards being a related sequence.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Prior Probability on the models
• What we really wish to calculate is:
P ( M | s, t ) 
P ( s, t | M ) P ( M )
P ( s, t )
• The log score being:
P ( M | s, t )
P ( M | s, t ) P ( M )
log
 log

P ( R | s, t )
P ( R | s, t ) P ( R )
P ( M | s, t )
P( M )
log
 log
P ( R | s, t )
P( R)
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Prior Probability on the models
• Our threshold should be:
UNIVERSITY OF SOUTH CAROLINA
P( M )
log
P( R)
College of Engineering & Information Technology
The Hazard of Large Databases
• Define p  P (d (s ,t )   |U )
• This is the probability that two unrelated
sequences will match with score >  by chance
• Assume there are N strings in our database
• Assuming that they are independent of each other,
and all are unrelated to s, we have
P (maxt d (s ,t )   )  1  (1  p )  1  e
N
UNIVERSITY OF SOUTH CAROLINA
 Np
College of Engineering & Information Technology
The Hazard of Large Databases
1
f(x,0.001)
f(x,0.0001)
f(x, 0.00001)
f(x, 0.000001)
0.8
0.6
0.4
0.2
0
0
20000
40000
UNIVERSITY OF SOUTH CAROLINA
60000
80000
100000
College of Engineering & Information Technology
Local Matching
• Question: Which local alignment query is expected to give
a higher score:
– To a short sequence
– To a long sequence?
• A local match can begin at any of the nm entries in the DP
matrix.
• The score is the optimal of all these starting points.
• If all starting points were independent we would need to
calculate the probability of attaining such a score in nm
trials.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Score Significance-Fasta
• How meaningful is a score?
• Calculate distribution of scores and related scores
• Under reasonable assumptions the scores for un-gapped
alignment behave according to the Extreme Value
Distribution.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Extreme Value Distribution (BLAST)
• We ask the following questions: Given a database
of size n and a sequence of size m
• What is the expected number of hits with score at
least S? This number is called an E-score
 S
E ( S )  Kmne
• Notice this is a Poisson distribution.
•
•
•
•
K corrects for the dependencies
 depends on the scoring matrix
Doubling n, the length of sequence, doubles expectation
Doubling S, the score, causes E() to decrease exponentially
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Blast P-value
• Recall the Poisson distribution:
– Probability of finding no hits with a score => S
e
E
– Therefore probability of finding at least one hit with
score => S is
1 e
E
– This is called the P-value.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
A Typical Genebank entry
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Sequence Information
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
The Sequence
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
BLAST programs
• BLASTN - Nucleotide query searching a nucleotide
database.
• BLASTP - Protein query searching a protein database.
• BLASTX - Translated nucleotide query sequence (6
frames) searching a protein database.
• TBLASTN - Protein query searching a translated
nucleotide (6 frames) database.
• TBLASTX - Translated nucleotide query (6 frames)
searching a translated nucleotide (6 frames) database
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
BLAST Search
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
BLAST Output
• List of hits
– Database accession codes, name, description.
– Score in bits (Usually >30 bits is significant )
– Expectation value E()
• For each hit
– A header including hit name, description, length
– Each hit may contain several HSPs
– Score and expectation value
– how many identical residues
– how many residues contributing positively to the score
•
The local alignment itself
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
BLAST Output
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
BLAST Output
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
BLAST Output
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
PSI- BLAST (Position Specific Iterated)
• BLAST provides a new automatic “profile like” search.
• Iterative procedure:
– Perform BLAST on database.
– Use Significant alignments to construct a “position specific” score
matrix.
– This matrix replaces the query sequence in the next round of
database searching.
• The program may be iterated until no new significant
alignments are found.
• Most commonly used search method today.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Multiple Alignment
• Proteins can be classified into families:
– Common structure.
– Common function.
– Common evolutionary origin.
• For a set of sequences belonging to some family
– Each pair has some differences
– But, there are some common motifs in almost all
sequences of the family
• A multiple alignment carries more information
than pairwise alignment
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Protein Families
• Consider Zinc Fingers:
• All have the same function:
– Bind to DNA
• All have similar structure
• They constitute a Protein Family
• In a protein family some parts of the sequence (the
functional parts) are more conserved than others.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Definition
A multiple alignment of strings S1,S2,…,Sk is
a series of strings with blanks S’1,S’2,…,S’k
such that:
– |S’1|=|S’2|=…=|S’k|
– S’j is an extension of Sj obtained by insertion of
blanks.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Example
AGT..CTT.ACGCG
AGTAGCTT...GCG
..TAGC.T..GGCG
.CTA.C.TAACCCG
ACTA...TAAC...
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Example
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Sum of Pairs
• The sum of pairwise distances between all pairs of
sequences for some scoring matrix
S (mi )   s(mik , mil )
k l
• Not only assumes that alignment of each column
is independent, but also each pair of sequences.
– Each sequence is scored as if descended from k-1
sequences instead of one common ancestor.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Calculation of Multiple Alignment
• The optimal alignment can be calculated exactly
using k-dimensional dynamic programming.
– Space complexity O(nk)
– Time complexity O(2knk)
• A Heuristic Program called ClustalW quickly
finds a good multiple alignment.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Creating a PSSM
• After aligning the sequences we see that there are
some conserved regions.
• We use the multiple alignment of Blast results to
create a Position Specific Scoring Matrix.
• This matrix represents information from a whole
family, it is more strict in highly conserved
regions.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
PSI- BLAST (Position Specific Iterated)
• BLAST provides a new automatic “profile like” search.
• Iterative procedure:
– Perform BLAST on database.
– Use Significant alignments to construct a “position specific” score
matrix.
– This matrix replaces the query sequence in the next round of
database searching.
• The program may be iterated until no new significant
alignments are found.
• Most commonly used search method today.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Download