blast

advertisement
Bioinformatics Tutorial I
BLAST and Sequence Alignment
What is BLAST?
• Online tool from National Center for the
Biotechnology Information (NCBI)
• “Google” for proteins and nucleotide sequences
What can you use BLAST for?
• Identify an unknown sequence
• Characterize the gene/protein of interest
– Function/activity (gene and protein)
– Structure or shape (new protein)
– Location or preferred location (protein)
– Stability (gene/transcript or protein)
• Origin of a gene or protein
Sequence alignment approaches
1. Global alignment
– Needleman and Wunsch, 1970
2. Local alignment (used in BLAST)
– Smith and Waterman, 1980
Global alignment
• One approach for searching a query sequence
is to align the entire sequence against all
sequences in a database
• This approach is very slow and hence
impractical
BLAST
• A much faster approach
• Divides your search query into short
sequences (“words”) and initially looks for
exact matches. Once found, these words are
then extended
• i.e. Basic Local Alignment Search Tool
• Altschul, S.F. et al. Basic local alignment search
tool. J Mol Biol. 215(3):403-10(1990).
BLAST algorithm
• Query sequences are usually split into words
• Each word is then searched in database
• Word hits are extended in either direction to
generate alignment with score greater than
the threshold score
BLAST
“The central idea of the BLAST algorithm is to
confine attention to segment pairs that contain
a word pair of length w with a score of at least
T”
- Alschul et al, 1990
How does BLAST work?
Step 1: Get your sequence
• NCBI, UCSC etc..
• Sequencing facility (unknown gene)
Step 2: Choose BLAST program
The different BLAST programs
• blastn (nucleotide BLAST)
• blastp (protein BLAST)
• blastx (translated BLAST)
• tblastn (translated BLAST)
• tblastx (translated BLAST)
Simplified visualization
Why translate in 6 reading frames?
• DNA sequence can code for six different
proteins
5’ CAT CAA
5’ ATC AAC
5’ TCA ACT
5’ CATCAACTACAACTCCAAAGACACCCTTACACATCAACAAACCTACCCAC 3’
3’ GTAGTTGATGTTGAGGTTTCTGTGGGAATGTGTAGTTGTTTGGATGGGTG 5’
5’ GTG GGT
5’ TGG GTA
5’ GGG TAG
Step 3: Search parameters
Step 4: Search results
Important: Tabular output
Score
• Sequence similarity score is calculated based
on the pair-wise alignment quality
• Alignment score is the sum of scores for each
position
Score
• Nucleotides
• +1 score for each
match
• -2 score for each
mismatch
• Peptides
• Each amino acid
substitution is given
a score
Example
AACGTTTCCAGTCCAAATAGCTAGGC
===--===
=-===-==-======
AACCGTTC
TACAATTACCTAGGC
Hits(+1): 18
Misses (-2): 5
Gaps (existence -2, extension -1): 1 Length: 3
Score = 18 * 1 + 5 * (-2) – 2 – 2 = 6
David Fristrom, Introduction to BLAST
E-value
• E-value – expectation value; the number of
different alignments which would yield a
similar or better score if searched though the
database by chance alone.
• Low E-value – sequences may be homologous
• Statistical significance depends on..
– Length of the query sequence
– Size of the sequence database
Graphical output
Taxonomy Results
Graphical output
References
• Figures and text adapted from the following sources:
– David Fristrom, Introduction to BLAST
– Jonathan Pevsner, BLAST: Basic local alignment search tool
– Joanne Fox, BLAST: Finding function by sequence similarity
Download