blast

advertisement
BLAST
Objectives
• Gain familiarity with sequence searches and
comparisons via web-based BLAST
• To understand the BLAST algorithm
• To understand the principles of BLAST scoring and
BLAST statistics
• To understand scoring matrices
• To become aware of other BLAST services and
applications
BLAST
• Basic Local Alignment Search Tool
• Developed in 1990 and 1997 (Altschul et al.)
• A heuristic method for performing local
alignments through searches of high scoring
pairs (HSPs)
• First to use statistics to predict significance
of initial matches – saves on false leads
• Offers both sensitivity and speed
BLAST
• Looks for clusters of nearby or locally dense
“similar or homologous” words/k-tuples
• Uses look-up tables to shorten the search time
• Use larger “word size” than FASTA to accelerate
the search process
• Does both Global and Local Alignment
• Fastest and most frequently used sequence
alignment program tool – de facto standard
BLAST
• NCBI BLAST
http://www.ncbi.nih.gov/BLAST/
• European Bioinformatics Institute
• NCBI BLAST
http://www.ebi.ac.uk/Tools/sss/ncbiblast/
• WUBLAST
http://www.ebi.ac.uk/Tools/sss/wublast/
• Rosaceae BLAST (www.rosaceae.org)
• Legume BLAST (http://lis.comparativelegumes.org; www.gabcsfl.org)
• Grasses BLAST (www.gramene.org)
Types of BLAST
• BLASTP – protein query against protein DB
• BLASTN – DNA/RNA query against DNA DB
• BLASTX – 6 frame translation of DNA query against
protein DB
• TBLASTN – protein query against 6 frame translation of
DNA DB
• TBLASTX – 6 frame translation of DNA query against 6
frame translation of DNA DB
• BLAST2SEQ – for performing pairwise alignments for 2
chosen sequences
Types of BLAST
• PSI-BLAST - protein “profile” query against protein DB
• PHI-BLAST – protein pattern against protein DB
• RPS-BLAST – Conserved Domain Detection
• MEGABLAST – for comparison of large sets of long DNA
sequences
• Primer BLAST – uses Primer3 to design PCR primers
• Genomic BLAST – for alignments against completed
genomes
• VecScreen – for detecting cloning vector contamination
in sequenced data.
see last weeks handout for rest of them
Types of Comparison
• What program will best suite your query and desired
output?
• DNA sequences contain less information with which to
deduce homology than do the encoded protein
sequences when compared using simple nucleotide
substitution scores, 20 aa vs 4 nt!
• Protein comparisons give more meaningful results
• Moderately similar nt sequences often a highly similar
protein sequence
NCBI WEB BLAST
Step 1: Select a BLAST
program
In this example we will
choose nucleotide blast
Basic BLAST Options
Using Nucleotide BLAST
Step 2:
Type in your sequence in FASTA
format or type in a GI or accession
number or upload a file
>my protein MT08976
KIQIYTGTCANGTCKIQIYTGTCANGTCKIQIY
GTCANGTCKIQIYTGTCANGTC
Step 3:
Give your search a name/title
Step 4:
Choose a database to search
MEGABLAST is specifically
designed to efficiently find
long alignments between
very similar sequences
NCBI WEB BLAST
Basic BLAST Options
Using Nucleotide BLAST
Step 2:
Type in your sequence in FASTA
format or type in a GI or accession
number or upload a file
>my protein MT08976
KIQIYTGTCANGTCKIQIYTGTCANGTCKIQIY
GTCANGTCKIQIYTGTCANGTC
Note you can also restrict the range of your
query to be searched
Step 3:
Give your search a name/title
Step 4:
Choose a database to search
and/or restrict the
database selection
e.g. Viridiplantae
[ORGN] of the nr
Means restrict my search to just plant proteins
Basic BLAST Options
Using Nucleotide BLAST
Step 5:
Choose a BLAST program
Megablast is intended for comparing a
query to closely related sequences and
works best if the target percent identity
is 95% or more but is very fast.
Discontiguous megablast uses an initial
seed that ignores some bases (allowing
mismatches) and is intended for
cross-species comparisons.
BlastN is slow, but allows a word-size down
to seven bases.
Basic BLAST Options
Steps 2 – 4 the same as in nt blast
Using Amino Acid BLAST
Step 5:
Choose a BLAST program
BlastP simply compares a protein query
to a protein database. It is used for finding
similar sequences in protein databases. It
is designed to find local regions of
similarity but when sequence similarity
spans the whole sequence, blastp will also
report a global alignment.
PSI-BLAST is the most sensitive BLAST
program, making it useful for finding very
distantly related proteins or new members
of a protein family.
PHI-BLAST performs the search but limits
alignments to those that match a pattern
in the query.
Basic BLAST Options
BLASTP RESULTS for
Steps 2 – 4 the same as in nt blast
NP_001031578.1
Basic BLAST Options
BLASTP RESULTS for
Steps 2 – 4 the same as in nt blast
NP_001031578.1
BLAST OUTPUT by Column
• The sequence accession name: this takes you to the database entry
that contains the sequence.
• Description of the match organism and any assigned/putative
function
• The alignment score. Higher scoring hits are at the top
• Query coverage is how much of your sequence aligned to the match
• The expectation value (E Value) which provides an estimate of
statistical significance. This tells you the number of times you could
have expected such a good match only by chance. The E value
provides you with the most important measure of statistical
significance.
Interpreting Significance of BLAST Results
• In general:
DNA to DNA alignment
For nucleotide sequences at least 100 bp long, if 70% of your
nucleotides are identical with your match sequence then they can be
considered to be homologous
AA to AA alignment
For amino acid sequences at least 100 aa long, if 25% of your aa are
identical with your match sequence then they can be considered to
be homologous
Below these values, the alignments are considered to be in the twilight zone!
However how do you tell the difference between 60 matched residues spread
over a 100 residue segment and 120 matches spread over a 200 residue
segment? The longest is probably the more meaningful but the percent identity
says nothing about this!
Interpreting Significance of BLAST Results
• So we use E values. In theory any match with an E value
below 1 should all be trusted. In practice this is NOT true
because BLAST uses an approximate formula for computing E
values and strongly underestimates them.
• Rule of thumb, look for E values above 1e-4 (0.0001). So if
you want to be certain of homology , your E value must be
lower than 0.0001
• Caveat – if you are doing a blast search with thousands of
query sequences you need to take into account the size of
your query database and lower the E value further.
Interpreting Significance of BLAST Results
15,000 EST query sequences
• A 10-3 E-value cutoff means that you should expect
one false positive in 1000 searches.
• Thus with 15,000 searches, we should expect 15 false
positives with a cutoff of 10-3.
• To reduce the chances of identifying a false positive,
set the E-value cutoff lower.
• For 15,000 searches, an E-value cutoff of 10-5 will
mean that you should expect 0.15 false positives.
Most of the time we make it even lower < 10-6
Sequence Analysis I
Database Searching Questions
• What database should I search?
• What kind of sequences should I search with?
• What E-value is significant?
• What can I reliably infer about the function of my sequence based
on homology?
Sequence Analysis I
Databases
• Bigger databases have more sequences.
• Bigger databases are also more redundant, which can
skew the statistics.
• Bigger databases are also poorly annotated (homology
with an "unidentified sequence" doesn't really tell you
much)
• Bigger databases take lots of time to search.
Sequence Analysis I
Databases Cont.
• Smaller databases (like Swiss-Prot) are often better
curated and annotated.
• Smaller databases are much less redundant.
• Smaller databases can contain phylogenetically relevant
sequences (all plant)
• Smaller databases are much faster to search.
Download