Biology and computers

advertisement
Local alignment, BLAST and
Psi-BLAST
October 25, 2012
Local alignment
Quiz 2
Learning objectives-Learn the basics of
BLAST and Psi-BLAST
Workshop-Use BLAST2 to determine local
sequence similarities.
Homework #6 due Nov 1
Chapter 5, Problem 8
 Chapter 6, Problems 1 and 4.

Local Alignment
1. Initialize the i-1 row and j-1 column with
zeros.
2.
3. For traceback, start with highest value and
traceback to zero.
Local Alignment (continued)
Which software program should
one use for local alignment?
Most researchers use methods for
determining local similarities:
Smith-Waterman (gold standard)
Do not find every possible alignment
 FASTA
of query with database sequence. These
 BLAST
are used because they run faster than S-W

}
BLAST
Three phases:
1) List of high scoring words
2) Scan the sequence database
3) Extend hits
The threshold and word size
The program declares a hit if the word taken from
the query sequence has a score >= T when a
scoring matrix is used.
This allows the word size (W) to be kept high (for
speed) without sacrificing sensitivity.
If T is increased, the number of background hits is
reduced and the program will run faster.
Phase 1: Compile a list of high-scoring words at or above threshold T.
Query sequence is human p53: . . . RCPHHERCSD. . .
Words derived from query sequence: RCP, CPH, PHH, HHE, …
Threshold T (T = 17):
Word
Scores from BLOSUM
scoring matrix
Total
score
RCP
5+9+7
21
KCP
2+9+7
18
QCP
1+9+7
17
ECP
.
0+9+7
.
.16
.
.
.
Note: The line is located at the threshold cutoff.
Word size is 3.
Phase 2: Scan the database for short segments that
match the list of acceptable words/scores above
or equal to threshold T. These are potential hits.
Phase 3: Extend the potential hits to the left and to the right and
terminate when the tabulated score drops below a cutoff score.
Query
Sbjct
EVVRRCPHHERCSD
EVVRRCPHHER S+
EVVRRCPHHERSSE (Ch. hamster p53 O09185)
If the sequence alignment is extended far enough, and the score
is higher than the alignment score the query/sbjct segment
is called a hit.
The relationship between extension length and
cumulative score
The steps to
a Gapped
BLAST search.
What are the different BLAST
programs?
blastp
 compares an amino acid query sequence against a protein sequence
database
blastn
 compares a nucleotide query sequence against a nucleotide
sequence database
blastx
 compares a nucleotide query sequence translated in all reading
frames against a protein sequence database
tblastn
 compares a protein query sequence against a nucleotide sequence
database dynamically translated in all reading frames
tblastx
 compares the six-frame translations of a nucleotide query sequence
against the six-frame translations of a nucleotide sequence
database. Please note that tblastx program cannot be used with the
nr database on the BLAST Web page.
What are the different BLAST
programs? (continued)
psi-blast
 Compares a protein sequence to a protein database. Performs the
comparison in an iterative fashion in order to detect homologs that
are evolutionarily distant.
blast2
 Compares two protein or two nucleotide sequences.
The E value
(false positive expectation value)
The Expect value (E) is a parameter that describes the number
of “hits” one can "expect" to see just by chance when
searching a database of a particular size. It decreases
exponentially as the Similarity Score (S) increases (inverse
relationship). The higher the Similarity Score, the lower
the E value. Essentially, the E value describes the random
background noise that exists for matches between two
sequences. The E value is used as a convenient way to
create a “significance” threshold for reporting results.
When the E value is increased from the default value prior
to a sequence search, a larger list with more low-similarity
scoring hits can be reported. An E value of 1 assigned to a
hit can be interpreted as meaning that in a database of the
current size you might expect to see 1 match with a similar
score simply by chance.
E value (Karlin-Altschul statistics)
E = K•m•n•e-λS
Where K is a scaling factor (constant), m is the
length of the query sequence, n is the length of the
database sequence, λ is the decay constant, S is the
similarity score.
If S increases, E decreases exponentially.
If the decay constant increases, E decreases
exponentially
If m•n increases the “search space” increases. Then
there is a greater chance for a random “hit” and E
increases. A larger database will increase E.
However, larger query sequence often results in a
lower E value. Why???
Thought problem
A homolog to a query sequence resides in two
databases. One is the UniProt database and the
other is the PDB database. After performing
BLAST search against the UniProt database you
obtain an E value of 1. After performing the
BLAST search against the PDB database you
obtain an E value of 0.0625. What is the ratio of
the sizes of the two databases?
Using BLAST to get quick answers
to bioinformatics problems
Task
BLAST method
Predict protein Perform blastp on
function (1)
PIR or Swiss-Prot
database
Predict protein Perform tblastn
function (2)
on NR database
Predict protein Perform blastp
structure
against PDB
Trad. Method
Perform wet-lab
experiment
Perform wet-lab
experiment
Structure prediction
software, x-ray
crystal., NMR
Using BLAST to get quick answers
to bioinformatics problems (cont.)
Task
BLAST method
Trad. Method
Locate genes in a Divide genome into 2-5
genome
kb sequences. Perform
blastx against NR protein
datbase
Find distantly
Perform psi-blast
related proteins
Run gene prediction
software. Perform
microarray analysis or
RNAs
No traditional method
Identify DNA
sequence
Screen genomic DNA
library
Perform blastn
Filtering Repetitive Sequences
Over 50% of genomic DNA is repetitive
This is due to:





retrotransposons
ALU region
microsatellites
centromeric sequences, telomeric sequences
5’ Untranslated Region of ESTs
Example of EST with simple low complexity region:
T27311
GGGTGCAGGAATTCGGCACGAGTCTCTCTCTCTCTCTCTCTCTCTCTC
TCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTC
Filtering Repetitive Sequences
and Masking
Options available for user.
PSI-BLAST
PSI-position specific iterative
a position specific scoring matrix (PSSM) is
constructed automatically from multiple HSPs of
initial BLAST search. Normal E value threshold is
used.
The PSSM is created as the new scoring matrix for
a second BLAST search. A low E value threshold
is used (E=.001).
Result-1) obtains distantly related sequences
2) finds the important residues that provide
function or structure.
Workshop
Is the American crocodile (Crocodylus
acutus) more closely related to the sea turtle
(Cheloniidae) or to the turkey (Meleagris
gallopavo)? Choose two genes from each
species and compare using blast2. Record
bit score, E-value, percent nucleotide
identities, percent similarities and lengths of
coverage query/sbjct sequences in your
answer.
Download