Pairwise alignments 2 - Bioinformatics Centre

advertisement
Pairwise alignment 2:
Scoring matrices and gaps BLAST,
BLAT and FASTA
Introduction to bioinformatics
Stinus Lindgreen
stinus@binf.ku.dk
Bioinformatics Centre, University of Copenhagen
1
Outline of the lecture








Scoring an alignment: BLOSUM and PAM
Scoring matrices for nucleotides
Treatment of gaps
The BLAST algorithm
Position Specific Scoring Matrices
PSI-BLAST and other variants
BLAT
FASTA
2
Definition of scoring system
To build an alignment we need a scoring system
 Aligning two different residues indicates a
substitution
 How likely is it to replace A with B?
 A likely alignment should receive a large score
Create a substitution matrix
 A 20×20 symmetrical table (matrix)
 Look up the score for aligning two amino acids
 The diagonal represents conservation
3
Considerations
Conservative substitutions
 Prefer changes that respect structure and function
 E.g. hydrophobicity, size, chemical properties etc.
Frequencies
 How often do specific residues occur?
 Scores should weigh the rare ones higher
Evolution
 How distant are the sequences?
 Ancient divergence  more changes
4
The creation
Approach 1:
 Make an evolutionary model
 Use closely related sequences
 Extrapolate to greater distances
 Done in the PAM family
Approach 2:
 Look at related sequences
 Observe actual mutations in motifs
 Use sets of different overall identity
 Done in the BLOSUM family
5
The PAM250 matrix[1]
[1]
M.O. Dayhoff: Survey of new data and computer methods of analysis (1978),
Atlas of protein sequence and structure, 5:3
6
The PAM family
Defining characteristics
 PAM: Point (or percent) accepted mutations
 One PAM unit: 1 change per 100 residues (~ 10M ys)
 Not the probability that a residue A is something else
 What would PAM250 mean in that case?
 But: The number of expected changes overall
 Including changes A  B  A
7
Construction of PAM matrices
The general idea:
 Assume evolution is independent
 Describe ”one step of evolution”
 The next step will follow
PAMx: Scoring matrix for x evolutionary steps
 Very similar sequences: Low x
 Very divergent sequences: Large x
What is an evolutionary step?
8
Evolutionary model
Assume independence

between positions

over time

P(AB)=P(BA)
Consequences:

How a sequence looks tomorrow depends





ONLY on how it looks today
Not on how it looked yesterday
Known as a Markov chain
The context does not matter
Evolution ”reversible”
9
Building PAM1
1.
2.
3.
4.
Find set of very similar sequences (1% ID)
Make global alignment (71 groups)
Count the substitutions (1572 changes)
Calculate weights
10
From counts to scores
The simplified version
 Count mutation frequencies


Normalize by mutability of respective amino acid



Makes comparisons possible
Multiply to wanted evolutionary distance x


The respective probabilities
Still probabilities
Calculate log-odds score for (AB+BA)/2
You have the PAMx matrix!
11
Why log-odds?
Si , j  log




qi , j
pi  p j
Summation instead of multiplication
 log(A·B)=log(A)+log(B)
S>0: More often than per chance
S=0: Number expected by chance
S<0: Less often than per chance
12
The PAM1 matrix
Highly identical sequences  1 evolutionary step
 PAM1 describes likelihood of changes
 Diagonal close to 1, off-diagonal close to 0
 PAM2=PAM1×PAM1
 PAM3=PAM1×PAM2 …
 Markov chain property (independence)
 Greater divergence through matrix multiplication
 Large PAM: More even values
13
PAM problems
Good and bad things
 Based on evolution 
 Assumes independence 
 Global data  evolution of proteins 
 Based on small dataset 
 Extrapolates to greater divergence 
information loss & error growth 
 Evolutionary model partly verified 
 But still simplified… 
 Mutations happen at the nucleotide level 
14
The BLOSUM matrix[2]
[2]
S. Henikoff and J.G. Henikoff: Amino acid substitution matrices from
protein blocks (1992), Proc. Natl. Acad, Sci., 89:10915–10919s
15
The BLOSUM family
Look at conserved motifs in proteins
 Scan databases for motifs
 No gaps in alignment  a BLOCK


Make groups of motifs



One protein can contain many motifs
~2000 blocks, >500 protein families
The BLOCKS database
BLOSUM: BLOCKS substitution matrices
16
The BLOSUM idea
Don’t extrapolate data
 Collect divergent dataset
 Observe mutations directly
 Local alignment of sequences
 Evolution treated implicitly
 Still assumes independence between sites
17
Building BLOSUMx
Substitution scores for sequences of x %ID
1.
Create sets of blocks of x %ID
2.
Very similar sequences are grouped and weighted

3.
4.
5.
Avoid bias due to numbers
Count the substitutions
Calculate log-odds scores for all pairs
You have the BLOSUMx matrix!
18
Some general notes



BLOSUM62 has proved to work for a range of
similarities
Other values (e.g. 30, 45 and 80) also used
Today: BLOSUM used more than PAM
19
BLOSUM problems
Good and bad things

Evolution not considered explicitly 

Known related proteins 

Assumes star-shaped phylogeny 

Based on conserved blocks  /

Local alignment 
20
The matrix matters

Example: BLAST using BLOSUM45 and BLOSUM80
 Note number of hits
 Note difference in scores
NCBI BLAST
>GLBP_CHITH 152 P11582 GLOBIN CTT-E/E' PRECURSOR.
MKFIILALCVAAASALSGDQIGLVQSTYGKVKGDSVGILYAVFKADPTIQAAFPQ
FVGKDLDAIKGGAEFSTHAGRIVGFLGGVIDDLPNIGKHVDALVATHKPRGVT
HAQFNNFRAAFIAYLKGHVDYTAAVEAAWGATFDAFFGAVFAKM
21
Nucleotide scoring matrices
Normally just simple 1/-1 cost scheme
 Problematic simplification
 Especially for structural sequences (e.g. ncRNA)
 Consider evolution (PAM-like matrix)
 Use actual frequencies of nucleotides as background
 Weigh transitions and transversions differently
22
Gap penalties
We can score pairs of nucleotides/amino acids
 What about indels and gaps?
Two main strategies
Linear gap penalty: G(n)=a·n
 Penalize each gap the same (cf. last lecture)
Affine gap penalty: G(n)=O+E·n
 More evolutionary sound
 Opening a gap costs more than extending it
 Standard today
23
Gaps: Things to consider
How large should O and E be?
 Depends on the scoring matrix used
 Normally good values are given
 Too large: No gaps created
 Too small: Too many gaps created
Should end gaps be penalized?
 Local alignment: No
 Global: Probably
24
Heuristics
So, now we know all about scoring an alignment
 Let us use Smith-Waterman or Needleman-Wunsch
 Against a database or a full-length genome?
Bad idea!
Instead: Heuristic method
 Guaranteed to give a good solution fast
 Not necessarily the optimal alignment
25
The BLAST algorithm
Basic Local Alignment Search Tool
You have all used it, but how does it work?
 Used to find statistically significant local alignments
between a query and a database
 Many versions (protein-protein, nucleotidenucleotide,nucleotide-protein …)
 Fast, widely used, good reason to know the algorithm
26
BLAST overview
The six steps
1.
Filtering of low complexity regions (optional)
2.
Compile list of relevant words
3.
Scan database sequences
4.
Extend hits to HSPs
5.
Calculate E-value for significant hits
6.
Report hits and alignments
27
BLAST step 1
Filtering
 Some sequences contain low complexity regions
 Give rise to many random hits
 Remember dotplots
 Filter out by replacing with Xs
28
BLAST step 2
Compiling word list
 Word length L=3 (protein) or L=11 (nucleotides)
 Choose score threshold T (usually 11)
 Find all words of length L in query sequence




One for each position
Compare to all possible length L words (8000/~4.2M)
For each position in query, remove words below T
Limited to appr. 50 words per position
29
The word list
APLSADQASLVKSTWAQVRNSEVEILAAVFTAYPDIQARF…
APL
PLS
LSA
SAD
ADQ
DQA…
Word list, position 1 APL: APC,APS,APT,APE…
Remove words w with score(APL,w)<T
30
BLAST step 3
Scan database
 Store words for each position in efficient search tree
 Scan each sequence in database
 Remember exact word hits
 If query length is 100, scan for appr. 50·100 hits
31
Scanning
APL 
APA 
LPL 
IIAPLIGNESNAPAVQTLVGQLPLSHKARG…
Perfect match between word and database is a hit
32
BLAST step 4
Extend hits (BLAST2)
 HSP: High-scoring Segment Pair
 Find two hits on same diagonal with distance ≤A
 Connect them (ungapped alignment)
 Extend using gaps, matches and mismatches
 Extension continues while score increases
 Stop when score drops X below highest score
Original BLAST
 Extend all single word hits (higher T needed)
33
BLAST extension


Find diagonal hits
Extend alignment
34
BLAST step 5
Calculate E-scores
 Compile list of HSPs scoring more than S
 Let x be the score of a HSP
 The probability of seeing this score by chance
P(score≥x)
 The expectation of seeing this score in the database
E≈1-e-P(score≥x)D
 Rather complicated calculations
 The Karlin-Altschul equation (written in many ways)
35
BLAST step 6
Report the results
 List the hits (sorted by E-value)
 Graphical representation
 Show Smith-Waterman alignment of the HSPs
36
Using BLAST
Settings
 Query sequence






Sub-sequence
Database  important
Conserved Domain search: Additional info
What organisms to search
Scoring matrix
Normally leave the rest as defaults

But you can change, well, anything
37
Understanding BLAST
Output
 Database searched and query used
 Number of hits
 Color-coded diagram





Magnitude of score
Relation between hits (if any)
Hits sorted by E-value
Alignments
Additional info by clicking a link
38
BLAST summary








Heuristic local alignment
Finds word matches
Extends locally
Might miss optimal solutions
Fast
Lower E-value  better result
Many hits between query and sequence possible
Remember: Use a proper scoring matrix!
39
Position Specific Scoring Matrix
Another brief detour: PSSM or Profile
 Useful to represent (part of) a multiple alignment
 Represents the information in a motif
 For each position: What are the frequencies of the
characters? (Nucleotides or amino acids)
 Frequencies can give the probabilities
 Find most probable hit to a PSSM in a sequence
 Compare to the background probability (i.e. the
overall frequencies)
 Pseudocounts to avoid 0 probabilities
40
PSSM example
123456789
--------AAGGTAAGT
TGTGTGAGT
CAGGTATAC
ATGGTAACT
TAGGTACTG
TAGGTCATT
ACAGTCAGT
CAGGTTGGA
TCCGTAAGT
GAGGTAAAC
| 1 2 3 4 5 6 7 8 9
-|-------------------A| 3 6 1 0 0 6 7 2 1
C| 2 2 1 0 0 2 1 1 2
G| 1 1 7 10 0 1 1 5 1
T| 4 1 1 0 10 1 1 2 6
Just divide by 10 to get probabilities
Assume equal background distribution
P(A)=P(C)=P(G)=P(T)=0.25
41
PSSM score
Comparing sequence CAGGTAGTC to the PSSM



Probability given the model
0,2·0,6·0,7·1,0·1,0·0,6·0,1·0,2·0,2=0,0002016
Probability given the background
0,259=0,000003815
log-odds score (bits)
Pmodel
0,0002016
log
 log
 5,72
Pbckg
0,000003815
42
PSI-BLAST
Position-Specific-Iterated BLAST
 Designed to find distant homologs
 More sensitive than BlastP
 Use PSSM instead of just sequence comparison
43
The PSI-BLAST algorithm
1.
2.
3.
4.
5.


Perform standard BlastP search
Create PSSM based on query and best hits
Search database again for hits to the PSSM
Incorporate new hits (i.e. update frequencies)
Iterate (repeat) until convergence
End result: More distant homologs found
Slower than standard BLAST (of course)
44
PSI-BLASTing a protein


Output at first iteration, 2nd iteration, …
When to stop?

In this case: 7 iterations
PSI-BLAST
>GLBP_CHITH 152 P11582 GLOBIN CTT-E/E' PRECURSOR.
MKFIILALCVAAASALSGDQIGLVQSTYGKVKGDSVGILYAVFKADPTIQAAFPQ
FVGKDLDAIKGGAEFSTHAGRIVGFLGGVIDDLPNIGKHVDALVATHKPRGVT
HAQFNNFRAAFIAYLKGHVDYTAAVEAAWGATFDAFFGAVFAKM
45
Other BLAST versions

MegaBLAST


BLAST2SEQUENCES


Blast on only two sequences (local alignments)
PHI-BLAST


Assumes high similarity, longer sequences
Pattern Hit Initiated BLAST – searches for motif
WU-BLAST

BLAST from Washington University, not NCBI
46
BLAST variants






Nucleotide vs nucleotide: blastn
Protein vs protein: blastp
Translated sequence vs protein database: blastx
Protein sequence vs translated database: tblastn
Translated sequence vs translated database: tblastx
Find divergent protein sequences: PSI-BLAST
47
The BLAT algorithm
BLAST-Like Alignment Tool
 Designed for the genome projects
 Local alignments between long sequences
 Speed important!
 BLAST turned upside-down
48
What BLAT does
How does it work?
1.
Index the database (length 11 words)
2.
Find hits in the query

3.


Opposite the BLAST strategy
Extend hits to HSPs
Useful when the database does not often change
Too time (and space) consuming for normal BLAST
49
The FASTA algorithm
Similar to BLAST … but different
 Fast All (since it works with all alphabets)
 Not as widely used as BLAST – but one of the first
 Works in a step-wise fashion
 Locate word hits, extend using Smith-Waterman
50
The steps in FASTA
1.
2.
3.
4.
Find word hits
Score the hits and trim results
Join regions of similarity
Find the best alignment
51
FASTA step 1
Word hits
 Choose word size ktup (2 for protein, 4 or 6 for DNA)
 Create two word lists: Query and database
 Find all words that occur in both
 Connect nearby hits directly (i.e. no gaps)
52
Word hits


Find all hits
Connect hits on same diagonal
53
FASTA step 2
Score and trim
 Only keep the best 10 segments from step 1
 Re-evaluate all hits using PAM250
 For each hit: Note the best score
54
Score and trim


Keep 10 best hits
Recalculate scores
55
FASTA step 3
Join regions
 All regions scoring over a threshold are kept
 Crudely join regions
 Add linear gap penalty for joining to diagonals
 Keep the best scoring rough alignments
 Removes unlikely similar regions
56
Join and penalize


Keep the good regions
Connect using gaps. Remove the low scoring
57
FASTA step 4
Optimize alignment
 Consider the good but crude alignments
 Reoptimize using Smith-Waterman
 Only a small part of the matrix is needed
 Banded version (much faster)
58
Optimize alignment


Optimize crude alignment
Use banded Smith-Waterman
59
FASTA results




Histogram of scores
Actual scores versus
expected scores
Optimal alignments
E-value representing the
probability of the hit
60
BLAST vs. FASTA
Differences between the two
 BLAST searches the neighbourhood
 FASTA looks for exact matches
 BLAST returns all the best hits in a sequence
 FASTA returns one hit per sequence
 BLAST is faster than FASTA
 FASTA produces better final alignment
61
Summary


Introduction to Markov chains and HMMs
Scoring matrices: PAM and BLOSUM






Larger PAM  Greater distance
Larger BLOSUM  Greater identity
Gap penalties
BLAST, PSI-BLAST etc.
BLAT
FASTA
Next time: Multiple alignment
62
Download