Pairwise alignment 2: Scoring matrices and gaps BLAST, BLAT and FASTA Introduction to bioinformatics Stinus Lindgreen stinus@binf.ku.dk Bioinformatics Centre, University of Copenhagen 1 Outline of the lecture Scoring an alignment: BLOSUM and PAM Scoring matrices for nucleotides Treatment of gaps The BLAST algorithm Position Specific Scoring Matrices PSI-BLAST and other variants BLAT FASTA 2 Definition of scoring system To build an alignment we need a scoring system Aligning two different residues indicates a substitution How likely is it to replace A with B? A likely alignment should receive a large score Create a substitution matrix A 20×20 symmetrical table (matrix) Look up the score for aligning two amino acids The diagonal represents conservation 3 Considerations Conservative substitutions Prefer changes that respect structure and function E.g. hydrophobicity, size, chemical properties etc. Frequencies How often do specific residues occur? Scores should weigh the rare ones higher Evolution How distant are the sequences? Ancient divergence more changes 4 The creation Approach 1: Make an evolutionary model Use closely related sequences Extrapolate to greater distances Done in the PAM family Approach 2: Look at related sequences Observe actual mutations in motifs Use sets of different overall identity Done in the BLOSUM family 5 The PAM250 matrix[1] [1] M.O. Dayhoff: Survey of new data and computer methods of analysis (1978), Atlas of protein sequence and structure, 5:3 6 The PAM family Defining characteristics PAM: Point (or percent) accepted mutations One PAM unit: 1 change per 100 residues (~ 10M ys) Not the probability that a residue A is something else What would PAM250 mean in that case? But: The number of expected changes overall Including changes A B A 7 Construction of PAM matrices The general idea: Assume evolution is independent Describe ”one step of evolution” The next step will follow PAMx: Scoring matrix for x evolutionary steps Very similar sequences: Low x Very divergent sequences: Large x What is an evolutionary step? 8 Evolutionary model Assume independence between positions over time P(AB)=P(BA) Consequences: How a sequence looks tomorrow depends ONLY on how it looks today Not on how it looked yesterday Known as a Markov chain The context does not matter Evolution ”reversible” 9 Building PAM1 1. 2. 3. 4. Find set of very similar sequences (1% ID) Make global alignment (71 groups) Count the substitutions (1572 changes) Calculate weights 10 From counts to scores The simplified version Count mutation frequencies Normalize by mutability of respective amino acid Makes comparisons possible Multiply to wanted evolutionary distance x The respective probabilities Still probabilities Calculate log-odds score for (AB+BA)/2 You have the PAMx matrix! 11 Why log-odds? Si , j log qi , j pi p j Summation instead of multiplication log(A·B)=log(A)+log(B) S>0: More often than per chance S=0: Number expected by chance S<0: Less often than per chance 12 The PAM1 matrix Highly identical sequences 1 evolutionary step PAM1 describes likelihood of changes Diagonal close to 1, off-diagonal close to 0 PAM2=PAM1×PAM1 PAM3=PAM1×PAM2 … Markov chain property (independence) Greater divergence through matrix multiplication Large PAM: More even values 13 PAM problems Good and bad things Based on evolution Assumes independence Global data evolution of proteins Based on small dataset Extrapolates to greater divergence information loss & error growth Evolutionary model partly verified But still simplified… Mutations happen at the nucleotide level 14 The BLOSUM matrix[2] [2] S. Henikoff and J.G. Henikoff: Amino acid substitution matrices from protein blocks (1992), Proc. Natl. Acad, Sci., 89:10915–10919s 15 The BLOSUM family Look at conserved motifs in proteins Scan databases for motifs No gaps in alignment a BLOCK Make groups of motifs One protein can contain many motifs ~2000 blocks, >500 protein families The BLOCKS database BLOSUM: BLOCKS substitution matrices 16 The BLOSUM idea Don’t extrapolate data Collect divergent dataset Observe mutations directly Local alignment of sequences Evolution treated implicitly Still assumes independence between sites 17 Building BLOSUMx Substitution scores for sequences of x %ID 1. Create sets of blocks of x %ID 2. Very similar sequences are grouped and weighted 3. 4. 5. Avoid bias due to numbers Count the substitutions Calculate log-odds scores for all pairs You have the BLOSUMx matrix! 18 Some general notes BLOSUM62 has proved to work for a range of similarities Other values (e.g. 30, 45 and 80) also used Today: BLOSUM used more than PAM 19 BLOSUM problems Good and bad things Evolution not considered explicitly Known related proteins Assumes star-shaped phylogeny Based on conserved blocks / Local alignment 20 The matrix matters Example: BLAST using BLOSUM45 and BLOSUM80 Note number of hits Note difference in scores NCBI BLAST >GLBP_CHITH 152 P11582 GLOBIN CTT-E/E' PRECURSOR. MKFIILALCVAAASALSGDQIGLVQSTYGKVKGDSVGILYAVFKADPTIQAAFPQ FVGKDLDAIKGGAEFSTHAGRIVGFLGGVIDDLPNIGKHVDALVATHKPRGVT HAQFNNFRAAFIAYLKGHVDYTAAVEAAWGATFDAFFGAVFAKM 21 Nucleotide scoring matrices Normally just simple 1/-1 cost scheme Problematic simplification Especially for structural sequences (e.g. ncRNA) Consider evolution (PAM-like matrix) Use actual frequencies of nucleotides as background Weigh transitions and transversions differently 22 Gap penalties We can score pairs of nucleotides/amino acids What about indels and gaps? Two main strategies Linear gap penalty: G(n)=a·n Penalize each gap the same (cf. last lecture) Affine gap penalty: G(n)=O+E·n More evolutionary sound Opening a gap costs more than extending it Standard today 23 Gaps: Things to consider How large should O and E be? Depends on the scoring matrix used Normally good values are given Too large: No gaps created Too small: Too many gaps created Should end gaps be penalized? Local alignment: No Global: Probably 24 Heuristics So, now we know all about scoring an alignment Let us use Smith-Waterman or Needleman-Wunsch Against a database or a full-length genome? Bad idea! Instead: Heuristic method Guaranteed to give a good solution fast Not necessarily the optimal alignment 25 The BLAST algorithm Basic Local Alignment Search Tool You have all used it, but how does it work? Used to find statistically significant local alignments between a query and a database Many versions (protein-protein, nucleotidenucleotide,nucleotide-protein …) Fast, widely used, good reason to know the algorithm 26 BLAST overview The six steps 1. Filtering of low complexity regions (optional) 2. Compile list of relevant words 3. Scan database sequences 4. Extend hits to HSPs 5. Calculate E-value for significant hits 6. Report hits and alignments 27 BLAST step 1 Filtering Some sequences contain low complexity regions Give rise to many random hits Remember dotplots Filter out by replacing with Xs 28 BLAST step 2 Compiling word list Word length L=3 (protein) or L=11 (nucleotides) Choose score threshold T (usually 11) Find all words of length L in query sequence One for each position Compare to all possible length L words (8000/~4.2M) For each position in query, remove words below T Limited to appr. 50 words per position 29 The word list APLSADQASLVKSTWAQVRNSEVEILAAVFTAYPDIQARF… APL PLS LSA SAD ADQ DQA… Word list, position 1 APL: APC,APS,APT,APE… Remove words w with score(APL,w)<T 30 BLAST step 3 Scan database Store words for each position in efficient search tree Scan each sequence in database Remember exact word hits If query length is 100, scan for appr. 50·100 hits 31 Scanning APL APA LPL IIAPLIGNESNAPAVQTLVGQLPLSHKARG… Perfect match between word and database is a hit 32 BLAST step 4 Extend hits (BLAST2) HSP: High-scoring Segment Pair Find two hits on same diagonal with distance ≤A Connect them (ungapped alignment) Extend using gaps, matches and mismatches Extension continues while score increases Stop when score drops X below highest score Original BLAST Extend all single word hits (higher T needed) 33 BLAST extension Find diagonal hits Extend alignment 34 BLAST step 5 Calculate E-scores Compile list of HSPs scoring more than S Let x be the score of a HSP The probability of seeing this score by chance P(score≥x) The expectation of seeing this score in the database E≈1-e-P(score≥x)D Rather complicated calculations The Karlin-Altschul equation (written in many ways) 35 BLAST step 6 Report the results List the hits (sorted by E-value) Graphical representation Show Smith-Waterman alignment of the HSPs 36 Using BLAST Settings Query sequence Sub-sequence Database important Conserved Domain search: Additional info What organisms to search Scoring matrix Normally leave the rest as defaults But you can change, well, anything 37 Understanding BLAST Output Database searched and query used Number of hits Color-coded diagram Magnitude of score Relation between hits (if any) Hits sorted by E-value Alignments Additional info by clicking a link 38 BLAST summary Heuristic local alignment Finds word matches Extends locally Might miss optimal solutions Fast Lower E-value better result Many hits between query and sequence possible Remember: Use a proper scoring matrix! 39 Position Specific Scoring Matrix Another brief detour: PSSM or Profile Useful to represent (part of) a multiple alignment Represents the information in a motif For each position: What are the frequencies of the characters? (Nucleotides or amino acids) Frequencies can give the probabilities Find most probable hit to a PSSM in a sequence Compare to the background probability (i.e. the overall frequencies) Pseudocounts to avoid 0 probabilities 40 PSSM example 123456789 --------AAGGTAAGT TGTGTGAGT CAGGTATAC ATGGTAACT TAGGTACTG TAGGTCATT ACAGTCAGT CAGGTTGGA TCCGTAAGT GAGGTAAAC | 1 2 3 4 5 6 7 8 9 -|-------------------A| 3 6 1 0 0 6 7 2 1 C| 2 2 1 0 0 2 1 1 2 G| 1 1 7 10 0 1 1 5 1 T| 4 1 1 0 10 1 1 2 6 Just divide by 10 to get probabilities Assume equal background distribution P(A)=P(C)=P(G)=P(T)=0.25 41 PSSM score Comparing sequence CAGGTAGTC to the PSSM Probability given the model 0,2·0,6·0,7·1,0·1,0·0,6·0,1·0,2·0,2=0,0002016 Probability given the background 0,259=0,000003815 log-odds score (bits) Pmodel 0,0002016 log log 5,72 Pbckg 0,000003815 42 PSI-BLAST Position-Specific-Iterated BLAST Designed to find distant homologs More sensitive than BlastP Use PSSM instead of just sequence comparison 43 The PSI-BLAST algorithm 1. 2. 3. 4. 5. Perform standard BlastP search Create PSSM based on query and best hits Search database again for hits to the PSSM Incorporate new hits (i.e. update frequencies) Iterate (repeat) until convergence End result: More distant homologs found Slower than standard BLAST (of course) 44 PSI-BLASTing a protein Output at first iteration, 2nd iteration, … When to stop? In this case: 7 iterations PSI-BLAST >GLBP_CHITH 152 P11582 GLOBIN CTT-E/E' PRECURSOR. MKFIILALCVAAASALSGDQIGLVQSTYGKVKGDSVGILYAVFKADPTIQAAFPQ FVGKDLDAIKGGAEFSTHAGRIVGFLGGVIDDLPNIGKHVDALVATHKPRGVT HAQFNNFRAAFIAYLKGHVDYTAAVEAAWGATFDAFFGAVFAKM 45 Other BLAST versions MegaBLAST BLAST2SEQUENCES Blast on only two sequences (local alignments) PHI-BLAST Assumes high similarity, longer sequences Pattern Hit Initiated BLAST – searches for motif WU-BLAST BLAST from Washington University, not NCBI 46 BLAST variants Nucleotide vs nucleotide: blastn Protein vs protein: blastp Translated sequence vs protein database: blastx Protein sequence vs translated database: tblastn Translated sequence vs translated database: tblastx Find divergent protein sequences: PSI-BLAST 47 The BLAT algorithm BLAST-Like Alignment Tool Designed for the genome projects Local alignments between long sequences Speed important! BLAST turned upside-down 48 What BLAT does How does it work? 1. Index the database (length 11 words) 2. Find hits in the query 3. Opposite the BLAST strategy Extend hits to HSPs Useful when the database does not often change Too time (and space) consuming for normal BLAST 49 The FASTA algorithm Similar to BLAST … but different Fast All (since it works with all alphabets) Not as widely used as BLAST – but one of the first Works in a step-wise fashion Locate word hits, extend using Smith-Waterman 50 The steps in FASTA 1. 2. 3. 4. Find word hits Score the hits and trim results Join regions of similarity Find the best alignment 51 FASTA step 1 Word hits Choose word size ktup (2 for protein, 4 or 6 for DNA) Create two word lists: Query and database Find all words that occur in both Connect nearby hits directly (i.e. no gaps) 52 Word hits Find all hits Connect hits on same diagonal 53 FASTA step 2 Score and trim Only keep the best 10 segments from step 1 Re-evaluate all hits using PAM250 For each hit: Note the best score 54 Score and trim Keep 10 best hits Recalculate scores 55 FASTA step 3 Join regions All regions scoring over a threshold are kept Crudely join regions Add linear gap penalty for joining to diagonals Keep the best scoring rough alignments Removes unlikely similar regions 56 Join and penalize Keep the good regions Connect using gaps. Remove the low scoring 57 FASTA step 4 Optimize alignment Consider the good but crude alignments Reoptimize using Smith-Waterman Only a small part of the matrix is needed Banded version (much faster) 58 Optimize alignment Optimize crude alignment Use banded Smith-Waterman 59 FASTA results Histogram of scores Actual scores versus expected scores Optimal alignments E-value representing the probability of the hit 60 BLAST vs. FASTA Differences between the two BLAST searches the neighbourhood FASTA looks for exact matches BLAST returns all the best hits in a sequence FASTA returns one hit per sequence BLAST is faster than FASTA FASTA produces better final alignment 61 Summary Introduction to Markov chains and HMMs Scoring matrices: PAM and BLOSUM Larger PAM Greater distance Larger BLOSUM Greater identity Gap penalties BLAST, PSI-BLAST etc. BLAT FASTA Next time: Multiple alignment 62