BLAST, PSI-BLAST and positionspecific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington thabangh@gmail.com Outline • • • • • • Responses from last class Revision BLAST PSI-BLAST Position specific scoring matrices (PSSMs) Python One-minute responses • Please explain the null and alternative hypothesis again. • Liked giving examples on the statistical concepts. • Sometimes the class is boring because you are using only the projector. • • • • • For Python, we learn more by practicing than just looking at your code. Python session was good, but too fast. More Python examples, please. The Python is difficult because it is different from what we learned before. The problem is how to use sys in Python. I hope you give lots of examples for the sys command. • Please be available for consultation over the weekend on the assignment. • Does BLAST use p-values to decide which alignments to consider? Revision • What is a distribution? – A mathematical function whose values sum to 1. • If you roll a single die many times and make a histogram of the resulting values, what kind of distribution will you observe? – Uniform • If you compare a protein sequence to many, randomly shuffled protein sequences and make a histogram of the resulting scores, what kind of distribution will you observed? – Extreme value distribution • What is the definition of “null hypothesis”? – A statistical model of the situation that we are not interested in. • What is the opposite of the null hypothesis? – The alternative hypothesis. • What is the name of the estimated probability of observing the data, assuming that it was generated according to the null hypothesis? – p-value • How do you decide what p-value threshold to use? – Consider the costs associated with making a mistake. Significance of scores HPDKKAHSIHAWILSKSKVLEGNTKEVVDNVLKT Sequence alignment algorithm LENENQGKCTIAEYKYDGKKASVYNSFVSNGVKE 45 Low score = unrelated High score = homologs How high is high enough? Database searching Sequence database Query Targets ranked by score Sequence comparison algorithm How long does DP take? Query sequence of length n There are nm entries in the matrix. Target sequence of length m Each entry requires a constant number c of operations. Dynamic programming matrix The total number of required operations is approximate nmc. We say that the algorithm is “order nm” or “O(nm).” How long does DP take? • Say that your query is 200 amino acids long. • You are searching a database that contains a million proteins. • If their average length is 200, then you have to fill in 200 200 1,000,000 = 4 1010 DP entries. • If it takes only 10 operations to fill in each cell, then you still have to do 4 1011 operations. BLAST • DP is O(nm); BLAST is O(m). • Fundamental innovation: employ a data structure to index the query sequence. • The data structure allows you to look up entries in a table in O(1) time. Does my length-n sequence contain the subsequence “GTR”? Naive method: scan the sequence O(n) Improved method: hash table or search tree lookup O(1) BLAST List of words in query and similar words Query sequence Query Target sequence BLAST List of words in query and similar words Query sequence Query Target sequence “Does this target word appear in the query word list?” BLAST List of words in query and similar words Query sequence Query x “Yes, at position 34 in the query sequence.” Target sequence BLAST Query Query sequence List of words in query and similar words x x x x x x x x x Target sequence BLAST Query Query sequence List of words in query and similar words x x x x x These two hits are on the x diagonal and close to each other, so let’s try to connect them. x x x Target sequence BLAST Query Query sequence List of words in query and similar words x x x x x x x x x Target sequence BLAST Assign a score to each hit List of words in query and similar words Query sequence Query 0.005 x x 0.27 x x Target sequence BLAST • “The central idea of the BLAST algorithm is that a statistically significant alignment is likely to contain a high-scoring pair of aligned words.” • The initial word threshold T is the most important parameter. • Low T = high sensitivity, long compute. • High T = low sensitivity, quick compute. When does BLAST fail? ERDCRVSSFRVKENFDKARFAGTWYAMAKKDPEGLFLQDNIVAEFSVDENGHMSATAKGRVRLLNNWDVCADMVGTFTDT E R F E K A Y K E L I F E M A V N V M F ECEIRQFLFIQRESARKEACATGTYREKKMDPELIVLVIWICPQFEQLEMRAMWIHAKJEVIUENAQCVIYTMQEPFCII • BLAST works by joining together short regions of high similarity. • Therefore, BLAST will fail to detect long regions of low similarity. Summary of BLAST • Dynamic programming is O(nm), where n is the length of the query and m is the size of the database. • BLAST is O(m). • BLAST produces an index of the query sequence that allows fast matching to the database. • Relative to Smith-Waterman, BLAST can produce false negatives; i.e., homologs that BLAST fails to detect. BLAST Query Homologs Sequence database BLAST Position-specific iterated BLAST Position-specific scoring matrix (PSSM) Query Statistical model of protein family Homologs Sequence database BLAST Position-specific scoring matrix Position in query sequence • A PSSM is an n by m matrix, where n is the size of the alphabet, and m is the length of the sequence. • The entry at (i, j) is the score assigned by the PSSM to letter i at the jth position. A -1 -2 -1 0 -1 -2 0 -2 R 5 0 5 -2 1 -3 -2 0 N 0 6 0 0 0 -3 0 1 D -2 1 -2 -1 0 -3 -1 -1 C -3 -3 -3 -3 -3 -2 -3 -3 Q 1 0 1 -2 5 -3 -2 0 E 0 0 0 -2 2 -3 -2 0 G -2 0 -2 6 -2 -3 6 -2 H 0 1 0 I -3 -3 -3 L -2 -3 -2 -4 -2 0 -4 -3 K 2 0 2 -2 1 -3 -2 -1 M -1 -2 -1 -3 0 0 -3 -2 F -3 -3 -3 -3 -3 6 -3 -1 P -2 -2 -2 -2 -1 -4 -2 -2 S -1 1 -1 0 0 -2 0 -1 T -1 0 -1 -2 -1 -2 -2 -2 W -3 -4 -3 -2 -2 1 -2 -2 Y -2 -2 -2 -3 -1 3 -3 2 V -3 -3 -3 -3 -2 -1 -3 -3 -2 -1 -23 “K” at0 position -4 0 gets a-3score of -42. 8 -3 Position-specific scoring matrix • This PSSM assigns the sequence NMFWAFGH a score of 0 + -2 + -3 + 2 + -1 + 6 + 6 + 8 = 12. A -1 -2 -1 0 -1 -2 0 -2 R 5 0 5 -2 1 -3 -2 0 N 0 6 0 0 0 -3 0 1 D -2 1 -2 -1 0 -3 -1 -1 C -3 -3 -3 -3 -3 -2 -3 -3 Q 1 0 1 -2 5 -3 -2 0 E 0 0 0 -2 2 -3 -2 0 G -2 0 -2 6 -2 -3 6 -2 H 0 1 0 -2 0 -1 -2 8 I -3 -3 -3 -4 -3 0 -4 -3 L -2 -3 -2 -4 -2 0 -4 -3 K 2 0 2 -2 1 -3 -2 -1 M -1 -2 -1 -3 0 0 -3 -2 F -3 -3 -3 -3 -3 6 -3 -1 P -2 -2 -2 -2 -1 -4 -2 -2 S -1 1 -1 0 0 -2 0 -1 T -1 0 -1 -2 -1 -2 -2 -2 W -3 -4 -3 -2 -2 1 -2 -2 Y -2 -2 -2 -3 -1 3 -3 2 V -3 -3 -3 -3 -2 -1 -3 -3 • What score does this PSSM assign to KRPGHFLA? • 2 + 0 + -2 + 6 + 0 + 6 + -4 + -2 = 6 A -1 -2 -1 0 -1 -2 0 -2 R 5 0 5 -2 1 -3 -2 0 N 0 6 0 0 0 -3 0 1 D -2 1 -2 -1 0 -3 -1 -1 C -3 -3 -3 -3 -3 -2 -3 -3 Q 1 0 1 -2 5 -3 -2 0 E 0 0 0 -2 2 -3 -2 0 G -2 0 -2 6 -2 -3 6 -2 H 0 1 0 -2 0 -1 -2 8 I -3 -3 -3 -4 -3 0 -4 -3 L -2 -3 -2 -4 -2 0 -4 -3 K 2 0 2 -2 1 -3 -2 -1 M -1 -2 -1 -3 0 0 -3 -2 F -3 -3 -3 -3 -3 6 -3 -1 P -2 -2 -2 -2 -1 -4 -2 -2 S -1 1 -1 0 0 -2 0 -1 T -1 0 -1 -2 -1 -2 -2 -2 W -3 -4 -3 -2 -2 1 -2 -2 Y -2 -2 -2 -3 -1 3 -3 2 V -3 -3 -3 -3 -2 -1 -3 -3 How PSI-BLAST makes PSSMs Position-specific iterated BLAST Query ? PSSM Multiple alignment Sequence database BLAST Creating a PSSM from 1 sequence R L RNRGQFGH R BLOSUM62 matrix 20 by 20 A -1 -2 -1 0 -1 -2 0 -2 R 5 0 5 -2 1 -3 -2 0 N 0 6 0 0 0 -3 0 1 D -2 1 -2 -1 0 -3 -1 -1 C -3 -3 -3 -3 -3 -2 -3 -3 Q 1 0 1 -2 5 -3 -2 0 E 0 0 0 -2 2 -3 -2 0 G -2 0 -2 6 -2 -3 6 -2 H 0 1 0 -2 0 -1 -2 8 I -3 -3 -3 -4 -3 0 -4 -3 L -2 -3 -2 -4 -2 0 -4 -3 K 2 0 2 -2 1 -3 -2 -1 M -1 -2 -1 -3 0 0 -3 -2 F -3 -3 -3 -3 -3 6 -3 -1 P -2 -2 -2 -2 -1 -4 -2 -2 S -1 1 -1 0 0 -2 0 -1 T -1 0 -1 -2 -1 -2 -2 -2 W -3 -4 -3 -2 -2 1 -2 -2 Y -2 -2 -2 -3 -1 3 -3 2 V -3 -3 -3 -3 -2 -1 -3 -3 20 by L Position-specific iterated BLAST ? Query PSSM Multiple alignment Sequence database BLAST Creating a PSSM from multiple sequences • Discard columns that contain gaps in the query. • For each column C – Compute relative sequence weights – Compute PSSM entries, taking into account • Observed residues in this column • Sequence weights • Substitution matrix Discard query gap columns EEFG----SVDGLVNNA QKYG----RLDVMINNA RRLG----TLNVLVNNA GGIG----PVD-LVNNA KALG----GFNVIVNNA ARFG----KID-LIPNA FEPEGPEKGMWGLVNNA AQLK----TVDVLINGA EEFGSVDGLVNNA QKYGRLDVMINNA RRLGTLNVLVNNA GGIGPVD-LVNNA KALGGFNVIVNNA ARFGKID-LIPNA FEPEGMWGLVNNA AQLKTVDVLINGA Compute sequence weights EEFGSVDGLVNNA QKYGRLDVMINNA RRLGTLNVLVNNA GGIGPVDLLVNNA KALGGFNVIVNNA ARFGKIDTLIPNA FEPEGMWGLVNNA AQLKTVDVLINGA 1.2 1.2 0.8 0.8 1.1 0.9 1.1 1.3 • Low weights are assigned to redundant sequences. • High weights are assigned to unique sequences. Compute PSSM entries EEFGSVDGLVNNA QKYGRLDVMINNA RRLGTLNVLVNNA GGIGPVDLLVNNA KALGGFNVIVNNA ARFGKIDTLIPNA FEPEGMWGLVNNA AQLKTVDVLINGA BLOSUM62 matrix 1.2 1.2 0.8 0.8 1.1 0.9 1.1 1.3 PSSM Position-specific iterated BLAST Query PSSM Multiple alignment Sequence database BLAST Summary of PSI-BLAST • PSI-BLAST builds a model of the query sequence and its close homologs. • Instead of comparing a target sequence to the query, each target is compared to the model. • The PSI-BLAST model is called a position-specific scoring matrix (PSSM). • The PSSM can be constructed from a collection of targets aligned to the query sequence. • PSI-BLAST is more accurate than BLAST. Sample problem #1 • Given: – a file containing a sequence of amino acids • Return: – the amino acid counts ./compute-counts.py seq1.txt Read 68 amino acids from seq1.txt. A 5 C 2 D 3 E 1 F 6 G 0 H 0 I 2 K 2 L 8 M 1 N 5 P 7 Q 1 R 1 S 2 T 5 V 6 W 3 Y 8 Sample problem #2 • Given: – a pseudocount weight – a file containing amino acid frequencies – a file containing a sequence of amino acids • Return: – the summed amino acid counts and pseudocounts Sample problem #3 • Given: – a pseudocount weight – a file containing amino acid frequencies – a file containing a sequence of amino acids • Return: – the normalized summed amino acid counts and pseudocounts