Bioinformatics Algorithms and Data Structures BLAST Lecturer: Dr. Rose BLAST Slides: Adaptation of Nir Friedman’s slides from the Computational Methods in Molecular Biology course (Spring 2001) at Hebrew University, Jerusalem, Israel February 21, 2007 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology BLAST Q: What is BLAST? A: Uhmmm, actually no, BLAST is an acronym: A: Basic Local Alignment Search Tool - a set of similarity search programs designed to explore all of the available sequence databases regardless of whether the query is protein or DNA You can find it at: http://www.ncbi.nlm.nih.gov/BLAST/ UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology BLAST • Q: Why do you care? • A: Because you are going to do a project. • • • • • • • • • • • U51112 J03581 NM_000245 NM_010849 NM_007409 NM_002475 XM_086788 M30047 NM_000518 NM_000477 NM_008476 Membrane protein that transports sodium and hydrogen Tyrosinase. . people lacking this are albino MET, an oncogene. . .mutations in this cause cancer MYC, another oncogene Alcohol Dehydrogenase. . good to have when drinking Myosin. . .one of the muscle proteins Crystallin, the major protein in the lens Myelin basic protein..protects the neurons Hemoglobin, oxygen carrying protein in RBC Albumin, major serum protein. . .does lot of things Keratin, skin and integument protein UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology BLAST • BLAST is designed to efficiently find alignments of a target string s against large databases – Motivation: increase the speed of finding fewer and better hotspots. – Idea: Find high scoring matches using a substitution matrix rather than exact matches. – We are still searching only for gapless matches. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology High-Scoring Pair • Two strings s and t are a high scoring pair (HSP) if d(s,t) > T • Given a query s[1..n], BLAST construct all words (fixed-length substrings) w, such that w scores > t with a k-substring of s – Each such match to such word in the database is called a hit • Typical k: 12 for nucleotides, 3-5 for amino acids. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology High-Scoring Pair • Try to extend each such hit to an alignment with maximal score (still with no gaps). Keep all HSPs – Threshold is chosen so that a random match with such a score is unlikely . UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Finding Potential Matches We can locate seed words in a large database in a single pass • Construct a FSA that recognizes seed words • Use hashing techniques to locate matching words UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Extending Potential Matches • Once a seed is found, BLAST attempts to find a local alignment that extends the seed s • Seeds on the same diagonal are combined (as in FASTA) t UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Which programs are used? • Originally Blast did not allow gaps. – Now people use gapped-Blast – Gapped blast joins different diagonals. • For proteins Blast is superior • For nucleotides Fasta is better. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Review: Unrelated Sequences • Our model of unrelated sequences is simple – Each position is sampled independently from a distribution over the alphabet – We assume there is a distribution q() that describes the probability of letters in such positions • Then: P( s[1..n], t[1..n] | R) q( s[i]) q(t[i]) i • R denotes the assumption that s and t are random unrelated strings UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Review: Related Sequences • We assume that each pair of aligned positions (s[i],t[i]) evolved from a common ancestor • Let p(a,b) be a distribution over pairs of letters. • p(a,b) is the probability that some ancestral letter evolved into this particular pair of letters P( s[1..n], t[1..n] | M ) p( s[i], t[i]) i • Here M denotes the assumption that s and t are related strings. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Review: Ratio Test for Alignment • Taking logarithm of both sides, we get P ( s, t | M ) p( s[i ], t[i ]) log log P ( s, t | R ) i q ( s[i ]) q (t[i ]) p( s[i ], t[i ]) log q( s[i ]) q(t[i ]) i UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Review: Probabilistic Interpretation of Scoring Rule • If we take p ( a, b) (a, b) log q(a)q(b) • then the score of an alignment is the log-ratio between the two models: – Score > 0 R is more “probable” – Score < 0 U is more “probable” UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Problems with Scoring Rule When searching for an optimal alignment in a big database, there are a number of problems that arise with this simple scheme. • We are assuming P(M)=P(R), this assumes there are an equal number of related and unrelated sequences in the database. • When searching through a big database, there is high probability that an unrelated sequence will receive a high score • When searching for an optimal local alignment, we have many possible starting points, heavily biasing the score towards being a related sequence. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Prior Probability on the models • What we really wish to calculate is: P ( M | s, t ) P ( s, t | M ) P ( M ) P ( s, t ) • The log score being: P ( M | s, t ) P ( M | s, t ) P ( M ) log log P ( R | s, t ) P ( R | s, t ) P ( R ) P ( M | s, t ) P( M ) log log P ( R | s, t ) P( R) UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Prior Probability on the models • Our threshold should be: UNIVERSITY OF SOUTH CAROLINA P( M ) log P( R) College of Engineering & Information Technology The Hazard of Large Databases • Define p P (d (s ,t ) |U ) • This is the probability that two unrelated sequences will match with score > by chance • Assume there are N strings in our database • Assuming that they are independent of each other, and all are unrelated to s, we have P (maxt d (s ,t ) ) 1 (1 p ) 1 e N UNIVERSITY OF SOUTH CAROLINA Np College of Engineering & Information Technology The Hazard of Large Databases 1 f(x,0.001) f(x,0.0001) f(x, 0.00001) f(x, 0.000001) 0.8 0.6 0.4 0.2 0 0 20000 40000 UNIVERSITY OF SOUTH CAROLINA 60000 80000 100000 College of Engineering & Information Technology Local Matching • Question: Which local alignment query is expected to give a higher score: – To a short sequence – To a long sequence? • A local match can begin at any of the nm entries in the DP matrix. • The score is the optimal of all these starting points. • If all starting points were independent we would need to calculate the probability of attaining such a score in nm trials. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Score Significance-Fasta • How meaningful is a score? • Calculate distribution of scores and related scores • Under reasonable assumptions the scores for un-gapped alignment behave according to the Extreme Value Distribution. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Extreme Value Distribution (BLAST) • We ask the following questions: Given a database of size n and a sequence of size m • What is the expected number of hits with score at least S? This number is called an E-score S E ( S ) Kmne • Notice this is a Poisson distribution. • • • • K corrects for the dependencies depends on the scoring matrix Doubling n, the length of sequence, doubles expectation Doubling S, the score, causes E() to decrease exponentially UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Blast P-value • Recall the Poisson distribution: – Probability of finding no hits with a score => S e E – Therefore probability of finding at least one hit with score => S is 1 e E – This is called the P-value. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology A Typical Genebank entry UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Sequence Information UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology The Sequence UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology BLAST programs • BLASTN - Nucleotide query searching a nucleotide database. • BLASTP - Protein query searching a protein database. • BLASTX - Translated nucleotide query sequence (6 frames) searching a protein database. • TBLASTN - Protein query searching a translated nucleotide (6 frames) database. • TBLASTX - Translated nucleotide query (6 frames) searching a translated nucleotide (6 frames) database UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology BLAST Search UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology BLAST Output • List of hits – Database accession codes, name, description. – Score in bits (Usually >30 bits is significant ) – Expectation value E() • For each hit – A header including hit name, description, length – Each hit may contain several HSPs – Score and expectation value – how many identical residues – how many residues contributing positively to the score • The local alignment itself UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology BLAST Output UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology BLAST Output UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology BLAST Output UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology PSI- BLAST (Position Specific Iterated) • BLAST provides a new automatic “profile like” search. • Iterative procedure: – Perform BLAST on database. – Use Significant alignments to construct a “position specific” score matrix. – This matrix replaces the query sequence in the next round of database searching. • The program may be iterated until no new significant alignments are found. • Most commonly used search method today. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Multiple Alignment • Proteins can be classified into families: – Common structure. – Common function. – Common evolutionary origin. • For a set of sequences belonging to some family – Each pair has some differences – But, there are some common motifs in almost all sequences of the family • A multiple alignment carries more information than pairwise alignment UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Protein Families • Consider Zinc Fingers: • All have the same function: – Bind to DNA • All have similar structure • They constitute a Protein Family • In a protein family some parts of the sequence (the functional parts) are more conserved than others. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Definition A multiple alignment of strings S1,S2,…,Sk is a series of strings with blanks S’1,S’2,…,S’k such that: – |S’1|=|S’2|=…=|S’k| – S’j is an extension of Sj obtained by insertion of blanks. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Example AGT..CTT.ACGCG AGTAGCTT...GCG ..TAGC.T..GGCG .CTA.C.TAACCCG ACTA...TAAC... UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Example UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Sum of Pairs • The sum of pairwise distances between all pairs of sequences for some scoring matrix S (mi ) s(mik , mil ) k l • Not only assumes that alignment of each column is independent, but also each pair of sequences. – Each sequence is scored as if descended from k-1 sequences instead of one common ancestor. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Calculation of Multiple Alignment • The optimal alignment can be calculated exactly using k-dimensional dynamic programming. – Space complexity O(nk) – Time complexity O(2knk) • A Heuristic Program called ClustalW quickly finds a good multiple alignment. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Creating a PSSM • After aligning the sequences we see that there are some conserved regions. • We use the multiple alignment of Blast results to create a Position Specific Scoring Matrix. • This matrix represents information from a whole family, it is more strict in highly conserved regions. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology PSI- BLAST (Position Specific Iterated) • BLAST provides a new automatic “profile like” search. • Iterative procedure: – Perform BLAST on database. – Use Significant alignments to construct a “position specific” score matrix. – This matrix replaces the query sequence in the next round of database searching. • The program may be iterated until no new significant alignments are found. • Most commonly used search method today. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology