BLAST Introduction - CSE - University of South Carolina

Bioinformatics Algorithms and Data Structures BLAST Lecturer: Dr. Rose BLAST Slides: Adaptation of Nir Friedman’s slides from the Computational Methods in Molecular Biology course (Spring 2001) at Hebrew University, Jerusalem, Israel February 21, 2007 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology BLAST Q: What is BLAST? A: Uhmmm, actually no, BLAST is an acronym: A: Basic Local Alignment Search Tool - a set of similarity search programs designed to explore all of the available sequence databases regardless of whether the query is protein or DNA You can find it at: http://www.ncbi.nlm.nih.gov/BLAST/ UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology BLAST • Q: Why do you care? • A: Because you are going to do a project. • • • • • • • • • • • U51112 J03581 NM_000245 NM_010849 NM_007409 NM_002475 XM_086788 M30047 NM_000518 NM_000477 NM_008476 Membrane protein that transports sodium and hydrogen Tyrosinase. . people lacking this are albino MET, an oncogene. . .mutations in this cause cancer MYC, another oncogene Alcohol Dehydrogenase. . good to have when drinking Myosin. . .one of the muscle proteins Crystallin, the major protein in the lens Myelin basic protein..protects the neurons Hemoglobin, oxygen carrying protein in RBC Albumin, major serum protein. . .does lot of things Keratin, skin and integument protein UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology BLAST • BLAST is designed to efficiently find alignments of a target string s against large databases – Motivation: increase the speed of finding fewer and better hotspots. – Idea: Find high scoring matches using a substitution matrix rather than exact matches. – We are still searching only for gapless matches. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology High-Scoring Pair • Two strings s and t are a high scoring pair (HSP) if d(s,t) > T • Given a query s[1..n], BLAST construct all words (fixed-length substrings) w, such that w scores > t with a k-substring of s – Each such match to such word in the database is called a hit • Typical k: 12 for nucleotides, 3-5 for amino acids. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology High-Scoring Pair • Try to extend each such hit to an alignment with maximal score (still with no gaps). Keep all HSPs – Threshold is chosen so that a random match with such a score is unlikely . UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Finding Potential Matches We can locate seed words in a large database in a single pass • Construct a FSA that recognizes seed words • Use hashing techniques to locate matching words UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Extending Potential Matches • Once a seed is found, BLAST attempts to find a local alignment that extends the seed s • Seeds on the same diagonal are combined (as in FASTA) t UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Which programs are used? • Originally Blast did not allow gaps. – Now people use gapped-Blast – Gapped blast joins different diagonals. • For proteins Blast is superior • For nucleotides Fasta is better. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Review: Unrelated Sequences • Our model of unrelated sequences is simple – Each position is sampled independently from a distribution over the alphabet  – We assume there is a distribution q() that describes the probability of letters in such positions • Then: P( s[1..n], t[1..n] | R)   q( s[i]) q(t[i]) i • R denotes the assumption that s and t are random unrelated strings UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Review: Related Sequences • We assume that each pair of aligned positions (s[i],t[i]) evolved from a common ancestor • Let p(a,b) be a distribution over pairs of letters. • p(a,b) is the probability that some ancestral letter evolved into this particular pair of letters P( s[1..n], t[1..n] | M )   p( s[i], t[i]) i • Here M denotes the assumption that s and t are related strings. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Review: Ratio Test for Alignment • Taking logarithm of both sides, we get P ( s, t | M ) p( s[i ], t[i ]) log  log  P ( s, t | R ) i q ( s[i ]) q (t[i ]) p( s[i ], t[i ])   log q( s[i ]) q(t[i ]) i UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Review: Probabilistic Interpretation of Scoring Rule • If we take p ( a, b)  (a, b)  log q(a)q(b) • then the score of an alignment is the log-ratio between the two models: – Score > 0  R is more “probable” – Score < 0  U is more “probable” UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Problems with Scoring Rule When searching for an optimal alignment in a big database, there are a number of problems that arise with this simple scheme. • We are assuming P(M)=P(R), this assumes there are an equal number of related and unrelated sequences in the database. • When searching through a big database, there is high probability that an unrelated sequence will receive a high score • When searching for an optimal local alignment, we have many possible starting points, heavily biasing the score towards being a related sequence. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Prior Probability on the models • What we really wish to calculate is: P ( M | s, t )  P ( s, t | M ) P ( M ) P ( s, t ) • The log score being: P ( M | s, t ) P ( M | s, t ) P ( M ) log  log  P ( R | s, t ) P ( R | s, t ) P ( R ) P ( M | s, t ) P( M ) log  log P ( R | s, t ) P( R) UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Prior Probability on the models • Our threshold should be: UNIVERSITY OF SOUTH CAROLINA P( M ) log P( R) College of Engineering & Information Technology The Hazard of Large Databases • Define p  P (d (s ,t )   |U ) • This is the probability that two unrelated sequences will match with score >  by chance • Assume there are N strings in our database • Assuming that they are independent of each other, and all are unrelated to s, we have P (maxt d (s ,t )   )  1  (1  p )  1  e N UNIVERSITY OF SOUTH CAROLINA  Np College of Engineering & Information Technology The Hazard of Large Databases 1 f(x,0.001) f(x,0.0001) f(x, 0.00001) f(x, 0.000001) 0.8 0.6 0.4 0.2 0 0 20000 40000 UNIVERSITY OF SOUTH CAROLINA 60000 80000 100000 College of Engineering & Information Technology Local Matching • Question: Which local alignment query is expected to give a higher score: – To a short sequence – To a long sequence? • A local match can begin at any of the nm entries in the DP matrix. • The score is the optimal of all these starting points. • If all starting points were independent we would need to calculate the probability of attaining such a score in nm trials. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Score Significance-Fasta • How meaningful is a score? • Calculate distribution of scores and related scores • Under reasonable assumptions the scores for un-gapped alignment behave according to the Extreme Value Distribution. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Extreme Value Distribution (BLAST) • We ask the following questions: Given a database of size n and a sequence of size m • What is the expected number of hits with score at least S? This number is called an E-score  S E ( S )  Kmne • Notice this is a Poisson distribution. • • • • K corrects for the dependencies  depends on the scoring matrix Doubling n, the length of sequence, doubles expectation Doubling S, the score, causes E() to decrease exponentially UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Blast P-value • Recall the Poisson distribution: – Probability of finding no hits with a score => S e E – Therefore probability of finding at least one hit with score => S is 1 e E – This is called the P-value. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology A Typical Genebank entry UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Sequence Information UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology The Sequence UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology BLAST programs • BLASTN - Nucleotide query searching a nucleotide database. • BLASTP - Protein query searching a protein database. • BLASTX - Translated nucleotide query sequence (6 frames) searching a protein database. • TBLASTN - Protein query searching a translated nucleotide (6 frames) database. • TBLASTX - Translated nucleotide query (6 frames) searching a translated nucleotide (6 frames) database UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology BLAST Search UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology BLAST Output • List of hits – Database accession codes, name, description. – Score in bits (Usually >30 bits is significant ) – Expectation value E() • For each hit – A header including hit name, description, length – Each hit may contain several HSPs – Score and expectation value – how many identical residues – how many residues contributing positively to the score • The local alignment itself UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology BLAST Output UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology BLAST Output UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology BLAST Output UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology PSI- BLAST (Position Specific Iterated) • BLAST provides a new automatic “profile like” search. • Iterative procedure: – Perform BLAST on database. – Use Significant alignments to construct a “position specific” score matrix. – This matrix replaces the query sequence in the next round of database searching. • The program may be iterated until no new significant alignments are found. • Most commonly used search method today. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Multiple Alignment • Proteins can be classified into families: – Common structure. – Common function. – Common evolutionary origin. • For a set of sequences belonging to some family – Each pair has some differences – But, there are some common motifs in almost all sequences of the family • A multiple alignment carries more information than pairwise alignment UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Protein Families • Consider Zinc Fingers: • All have the same function: – Bind to DNA • All have similar structure • They constitute a Protein Family • In a protein family some parts of the sequence (the functional parts) are more conserved than others. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Definition A multiple alignment of strings S1,S2,…,Sk is a series of strings with blanks S’1,S’2,…,S’k such that: – |S’1|=|S’2|=…=|S’k| – S’j is an extension of Sj obtained by insertion of blanks. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Example AGT..CTT.ACGCG AGTAGCTT...GCG ..TAGC.T..GGCG .CTA.C.TAACCCG ACTA...TAAC... UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Example UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Sum of Pairs • The sum of pairwise distances between all pairs of sequences for some scoring matrix S (mi )   s(mik , mil ) k l • Not only assumes that alignment of each column is independent, but also each pair of sequences. – Each sequence is scored as if descended from k-1 sequences instead of one common ancestor. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Calculation of Multiple Alignment • The optimal alignment can be calculated exactly using k-dimensional dynamic programming. – Space complexity O(nk) – Time complexity O(2knk) • A Heuristic Program called ClustalW quickly finds a good multiple alignment. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Creating a PSSM • After aligning the sequences we see that there are some conserved regions. • We use the multiple alignment of Blast results to create a Position Specific Scoring Matrix. • This matrix represents information from a whole family, it is more strict in highly conserved regions. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology PSI- BLAST (Position Specific Iterated) • BLAST provides a new automatic “profile like” search. • Iterative procedure: – Perform BLAST on database. – Use Significant alignments to construct a “position specific” score matrix. – This matrix replaces the query sequence in the next round of database searching. • The program may be iterated until no new significant alignments are found. • Most commonly used search method today. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology

BLAST Introduction - CSE - University of South Carolina

Related documents

Products

Support

BLAST Introduction - CSE - University of South Carolina

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib