BLAST Workshop Maya Schushan November 2011 Workshop OUTLINE Part 1: Part 2: • Introduction and motivation • Work Steps • How does BLAST work? • HANDS-ON • BLAST programs • Applications: ConSurf • Sequence databases • MSA programs Why BLAST? Finding homologous • Homology- similarity between sequences that result from a common ancestor. • Sequences look alikeMore probably have the same function then: and structure. • 25% for proteins 70% nucleotides Use a sequence as a for search query in order to find homologous sequences in a dataas base. will be considered homologous • Save time! – exploit the knowledge you have about your homologues, and conclude about your query. Why BLAST? Finding homologous Identify sequence motifs Why BLAST? Finding homologous Find out which region are evolutionary conserved important for function and\or structure Why BLAST? Finding homologous Construct phylogenetic trees understand the evolution of the sequence’s family Why BLAST? Finding homologous Finding out if your protein sequence has a structure (or a close homologue has one….) Workshop OUTLINE Part 1: Part 2: • Introduction and motivation • Work Steps • How does BLAST work? • HANDS-ON • BLAST programs • Applications: ConSurf • Sequence databases • MSA programs How does BLAST work? What Is An Alignment? Before we can understand how BLAST works, we first have to understand the principles of sequence alignment…. How does BLAST work? What Is An Alignment? • Comparing 2 (pairwise) or more (multiple) sequences. • Searching for a series of identical or similar characters in the sequences. VLSPADKTNVKAAWAKVGAHAAGHG ||| | | |||| | |||| VLSEAEWQLVLHVWAKVEADVAGHG How does BLAST work? What Is An Alignment? A process of lining-up 2 or more sequences to achieve maximum level of identity, in order to find homologies. TCATG CATTG ? TCATG CATTG or TCATG CATTG How does BLAST work? What Is An Alignment? S = ACTG T = AGT S’ = AC_TG S’ = ACTG S’ = ACTG T’ = A_GT_ T’ = AGT_ T’ = _AGT Good: Identical characters- match. Bad: Different characters- mismatch; gap (InDel). • Each pair of characters gets a value, depending on its identity. •The similarity score of the alignment is the sum of pair values. General Alignment Methodology How does BLAST work? What Is An Alignment? Example: Aligning Two Globins Human Hemoglobin (HH): VLSPADKTNVKAAWGKVGAHAGYEG Sperm Whale Myoglobin (SWM): VLSEGEWQLVLHVWAKVEADVAGHG How does BLAST work? What Is An Alignment? Example: Aligning Two Globins • Percent identity: 36 • Percent similarity: 40 (HH) No Gaps: VLSPADKTNVKAAWGKVGAHAGYEG (SWM) VLSEGEWQLVLHVWAKVEADVAGHG How does BLAST work? What Is An Alignment? Example: Aligning Two Globins With Gaps: Gaps: 2 • Percent identity: 45.833 (instead of 36 without gaps) • Percent similarity: 54.167 (instead of 40 without gaps) • (HH) VLSPADKTNVKAAWGKVGAH-AGYEG (SWM) VLSEGEWQLVLHVWAKVEADVAGH-G How does BLAST work? What Is An Alignment? Alignment Scoring 1. Assume independent mutation model 2. Score at each position – Positive if the same/similar – Negative if different or gap 3. Score of an alignment is sum of position score How does BLAST work? What Is An Alignment? Scoring Matrix • A matrix n n : n=4 for DNA, n=20 for proteins • Each entry matrix defines the score for observing the two letters in the alignment A G C T – Positive if likely to change – Negative otherwise A 1 G -5 1 C -5 -5 1 T -5 -5 -5 1 How does BLAST work? What Is An Alignment? DNA scoring matrices From To A G A G 2 -4 2 C T -6 -6 -6 -6 Transversion C T 2 -4 2 Transition Match How does BLAST work? What Is An Alignment? Proteins scoring matrices • Observation: some substitutions are more frequent than others, e.g., chemically similar amino acids • As for DNA, protein matrices define the probabilities of change between the different amino acids • Popular matrices are based on empirical data: PAM & BLOSUM How does BLAST work? What Is An Alignment? PAM Matrices • PAM matrices are based on sequences with 85% identity. • The changes are “accepted” by natural selection • 1 PAM unit: the probability of 1 point mutation per 100 residues. • Multiplying PAM1 by itself gives higher PAMs matrices that are suitable for larger evolutionary distance. How does BLAST work? What Is An Alignment? BLOSUM Matrices • Based on BLOCKS database: • Low BLUSOM numbers for distant sequences, High BLUSOM numbers for similar sequence • BLOSUMn is based on sequences that shared at least n percent identity, generally: BLOSUM62 for general use BLOSUM80 for close relations BLOSUM45 for distant relations How does BLAST work? What Is An Alignment? Proteins scoring matrices Closer sequences PAM100 PAM120 PAM160 PAM200 PAM250 Distant sequences = = = = = BLOSUM90 BLOSUM80 BLOSUM60 BLOSUM52 BLOSUM45 How does BLAST work? What Is An Alignment? Scoring • The final score of the alignment is the sum of the positive scores and penalty scores: Scoring Matrix + Number of Identities + Number of Similarities - Number of Gap insertions - Number of Gap extensions Alignment score Gap penalties How does BLAST work? BLAST (Basic Local Alignment Search Tool) • Goal: A fast search for homologues in a huge database • The underlying hypothesis: when two sequences are similar there are short ungapped regions of high similarity between them • The heuristic: 1. Discard irrelevant sequences 2. Perform exact local alignment only with the remaining sequences Altschul, S.F.,Gish, W., Miller, W., Myers, E.W., and Lipman,D.J(1990) “basic local alignment search tool” J. Mol. Biol. 215: 403-410 How does BLAST work? Searching a sequence database •Idea: In order to find homologous sequences to a sequence of interest, one should compute its pairwise alignment against all known sequences in a database, and detect the best scoring significant homologs •Query sequence - the sequence with which we are searching •Hit – a sequence found in the database, suspected as homologous 25 How does BLAST work? The parametersW : Word size – find W-mers in target/query 2-3 for aa, 6-11 for nucleotides. T : Threshold – focus on pairs scoring >T usually 11-13 X : Drop-off – stop extending when loss >X S : Score – the final score of segment pair How does BLAST work? The algorithm: 1. Align a query sequence with the database. 2. Find “hits”: short word pairs of length W with an ungapped alignment score of at least T. 3. Extend alignments until score drops more than X below hitherto best score s Consumes most of the processing time (>90%) t How does BLAST work? How do we discard irrelevant sequences quickly? • Divide the database into words of length w (default: w = 3 for protein and w = 7 for DNA) • Save the words in a look-up table that can be searched quickly WTDFGYPAILKGGTAC WTD TDF DFG FGY GYP … How does BLAST work? BLAST: discarding sequences • When the user enters a query sequence, it is also divided into words • Search the database for consecutive neighboring words • neighbor words are defined according to a scoring matrix (e.g., BLOSUM62 for proteins) with a certain cutoff level GFC (20) GFB GPC (11) WAC (5) How does BLAST work? Look for a seed: hits on the same diagonal which can be connected Neighbor word Database record At least 2 hits on the same diagonal with distance which is smaller than a predetermined cutoff This is the filtering stage – many unrelated hits are filtered, saving lots of time! Query How does BLAST work? Try to extend the alignment • Stop extending when the score of the alignment drops X beneath the maximal score obtained so far • Discard segments with score < S ASKIOPLLWLAASFLHNEQAPALSDAN JWQEOPLWPLAASOIHLFACNSIFYAS Score=15 Score=17 Score=14 How does BLAST work? The result – local alignment • The result of BLAST will be a series of local alignments between the query and the different hits found How does BLAST work? E-value • To asses the bits score we calculate E-value: E-value = The expected number of HSP’s with a score of at least S: s E KMNe • For each score S there is a specific E-value. Small E-value better score How does BLAST work? E-value Theoretically, we could trust any result with an E-value ≤ 1 In practice – BLAST uses estimations. •E-values of 10-4 and lower indicate a significant homology. •E-values between 10-4 and 10-2 should be checked (similar domains, maybe non-homologous). •E-values between 10-2 and 1 do not indicate a good homology How does BLAST work? PSI-BLAST Step 1: 1. Set a standard protein-protein BLAST search (BLOSUM62) 2. Build a position specific scoring matrix (PSSM) according to MSA of the alignment results with low Evalue. Step 2: 1. Set a BLAST search using the PSSM to evaluate the alignment. PSSM vs. DB instead of seq vs. DB 2. Update the PSSM according to the new result 3. Go back to the beginning of step two or stop. How does BLAST work? The power of PSI-BLAST: 1. A much sensitive scoring system . each position has its own pattern probabilities . 2. Important motifs are bounded 3. Lowers the level of random noise. 4. Finding distant relatives. How does BLAST work? Lets sum up… - Blast is a fast way to find homologues - No analytic theory that estimates the statistical significance of gapped alignments - Gap scores have been selected by trial and error. applying different scoring matrix No grantee for gap scores - PSI-BLAST finds weak homologues fast Workshop OUTLINE Part 1: Part 2: • Introduction and motivation • Work Steps • How does BLAST work? • HANDS-ON • BLAST programs • Applications: ConSurf • Sequence databases • MSA programs BLAST programs • All types of searches are possible Query: DNA Protein Database: DNA Protein blastn – nuc vs. nuc blastp – prot vs. prot blastx – translated query vs. protein database tblastn – protein vs. translated nuc. DB tblastx – translated query vs. translated database 39 BLAST programs Amino acid sequence – most suitable for homology search • Amino acid sequence is more conserved • 20 letter alphabet. Two random hits share 5% identity in average (comparing to 25% in DNA seq) • Protein comparison matrices more sensitive • Protein databases smaller – less random hits • Proteins are much more relevant for inferring structural data. Workshop OUTLINE Part 1: Part 2: • Introduction and motivation • Work Steps • How does BLAST work? • HANDS-ON • BLAST programs • Applications: ConSurf • Sequence databases • MSA programs Sequence databases Where do we want to search? DNA sequences • NR- All GenBank + EMBL + DDBJ + PDB sequences. No longer "non-redundant" due to computational cost. • Genomes a specific organisms • RefSeq- mRna or genomic- an annotated collection from NCBI Reference Sequence Project. • EMBL- Europe's primary nucleotide sequence resource (EBI) • …. Sequence databases Where do we want to search? Protein databases: • Uniprot –swissprot or trembl • UniRef90- clustered UniRef using 90% identity. • PDB- the sequences of proteins for which structures are available • NR (non-redundant)- Non-redundant GenBank CDS translations + PDB + SwissProt + PIR + PRF, excluding those in env_nr • RefSeq- sequences from NCBI Reference Sequence project. • Proteins of a specific organisms Sequence databases Where do we want to search? UniProt • UniProt is a collaboration between the European Bioinformatics Institute (EBI), the Swiss Institute of Bioinformatics (SIB) and the Protein Information Resource (PIR). • In 2002, the three institutes decided to pool their resources and expertise and formed the UniProt Consortium. Sequence databases Where do we want to search? UniProt (UniRef90) The world's most comprehensive catalog of information on proteins- Sequence, function & more… Comprised mainly of the databases: • SwissProt ??? protein entries– high quality annotation, non-redundant & cross-referenced to other databases. • TrEMBL - ??? protein entries– translation of genetic information from the EMBL Nucleotide Database most proteins are poorly annotated since only annotated automatically After BLAST We ran BLAST & found homologues (HSTs). What next? Workshop OUTLINE Part 1: Part 2: • Introduction and motivation • Work Steps • How does BLAST work? • HANDS-ON • BLAST programs • Applications: ConSurf • Sequence databases • MSA programs MSA Multiple Sequence Alignment (MSA) • Perform alignment of a large collection of sequences • Many algorithms, leading ones: 1. ClustalW 2. MUSCLE 3. Mafft MSA Pairwise Vs. Multiple Sequence Alignment Alignments help to analyze sequence data: organize, visualize. Pairwise: For 2 sequences F G K G K G F G K F G K G MSA: For more than 2 sequences F F - G G G - K K K K F Q F G G G G K K K K G G G G MSA ClustalW- Introduction • This heuristic approach works because it uses the biological meaning of MSA • Based on the idea that the sequences we usually want to align are phylogenetically related • The first program to implement progressive MSA • Was introduced in 1994 and still used today. Thompson, J.D. et al, 1994 MSA ClustalW- Progressive Alignment Hbb_Human 1 Hbb_Horse 2 Hba_Human 3 Hba_Horse 4 Myg_Whale 5 17 - 59 60 - 59 59 13 - 77 77 75 75 1. Quick pairwise alignment calculate distance matrix - Hbb_Human Hbb_Horse Hba_Human Hba_Horse 2. Build a guide tree using the NJ phylogenetic method Myg_Whale 3. Progressive alignment following guide tree MSA ClustalW- Progressive Alignment A B C D A - - - - B 1 - - - C 7 8 - - D 11 5 2 - A B C D MSA ClustalW- Problems • Sequences that are similar only in some smaller regions ClustalW tries to find global alignments, not local. • Sequence that contains a large insertion compared to the rest global not local • Sequence that contains a repetitive element, while another sequence only contains one copy. Vs MSA MUSCLE- Introduction • The most recent popular MSA software • Considered to be the most accurate MSA software available today • The basic idea: Progressive Alignment Edgar, R.C., 2004 MSA MUSCLE Innovations • Faster distance estimation between the input sequences • Faster construction of an evolutionary tree (UPGMA instead of NJ in ClustalW ) •Applying new score function to the profile alignments • Refinement of the initial results Edgar R.C., 2004 faster more accurate MSA MUSCLEIt’s Even More Complicated… MSA MAFFT • Implements the Fast Fourier Transform (FFT) to optimize protein alignments based on physical properties of the amino acids • Iterative progressive alignment and iterative refinement • Useful for hard-to-align sequences such as those containing large gaps. • Different algorithms for different problems: • Number and length of sequences • RNA vs. protein MSA Comparison of MSA methods No method is superior for all cases: trial and error! Essoussi et al 2008 But MAFFT and ProbCons are very good… Nuin et al 2006