#11 - MSAs; PSSMs & Psi-BLAST 9/17/07 Required Reading BCB 444/544 (before lecture) √ Mon Sept 17 - Lecture 12 Position Specific Scoring Matrices & PSI-BLAST • Chp 6 - pp 75-78 (but not HMMs) Lecture 12 Multiple Sequence Alignment (MSA) Wed Sept 19 - Lecture 13 (not covered on Exam 1) Hidden Markov Models • Chp 6 - pp 79-84 • Eddy: What is a hidden Markov Model? PSSMs & Psi-BLAST 2004 Nature Biotechnol 22:1315 #12_Sept17 http://www.nature.com/nbt/journal/v22/n10/abs/nbt1004-1315.html Wed Sept 21 - EXAM 1 BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST 9/17/07 1 Assignments & Announcements BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST SECTION II Mon Sept 17 - Answers to HW#2 will be posted ~ Noon Xiong: Chp 5 9/17/07 4 9/17/07 6 SEQUENCE ALIGNMENT Multiple Sequence Alignment Thu Sept 20 - Lab = Optional Review Session for Exam • • • • Fri Sept 21 - Exam 1 - Will cover: Lectures 2-12 (thru Mon Sept 17) Labs 1-4 HW2 All assigned reading: Chps 2-6 (but not HMMs) Eddy: What is Dynamic Programming~ BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST 2 Chp 5- Multiple Sequence Alignment Sun Sept 16 - Study Guide for Exam 1 was posted • • • • 9/17/07 9/17/07 3 Scoring Function Exhaustive Algorithms Heuristic Algorithms Practical Issues BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST Overview Multiple Sequence Alignments 1. What is a multiple sequence alignment (MSA)? 2. Where/why do we need MSA? 3. What is a good MSA? 4. Algorithms to compute a MSA Credits for slides: Caragea & Brown, 2007; Fernandez-Baca, Heber &HunterBCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST BCB 444/544 Fall 07 Dobbs 9/17/07 5 BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST 1 #11 - MSAs; PSSMs & Psi-BLAST 9/17/07 Definition: MSA Multiple Sequence Alignment Given a set of sequences, a multiple sequence alignment is an assignment of gap characters, such that • resulting sequences have same length • no column contains only gaps • Generalize pairwise alignment of sequences to include > 2 homologous sequences • Analyzing more than 2 sequences gives us much more information: • Which amino acids are required? Correlated? • Evolutionary/phylogenetic relationships • Similar to PSI-BLAST idea (not yet covered in lecture): use a set of homologous sequences to provide more "sensitivity" BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST 9/17/07 7 Displaying MSAs: using CLUSTAL W ATT-GC ATTTGC ATTTG AT-TGC ATTTGC ATTTG- AT-T-GC ATTT-GC ATTT-G- NO YES NO BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST 9/17/07 8 What is a Consensus Sequence? A single sequence that represents most common residue of each column in a MSA Example: RED: BLUE: MAGENTA: GREEN: FGGHL-GF F-GHLPGF FGGHP-FG FGGHL-GF AVFPMILW (small) DE (acidic, negative chg) RHK (basic, positive chg) STYHCNGQ (hydroxyl + amine + basic) * : . Steiner consensus seqence: Given sequences s1,…, sk, find a sequence s* that maximizes Σi S(s*,si ) entirely conserved column all residues have ~ same size all residues have ~ same size AND OR hydropathy hydropathy BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST 9/17/07 9 BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST 9/17/07 10 Application: Recover Phylogenetic Tree Applications of MSA What was series of events that led to current species? • Building phylogenetic trees • Finding conserved patterns, e.g.: • Regulatory motifs (TF binding sites) • Splice sites • Protein domains • Identifying and characterizing protein families • Find out which protein domains have same function • Finding SNPs (single nucleotide polymorphisms) & mRNA isoforms (alternatively spliced forms) • DNA fragment assembly (in genomic sequencing) NYLS NFLS NYLS BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST BCB 444/544 Fall 07 Dobbs 9/17/07 11 BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST 9/17/07 12 2 #11 - MSAs; PSSMs & Psi-BLAST 9/17/07 Application: Discover Conserved Patterns Goal: Characterize Protein Families Which parts of globin sequences are most highly conserved? Is there a conserved cis-acting regulatory sequence? Rationale: if they are homologous (derived from a common ancestor), they may be structurally equivalent TATA box = transcriptional promoter element BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST 9/17/07 13 BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST 9/17/07 14 Scoring an Alignment Databases of Multiple Alignments • Pfam (Protein Domain Families data base) Goal: Align homologous positions. But: Without knowledge of phylogenetic tree is this very hard (sometimes impossible) to achieve! • Contains alignments and HMMs of protein families • InterPro • Integrates: Prosite, Prints, ProDom, Pfam, and SMART • BLOCKS • Segments of highly conserved multiple alignments • Hovergen (Homologous Vertebrate Genes Database) • COGs (Clusters of Orthologous Groups) • BaliBASE (Benchmark alignments database) NYLS NFLS NYLS BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST 9/17/07 15 Scoring an Alignment S(m) = ! S (mi )+ G F F F I D D D F F F I Y Y Y 16 • SP = sum of scores of all possible pairs of sequences in an MSA based on a particular scoring matrix • Compute for each column c gap penalty G G Q G Q G K A F P G Q I K F F F I I - F F F I D D D W W W W W W W A F P G Q I K F F I Y Y Y I D D D G G G G G G G BCB 444/544 Fall 07 Dobbs 9/17/07 17 F F F I mi residue l PAM or BLOSUM score A F P G BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST F F I - S(mi) = Σk<l s(mik,mil) i A F P G Q I K 9/17/07 Sum of Pairs (SP) Score In practice, simple scoring functions are used: usually, columns are scored independently, i.e. ith column of alignment m BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST F F F I G G Q G A F P G F F I - F F F I BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST W W W W A F P G F F Y D D G G G G 9/17/07 18 3 #11 - MSAs; PSSMs & Psi-BLAST 9/17/07 How Score Gaps in MSAs? Example: SP Score Want to align gaps with each other over all sequences. A gap in a pairwise alignment that “matches” a gap in another pairwise alignment should cost less than introducing a totally new gap. • Possible that a new gap could be made to “match” an older one by adjusting older pairwise alignment • Change gap penalty near conserved domains of various kinds (e.g. secondary structure elements, hydrophobic regions) F Y G F Y 5 -2 -2 -1 7 G D 1 -5 4 -3 D F-G F-G m= FYD G G D 5 BLOSUM 60 Gap penalty: -8 s(-,-) = 0 S(m) = S(m 1) + S(m2) + S(m3) = 3s(F,F) + 2s(-,Y) + s(-,-) + s(G,G) + 2s(G,D) = 15 -16 + 0 + 4 -6 = -3 BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST 9/17/07 19 Overcoming problems with SP scoring BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST 9/17/07 20 How Compute a Multiple Alignment? Algorithms for MSA: • Use weights to incorporate evolution in sum of pairs scoring: • Some pairwise alignments are more important than others • e.g., more important to have a good alignment between mouse & human sequences than between mouse & bird • Assign different weights to different pairwise alignments • Weight decreases with evolutionary distance • Multidimensional dynamic programming • Optimal global alignment (time & space intensive!!!) • Progressive alignments (Star alignment, ClustalW) • Match closely-related sequences first using a guide tree • Iterative methods • Combined local alignments (Dialign) • Multiple re-building attempts to find best alignment • Partial order alignment (POA) • Local alignments • Profiles, Blocks, Patterns BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST 9/17/07 21 Dynamic Programming for MSA BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST 9/17/07 22 Generalized Needleman-Wunsch Algorithm • As with pairwise alignments, multiple sequence alignments can be computed by dynamic programming Given 3 sequences x, y, and z: Main iteration loop: F(i,j,k) = max ( F(i-1, j-1, k-1) + S(xi, yj, zk), F(i-1, j-1, k ) + S(xi, y j, - ), F(i-1, j , k-1) + S(xi, -, zk), F(i-1, j , k ) + S(xi, -, - ), F(i , j-1, k-1) + S( -, yj, zk), F(i , j-1, k ) + S( -, yj, -), F(i , j , k-1) + S( -, -, zk) ) F 2D 3D BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST BCB 444/544 Fall 07 Dobbs 9/17/07 23 BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST 3D 9/17/07 24 4 #11 - MSAs; PSSMs & Psi-BLAST 9/17/07 What's so bad about those exponents? An example: Running Time of DP What Happens to Computational Complexity? • Overall runtime: O(k22kn k) Given k sequences of length n: • Space for matrix: O(nk ) • Neighbors/cell: 2k-1 • Time to compute SP score: O(k2) • Overall runtime: O(k22kn k) 3D Ouch!!! # sequences running time 2 1 second 3 2 minutes 4 5 hours 5 3 weeks 6 9 years Sequences: globins (≈ 150 aa) But: There are fast heuristics. BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST 9/17/07 25 Progressive Alignment Heuristic procedure: 1. Align most similar sequences first 2. Add sequences progressively 9/17/07 26 Guide Tree Binary tree • Leaves correspond to sequences • Internal nodes represent alignments • Root corresponds to final MSA Multiple Alignment by adding sequences 1 2 Often: use guide tree to determine order of alignments Examples: BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST 3 -TCG -TCC ATCATG- 4 ATC ATG Star alignment ClustalW ATC BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST 9/17/07 27 Star Alignment - will skip for now, come back to this on Wed TCG TCC ATG TCG TCC BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST 9/17/07 28 Chp6 - Profiles & Hidden Markov Models Star alignment will NOT be covered on Exam 1 SECTION II SEQUENCE ALIGNMENT Xiong: Chp 6 Profiles & HMMs • Position Specific Scoring Matrices (PSSMs) • PSI-BLAST • Profiles • Markov Model & Hidden Markov Model BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST BCB 444/544 Fall 07 Dobbs 9/17/07 29 BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST 9/17/07 30 5 #11 - MSAs; PSSMs & Psi-BLAST 9/17/07 PSI Blast Psi-BLAST • Position Specific Iterated BLAST • Intuition: substitution matrices should be specific to a particular site: penalize alanine→glycine more in a helix • Basic idea: Query PSSM • Use BLAST with high stringency to get a set of closely related sequences • Align those sequences to create a new substitution matrix for each position • Then use that matrix (iteratively) to find additional sequences BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST 9/17/07 Multiple alignment Sequence database 31 PSI-BLAST pseudocode 32 Position-specific 9/17/07 33 BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST 9/17/07 34 Position-specific scoring matrix - PSSM Convert query to PSSM do { BLAST database with PSSM Stop if no new homologs are found Add new homologs to PSSM } Print current set of homologs This step requires a user-defined threshold BCB 444/544 Fall 07 Dobbs 9/17/07 Convert query to PSSM scoring matrix do { BLAST database with PSSM Stop if no new homologs are found Add new homologs to PSSM } Print current set of homologs PSI-BLAST pseudocode BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST PSI-BLAST pseudocode Convert query to PSSM do { BLAST database with PSSM Stop if no new homologs are found Add new homologs to PSSM } Print current set of homologs BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST BLAST • A PSSM is an n by m matrix, where n is the size of alphabet, and m is length of sequence • Entry at (i, j) is score assigned by PSSM to letter i at the jth position A -1 -2 -1 0 -1 -2 0 R 5 0 5 -2 1 -3 -2 N 0 6 0 0 0 -3 0 1 D -2 1 -2 -1 0 -3 -1 -1 C -3 -3 -3 -3 -3 -2 -3 -3 Q 35 0 1 0 1 -2 5 -3 -2 E 0 0 0 -2 2 -3 -2 0 G -2 0 -2 6 -2 -3 6 -2 H 0 1 0 -2 0 -1 -2 8 I -3 -3 -3 -4 -3 0 -4 -3 L -2 -3 -2 -4 -2 0 -4 -3 K 9/17/07 -2 0 2 0 2 -2 1 -3 -2 -1 M -1 -2 -1 -3 0 0 -3 -2 F -3 -3 -3 -3 -3 6 -3 -1 P -2 -2 -2 -2 -1 -4 -2 -2 S -1 1 -1 0 0 -2 0 -1 T -1 0 -1 -2 -1 -2 -2 -2 W -3 -4 -3 -2 -2 1 -2 -2 Y -2 -2 -2 -3 -1 3 -3 2 V -3 -3 -3 -3 -2 -1 -3 -3 BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST 9/17/07 36 6 #11 - MSAs; PSSMs & Psi-BLAST 9/17/07 Position-specific scoring matrix • A PSSM is an n by m matrix, where n is the size of the alphabet, and m is the length of the sequence. • The entry at (i, j) is the score assigned by the PSSM to letter i at the jth position. Position-specific scoring matrix A -1 -2 -1 0 -1 -2 0 -2 R 5 0 5 -2 1 -3 -2 0 N 0 6 0 0 0 -3 0 1 D -2 1 -2 -1 0 -3 -1 -1 C -3 -3 -3 -3 -3 -2 -3 -3 Q This PSSM assigns sequence NMFWAFGH a score of: 0 + -2 + -3 + -2 + -1 + 6 + 6 + 8 = 12 -2 -1 0 -1 -2 0 5 0 5 -2 1 -3 -2 N 0 6 0 0 0 -3 0 -2 1 D -2 1 -2 -1 0 -3 -1 -1 C -3 -3 -3 -3 -3 -2 -3 -3 0 1 0 1 -2 5 -3 -2 0 Q 1 0 1 -2 5 -3 -2 0 0 0 -2 2 -3 -2 0 E 0 0 0 -2 2 -3 -2 0 G -2 0 -2 6 -2 -3 6 -2 G -2 0 -2 6 -2 -3 6 -2 H 0 1 0 8 H 0 1 0 -2 0 -1 -2 8 I -3 -3 -3 -3 I -3 -3 -3 -4 -3 0 -4 -3 L -2 -3 -2 -3 L -2 -3 -2 -4 -2 0 -4 -3 -2 -1 -2 “K” at0 position 3 -4 0 gets a-3score of-4 2 -4 -2 0 -4 0 2 0 2 -2 1 -3 -2 -1 K 2 0 2 -2 1 -3 -2 -1 M -1 -2 -1 -3 0 0 -3 -2 M -1 -2 -1 -3 0 0 -3 -2 F -3 -3 -3 -3 -3 6 -3 -1 F -3 -3 -3 -3 -3 6 -3 -1 P -2 -2 -2 -2 -1 -4 -2 -2 P -2 -2 -2 -2 -1 -4 -2 -2 S -1 1 -1 0 0 -2 0 -1 S -1 1 -1 0 0 -2 0 -1 T -1 0 -1 -2 -1 -2 -2 -2 T -1 0 -1 -2 -1 -2 -2 -2 W -3 -4 -3 -2 -2 1 -2 -2 W -3 -4 -3 -2 -2 1 -2 -2 Y -2 -2 -2 -3 -1 3 -3 2 Y -2 -2 -2 -3 -1 3 -3 2 V -3 -3 -3 -3 -2 -1 -3 -3 V -3 -3 -3 -3 -2 -1 -3 -3 BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST 9/17/07 37 Position-specific scoring matrix 2 + 0 + -2 + 6 + 0 + 6 + -4 + -2 =6 -1 R E K • What score does this PSSM assign to KRPGHFLA? A 9/17/07 38 Position-specific iterated BLAST A -1 -2 -1 0 -1 -2 0 R 5 0 5 -2 1 -3 -2 N 0 6 0 0 0 -3 0 1 D -2 1 -2 -1 0 -3 -1 -1 C -3 -3 -3 -3 -3 -2 -3 -3 Q 1 0 1 -2 5 -3 -2 0 E 0 0 0 -2 2 -3 -2 0 G -2 0 -2 6 -2 -3 6 -2 -2 0 H 0 1 0 -2 0 -1 -2 8 I -3 -3 -3 -4 -3 0 -4 -3 L -2 -3 -2 -4 -2 0 -4 -3 K BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST 2 0 2 -2 1 -3 -2 -1 M -1 -2 -1 -3 0 0 -3 -2 F -3 -3 -3 -3 -3 6 -3 -1 P -2 -2 -2 -2 -1 -4 -2 -2 S -1 1 -1 0 0 -2 0 -1 T -1 0 -1 -2 -1 -2 -2 -2 W -3 -4 -3 -2 -2 1 -2 -2 Y -2 -2 -2 -3 -1 3 -3 2 V -3 -3 -3 -3 -2 -1 -3 -3 BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST 9/17/07 ? Query PSSM Multiple alignment Sequence database 39 BLAST BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST 9/17/07 40 Position-specific iterated BLAST Creating a PSSM from 1 sequence R L RNRGQFGH A -1 -2 -1 0 -1 -2 0 R 5 0 5 -2 1 -3 -2 N 0 6 0 0 0 -3 0 1 D -2 1 -2 -1 0 -3 -1 -1 C -3 -3 -3 -3 -3 -2 -3 -3 Q R 20 by 20 1 0 1 -2 5 -3 -2 0 0 0 -2 2 -3 -2 0 G -2 0 -2 6 -2 -3 6 -2 0 1 0 -2 0 -1 -2 8 I -3 -3 -3 -4 -3 0 -4 -3 L -2 -3 -2 -4 -2 0 -4 -3 2 0 2 -2 1 -3 -2 -1 M -1 -2 -1 -3 0 0 -3 -2 F -3 -3 -3 -3 -3 6 -3 -1 P -2 -2 -2 -2 -1 -4 -2 -2 S -1 1 -1 0 0 -2 0 -1 T -1 0 -1 -2 -1 -2 -2 -2 W -3 -4 -3 -2 -2 1 -2 -2 Y -2 -2 -2 -3 -1 3 -3 2 V -3 -3 -3 -3 -2 -1 -3 -3 BCB 444/544 Fall 07 Dobbs by L 9/17/07 Query PSSM 0 H 20 BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST ? 0 E K BLOSUM62 matrix -2 Multiple alignment Sequence database 41 BLAST BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST 9/17/07 42 7 #11 - MSAs; PSSMs & Psi-BLAST 9/17/07 Discard query gap columns Creating a PSSM from multiple sequences • Discard columns that contain gaps in query • For each column C • Compute relative sequence weights • Compute PSSM entries, taking into account • Observed residues in this column • Sequence weights • Substitution matrix BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST 9/17/07 43 BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST 1.2 1.2 0.8 0.8 1.1 0.9 1.1 1.3 A C D E F Background Observed E G frequencies H residues Q R I G K K L A M F P A Q R S These are usually T derived from a large V sequence database W Y • Low weights are assigned to redundant sequences + • High weights are assigned to unique sequences BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST 9/17/07 45 Estimate the probability of observing each residue 2. Divide by the background probability of observing the same residue 3. Take log so scores will be additive 0.085 0.019 0.054 0.065 0.040 0.072 0.023 0.058 0.056 0.096 0.024 0.053 0.042 0.054 0.072 0.063 0.073 0.016 0.034 44 = PSSM column PSSM BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST Log-odds score 1. 9/17/07 Compute PSSM entries (simplified version) Compute sequence weights EEFGSVDGLVNNA QKYGRLDVMINNA RRLGTLNVLVNNA GGIGPVDLLVNNA KALGGFNVIVNNA ARFGKIDTLIPNA FEPEGMWGLVNNA AQLKTVDVLINGA EEFGSVDGLVNNA QKYGRLDVMINNA RRLGTLNVLVNNA GGIGPVD-LVNNA KALGGFNVIVNNA ARFGKID-LIPNA FEPEGMWGLVNNA AQLKTVDVLINGA EEFG----SVDGLVNNA QKYG----RLDVMINNA RRLG----TLNVLVNNA GGIG----PVD-LVNNA KALG----GFNVIVNNA ARFG----KID-LIPNA FEPEGPEKGMWGLVNNA AQLK----TVDVLINGA 9/17/07 46 Log-odds score Residue was generated by foreground model (i.e., the PSSM) Residue “A” is observed 1. Estimate the probability of observing each residue 2. Divide by the background probability & Pr of observing the same residue log 2 $$ Pr 3. Take log so scores will be additive & Pr (A M )# ! log 2 $$ ! % Pr (A B ) " % (A M )#! (A B ) !" Residue was generated by the background model (i.e., randomly selected) BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST BCB 444/544 Fall 07 Dobbs 9/17/07 47 BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST 9/17/07 48 8 #11 - MSAs; PSSMs & Psi-BLAST 9/17/07 Why (not) PSI-BLAST How to use PSI BLAST • Weights sequence according to observed diversity specific to family of interest • Set initial thresholds high • Inspect each iteration's result for suspicious sequences • Do several iterations (~5), or until no new sequences are found • Even if only looking for a small set of sequences, make initial search very broad • Advantage: If sequences used to construct Position Specific Scoring Matrices (PSSMs) are all homologous, sensitivity at a given specificity improves significantly • Disadvantage: However, if any non-homologous sequences are included in PSSMs, they are “corrupted.” Then they "pull in" addition non-homologous sequences, and become worse than generic BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST 9/17/07 • First, use NR (large, inclusive database) with up to 5 iterations to set PSSM • Then use that PSSM to search in restricted domain 49 PSI-BLAST caveats BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST 9/17/07 50 9/17/07 52 9/17/07 54 PSI-BLAST example • Goal: Increased ability to find distant homologs • Cost? additional care to prevent non-homologous sequences from being included in PSSM calculation Query is human NF-Kappa-B sequence • When in doubt, leave it out! • Examine sequences with moderate similarity carefully • Be particularly cautious about matches to sequences with highly biased amino acid content BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST 9/17/07 51 BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST Second iteration First Iteration … … BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST BCB 444/544 Fall 07 Dobbs 9/17/07 53 BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST 9 #11 - MSAs; PSSMs & Psi-BLAST 9/17/07 Summary • Dynamic programming is O(NM) • BLAST is O(M) • BLAST produces an index of query sequence that allows fast matching to the database • Target database is pre-indexed to indicate positions in all database sequences that match each possible search word above some score threshold • PSI-BLAST iterates BLAST, adding new homologs at each iteration BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST BCB 444/544 Fall 07 Dobbs 9/17/07 55 10