Estimating statistical significance with reverse-sequence null models Why it works and why it fails Kevin Karplus, Rachel Karchin, Richard Hughey karplus@soe.ucsc.edu Center for Biomolecular Science and Engineering University of California, Santa Cruz E-values for reverse-sequence null – p.1/32 Outline of Talk What is a null model? Why use the reverse-sequence null? Two approaches to statistical significance. What distribution do we expect for scores? Fitting the distribution. Does calibrating the E-values help? When do reverse-sequence null models fail? E-values for reverse-sequence null – p.2/32 Scoring HMMs and Bayes Rule The model M is a computable function that assigns a probability Prob (A | M ) to each string A. When given a string A, we want to know how likely the model is. That is, we want to compute something like Prob (M | A). Bayes Rule: Prob(M ) . Prob M A = Prob A M Prob(A) Problem: Prob(A) and Prob(M ) are inherently unknowable. E-values for reverse-sequence null – p.3/32 Null models Standard solution: ask how much more likely M is than some null hypothesis (represented by a null model ). Prob (M | A) Prob (A | M ) Prob(M ) = . Prob (N | A) Prob (A | N ) Prob(N ) Prob(M ) is the prior odds ratio, and represents our belief in Prob(N ) the likelihood of the model before seeing any data. Prob M |A is the posterior odds ratio, and represents our Prob N |A belief in the likelihood of the model after seeing the data. E-values for reverse-sequence null – p.4/32 Standard Null Model Null model is an i.i.d (independent, identically distributed) model. (A) len Y Prob(Ai ) . Prob A N, len (A) = i=1 Prob A N = Prob(string of length len (A)) len (A) Y Prob(Ai ) . i=1 The length modeling is often omitted, but one must be careful then to normalize the probabilities correctly. E-values for reverse-sequence null – p.5/32 Problems with standard null When using the standard null model, certain sequences and HMMs have anomalous behavior. Many of the problems are due to unusual composition—a large number of some usually rare amino acid. For example, metallothionein, with 24 cysteines in only 61 total amino acids, scores well on any model with multiple highly conserved cysteines. E-values for reverse-sequence null – p.6/32 Reversed model for null We avoid composition bias (and several other problems) by using a reversed model M r as the null model. The probability of a sequence in M r is exactly the same as the probability of the reversal of the sequence given M. If we assume that M and M r have equal prior likelihood, then Prob (S | M ) Prob (M | S) = . r r Prob (M | S) Prob (S | M ) This method corrects for composition biases, length biases, and several subtler biases. E-values for reverse-sequence null – p.7/32 Composition as source of error A cysteine-rich protein, such as metallothionein, can match any HMM that has several highly-conserved cysteines, even if they have quite different structures: cost in nats model − model − HMM sequence standard null reversed-model 1kst 4mt2 -21.15 0.01 -15.04 -0.93 1kst 1tabI -15.14 -0.10 4mt2 1kst -21.44 -1.44 4mt2 1tabI -17.79 -7.72 1tabI 1kst -19.63 -1.79 1tabI 4mt2 E-values for reverse-sequence null – p.8/32 Composition examples Metallothionein Isoform II (4mt2) Kistrin (1kst) E-values for reverse-sequence null – p.9/32 Composition examples Kistrin (1kst) Trypsin-binding domain of Bowman-Birk Inhibitor (1tabI) E-values for reverse-sequence null – p.10/32 Helix examples Tropomyosin (2tmaA) Colicin Ia (1cii) Flavodoxin mutant (1vsgA) E-values for reverse-sequence null – p.11/32 Helix examples Apolipophorin III (1aep) Apolipoprotein A-I (1av1A) E-values for reverse-sequence null – p.12/32 Fold Recognition Performance Fraction of True Positives found +=Same fold 0.35 0.3 1-track AA rev-null 1-track AA geo-null 0.25 0.2 0.15 0.1 0.05 0.001 0.01 0.1 1 10 100 False Positives/Query E-values for reverse-sequence null – p.13/32 What is Statistical Significance? The statistical significance of a hit, P1 , is the probability of getting a score as good as the hit “by chance,” when scoring a single “random” sequence. When searching a database of N sequences, the significance is best reported as an E-value—the expected number of sequences that would score that well by chance: E = P1 N . Some people prefer the p-value: PN = 1 − (1 − P1 )N , For large N and small E , PN ≈ 1 − e−E ≈ E . I prefer E-values, because our best scores are often not significant, and it is easier to distinguish between E-values of 10, 100, and 1000 than between p-values of 0.999955, 1.0 − 4E-44, and 1.0 − 5E-435 E-values for reverse-sequence null – p.14/32 Approaches to Statistical Significance (Markov’s inequality) For any scoring scheme that uses Prob (seq | M1 ) ln Prob (seq | M2 ) the probability of a score better than T is less than e−T for sequences distributed according to M2 . This method is independent of the actual probability distributions. (Classical parameter fitting) If the “random” sequences are not drawn from the distribution M2 , but from some other distribution, then we can try to fit some parameterized family of distributions to scores from a random sample, and use the parameters to compute P1 and E values for scores of real sequences. E-values for reverse-sequence null – p.15/32 Our Assumptions The sequence and reversed sequence come from the same underlying distribution. Bad assumption 1: The scores with a standard null model are distributed according to an extreme-value distribution: P ln Prob seq M > T ≈ Gk,λ (T ) = 1 − exp(−keλT ) . Bad assumption 2: The scores with the model and the reverse-model are independent of each other. Bad assumption 3: The scores using a reverse-sequence null model are distributed according to a sigmoidal function: Result: P (score > T ) = (1 − eλT )−1 . E-values for reverse-sequence null – p.16/32 Derivation of sigmoidal distribution (Derivation for costs, not scores, so more negative is better.) Z P (cost < T ) = = = ∞ −∞ Z ∞ −∞ Z ∞ −∞ ∞ Z = −∞ Z P (cM = x) ∞ x−T P (cM 0 = y)dydx P (cM = x)P (cM 0 > x − T )dx kλ exp(−keλx )eλx exp(−keλ(x−T ) )dx kλeλx exp(−k(1 + e−λT )eλx )dx E-values for reverse-sequence null – p.17/32 Derivation of sigmoid (cont.) If we introduce a temporary variable to simplify the formulas: KT = k(1 + exp(−λT )), then Z ∞ (1 + e−λT )−1 KT λeλx exp(−KT eλx )dx −∞ Z ∞ = (1 + e−λT )−1 KT λeλx exp(−KT eλx )dx −∞ Z ∞ = (1 + e−λT )−1 gKT ,λ (x)dx P (cost < T ) = −∞ = (1 + e−λT )−1 E-values for reverse-sequence null – p.18/32 Fitting λ The λ parameter simply scales the scores (or costs) before the sigmoidal distribution, so λ can be set by matching the observed variance to the theoretically expected variance. The mean is theoretically (and experimentally) zero. The variance is easily computed, though derivation is messy: E(c2 ) = (π 2 /3)λ−2 . λ is easily fit by matching the variance: v u N −1 u X λ ≈ π tN/(3 c2i ) . i=0 E-values for reverse-sequence null – p.19/32 Two-parameter family We made three dangerous assumptions: reversibility, extreme-value, and independence. To give ourselves some room to compensate for deviations from the extreme-value assumption, we can add another parameter to the family. We can replace −λT with any strictly decreasing odd function. Somewhat arbitrarily, we chose − sign(T )|λT |τ so that we could match a “stretched exponential” tail. E-values for reverse-sequence null – p.20/32 Fitting a two-parameter family For two-parameter symmetric distribution, we can fit using 2nd and 4th moments: E(c2 ) = λ−2/τ K2/τ E(c4 ) = λ−4/τ K4/τ where Kx is a constant: Z ∞ Kx = y x (1 + ey )−1 (1 + e−y )−1 dy −∞ ∞ X (−1)k /k x . = −Γ(x + 1) k=1 E-values for reverse-sequence null – p.21/32 Fitting a two-parameter family (cont.) 2 The ratio E(c4 )/(E(c2 ))2 = K4/τ /K2/tau is independent of λ and monotonic in τ , so we can fit τ by binary search. Once τ is chosen we can fit λ using E(c2 ) = λ−2/τ K2/τ . E-values for reverse-sequence null – p.22/32 Student’s t-distribution On the advice of statistician David Draper, we tried maximum-likelihood fits of Student’s t-distribution to our heavy-tailed symmetric data. We couldn’t do moment matching, because the degrees of freedom parameter for the best fits turned out to be less than 4, where the 4th moment of Student’s t is infinite. The maximum-likelihood fit of Student’s t seemed to produce too heavy a tail for our data. We plan to investigate other heavy-tailed distributions. E-values for reverse-sequence null – p.23/32 What is single-track HMM looking for? nostruct-align/3chy.t2k w0.5 3 2 1 VI VVIDD L E X XX L I A LL V L I E ND S GE N S L V F M XX X G V D N A D S NG EAL D QL XXX X X XX XXXXXX X ADK EL KFL V VDDFSTMRRIVR NLLKELGF NNVEE AE DG V DA L N KL Q A G G Y XX T X R X X V XX L P I V XXX L A X 1 T L I S E E V H L P E X L K L V E Y X X X E X L E X X X A XX E E E L X K E 50 A 10 L 40 XA X R X 30 S L 20 X G X X H 3 2 L D P G TS M I VI D DL VL L N VI L E LV L IV E I XXXXXXX XXXXXXXXXX XXX XXXXXXXX XX XX XXXX GFV IS DWN M PNMDGLELLKTI RADGAMSA LPVLM VT AE A KK E N II A A A Q A A IAFLSL I GV M S G D I VEG A SS NN DLL L F VVAA L PLA G MII E VR V L M R L T X A P X X H T E T A V V D X E E L X 80 X 70 60 51 R E E AL D A D XX X X R X 100 S G N EA 90 1 H 3 2 L G D S E N A XX K I D YV GF KP L RD S G Q EE A XXXXXXXX X M FS E L I V F M L X R I LR V L XX R XXXXX E D X A K X LR XXX L A X G X E T 120 110 GAS GY VVK P FTAATLEEKLNK IFEKLGM 101 1 H E-values for reverse-sequence null – p.24/32 Example for single-track HMM Database calibration for 3chy.t2k-w0.5 HMM 1000 E-value 100 10 1 Unfitted sigmoidal Fitted Student-t desired fit Fitted 2-parameter sigmoidal Fitted 1-parameter sigmoidal 0.1 0.01 1 10 100 Sequence Rank 1000 E-values for reverse-sequence null – p.25/32 What is second track looking for? nostruct-align/3chy.t2k EBGHTL 3 2 1 C T EEEE HHHHHHHHHHHHH C EEEE HHHHHHHHH E TTC E CTC CC T XXXE C TX X X X E T CTC TT XC X X X X X X X X X X X CE H CT C TC G HT CTEX X X X X X T C ET CT T X X X H C C TT C T TC T ETCXCC XCEXC X X X X X X T H XG X X X X X X X X X X X X T E H E 50 40 30 20 1 10 ADK EL KFLV V D DF STMRR I VRN LLK EL G FNN V EE A ED GV D AL NK LQ A G G Y H 3 2 1 EEEEE E C T HHHHHHHHH TTT H T C CCCC C T CET T C H X X X X X TTT T EEEEE HHHHHHHHHH E TC H GC E TCTHC GXCTC CXX CTXX XX EG EX XTTX X XXXCX X X X X X X X X X C X T C C CTCT CTX H XTEXXX X X X X X X XX C X X X X X X X E T H T E T 100 90 80 70 60 51 GFV IS DWNM P N MD GLELL K TIR ADG AM S ALP V LM V TA EA K KE NI IA A A Q A H 3 2 CC HHHHHHHHHHHHH C CH C H C EEEC TTCT CCH GAS GY VVKP F T AA TLEEK L NKI FEK LG M CCC TXX GG T XTEEE X X T X X X X X 110 E X E T XTTTX X X X X X X X X X X X X 120 T TC TXX X 101 1 H E-values for reverse-sequence null – p.26/32 Example for two-track HMM Database calibration for 3chy.t2k-100-30-ebghtl HMM 1000 E-value 100 10 1 Fitted Student-t desired fit Unfitted sigmoidal Fitted 2-parameter sigmoidal Fitted 1-parameter sigmoidal 0.1 0.01 1 10 100 Sequence Rank 1000 E-values for reverse-sequence null – p.27/32 Fold recognition results +=Same fold 0.22 Fraction of True Positives found 0.2 0.18 0.16 AA no calib AA db calib AA random calib AA-STRIDE no calib AA-STRIDE db calib AA-STRIDE random calib 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0.001 0.01 0.1 False Positives/Query 1 10 E-values for reverse-sequence null – p.28/32 What went wrong? Why did random calibrated fold recognition fail for 2-track HMMs? “Random” secondary structure sequences (i.i.d. model) are not representative of real sequences. Fixes: Better secondary structure decoy generator Use real database, but avoid problems with contamination by true positives by taking only costs > 0 to get estimate of E(cost2 ) and E(cost4 ). E-values for reverse-sequence null – p.29/32 Fold recognition results +=Same fold 0.25 Dunbrack t2k+0.3*EBGHTL db calib Dunbrack t2k+0.3*EBGHTL nocalib Dunbrack t2k+0.3*PB nocalib Dunbrack t2k+0.3*PB simple Dunbrack t2k+0.3*PB db calib Fraction of True Positives found 0.2 0.15 0.1 0.05 0 0.001 0.01 0.1 False Positives/Query 1 10 E-values for reverse-sequence null – p.30/32 What went wrong with Protein Blocks? The HMMs using de Brevern’s protein blocks did much worse after calibration. Why? The protein blocks alphabet strongly violates reversibility assumption. Encoding cost in bits for secondary structure strings: 0-order 1st-order reverse-forward alphabet amino acid 4.1896 4.1759 0.0153 stride 2.3330 1.0455 0.0042 dssp 2.5494 1.3387 0.0590 pb 3.3935 1.4876 3.0551 E-values for reverse-sequence null – p.31/32 Web sites UCSC bioinformatics info: http://www.soe.ucsc.edu/research/compbio/ SAM tool suite info: http://www.soe.ucsc.edu/research/compbio/sam.html H MM servers: http://www.soe.ucsc.edu/research/compbio/HMM-apps/ SAM-T02 prediction server: http://www.soe.ucsc.edu/research/compbio/ HMM-apps/T02-query.html These slides: http://www.soe.ucsc.edu/˜karplus/papers/e-value-germany02.pdf E-values for reverse-sequence null – p.32/32