Comparing Database Search Methods & Improving the Performance of PSI-BLAST Stephen Altschul “Gold standards” for protein classification Traditional curated sequence databases with family and superfamily classifications: PIR SWISS-PROT Structure-based protein domain classification: SCOP Measuring retrieval accuracy Sequence Search Positive Negative Sensitivity: TP/R Related Unrelated TP FP P True Positive False Positive = TP + FP FN TN N False Negative True Negative = FN + TN R U = TP + FN = FP + TN Specificity: TP/P Receiver Operating Characteristic curve False – 1 0.8 True + Fraction 0.6 related accepted 0.4 0.2 0 0 0.2 0.4 0.6 0.8 Fraction unrelated accepted False + True – 1 Random retrieval on a ROC plot 1 0.8 Fraction 0.6 related accepted 0.4 0.2 0 0 0.2 0.4 0.6 0.8 Fraction unrelated accepted 1 Line of fixed sensitivity 1 0.8 Fraction 0.6 related accepted 0.4 0.2 0 0 0.2 0.4 0.6 0.8 Fraction unrelated accepted 1 Line of fixed specificity 1 0.8 Fraction 0.6 related accepted 0.4 0.2 0 0 0.2 0.4 0.6 0.8 Fraction unrelated accepted 1 Line of fixed crossover ratio 1 0.8 Fraction 0.6 related accepted 0.4 0.2 0 0 0.2 0.4 0.6 0.8 Fraction unrelated accepted 1 Line of fixed crossover ratio 1 0.8 Fraction 0.6 related accepted 0.4 0.2 0 0 0.2 0.4 0.6 0.8 Fraction unrelated accepted 1 ROC score: area under the ROC curve 1 0.8 Fraction 0.6 related accepted 0.4 0.2 0 0 0.2 0.4 0.6 0.8 Fraction unrelated accepted 1 Region of interest in ROC analysis 1 0.8 Fraction 0.6 related accepted 0.4 0.2 0 0 0.2 0.4 0.6 0.8 Fraction unrelated accepted 1 Region of interest in ROC analysis 1 0.8 Fraction 0.6 related accepted 0.4 0.2 0 0 0.2 0.4 0.6 0.8 Fraction unrelated accepted 1 Truncated ROC, or ROCn curve 1 0 Fraction unrelated accepted 10–3 0.8 Fraction related accepted 0.6 0.4 0.2 0 0 0.2 0.4 0.6 ROC n scale 0.8 1 ROCn score: area under the ROCn curve 1 0.8 Fraction related accepted 0.6 0.4 0.2 0 0 0.2 0.4 0.6 ROC n scale 0.8 1 Questions concerning ROC analysis What false-positive cutoff value should be used? When does it make sense to pool the results of database searches? When are the ROC scores for two different methods significantly different? Marginal ratio of true to false positives 1 0.8 Fraction related accepted 0.6 0.4 0.2 0 0 0.2 0.4 0.6 ROC n scale 0.8 1 Definition of the ROCn score ti : Number of related sequences (true positives) returned before the ith false positive t: Total number of related sequences n 1 ROCn ti nt i 1 “Random distribution” of ROCn scores Bootstrap resampling can be used to assign a statistical significance to differences in ROCn scores. Under reasonable assumptions, the distribution of bootstrapped ROCn scores is approximately normal. Resampling a small subset in a large database is equivalent to resampling the subset with independent Poisson distributions with mean 1. Bootstrap resampling of false positives Retrieval Ranking of the Database 1 1 2 2 3 4 3 5 6 5 7 7 8 9 10 The true records are well characterized. Only false records are resampled with replacement. 4 8 10 10 4 1 2 3 4 4 The false records are the noise. 5 7 8 10 10 Mean and variance for the normal distribution of ROCn scores yielded by resampling only the false positives n 1 ti nt i 1 1 22 nt n t i 1 n ti 2 Mean and variance for the normal distribution of the difference of two ROCn scores, yielded by resampling only the false positives ' 2 ' 2 2 (tn ti )(t 'n t ' j ) n t i , j:Si S j 2 2 2 PSI-BLAST in a nutshell With a protein sequence as query, use BLAST to search a protein sequence database. Collapse significant local alignments (those with Evalue less than or equal to a set threshold h) into a multiple alignment, using the residues of the query sequence as alignment-column placeholders. Abstract a position-specific score matrix from the multiple alignment. Search the database with the score matrix as query. Iterate a fixed number of times, or until convergence. Protocol for evaluating PSI-BLAST For each query sequence, search a comprehensive protein sequence database (e.g. NCBI’s nr) through a fixed number of PSI-BLAST iterations, or until convergence. Use the resulting position-specific score matrix to search the “gold standard” database. Pool the search results for ROC analysis. The effect of acceptance threshold h on PSI-BLAST accuracy Some ideas for improving PSI-BLAST 1. New statistical parameters 11. Use pseudocounts with composition window 2. Smith-Waterman alignment 12. Vary gap costs 3. Substitution matrix frequency ratios 13. Generalized affine gap costs 4. Apply SEG to database sequences 14. Substitution score offset 5. Composition-based statistics 15. Information-dependent pseudocount parameter 6. “Concentrated” accounting of gaps 7. “Dispersed” accounting of gaps 16. Database-sequence lengthnormalization 17. Restricted score rescaling 8. Exponentiate Henikoff weights 18. Adjust purging percentage 9. Reverse sequence normalization 19. Adjust pseudocount parameter 10. Window for amino acid composition 20. Adjust acceptance threshold The effect of composition-based statistics on PSI-BLAST accuracy Composition-based statistics Statistics based on “standard” amino acid frequencies can differ by orders of magnitude from those based upon the peculiar composition of two proteins. Standard protein: DNA pol III, β chain [M. genitalium]: DNA pol III, β chain [C. jejuni]: 4.5 % N 12.1 % N 7.6 % N Depending upon the composition assumed, a search of nr with M. genitalium DNA pol III as query yields different E-values for C. jejuni DNA pol III, as well as for the highest-scoring false positive: “Standard” statistics: Composition-based statistics: 10-10 0.001 0.0002 0.2 At a threshold of 0.0001, “standard” statistics yield 54 true positives, while at 0.1, composition-based statistics yield 55 true positives. The effect of dispersed accounting of gaps on PSI-BLAST accuracy The effect of restricted score rescaling and parameter tuning on PSI-BLAST accuracy Accuracy of PSI-BLAST Program version ROC100 score Original h = 10-6 0.758 ± 0.005 + Composition-based statistics h = 0.002 0.879 ± 0.003 + “Dispersed” gap accounting h = 0.005 0.884 ± 0.002 + Restricted score rescaling b = 9 ; p = 0.94 0.895 ± 0.003 PSI-BLAST accuracy as a function of the number of iterations Literature ROC analysis Swets, J.A. (1988) Science 240:1285-1293 Gribskov, M. & Robinson, N.L. (1996) Comput. Chem. 20:25-33 PSI-BLAST Altschul, S.F. et al. (1997) Nucl. Acids Res. 25:3389-3402 Composition-based statistics Karplus, K. et al. (1998) Bioinformatics 14:846-856 Schäffer, A.A. et al. (1999) Bioinformatics 15:1000-1011 Mott, R. (2000) J. Mol. Biol. 300:649-659 Statistics of ROCn resampling Schäffer, A.A. et al. (2001) Nucl. Acids Res. 29:2994-3005 Spouge, J.L. & Czabarka, E. (2002) ISMB Poster 133A Acknowledgements Analysis of ROCn score distribution Improvements to PSIBLAST John Spouge Eva Czabarka Alejandro Schäffer L. Aravind Thomas Madden Sergei Shavirin John Spouge Yuri Wolf Eugene Koonin