Assessing the accuracy of database search methods, and improving the performance of PSI-BLAST

advertisement
Comparing Database Search
Methods & Improving the
Performance of PSI-BLAST
Stephen Altschul
“Gold standards” for protein classification
Traditional curated sequence databases with
family and superfamily classifications:
PIR
SWISS-PROT
Structure-based protein domain classification:
SCOP
Measuring retrieval accuracy
Sequence
Search
Positive
Negative
Sensitivity: TP/R
Related
Unrelated
TP
FP
P
True Positive
False Positive
= TP + FP
FN
TN
N
False Negative
True Negative
= FN + TN
R
U
= TP + FN
= FP + TN
Specificity: TP/P
Receiver Operating Characteristic curve
False –
1
0.8
True +
Fraction 0.6
related
accepted 0.4
0.2
0
0
0.2
0.4
0.6
0.8
Fraction unrelated accepted
False +
True –
1
Random retrieval on a ROC plot
1
0.8
Fraction 0.6
related
accepted 0.4
0.2
0
0
0.2
0.4
0.6
0.8
Fraction unrelated accepted
1
Line of fixed sensitivity
1
0.8
Fraction 0.6
related
accepted 0.4
0.2
0
0
0.2
0.4
0.6
0.8
Fraction unrelated accepted
1
Line of fixed specificity
1
0.8
Fraction 0.6
related
accepted 0.4
0.2
0
0
0.2
0.4
0.6
0.8
Fraction unrelated accepted
1
Line of fixed crossover ratio
1
0.8
Fraction 0.6
related
accepted 0.4
0.2
0
0
0.2
0.4
0.6
0.8
Fraction unrelated accepted
1
Line of fixed crossover ratio
1
0.8
Fraction 0.6
related
accepted 0.4
0.2
0
0
0.2
0.4
0.6
0.8
Fraction unrelated accepted
1
ROC score: area under the ROC curve
1
0.8
Fraction 0.6
related
accepted 0.4
0.2
0
0
0.2
0.4
0.6
0.8
Fraction unrelated accepted
1
Region of interest in ROC analysis
1
0.8
Fraction 0.6
related
accepted 0.4
0.2
0
0
0.2
0.4
0.6
0.8
Fraction unrelated accepted
1
Region of interest in ROC analysis
1
0.8
Fraction 0.6
related
accepted 0.4
0.2
0
0
0.2
0.4
0.6
0.8
Fraction unrelated accepted
1
Truncated ROC, or ROCn curve
1
0
Fraction unrelated accepted
10–3
0.8
Fraction
related
accepted
0.6
0.4
0.2
0
0
0.2
0.4
0.6
ROC n scale
0.8
1
ROCn score: area under the ROCn curve
1
0.8
Fraction
related
accepted
0.6
0.4
0.2
0
0
0.2
0.4
0.6
ROC n scale
0.8
1
Questions concerning ROC analysis

What false-positive cutoff value should be
used?

When does it make sense to pool the results
of database searches?

When are the ROC scores for two different
methods significantly different?
Marginal ratio of true to false positives
1
0.8
Fraction
related
accepted
0.6
0.4
0.2
0
0
0.2
0.4
0.6
ROC n scale
0.8
1
Definition of the ROCn score
ti : Number of related sequences (true positives)
returned before the ith false positive
t:
Total number of related sequences
n
1
ROCn   ti
nt i 1
“Random distribution” of ROCn scores

Bootstrap resampling can be used to assign a
statistical significance to differences in ROCn
scores.

Under reasonable assumptions, the distribution
of bootstrapped ROCn scores is approximately
normal.

Resampling a small subset in a large database
is equivalent to resampling the subset with
independent Poisson distributions with mean 1.
Bootstrap resampling of false
positives
Retrieval Ranking of the Database
1
1
2
2
3
4
3
5
6
5
7
7
8
9
10
The true records are
well characterized.
Only false records are resampled with replacement.
4
8
10
10
4
1
2
3
4
4
The false records are
the noise.
5
7
8
10 10
Mean and variance for the normal
distribution of ROCn scores yielded by
resampling only the false positives
n
1
   ti
nt i 1
1
  22
nt

n
 t
i 1
n
 ti 
2
Mean and variance for the normal distribution
of the difference of two ROCn scores, yielded
by resampling only the false positives
     '
2
     '  2 2  (tn  ti )(t 'n  t ' j )
n t i , j:Si  S j
2

2
2
PSI-BLAST in a nutshell

With a protein sequence as query, use BLAST to
search a protein sequence database.

Collapse significant local alignments (those with Evalue less than or equal to a set threshold h) into a
multiple alignment, using the residues of the query
sequence as alignment-column placeholders.

Abstract a position-specific score matrix from the
multiple alignment.

Search the database with the score matrix as query.

Iterate a fixed number of times, or until convergence.
Protocol for evaluating PSI-BLAST

For each query sequence, search a
comprehensive protein sequence database
(e.g. NCBI’s nr) through a fixed number of
PSI-BLAST iterations, or until convergence.

Use the resulting position-specific score
matrix to search the “gold standard”
database.

Pool the search results for ROC analysis.
The effect of acceptance threshold h on
PSI-BLAST accuracy
Some ideas for improving PSI-BLAST
1. New statistical parameters
11. Use pseudocounts with
composition window
2. Smith-Waterman alignment
12. Vary gap costs
3. Substitution matrix frequency
ratios
13. Generalized affine gap costs
4. Apply SEG to database
sequences
14. Substitution score offset
5. Composition-based statistics
15. Information-dependent
pseudocount parameter
6. “Concentrated” accounting of gaps
7. “Dispersed” accounting of gaps
16. Database-sequence lengthnormalization
17. Restricted score rescaling
8. Exponentiate Henikoff weights
18. Adjust purging percentage
9. Reverse sequence normalization
19. Adjust pseudocount parameter
10. Window for amino acid
composition
20. Adjust acceptance threshold
The effect of composition-based statistics
on PSI-BLAST accuracy
Composition-based statistics
Statistics based on “standard” amino acid frequencies can differ by
orders of magnitude from those based upon the peculiar composition
of two proteins.
Standard protein:
DNA pol III, β chain [M. genitalium]:
DNA pol III, β chain [C. jejuni]:
4.5 % N
12.1 % N
7.6 % N
Depending upon the composition assumed, a search of nr with M.
genitalium DNA pol III as query yields different E-values for C. jejuni
DNA pol III, as well as for the highest-scoring false positive:
“Standard” statistics:
Composition-based statistics:
10-10
0.001
0.0002
0.2
At a threshold of 0.0001, “standard” statistics yield 54 true positives,
while at 0.1, composition-based statistics yield 55 true positives.
The effect of dispersed accounting of
gaps on PSI-BLAST accuracy
The effect of restricted score rescaling and
parameter tuning on PSI-BLAST accuracy
Accuracy of PSI-BLAST
Program version
ROC100 score
Original
h = 10-6
0.758 ± 0.005
+ Composition-based statistics
h = 0.002
0.879 ± 0.003
+ “Dispersed” gap accounting
h = 0.005
0.884 ± 0.002
+ Restricted score rescaling
b = 9 ; p = 0.94
0.895 ± 0.003
PSI-BLAST accuracy as a function of the
number of iterations
Literature
ROC analysis
Swets, J.A. (1988) Science 240:1285-1293
Gribskov, M. & Robinson, N.L. (1996) Comput. Chem. 20:25-33
PSI-BLAST
Altschul, S.F. et al. (1997) Nucl. Acids Res. 25:3389-3402
Composition-based statistics
Karplus, K. et al. (1998) Bioinformatics 14:846-856
Schäffer, A.A. et al. (1999) Bioinformatics 15:1000-1011
Mott, R. (2000) J. Mol. Biol. 300:649-659
Statistics of ROCn resampling
Schäffer, A.A. et al. (2001) Nucl. Acids Res. 29:2994-3005
Spouge, J.L. & Czabarka, E. (2002) ISMB Poster 133A
Acknowledgements
Analysis of ROCn score
distribution
Improvements to PSIBLAST
John Spouge
Eva Czabarka
Alejandro Schäffer
L. Aravind
Thomas Madden
Sergei Shavirin
John Spouge
Yuri Wolf
Eugene Koonin
Download