Biostatistics Series at U Toronto, 10/7/2008 (Word File for Outline)

advertisement
Title Slide: Mountain background, “Compound Poisson Approximation of Palindrome Length
Score in Herpesviruses”
Old slides from UTEP Math talk (15 slides)
Slide: Palindrome Length Score (PLS)
Chew et al. (2005, Nucleic Acids Res.):
(1) Identify all palindromes of length at least 2L (using EMBOSS)
(2) Score a fully extended palindrome: If length is 2s, then this palindrome is given a score s.
(3) Window score. Wi is defined as the sum of the scores of all the palindromes whose center
lies in this window.
Nonparametric approach with PLS to predict replication origins in herpesviruses:
Sensitivity = 67%
Positive predicted value (PPV) = 15%
Slide: Can PPV be improved?
 Consider windows with significantly high scores.
 When should a window be considered high scoring?
 Need to know the probability distribution of the PLS.
Slide: Compound Poisson random variable
Let N be a Poisson random variable with mean
  w L
where
w is the window length,
  2( pA pT  pC pG ) , and L is the minimal palindrome length considered for that
sequence.
Define the PLS for a window to be
N
Z   Yj
j 1
where
N represents the number of (fully extended) palindromes in the window, and
Y j is the score given to the jth palindrome.
Slide: Compound Poisson random variable (cont.)
With an i.i.d. model for the nucleotide sequence, the probability mass function of
written as
(1   ) l  L
pY (l )  
M L
 
if L  l  M
if l  M
Y j can be
Z
Then the probability mass function of
is computed using the recursive formula which follows
from Stein’s identity (Barbour et al. 1992).
P( Z  k ) 

k
 lpY (l ) P( Z  k  l )
k l 1
Slide: How good is the compound Poisson approximation (CPA)?
Kolmogorov distance between two random variables:
d K  sup P( X  l )  P(Y  l )
l
d K between the compound Poisson and empirical (simulated sequences) distributions:
minimum
maximum
mean
std. dev.
M0
0.00246
0.01868
0.00799
0.00362
M1
0.00251
0.02683
0.00939
0.00457
M2
0.00103
0.01996
0.00878
0.00504
M3
0.00190
0.02743
0.00778
0.00578
BIC
0.00275
0.02743
0.00788
0.00569
Slide: The herpesvirus dataset
Use Table 1 of manuscript
Slide: Location of windows with significantly high PLS scores
Use Table 5 of manuscript
Slide: Prediction Performance of PLS with CPA
sensitivity
PPV
PLS (10 windows)
67%
15%
CPA
0.01
48%
37%
0.05
52%
31%
CPA+
0.01 0.05
65% 65%
37% 31%
Slide: Further questions
 Can the compound Poisson approximation be generalized to other scoring schemes?
E.g., the base weighted scheme (BWS) which gives a higher score to palindromes which
have lower probabilities to occur at random.
 Replication origin prediction for other DNA (viral, bacterial, and eukaryotic) genomes?
(viral, bacterial, and eukaryotic)
 Other sequence features important to replication origin predictions?
Slide: Acknowledgements
Collaborators:
David Chew
Kwok Pui Choi
Raul Cruz-Cano
Deepak Chandran
Funding:
Texas Advanced Research Program 003661-0013-2007
National Science Foundation DMS0800272.
National Institutes of Health: S06GM08012-35, 5G12RR008124-11, 1R01AI077413,
1T36GM078000-01.
Download