Title Slide: Mountain background, “Compound Poisson Approximation of Palindrome Length Score in Herpesviruses” Old slides from UTEP Math talk (15 slides) Slide: Palindrome Length Score (PLS) Chew et al. (2005, Nucleic Acids Res.): (1) Identify all palindromes of length at least 2L (using EMBOSS) (2) Score a fully extended palindrome: If length is 2s, then this palindrome is given a score s. (3) Window score. Wi is defined as the sum of the scores of all the palindromes whose center lies in this window. Nonparametric approach with PLS to predict replication origins in herpesviruses: Sensitivity = 67% Positive predicted value (PPV) = 15% Slide: Can PPV be improved? Consider windows with significantly high scores. When should a window be considered high scoring? Need to know the probability distribution of the PLS. Slide: Compound Poisson random variable Let N be a Poisson random variable with mean w L where w is the window length, 2( pA pT pC pG ) , and L is the minimal palindrome length considered for that sequence. Define the PLS for a window to be N Z Yj j 1 where N represents the number of (fully extended) palindromes in the window, and Y j is the score given to the jth palindrome. Slide: Compound Poisson random variable (cont.) With an i.i.d. model for the nucleotide sequence, the probability mass function of written as (1 ) l L pY (l ) M L if L l M if l M Y j can be Z Then the probability mass function of is computed using the recursive formula which follows from Stein’s identity (Barbour et al. 1992). P( Z k ) k lpY (l ) P( Z k l ) k l 1 Slide: How good is the compound Poisson approximation (CPA)? Kolmogorov distance between two random variables: d K sup P( X l ) P(Y l ) l d K between the compound Poisson and empirical (simulated sequences) distributions: minimum maximum mean std. dev. M0 0.00246 0.01868 0.00799 0.00362 M1 0.00251 0.02683 0.00939 0.00457 M2 0.00103 0.01996 0.00878 0.00504 M3 0.00190 0.02743 0.00778 0.00578 BIC 0.00275 0.02743 0.00788 0.00569 Slide: The herpesvirus dataset Use Table 1 of manuscript Slide: Location of windows with significantly high PLS scores Use Table 5 of manuscript Slide: Prediction Performance of PLS with CPA sensitivity PPV PLS (10 windows) 67% 15% CPA 0.01 48% 37% 0.05 52% 31% CPA+ 0.01 0.05 65% 65% 37% 31% Slide: Further questions Can the compound Poisson approximation be generalized to other scoring schemes? E.g., the base weighted scheme (BWS) which gives a higher score to palindromes which have lower probabilities to occur at random. Replication origin prediction for other DNA (viral, bacterial, and eukaryotic) genomes? (viral, bacterial, and eukaryotic) Other sequence features important to replication origin predictions? Slide: Acknowledgements Collaborators: David Chew Kwok Pui Choi Raul Cruz-Cano Deepak Chandran Funding: Texas Advanced Research Program 003661-0013-2007 National Science Foundation DMS0800272. National Institutes of Health: S06GM08012-35, 5G12RR008124-11, 1R01AI077413, 1T36GM078000-01.