A generic C++ implementation of the Pruned DPA for segmentation A. Cleynen, M. Koskas, G. Rigaill A. Cleynen (AgroParisTech, UCBerkeley) March 27th, 2012 March 2012 1 / 25 Pruned DPA Guillem Rigaill [4] Pruned dynamic programming for optimal multiple change-point detection Algorithm for the segmentation of datasets: Exact with respect to given loss Fast: empirically in n log(n) A. Cleynen (AgroParisTech, UCBerkeley) Returns optimal segmentation in 1 to Kmax segments allows for a vast range of methods for the choice of K March 2012 2 / 25 CGH prole, Pruned DPA 1 to Kmax = 50 Runtime = 10.834s A. Cleynen (AgroParisTech, UCBerkeley) Lebarbier: [3], Lavielle: n [2] : pen (K ) = β K log K = 28 K March 2012 3 / 25 CGH prole, Pruned DPA 1 to Kmax = 50 Runtime = 10.834s A. Cleynen (AgroParisTech, UCBerkeley) mBIC (Zhang andSiegmund [5]): P n pen (K ) = β K log +g log n r r ∈mĚ‚(K ) K = 42 K March 2012 4 / 25 CGH prole, CART 1 to Kmax = 50 Runtime = 0.171s A. Cleynen (AgroParisTech, UCBerkeley) SIC: K pen (K = 12 + 1) = 1 K log n 2 March 2012 5 / 25 CGH prole, PELT Runtime = 14.651s A. Cleynen (AgroParisTech, UCBerkeley) SIC: K pen (K = 17 + 1) = 1 K log n 2 March 2012 6 / 25 Comparison for K=17 Runtimes: PDPA = 3.621s CART = 0.062s PELT = 14.651s A. Cleynen (AgroParisTech, UCBerkeley) Breakpoints PELT and PDPA: same breakpoints CART: only 9 out of 16 are identical March 2012 7 / 25 An example Y1 =0 Y2 Four-point signal Y3 = 0.4 = 0.5 Y4 = −0.5 Contrast γ(Y , µ) = (Y − µ)2 Segmentation in K=2 segments A. Cleynen (AgroParisTech, UCBerkeley) March 2012 8 / 25 Dynamic Programming approach: ∀t ∈ {1, . . . , n} compute cost of segmentation with last breakpoint t as a function of last-segment parameter µ ∀t ∈ {1, . . . , n} nd minimum of cost function in µ identify the minimum of those minimums A. Cleynen (AgroParisTech, UCBerkeley) March 2012 9 / 25 Main idea: If we add a new point, the values of µ change, but not the best candidates for last breakpoint ⇒ A beaten candidate can never become optimal again A. Cleynen (AgroParisTech, UCBerkeley) March 2012 10 / 25 Pruned DPA: The algorithm Initialization: ∀t ∈ {1, n} compute C1,t τ =K −1=1 (t = 1 Signal: Y1 = 0) Cost function: 1 h2,1 (µ) = C1,1 = 0 Set of intervals: 1 S2,1 = R(= [−0.5; 0.5]) A. Cleynen (AgroParisTech, UCBerkeley) March 2012 11 / 25 New entry: =2 Y1 = 0 t Signal: Y2 = 0.5 Cost functions: 1 h2,2 (µ) 2 h2,2 (µ) = h21,1 + (Y2 − µ)2 = 0.25 − µ + µ2 = C1,2 = 0.125 Set of intervals: 1 S2,2 = [0.146; 0.5] 2 S2,2 = [−0.5; 0.146] A. Cleynen (AgroParisTech, UCBerkeley) March 2012 12 / 25 New entry: Signal: Y1 =0 Y2 = 0.5 Y3 = 0.4 Cost functions: 1 h2,3 (µ) 2 h2,3 (µ) 3 h2,3 (µ) = h21,2 + (Y3 − µ)2 = 0.41 − 1.8µ + 2µ2 = h22,2 + (Y3 − µ)2 = 0.285 − 0.8µ + µ2 = C1,3 = 0.14 Set of intervals: 1 S2,3 = [0.190; 0.5] 2 S2,3 3 S2,3 =∅ = [−0.5; 0.190] τ = 2 is discarded A. Cleynen (AgroParisTech, UCBerkeley) March 2012 13 / 25 Last entry: Signal: Y1 =0 Y2 = 0.5 Y3 = 0.4 Y4 = −0.5 Cost functions: 1 h2,4 (µ) 3 h2,4 (µ) 4 h2,4 (µ) = h21,3 + (Y4 − µ)2 = 0.66 − 0.8µ + 3µ2 = h23,3 + (Y4 − µ)2 = 0.39 + µ + µ2 = C1,4 = 0.62 Set of intervals: 1 S2,4 =∅ S2,4 = [−0.5; 0.190] 3 4 S2,4 = [0.190; 0.5] τ = 1 is discarded A. Cleynen (AgroParisTech, UCBerkeley) March 2012 14 / 25 Last step: minimization n o τ (µ) = min{K −1<τ <t } hK ,t 2 0.39 + µ + µ for µ ∈ [−0.5; 0.190] HK ,t (µ) = 0.62 for µ ∈ [0.190; 0.5] CK ,t = minµ {HK ,t (µ)} HK ,t (µ) CK ,n (= C2,4 ) = 0.14 τ =3 µ = −0.5 A. Cleynen (AgroParisTech, UCBerkeley) March 2012 15 / 25 From the original DPA to the Pruned DPA Original DPA: segment additivity Θ(Kn2 ) ( Ck ,t = min {k −1<τ <t } t X + min{ Ck −1,τ µ i ) γ(Yi , µ)} =τ +1 Pruned DPA: point additivity ( Ck ,t = min µ min {Ck −1,τ + {k −1<τ <t } i t −1 X ( = min µ min {k −1<τ <t } A. Cleynen (AgroParisTech, UCBerkeley) t X {Ck −1,τ + i ) γ(Yi , µ)} =τ +1 ) γ(Yi , µ) + γ(Yt , µ)} =τ +1 March 2012 16 / 25 Generic C++ implementation A. Cleynen (AgroParisTech, UCBerkeley) March 2012 17 / 25 Performances on real datasets Figure: Chromosome 1, positive strand of S. cerevisiae (yeast) A. Cleynen (AgroParisTech, UCBerkeley) March 2012 18 / 25 Performances on real datasets A. Cleynen (AgroParisTech, UCBerkeley) March 2012 19 / 25 Performances on real datasets A. Cleynen (AgroParisTech, UCBerkeley) March 2012 20 / 25 Performances on real datasets Neuroblastoma copy number data Result from Hocking et al. [1] PDPA-Lavielle Fused Lasso (λ = f (K )) Circular Binary Segmentation (SD) Fused Lasso (λ = cste ) Circular Binary Segmentation (default) PDPA-mBIC errors 2.2 6.7 11.5 16.0 40.5 40.9 FP 0.6 3.6 7.6 12.7 49.3 49.4 FN 11.6 18.5 32.2 36.6 0.5 0.0 Timing(s) 2.10 0.08 51.62 0.04 1.78 1.47 Table: Comparison of a few segmentation methods on a real data set. Tuning parameters are learned by Leave-one-out. More methods are compared in [1] A. Cleynen (AgroParisTech, UCBerkeley) March 2012 21 / 25 Conclusions and Perspectives Conclusion: PDPA is a fast and exact algorithm that allows the use of: I a large range of data type (CGH, Seq-data, etc) I a large range of possible contrasts (Quadratic, Poisson, etc) I a large range of methods for the choice of K (mBIC, Lavielle, AIC, etc) Perspectives: I Application to real datasets for the discovery of new transcripts, etc. I Theoretical proof of the complexity of the algorithm I Implementation of Ridge-type penalties A. Cleynen (AgroParisTech, UCBerkeley) March 2012 22 / 25 The End A. Cleynen (AgroParisTech, UCBerkeley) Thank you! March 2012 23 / 25 References Toby Dylan Hocking. Learning smoothing models of copy number proles using breakpoint annotations. http://hal.inria.fr/hal-00663790. Marc Lavielle. Using penalized contrasts for the change-point problem. Signal Processing, 85(8):15011510, August 2005. E. Lebarbier. Detecting multiple change-points in the mean of gaussian process by model selection. Signal Processing, 85(4):717736, April 2005. Guillem Rigaill. Pruned dynamic programming for optimal multiple change-point detection. Arxiv:1004.0887, April 2010. Nancy R Zhang and David O Siegmund. A modied bayes information criterion with applications to the analysis of comparative genomic hybridization data. Biometrics, 63(1):2232, March 2007. PMID: 17447926. A. Cleynen (AgroParisTech, UCBerkeley) March 2012 24 / 25 Runtime Comparisons, PELT-PDPA A. Cleynen (AgroParisTech, UCBerkeley) March 2012 25 / 25