A generic C++ implementation of the Pruned DPA for segmentation

advertisement
A generic C++ implementation of the Pruned DPA
for segmentation
A. Cleynen, M. Koskas, G. Rigaill
A. Cleynen (AgroParisTech, UCBerkeley)
March 27th, 2012
March 2012
1 / 25
Pruned DPA
Guillem Rigaill [4]
Pruned dynamic programming for optimal multiple change-point detection
Algorithm for the segmentation of datasets:
Exact with respect to given loss
Fast: empirically in n log(n)
A. Cleynen (AgroParisTech, UCBerkeley)
Returns optimal segmentation in
1 to Kmax segments
allows for a vast range of
methods for the choice of K
March 2012
2 / 25
CGH prole, Pruned DPA
1 to Kmax = 50
Runtime = 10.834s
A. Cleynen (AgroParisTech, UCBerkeley)
Lebarbier: [3], Lavielle:
n [2] :
pen (K ) = β K log
K
= 28
K
March 2012
3 / 25
CGH prole, Pruned DPA
1 to Kmax = 50
Runtime = 10.834s
A. Cleynen (AgroParisTech, UCBerkeley)
mBIC (Zhang andSiegmund
[5]):
P
n
pen (K ) = β K log
+g
log
n
r
r ∈mĚ‚(K )
K
= 42
K
March 2012
4 / 25
CGH prole, CART
1 to Kmax = 50
Runtime = 0.171s
A. Cleynen (AgroParisTech, UCBerkeley)
SIC:
K
pen (K
= 12
+ 1) =
1
K log n
2
March 2012
5 / 25
CGH prole, PELT
Runtime = 14.651s
A. Cleynen (AgroParisTech, UCBerkeley)
SIC:
K
pen (K
= 17
+ 1) =
1
K log n
2
March 2012
6 / 25
Comparison for K=17
Runtimes:
PDPA = 3.621s
CART = 0.062s
PELT = 14.651s
A. Cleynen (AgroParisTech, UCBerkeley)
Breakpoints
PELT and PDPA: same breakpoints
CART: only 9 out of 16 are identical
March 2012
7 / 25
An example
Y1
=0
Y2
Four-point signal
Y3 = 0.4
= 0.5
Y4
= −0.5
Contrast γ(Y , µ) = (Y − µ)2
Segmentation in K=2 segments
A. Cleynen (AgroParisTech, UCBerkeley)
March 2012
8 / 25
Dynamic Programming approach:
∀t ∈ {1, . . . , n} compute
cost of segmentation with
last breakpoint t as a
function of last-segment
parameter µ
∀t ∈ {1, . . . , n} nd
minimum of cost function
in µ
identify the minimum of
those minimums
A. Cleynen (AgroParisTech, UCBerkeley)
March 2012
9 / 25
Main idea:
If we add a new point, the
values of µ change, but
not the best candidates
for last breakpoint
⇒ A beaten candidate
can never become optimal
again
A. Cleynen (AgroParisTech, UCBerkeley)
March 2012
10 / 25
Pruned DPA: The algorithm
Initialization:
∀t ∈ {1, n} compute
C1,t
τ =K −1=1
(t = 1 Signal: Y1 = 0)
Cost function:
1
h2,1 (µ)
= C1,1 = 0
Set of intervals:
1
S2,1
= R(= [−0.5; 0.5])
A. Cleynen (AgroParisTech, UCBerkeley)
March 2012
11 / 25
New entry:
=2
Y1 = 0
t
Signal:
Y2
= 0.5
Cost functions:
1
h2,2 (µ)
2
h2,2 (µ)
= h21,1 + (Y2 − µ)2 = 0.25 − µ + µ2
= C1,2 = 0.125
Set of intervals:
1
S2,2 = [0.146; 0.5]
2
S2,2
= [−0.5; 0.146]
A. Cleynen (AgroParisTech, UCBerkeley)
March 2012
12 / 25
New entry:
Signal:
Y1
=0
Y2
= 0.5
Y3
= 0.4
Cost functions:
1
h2,3 (µ)
2
h2,3 (µ)
3
h2,3 (µ)
= h21,2 + (Y3 − µ)2 = 0.41 − 1.8µ + 2µ2
= h22,2 + (Y3 − µ)2 = 0.285 − 0.8µ + µ2
= C1,3 = 0.14
Set of intervals:
1
S2,3 = [0.190; 0.5]
2
S2,3
3
S2,3
=∅
= [−0.5; 0.190]
τ = 2 is discarded
A. Cleynen (AgroParisTech, UCBerkeley)
March 2012
13 / 25
Last entry:
Signal:
Y1
=0
Y2
= 0.5
Y3
= 0.4
Y4
= −0.5
Cost functions:
1
h2,4 (µ)
3
h2,4 (µ)
4
h2,4 (µ)
= h21,3 + (Y4 − µ)2 = 0.66 − 0.8µ + 3µ2
= h23,3 + (Y4 − µ)2 = 0.39 + µ + µ2
= C1,4 = 0.62
Set of intervals:
1
S2,4
=∅
S2,4
= [−0.5; 0.190]
3
4
S2,4
= [0.190; 0.5]
τ = 1 is discarded
A. Cleynen (AgroParisTech, UCBerkeley)
March 2012
14 / 25
Last step: minimization
n
o
τ (µ)
= min{K −1<τ <t } hK
,t
2
0.39 + µ + µ for µ ∈ [−0.5; 0.190]
HK ,t (µ) =
0.62
for µ ∈ [0.190; 0.5]
CK ,t = minµ {HK ,t (µ)}
HK ,t (µ)
CK ,n (= C2,4 )
= 0.14
τ =3
µ = −0.5
A. Cleynen (AgroParisTech, UCBerkeley)
March 2012
15 / 25
From the original DPA to the Pruned DPA
Original DPA: segment additivity
Θ(Kn2 )
(
Ck ,t
=
min
{k −1<τ <t }
t
X
+ min{
Ck −1,τ
µ
i
)
γ(Yi , µ)}
=τ +1
Pruned DPA: point additivity
(
Ck ,t
= min
µ
min
{Ck −1,τ +
{k −1<τ <t }
i
t −1
X
(
= min
µ
min
{k −1<τ <t }
A. Cleynen (AgroParisTech, UCBerkeley)
t
X
{Ck −1,τ +
i
)
γ(Yi , µ)}
=τ +1
)
γ(Yi , µ) + γ(Yt , µ)}
=τ +1
March 2012
16 / 25
Generic C++ implementation
A. Cleynen (AgroParisTech, UCBerkeley)
March 2012
17 / 25
Performances on real datasets
Figure: Chromosome 1, positive strand of S. cerevisiae (yeast)
A. Cleynen (AgroParisTech, UCBerkeley)
March 2012
18 / 25
Performances on real datasets
A. Cleynen (AgroParisTech, UCBerkeley)
March 2012
19 / 25
Performances on real datasets
A. Cleynen (AgroParisTech, UCBerkeley)
March 2012
20 / 25
Performances on real datasets
Neuroblastoma copy number data
Result from Hocking et al. [1]
PDPA-Lavielle
Fused Lasso (λ = f (K ))
Circular Binary Segmentation (SD)
Fused Lasso (λ = cste )
Circular Binary Segmentation (default)
PDPA-mBIC
errors
2.2
6.7
11.5
16.0
40.5
40.9
FP
0.6
3.6
7.6
12.7
49.3
49.4
FN
11.6
18.5
32.2
36.6
0.5
0.0
Timing(s)
2.10
0.08
51.62
0.04
1.78
1.47
Table: Comparison of a few segmentation methods on a real data set. Tuning
parameters are learned by Leave-one-out.
More methods are compared in [1]
A. Cleynen (AgroParisTech, UCBerkeley)
March 2012
21 / 25
Conclusions and Perspectives
Conclusion: PDPA is a fast and exact algorithm that allows the use of:
I
a large range of data type (CGH, Seq-data, etc)
I
a large range of possible contrasts (Quadratic, Poisson, etc)
I
a large range of methods for the choice of K (mBIC, Lavielle, AIC, etc)
Perspectives:
I
Application to real datasets for the discovery of new transcripts, etc.
I
Theoretical proof of the complexity of the algorithm
I
Implementation of Ridge-type penalties
A. Cleynen (AgroParisTech, UCBerkeley)
March 2012
22 / 25
The End
A. Cleynen (AgroParisTech, UCBerkeley)
Thank you!
March 2012
23 / 25
References
Toby Dylan Hocking.
Learning smoothing models of copy number proles using breakpoint annotations.
http://hal.inria.fr/hal-00663790.
Marc Lavielle.
Using penalized contrasts for the change-point problem.
Signal Processing, 85(8):15011510, August 2005.
E. Lebarbier.
Detecting multiple change-points in the mean of gaussian process by model selection.
Signal Processing, 85(4):717736, April 2005.
Guillem Rigaill.
Pruned dynamic programming for optimal multiple change-point detection.
Arxiv:1004.0887, April 2010.
Nancy R Zhang and David O Siegmund.
A modied bayes information criterion with applications to the analysis of comparative
genomic hybridization data.
Biometrics, 63(1):2232, March 2007.
PMID: 17447926.
A. Cleynen (AgroParisTech, UCBerkeley)
March 2012
24 / 25
Runtime Comparisons, PELT-PDPA
A. Cleynen (AgroParisTech, UCBerkeley)
March 2012
25 / 25
Download