Intro. Framework Which change-points? (D known) How many change-points? Empirical assessment Kernel change-point detection Alain Celisse 1 UMR 8524 CNRS - Université Lille 1 2 Modal INRIA team-project 3 SSB Group, Paris joint work with Sylvain Arlot and Zaı̈d Harchaoui Workshop: “Recent advances in changepoints analysis” Warwick University, March 28, 2012 1/24 Kernel change-point detection Alain Celisse Intro. Framework Which change-points? (D known) How many change-points? Empirical assessment 1-D signal (example) 1.2 1 0.8 0.6 Signal 0.4 0.2 0 −0.2 −0.4 −0.6 −0.8 0 10 20 30 40 50 60 70 80 90 100 Position t 2/24 Kernel change-point detection Alain Celisse Intro. Framework Which change-points? (D known) How many change-points? Empirical assessment 1-D signal (example): Find abrupt changes in the mean 1.2 Signal Reg. func. 1 0.8 0.6 Signal 0.4 0.2 0 −0.2 ? ? −0.4 −0.6 −0.8 0 10 20 30 40 50 60 70 80 90 100 Position t 2/24 Kernel change-point detection Alain Celisse Intro. Framework Which change-points? (D known) How many change-points? Empirical assessment Estimation rather than identification Signal: Y Reg. func. s 1 0.8 Fact: With finite sample, it is impossible to recover change-point in noisy regions. 0.6 0.4 0.2 0 −0.2 55 60 65 70 75 80 85 90 95 100 Purpose: Estimate the regression function. Idea: −→ Without too strong noise, recover true change-points. Kernel change-point detection 3/24 Alain Celisse Intro. Framework Which change-points? (D known) How many change-points? Empirical assessment Example 1: Changes in the distribution Description: Observations generated along the time. Observation distribution is piecewise constant along the time. Observations have the same mean and variance. −→ Detecting changes in the mean and variance is useless. 4/24 Kernel change-point detection Alain Celisse Intro. Framework Which change-points? (D known) How many change-points? Empirical assessment Example 2: Working with non-vectorial objects Description: Video sequences from “Le grand échiquier”, 70s-80s French talk show. At each time, one observes an image (high-dimensional). Each image is summarized by a histogram. Kernel change-point detection 5/24 Alain Celisse Intro. Framework Which change-points? (D known) How many change-points? Empirical assessment Example 2: Working with non-vectorial objects Preprocessing images (patches in yellow). Each histogram bin corresponds to a patch. Non-vectorial object: Histograms with D bins belong to n o PD (p1 , . . . , pD ) ∈ [0, 1]D , i=1 pi = 1 . −→ Algorithms for vectorial data are useless. 5/24 Kernel change-point detection Alain Celisse Intro. Framework Which change-points? (D known) How many change-points? Empirical assessment Detect abrupt changes. . . General purposes: 1 Detect changes in the whole distribution (not only in the mean) 2 High-dimensional data of different nature: Vectorial: measures in Rd , curves (sound recordings,. . . ) Non vectorial: phenotypic data, graphs, DNA sequence,. . . Both vectorial and non vectorial data. 3 Efficient algorithm allowing to deal with large data sets 6/24 Kernel change-point detection Alain Celisse Intro. Framework Which change-points? (D known) How many change-points? Empirical assessment Kernel and Reproducing Kernel Hilbert Space (RKHS) X1 , . . . , Xn ∈ X : initial observations. k(·, ·) : X × X → R: reproducing kernel (H: RKHS). φ(·) : X → H s.t. φ(x) = k(x, ·): canonical feature map. < ·, · >H : inner-product in H. Asset: Enables to work with high-dimensional heterogeneous data. 7/24 Kernel change-point detection Alain Celisse Intro. Framework Which change-points? (D known) How many change-points? Empirical assessment Model Mapping of the initial data ∀1 ≤ i ≤ n, Yi = φ(Xi ) ∈ H . −→ (t1 , Y1 ), . . . , (tn , Yn ) ∈ [0, 1] × H : independent . 8/24 Kernel change-point detection Alain Celisse Intro. Framework Which change-points? (D known) How many change-points? Empirical assessment Model Mapping of the initial data ∀1 ≤ i ≤ n, Yi = φ(Xi ) ∈ H . −→ (t1 , Y1 ), . . . , (tn , Yn ) ∈ [0, 1] × H : independent . Mean element The mean element of PXi (distribution of Xi ) is µ?i : < µ?i , f >H = EXi [ < φ(Xi ), f >H ] , ∀f ∈ H . Fact: With characteristic kernels, PXi 6= PXj ⇒ µ?i 6= µ?j . 8/24 Kernel change-point detection Alain Celisse Intro. Framework Which change-points? (D known) How many change-points? Empirical assessment Model ∀1 ≤ i ≤ n, Yi = µ?i + εi ∈H , where µ?i ∈ H: mean element of PXi (distribution of Xi ) ∀i, εi := Yi − µ?i , with Eεi = 0, h i vi := E kεi k2H . 8/24 Kernel change-point detection Alain Celisse Intro. Framework Which change-points? (D known) How many change-points? Empirical assessment Model ∀1 ≤ i ≤ n, Yi = µ?i + εi ∈H , where µ?i ∈ H: mean element of PXi (distribution of Xi ) ∀i, εi := Yi − µ?i , with Eεi = 0, h i vi := E kεi k2H . Assumptions 1 maxi kYi kH ≤ M 2 maxi vi ≤ vmax 3 a.s. (Db) . (Vmax) . µ? = (µ?1 , . . . , µ?n )0 ∈ Hn : piecewise constant. P kµ? − µk2 := ni=1 kµ?i − µi k2H . Goal: −→ Estimate µ? to recover change-points. 8/24 Kernel change-point detection Alain Celisse Intro. Framework Which change-points? (D known) How many change-points? Empirical assessment Model selection Models: Mn = {m, segmentation of {1, . . . , n}}, Dm = Card(m). Sm = {µ : (t1 , . . . , tn ) → H, piecewise const. on (Iλ )λ∈m }, (Iλ )λ∈m : I1 = [0, tλ1 ], I2 =]tλ1 , tλ2 ], . . . , IDm =]tλDm −1 , 1]. Strategy: (Sm )m∈Mn −→ (b µm )m∈Mn −→ µ bm b ??? 9/24 Kernel change-point detection Alain Celisse Intro. Framework Which change-points? (D known) How many change-points? Empirical assessment Model selection Models: Mn = {m, segmentation of {1, . . . , n}}, Dm = Card(m). Sm = {µ : (t1 , . . . , tn ) → H, piecewise const. on (Iλ )λ∈m }, (Iλ )λ∈m : I1 = [0, tλ1 ], I2 =]tλ1 , tλ2 ], . . . , IDm =]tλDm −1 , 1]. Strategy: (Sm )m∈Mn −→ (b µm )m∈Mn −→ µ bm b ??? Goal: Oracle inequality (in expectation, or with large probability): n o 2 2 ? kµ? − µ bm k ≤ C inf kµ − µ b k + rn . m b m∈Mn 9/24 Kernel change-point detection Alain Celisse Intro. Framework Which change-points? (D known) How many change-points? Empirical assessment Least-squares estimator Empirical risk minimizer over Sm (= model): µ bm ∈ arg min u∈Sm n 1X ku(ti ) − Yi k2H =: arg min Pn γ(u) . u∈Sm n i=1 Regressogram: µ bm = X λ∈m βbλ 1Iλ βbλ = X 1 Yi . Card {ti ∈ Iλ } ti ∈Iλ 10/24 Kernel change-point detection Alain Celisse Intro. Framework Which change-points? (D known) How many change-points? Empirical assessment Choose D − 1 change-points. . . Assumption: (Harchaoui, Cappé (2007)) The number D − 1 of change-points is known. Strategy: b Choose m(D) among {m ∈ Mn , Dm = D}. ERM algorithm: b b ERM (D) = Argminm|Dm =D kY − µ m(D) =m bm k2 . (dynamic programming). 11/24 Kernel change-point detection Alain Celisse Intro. Framework Which change-points? (D known) How many change-points? Empirical assessment Quality of the segmentations 12/24 Kernel change-point detection Alain Celisse Intro. Framework Which change-points? (D known) How many change-points? Empirical assessment Elementary calculations Expectations (vλ = 1 Card(λ) P i∈λ vi ) h i X E kµ? − µ bm k2 = kµ? − Πm µ? k2 + vλ , λ∈m h 2 E kY − µ bm k i ? 2 = kµ? − Πm µ k − X vλ + Cste , λ∈m (Πm : orthog. proj. operator onto Sm ). Conclusion: −→ ERM prefers models with large P λ∈m vλ (overfitting). 13/24 Kernel change-point detection Alain Celisse Intro. Framework Which change-points? (D known) How many change-points? Empirical assessment Overfitting illustration 14/24 Kernel change-point detection Alain Celisse Intro. Framework Which change-points? (D known) How many change-points? Empirical assessment Choose the number of change-points From µ bm b D D , choose D amounts to choose the “best model”. Ideal penalty: m∗ ∈ Argminm∈M kµ? − µ bm k2 (oracle segmentation) n o = Argminm∈M kY − µ bm k2 + penid (m) , with penid (m) := 2 kΠm εk2 − 2 < (I − Πm )µ? , ε >. Strategy 1 Concentration inequalities for linear and quadratic terms. 2 Derive a tight upper bound pen ≥ penid with high probability. Previous work: Birgé, Massart (2001): Gaussian assump. + real valued functions. −→ cannot be extended to Hilbert framework. 15/24 Kernel change-point detection Alain Celisse Intro. Framework Which change-points? (D known) How many change-points? Empirical assessment Concentration of the linear term Theorem (Linear term) Let us consider a segmentation m and assume (Db) − (Vmax) hold true. Then for every x > 0 with probability at least 1 − 2e −x , vmax 4M 2 ? ? ? ? 2 + |< Πm µ − µ , ε >| ≤ θ kΠm µ − µ k + x , θ 3 for every θ > 0. 16/24 Kernel change-point detection Alain Celisse Intro. Framework Which change-points? (D known) How many change-points? Empirical assessment Concentration of the quadratic term Theorem (Quadratic term) Assuming (Db)-(Vmax), and ∃κ ≥ 1, 0< M2 ≤ min vi i κ (Vmin) for every m ∈ Mn , x > 0, and θ ∈ (0, 1], h i h i 2 2 2 bm k + θ−1 L(κ)vmax x , kΠm εk − E kΠm εk ≤ θE kΠm µ? − µ with probability at least 1 − 2e −x , where L(κ) is a constant. Idea of the proof: P Pinelis-Sakhanenko’s inequality ( i∈λ εi H ). Bernstein’s inequality (upper bounding moments) Kernel change-point detection 17/24 Alain Celisse Intro. Framework Which change-points? (D known) How many change-points? Empirical assessment Oracle inequality Theorem Let us assume (Db)-(Vmin)-(Vmax) and for every x > 0, define n o b ∈ Argminm kY − µ m bm k2 + pen(m) , i h where pen(m) = vmax Dm C1 ln Dnm + C2 for constants C1 , C2 > 0. Then, there exists an event of probability larger than 1 − 2e −x on which n o 2 ? kµ? − µ bm bm k2 + pen(m) + ∆2 , b k ≤ ∆1 inf kµ − µ m where ∆1 ≥ 1 and ∆2 > 0 is a remainder term. Rk: h i In Birgé, Massart (2001), pen(m) = σ 2 Dm c1 ln Dnm + c2 . Kernel change-point detection 18/24 Alain Celisse Intro. Framework Which change-points? (D known) How many change-points? Empirical assessment Model selection procedure Penalty: pen(m) = vmax Dm C1 ln n Dm + C2 = pen(Dm ) . Algorithm 1 For every 1 ≤ D ≤ D max , b D ∈ Argminm, m 2 Dm =D n o kY − µ bm k2 , Define io n h n 2 b = Argmin Y − µ . D b + v D C ln + C 2 max 1 bD m D D 3 where C1 , C2 : computed by simulation experiments. Final estimator: . µ bm bm b := µ bD b Kernel change-point detection 19/24 Alain Celisse Intro. Framework Which change-points? (D known) How many change-points? Empirical assessment Changes in the distribution (synthetic data) Description: 1 n = 1 000, D ∗ = 4, Nrep = 100. 2 In each segment, observations generated according to one distribution within a pool of 10 distributions with same mean and variance. 3 Our kernel based approach enables to distinguish them (higher order moments) 20/24 Kernel change-point detection Alain Celisse Intro. Framework Which change-points? (D known) How many change-points? Empirical assessment Changes in the distribution (synthetic data) (Cont’.) Results 21/24 Kernel change-point detection Alain Celisse Intro. Framework Which change-points? (D known) How many change-points? Empirical assessment “Le grand échiquier”, 70s-80s French talk show Audio and video recordings. Audio: different situations can be distinguished from sound recordings (music, applause, speech,. . . ). Video: different video scenes can be distinguished by their backgrounds or specific actions of people (clapping hands, discussing,. . . ). Kernel change-point detection 22/24 Alain Celisse Intro. Framework Which change-points? (D known) How many change-points? Empirical assessment Video sequence Description: n = 10 000, D ∗ = 4. Each image summarized by a histogram with 1 024 bins. χ2 kernel: kd (x, y ) = Pd i=1 (xi −yi )2 xi +yi · Results: 23/24 Kernel change-point detection Alain Celisse Intro. Framework Which change-points? (D known) How many change-points? Empirical assessment Concluding remarks Open questions: 1 Relax the assumption on the variance. 2 Use resampling strategies (hetoroscedasticity). 3 Influence of the choice of kernel. Preprint: ArXiv http://www.math.univ-lille1.fr/~celisse/ 24/24 Kernel change-point detection Alain Celisse Intro. Framework Which change-points? (D known) How many change-points? Empirical assessment Concluding remarks Open questions: 1 Relax the assumption on the variance. 2 Use resampling strategies (hetoroscedasticity). 3 Influence of the choice of kernel. Preprint: ArXiv http://www.math.univ-lille1.fr/~celisse/ Thank you! 24/24 Kernel change-point detection Alain Celisse Intro. Framework Which change-points? (D known) How many change-points? Empirical assessment 25/24 Kernel change-point detection Alain Celisse Intro. Framework Which change-points? (D known) How many change-points? Empirical assessment Sketch of proof 1 kΠm εk2 = 2 nP 1 λ∈m nλ P 2 o ε i i∈λ H λ∈m 2 i∈λ εi H P = P λ∈m are independent r.v. . 3 Bernstein’s inequality to kΠm εk2 4 For every q ≥ 2, upper bound of E Tλq . 5 Tλ . (?). P Pinelis-Sakhanenko’s inequality on i∈λ εi H : # " X 2 x ∀x > 0, P , εi > x ≤ 2 exp − 2 σλ2 + bλ x i∈λ H with bλ = 2M/3 and σλ2 = Kernel change-point detection P i∈λ vi . 26/24 Alain Celisse Intro. Framework Which change-points? (D known) How many change-points? Empirical assessment Bernstein rather than Talagrand Talagrand’s inequality P kΠm εk = supf ∈Bn < f , Πm ε >= supf ∈Bn ni=1 < fi , (Πm ε)i >H √ b P kΠm εk ≤ E [ kΠm εk ] + 2vx + x , 3 P with v = ni=1 supf E < fi , (Πm ε)i >2H + 16bE [ kΠm εk ]. Bernstein’s inequality σ 2 = sup f n X h i E < fi , (Πm ε)i >2H = E kΠm εk2 . i=1 27/24 Kernel change-point detection Alain Celisse