Kernel change-point detection

advertisement
Intro.
Framework
Which change-points? (D known)
How many change-points?
Empirical assessment
Kernel change-point detection
Alain Celisse
1 UMR
8524 CNRS - Université Lille 1
2 Modal
INRIA team-project
3 SSB
Group, Paris
joint work with Sylvain Arlot and Zaı̈d Harchaoui
Workshop: “Recent advances in changepoints analysis”
Warwick University, March 28, 2012
1/24
Kernel change-point detection
Alain Celisse
Intro.
Framework
Which change-points? (D known)
How many change-points?
Empirical assessment
1-D signal (example)
1.2
1
0.8
0.6
Signal
0.4
0.2
0
−0.2
−0.4
−0.6
−0.8
0
10
20
30
40
50
60
70
80
90
100
Position t
2/24
Kernel change-point detection
Alain Celisse
Intro.
Framework
Which change-points? (D known)
How many change-points?
Empirical assessment
1-D signal (example): Find abrupt changes in the mean
1.2
Signal
Reg. func.
1
0.8
0.6
Signal
0.4
0.2
0
−0.2
?
?
−0.4
−0.6
−0.8
0
10
20
30
40
50
60
70
80
90
100
Position t
2/24
Kernel change-point detection
Alain Celisse
Intro.
Framework
Which change-points? (D known)
How many change-points?
Empirical assessment
Estimation rather than identification
Signal: Y
Reg. func. s
1
0.8
Fact:
With finite sample, it is
impossible to recover
change-point in noisy regions.
0.6
0.4
0.2
0
−0.2
55
60
65
70
75
80
85
90
95
100
Purpose:
Estimate the regression function.
Idea:
−→ Without too strong noise, recover true change-points.
Kernel change-point detection
3/24
Alain Celisse
Intro.
Framework
Which change-points? (D known)
How many change-points?
Empirical assessment
Example 1: Changes in the distribution
Description:
Observations generated along the time.
Observation distribution is piecewise constant along the time.
Observations have the same mean and variance.
−→ Detecting changes in the mean and variance is useless.
4/24
Kernel change-point detection
Alain Celisse
Intro.
Framework
Which change-points? (D known)
How many change-points?
Empirical assessment
Example 2: Working with non-vectorial objects
Description:
Video sequences from “Le grand échiquier”, 70s-80s French
talk show.
At each time, one observes an image (high-dimensional).
Each image is summarized by a histogram.
Kernel change-point detection
5/24
Alain Celisse
Intro.
Framework
Which change-points? (D known)
How many change-points?
Empirical assessment
Example 2: Working with non-vectorial objects
Preprocessing images
(patches in yellow).
Each histogram bin
corresponds to a patch.
Non-vectorial object:
Histograms with D bins belong to
n
o
PD
(p1 , . . . , pD ) ∈ [0, 1]D ,
i=1 pi = 1 .
−→ Algorithms for vectorial data are useless.
5/24
Kernel change-point detection
Alain Celisse
Intro.
Framework
Which change-points? (D known)
How many change-points?
Empirical assessment
Detect abrupt changes. . .
General purposes:
1
Detect changes in the whole distribution (not only in the
mean)
2
High-dimensional data of different nature:
Vectorial: measures in Rd , curves (sound recordings,. . . )
Non vectorial: phenotypic data, graphs, DNA sequence,. . .
Both vectorial and non vectorial data.
3
Efficient algorithm allowing to deal with large data sets
6/24
Kernel change-point detection
Alain Celisse
Intro.
Framework
Which change-points? (D known)
How many change-points?
Empirical assessment
Kernel and Reproducing Kernel Hilbert Space (RKHS)
X1 , . . . , Xn ∈ X : initial observations.
k(·, ·) : X × X → R: reproducing kernel (H: RKHS).
φ(·) : X → H s.t. φ(x) = k(x, ·): canonical feature map.
< ·, · >H : inner-product in H.
Asset:
Enables to work with high-dimensional heterogeneous data.
7/24
Kernel change-point detection
Alain Celisse
Intro.
Framework
Which change-points? (D known)
How many change-points?
Empirical assessment
Model
Mapping of the initial data
∀1 ≤ i ≤ n,
Yi = φ(Xi ) ∈ H .
−→ (t1 , Y1 ), . . . , (tn , Yn ) ∈ [0, 1] × H :
independent .
8/24
Kernel change-point detection
Alain Celisse
Intro.
Framework
Which change-points? (D known)
How many change-points?
Empirical assessment
Model
Mapping of the initial data
∀1 ≤ i ≤ n,
Yi = φ(Xi ) ∈ H .
−→ (t1 , Y1 ), . . . , (tn , Yn ) ∈ [0, 1] × H :
independent .
Mean element
The mean element of PXi (distribution of Xi ) is µ?i :
< µ?i , f >H = EXi [ < φ(Xi ), f >H ] ,
∀f ∈ H .
Fact:
With characteristic kernels,
PXi 6= PXj
⇒
µ?i 6= µ?j .
8/24
Kernel change-point detection
Alain Celisse
Intro.
Framework
Which change-points? (D known)
How many change-points?
Empirical assessment
Model
∀1 ≤ i ≤ n,
Yi = µ?i + εi
∈H ,
where
µ?i ∈ H: mean element of PXi (distribution of Xi )
∀i,
εi := Yi − µ?i ,
with
Eεi = 0,
h
i
vi := E kεi k2H .
8/24
Kernel change-point detection
Alain Celisse
Intro.
Framework
Which change-points? (D known)
How many change-points?
Empirical assessment
Model
∀1 ≤ i ≤ n,
Yi = µ?i + εi
∈H ,
where
µ?i ∈ H: mean element of PXi (distribution of Xi )
∀i,
εi := Yi − µ?i ,
with
Eεi = 0,
h
i
vi := E kεi k2H .
Assumptions
1
maxi kYi kH ≤ M
2
maxi vi ≤ vmax
3
a.s.
(Db) .
(Vmax) .
µ? = (µ?1 , . . . , µ?n )0 ∈ Hn : piecewise constant.
P
kµ? − µk2 := ni=1 kµ?i − µi k2H .
Goal:
−→ Estimate µ? to recover change-points.
8/24
Kernel change-point detection
Alain Celisse
Intro.
Framework
Which change-points? (D known)
How many change-points?
Empirical assessment
Model selection
Models:
Mn = {m, segmentation of {1, . . . , n}},
Dm = Card(m).
Sm = {µ : (t1 , . . . , tn ) → H, piecewise const. on (Iλ )λ∈m },
(Iλ )λ∈m : I1 = [0, tλ1 ], I2 =]tλ1 , tλ2 ], . . . , IDm =]tλDm −1 , 1].
Strategy:
(Sm )m∈Mn
−→
(b
µm )m∈Mn
−→
µ
bm
b
???
9/24
Kernel change-point detection
Alain Celisse
Intro.
Framework
Which change-points? (D known)
How many change-points?
Empirical assessment
Model selection
Models:
Mn = {m, segmentation of {1, . . . , n}},
Dm = Card(m).
Sm = {µ : (t1 , . . . , tn ) → H, piecewise const. on (Iλ )λ∈m },
(Iλ )λ∈m : I1 = [0, tλ1 ], I2 =]tλ1 , tλ2 ], . . . , IDm =]tλDm −1 , 1].
Strategy:
(Sm )m∈Mn
−→
(b
µm )m∈Mn
−→
µ
bm
b
???
Goal:
Oracle inequality (in expectation, or with large probability):
n
o
2
2
?
kµ? − µ
bm
k
≤
C
inf
kµ
−
µ
b
k
+ rn .
m
b
m∈Mn
9/24
Kernel change-point detection
Alain Celisse
Intro.
Framework
Which change-points? (D known)
How many change-points?
Empirical assessment
Least-squares estimator
Empirical risk minimizer over Sm (= model):
µ
bm ∈ arg min
u∈Sm
n
1X
ku(ti ) − Yi k2H =: arg min Pn γ(u) .
u∈Sm
n
i=1
Regressogram:
µ
bm =
X
λ∈m
βbλ 1Iλ
βbλ =
X
1
Yi .
Card {ti ∈ Iλ }
ti ∈Iλ
10/24
Kernel change-point detection
Alain Celisse
Intro.
Framework
Which change-points? (D known)
How many change-points?
Empirical assessment
Choose D − 1 change-points. . .
Assumption:
(Harchaoui, Cappé (2007))
The number D − 1 of change-points is known.
Strategy:
b
Choose m(D)
among {m ∈ Mn , Dm = D}.
ERM algorithm:
b
b ERM (D) = Argminm|Dm =D kY − µ
m(D)
=m
bm k2 .
(dynamic programming).
11/24
Kernel change-point detection
Alain Celisse
Intro.
Framework
Which change-points? (D known)
How many change-points?
Empirical assessment
Quality of the segmentations
12/24
Kernel change-point detection
Alain Celisse
Intro.
Framework
Which change-points? (D known)
How many change-points?
Empirical assessment
Elementary calculations
Expectations
(vλ =
1
Card(λ)
P
i∈λ vi )
h
i
X
E kµ? − µ
bm k2 = kµ? − Πm µ? k2 +
vλ ,
λ∈m
h
2
E kY − µ
bm k
i
? 2
= kµ? − Πm µ k −
X
vλ + Cste ,
λ∈m
(Πm : orthog. proj. operator onto Sm ).
Conclusion:
−→ ERM prefers models with large
P
λ∈m vλ
(overfitting).
13/24
Kernel change-point detection
Alain Celisse
Intro.
Framework
Which change-points? (D known)
How many change-points?
Empirical assessment
Overfitting illustration
14/24
Kernel change-point detection
Alain Celisse
Intro.
Framework
Which change-points? (D known)
How many change-points?
Empirical assessment
Choose the number of change-points
From µ
bm
b D D , choose D amounts to choose the “best model”.
Ideal penalty:
m∗ ∈ Argminm∈M kµ? − µ
bm k2 (oracle segmentation)
n
o
= Argminm∈M kY − µ
bm k2 + penid (m) ,
with penid (m) := 2 kΠm εk2 − 2 < (I − Πm )µ? , ε >.
Strategy
1
Concentration inequalities for linear and quadratic terms.
2
Derive a tight upper bound pen ≥ penid with high probability.
Previous work:
Birgé, Massart (2001): Gaussian assump. + real valued functions.
−→ cannot be extended to Hilbert framework.
15/24
Kernel change-point detection
Alain Celisse
Intro.
Framework
Which change-points? (D known)
How many change-points?
Empirical assessment
Concentration of the linear term
Theorem (Linear term)
Let us consider a segmentation m and assume (Db) − (Vmax)
hold true. Then for every x > 0 with probability at least 1 − 2e −x ,
vmax
4M 2
?
?
?
? 2
+
|< Πm µ − µ , ε >| ≤ θ kΠm µ − µ k +
x ,
θ
3
for every θ > 0.
16/24
Kernel change-point detection
Alain Celisse
Intro.
Framework
Which change-points? (D known)
How many change-points?
Empirical assessment
Concentration of the quadratic term
Theorem (Quadratic term)
Assuming (Db)-(Vmax), and
∃κ ≥ 1,
0<
M2
≤ min vi
i
κ
(Vmin)
for every m ∈ Mn , x > 0, and θ ∈ (0, 1],
h
i
h
i
2
2 2
bm k + θ−1 L(κ)vmax x ,
kΠm εk − E kΠm εk ≤ θE kΠm µ? − µ
with probability at least 1 − 2e −x , where L(κ) is a constant.
Idea of the proof:
P
Pinelis-Sakhanenko’s inequality ( i∈λ εi H ).
Bernstein’s inequality (upper bounding moments)
Kernel change-point detection
17/24
Alain Celisse
Intro.
Framework
Which change-points? (D known)
How many change-points?
Empirical assessment
Oracle inequality
Theorem
Let us assume (Db)-(Vmin)-(Vmax) and for every x > 0, define
n
o
b ∈ Argminm kY − µ
m
bm k2 + pen(m) ,
i
h
where pen(m) = vmax Dm C1 ln Dnm + C2 for constants
C1 , C2 > 0. Then, there exists an event of probability larger than
1 − 2e −x on which
n
o
2
?
kµ? − µ
bm
bm k2 + pen(m) + ∆2 ,
b k ≤ ∆1 inf kµ − µ
m
where ∆1 ≥ 1 and ∆2 > 0 is a remainder term.
Rk:
h
i
In Birgé, Massart (2001), pen(m) = σ 2 Dm c1 ln Dnm + c2 .
Kernel change-point detection
18/24
Alain Celisse
Intro.
Framework
Which change-points? (D known)
How many change-points?
Empirical assessment
Model selection procedure
Penalty:
pen(m) = vmax Dm C1 ln
n
Dm
+ C2
= pen(Dm ) .
Algorithm
1 For every 1 ≤ D ≤ D
max ,
b D ∈ Argminm,
m
2
Dm =D
n
o
kY − µ
bm k2 ,
Define
io
n
h
n
2
b = Argmin Y − µ
.
D
b
+
v
D
C
ln
+
C
2
max
1
bD
m
D
D
3
where C1 , C2 : computed by simulation experiments.
Final estimator:
.
µ
bm
bm
b := µ
bD
b
Kernel change-point detection
19/24
Alain Celisse
Intro.
Framework
Which change-points? (D known)
How many change-points?
Empirical assessment
Changes in the distribution (synthetic data)
Description:
1
n = 1 000, D ∗ = 4, Nrep = 100.
2
In each segment, observations generated according to one
distribution within a pool of 10 distributions with same mean
and variance.
3
Our kernel based approach enables to distinguish them (higher
order moments)
20/24
Kernel change-point detection
Alain Celisse
Intro.
Framework
Which change-points? (D known)
How many change-points?
Empirical assessment
Changes in the distribution (synthetic data) (Cont’.)
Results
21/24
Kernel change-point detection
Alain Celisse
Intro.
Framework
Which change-points? (D known)
How many change-points?
Empirical assessment
“Le grand échiquier”, 70s-80s French talk show
Audio and video recordings.
Audio: different situations can be distinguished from sound
recordings (music, applause, speech,. . . ).
Video: different video scenes can be distinguished by their
backgrounds or specific actions of people (clapping hands,
discussing,. . . ).
Kernel change-point detection
22/24
Alain Celisse
Intro.
Framework
Which change-points? (D known)
How many change-points?
Empirical assessment
Video sequence
Description:
n = 10 000, D ∗ = 4.
Each image summarized by a histogram with 1 024 bins.
χ2 kernel:
kd (x, y ) =
Pd
i=1
(xi −yi )2
xi +yi ·
Results:
23/24
Kernel change-point detection
Alain Celisse
Intro.
Framework
Which change-points? (D known)
How many change-points?
Empirical assessment
Concluding remarks
Open questions:
1
Relax the assumption on the variance.
2
Use resampling strategies (hetoroscedasticity).
3
Influence of the choice of kernel.
Preprint:
ArXiv
http://www.math.univ-lille1.fr/~celisse/
24/24
Kernel change-point detection
Alain Celisse
Intro.
Framework
Which change-points? (D known)
How many change-points?
Empirical assessment
Concluding remarks
Open questions:
1
Relax the assumption on the variance.
2
Use resampling strategies (hetoroscedasticity).
3
Influence of the choice of kernel.
Preprint:
ArXiv
http://www.math.univ-lille1.fr/~celisse/
Thank you!
24/24
Kernel change-point detection
Alain Celisse
Intro.
Framework
Which change-points? (D known)
How many change-points?
Empirical assessment
25/24
Kernel change-point detection
Alain Celisse
Intro.
Framework
Which change-points? (D known)
How many change-points?
Empirical assessment
Sketch of proof
1
kΠm εk2 =
2
nP
1
λ∈m nλ
P
2 o
ε
i
i∈λ
H
λ∈m
2
i∈λ εi H
P
=
P
λ∈m
are independent r.v. .
3
Bernstein’s inequality to kΠm εk2
4
For every q ≥ 2, upper bound of E Tλq .
5
Tλ .
(?).
P
Pinelis-Sakhanenko’s inequality on i∈λ εi H :


#
"
X 2
x
∀x > 0, P  ,
εi > x  ≤ 2 exp −
2 σλ2 + bλ x
i∈λ
H
with bλ = 2M/3 and σλ2 =
Kernel change-point detection
P
i∈λ vi .
26/24
Alain Celisse
Intro.
Framework
Which change-points? (D known)
How many change-points?
Empirical assessment
Bernstein rather than Talagrand
Talagrand’s inequality
P
kΠm εk = supf ∈Bn < f , Πm ε >= supf ∈Bn ni=1 < fi , (Πm ε)i >H
√
b
P kΠm εk ≤ E [ kΠm εk ] + 2vx + x ,
3
P
with v = ni=1 supf E < fi , (Πm ε)i >2H + 16bE [ kΠm εk ].
Bernstein’s inequality
σ 2 = sup
f
n
X
h
i
E < fi , (Πm ε)i >2H = E kΠm εk2 .
i=1
27/24
Kernel change-point detection
Alain Celisse
Download