othing Smo cal Lo

advertisement
• Multi-valued nonparametric regression
• Local principal curves
10th of February 2005
joint work with Gerhard Tutz, University of Munich
Jochen Einbeck
with Application to Speed-flow Diagrams
Approaches to Function-free Local Smoothing
(Holmström et.al, 2003)
Also called nonparametric regression.”
value of Y given that X = x.
Y, construct a function, a ”smoother”, which at point x estimates the average
“Given observations from an explanatory variable X and a response variable
Typical definition:
What is “Smoothing”?
yielding
problem
z2
E(·)
l(z)
Φ(·)
M edian(·)
|z|
a
M ode(·)
−δ(z)
ml (x) = arg min E(l(Y − a)|X = x),
Possible choices of the operator Φ(·) are obtained as solutions of the minimization
m(x) = Φ(Y |X = x).
relating X and Y in a proper way, which can be generally expressed in the form
Given n data points (Xi, Yi), i = 1, . . . , n, one seeks a smooth function m : R −→ R
Nonparametric regression
500
1000
X (flow)
Y: speed in Miles/hour
recorded on a californian “freeway”.
X: traffic flow in cars/hour,
0
Speed-Flow diagram:
Limits of ordinary nonparametric regression
Y (speed)
70
60
50
40
30
20
10
0
1500
2000
500
1000
flow
−→ Obviously some information is discarded!
0
Local Mean
Local Median
1500
Y = m(X) + ² ; with a function m : R −→ R
Attempt: Ordinary nonparametric regression
speed
60
50
40
30
20
10
2000
20
flow/40
30
40
50
Principal curves are smooth functions passing through the middle of the data cloud.
Intuitive Definition:
10
µ ¶
X
= f (t) + ² ; f : R −→ R2
Y
1st alternative Approach: Principal curves
speed
60
50
40
30
20
10
•
•
•
•
•
•
-0.8
•
•
•1m •
•
• •
•
• •
m
0
•
-1.0
• •
•
•
•
•
2m•
•
•
••
-0.6
•
••
X_1
• • •
3 •
m
•
•
-0.4
•
•
•
•
•
••
•
•
-0.2
•
•
0.0
•
•
•
steps.
1, 2, 3 :
enumeration of
m: points of the LPC,
0: starting point,
Idea: Calculate alternately a local center of mass and a first local principal component.
Local principal curves (LPC)
1.0
0.8
0.6
0.4
0.2
0.0
x
Pn
=
x
x
w
(X
−
µ
)(X
−
µ
i
ij
ik
j
k ).
i=1
Pn
5. Repeat steps 2. to 4. until the µx remain constant.
4. Step from µx in direction of the first principal component to x := µx + t0γ1x.
x
σjk
x
of the local covariance matrix Σx = (σjk
)(1≤j,k≤d), where
3. Perform a PCA locally at x, i.e. compute the 1st eigenvector γ x
2. At x, calculate the local center of mass µ = i=1 wiXi,
Pn
where wi = KH (Xi − x)Xi/ i=1 KH (Xi − x).
1. Choose a starting point x0. Set x = x0.
Assume a data cloud X = (X1, . . . , Xn)T , where Xi = (Xi1, . . . , Xid)T .
Algorithm: Local principal curves
(1)
m(k+1) = µ(m(k))
m(0) = x
converges to a neighboring critical point of fˆK (·).
(B) and the sequence
(A) M (x) ∼ ∇fˆK (x)
density estimate fˆK (x), Comaniciu & Meer (2002) show that
i.e. the movement from the current point to the local center of mass. For a kernel
M (x) = µ(x) − x.
A central tool of the LPC algorithm is the Mean Shift,
60
Y
40
20
+
+
0
+ +
10
+
+
20
+
+
+
+
++
++
+
+
+
++
+
X
30
40
Kernel density estimate of speed-flow data and LPC (+).
80
+
+
++
+
+
+
++
+
+
+
+
+
+
50
Conclusion: The LPC algorithm approximates the crest line of a kernel density estimate.
Z
5
00
0 0.0010.0020.0030.0040.
2.0
0.4
0.2
0.0
10
15
+ + + + ++ + + + + + ++ + + ++ + + +
+
+
+
+
+
++
+
+
+
+
+
+
+
++
+
1.5
5
1.0
0
20
25
30
0.0
0.2
0.4
0.6
0.8
1.0
35
Example: Simulated “E” and flow diagram of the quantity λx2 /λx1 .
direction of the second local eigenvector γ2x.
this value is large at a certain point of the original LPC, a new LPC is launched in
Consider the second local eigenvalue λx2 , i.e. the second largest eigenvalue of Σx: If
LPC’s of higher order
0.5
++
+
+ + + + + + + + ++ + + ++ + +
+
+ ++
+
+
+
+
+
++
+
+
+
++
+
+
+
+
+
+
+ + + + + + + + + ++ + + + + + + + + +
0.0
−1.0
−1.0
−0.8
−0.8
1
−0.6
−0.6
−0.4
1
−0.4
−0.2
−0.2
−0.2
2
0.0
1
2
2
0.2
2
2
0.2
2
LPC’s through simulated letters (C,E,K)
1.0
0.5
0.0
−0.5
−1.0
1.0
0.5
0.0
−0.5
−1.0
2.0
1.5
1.0
0.5
0.0
2.0
1.5
1.0
0.5
0.0
1
0.4
0.4
0.6
0.6
0.8
0.8
1.0
1.0
2.0
1.5
1.0
0.5
0.0
2.0
1.5
1.0
0.5
0.0
2
0.2
0.2
2
3
2
0.0
0.0
1
0.4
0.4
3
0.6
0.6
0.8
0.8
1.0
1
1.0
0
10
20
30
40
50
60
70
++
+++
+
+
+
+
+
+
+
+
+++++++
+
+
++
+
+++
++
+
+
+++
++
++
++++++++++
+
+++++
+++
++
+
+ +++
++
++
+
+
+
+ ++
++
++ ++
++
++ ++
+
+
+++
+
+++++++++++++ ++
+
++
+
+++++
+
++
++++
+
+
+
+
+
+
++++++++++
++++
+
+
++++
+
+
++
+
+
+
+++
++
+
+
++
++ + +++
+
+
+
++
+
+
+
+
+
+
++
+++
+
+++ ++
++
+
+
+++
+++++++ +
+
++
+
+
+
+
+
++
+
+
+
++
+
+
+
+++
+
+
+
++
++++
++++
+++
+ + ++
+
++
++
+
+
+
++
++++
+
++
+++ + + ++++
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ + ++
+
+
+
+
+
+
+
+
+
+++
+
+
+
+
++
++++
++
++
+
+++
+
++
++++
++
+++
+
+++++
+
+
+++
++ +++++
++
++
+
+
+
+
++
+
+++
++
+
+++
+
+
++
+
++
+++
+
+++++
+++++
+++
++
+++
+++
+
+
++++ ++++++++++
+
+
+
+
+
+
+++
++++
+
+
+
+
+
+
+
+
+
+
+++++++++
+++
++++++
+++
+
++ ++++
++
+++++
+
+
++++
+
+
+
+
+
+++++++
++++
+
+
+
+
+++
+
+
+
+
+++
+
+
+
+
+
++
+
+
+++
+
+
+
+
+
+
+
++
+++
+
+ +++
+
+
+
+ ++
+
+++
+
+
+
+
+
+++
+++
++
+
+++++
++
+
++
+
++++++
+
+
+
++ ++++
++
+
++++
+
+
+
+++
++
+
++++++
++
+
+++
+++++
++
++++
+
+
+
+
+
+
+
+
+
+
+++
+++++
++++
+
+
+
+++
+
+++
+++
++
++++
+
+
+
+++
+
+
++
+
+
+
+
+
+
+
+
+
+
+
+
+
+
++++++ ++++++
++
++
++
+++
++
++
+++++
++
+
++
+++
+
+
+
+
+
+++
+
++
+
+
+++
+++
++
+
+++
++
++
++
+
+
+
+
+
+++
+
+
++
+++
+
++++++++
+++
+
++
+
+
+
+
+
+
+
+
++ +
++
+
++++ +
+
+
+
++
+
+
+
+
+
++
++ +++
+++
++
++ +
++
+
++
+
+
+++
++
+
+
++++
++++
++++++
++++++
+
+
+
+
+
++
+
+
+
++
+++ +++
+
+++++
+
+++++++++++++++++++
++++
++++++
+++++++++
+
++
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+++++
A further example: Floodplains in Pennsylvania
100
80
60
40
20
0
0
20
40
60
A further example: Coastal resorts in Europe
100
80
60
40
20
80
100
120
• (Local) Principal Curves are not suitable for prediction of Y from a given X = x.
• Bandwidth selection works by means of a Coverage measure.
• However, the theoretical grounding is somewhat waggly.
Hastie & Stuetzle (1989) und Kegl et al. (2000) .
plicated data structures) a better performance than the “global” algorithms from
• LPC’s work well in a variety of data situations, and show (in particular for com-
Concluding remarks
1000
flow
1500
For every X = x, more than one predicted value is possible.
The data cloud is assumed to consist of several (smooth) branches.
500
Y = r(X) + ², with a multifunction r : R −→ R
2nd alternative approach: Multi-valued nonparametric regression
speed
60
50
40
30
20
10
2000
60
speed
40
Basic idea: Consider the conditional densities.
20
500
1000
1500
2000
flow
0
10
20
40
speed at flow=1400
30
7.7%
50
60
92.3%
70
For instance, conditional density at a flow = 1400.
f(speed |1400 )
0.08
0.06
0.04
0.02
0.00
tained.
the corresponding branch is at-
ity, that, given x, a value on
serves as estimated probabil-
and the neighboring ’antimode’
• The area between a mode
ditional densities fˆ(y|x).
the modes of the estimated con-
• For estimation of r(x), compute
500
P(Car in congested traffic)
P(Car in uncongested traffic)
1000
flow
1500
2000
can be separated.
The relevance of the branches varies smoothly for varying x, as long as as the branches
0
Relevance assessment
P
1.0
0.8
0.6
0.4
0.2
0.0
2
i=1
1
h1
∂ fˆ(y|x) 2ck
= 3
∂y
h2
holds. One calculates
K2 exists such that
i=1
n
X
K1
µ
x − Xi
k0
h1
¶
õ
y − Yi
h2
K2(·) = ck k((·)2),
¶2 !
!
(y − Yi) = 0
with kernels K1, K2 and bandwidths h1, h2. We assume, that a profile k(·) for kernel
We are interested in all local maxima of the estimated conditional densities
³
´
´
³
Pn
x−Xi
y−Yi
K
K
ˆ
1
2
i=1
h1
h2
f (x, y)
´
³
fˆ(y|x) =
=
Pn
fˆ(x)
h
K x−Xi
Estimation of conditional modes
i=1
K1
³
x−Xi
h1
´
³
´
i
G y−Y
Yi
h2
y = i=1
³
´ ³
´ .
n
P
y−Yi
i
K1 x−X
G
h1
h2
n
P
(2)
ing. Thus, the sigma filter is a one-step approximation to the conditional mode.
• The right side of (2) corresponds to the “Sigma-Filter” used in digital image smooth-
LPC’s, this iteration has to be run until convergence.
• Applying (2) recursively, this can be seen as a conditional mean shift! Unlike for
Remarks:
with G(·) = −k 0((·)2).
and obtains
1000
flow
1500
2000
observations to the uncongested or congested regime.
This curve serves as a separator between the branches, and thus as a tool to classify
500
Cond. antimode (bottom)
Cond. antimode (top)
CONGESTION
UNCONGESTED TRAFFIC
the relevances, one obtains an antiprediction or antiregression curve.
If one plots the antimodes, which are obtained as a by-product of the computation of
Outlook: Antiregression and Classification
speed
60
50
40
30
20
10
• Relation to mixture models?
• Bias, Variance? Asymptotics? (for cond. modes: Berlinet et al., 1998)
• Bandwidth selection (for conditional densities already available: Fan et al., 1996)
To do:
mean shift procedure.
• Maxima of the conditional density can be calculated fast and easily via a conditional
• The mode is robust to outliers and is edge-preserving.
conditional mean or median.
• The conditional mode is more useful for the analysis of multimodal data as the
Summary:
with Local Principal Curves. to appear in Proceedings GFKL 2004
Einbeck, J, Tutz, G. & Evers, L. (2005): Exploring Multivariate Data Structures
sion Paper No. 320, accepted in Statistics and Computing
Einbeck, J, Tutz, G. & Evers, L. (2003): Local Principal Curves. SFB 386 Discus-
..
principal curves. IEEE Transactions Patt. Anal. Mach. Intell., 24, 59–74.
Kégl, B., Krzyzak, A., Linder, T. & Zeger, K. (2000): Learning and design of
Hastie & Stuetzle, L. (1989): Principal Curves. JASA, 84, 502–516.
• Principal Curves
Literature
.. ?
in nonlinear dynamical systems. Biometrika, 83, 189–206
Fan, Yao & Tong (1996): Estimation of conditional densities and sensivity measures
convergents du mode conditionnel. Canadian Journal of Statistics, 26, 365–380.
Berlinet, Gannoun & Matzner-Løber (1998): Normalité aymptotique d’estimateurs
“Related Literature”:
395.
tion of multimodal regression to speed-flow data , SFB 386 Discussion Paper No.
Einbeck, J, & Tutz, G.(2004): Modelling beyond regression functions: An applica-
• Multi-valued nonparametric regression
Download