Curves Principal eck Einb

advertisement
Gerhard Tutz, University of Munich, and Ludger Evers, University of Oxford.
joint work with
Dublin - 22nd November 2005
jochen.einbeck@nuigalway.ie
Department of Mathematics, NUI Galway
Jochen Einbeck
Local Principal Curves
flow/40
30
Y: speed in Miles/hour
20
recorded on a californian “freeway”.
X: traffic flow in cars/hour,
10
Example: Speed-Flow diagram.
cloud X = (X1, . . . , Xn), where Xi ∈ Rd.
40
50
Principal Curves are smooth curves passing through the ’middle’ of a multidimensional data
Descriptive Definition
speed
60
50
40
30
20
10
through the data cloud.
• Local (‘bottom-up’) algorithms estimate the principal curve locally moving step by step
data cloud.
concatenate other lines to the initial line until the resulting curve fits well through the
• Global (‘top-down’) algorithms start with an initial line and try to dwell out this line or
be divided into two major groups:
derstood of the “middle” of a data cloud. The algorithms associated to these definitions can
There exist a variety of definitions of principal curves, which essentially vary in what is un-
Types of principal curves
t
(1)
(Picture from: Hastie & Stützle, 1989)
curves with bounded length L.
define a principal curve as the curve minimizing the average squared distance (1) over all
Kégl, Krzyzak, Linder & Zeger (KKLZ, 2000)
in a natural way.
and generalize linear principal components
³
´
2
4(m) = E inf ||X − m(t)||
function
obtained as critical points of the distance
Self-consistent curves m : I −→ Rd are
points which project there (‘self-consistency’).
Hastie & Stützle (HS, 1989) define a point on the principal curve as the average of all
Principal curve definitions associated to ‘top-down’ - approaches
with E(²) = 0
• Estimation of branched or disconnected data clouds not (directly) possible.
gnment of projection indices can often not be correct in the further run of the algorithm
• Dependance on an initial line leads to a lack of flexibility, as an initial unsuitable assi-
is estimated iteratively with EM-like algorithms.
• Starting from the first principal component line of the whole data set, the principal curve
Properties of ‘top-down’ algorithms:
curve m is also principal curve of the data cloud X.
X = m(t) + ²
Tibshirani (1992) defines principal curves such that for data generated as
point of the principal curve.
• Requires a cluster analysis at every
putationally demanding.
• However, quite complicated and com-
• Mathematically elegant
data sets.
• Works fine for most (not too complex)
smoothly through the data cloud.
(Picture from: Delicado, 2001)
the data points projected on it. He estimates ’PCOPs’ using a fix point algorithm moving
E(X|X ∈ H), where H is the hyperplane through x minimizing locally the variance of
Delicado (2001) defines principal curves as a sequence of fix points of the function µ∗(x) =
Alternative: ‘Bottom-up’ algorithms
•
•
•
•
•
•
-0.8
•
•
•1m •
•
• •
•
• •
m
0
•
-1.0
• •
•
•
•
•
2m•
•
•
••
••
-0.6
•
X_1
• • •
3 •
m
•
•
-0.4
•
•
•
•
•
• •
•
•
-0.2
•
•
0.0
•
•
•
1, 2, 3 : enumeration of steps.
m: points of the LPC,
0: starting point,
Idea: Calculate alternately a local center of mass and a first local principal component.
A simple alternative ‘bottom-up’ approach: Local principal curves (LPC)
1.0
0.8
0.6
0.4
0.2
0.0
i=1 KH (Xi
Pn
− x).
i=1 wi Xi ,
Pn
where
− µxj)(Xik − µxk).
The sequence of the local centers of mass µx makes up the local principal curve (LPC).
and continue with 4.
5. Repeat steps 2. to 4. until the µx remain constant. Then set x = x0, set γ x := −γ x
i=1 wi (Xij
Pn
4. Step from µx to x := µx + t0γ1x.
x
σjk
=
x
3. Compute the 1st local eigenvector γ x of Σx = (σjk
)(1≤j,k≤d), where
wi = KH (Xi − x)Xi/
2. At x, calculate the local center of mass µx =
1. Choose a starting point x0. Set x = x0.
Given: A data cloud X = (X1, . . . , Xn), where Xi = (Xi1, . . . , Xid).
Algorithm for LPC’s
n
1 X
ˆ
fK (x) = d
K
nh i=1
µ
Xi − x
h
¶
80
60
Y
40
20
+
+
+
+
+++++
+
+
+
+
+
0
+
+
+
+ +
10
+
+
20
+
++
+
++
++
+
++
+
+
+
Comaniciu & Meer (2002): ‘Mean Shift’ µx − x ∼ ∇fˆK (x)
Z
X
30
40
50
A local principal curve approximates the density ridge. For instance, speed-flow data:
Kernel density estimate:
Background
05
0 0.0010.0020.0030.0040.0
(2)
generator).
• Use multiple initializations if data cloud consists of several branches (e.g. using a random
• Angle penalization, to hamper the principle curve from bending off at crossings.
x
x
.
:= −γ(i)
Otherwise, set γ(i)
x
x
◦ γ(i)
> 0.
γ(i−1)
• “Signum flipping”: Check in every cycle if
Technical Details
1.0
−0.5
0.0
Spirals with small noise
Simulated Examples
0.5
true curve
HS
KKLZ
Delicado
LPC
−1.0
0.0
−0.5
0.5
1.0
0.5
0.0
−0.5
−0.5
0.0
0.5
true curve
HS
KKLZ
Delicado
LPC
1.0
−1.0
−0.5
Spirals with large noise
0.0
0.5
true curve
HS
KKLZ
Delicado
LPC
0.0
−0.5
0.5
1.0
0.5
0.0
−0.5
−1.0
−0.5
0.0
0.5
true curve
HS
KKLZ
Delicado
LPC
observed residuals.
• The area between Cm(τ ) and the constant 1 corresponds to the mean length of the
• The coverage can also be interpreted as empirical distribution function of the residuals.
Cm(τ ) = #{x ∈ X|∃p ∈ Pm mit ||x − p|| ≤ τ }/n
Formally, for a principal curve m consisting of a set Pm of points, the coverage is given by
borhood of the principal curve.
The coverage of a principal curve is the fraction of all data points found in a certain neigh-
Measuring performance: Coverage
C
C
1.0
0.8
0.6
0.4
0.2
0.0
1.0
0.8
0.6
0.00
0.00
0.05
0.05
tau
0.15
0.20
0.25
0.10
0.15
0.20
0.25
big spiral with small noise
0.10
small spiral with small noise
Coverage for spiral-data
0.30
0.30
tau
0.2
0.0
0.4
C
C
1.0
0.8
0.6
0.4
0.2
0.0
1.0
0.8
0.6
0.4
0.2
0.0
0.00
0.00
0.05
0.05
tau
0.15
0.20
0.25
0.10
tau
0.15
0.20
0.25
big spiral with large noise
0.10
small spiral with large noise
0.30
0.30
PCA
HS
KKLZ
Delicado
LPC
big spiral
0.03
0.05
0.05
KKLZ
Delicado
LPC
0.24
0.85
0.20
0.77
0.08
0.87
0.50
0.92
0.29
0.92
0.65
0.92
• the quantity RC = 1 − AC can be interpreted in analogy to R2 used in regression analysis
• The closer to 0, the better the performance
0.72
small noise large noise small noise large noise
small spiral
HS
AC
Residual mean length relative to principal components (AC ):
is a suitable bandwidth.
τ . Then
h = first local maximum of
S(τ )
where Pm(τ ) is the set of points belonging to a principal curve m(τ ) calculated with bandwidth
#{x ∈ X|∃p ∈ Pm(τ ) mit ||x − p|| ≤ τ }
S(τ ) = Cm(τ )(τ ) =
,
n
cover adequately the data cloud. This motivates to define the self-coverage,
Idea: A bandwidth suitable for computation of a principal curve m should also be able to
Bandwidth selection with self-coverage
0.0
0.0
0.2
0.2
tau
0.6
0.4
tau
0.6
h= 0.06
0.4
h= 0.05
Self-coverage for spiral-data
S
S
1.0
0.8
0.6
0.4
0.2
0.0
1.0
0.8
0.6
0.4
0.2
0.8
0.8
1.0
1.0
S
S
1.0
0.8
0.6
0.4
0.2
0.0
1.0
0.8
0.6
0.4
0.2
0.0
0.0
0.2
0.2
tau
0.6
0.4
tau
0.6
h= 0.12
0.4
h= 0.15
0.8
0.8
1.0
1.0
Real data example: Floodplains in Pennsylvania
100
80
60
40
20
0
10
20
30
40
50
60
+++
++++
++
+
+++++
+
+
+++ ++++
+
++++++++
+++++++
+
+
+
+
++++++
++
++++++
+
++++++++ ++++
++++++
+++++
+
++++++
+++ ++++++++
+
++++
+
++
+
+
++++
++
++++++++++
+
+
+
+
+
+
+
+++
+
+
++++++
++++
+
+
+
+++
+
+
+
+
+++++
++++
+++++++
++
++++++++++++++ ++ ++++++ + +++ ++ ++++++
++++++
++++ +++++ ++++++++
+
+
+++ +
+
+++++++ +++++ +++++++
+++++
+
++++++++++++ +++++
++
+++++++
++
+
+
+
+
+
+
+
+
+
+
+
+ +
++
+++ ++++++
+++++++++ ++++++++++++++ +
+
+
+
++++++++ +++
+
++++
+
+++
++++
++++++++ +++++
+
+
+
++
+
+
+++++
+++++ +++ ++
+
+
++
+
+
+
+
+
+
+
+++++++
+
+
+
+++
+++ ++ ++++++
+
+++++++++++++
+
+
+
+
+
+
+
+
+
+
+
+
+
+++
++
++++++++++
++++ + ++++++++++
++++++++
+++
++
+++++
++
+++
+++
+++
+++
++++
+
+++
++
++++++
+
+
+
+
++ ++
+++++++ +++++++++
+++ +++++
+
++++++
+++++++++
+++++++++++++++++
+
++++++++++++++
+++ ++++++
LPC with multiple (50) initializations.
0
70
0
20
40
60
Further example: Coastal Resorts in Europe
100
80
60
40
20
80
100
120
(Picture from: Prof. Eisen, University of Frankfurt)
(infl/rate):
Usually just seen as a two-dimensional problem
employment rate over time.
Dependance between inflation (price index) and un-
rate
3.5 4.0 4.5 5.0 5.5 6.0 6.5
0
20
USA, 1995-2005, with LPC:
40
60
Price index and unemployment in the
150 160 170 180 190 200
infl
3D example: Philips curves
140
120
100
80
time
0.4
0.2
0.0
10
15
20
Simulated E and flow diagram of relation λx2 /λx1 .
5
2.0
+ + + + ++ + + + + + ++ + + ++ + + +
+
+
+
+
+
++
+
+
+
+
+
+
+
++
+
1.5
0
1.0
Example
25
30
0.0
0.2
0.4
0.6
0.8
1.0
35
second local eigenvector γ2x. Every bifurcation raises the depth of the LPC tree.
is large at a certain point of the original LPC, a new LPC is launched in direction of the
Consider the second local eigenvalue λx2 , i.e. the second largest eigenvalue of Σx: If this value
Higher-order-LPC’s
0.5
++
+
+ +
+ + + + + + + + ++ + +
+ ++
+ ++
+
+
+
+
+
++
+
+
+
++
+
+
+
+
+
+
+ + + + + + + + + ++ + + + + + + + + +
0.0
−0.8
−0.8
−0.6
−0.6
−0.4
1
−0.4
−0.2
−0.2
−0.2
2
0.0
1
2
2
0.2
2
2
0.2
2
0.4
0.4
1
0.6
0.6
0.8
0.8
1.0
1.0
LPC’s and corresponding starting points with depth 1, 2, 3.
−1.0
−1.0
1
LPC’s through simulated letters (C,E,K)
1.0
0.5
0.0
−0.5
−1.0
1.0
0.5
0.0
−0.5
−1.0
2.0
1.5
1.0
0.5
0.0
2.0
1.5
1.0
0.5
0.0
2.0
1.5
1.0
0.5
0.0
2.0
1.5
1.0
0.5
0.0
2
0.2
0.2
2
3
2
0.0
0.0
1
0.4
0.4
3
0.6
0.6
0.8
0.8
1.0
1
1.0
1
2
• •
••• • • ••••
•
•
•
•
•• ••
•• •
• ••• ••••• •• • ••
• • •• • •
• •• • • •
• • ••• •• • •
••••• •
•• ••• • • ••• •• • • •
••• • •• • ••
••••• • •• •
•• • • •
•• • •••••• •
• • •• •
•• • •• •• •
• • •••• •
• ••
•• • • •
•• • •
• •• •
•
Example: Scallops
1
6
6642
4
6
2
6
6
4
2 4244
Water depth
Scallops
1, 2:
Branches of depth 1, 2.
Bottom left, right: Two LPC’s
Top right:
Top left:
Y for given X = x.
• General drawback of principal curves: Principal curves are not suitable for prediction of
principal curves depends strongly on selected starting point(s).
• Drawbacks of LPCs: No statistical model and hence no ’true’ principal curve; estimated
• Bandwidth selection works by means of a coverage measure.
than Delicado’s algorithm for complex or branched data.
• LPC’s can be seen as a simplified version of Delicado’s ’PCOPs’, but seem to work better
(1989) and Kegl et al. (2000) .
structures more suitable than the die “global” algorithms developed by Hastie & Stuetzle
• LPCs work well in a variety of data situations, and are particularly for noisy complex
Summary
60
speed
40
20
Idea: Consider the conditional densities, e.g. for speed-flow data:
500
1000
1500
2000
Goal: Estimate a multifunction r : R −→ R rather than a regression function
Outlook: Multi-valued regression
flow
0
10
20
40
speed at flow=1400
30
7.7%
50
60
92.3%
For instance, conditional density at a flow = 1400.
f(speed |1400 )
0.08
0.06
0.04
0.02
0.00
70
tained.
lue on the corresponding branch is at-
mated probability, that, given x, a va-
neighboring ’antimode’ serves as esti-
• The area between a mode and the
densities fˆ(y|x).
modes of the estimated conditional
• For estimation of r(x), compute the
P
0
500
P(Car in congested traffic)
P(Car in uncongested traffic)
Relevance assessment
500
1000
1000
Multi-valued regression curve
speed
60
50
40
30
20
10
1.0
0.8
0.6
0.4
0.2
0.0
flow
flow
1500
1500
2000
2000
2
i=1
1
h1
K1
³
x−Xi
h1
´
³
y−Yi
h2
´
i=1
G
Yi
y = i=1
´ ³
´ .
³
n
P
y−Yi
x−Xi
K1 h1 G h2
n
P
• The right side of (3) is just the “Sigma-Filter” used in digital image smoothing.
• Gives conditional mean shift procedure.
with G(·) = −k 0((·)2).
and obtains
such that K2(·) = ck k((·)2) holds. One calculates
õ
!
µ
¶
¶
n
2
∂ fˆ(y|x) 2ck X
x − Xi
y − Yi
!
k0
= 3
K1
(y − Yi) = 0
∂y
h2 i=1
h1
h2
(3)
with kernels K1, K2 and bandwidths h1, h2. We assume that a profile k(·) for kernel K2 exists
We are interested in all local maxima of the estimated conditional densities
³
´
´
³
Pn
y−Yi
x−Xi
K2 h2
ˆ(x, y)
i=1 K1
h1
f
³
´
fˆ(y|x) =
=
Pn
ˆ
x−X
i
f (x)
h
K
Estimation of conditional modes
1000
flow
1500
2000
vations to the uncongested or congested regime.
This curve serves as a separator between the branches, and thus as a tool to classify obser-
500
Cond. antimode (bottom)
Cond. antimode (top)
CONGESTION
UNCONGESTED TRAFFIC
relevances, one obtains an antiprediction or antiregression curve.
If one plots the antimodes, which are obtained as a by-product of the computation of the
Antiregression and Classification
speed
60
50
40
30
20
10
In: Weihs, C. and Gaul, W. (Eds.): Classification - The Ubiquitous Challenge. Springer, Heidelberg.
Einbeck, J, Tutz, G. & Evers, L. (2005): Exploring Multivariate Data Structures with Local Principal Curves.
Einbeck, J, Tutz, G. & Evers, L. (2005): Local Principal Curves. Statistics and Computing 15, 301–313.
..
.
84–116.
Delicado, P. (2001): Another Look at Principal Curves and Surfaces, Journal of Multivariate Analysis 77,
Transactions Patt. Anal. Mach. Intell. 24, 59–74.
Kégl, B., Krzyzak, A., Linder, T. & Zeger, K. (2000): Learning and Design of Principal Curves. IEEE
Tibshirani, R. (1992): Principal Curves Revisited. Statistics and Computing 2, 183–190.
Hastie & Stuetzle, L. (1989): Principal Curves. JASA 84, 502–516.
Literature on Principal Curves
Munich.
of Multimodal Regression to Speed-Flow Data. SFB386 Discussion Paper No. 395, LMU
Einbeck, J, & Tutz, G.(2004): Modelling Beyond Regression Functions: An Application
Literature on Multi-valued Regression
Download