Gerhard Tutz, University of Munich, and Ludger Evers, University of Oxford. joint work with Dublin - 22nd November 2005 jochen.einbeck@nuigalway.ie Department of Mathematics, NUI Galway Jochen Einbeck Local Principal Curves flow/40 30 Y: speed in Miles/hour 20 recorded on a californian “freeway”. X: traffic flow in cars/hour, 10 Example: Speed-Flow diagram. cloud X = (X1, . . . , Xn), where Xi ∈ Rd. 40 50 Principal Curves are smooth curves passing through the ’middle’ of a multidimensional data Descriptive Definition speed 60 50 40 30 20 10 through the data cloud. • Local (‘bottom-up’) algorithms estimate the principal curve locally moving step by step data cloud. concatenate other lines to the initial line until the resulting curve fits well through the • Global (‘top-down’) algorithms start with an initial line and try to dwell out this line or be divided into two major groups: derstood of the “middle” of a data cloud. The algorithms associated to these definitions can There exist a variety of definitions of principal curves, which essentially vary in what is un- Types of principal curves t (1) (Picture from: Hastie & Stützle, 1989) curves with bounded length L. define a principal curve as the curve minimizing the average squared distance (1) over all Kégl, Krzyzak, Linder & Zeger (KKLZ, 2000) in a natural way. and generalize linear principal components ³ ´ 2 4(m) = E inf ||X − m(t)|| function obtained as critical points of the distance Self-consistent curves m : I −→ Rd are points which project there (‘self-consistency’). Hastie & Stützle (HS, 1989) define a point on the principal curve as the average of all Principal curve definitions associated to ‘top-down’ - approaches with E(²) = 0 • Estimation of branched or disconnected data clouds not (directly) possible. gnment of projection indices can often not be correct in the further run of the algorithm • Dependance on an initial line leads to a lack of flexibility, as an initial unsuitable assi- is estimated iteratively with EM-like algorithms. • Starting from the first principal component line of the whole data set, the principal curve Properties of ‘top-down’ algorithms: curve m is also principal curve of the data cloud X. X = m(t) + ² Tibshirani (1992) defines principal curves such that for data generated as point of the principal curve. • Requires a cluster analysis at every putationally demanding. • However, quite complicated and com- • Mathematically elegant data sets. • Works fine for most (not too complex) smoothly through the data cloud. (Picture from: Delicado, 2001) the data points projected on it. He estimates ’PCOPs’ using a fix point algorithm moving E(X|X ∈ H), where H is the hyperplane through x minimizing locally the variance of Delicado (2001) defines principal curves as a sequence of fix points of the function µ∗(x) = Alternative: ‘Bottom-up’ algorithms • • • • • • -0.8 • • •1m • • • • • • • m 0 • -1.0 • • • • • • 2m• • • •• •• -0.6 • X_1 • • • 3 • m • • -0.4 • • • • • • • • • -0.2 • • 0.0 • • • 1, 2, 3 : enumeration of steps. m: points of the LPC, 0: starting point, Idea: Calculate alternately a local center of mass and a first local principal component. A simple alternative ‘bottom-up’ approach: Local principal curves (LPC) 1.0 0.8 0.6 0.4 0.2 0.0 i=1 KH (Xi Pn − x). i=1 wi Xi , Pn where − µxj)(Xik − µxk). The sequence of the local centers of mass µx makes up the local principal curve (LPC). and continue with 4. 5. Repeat steps 2. to 4. until the µx remain constant. Then set x = x0, set γ x := −γ x i=1 wi (Xij Pn 4. Step from µx to x := µx + t0γ1x. x σjk = x 3. Compute the 1st local eigenvector γ x of Σx = (σjk )(1≤j,k≤d), where wi = KH (Xi − x)Xi/ 2. At x, calculate the local center of mass µx = 1. Choose a starting point x0. Set x = x0. Given: A data cloud X = (X1, . . . , Xn), where Xi = (Xi1, . . . , Xid). Algorithm for LPC’s n 1 X ˆ fK (x) = d K nh i=1 µ Xi − x h ¶ 80 60 Y 40 20 + + + + +++++ + + + + + 0 + + + + + 10 + + 20 + ++ + ++ ++ + ++ + + + Comaniciu & Meer (2002): ‘Mean Shift’ µx − x ∼ ∇fˆK (x) Z X 30 40 50 A local principal curve approximates the density ridge. For instance, speed-flow data: Kernel density estimate: Background 05 0 0.0010.0020.0030.0040.0 (2) generator). • Use multiple initializations if data cloud consists of several branches (e.g. using a random • Angle penalization, to hamper the principle curve from bending off at crossings. x x . := −γ(i) Otherwise, set γ(i) x x ◦ γ(i) > 0. γ(i−1) • “Signum flipping”: Check in every cycle if Technical Details 1.0 −0.5 0.0 Spirals with small noise Simulated Examples 0.5 true curve HS KKLZ Delicado LPC −1.0 0.0 −0.5 0.5 1.0 0.5 0.0 −0.5 −0.5 0.0 0.5 true curve HS KKLZ Delicado LPC 1.0 −1.0 −0.5 Spirals with large noise 0.0 0.5 true curve HS KKLZ Delicado LPC 0.0 −0.5 0.5 1.0 0.5 0.0 −0.5 −1.0 −0.5 0.0 0.5 true curve HS KKLZ Delicado LPC observed residuals. • The area between Cm(τ ) and the constant 1 corresponds to the mean length of the • The coverage can also be interpreted as empirical distribution function of the residuals. Cm(τ ) = #{x ∈ X|∃p ∈ Pm mit ||x − p|| ≤ τ }/n Formally, for a principal curve m consisting of a set Pm of points, the coverage is given by borhood of the principal curve. The coverage of a principal curve is the fraction of all data points found in a certain neigh- Measuring performance: Coverage C C 1.0 0.8 0.6 0.4 0.2 0.0 1.0 0.8 0.6 0.00 0.00 0.05 0.05 tau 0.15 0.20 0.25 0.10 0.15 0.20 0.25 big spiral with small noise 0.10 small spiral with small noise Coverage for spiral-data 0.30 0.30 tau 0.2 0.0 0.4 C C 1.0 0.8 0.6 0.4 0.2 0.0 1.0 0.8 0.6 0.4 0.2 0.0 0.00 0.00 0.05 0.05 tau 0.15 0.20 0.25 0.10 tau 0.15 0.20 0.25 big spiral with large noise 0.10 small spiral with large noise 0.30 0.30 PCA HS KKLZ Delicado LPC big spiral 0.03 0.05 0.05 KKLZ Delicado LPC 0.24 0.85 0.20 0.77 0.08 0.87 0.50 0.92 0.29 0.92 0.65 0.92 • the quantity RC = 1 − AC can be interpreted in analogy to R2 used in regression analysis • The closer to 0, the better the performance 0.72 small noise large noise small noise large noise small spiral HS AC Residual mean length relative to principal components (AC ): is a suitable bandwidth. τ . Then h = first local maximum of S(τ ) where Pm(τ ) is the set of points belonging to a principal curve m(τ ) calculated with bandwidth #{x ∈ X|∃p ∈ Pm(τ ) mit ||x − p|| ≤ τ } S(τ ) = Cm(τ )(τ ) = , n cover adequately the data cloud. This motivates to define the self-coverage, Idea: A bandwidth suitable for computation of a principal curve m should also be able to Bandwidth selection with self-coverage 0.0 0.0 0.2 0.2 tau 0.6 0.4 tau 0.6 h= 0.06 0.4 h= 0.05 Self-coverage for spiral-data S S 1.0 0.8 0.6 0.4 0.2 0.0 1.0 0.8 0.6 0.4 0.2 0.8 0.8 1.0 1.0 S S 1.0 0.8 0.6 0.4 0.2 0.0 1.0 0.8 0.6 0.4 0.2 0.0 0.0 0.2 0.2 tau 0.6 0.4 tau 0.6 h= 0.12 0.4 h= 0.15 0.8 0.8 1.0 1.0 Real data example: Floodplains in Pennsylvania 100 80 60 40 20 0 10 20 30 40 50 60 +++ ++++ ++ + +++++ + + +++ ++++ + ++++++++ +++++++ + + + + ++++++ ++ ++++++ + ++++++++ ++++ ++++++ +++++ + ++++++ +++ ++++++++ + ++++ + ++ + + ++++ ++ ++++++++++ + + + + + + + +++ + + ++++++ ++++ + + + +++ + + + + +++++ ++++ +++++++ ++ ++++++++++++++ ++ ++++++ + +++ ++ ++++++ ++++++ ++++ +++++ ++++++++ + + +++ + + +++++++ +++++ +++++++ +++++ + ++++++++++++ +++++ ++ +++++++ ++ + + + + + + + + + + + + + ++ +++ ++++++ +++++++++ ++++++++++++++ + + + + ++++++++ +++ + ++++ + +++ ++++ ++++++++ +++++ + + + ++ + + +++++ +++++ +++ ++ + + ++ + + + + + + + +++++++ + + + +++ +++ ++ ++++++ + +++++++++++++ + + + + + + + + + + + + + +++ ++ ++++++++++ ++++ + ++++++++++ ++++++++ +++ ++ +++++ ++ +++ +++ +++ +++ ++++ + +++ ++ ++++++ + + + + ++ ++ +++++++ +++++++++ +++ +++++ + ++++++ +++++++++ +++++++++++++++++ + ++++++++++++++ +++ ++++++ LPC with multiple (50) initializations. 0 70 0 20 40 60 Further example: Coastal Resorts in Europe 100 80 60 40 20 80 100 120 (Picture from: Prof. Eisen, University of Frankfurt) (infl/rate): Usually just seen as a two-dimensional problem employment rate over time. Dependance between inflation (price index) and un- rate 3.5 4.0 4.5 5.0 5.5 6.0 6.5 0 20 USA, 1995-2005, with LPC: 40 60 Price index and unemployment in the 150 160 170 180 190 200 infl 3D example: Philips curves 140 120 100 80 time 0.4 0.2 0.0 10 15 20 Simulated E and flow diagram of relation λx2 /λx1 . 5 2.0 + + + + ++ + + + + + ++ + + ++ + + + + + + + + ++ + + + + + + + ++ + 1.5 0 1.0 Example 25 30 0.0 0.2 0.4 0.6 0.8 1.0 35 second local eigenvector γ2x. Every bifurcation raises the depth of the LPC tree. is large at a certain point of the original LPC, a new LPC is launched in direction of the Consider the second local eigenvalue λx2 , i.e. the second largest eigenvalue of Σx: If this value Higher-order-LPC’s 0.5 ++ + + + + + + + + + + + ++ + + + ++ + ++ + + + + + ++ + + + ++ + + + + + + + + + + + + + + + ++ + + + + + + + + + 0.0 −0.8 −0.8 −0.6 −0.6 −0.4 1 −0.4 −0.2 −0.2 −0.2 2 0.0 1 2 2 0.2 2 2 0.2 2 0.4 0.4 1 0.6 0.6 0.8 0.8 1.0 1.0 LPC’s and corresponding starting points with depth 1, 2, 3. −1.0 −1.0 1 LPC’s through simulated letters (C,E,K) 1.0 0.5 0.0 −0.5 −1.0 1.0 0.5 0.0 −0.5 −1.0 2.0 1.5 1.0 0.5 0.0 2.0 1.5 1.0 0.5 0.0 2.0 1.5 1.0 0.5 0.0 2.0 1.5 1.0 0.5 0.0 2 0.2 0.2 2 3 2 0.0 0.0 1 0.4 0.4 3 0.6 0.6 0.8 0.8 1.0 1 1.0 1 2 • • ••• • • •••• • • • • •• •• •• • • ••• ••••• •• • •• • • •• • • • •• • • • • • ••• •• • • ••••• • •• ••• • • ••• •• • • • ••• • •• • •• ••••• • •• • •• • • • •• • •••••• • • • •• • •• • •• •• • • • •••• • • •• •• • • • •• • • • •• • • Example: Scallops 1 6 6642 4 6 2 6 6 4 2 4244 Water depth Scallops 1, 2: Branches of depth 1, 2. Bottom left, right: Two LPC’s Top right: Top left: Y for given X = x. • General drawback of principal curves: Principal curves are not suitable for prediction of principal curves depends strongly on selected starting point(s). • Drawbacks of LPCs: No statistical model and hence no ’true’ principal curve; estimated • Bandwidth selection works by means of a coverage measure. than Delicado’s algorithm for complex or branched data. • LPC’s can be seen as a simplified version of Delicado’s ’PCOPs’, but seem to work better (1989) and Kegl et al. (2000) . structures more suitable than the die “global” algorithms developed by Hastie & Stuetzle • LPCs work well in a variety of data situations, and are particularly for noisy complex Summary 60 speed 40 20 Idea: Consider the conditional densities, e.g. for speed-flow data: 500 1000 1500 2000 Goal: Estimate a multifunction r : R −→ R rather than a regression function Outlook: Multi-valued regression flow 0 10 20 40 speed at flow=1400 30 7.7% 50 60 92.3% For instance, conditional density at a flow = 1400. f(speed |1400 ) 0.08 0.06 0.04 0.02 0.00 70 tained. lue on the corresponding branch is at- mated probability, that, given x, a va- neighboring ’antimode’ serves as esti- • The area between a mode and the densities fˆ(y|x). modes of the estimated conditional • For estimation of r(x), compute the P 0 500 P(Car in congested traffic) P(Car in uncongested traffic) Relevance assessment 500 1000 1000 Multi-valued regression curve speed 60 50 40 30 20 10 1.0 0.8 0.6 0.4 0.2 0.0 flow flow 1500 1500 2000 2000 2 i=1 1 h1 K1 ³ x−Xi h1 ´ ³ y−Yi h2 ´ i=1 G Yi y = i=1 ´ ³ ´ . ³ n P y−Yi x−Xi K1 h1 G h2 n P • The right side of (3) is just the “Sigma-Filter” used in digital image smoothing. • Gives conditional mean shift procedure. with G(·) = −k 0((·)2). and obtains such that K2(·) = ck k((·)2) holds. One calculates õ ! µ ¶ ¶ n 2 ∂ fˆ(y|x) 2ck X x − Xi y − Yi ! k0 = 3 K1 (y − Yi) = 0 ∂y h2 i=1 h1 h2 (3) with kernels K1, K2 and bandwidths h1, h2. We assume that a profile k(·) for kernel K2 exists We are interested in all local maxima of the estimated conditional densities ³ ´ ´ ³ Pn y−Yi x−Xi K2 h2 ˆ(x, y) i=1 K1 h1 f ³ ´ fˆ(y|x) = = Pn ˆ x−X i f (x) h K Estimation of conditional modes 1000 flow 1500 2000 vations to the uncongested or congested regime. This curve serves as a separator between the branches, and thus as a tool to classify obser- 500 Cond. antimode (bottom) Cond. antimode (top) CONGESTION UNCONGESTED TRAFFIC relevances, one obtains an antiprediction or antiregression curve. If one plots the antimodes, which are obtained as a by-product of the computation of the Antiregression and Classification speed 60 50 40 30 20 10 In: Weihs, C. and Gaul, W. (Eds.): Classification - The Ubiquitous Challenge. Springer, Heidelberg. Einbeck, J, Tutz, G. & Evers, L. (2005): Exploring Multivariate Data Structures with Local Principal Curves. Einbeck, J, Tutz, G. & Evers, L. (2005): Local Principal Curves. Statistics and Computing 15, 301–313. .. . 84–116. Delicado, P. (2001): Another Look at Principal Curves and Surfaces, Journal of Multivariate Analysis 77, Transactions Patt. Anal. Mach. Intell. 24, 59–74. Kégl, B., Krzyzak, A., Linder, T. & Zeger, K. (2000): Learning and Design of Principal Curves. IEEE Tibshirani, R. (1992): Principal Curves Revisited. Statistics and Computing 2, 183–190. Hastie & Stuetzle, L. (1989): Principal Curves. JASA 84, 502–516. Literature on Principal Curves Munich. of Multimodal Regression to Speed-Flow Data. SFB386 Discussion Paper No. 395, LMU Einbeck, J, & Tutz, G.(2004): Modelling Beyond Regression Functions: An Application Literature on Multi-valued Regression