• Multi-valued nonparametric regression • Local principal curves 10th of February 2005 joint work with Gerhard Tutz, University of Munich Jochen Einbeck with Application to Speed-flow Diagrams Approaches to Function-free Local Smoothing (Holmström et.al, 2003) Also called nonparametric regression.” value of Y given that X = x. Y, construct a function, a ”smoother”, which at point x estimates the average “Given observations from an explanatory variable X and a response variable Typical definition: What is “Smoothing”? yielding problem z2 E(·) l(z) Φ(·) M edian(·) |z| a M ode(·) −δ(z) ml (x) = arg min E(l(Y − a)|X = x), Possible choices of the operator Φ(·) are obtained as solutions of the minimization m(x) = Φ(Y |X = x). relating X and Y in a proper way, which can be generally expressed in the form Given n data points (Xi, Yi), i = 1, . . . , n, one seeks a smooth function m : R −→ R Nonparametric regression 500 1000 X (flow) Y: speed in Miles/hour recorded on a californian “freeway”. X: traffic flow in cars/hour, 0 Speed-Flow diagram: Limits of ordinary nonparametric regression Y (speed) 70 60 50 40 30 20 10 0 1500 2000 500 1000 flow −→ Obviously some information is discarded! 0 Local Mean Local Median 1500 Y = m(X) + ² ; with a function m : R −→ R Attempt: Ordinary nonparametric regression speed 60 50 40 30 20 10 2000 20 flow/40 30 40 50 Principal curves are smooth functions passing through the middle of the data cloud. Intuitive Definition: 10 µ ¶ X = f (t) + ² ; f : R −→ R2 Y 1st alternative Approach: Principal curves speed 60 50 40 30 20 10 • • • • • • -0.8 • • •1m • • • • • • • m 0 • -1.0 • • • • • • 2m• • • •• -0.6 • •• X_1 • • • 3 • m • • -0.4 • • • • • •• • • -0.2 • • 0.0 • • • steps. 1, 2, 3 : enumeration of m: points of the LPC, 0: starting point, Idea: Calculate alternately a local center of mass and a first local principal component. Local principal curves (LPC) 1.0 0.8 0.6 0.4 0.2 0.0 x Pn = x x w (X − µ )(X − µ i ij ik j k ). i=1 Pn 5. Repeat steps 2. to 4. until the µx remain constant. 4. Step from µx in direction of the first principal component to x := µx + t0γ1x. x σjk x of the local covariance matrix Σx = (σjk )(1≤j,k≤d), where 3. Perform a PCA locally at x, i.e. compute the 1st eigenvector γ x 2. At x, calculate the local center of mass µ = i=1 wiXi, Pn where wi = KH (Xi − x)Xi/ i=1 KH (Xi − x). 1. Choose a starting point x0. Set x = x0. Assume a data cloud X = (X1, . . . , Xn)T , where Xi = (Xi1, . . . , Xid)T . Algorithm: Local principal curves (1) m(k+1) = µ(m(k)) m(0) = x converges to a neighboring critical point of fˆK (·). (B) and the sequence (A) M (x) ∼ ∇fˆK (x) density estimate fˆK (x), Comaniciu & Meer (2002) show that i.e. the movement from the current point to the local center of mass. For a kernel M (x) = µ(x) − x. A central tool of the LPC algorithm is the Mean Shift, 60 Y 40 20 + + 0 + + 10 + + 20 + + + + ++ ++ + + + ++ + X 30 40 Kernel density estimate of speed-flow data and LPC (+). 80 + + ++ + + + ++ + + + + + + 50 Conclusion: The LPC algorithm approximates the crest line of a kernel density estimate. Z 5 00 0 0.0010.0020.0030.0040. 2.0 0.4 0.2 0.0 10 15 + + + + ++ + + + + + ++ + + ++ + + + + + + + + ++ + + + + + + + ++ + 1.5 5 1.0 0 20 25 30 0.0 0.2 0.4 0.6 0.8 1.0 35 Example: Simulated “E” and flow diagram of the quantity λx2 /λx1 . direction of the second local eigenvector γ2x. this value is large at a certain point of the original LPC, a new LPC is launched in Consider the second local eigenvalue λx2 , i.e. the second largest eigenvalue of Σx: If LPC’s of higher order 0.5 ++ + + + + + + + + + ++ + + ++ + + + + ++ + + + + + ++ + + + ++ + + + + + + + + + + + + + + + ++ + + + + + + + + + 0.0 −1.0 −1.0 −0.8 −0.8 1 −0.6 −0.6 −0.4 1 −0.4 −0.2 −0.2 −0.2 2 0.0 1 2 2 0.2 2 2 0.2 2 LPC’s through simulated letters (C,E,K) 1.0 0.5 0.0 −0.5 −1.0 1.0 0.5 0.0 −0.5 −1.0 2.0 1.5 1.0 0.5 0.0 2.0 1.5 1.0 0.5 0.0 1 0.4 0.4 0.6 0.6 0.8 0.8 1.0 1.0 2.0 1.5 1.0 0.5 0.0 2.0 1.5 1.0 0.5 0.0 2 0.2 0.2 2 3 2 0.0 0.0 1 0.4 0.4 3 0.6 0.6 0.8 0.8 1.0 1 1.0 0 10 20 30 40 50 60 70 ++ +++ + + + + + + + + +++++++ + + ++ + +++ ++ + + +++ ++ ++ ++++++++++ + +++++ +++ ++ + + +++ ++ ++ + + + + ++ ++ ++ ++ ++ ++ ++ + + +++ + +++++++++++++ ++ + ++ + +++++ + ++ ++++ + + + + + + ++++++++++ ++++ + + ++++ + + ++ + + + +++ ++ + + ++ ++ + +++ + + + ++ + + + + + + ++ +++ + +++ ++ ++ + + +++ +++++++ + + ++ + + + + + ++ + + + ++ + + + +++ + + + ++ ++++ ++++ +++ + + ++ + ++ ++ + + + ++ ++++ + ++ +++ + + ++++ + + + + + + + + + + + + + + + + ++ + + + + + + + + + +++ + + + + ++ ++++ ++ ++ + +++ + ++ ++++ ++ +++ + +++++ + + +++ ++ +++++ ++ ++ + + + + ++ + +++ ++ + +++ + + ++ + ++ +++ + +++++ +++++ +++ ++ +++ +++ + + ++++ ++++++++++ + + + + + + +++ ++++ + + + + + + + + + + +++++++++ +++ ++++++ +++ + ++ ++++ ++ +++++ + + ++++ + + + + + +++++++ ++++ + + + + +++ + + + + +++ + + + + + ++ + + +++ + + + + + + + ++ +++ + + +++ + + + + ++ + +++ + + + + + +++ +++ ++ + +++++ ++ + ++ + ++++++ + + + ++ ++++ ++ + ++++ + + + +++ ++ + ++++++ ++ + +++ +++++ ++ ++++ + + + + + + + + + + +++ +++++ ++++ + + + +++ + +++ +++ ++ ++++ + + + +++ + + ++ + + + + + + + + + + + + + + ++++++ ++++++ ++ ++ ++ +++ ++ ++ +++++ ++ + ++ +++ + + + + + +++ + ++ + + +++ +++ ++ + +++ ++ ++ ++ + + + + + +++ + + ++ +++ + ++++++++ +++ + ++ + + + + + + + + ++ + ++ + ++++ + + + + ++ + + + + + ++ ++ +++ +++ ++ ++ + ++ + ++ + + +++ ++ + + ++++ ++++ ++++++ ++++++ + + + + + ++ + + + ++ +++ +++ + +++++ + +++++++++++++++++++ ++++ ++++++ +++++++++ + ++ + + + + + + + + + + + + + + + + + + + + + + + +++++ A further example: Floodplains in Pennsylvania 100 80 60 40 20 0 0 20 40 60 A further example: Coastal resorts in Europe 100 80 60 40 20 80 100 120 • (Local) Principal Curves are not suitable for prediction of Y from a given X = x. • Bandwidth selection works by means of a Coverage measure. • However, the theoretical grounding is somewhat waggly. Hastie & Stuetzle (1989) und Kegl et al. (2000) . plicated data structures) a better performance than the “global” algorithms from • LPC’s work well in a variety of data situations, and show (in particular for com- Concluding remarks 1000 flow 1500 For every X = x, more than one predicted value is possible. The data cloud is assumed to consist of several (smooth) branches. 500 Y = r(X) + ², with a multifunction r : R −→ R 2nd alternative approach: Multi-valued nonparametric regression speed 60 50 40 30 20 10 2000 60 speed 40 Basic idea: Consider the conditional densities. 20 500 1000 1500 2000 flow 0 10 20 40 speed at flow=1400 30 7.7% 50 60 92.3% 70 For instance, conditional density at a flow = 1400. f(speed |1400 ) 0.08 0.06 0.04 0.02 0.00 tained. the corresponding branch is at- ity, that, given x, a value on serves as estimated probabil- and the neighboring ’antimode’ • The area between a mode ditional densities fˆ(y|x). the modes of the estimated con- • For estimation of r(x), compute 500 P(Car in congested traffic) P(Car in uncongested traffic) 1000 flow 1500 2000 can be separated. The relevance of the branches varies smoothly for varying x, as long as as the branches 0 Relevance assessment P 1.0 0.8 0.6 0.4 0.2 0.0 2 i=1 1 h1 ∂ fˆ(y|x) 2ck = 3 ∂y h2 holds. One calculates K2 exists such that i=1 n X K1 µ x − Xi k0 h1 ¶ õ y − Yi h2 K2(·) = ck k((·)2), ¶2 ! ! (y − Yi) = 0 with kernels K1, K2 and bandwidths h1, h2. We assume, that a profile k(·) for kernel We are interested in all local maxima of the estimated conditional densities ³ ´ ´ ³ Pn x−Xi y−Yi K K ˆ 1 2 i=1 h1 h2 f (x, y) ´ ³ fˆ(y|x) = = Pn fˆ(x) h K x−Xi Estimation of conditional modes i=1 K1 ³ x−Xi h1 ´ ³ ´ i G y−Y Yi h2 y = i=1 ³ ´ ³ ´ . n P y−Yi i K1 x−X G h1 h2 n P (2) ing. Thus, the sigma filter is a one-step approximation to the conditional mode. • The right side of (2) corresponds to the “Sigma-Filter” used in digital image smooth- LPC’s, this iteration has to be run until convergence. • Applying (2) recursively, this can be seen as a conditional mean shift! Unlike for Remarks: with G(·) = −k 0((·)2). and obtains 1000 flow 1500 2000 observations to the uncongested or congested regime. This curve serves as a separator between the branches, and thus as a tool to classify 500 Cond. antimode (bottom) Cond. antimode (top) CONGESTION UNCONGESTED TRAFFIC the relevances, one obtains an antiprediction or antiregression curve. If one plots the antimodes, which are obtained as a by-product of the computation of Outlook: Antiregression and Classification speed 60 50 40 30 20 10 • Relation to mixture models? • Bias, Variance? Asymptotics? (for cond. modes: Berlinet et al., 1998) • Bandwidth selection (for conditional densities already available: Fan et al., 1996) To do: mean shift procedure. • Maxima of the conditional density can be calculated fast and easily via a conditional • The mode is robust to outliers and is edge-preserving. conditional mean or median. • The conditional mode is more useful for the analysis of multimodal data as the Summary: with Local Principal Curves. to appear in Proceedings GFKL 2004 Einbeck, J, Tutz, G. & Evers, L. (2005): Exploring Multivariate Data Structures sion Paper No. 320, accepted in Statistics and Computing Einbeck, J, Tutz, G. & Evers, L. (2003): Local Principal Curves. SFB 386 Discus- .. principal curves. IEEE Transactions Patt. Anal. Mach. Intell., 24, 59–74. Kégl, B., Krzyzak, A., Linder, T. & Zeger, K. (2000): Learning and design of Hastie & Stuetzle, L. (1989): Principal Curves. JASA, 84, 502–516. • Principal Curves Literature .. ? in nonlinear dynamical systems. Biometrika, 83, 189–206 Fan, Yao & Tong (1996): Estimation of conditional densities and sensivity measures convergents du mode conditionnel. Canadian Journal of Statistics, 26, 365–380. Berlinet, Gannoun & Matzner-Løber (1998): Normalité aymptotique d’estimateurs “Related Literature”: 395. tion of multimodal regression to speed-flow data , SFB 386 Discussion Paper No. Einbeck, J, & Tutz, G.(2004): Modelling beyond regression functions: An applica- • Multi-valued nonparametric regression