Projection Pursuit Regression/Classification • Motivation: Consider the following example: 1.0 Xi ∼ U nif ([−1, 1] × [−1, 1]); Yi = X1X2; i = 1, 2, . . . , n. ● ● ● ● ● ● ● ● ●● ● 0.5 ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ●●● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ●●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −0.5 ● ● ● ● −0.5 ● ● ● ● ● ● ● ●● ● 0.0 ● ● −0.5 0.0 ●● y ● ● ● 0.0 ● ● ● ● y ● ●● ● ● ● ● ● ●●● ● ● ● ● ● ● ● −1.0 x2 0.5 0.5 ● ● ● −1.0 −0.5 0.0 x1 ● ● ● ● ● 0.5 ● ● ● ● ● ● 1.0 −1.0 ● ● −0.5 0.0 x1 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 x2 845 Using CART 1.80 1.50 1.30 1.10 0.71 0.60 0.53 CART Partition of Response (times 0.1) −Inf q c g x1 < −0.833681 | q 12 g b b b b ik n c i b b c bb 0.5 w 10 b −0.116700 −0.755300 x1 0.0 x1 < −0.0411978 8 x1 < 0.264755 g k c w b c 0.493500 0.034080 b g c n c g k g g b b b c b iq b c g 4 x1 < 0.421723 −0.350800 g n bi g c b −0.5 x1 < −0.293539 −0.005133 −0.511900 −0.133400 n b g 6 x2 < −0.653911 b i b x2 < −0.510474 c i b b g ng w k b n w gg cg c b x2 < 0.35276 deviance x2 < 0.47524 b b c b i c g −1.0 b 0.096180 0.436200 2 4 6 size 8 10 b gb cq b q −0.5 0.0 c b b b k w k b 0.5 x2 • Partitioning is difficult because interaction effects are not addressed by CART: to detect interactions, we have to look at marginals along the directions in predictor space. • Marginal variables are not adequate to split the data informatively 846 0.5 ● ● ●● ● ● ●●● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● −0.5 ● ● ● ●● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● −0.5 ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● 0.0 ● ● ● ●● ●● ● ● ● ● ● ● y ● ● 0.0 y 0.5 Toy Example Continued ● ●●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● −1.5 −1.0 −0.5 0.0 (x1 + x2) ● ● 0.5 1.0 1.5 −2 −1 0 1 (x1 − x2) • Plotting Y against X1 + X2, X1 − X2 reveals a dependence. • Building an additive model seems better than splitting: while Y can not be written as a weighted sum of X1 and X2, it can be written as a sum of a function of a linear combination of 2 − 1 (X − X )2 . predictors: Y = 1 (X + X ) 1 2 1 2 4 4 847 Projection Pursuit Classification/Regression • Fitting the response as a linear combination of smooth functions of projections of our predictors is therefore our goal. • Our objective will be to find a direction (unit vector) a and a linear function l(z) = b0 + b1z s(x) = b0 + b1a0x 0 x)2 (regression) or misminimizes RSS = n (y − b − b a 0 1 i i=1 classification rate (as before). P • Note: A linear function s(x) is a ridge function. It varies only along a direction a and is constant along all orthogonal directions. i.e., if a0x = a0x∗, then s(x) = s(x∗). 848 Projection Pursuit Classification/Regression • Generalizations: Replace linear functions by more general ridge functions: i.e., allow for more than one term to get: s(x) = M X si(a0ix) i=1 • Note that if ai = ei, i.e., ais are restricted to be the coorP dinate directions, then s(x) = M i=1 si (x) is the same as an additive model. • We now address the issue of fitting the model for known M . 849 Fitting Projection Pursuit Classification/Regression Models • The general model is set as: Y ∼ PM 0 x) s ( a m m m=1 • As written, sm(·) is not uniquely determined because we can add and subtract constants. We get around this by centering the responses such that Ȳ = 0. • Additionally, we require sms to have mean zero, i.e., we specP 0 x ) = 0∀m. ify n s ( a m m i i=1 • So, we need to find a1, a2, . . . , aM , s1(·), s2(·), . . . , sM (·) minimizing the RSS or the misclassification error rate. 850 Fitting PPR Models • Suppose we know a1, a2, . . . , aM ; s2(·), s3(·), . . . , sM (·). • Then the only unknown is s1(·). 0 • Draw a scatterplot of the residuals ri = Yi − M m=2 sm (am xi ) against a01xi. Use a scatterplot smoother (supersmoother, etc) moving average on (ri, a01xi). P • We thus get an iterative procedure using backfitting. 851 Fitting PPR Models • Suppose we know a1, a2, . . . , aM and that at any time, we have current guesses for s1(·), s2(·), . . . , sM (·). • Initially, all si(·) ≡ 0. – Cycle over m = 1, 2, . . . , M in the following manner: ∗ Get new sm(·)s in the same manner as described for the first direction: keep going until there is no improvement. – Thus, we get an algorithm for finding s1(·), s2(·), . . . , sM (·) given a1, a2, . . . , aM . 852 • The question now arises: how do we find a1, a2, . . . , aM ? • Note that for any choice of a1, a2, . . . , aM , we can find optimal s1(·), s2(·), . . . , sM (·) and compute the corresponding RSS. • We optimize over a1, a2, . . . , aM . We perform numerical optimization with M (p − 1) unknowns. Iterative Backfitting Algorithm 853 Projection Pursuit Examples ● −2.5 ● ● ● ● ● ● ●● ● ● ● ●●●● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −4.0 term 2 0.0 ● −3.0 ● ● ● ●● ● ●● ● ●● ●● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ●● ● ● ●● ● ● ●● ●● ● ●● ● ●● ● ●● ● −0.5 ● ●● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ●● ● ●●● ●● ● ● ●● ● ● ● ● ● ● ● ● ● −4.5 ● ● ● ●● ● ● ● ● ● ● ● ● −5.0 pp.toy$fitted.values ● ● ●● ●● −3.5 0.5 ● ● ● ● ● −0.5 0.0 y 0.5 −1 0 1 2 ● 3 term 1 854 • Fit or toy data and iris data