Stat 602X Outline (2013) Steve Vardeman Iowa State University May 27, 2013 Abstract This outline is the 2013 revision of a Modern Multivariate Statistical Learning course outline developed in 2009 and 2011-12 summarizing ISU Stat 648 (2009), Stat 602X (2011) and Stat 690 (2012). That earlier outline was based mostly on the topics and organization of The Elements of Statistical Learning by Hastie, Tibshirani, and Friedman, though very substantial parts of its technical development bene…ted from Izenman’s Modern Multivariate Statistical Techniques and from Principles and Theory for Data Mining and Machine Learning by Clarke, Fokoué, and Zhang. The present revision of the earlier outline bene…ts from a set of thoughful comments on that document kindly provided by Ken Ryan and Mark Culp. It intends to re‡ect the organization of the 2013 course as it is presented. This document will be in ‡ux as the course unfolds, potentially deviating from the earlier organization and re‡ecting corrections and intended improvements in exposition. As in the earlier outline, the notation here is meant to accord to the extent possible with that commonly used in ISU Stat 511, 542, and 543, and the background assumed here is the content of those courses and Stat 579. Contents 1 Introduction 1.1 Some Very Simple Probability Theory and Decision Theory and Prediction Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Two Conceptually Simple Ways One Might Approximate a Conditional Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.1 Rules Based on a Linear Assumption About the Form of the Conditional Mean . . . . . . . . . . . . . . . . . . . . 1.2.2 Nearest Neighbor Rules . . . . . . . . . . . . . . . . . . . 1.3 The Curse of Dimensionality . . . . . . . . . . . . . . . . . . . . 1.4 Variance-Bias Considerations in Prediction . . . . . . . . . . . . 1.5 Complexity, Fit, and Choice of Predictor . . . . . . . . . . . . . . 1.6 Plotting to Portray the E¤ects of Particular Inputs . . . . . . . . 1 5 6 7 7 7 8 10 12 13 2 Some Linear Algebra and Simple Principal Components 2.1 The Gram-Schmidt Process and the QR Decomposition of rank = p Matrix X . . . . . . . . . . . . . . . . . . . . . . . 2.2 The Singular Value Decomposition of X . . . . . . . . . . . . 2.3 Matrices of Centered Columns and Principal Components . . 2.3.1 "Ordinary" Principal Components . . . . . . . . . . . 2.3.2 "Kernel" Principal Components . . . . . . . . . . . . . 14 a . . . . . . . . . . 15 17 19 19 21 3 (Non-OLS) Linear Predictors 22 3.1 Ridge Regression, the Lasso, and Some Other Shrinking Methods 22 3.1.1 Ridge Regression . . . . . . . . . . . . . . . . . . . . . . . 23 3.1.2 The Lasso, Etc. . . . . . . . . . . . . . . . . . . . . . . . . 25 3.1.3 Least Angle Regression (LAR) . . . . . . . . . . . . . . . 27 3.2 Two Methods With Derived Input Variables . . . . . . . . . . . . 29 3.2.1 Principal Components Regression . . . . . . . . . . . . . . 30 3.2.2 Partial Least Squares Regression . . . . . . . . . . . . . . 31 4 Linear Prediction Using Basis Functions 4.1 p = 1 Wavelet Bases . . . . . . . . . . . . . . . . . . . . . . . 4.2 p = 1 Piecewise Polynomials and Regression Splines . . . . . 4.3 Basis Functions and p-Dimensional Inputs . . . . . . . . . . . 4.3.1 Multi-Dimensional Regression Splines (Tensor Bases) . 4.3.2 MARS (Mulitvariate Adapative Regression Splines) . 5 Smoothing Splines 5.1 p = 1 Smoothing Splines . . . . . . . . . . . . . . 5.2 Multi-Dimensional Smoothing Splines . . . . . . 5.3 An Abstraction of the Smoothing Spline Material Fitting in <N . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . and Penalized . . . . . . . . . 33 33 36 38 38 38 40 40 44 46 6 Kernel Smoothing Methods 47 6.1 One-dimensional Kernel Smoothers . . . . . . . . . . . . . . . . . 47 6.2 Kernel Smoothing in p Dimensions . . . . . . . . . . . . . . . . . 49 7 High-Dimensional Use of Low-Dimensional Smoothers 50 7.1 Additive Models and Other Structured Regression Functions . . 50 7.2 Projection Pursuit Regression . . . . . . . . . . . . . . . . . . . . 52 8 Highly Flexible Non-Linear Parametric Regression 8.1 Neural Network Regression . . . . . . . . . . . . . . 8.1.1 The Back-Propagation Algorithm . . . . . . . 8.1.2 Formal Regularization of Fitting . . . . . . . 8.2 Radial Basis Function Networks . . . . . . . . . . . . 2 Methods 53 . . . . . . . 53 . . . . . . . 54 . . . . . . . 56 . . . . . . . 58 9 Trees and Related Methods 9.1 Regression and Classi…cation Trees . . . . . 9.1.1 Comments on Forward Selection . . 9.1.2 Measuring the Importance of Inputs 9.2 Random Forests . . . . . . . . . . . . . . . 9.3 PRIM (Patient Rule Induction Method) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Predictors Built on Bootstrap Samples, Model Averaging, and Stacking 10.1 Predictors Built on Bootstrap Samples . . . . . . . . . . . . . . . 10.1.1 Bagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.1.2 Bumping . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2 Model Averaging and Stacking . . . . . . . . . . . . . . . . . . . 10.2.1 Bayesian Model Averaging . . . . . . . . . . . . . . . . . . 10.2.2 Stacking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 59 62 63 63 66 67 67 67 68 68 68 69 11 Reproducing Kernel Hilbert Spaces: Penalized/Regularized and Bayes Prediction 71 11.1 RKHS’s and p = 1 Cubic Smoothing Splines . . . . . . . . . . . . 72 11.2 What Seems to be Possible Beginning from Linear Functionals and Linear Di¤erential Operators for p = 1 . . . . . . . . . . . . 73 11.3 What Is Common Beginning Directly From a Kernel . . . . . . . 75 11.3.1 Some Special Cases . . . . . . . . . . . . . . . . . . . . . . 80 11.3.2 Addendum Regarding the Structures of the Spaces Related to a Kernel . . . . . . . . . . . . . . . . . . . . . . . 80 11.4 Gaussian Process "Priors," Bayes Predictors, and RKHS’s . . . . 82 12 More on Understanding and Predicting Predictor Performance 12.1 Optimism of the Training Error . . . . . . . . . . . . . . . . . . . 12.2 Cp , AIC and BIC . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.2.1 Cp and AIC . . . . . . . . . . . . . . . . . . . . . . . . . . 12.2.2 BIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.3 Cross-Validation Estimation of Err . . . . . . . . . . . . . . . . . 12.4 Bootstrap Estimation of Err . . . . . . . . . . . . . . . . . . . . . 84 85 86 86 87 88 89 13 Linear (and a Bit on Quadratic) Methods of Classi…cation 91 13.1 Linear (and a bit on Quadratic) Discriminant Analysis . . . . . . 91 13.2 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 95 13.3 Separating Hyperplanes . . . . . . . . . . . . . . . . . . . . . . . 96 14 Support Vector Machines 14.1 The Linearly Separable Case . . 14.2 The Linearly Non-separable Case 14.3 SVM’s and Kernels . . . . . . . . 14.3.1 Heuristics . . . . . . . . . 14.3.2 An Optimality Argument 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 97 100 102 102 103 14.4 Other Support Vector Stu¤ . . . . . . . . . . . . . . . . . . . . . 105 15 Boosting 15.1 The AdaBoost Algorithm for 2-Class Classi…cation . . . . . . . . 15.2 Why AdaBoost Might Be Expected to Work . . . . . . . . . . . . 15.2.1 Expected "Exponential Loss" for a Function of x . . . . . 15.2.2 Sequential Search for a g (x) With Small Expected Exponential Loss and the AdaBoost Algorithm . . . . . . . . . 15.3 Other Forms of Boosting . . . . . . . . . . . . . . . . . . . . . . . 15.3.1 SEL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.3.2 AEL (and Binary Regression Trees) . . . . . . . . . . . . 15.3.3 K-Class Classi…cation . . . . . . . . . . . . . . . . . . . . 15.4 Some Issues (More or Less) Related to Boosting and Use of Trees as Base Predictors . . . . . . . . . . . . . . . . . . . . . . . . . . 106 106 108 108 16 Prototype and Nearest Neighbor Methods of Classi…cation 114 108 111 112 112 113 113 17 Some Methods of Unsupervised Learning 117 17.1 Association Rules/Market Basket Analysis . . . . . . . . . . . . . 118 17.1.1 The "Apriori Algorithm" for the Case of Singleton Sets Sj 120 17.1.2 Using 2-Class Classi…cation Methods to Create Rectangles 121 17.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 17.2.1 Partitioning Methods ("Centroid"-Based Methods) . . . . 123 17.2.2 Hierarchical Methods . . . . . . . . . . . . . . . . . . . . 125 17.2.3 (Mixture) Model-Based Methods . . . . . . . . . . . . . . 126 17.2.4 Self-Organizing Maps . . . . . . . . . . . . . . . . . . . . 127 17.2.5 Spectral Clustering . . . . . . . . . . . . . . . . . . . . . . 129 17.3 Multi-Dimensional Scaling . . . . . . . . . . . . . . . . . . . . . . 131 17.4 More on Principal Components and Related Ideas . . . . . . . . 132 17.4.1 "Sparse" Principal Components . . . . . . . . . . . . . . . 132 17.4.2 Non-negative Matrix Factorization . . . . . . . . . . . . . 133 17.4.3 Archetypal Analysis . . . . . . . . . . . . . . . . . . . . . 134 17.4.4 Independent Component Analysis . . . . . . . . . . . . . 134 17.4.5 Principal Curves and Surfaces . . . . . . . . . . . . . . . . 137 17.5 Density Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 140 17.6 Google PageRanks . . . . . . . . . . . . . . . . . . . . . . . . . . 141 18 Document Features and String Kernels for Text Processing 143 19 Kernel Mechanics 145 20 Undirected Graphical Models and Machine Learning 148 4 1 Introduction The course is mostly about "supervised" machine learning for statisticians. The fundamental issue here is prediction, usually with a fairly large number of predictors and a large number of cases upon which to build a predictor. For x 2 <p an input vector (a vector of predictors) with jth coordinate xj and y some output value (usually univariate, and when it’s not, we’ll indicate so with bold face) the object is to use N observed (x; y) pairs (x1 ; y1 ) ; (x2 ; y2 ) ; : : : ; (xN ; yN ) called training data to create an e¤ective prediction rule yb = fb(x) "Supervised" learning means that the training data include values of the output variable. "Unsupervised" learning means that there is no particular variable designated as the output variable, and one is looking for patterns in the training data x1 ; x2 ; : : : ; xN , like clusters or low-dimensional structure. Here we will write 0 0 0 1 1 y1 x1 B x02 C B y2 C B B C C X = B . C ; Y = B . C ; and T = (X; Y ) N p @ .. A N 1 @ .. A yN x0N for the training data, and when helpful (to remind of the size of the training set) add "N " subscripts. A primary di¤erence between a "machine learning" point of view and that common in basic graduate statistics courses is an ambivalence toward making statements concerning parameters of any models used in the invention of prediction rules (including those that would describe the amount of variation in observables one can expect the model to generate). Standard statistics courses often conceive of those parameters as encoding scienti…c knowledge about some underlying data-generating mechanism or real-world phenomenon. Machine learners don’t seem to care much about these issues, but rather are concerned about "how good" a prediction rule is. An output variable might take values in < or in some …nite set G = f1; 2; : : : ; Kg or G = f0; 1; : : : ; K 1g. In this latter case, it can be notationally convenient to replace a G-valued output y with K indicator variables yj = I [y = j] 5 (of course) each taking real values. This replaces the output 0 0 1 y0 y1 B y1 B y2 C B B C y = B . C or y = B .. @ @ .. A . yK yK taking values in f0; 1g 1.1 K original y with the vector 1 K < . 1 C C C A (1) Some Very Simple Probability Theory and Decision Theory and Prediction Rules For purposes of appealing to general theory that can suggest sensible forms for prediction rules, suppose that the random pair (x0 ; y) has ((p + 1)-dimensional joint) distribution P . (Until further notice all expectations refer to distributions and conditional distributions derived from P . In general, if we need to remind ourselves what variables are being treated as random in probability, expectation, conditional probability, or conditional expectation computations, we will superscript E or P with their names.) Suppose further that L (b y ; y) 0 is some loss function quantifying how unhappy I am with prediction yb when the output is y. Purely as a thought experiment (not yet as anything based on the training data) I might consider trying to choose a functional form f (NOT yet f^) to minimize EL (f (x) ; y) This is (in theory) easily done, since EL (f (x) ; y) = EE [L (f (x) ; y) jx] and the minimization can be accomplished by considering the conditional mean for each possible value of x. That is, an optimal choice of f is de…ned by f (u) = arg minE [L (a; y) jx = u] (2) a 2 In the simple case where L (b y ; y) = (b y y) , the optimal f in (2) is then well-known to be f (u) = E [yjx = u] In a classi…cation context, where y takes values in G = f1; 2; : : : ; Kg or G = f0; 1; : : : ; K 1g, I might use (0-1) loss function L (b y ; y) = I [b y 6= y] An optimal f corresponding to (2) is then X f (u) = arg min P [y = vjx = u] a v6=a = arg maxP [y = ajx = u] a 6 Further, if I de…ne y as in (1), this is f (u) = the index of the largest entry of E [yjx = u] So in both the squared error loss context and in the 0-1 loss classi…cation context, (P ) optimal choice of f involves the conditional mean of an output variable (albeit the second is a vector of 0-1 indicator variables). These strongly suggest the importance of …nding ways to approximate conditional means E[yjx = u] in order to (either directly or indirectly) produce good prediction rules. 1.2 Two Conceptually Simple Ways One Might Approximate a Conditional Mean It is standard to point to two conceptually appealing but in some sense polaropposite ways of approximating E[yjx = u] (for (x; y) with distribution P ). The tacit assumption here is that the training data are realizations of (x1 ; y1 ) ; (x2 ; y2 ) ; : : : ; (xN ; yN ) iid P . 1.2.1 Rules Based on a Linear Assumption About the Form of the Conditional Mean If one is willing to adopt fairly strong assumptions about the form of E[yjx = u] then Stat 511-type computations might be used to guide choice of a prediction rule. In particular, if one assumes that for some 2 <p u0 E [yjx = u] (3) (if we wish, we are free to restrict x1 = u1 1 so that this form has an intercept) then an OLS computation with the training data produces and suggests the predictor b = X 0X 1 X 0Y f^ (x) = x0 b Probably "the biggest problems" with this possibility in situations involving large p are the potential in‡exibility of the linear form (3) working directly with the training data, and the "non-automatic" ‡avor of traditional regression model-building that tries to both take account of/mitigate problems of multicollinearity and extend the ‡exibility of forms of conditional means by constructing new predictors from "given" ones (further exploding the size of "p" and the associated di¢ culties). 1.2.2 Nearest Neighbor Rules One idea (that might automatically allow very ‡exible forms for E[yjx = u]) is to operate completely non-parametrically and to think that if N is "big enough," 7 perhaps something like X 1 yi # of xi = u (4) i with xi =u might work as an approximation to E[yjx = u]. But almost always (unless N is huge and the distribution of x is discrete) the number of xi = u is 0 or at most 1 (and absolutely no extrapolation beyond the set of training inputs is possible). So some modi…cation is surely required. Perhaps the condition xi = u might be replaced with xi u in (4). One form of this is to de…ne for each x the k-neighborhood nk (x) = the set of k inputs xi in the training set closest to x in <p A k-nearest neighbor approximation to E[yjx = u] is then mk (u) = 1 k X yi i with xi 2nk (u) suggesting the prediction rule fb(x) = mk (x) One might hope that upon allowing k to increase with N , provided that P is not so bizarre (one is counting, for example, on the continuity of E[yjx = u] in u) this could be an e¤ective predictor. It is surely a highly ‡exible predictor and automatically so. But as a practical matter, it and things like it often fail to be e¤ective ... basically because for p anything but very small, <p is far bigger than we naively realize. 1.3 The Curse of Dimensionality Roughly speaking, one wants to allow the most ‡exible kinds of prediction rules possible. Naively, one might hope that with what seem like the "large" sample sizes common in machine learning contexts, even if p is fairly large, prediction rules like k-nearest neighbor rules might be practically useful (as points "…ll up" <p and allow one to …gure out how y behaves as a function of x operating quite locally). But our intuition about how sparse data sets in <p actually must be is pretty weak and needs some education. In popular jargon, there is the "curse of dimensionality" to be reckoned with. There are many ways of framing this inescapable sparsity. Some simple ones involve facts about uniform distributions on the unit ball in <p fx 2 <p j kxk 8 1g and on a unit cube centered at the origin [ :5; :5] p For one thing, "most" of these distributions are very near the surface of the p solids. The cube [ r; r] capturing (for example) half the volume of the cube (half the probability mass of the distribution) has r = :5 (:5) 1=p which converges to :5 as p increases. Essentially the same story holds for the uniform distribution on the unit ball. The radius capturing half the probability mass has 1=p r = (:5) which converges to 1 as p increases. Points uniformly distributed in these regions are mostly near the surface or boundary of the spaces. Another interesting calculation concerns how large a sample must be in order for points generated from the uniform distribution on the ball or cube in an iid fashion to tend to "pile up" anywhere. Consider the problem of describing the distance from the origin to the closest of N iid points uniform in the pdimensional unit ball. With R = the distance from the origin to a single random point R has cdf 8 < 0 rp F (r) = : 1 0 r<0 r 1 r>1 So if R1 ; R2 ; : : : ; RN are iid with this distribution, M = min fR1 ; R2 ; : : : ; RN g has cdf 8 0 m<0 < N FM (m) = 1 (1 mp ) 0 m 1 : 1 m>1 This distribution has, for example, median 1 FM (:5) = 1 2 1 1=N !1=p For, say, p = 100 and N = 106 , the median of the distribution of M is :87. In retrospect, it’s not really all that hard to understand that even "large" numbers of points in <p must be sparse. After all, it’s perfectly possible for p-vectors u and v to agree perfectly on all but one coordinate and be far apart in p-space. There simply are a lot of ways for two p-vectors to di¤er! In addition to these kinds of considerations of sparsity, there is the fact that the potential complexity of functions of p variables explodes exponentially in 9 p. CFZ identify this as a second manifestation of "the curse," and go on to expound a third that says "for large p, all data sets exhibit multicollinearity (or its generalization to non-parametric …tting)" and its accompanying problems of reliable …tting and extrapolation. In any case, the point is that even "large" N doesn’t save us and make practical large-p prediction a more or less trivial application of standard parametric or non-parametric regression methodology. 1.4 Variance-Bias Considerations in Prediction Just as in model parameter estimation, the "variance-bias trade-o¤" notion is important in a prediction context. Here we consider decompositions of what HTF in their Section 2.9 and Chapter 7 call various forms of "test/ generalization/prediction error" that can help us think about that trade-o¤, and what issues come into play as we try to approximate a conditional mean function. In what follows, assume that (x1 ; y1 ) ; (x2 ; y2 ) ; : : : ; (xN ; yN ) are iid P and independent of (x; y) that is also P distributed. Write ET for expectation (and VarT for variance) computed with respect to the joint distribution of the training data and E[ jx = u] for conditional expectation (and Var[ jx = u] for conditional variance) based on the conditional distribution of yjx = u. A measure of the e¤ectiveness of the predictor f^ (x) at x = u (under squared error loss) is (the function of u, an input vector at which one is predicting) Err (u) f^ (x) ET E 2 y jx = u For some purposes, other versions of this might be useful and appropriate. For example 2 ErrT E(x;y) f^ (x) y is another kind of prediction or test error (that is a function of the training data). (What one would surely like, but almost surely cannot have is a guarantee that ErrT is small uniformly in T .) Note then that Err Ex Err (x) = ET ErrT is a number, an expected prediction error. In any case, a useful decomposition of Err(u) is 2 Err (u) = ET f^ (u) E [yjx = u] = ET f^ (u) ET f^ (u) 2 h + E (y + ET f^ (u) = VarT f^ (u) + ET f^ (u) 2 E [yjx = u]) jx = u 2 E [yjx = u] + Var [yjx = u] 2 E [yjx = u] 10 i + Var [yjx = u] The …rst quantity in this decomposition, VarT f^ (u) , is the variance of the prediction at u. The second term, ET f^ (u) 2 E [yjx = u] , is a kind of squared bias of prediction at u. And Var[yjx = u] is an unavoidable variance in outputs at x = u. Highly ‡exible prediction forms may give small prediction biases at the expense of large prediction variances. One may need to balance the two o¤ against each other when looking for a good predictor. Some additional insight into this can perhaps be gained by considering Err Ex Err (x) = Ex VarT f^ (x) + Ex ET f^ (x) 2 E [yjx] + Ex Var [yjx] The …rst term on the right here is the average (according to the marginal of x) of the prediction variance at x. The second is the average squared prediction bias. And the third is the average conditional variance of y (and is not under the control of the analyst choosing f^ (x)). Consider here a further decomposition of the second term. Suppose that T is used to select a function (say gT ) from some linear sus2 bspace, say S = fgg, of the space functions h with Ex (h (x)) < 1, and that ultimately one uses as a predictor f^ (x) = gT (x) Since linear subspaces are convex ET gT = ET f^ 2 S g Further, suppose that 2 arg minEx (g (x) g E [yjx]) g2S the projection of (the function of u) E[yjx = u] onto the space S. Then write h (u) = E [yjx = u] g (u) so that E [yjx] = g (x) + h (x) Then, it’s a consequence of the facts that Ex (h (x) g (x)) = 0 for all g 2 S and therefore that Ex (h (x) g (x)) = 0 and Ex (h (x) g (x)) = 0, that Ex ET f^ (x) 2 E [yjx] 2 = Ex (g (x) (g (x) + h (x))) = Ex (g (x) g (x)) + Ex (h (x)) 2Ex ((g (x) = Ex ET f^ (x) 11 2 2 g (x)) h (x)) 2 g (x) + Ex (E [yjx] g (x)) 2 The …rst term on the right above is an average squared estimation bias, measuring how well the average (over T ) predictor function approximates the element of S that best approximates the conditional mean function. This is a measure of how appropriately the training data are used to pick out elements of S. The second term on the right is an average squared model bias, measuring how well it is possible to approximate the conditional mean function E[yjx = u] by an element of S. This is controlled by the size of S, or e¤ectively the ‡exibility allowed in the form of f^. Average squared prediction bias can thus be large because the form …t is not ‡exible enough, or because a poor …tting method is employed. 1.5 Complexity, Fit, and Choice of Predictor In rough terms, standard methods of constructing predictors all have associated "complexity parameters" (like numbers and types of predictors/basis functions or "ridge parameters" used in regression methods, and penalty weights/bandwidths/neighborhood sizes applied in smoothing methods) that are at the choice of a user. Depending upon the choice of complexity, one gets more or less ‡exibility in the form f^. If a choice of complexity doesn’t allow enough ‡exibility in the form of a predictor, under-…t occurs and there is large bias in prediction. On the other hand, if the choice allows too much ‡exibility, bias may be reduced, but the price paid is large variance of prediction and over-…t. It is a theme that runs through this material that complexity must be chosen in a way that balances variance and bias for the particular combination of N and p and general circumstance one faces. Speci…cs of methods for doing this are presented in some detail in Chapter 7 of HTF, but in broad terms, the story is as follows. The most obvious/elementary means of selecting a "complexity parameter" is to (for all options under consideration) compute and compare values of the so-called "training error" err = N 1 X ^ f (xi ) N i=1 2 yi PN (or more generally, err = N1 i=1 L f^ (xi ) ; yi ) and look for predictors with small err. The problem is that err is no good estimator of Err (or any other sensible quanti…cation of predictor performance). It typically decreases with increased complexity (without an increase for large complexity), and fails to reliably indicate performance outside the training sample. The existing practical options for evaluating likely performance of a predictor (and guiding choice of complexity) are then three. 1. One might employ other (besides err) training-data based indicators of likely predictor performance, like Mallows’Cp , "AIC," and "BIC." 2. In genuinely large N contexts, one might hold back some random sample of the training data to serve as a "test set," …t to produce f^ on the remainder, 12 and use X 1 size of the test set 2 f^ (xi ) yi i2 the test set to indicate likely predictor performance. 3. One might employ sample re-use methods to estimate Err and guide choice of complexity. Cross-validation and bootstrap ideas are used here. We’ll say more about these possibilities later, but at the moment outline the cross-validation idea. K-fold cross-validation consists of 1. randomly breaking the training set into K disjoint roughly equal-sized pieces, say T 1 ; T 2 ; : : : ; T K , 2. training on each of the reduced training sets T dictors f^k , T k , to produce K pre- 3. letting k (i) be the index of the piece T k containing training case i, and computing the cross-validation error N 1 X ^k(i) f (xi ) CV f^ = N i=1 (more generally, CV f^ = 1 N PN i=1 2 yi L yi ; f^k(i) (xi ) ). This is roughly the same as …tting on each T T k and correspondingly evaluating on T k , and then averaging. (When N is a multiple of K, these are exactly the same.) One hopes that CV f^ approximates Err. The choice of K = N is called "leave one out cross-validation" and sometimes in this case there are slick computational ways of evaluating CV f^ . 1.6 Plotting to Portray the E¤ects of Particular Inputs An issue brie‡y discussed in HTF Ch 10 is the making and plotting of functions a few of the coordinates of x in an attempt to understand the nature of the in‡uence of these in a predictor. (What they say is really perfectly general, not at all special to the particular form of predictors discussed in that chapter.) If, for example, I want to understand the in‡uence the …rst two coordinates of x have on a form f (x), I might think of somehow averaging out the remaining coordinates of x. One theoretical implementation of this idea would be f 12 (u1 ; u2 ) = E(x3 ;x4 ;:::;xp ) f (u1 ; u2 ; x3 ; x4 ; : : : ; xp ) i.e. averaging according to the marginal distribution of the excluded input variables. An empirical version of this is N 1 X f (u1 ; u2 ; x3i ; x4i ; : : : ; xpi ) N i=1 13 This might be plotted (e.g. in contour plot fashion) and the plot called a partial dependence plot for the variables x1 and x2 . HTF’s language is that this function details the dependence of the predictor on (u1 ; u2 ) "after accounting for the average e¤ects of the other variables." This thinking amounts to a version of the kind of thing one does in ordinary factorial linear models, where main e¤ects are de…ned in terms of average (across all levels of all other factors) means for individual levels of a factor, two-factor interactions are de…ned in terms of average (again across all levels of all other factors) means for pairs of levels of two factors, etc. Something di¤erent raised by HTF is consideration of fe12 (u1 ; u2 ) = E [f (x) jx1 = u1 ; x2 = u2 ] (This is, by the way, the function of (u1 ; u2 ) closest to f (x) in L2 (P ).) This is obtained by averaging not against the marginal of the excluded variables, but against the conditional distribution of the excluded variables given x1 = u1 ; x2 = u2 . No workable empirical version of fe12 (u1 ; u2 ) can typically be de…ned. And it should be clear that this is not the same as f 12 (u1 ; u2 ). HTF say this in some sense describes the e¤ects of (x1 ; x2 ) on the prediction "ignoring the impact of the other variables." (In fact, if f (x) is a good approximation for E[yjx], this conditioning produces essentially E[yjx1 = u1 ; x2 = u2 ] and fe12 is just a predictor of y based on (x1 ; x2 ).) The di¤erence between f 12 and fe12 is easily seen through resort to a simple example. If, for example, f is additive of the form f (x) = h1 (x1 ; x2 ) + h2 (x3 ; : : : ; xp ) then f 12 (x1 ; x2 ) = h1 (x1 ; x2 ) + Eh2 (x3 ; : : : ; xp ) = h1 (x1 ; x2 ) + constant while fe12 (u1 ; u2 ) = h1 (u1 ; u2 ) + E [h2 (x3 ; : : : ; xp ) jx1 = u1 ; x2 = u2 ] and the fe12 "correction" to h1 (u1 ; u2 ) is not necessarily constant in (u1 ; u2 ). The upshot of all this is that partial dependence plots are potentially helpful, but that one needs to remember that they are produced by averaging according to the marginal of the set of variables not under consideration. 2 Some Linear Algebra and Simple Principal Components Methods of modern multivariate statistical learning often involve more linear algebra background than is assumed or used in Stat 511. So here we provide 14 some relevant linear theory. We then apply that to the unsupervised learning problem of "principal components analysis." We continue to use the notation 0 1 0 0 1 y1 x1 B y2 C B x02 C B C C B X = B . C and potentially Y = B . C .. A N 1 N p @ @ .. A yN x0N and recall that it is standard Stat 511 fare that ordinary least squares projects Y onto C (X), the column space of X, in order to produce the vector of …tted values 1 0 yb1 B yb2 C C B Yb = B . C N 1 @ .. A 2.1 ybN The Gram-Schmidt Process and the QR Decomposition of a rank = p Matrix X For many purposes it would be convenient if the columns of a full rank (rank = p) matrix X were orthogonal. In fact, it would be useful to replace the N p matrix X with an N p matrix Z with orthogonal columns and having the property that for each l if X l and Z l are N l consisting of the …rst l columns of respectively X and Z, then C (Z l ) = C (X l ). Such a matrix can in fact be constructed using the so-called Gram-Schmidt process. This process generalizes beyond the present application to <N to general Hilbert spaces, and recognition of that important fact, we’ll adopt standard Hilbert space notation for the innerproduct in <N , namely N X ui vi = u0 v hu; vi i=1 and describe the process in these terms. In what follows, vectors are N -vectors and in particular, xj is the jth column of X. Notice that this is in potential con‡ict with earlier notation that made xi the p-vector of inputs for the ith case in the training data. We will simply have to read the following in context and keep in mind the local convention. With this notation, the Gram-Schmidt process proceeds as follows: 1. Set the …rst column of Z to be z 1 = x1 2. Having constructed Z l 1 = (z 1 ; z 2 ; : : : ; z l z l = xl 1 ), let l 1 X hxl ; z j i zj hz j ; z j i j=1 15 It is easy enough to see that hz l ; z j i = 0 for all j < l (building up the orthogonality of z 1 ; z 2 ; : : : ; z l 1 by induction), since hz l ; z j i = hxl ; z j i hxl ; z j i as at most one term of the sum in step 2. above is non-zero. Further, assume that C (Z l 1 ) = C (X l 1 ). z l 2 C (X l ) and so C (Z l ) C (X l ). And since any element of C (X l ) can be written as a linear combination of an element of C (X l 1 ) = C (Z l 1 ) and xl , we also have C (X l ) C (Z l ). Thus, as advertised, C (Z l ) = C (X l ). Since the z j are perpendicular, for any N -vector w, l X hw; z j i zj hz j ; z j i j=1 is the projection of w onto C (Z l ) = C (X l ). (To see this, consider minimizaD E 2 Pl Pl Pl tion of the quantity w by j=1 cj z j ; w j=1 cj z j = w j=1 cj z j choice of the constants cj .) In particular, the sum on the right side of the equality in step 2. of the Gram-Schmidt process is the projection of xl onto C (X l 1 ). And the projection of a vector of outputs Y onto C (X l ) is l X hY ; z j i zj hz j ; z j i j=1 This means that for a full p-variable regression, hY ; z p i hz p ; z p i is the regression coe¢ cient for z p and (since only z p involves it) the last variable in X, xp . So, in constructing a vector of …tted values, …tted regression coe¢ cients in multiple regression can be interpreted as weights to be applied to that part of the input vector that remains after projecting the predictor onto the space spanned by all the others. The construction of the orthogonal variables z j can be represented in matrix form as X = Z N where p N pp p is upper triangular with kj = the value in the kth row and jth column of 8 1 if j = k < hz k ; xj i = if j > k : hz k ; z k i 16 De…ning 1=2 D = diag hz 1 ; z 1 i ; : : : ; hz p ; z p i 1=2 = diag (kz 1 k ; : : : ; kz p k) and letting 1 Q = ZD and R = D one may write X = QR (5) that is the so-called QR decomposition of X. In (5), Q is N p with Q0 Q = D 1 Z 0 ZD 1 1 =D diag (hz 1 ; z 1 i ; : : : ; hz p ; z p i) D 1 =I That is, Q has for columns perpendicular unit vectors that form a basis for C (X). R is upper triangular and that says that only the …rst l of these unit vectors are needed to create xl . The decomposition is computationally useful in that (for q j the jth column 1=2 of Q, namely q j = (hz j ; z j i) z j ) the projection of a response vector Y onto C (X) is (since the q j comprise an orthonormal basis for C (X)) and Yb = p X Y ; q j q j = QQ0 Y (6) j=1 b OLS = R 1 Q0 Y (The fact that R is upper triangular implies that there are e¢ cient ways to compute its inverse.) 2.2 The Singular Value Decomposition of X If the matrix X has rank r then it has a so-called singular value decomposition as X = U D V0 N p N rr rr p where U has orthonormal columns spanning C (X), V has orthonormal columns spanning C X 0 (the row space of X), and D = diag (d1 ; d2 ; : : : ; dr ) for d1 d2 dr > 0 The dj are the "singular values" of X.1 1 It is worth noting that for a real non-negative de…nite square matrix (a covariance matrix), the singular value decomposition is the eigen decomposition, U = V , columns of these matrices are unit eigenvectors, and the SVD singular values are the corresponding eigenvalues. 17 An interesting property of the singular value decomposition is this. If U l and V l are matrices consisting of the …rst l r columns of U and V respectively, then X l = U l diag (d1 ; d2 ; : : : ; dl ) V 0l is the best (in the sense of squared distance from X in <N p ) rank = l approximation to X. (In passing, note that application of this kind of argument to covariance matrices provides low-rank approximations to complicated covariance matrices.) Since the columns of U are an orthonormal basis for C (X), the projection of an output vector Y onto C (X) is OLS Yb = r X j=1 hY ; uj i uj = U U 0 Y (7) In the full rank (rank = p) X case, this is of course, completely parallel to (6) and is a consequence of the fact that the columns of both U and Q (from the QR decomposition of X) form orthonormal bases for C (X). In general, the two bases are not the same. Now using the singular value decomposition of a full rank (rank p) X, X 0 X = V D 0 U 0 U DV 0 = V D2 V 0 (8) which is the eigen (or spectral) decomposition of the symmetric and positive de…nite X 0 X. The vector z 1 Xv 1 is the product Xw with the largest squared length in <N subject to the constraint that kwk = 1: A second representation of z 1 is 0 1 1 B 0 C B C z 1 = Xv 1 = U DV 0 v 1 = U D B . C = d1 u1 @ .. A 0 In general, 0 1 hx1 ; v j i B C .. z j = Xv j = @ A = dj uj . hxN ; v j i (9) is the vector of the form Xw with the largest squared length in <N subject to the constraints that kwk = 1 and hz j ; z l i = 0 for all l < j. 18 2.3 Matrices of Centered Columns and Principal Components In the event that all the columns of X have been centered (each 10 xj = 0 for xj the jth column of X), there is additional terminology and insight associated with singular value decompositions as describing the structure of X. Note that centering is often sensible in unsupervised learning contexts because the object is to understand the internal structure of the data cases xi 2 <p , not the location of the data cloud (that is easily represented by the sample mean vector). So accordingly, we …rst translate the data cloud to the origin. Principal components ideas are then based on the singular value decomposition of X X = U D V0 N p N rr rr p (and related spectral/eigen decompositions of X 0 X and XX 0 ). 2.3.1 "Ordinary" Principal Components The columns of V (namely v 1 ; v 2 ; : : : ; v r ) are called the principal component directions in <p of the xi , and the elements of the vectors z j 2 <N from display (9), namely the inner products hxi ; v j i, are called the principal components of the xi . (The ith element of z j , hxi ; v j i, is the value of the jth principal component for case i, or the corresponding principal component score.) Notice that hxi ; v j i v j is the projection of xi onto the 1-dimensional space spanned by v j . It is worth thinking a bit more about the form of the product X l = U l diag (d1 ; d2 ; : : : ; dl ) V 0l that we’ve already said is the best rank l approximation to X. In fact it is 0 0 1 v1 l B v 02 C X B C (z 1 ; z 2 ; : : : ; z l ) B . C = z j v 0j @ .. A v 0l j=1 Pl and its ith row is j=1 hxi ; v j i v 0j , which (since the v j are orthonormal) is the transpose of the projection of xi onto C (V l ). That is, 0 1 0 1 0 1 hx1 ; v 1 i hx1 ; v 2 i hx1 ; v l i B C 0 B C 0 B C 0 .. .. .. X l=@ +@ A v1 + @ A v2 + A vl . . . = hxN ; v 1 i z 1 v 01 + z 2 v 02 + + hxN ; v 2 i z l v 0l hxN ; v l i a sum of rank 1 summands, producing for X l a matrix with each xi in X replaced by the transpose of its projection onto C (V l ). The values of the 19 principal components are the coe¢ cients applied to the v 0j for a given case to produce the summands. Note that in light of relationships (9) and the fact that the uj ’s are unit vectors, the squared length of z j is d2j and non-increasing in j. So on the whole, the summands z j v 0j that make up X l decrease in size with j. In this sense, the directions v 1 ; v 2 ; : : : ; v r are successively less important in terms of describing variation in the xi . This agrees with typical statistical interpretation of instances where a few early singular values are much bigger than the others. In such cases "simple structure" discovered in the data is that observations can be more or less reconstructed as linear combinations (with principal components as weights) of a few orthonormal vectors. Izenman, in his discussion of "polynomial principal components" points out that in some circumstances the existence of a few very small singular values can also identify important simple structure in a data set. Suppose, for example, that all singular values except dp 0 are of appreciable size. One simple feature of the data set is then that all hxi ; v p i 0, i.e. there is one linear combination of the p coordinates xj that is essentially constant (namely hx; v p i). The data fall nearly on a (p 1)-dimensional hyperplane in <p . In cases where the p coordinates xj are not functionally independent (for example consisting of centered versions of 1) all values, 2) all squares of values, and 3) all cross products of values of a smaller number of functionally independent variables), a single "nearly 0" singular value identi…es a quadratic function of the functionally independent variables that must be essentially constant, a potentially useful insight about the data set. X 0 X and XX 0 and Principal Components The singular value decomposition of X means that both X 0 X and XX 0 have useful representations in terms of singular vectors and singular values. Consider …rst X 0 X (that is most of the sample covariance matrix). As noted in display (8), the SVD of X means that X 0 X = V D2 V 0 and it’s then clear that the columns of V are eigenvectors of X 0 X and the squares of the diagonal elements of D are the corresponding eigenvalues. An eigen analysis of X 0 X then directly yields the principal component directions of the data, and through the further computation of the inner products in (9), the principal components z j (and hence the singular vectors uj ) are available. Note that 1 0 XX N is the (N -divisor) sample covariance matrix2 for the p input variables x1 ; x2 ; : : : ; xp . The principal component directions of X in <p , namely v 1 ; v 2 ; : : : ; v r , are also unit eigenvectors of the sample covariance matrix. The squared lengths of 2 Notice that when X has standardized columns (i.e. each column of X, x , has j 1 h1; xj i = 0 and hxj ; xj i = N ), the matrix N X 0 X is the sample correlation matrix for the p input variables x1 ; x2 ; : : : ; xp . 20 the principal components z j in <N divided by N are the (N -divisor) sample variances of entries of the z j , and their values are d2j 1 1 0 z j z j = dj u0j uj dj = N N N The SVD of X also implies that XX 0 = U DV 0 V DU 0 = U D 2 U 0 and it’s then clear that the columns of U are eigenvectors of XX 0 and the squares of the diagonal elements of D are the corresponding eigenvalues. U D then produces the N r matrix of principal components of the data. It does not appear that the principal component directions are available even indirectly based only on this second eigen analysis. 2.3.2 "Kernel" Principal Components This topic concerns the possibility of using a nonlinear function : <p ! <M to map data vectors x to (a usually higher-dimensional) vector of features (x). Of course, a perfectly straightforward way of proceeding is to simply create a new N M data matrix 0 0 1 (x1 ) B 0 (x2 ) C B C =B C .. @ A . 0 (xN ) and, after centering via e = 1 J N = I 1 J N (10) for J an N N matrix of 1’s, do an SVD of e producing singular values and both sets of singular vectors. But this doesn’t seem to be the most common approach. Rather than specify the transform (or even the space <M ) it seems more 0 common to instead specify (the so-called Gram matrix) by appeal to the fact that it is of necessity non-negative de…nite and might thus be generated by an appropriate "kernel" function. That is, for some choice of a non-negative 0 de…nite function K (x; z) : <p <p ! <, one assumes that is of the form 0 = 0 (xi ) (xj ) i=1;:::;N j=1;:::;N = (K (xi ; xj )) i=1;:::;N j=1;:::;N and doesn’t much bother about what might produce such a ( and then) 0 . (See Section 19 concerning the construction of kernel functions and note in passing that a choice of K (x; z) essentially p de…nes a new inner product on <p and corresponding new norm kxkK = K (x; x).) But one must then 21 0 0 somehow handle the fact that is not e e that is needed to do a principal components analysis in feature space. (Without (x) one can’t make use of relationship (10).) f A bit of linear algebra shows that for an N M matrix M and M 1 I N J M, 0 fM f = MM0 M 1 JMM0 N 1 1 M M 0J + 2 J M M 0J N N 0 0 so that e e is computable from only . Then (as indicated in Section 0 f f 2.3.1) an eigen analysis for M M will produce the principal components for the data expressed in feature space. (This analysis produces the z j ’s, but not the principal component directions in the feature space. There is the 0 0 1 f M f = M 0M representation M N M J M , but that doesn’t help one …nd 0 0 e e or its eigenvectors using only .) 3 (Non-OLS) Linear Predictors There is more to say about the development of a linear predictor fb(x) = x0 ^ for an appropriate ^ 2 <p than what is said in books and courses on ordinary linear models (where ordinary least squares is used to …t the linear form to all p input variables or to some subset of M of them). We continue the basic notation of Section 2, where the (supervised learning) problem is prediction, and there is a vector of continuous outputs, Y , of interest. 3.1 Ridge Regression, the Lasso, and Some Other Shrinking Methods An alternative to seeking to …nd a suitable level of complexity in a linear prediction rule through subset selection and least squares …tting of a linear form to the selected variables, is to employ a shrinkage method based on a penalized version of least squares to choose a vector ^ 2 <p to employ in a linear prediction rule. Here we consider several such methods. The implementation of these methods is not equivariant to the scaling used to express the input variables xj . So that we can talk about properties of the methods that are associated with a well-de…ned scaling, we assume here that the output variable has been centered (i.e. that hY ; 1i = 0) and that the columns of X have been standardized (and if originally X had a constant column, it has been removed). 22 3.1.1 For a Ridge Regression ridge > 0 the ridge regression coe¢ cient vector b 2 <p is b ridge = arg min (Y 0 X ) (Y 0 X )+ (11) 2<p OLS Here is a penalty/complexity parameter that controls how much b is shrunken towards 0. The unconstrained minimization problem expressed in (11) has an equivalent constrained minimization description as b ridge = t arg min 2 with k k 0 (Y X ) (Y X ) (12) t 2 ridge for an appropriate t > 0. (Corresponding to used in (11), is t = b used in (12). Conversely, corresponding to t used in (12), one may use a value of in (11) producing the same error sum of squares.) The unconstrained form (11) calls upon one to minimize (Y 0 X ) (Y 0 X )+ and some vector calculus leads directly to b ridge = X 0 X + I 1 X 0Y So then, using the singular value decomposition of X (with rank = r), ridge ridge Yb = Xb = U DV 0 V DU 0 U DV 0 + I 1 V DU 0 Y = U D V 0 V DU 0 U DV 0 + I V 1 DU 0 Y 1 = U D D2 + I DU 0 Y ! r X d2j = hY ; uj i uj 2 dj + j=1 (13) Comparing to (7) and recognizing that 0< d2j+1 d2j+1 + < d2j d2j + <1 we see that the coe¢ cients of the orthonormal basis vectors uj employed to ridge OLS get Yb are shrunken version of the coe¢ cients applied to get Yb . The most severe shrinking is enforced in the directions of the smallest principal components of X (the uj least important in making up low rank approximations to X). 23 The function df ( ) = tr X X 0 X + I 1 X0 1 = tr U D D 2 + I DU 0 1 0 ! r X d2j uj u0j A = tr @ 2+ d j j=1 0 1 ! r X d2j = tr @ u0j uj A 2+ d j j=1 ! r X d2j = d2j + j=1 is called the "e¤ective degrees of freedom" associated with the ridge regression. In regard to this choice of nomenclature, note that if = 0 ridge regression is ordinary least squares and this is r, the usual degrees of freedom associated with projection onto C (X), i.e. trace of the projection matrix onto this column space. As ! 1, the e¤ective degrees of freedom goes to 0 and (the centered) ridge Yb goes to 0 (corresponding to a constant predictor). Notice also (for future ridge ridge reference) that since Yb = Xb = X X 0X + I M = X X 0X + I 1 1 X 0 Y = M Y for X 0 , if one assumes that CovY = 2 I (conditioned on the xi in the training data, the outputs are uncorrelated and have constant variance 2 ) then e¤ective degrees of freedom = tr (M ) = N 1 X 2 Cov (^ yi ; yi ) (14) i=1 This follows since Yb = M Y and CovY = 2 I imply that 1 0 0 1 M Yb A= 2@ A I M 0 jI Cov @ I Y = 2 MM0 M0 M I and the terms Cov(^ yi ; yi ) are the diagonal elements of the upper right block of this covariance matrix. This suggests that tr(M ) is a plausible general de…nition for e¤ective degrees of freedom for any linear …tting method Yb = M Y , and that more generally, the last form in (14) might be used in situations 24 where Yb is other than a linear form in Y . Further (reasonably enough) the last form is a measure of how strongly the outputs in the training set can be expected to be related to their predictions. 3.1.2 The Lasso, Etc. The "lasso" (least absolute selection and shrinkage operator) and some other ¯ ¯ ¯ ¯ ¯ relatives of ridge regression are the result of P generalizing the criteria Poptimization 2 p p q 0 2 (11) and (12) by replacing = k k = j=1 j with j=1 j j j for a q > 0. That produces 8 9 p < = X q q b = arg min (Y X )0 (Y X ) + j jj (15) ; 2<p : j=1 generalizing (11) and bq = t (Y arg min P p q j=1 j j j with 0 X ) (Y X ) (16) t generalizing (12). The so called "lasso" is the q = 1 case of (15) or (16). That is, for t > 0 b lasso = t argPmin with (Y p j=1 j j j 0 X ) (Y X ) (17) t Because of the shape of the constraint region 8 p < X 2 <p j j jj : t j=1 9 = ; lasso (in particular its sharp corners at coordinate axes) some coordinates of b t OLS are often 0, and the lasso automatically provides simultaneous shrinking of b toward 0 and rational subset selection. (The same is true of cases of (16) with q < 1.) In Table 3.4, HTF provide explicit formulas for …tted coe¢ cients for the case of X with orthonormal columns. Method of Fitting Fitted Coe¢ cient for xj OLS bOLS j h bOLS I rank bOLS j j 1 bOLS j 1+ bOLS sign bOLS Best Subset (of Size M ) Ridge Regression Lasso j 25 j M 2 i + Best subset regression provides a kind of "hard thresholding" of the least squares coe¢ cients (setting all but the M largest to 0), ridge regression provides shrinking of all coe¢ cients toward 0, and the lasso provides a kind of "soft thresholding" of the coe¢ cients. There are a number of modi…cations of the ridge/lasso idea. One is the "elastic net" idea. This is a compromise between the ridge and lasso methods. For an 2 (0; 1) and some t > 0, this is de…ned by b elastic net ;t = with arg min )j j=1 ((1 0 (Y Pp j j+ 2 j ) X ) (Y X ) t (The constraint is a compromise between the ridge and lasso constraints.) The constraint regions have "corners" like the lasso regions but are otherwise more rounded than the lasso regions. The equivalent unconstrained optimization speci…cation of elastic net …tted coe¢ cient vectors is for 1 > 0 and 2 > 0 b elastic 1; 2 net 0 = arg min (Y X ) (Y X )+ 2<p 1 p X j=1 j jj + 2 p X 2 j j=1 Several sources (including a 2005 JRSSB paper of Zhou and Hastie) suggest that a modi…cation of the elastic net idea, namely (1 + 2) b elastic net 1; 2 performs better than the original version. (Note however, that when X has orthonormal columns, it’s possible to argue that belastic net 1 ; 2 ;j = 1 1+ 2 sign bjOLS bOLS 1 j 2 + so that modi…cation of the elastic net coe¢ cient vector in this way simply reduces it to a corresponding lasso coe¢ cient vector.) Breiman proposed a di¤erent shrinkage methodology he called the nonnegative garotte that attempts to …nd "optimal" re-weightings of the elements of b OLS . That is, for > 0 Breiman considered the vector optimization problem de…ned by c = arg min c2<p with cj 0; j=1;:::;p Y Xdiag (c) b OLS 0 Y and the corresponding …tted coe¢ cient vector 0 b nng = diag (c ) b OLS = B @ 26 c bOLS 1 1 Xdiag (c) b 1 C .. A . OLS b c p p OLS + p X j=1 cj Again looking to the case where X has orthonormal columns to get an explicit formula to compare to others, it turns out that in this context 1 0 bnng = bOLS B @1 j ;j 2 bjOLS C 2A + and the nonnegative garotte is again a combined shrinkage and variables selection method. 3.1.3 Least Angle Regression (LAR) Another class of shrinkage methods is de…ned algorithmically, rather than directly algebraically, or in terms of solutions to optimization problems. This includes the LAR (least angle regression). A description of the whole set of LAR LAR regression parameters b (for the case of X with each column centered and with norm 1 and centered Y ) follows. (This is some kind of amalgam of the descriptions of Izenman, CFZ, and the presentation in the 2003 paper Least Angle Regression by Efron, Hastie, Johnstone, and Tibshirani.) Note that for Y^ a vector of predictions, the vector c^ X0 Y Y^ has elements that are proportional to the correlations between the columns of X (the xj ) and the residual vector R = Y Y^ . We’ll let C^ = max j^ cj j and sj = sign (^ cj ) j Notice also that if X l is some matrix made up of l linearly independent columns of X, then if b is an l-vector and W = X l b , b can be recovered from W as 1 X 0l X l X 0l W . (This latter means that if we de…ne a path for Y^ vectors in C (X) and know for each Y^ which linearly independent set of columns of X is used in its creation, we can recover the corresponding path b takes through <p .) 1. Begin with Y^ 0 = 0; b 0 = 0;and R0 = Y Y^ = Y and …nd j1 = arg max jhxj ; Y ij j (the index of the predictor xj most strongly correlated with y) and add j1 to an (initially empty) "active set" of indices, A. 2. Move Y^ from Y^ 0 in the direction of the projection of Y onto the space spanned by xj1 (namely hxj1 ; Y i xj1 ) until there is another index j2 6= j1 with D E D E j^ cj2 j = xj2 ; Y Y^ = xj1 ; Y Y^ = j^ cj1 j 27 At that point, call the current vector of predictions Y^ 1 and the corresponding current parameter vector b 1 and add index j2 to the active set A. As it turns out, for ! !)+ ( C^0 + c^0j C^0 c^0j ; 1 = min j6=j1 1 hxj ; xj1 i 1 + hxj ; xj1 i (where the "+" indicates that only positive values are included in the minimization) Y^ 1 = Y^ 0 + sj1 1 xj1 = sj1 1 xj1 and b 1 is a vector of all 0’s except for sj1 1 in the j1 position. Let R1 = Y Y^ 1 . 3. At stage l with A of size l; b l 1 (with only l 1 non-zero entries), Y^ l 1 ; Rl 1 = Y Y^ l 1 ; and c^l 1 = X 0 Rl 1 in hand, move from Y^ l 1 toward the projection of Y onto the sub-space of <N spanned by fxj1 ; : : : ; xjl g. This is (as it turns out) in the direction of a unit ul vector "making equal angles less than 90 degrees with all xj with j 2 A" until there is an index jl+1 2 = A with j^ cl 1;j+1 j = j^ cl 1;j1 j ( = j^ cl 1;j2 j = = j^ cl 1;jl j ) At that point, with Y^ l the current vector of predictions, let b l (with only l non-zero entries) be the corresponding coe¢ cient vector, take Rl = Y Y^ l and c^l = X 0 Rl . It can be argued that with ! !)+ ( C^l 1 + c^l 1;j C^l 1 c^l 1;j ; l = min 1 hxj ; ul i 1 + hxj ; ul i j 2A = Y^ l = Y^ l 1 + sjl+1 l ul . Add the index jl+1 to the set of active indices, A, and repeat. This continues until there are r = rank (X) indices in A, and at that point Y^ OLS OLS moves from Y^ r 1 to Y^ and b moves from b r 1 to b (the version of an OLS coe¢ cient vector with non-zero elements only in positions with indices in A). This de…nes a piecewise linear path for Y^ (and therefore b ) that could, for example, be parameterized by Y^ or Y Y^ . There are several issues raised by the description above. For one, the standard exposition of this method seems to be that the direction vector ul is prescribed by letting W l = (sj1 xj1 ; : : : ; sjl xjl ) and taking ul = 1 1 W l W 0l W l It’s clear that W 0l ul = W l W 0l W l has the same inner product with ul . 1 1 W l W 0l W l 1 1 1 1 1, so that each of sj1 xj1 ; : : : ; sjl xj1 What is not immediately clear (but is 28 argued in Efron, Hastie, Johnstone, and Tibshirani) is why one knows that this prescription agrees with a prescription of a unit vector giving the direction from Y^ l 1 to the projection of Y onto the sub-space of <N spanned by fxj1 ; : : : ; xjl g, namely (for P l the projection matrix onto that subspace) 1 P lY Y^ l P lY Y^ l 1 1 Further, the arguments that establish that j^ cl 1;j1 j = j^ cl 1;j2 j = = j^ cl 1;jl j are not so obvious, nor are those that show that l has the form in 3. above. And …nally, HTF actually state their LAR algorithm directly in terms of a path for b , saying at stage l one moves from b l 1 in the direction of a vector with all 0’s except at those indices in A where there are the joint least squares coe¢ cients based on the predictor columns fxj1 ; : : : ; xjl g. The correspondence between the two points of view is probably correct, but is again not absolutely obvious. OLS At any rate, the LAR algorithm traces out a path in <p from 0 to b . One might think of the point one has reached along that path (perhaps parameterized by Y^ ) as being a complexity parameter governing how ‡exible a …t this algorithm has allowed, and be in the business of choosing it (by cross-validation or some other method) in exactly the same way one might, for example, choose a ridge parameter. What is not at all obvious but true, is that a very slight modi…cation of this LAR algorithm produces the whole set of lasso coe¢ cients (17) as its path. One simply needs to enforce the requirement that if a non-zero coe¢ cient hits 0, its index is removed from the active set and a new direction of movement is set based on one less input At any point along the modi…ed LAR Pvariable. p path, one can compute t = j=1 j j j, and think of the modi…ed-LAR path as parameterized by t. (While it’s not completely obvious, this turns out to be monotone non-decreasing in "progress along the path," or Y^ ). A useful graphical representation of the lasso path is one in which all coe¢ lasso cients ^tj are plotted against t on the same set of axes. Something similar is often done for the LAR coe¢ cients (where the plotting is against some measure of progress along the path de…ned by the algorithm). 3.2 Two Methods With Derived Input Variables Another possible approach to the problem of …nding an appropriate level of complexity in a …tted linear prediction rule is to consider regression on some number M < p of predictors derived from the original inputs xj . Two such methods discussed by HTF are the methods of Principal Components Regression and Partial Least Squares. Here we continue to assume that the columns of X have been standardized and Y has been centered. 29 3.2.1 Principal Components Regression The idea here is to replace the p columns of predictors in X with the …rst few (M ) principal components of X (from the singular value decomposition of X) z j = Xv j = dj uj Correspondingly, the vector of …tted predictions for the training data is PCR Yb = = M X hY ; z j i zj hz j ; zj i j=1 M X j=1 hY ; uj i uj (18) Comparing this to (7) and (13) we see that ridge regression shrinks the coe¢ cients of the principal components uj according to their importance in making up X, while principal components regression "zeros out" those least important in making up X. PCR Notice too, that Yb can be written in terms of the original inputs as PCR Yb = so that M X j=1 hY ; uj i 1 Xv j dj 0 1 M X 1 hY ; uj i v j A =X@ dj j=1 0 1 M X 1 A =X@ 2 hY ; Xv j i v j d j j=1 b PCR = M X 1 hY ; Xv j i v j d2 j=1 j (19) OLS where (obviously) the sum above has only M terms. b is the M = rank (X) version of (19). As the v j are orthonormal, it is therefore clear that b PCR b OLS OLS So principal components regression shrinks both Yb toward 0 in <N and OLS b toward 0 in <p . 30 3.2.2 Partial Least Squares Regression The shrinking methods mentioned thus far have taken no account of Y in determining directions or amounts of shrinkage. Partial least squares speci…cally employs Y . In what follows, we continue to suppose that the columns of X have been standardized and that Y has been centered. The logic of partial least squares is this. Suppose that z1 = p X j=1 hY ; xj i xj = XX 0 Y It is possible to argue that for w1 = X 0 Y = X 0 Y ; Xw1 = z 1 = X 0 Y linear combination of the columns of X maximizing is a jhY ; Xwij (which is essentially the absolute sample covariance between the variables y and x0 w) subject to the constraint that kwk = 1.3 This follows because 2 hY ; Xwi = wX 0 Y Y 0 Xw and a maximizer of this quadratic form subject to the constraint is the eigenvector of X 0 Y Y 0 X corresponding to its single non-zero eigenvalue. It’s then easy to verify that w1 is such an eigenvector corresponding to the non-zero eigenvalue Y 0 XX 0 Y . Then de…ne X 1 by orthogonalizing the columns of X with respect to z 1 . That is, de…ne the jth column of X 1 by hxj ; z 1 i z1 hz 1 ; z 1 i x1j = xj and take z2 = p X Y ; x1j x1j j=1 = X 1 X 10 Y For w2 = X 10 Y = X 10 Y ; X 1 w2 = z 2 = X 10 Y the columns of X 1 maximizing is the linear combination of Y ; X 1w subject to the constraint that kwk = 1. 3 Note that upon replacing jhY ; Xwij with jhXw; Xwij one has the kind of optimization problem solved by the …rst principal component of X. 31 Then for l > 1, de…ne X l by orthogonalizing the columns of X l respect to z l . That is, de…ne the jth column of X l by xlj = xlj 1 with xlj 1 ; z l zl hz l ; z l i 1 and let z l+1 = p X Y ; xlj xlj j=1 = X l X l0 Y Partial least squares regression uses the …rst M of these variables z j as input variables. The PLS predictors z j are orthogonal by construction. Using the …rst M of these as regressors, one has the vector of …tted output values PLS Yb = M X hY ; z j i zj hz j ; zj i j=1 Since the PLS predictors are (albeit recursively-computed data-dependent) linPLS (namely ear combinations of columns of X, it is possible to …nd a p-vector b X 0X 1 X 0 Yb PLS M ) such that PLS PLS Yb = X bM and thus produce the corresponding linear prediction rule PLS fb(x) = x0 b M (20) It seems like in (20), the number of components, M , should function as a complexity parameter. But then again there is the following. When the xj PLS OLS are orthonormal, it’s fairly easy to see that z 1 is a multiple of Yb , b = 1 PLS OLS b PLS = = bp = b , and all steps of partial least squares after the …rst 2 are simply providing a basis for the orthogonal complement of the 1-dimensional OLS subspace of C (X) generated by Yb (without improving …tting at all). That is, here changing M doesn’t change ‡exibility of the …t at all. (Presumably, when the xj are nearly orthogonal, something similar happens.) PLS, PCR, and OLS Partial least squares is a kind of compromise between principal components regression and ordinary least squares. To see this, note that maximizing jhY ; Xwij 32 subject to the constraint that kwk = 1 is equivalent maximizing the sample covariance between Y and Xw i.e. sample standard deviation of y sample standard deviation of x0 w sample correlation between y and x0 w or equivalently sample variance of x0 w sample correlation between y and x0 w 2 (21) subject to the constraint. Now if only the …rst term (the sample variance of x0 w) were involved in (21), a …rst principal component direction would be an optimizing w1 , and z 1 = X 0 Y Xw1 a multiple of the …rst principal component OLS OLS of X. On the other hand, if only the second term were involved, ^ = ^ OLS OLS a multiple of the X 0Y = ^ would be an optimizing w1 , and z 1 = Y^ vector of ordinary least squares …tted values. The use of the product of two terms can be expected to produce a compromise between these two. Note further that this logic applied at later steps in the PLS algorithm should then produce for z l some sort of compromise between an lth principal component of X and a vector of suitably constrained least squares …tted values. 4 Linear Prediction Using Basis Functions A way of moving beyond prediction rules that are functions of a linear form in x, i.e. depend upon x through x0 ^, is to consider some set of (basis) functions fhm g and predictors of the form or depending upon the form f^ (x) = p X ^m hm (x) = h (x)0 ^ (22) m=1 0 (for h (x) = (h1 (x) ; : : : ; hp (x))). We next consider some ‡exible methods employing this idea. Notice that …tting of form (22) can be done using any of the methods just discussed based on the N p matrix of inputs 0 0 1 h (x1 ) B h (x2 )0 C B C X = (hj (xi )) = B C .. A @ . 0 h (xN ) (i indexing rows and j indexing columns). 4.1 p = 1 Wavelet Bases Consider …rst the case of a one-dimensional input variable x, and in fact here suppose that x takes values in [0; 1]. One might consider a set of basis function 33 for use in the form (22) that is big enough and rich enough to approximate essentially any function on [0; 1]. In particular, various orthonormal bases for the square integrable functions on this interval (the space of functions L2 [0; 1]) come to mind. One might, for example, consider using some number of functions from the Fourier basis for L2 [0; 1] o1 np o1 np 2 sin (j2 x) [ 2 cos (j2 x) [ f1g j=1 For example, using M the …tting the forms f (x) = j=1 N=2 sin-cos pairs and the constant, one could consider 0 + M X 1m sin (m2 x) + m=1 M X 2m cos (m2 x) (23) m=1 If one has training xi on an appropriate regular grid, the use of form (23) leads to orthogonality in the N (2M + 1) matrix of values of the basis functions X and simple/fast calculations. Unless, however, one believes that E[yjx = u] is periodic in u, form (23) has its serious limitations. In particular, unless M is very very large, a trigonometric series like (23) will typically provide a poor approximation for a function that varies at di¤erent scales on di¤erent parts of [0; 1], and in any case, the coe¢ cients necessary to provide such localized variation at di¤erent scales have no obvious simple interpretations/connections to the irregular pattern of variation being described. So-called "wavelet bases" are much more useful in providing parsimonious and interpretable approximations to such functions. The simplest wavelet basis for L2 [0; 1] is the Haar basis that we proceed to describe. De…ne the so-called Haar "father" wavelet ' (x) = I [0 < x 1] and the so-called Haar "mother" wavelet (x) = ' (2x) =I 0<x ' (2x 1) 1 2 I 1 <x 2 1 Linear combinations of these functions provide all elements of L2 [0; 1] that are constant on 0; 21 and on 12 ; 1 . Write 0 = f'; g Next, de…ne 1;0 (x) = p 2 I 0<x 1;1 (x) = p 2 I 1 <x 2 1 4 3 4 34 1 <x 4 3 I <x 4 I 1 2 1 and and let 1 =f 1;0 ; 1;1 g Using the set of functions 0 [ 1 one can build (as linear combinations) all elements of L2 [0; 1] that are constant on 0; 41 and on 14 ; 12 and on 21 ; 34 and on 34 ; 1 . The story then goes on as one should expect. One de…nes 2;0 1 8 3 8 5 8 7 8 (x) = 2 I 0 < x 1 <x 4 1 <x 2;2 (x) = 2 I 2 3 <x 2;3 (x) = 2 I 4 2;1 (x) = 2 I 1 <x 8 3 I <x 8 5 <x I 8 7 I <x 8 I 1 4 1 2 3 4 and and and 1 and lets 2 =f 2;0 ; 2;1 ; 2;2 ; 2;3 g In general, m;j (x) = p j 2m 2m x 2m for j = 0; 1; 2; : : : ; 2m 1 and m =f m;0 ; m;1 ; : : : ; m;2m 1 g The Haar basis of L2 [0; 1] is then [1 m=0 m Then, one might entertain use of the Haar basis functions through order M in constructing a form m f (x) = 0 + M 2X1 X mj m;j (x) (24) m=0 j=0 (with the understanding that 0;0 = ), a form that in general allows building of functions that are constant on consecutive intervals of length 1=2M +1 . This form can be …t by any of the various regression methods (especially involving thresholding/selection, as a typically very large number, 2M +1 , of basis functions is employed in (24)). (See HTF Section 5.9.2 for some discussion of using the lasso with wavelets.) Large absolute values of coe¢ cients mj encode scales at which important variation in the value of the index m, and location in [0; 1] where that variation occurs in the value j=2m . Where (perhaps after model selection/ thresholding) only a relatively few …tted coe¢ cients are important, the 35 corresponding scales and locations provide an informative and compact summary of the …t. A nice visual summary of the results of the …t can be made by plotting for each m (plots arranged vertically, from M through 0, aligned and to the same scale) spikes of length j mj j pointed in the direction of sign( mj ) along an "x" axis at positions (say) (j=2m ) + 1=2m+1 . In special situations where N = 2K and xi = i 1 2K for i = 1; 2; : : : ; 2K and one uses the Haar basis functions through order K is computationally clean, since the vectors 0 1 m;j (x1 ) B C .. @ A . m;j 1, the …tting of (24) (xN ) (together with the column vector p of 1’s) are orthogonal. (So, upon proper normalization, i.e. division by N = 2K=2 , they form an orthonormal basis for <N .) The Haar wavelet basis functions are easy to describe and understand. But they are discontinuous, and from some points of view that is unappealing. Other sets of wavelet basis functions have been developed that are smooth. The construction begins with a smooth "mother wavelet" in place of the step function used above. HTF make some discussion of the smooth "symmlet" wavelet basis at the end of their Chapter 5. 4.2 p = 1 Piecewise Polynomials and Regression Splines Continue consideration of the case of a one-dimensional input variable x, and now K "knots" < K 1 < 2 < and forms for f (x) that are 1. polynomials of order M (or less) on all intervals ( at least) j 1 ; j ), and (potentially, 2. have derivatives of some speci…ed order at the knots, and (potentially, at least) 3. are linear outside ( 1 ; If we let I1 (x) = I [x < and de…ne IK+1 (x) = I [ K K ). 1 ], for j = 2; : : : ; K let Ij (x) = I [ j 1 x < j ], x], one can have 1. in the list above using basis 36 functions I1 (x) ; I2 (x) ; : : : ; IK+1 (x) xI1 (x) ; xI2 (x) ; : : : ; xIK+1 (x) x2 I1 (x) ; x2 I2 (x) ; : : : ; x2 IK+1 (x) .. . xM I1 (x) ; xM I2 (x) ; : : : ; xM IK+1 (x) Further, one can enforce continuity and di¤erentiability (at the knots) conditions P(M +1)(K+1) on a form f (x) = m hm (x) by enforcing some linear relations m=1 between appropriate ones of the m . While this is conceptually simple, it is messy. It is much cleaner to simply begin with a set of basis functions that are tailored to have the desired continuity/di¤erentiability properties. A set of M + 1 + K basis functions for piecewise polynomials of degree M with derivatives of order M 1 at all knots is easily seen to be M 1 )+ 1; x; x2 ; : : : ; xM ; (x M 2 )+ ; (x M K )+ ; : : : ; (x M (since the value of and …rst j derivatives of (x The j )+ at j are all 0). choice of M = 3 is fairly standard. Since extrapolation with polynomials typically gets worse with order, it is common to impose a restriction that outside ( 1 ; K ) a form f (x) be linear. For the case of M = 3 this can be accomplished by beginning with basis functions 3 3 3 1; x; (x 1 )+ ; (x 2 )+ ; : : : ; (x K )+ and imposing restrictions necessary to force 2nd and 3rd derivatives to the right of K to be 0: Notice that (considering x > K) 1 0 K K X X d2 @ 3A =6 (25) j (x j) 0 + 1x + j (x j )+ dx2 j=1 j=1 and 0 d3 @ dx3 0 + 1x + K X j 1 3A j )+ (x j=1 So, linearity for large x requires (from (26)) that =6 K X j (26) j=1 PK = 0. Further, PK substituting this into (25) means that linearity also requires that j=1 j j = 0. PK 1 Using the …rst of these to conclude that K = j and substituting into j=1 the second yields K X2 K j K 1 = j K j=1 and then K = K X2 j=1 K j j K 37 K 1 j=1 K 1 K X2 j=1 j j These then suggest the set of basis functions consisting of 1; x and for j = 1; 2; : : : ; K 2 (x K 3 j )+ K = (x j 3 K 1 )+ (x K 1 3 j )+ K K j K 1 (x K + K 3 K 1 )+ j 3 K )+ (x K 1 + K 1 K j 3 K )+ (x (x K 1 3 K )+ (These are essentially the basis functions that HTF call their Nj .) Their use produces so-called "natural" (linear outside ( 1 ; K )) cubic regression splines. There are other (harder to motivate, but in the end more pleasing and computationally more attractive) sets of basis functions for natural polynomial splines. See the B-spline material at the end of HTF Chapter 5. 4.3 4.3.1 Basis Functions and p-Dimensional Inputs Multi-Dimensional Regression Splines (Tensor Bases) If p = 2 and the vector of inputs, x, takes values in <2 , one might proceed as follows. If fh11 ; h12 ; : : : ; h1M1 g is a set of spline basis functions based on x1 and fh21 ; h22 ; : : : ; h2M2 g is a set of spline basis functions based on x2 one might consider the set of M1 M2 basis functions based on x de…ned by gjk (x) = h1j (x1 ) h2k (x2 ) and corresponding forms for regression splines X f (x) = jk gjk (x) j;k The biggest problem with this potential method is the explosion in the size of a tensor product basis as p increases. For example, using K knots for cubic p regression splines in each of p dimensions produces (4 + K) basis functions for the p-dimensional problem. Some kind of forward selection algorithm or shrinking of coe¢ cients will be needed to produce any kind of workable …ts with such large numbers of basis functions. 4.3.2 MARS (Mulitvariate Adapative Regression Splines) This is a high-dimensional regression-like methodology based on the use of "hockey-stick functions" and their products as data-dependent "basis functions." That is, with input space <p we consider data-dependent basis functions built on the N p pairs of functions hij1 (x) = (xj xij )+ and hij2 (x) = (xij xj )+ (27) (xij is the jth coordinate of the ith input training vector and both hij1 (x) and hij2 (x) depend on x only through the jth coordinate of x). MARS builds predictors sequentially, making use of these "re‡ected pairs" roughly as follows. 38 1. Identify a pair (27) so that 0 + 11 hij1 (x) + 12 hij2 (x) has the best SSE possible. Call the selected functions g11 = hij1 and g12 = hij2 and set f^1 (x) = ^0 + ^11 g11 (x) + ^12 g12 (x) 2. At stage l of the predictor-building process, with predictor f^l ^ 1 (x) = 0 + l 1 X ^m1 gm1 (x) + ^m2 gm2 (x) m=1 in hand, consider for addition to the model, pairs of basis functions that are either of the basic form (27) or of the form hij1 (x) gm1 (x) and hij2 (x) gm1 (x) or of the form hij1 (x) gm2 (x) and hij2 (x) gm2 (x) for some m < l, subject to the constraints that no xj appears in any candidate product more than once (maintaining the piece-wise linearity of sections of the predictor). Additionally, one may decide to put an upper limit on the order of the products considered for inclusion in the predictor. The best candidate pair in terms of reducing SSE gets called, say, gl1 and gl2 and one sets f^l (x) = ^0 + l X ^m1 gm1 (x) + ^m2 gm2 (x) m=1 One might pick the complexity parameter l by cross-validation, but the standard implementation of MARS apparently uses instead a kind of generalized cross validation error 2 PN f^l (xi ) i=1 yi GCV (l) = 2 1 MN(l) where M (l) is some kind of degrees of freedom …gure. One must take account of both the …tting of the coe¢ cients in this and the fact that knots (values xij ) have been chosen. The HTF recommendation is to use M (l) = 2l + (2 or 3) (the number of di¤erent knots chosen) (where presumably the knot count refers to di¤erent xij appearing in at least one gm1 (x) or gm2 (x)). 39 5 Smoothing Splines 5.1 p = 1 Smoothing Splines A way of avoiding the direct selection of knots for a regression spline is to instead, for a smoothing parameter > 0, consider the problem of …nding (for a min fxi g and max fxi g b) ! Z b N X 2 2 00 f^ = arg min (yi h (xi )) + (h (x)) dx functions h with 2 derivatives a i=1 Amazingly enough, this optimization problem has a solution that can be fairly simply described. f^ is a natural cubic spline with knots at the distinct values xi in the training set. That is, for a set of (now data-dependent, as the knots come from the training data) basis functions for such splines h1 ; h2 ; : : : ; hN (here we’re tacitly assuming that the N values of the input variable in the training set are all di¤erent) N X f^ (x) = b j hj (x) j=1 where the b j are yet to be identi…ed. So consider the function N X g (x) = j hj (x) (28) j=1 This has second derivative g 00 (x) = N X 00 j hj (x) j=1 and so 2 (g 00 (x)) = N X N X 00 j l hj (x) h00l (x) j=1 l=1 Then, for 0 = ( 1; 2; : : : ; N ) N it is the case that N Z and = Z ! b h00j (t) h00l (t) dt a b 2 (g 00 (x)) dx = a 40 0 In fact, with the notation H = (hj (xi )) N N (i indexing rows and j indexing columns) the criterion to be optimized in order to …nd f^ can be written for functions of the form (28) as (Y 0 H ) (Y 0 H )+ and some vector calculus shows that the optimizing 1 b = H 0H + H 0Y is (29) which can be thought of as some kind of vector of generalized ridge regression coe¢ cients. Corresponding to (29) is a vector of smoothed output values and the matrix 1 Yb = H H 0 H + S H 0Y 1 H H 0H + H0 is called a smoother matrix. Contrast this to a situation where some fairly small number, p, of …xed basis functions are employed in a regression context. That is, for basis functions b1 ; b2 ; : : : ; bp suppose B = (bj (xi )) N p Then OLS produces the vector of …tted values 1 Yb = B B 0 B B0Y and the projection matrix onto the column space of B, C (B), is P B = B B 0 B S and P B are both N N symmetric non-negative de…nite matrices. While P BP B = P B i.e. P B is idempotent, S S S meaning that S S S is non-negative de…nite. P B is of rank p = tr(P B ), while S is of rank N . In a manner similar to what is done in ridge regression we might de…ne an "e¤ective degrees of freedom" for S (or for smoothing) as df ( ) = tr (S ) (30) We proceed to develop motivation and a formula for this quantity and for Y^ . Notice that it is possible to write S 1 =I+ K 41 1 B0. for 1 K = H0 1 H so that S = (I + K) 1 (31) This is the so-called Reinsch form. Some vector calculus shows that Yb = S Y is a solution to the minimization problem 0 minimize (Y v) (Y v) + v 0 Kv (32) v2<N so that this matrix K can be thought of as de…ning a "penalty" in …tting a smoothed version of Y : Then, since S is symmetric non-negative de…nite, it has an eigen decomposition as N X S = U DU 0 = dj uj u0j (33) j=1 where columns of U (the eigenvectors uj ) comprise an orthonormal basis for <N and D = diag (d1 ; d2 ; : : : ; dN ) for eigenvalues of S d1 d2 dN > 0 It turns out to be guaranteed that d1 = d2 = 1. Consider how the eigenvalues and eigenvectors of S are related to those for K. An eigenvalue for K, say , solves det (K Now det (K I) = det I) = 0 1 [(I + K) (1 + So 1+ must be an eigenvalue of I + K and 1= (1 + 1 of S = (I + K) . So for some j we must have ) must be an eigenvalue 1 1+ dj = and observing that 1= (1 + ) I] ) is decreasing in , we may conclude that dj = 1 1+ N (34) j+1 for 1 2 N 2 42 N 1 = N =0 the eigenvalues of K (that themselves do not depend upon ). So, for example, in light of (30), (33), and (34), the smoothing e¤ective degrees of freedom are df ( ) = tr (S ) = N X dj = 2 + j=1 N X2 j=1 1 1+ j which is clearly decreasing in (with minimum value 2 in light of the fact that S has two eigenvalues that are 1). Further, consider uj , the eigenvector of S corresponding to eigenvalue dj . S uj = dj uj so that uj = S 1 dj uj = (I + K) dj uj so that uj = dj uj + dj Kuj and thus Kuj = 1 dj dj uj = N j+1 uj That is, uj is an eigenvector of K corresponding to the (N j + 1)st largest eigenvalue. That is, for all the eigenvectors of S are eigenvectors of K and thus do not depend upon . Then, for any 0 1 N X Yb = S Y = @ dj uj u0j A Y j=1 = N X j=1 dj huj ; Y i uj = hu1 ; Y i u1 + hu2 ; Y i u2 + N X j=3 huj ; Y i uj 1 + N j+1 (35) and we see that Yb is a shrunken version of Y (that progresses from Y to the projection of Y onto the span of fu1 ; u2 g as runs from 0 to 1)4 . The larger is , the more severe the shrinking overall. Further, the larger is j, the smaller is dj and the more severe is the shrinking in the uj direction. In the context of cubic smoothing splines, large j correspond to "wiggly" (as a functions of coordinate i or value of the input xi ) uj , and the prescription (35) calls for suppression of "wiggly" components of Y . Notice that large j correspond to early/large eigenvalues of the penalty matrix K in (32). Letting uj = uN j+1 so that U = (uN ; uN 1 ; : : : ; u1 ) = (u1 ; u2 ; : : : ; uN ) 4 It is possible to argue that the span of fu ; u g is the set of vectors of the form c1 + dx, 1 2 as is consistent with the integral penalty in original function optimization problem. 43 the eigen decomposition of K is K = U diag ( 1 ; 2; : : : ; N ) U 0 and the criterion (32) can be written as 0 minimize (Y v) + v 0 U diag ( 1 ; v) (Y v2<N 2; : : : ; N ) U 0 v or equivalently as 0 minimize @(Y v2<N 0 v) (Y v) + N X2 j=1 2 j 1 uj ; v A (36) (since N 1 = N = 0) and we see that eigenvalues of K function as penalty PN coe¢ cients applied to the N orthogonal components of v = j=1 uj ; v uj in the choice of optimizing v. From this point of view, the uj (or uj ) provide the natural alternative (to the columns of H) basis (for <N ) for representing or approximating Y , and the last equality in (35) provides an explicit form for the optimizing smoothed vector Yb . In this development, K has had a speci…c meaning derived from the H and matrices connected speci…cally with smoothing splines and the particular values of x in the training data set. But in the end, an interesting possibility brought up by the whole development is that of forgetting the origins (from K) of the j and uj and beginning with any interesting/intuitively appealing orthonormal basis fuj g and set of non-negative penalties f j g for use in (36). Working backwards through (35) and (34) one is then led to the corresponding smoothed vector Yb and smoothing matrix S . (More detail on this matter is in Section 5.3.) It is also worth remarking that since Yb = S Y the rows of S provide weights to be applied to the elements of Y in order to produce predictions/smoothed values corresponding to Y . These can for each i be thought of as de…ning a corresponding "equivalent kernel" (for an appropriate "kernelweighted average" of the training output values). (See Figure 5.8 of HTF2 in this regard.) 5.2 Multi-Dimensional Smoothing Splines If p = 2 and the vector of inputs, x, takes values in <2 , one might propose to seek ! N X 2 ^ f = arg min (yi h (xi )) + J [h] functions h with 2 derivatives for J [h] ZZ <2 @2h @x21 2 +2 i=1 @2h @x1 @x2 44 2 + @2h @x22 2 dx1 dx2 An optimizing f^ : <2 ! < can be identi…ed and is called a "thin plate spline." As ! 0, f^ becomes an interpolator, as ! 1 it de…nes the OLS plane through the data in 3-space. In general, it can be shown to be of the form f (x) = + 0 0 x+ N X i gi (x) (37) i=1 where gi (x) = (kx xi k) for (z) = z 2 ln z 2 . The gi (x) are "radial basis functions" (radially symmetric basis functions) and …tting is accomplished much as for the p = 1 case. The form (37) is plugged into the optimization criterion and a discrete penalized least squares problem emerges (after taking account of some linear constraints that are required to keep J [f ] < 1). HTF seem to indicate that in order to keep computations from exploding with N , it usually su¢ ces to replace the N functions gi (x) in (37) with K N functions gi (x) = (kx xi k) for K potential input vectors xi placed on a rectangular grid covering the convex hull of the N training data input vectors xi . For large p, one might simply declare that attention is going to be limited to predictors of some restricted form, and for h in that restricted class, seek to optimize N X 2 (yi h (xi )) + J [h] i=1 for J [h] some appropriate penalty on h intended to regularize/restrict its wiggling. For example, one might assume that a form g (x) = p X gj (xj ) j=1 will be used and set J [g] = p Z X gj00 (x) 2 dx j=1 and be led to additive splines. Or, one might assume that g (x) = p X gj (xj ) + j=1 X gjk (xj ; xk ) (38) j;k and invent an appropriate penalty function. It seems like a sum of 1-D smoothing spline penalties on the gj and 2-D thin plate spline penalties on the gjk is the most obvious starting point. Details of …tting are a bit murky (though I am sure that they can be found in book on generalized additive models). Presumably one cycles through the summands in (38) iteratively …tting functions to sets of residuals de…ned by the original yi minus the sums of all other current versions of the components until some convergence criterion is satis…ed. (38) is a kind of "main e¤ects plus 2-factor interactions" decomposition, but it is (at least in theory) possible to also consider higher order terms in this kind of expansion. 45 5.3 An Abstraction of the Smoothing Spline Material and Penalized Fitting in <N In abstraction of the smoothing spline development, suppose that fuj g is a set of M N orthonormal N -vectors, 0; j 0 for j = 1; 2; : : : ; M; and consider the optimization problem 0 1 M X 0 2 A minimize @(Y v) (Y v) + j huj ; vi v2 spanfuj g j=1 PM PM PM 2 2 For v = j=1 cj uj 2 spanfuj g, the penalty is j=1 j huj ; vi = j=1 j cj and in this penalty, j is a multiplier of the squared length of the component of v in the direction of uj . The optimization criterion is then (Y 0 v) (Y v) + M X j=1 2 j huj ; vi = M X j=1 (huj ; Y i 2 cj ) + M X 2 j cj j=1 and it is then easy to see (via simple calculus) that copt = j i.e. Y^ = v opt = huj ; Y i 1+ j M X huj ; Y i j=1 1+ uj j From this it’s clear how the penalty structure dictates optimally shrinking the components of the projection of Y onto spanfuj g. It is further worth noting that for a given set of penalty coe¢ cients, Y^ can be represented as SY for S= M X j=1 dj uj u0j = U diag 1 1+ ;:::; 1 1 1+ U0 M for U = (u1 ; u2 ; : : : ; uM ). Then it’s easy to see that smoother matrix S is a rank M matrix for which Y^ = SY . One context in which this material might …nd immediate application is where some set of basis functions fhj g are increasingly "wiggly" with increasing j and the vectors uj come from applying the Gram-Schmidt process to the vectors hj = (hj (x1 ) ; : : : ; hj (xN )) 0 In this context, it would be very natural to penalize the later uj more severely than the early ones. 46 6 Kernel Smoothing Methods The central idea of this material is that when …nding f^ (x0 ) I might weight points in the training set according to how close they are to x0 , do some kind of …tting around x0 , and ultimately read o¤ the value of the …t at x0 . 6.1 One-dimensional Kernel Smoothers For the time being, suppose that x takes values in [0; 1]. Invent weighting schemes for points in the training set by de…ning a (usually, symmetric about 0) non-negative, real-valued function D (t) that is non-increasing for t 0 and non-decreasing for t 0. Often D (t) is taken to have value 0 unless jtj 1. Then, a kernel function is x K (x; x0 ) = D x0 (39) where is a "bandwidth" parameter that controls the rate at which weights drop o¤ as one moves away from x0 (and indeed in the case that D (t) = 0 for jtj > 1, how far one moves away from x0 before no weight is assigned). Common choices for D are 1. the Epanechnikov quadratic kernel, D (t) = 2. the "tri-cube" kernel, D (t) = 1 1 t2 I [jtj 1], 3 3 jtj 3. the standard normal density, D (t) = 3 4 I [jtj 1], and (t). Using weights (39) to make a weighted average of training responses, one arrives at the Nadaraya-Watson kernel-weighted prediction at x0 PN K (x0 ; xi ) yi f^ (x0 ) = Pi=1 (40) N i=1 K (x0 ; xi ) This typically smooths training outputs yi in a more pleasing way than does a k-nearest neighbor average, but it has obvious problems at the ends of the interval [0; 1] and at places in the interior of the interval where training data are dense to one side of x0 and sparse to the other, if the target E[yjx = z] has non-zero derivative at z = x0 . For example, at x0 = 1 only xi 1 get weight, and if E[yjx = z] is decreasing at z = x0 = 1, f^ (1) will be positively biased. That is, with usual symmetric kernels, (40) will fail to adequately follow an obvious trend at 0 or 1 (or at any point between where there is a sharp change in the density of input values in the training set). A way to …x this problem with the Nadaraya-Watson predictor is to replace the locally-…tted constant with a locally-…tted line. That is, at x0 one might choose (x0 ) and (x0 ) to solve the optimization problem minimize and N X K (x0 ; xi ) (yi i=1 47 2 ( + xi )) (41) and then employ the prediction f^ (x0 ) = (x0 ) + (x0 ) x0 (42) Now the weighted least squares problem (41) has an explicit solution. Let 0 1 1 x1 B 1 x2 C B C B =B . .. C . N 2 @ . . A 1 xN and take W (x0 ) = diag (K (x0 ; x1 ) ; : : : ; K (x0 ; xN )) N N then (42) is f^ (x0 ) = (1; x0 ) B 0 W (x0 ) B 1 B 0 W (x0 ) Y (43) 0 = l (x0 ) Y 1 for the 1 N vector l0 (x0 ) = (1; x0 ) B 0 W (x0 ) B B 0 W (x0 ). It is thus obvious that locally weighted linear regression is (an albeit x0 -dependent) linear operation on the vector of outputs. The weights in l0 (x0 ) combine the original kernel values and the least squares …tting operation to produce a kind of "equivalent kernel" (for a Nadaraya-Watson type weighted average). Recall that for smoothing splines, smoothed values are where the parameter Yb = S Y is the penalty weight, and df ( ) = tr (S ) We may do something parallel in the present context. We may take 0 0 1 l (x1 ) B l0 (x2 ) C B C L =B C .. @ A N N . l0 (xN ) where now the parameter and de…ne is the bandwidth, write Yb = L Y df ( ) = tr (L ) HTF suggest that matching degrees of freedom for a smoothing spline and a kernel smoother produces very similar equivalent kernels, smoothers, and predictions. 48 There is a famous theorem of Silverman that adds technical credence to this notion. Roughly the theorem says that for large N , if in the case p = 1 the inputs x1 ; x2 ; : : : ; xN are iid with density p (x) on [a; b], is neither too big nor too small, juj 1 juj p sin p + DS (u) = exp 2 4 2 2 1=4 (x) = and G (z; x) = N p (x) 1 DS (x) p (x) z x (x) then for xi not too close to either a or b, (S )ij 1 G (xi ; xj ) N (in some appropriate probabilistic sense) and the smoother matrix for cubic spline smoothing has entries like those that would come from an appropriate kernel smoothing. 6.2 Kernel Smoothing in p Dimensions A direct generalization of 1-dimensional kernel smoothing to p dimensions might go roughly as follows. For D as before, and x 2 <p , I might set K (x0 ; x) = D and …t linear forms locally by choosing optimization problem minimize and N X kx x0 k (44) (x0 ) 2 <p to solve the (x0 ) 2 < and K (x0 ; xi ) yi + 0 xi 2 i=1 and predicting as f^ (x0 ) = (x0 ) + 0 (x0 ) x0 This seems typically to be done only after standardizing the coordinates of x and can be e¤ective as longs as N is not too small and p is not more than 2 or 3. However for p > 3, the curse of dimensionality comes into play and N points usually just aren’t dense enough in p-space to make direct use of kernel smoothing e¤ective. If the method is going to be successfully used in <p it will need to be applied under appropriate structure assumptions. One way to apply additional structure to the p-dimensional kernel smoothing problem is to essentially reduce input variable dimension by replacing the kernel 49 (44) with the "structured kernel" K ;A 0q (x0 ; x) = D @ (x 0 x0 ) A (x x0 ) 1 A for an appropriate non-negative de…nite matrix A. For the eigen decomposition of A, A = V DV 0 write (x 0 x0 ) A (x 1 x0 ) = D 2 V 0 (x x0 ) 0 1 D 2 V 0 (x x0 ) This amounts to using not x and <p distance from x to x0 to de…ne weights, 1 1 1 but rather D 2 V 0 x and <p distance from D 2 V 0 x to D 2 V 0 x0 . In the event that some entries of D are 0 (or are nearly so), we basically reduce dimension from p to the number of large eigenvalues of A and de…ne weights in a space of that dimension (spanned by eigenvectors corresponding to non-zero eigenvalues) where the curse of dimensionality may not preclude e¤ective use of kernel smoothing. The "trick" is, of course, identifying the right directions into which to project. (Searching for such directions is part of the Friedman "projection pursuit" ideas discussed below.) 7 High-Dimensional Use of Low-Dimensional Smoothers There are several ways that have been suggested for making use of fairly lowdimensional (and thus, potentially e¤ective) smoothing in large p problems. One of them is the "structured kernels" idea just discussed. Two more follow. 7.1 Additive Models and Other Structured Regression Functions A way to apply structure to the p-dimensional smoothing problem is through assumptions on the form of the predictor …t. For example, one might assume additivity in a form p X f (x) = + gj (xj ) (45) j=1 and try to do …tting of the p functions gj and constant . Fitting is possible via the so-called "back-…tting algorithm." That is (to generalize form (45) slightly) if I am to …t (under SEL) f (x) = + L X l=1 50 gl xl for xl some part of x, I might set b = l = 1; 2; : : : ; L; 1; 2; : : : 1 N PN i=1 yi , and then cycle through 1. …tting via some appropriate (often linear) operation (e.g., spline or kernel smoothing) gl xl to "data" xli ; yil i=1;2;:::;N for yil = yi 0 @b + X m6=l 1 A gm (xm i ) where the gm are the current versions of the …tted summands, 2. setting gl = the newly …tted version the sample mean of this newly …tted version across all xli (in theory this is not necessary, but it is here to prevent numerical/roundo¤ errors from causing the gm to drift up and down by additive constants summing to 0 across m), 3. iterating until convergence to, say, f^ (x). The simplest version of this, based on form (45), might be termed …tting of a "main e¤ects model." But the algorithm might as well be applied to …t a "main e¤ects and two factor interactions model," at some iterations doing smoothing of appropriate "residuals" as functions of 1-dimensional xj and at other iterations doing smoothing as functions of 2-dimensional (xj ; xj 0 ). One may mix types of predictors (continuous, categorical) and types of functions of them in the additive form to produce all sorts of interesting models (including semi-parametric ones and ones with low order interactions). Another possibility for introducing structure assumptions and making use of low-dimensional smoothing in a large p situation, is by making strong global assumptions on the forms of the in‡uence of some input variables on the output, but allowing parameters of those forms to vary in a ‡exible fashion with the values of some small number of coordinates of x. For sake of example, suppose that p = 4. One might consider predictor forms f (x) = (x3 ; x4 ) + 1 (x3 ; x4 ) x1 + 2 (x3 ; x4 ) x2 That is, one might assume that for …xed (x3 ; x4 ), the form of the predictor is linear in (x1 ; x2 ), but that the coe¢ cients of that form may change in a ‡exible way with (x3 ; x4 ). Fitting might then be approached by locally weighted least squares, with only (x3 ; x4 ) involved in the setting of the weights. That is, one 51 might for each (x30 ; x40 ), minimize over choices of 2 (x30 ; x40 ) the weighted sum of squares N X K ((x30 ; x40 ) ; (x3i ; x4i )) (yi ( (x30 ; x40 ) + (x30 ; x40 ) ; 1 1 (x30 ; x40 ) and (x30 ; x40 ) x1i + 2 (x30 ; x40 ) x2i )) i=1 and then employ the predictor f^ (x0 ) = ^ (x30 ; x40 ) + ^1 (x30 ; x40 ) x10 + ^2 (x30 ; x40 ) x20 This kind of device keeps the dimension of the space where one is doing smoothing down to something manageable. But note that nothing here does any thresholding or automatic variable selection. 7.2 Projection Pursuit Regression For w1 ; w2 ; : : : ; wM unit p-vectors of parameters, we might consider as predictors …tted versions of the form f (x) = M X gm (w0m x) (46) m=1 This is an additive form in the derived variables vm = w0m x. The functions gm and the directions wm are to be …t from the training data. The M = 1 case of this form is the "single index model" of econometrics. How does one …t a predictor of this form (46)? Consider …rst the M = 1 case. Given w, we simply have pairs (vi ; yi ) for vi = w0 xi and a 1-dimensional smoothing method can be used to estimate g. On the other hand, given g, we might seek to optimize w via an iterative search. A Gauss-Newton algorithm can be based on the …rst order Taylor approximation g (w0 xi ) g (w0old xi ) + g 0 (w0old xi ) (w0 w0old ) xi so that N X i=1 (yi g (w0 xi )) 2 N X 2 (g 0 (w0old xi )) w0old xi + i=1 yi g (w0old xi ) g 0 (w0old xi ) 2 w0 xi 2 We may then update wold to w using the closed form for weighted (by (g 0 (w0old xi )) ) no-intercept regression of w0old xi + yi g (w0old xi ) g 0 (w0old xi ) on xi . (Presumably one must normalize the updated w in order to preserve unit length property of the w in order to maintain a stable scaling in the …tting.) The g and w steps are iterated until convergence. Note that in the case where cubic 52 2 smoothing spline smoothing is used in projection pursuit, g 0 will be evaluated as some explicit quadratic and in the case of locally weighted linear smoothing, form (43) will need to be di¤erentiated in order to evaluate the derivative g 0 . When M > 1, terms gm (w0m x) are added to a sum of such in a forward stage-wise fashion. HTF provide some discussion of details like readjusting previous g’s (and perhaps w’s) upon adding gm (w0m x) to a …t, and the choice of M . 8 8.1 Highly Flexible Non-Linear Parametric Regression Methods Neural Network Regression A multi-layer feed-forward neural network is a nonlinear map of x 2 <p to one or more real-valued y’s through the use of functions of linear combinations of functions of linear combinations of ... of functions of linear combinations of coordinates of x. Figure 1 is a network diagram representation of a single hidden layer feed-forward neural net with 3 inputs 2 hidden nodes and 2 outputs. Figure 1: A Network Diagram Representation of a Single Hidden Layer Feedfoward Neural Net With 3 Inputs, 2 Hidden Nodes and 2 Ouputs This diagram stands for a function of x de…ned by setting z1 = ( 01 1+ 11 x1 + 21 x2 + 31 x3 ) z2 = ( 02 1+ 12 x1 + 22 x2 + 32 x3 ) 53 and then y1 = g1 ( 01 1+ 11 z1 + 21 z2 ) y2 = g2 ( 02 1+ 12 z1 + 22 z2 ) Of course, much more complicated networks are possible, particularly ones with multiple hidden layers and many nodes on all layers. The common choices of functional form at hidden nodes are ones with sigmoidal shape like 1 (u) = 1 + exp ( u) or exp (u) exp ( u) (u) = tanh (u) = exp (u) + exp ( u) (statisticians seem to favor the former, probably because of familiarity with logistic forms, while computer scientists seem to favor the latter). Such functions are really quite linear near u = 0, so that for small ’s the functions of x entering the g’s in a single hidden layer network are nearly linear. For large ’s the functions are nearly step functions. In light of the latter, it is not surprising that there are universal approximation theorems that guarantee that any continuous function on a compact subset of <p can be approximated to any degree of …delity with a single layer feed-forward neural net with enough nodes in the hidden layer. This is both a blessing and a curse. It promises that these forms are quite ‡exible It also promises that there must be both over-…tting and identi…ability issues inherent in their use (the latter in addition to the identi…ability issues already inherent in the symmetric nature of the functional forms assumed for the predictors). In regression contexts, identity functions are common for the functions g, while in classi…cation problems, in order to make estimates of class probabilities, the functions g often exponentiate one z and divide by a sum of such terms for all z’s. There are various possibilities for regularization of the ill-posed …tting problem for neural nets, ranging from the fairly formal and rational to the very informal and ad hoc. Possibly the most common is to simply use an iterative …tting algorithm and "stop it before it converges." 8.1.1 The Back-Propagation Algorithm The most common …tting algorithm for neural networks is something called the back-propagation algorithm or the delta rule. In the case where one is doing SEL …tting for a single layer feed-forward neural net with M hidden nodes and K output nodes, the logic goes as follows. There are M (p + 1) parameters to be …t. In addition to the notation 0m use the notation m =( There are K (M + 1) parameters use the notation k =( 1m ; 2m ; : : : ; pm ) 0 to be …t. In addition to the notation 1k ; 2k ; : : : ; 54 Mk) 0 0k , Let stand for the whole set of parameters and consider the SEL …tting criterion K R( ) = N 1 XX (yik 2 i=1 fk (xi )) 2 (47) k=1 Let z0i = 1 and for m = 1; 2; : : : ; M de…ne zmi = ( 0 m xi ) + 0m and 0 z i = (z1i ; z2i ; : : : ; zM i ) and write K Ri = 1X (yik 2 fk (xi )) 2 k=1 PN (so that R ( ) = i=1 Ri ). Partial derivatives with respect to the parameters are (by the chain rule) @Ri = @ mk fk (xi )) gk0 (yik 0k + 0 k zi zmi (for m = 0; 1; 2; : : : ; M and k = 1; 2; : : : ; K) and we will use the abbreviation = ki fk (xi )) gk0 (yik so that @Ri = @ mk 0k + 0 k zi (48) ki zmi (49) Similarly, letting xi0 = 1 @Ri = @ lm = K X fk (xi )) gk0 (yik 0k + 0 k zi mk 0 ( 0m + 0 m xi ) xil k=1 K X 0 ki mk ( 0m + 0 m xi ) xil k=1 (for l = 0; 1; 2; : : : ; p and m = 1; 2; : : : ; M ) and we will use the abbreviations smi = K X fk (xi )) gk0 (yik 0k + 0 k zi mk 0 ( 0m + 0 m xi ) k=1 = K X mk 0 ( 0m + 0 m xi ) ki (50) k=1 (equations (50) are called the "back-propagation" equations) so that @Ri = smi xil @ lm 55 (51) An iterative search to make R ( ) small may then proceed by setting (r+1) mk = (r) mk r N X @Ri @ mk i=1 = (r) 0k (r) k and (r) mk r N X (r) (r) ki zmi i=1 8m; k (52) and (r+1) lm = = (r) lm r (r) lm r N X @Ri @ lm i=1 N X (r) 0m ; (r) m and all (r) 0k and (r) k (r) smi xil (53) i=1 for some "learning rate" So operationally r. 1. in what is called the "forward pass" through the network, using rth iterates (r) of ’s and ’s, one computes rth iterates of …tted values fk (xi ), 2. in what is called the "backward pass" through the network, the (r)th (r) iterates ki are computed using the rth iterates of the z’s, ’s and the (r) (r) fk (xi ) in equations (48) and then the (r)th iterates sml are computed using the (r)th ’s, ’s, and ’s in equations (50), 3. the (r) ki (r) and sml then provide partials using (49) and (51), and 4. updates for the parameters 8.1.2 and come from (52) and (53). Formal Regularization of Fitting Suppose that the various coordinates of the input vectors in the training set have been standardized and one wants to regularize the …tting of a neural net. One possible way of proceeding is to de…ne a penalty function like ! p M M X K X 1 XX 2 2 J( )= (54) lm + mk 2 m=1 m=0 l=0 k=1 (it is not absolutely clear whether one really wants to include l = 0 and m = 0 in the sums (54)) and seek not to partially optimize R ( ) in (47), but rather to fully optimize R( )+ J ( ) (55) (r) (r) for a > 0. By simply adding r r lm mk on the right side of (52) and on the right side of (53), one arrives at an appropriate gradient descent algorithm for trying to optimize the penalized error sum of squares (55). (Potentially, an appropriate value for might be chosen based on cross-validation.) 56 Something that may at …rst seem quite di¤erent would be to take a Bayesian point of view. For example, with a model yik = fk (xi j ) + for the ik iid N 0; 2 l ik , a likelihood is simply ; 2 = N Y K Y h yik jfk (xi j ) ; i=1 k=1 2 for h j ; 2 the normal pdf. If then g ; 2 speci…es a prior distribution for and 2 , a posterior for ; 2 has density proportional to l ; 2 g 2 ; For example, one might well assume that a priori the ’s and ’s are iid N 0; 2 (where small 2 will provide regularization and it is again unclear whether one wants to include the 0m and 0k in such an assumption or to instead provide more di¤use priors for them, like improper "Uniform( 1; 1)" or at least large variance normal ones). And a standard (improper) prior for 2 would be to take log to be Uniform( 1; 1). In any case, whether improper or proper, let’s abuse notation and write g 2 for the prior density for 2 . Then with independent mean 0 variance 2 priors for all the weights (except possibly the 0m and 0k that might be given Uniform( 1; 1) priors) one has ln l ; 2 g ; 2 1 R( ) / N K ln ( ) = N K ln ( ) + ln g 2 2 1 2 1 2 2 J ( ) + ln g 2 R( )+ 2 J( ) (56) (‡at improper priors for the 0m and 0k correspond to the absence of terms for them in the sums for J ( ) in (54)). This recalls (55) and suggests that appropriate for regularization can be thought of as a variance ratio of "observation variance" and prior variance for the weights. It’s fairly clear how to de…ne Metropolis-Hastings-within-Gibbs algorithms for sampling from l ; 2 g ; 2 . But it seems that typically the high dimensionality of the parameter space combined with the symmetry-derived multimodality of the posterior will prevent one from running an MCMC algorithm long enough to fully detail the posterior It also seems unlikely however, that detailing the posterior is really necessary or even desirable. Rather, one might simply run the MCMC algorithm, monitoring the values of l ; 2 g ; 2 corresponding to the successively randomly generated MCMC iterates. An MCMC algorithm will spend much of its time where the corresponding posterior density is large and we can expect that a long MCMC 57 run will identify a nearly modal value for the posterior. Rather than averaging neural nets according to the posterior, one might instead use as a predictor a neural net corresponding to a parameter vector (at least locally) maximizing the posterior. Notice that one might even take the parameter vector in an MCMC run with the largest l ; 2 g ; 2 value and for a grid of 2 values around the empirical maximizer use the back-propagation algorithm modi…ed to fully optimize 2 R( )+ 2 J( ) over choices of . This, in turn, could be used with (56) to perhaps improve somewhat the result of the MCMC "search." 8.2 Radial Basis Function Networks Section 6.7 of HTF considers the use of the kind of kernels applied in kernel smoothing as basis functions. That is, for kx K (x; ) = D k one might consider …tting nonlinear predictors of the form f (x) = 0 + M X jK j x; (57) j j=1 where each basis element has prototype parameter and scale parameter . A common choice of D for this purpose is the standard normal pdf. A version of this with fewer parameters is obtained by restricting to cases where 1 = 2 = = M = : This restriction, however, has the potentially unattractive e¤ect of forcing "holes" or regions of <p where (in each) f (x) 0, including all "large" x. A way to replace this behavior with potentially di¤ering values in the former "holes" and directions of "large" x, is to replace the basis functions ! x j K x; j = D with normalized versions D hj (x) = PM x k=1 to produce a form f (x) = j D (kx 0+ M X j=1 58 j hj = kk = (x) ) (58) The …tting of form (57) by choice of 0 ; 1 ; : : : ; M ; 1 ; 2 ; : : : ; M ; 1 ; 2 ; : : : ; is fraught with all M or form (58) by choice of 0 ; 1 ; : : : ; M ; 1 ; 2 ; : : : ; M ; the problems of over-parameterization and lack of identi…ability associated with neural networks. Another way to use radial basis functions to produce ‡exible functional forms is to replace the forms ( 0m + 0m x) in a neural network with forms K m ( 0m x; m ) or hm ( 0m x). 9 9.1 Trees and Related Methods Regression and Classi…cation Trees This is something genuinely new to our discussion. We’ll …rst consider the SEL/regression version. This begins with a forward-selection/"greedy" algorithm for inventing predictions constant on p-dimensional rectangles, by successively looking for an optimal binary split of a single one of an existing set of rectangles. For example, with p = 2, one begins with a rectangle [a; b] [c; d] de…ned by a = min xi1 x1 max xi1 = b i=1;2;:::;N i=1;2;:::;N and c= min i=1;2;:::;N xi2 x2 max xi2 = d i=1;2;:::;N and looks for a way to split it at x1 = s1 or x2 = s1 so that the resulting two rectangles minimize the training error X X 2 yi y rectangle SSE = rectangles i with xi in the rectangle One then splits (optimally) one of the (now) two rectangles at x1 = s2 or x2 = s2 , etc. Where l rectangles R1 ; R2 ; : : : ; Rl in <p have been created, and m (x) = the index of the rectangle to which x belongs the corresponding predictor is f^l (x) = 1 # training input vectors xi in Rm(x) and in this notation the training error is SSE = N X yi i=1 59 f^l (xi ) 2 X i with xi in Rm(x) yi If one is to continue beyond l rectangles, one then looks for a value sl to split one of the existing rectangles R1 ; R2 ; : : : ; Rl on x1 or x2 or . . . or xp and thereby produce the greatest reduction in SSE. (We note that there is no guarantee that after l splits one will have the best (in terms of SSE) possible set of l + 1 rectangles.) Any series of binary splits of rectangles can be represented graphically as a binary tree, each split represented by a node where there is a fork and each …nal rectangle by an end node. It is convenient to discuss rectangle-splitting and "unsplitting" in terms of operations on corresponding binary trees, and henceforth we adopt this language. It is, of course, possible to continue splitting rectangles/adding branches to a tree until every distinct x in the training set has its own rectangle. But that is not helpful in a practical sense, in that it corresponds to a very low bias/high variance predictor. So how does one …nd a tree of appropriate size? The standard recommendation using K-fold cross-validation seems to be as follows. For each of the K subsets of the training set (say S k = T T k in the notation of Section 1.5) 1. grow a tree (on a given data set) until the cell with the fewest training xi contains 5 or less such points, then 2. "prune" the tree in 1. back by for each > 0 (a complexity parameter, weighting SSEk (for S k ) against complexity de…ned in terms of tree size) minimizing over choices of sub-trees, the quantity C k (T ) = jT j + SSEk (T ) (for, in the obvious way, jT j the number of …nal nodes in the candidate tree and SSEk (T ) the error sum of squares for the corresponding predictor). Write Tk ( ) = arg min C k (T ) subtrees T and let f^k be the corresponding predictor. Then (as in Section 1.5), letting k (i) be the index of the piece of the training set T k containing training case i, and compute the cross-validation error CV ( ) = N 1 X ^k(i) f (xi ) N i=1 2 yi For b a minimizer of CV ( ), one then operates on the entire training set, growing a tree until the cell with the fewest training xi contains 5 or less such points, then …nding the sub-tree, say T b , optimizing C (T ) = jT j + ^ SSE (T ), and using the corresponding predictor f^^ . The question of how to …nd a subtree optimizing a C k (T ) or C (T ) without making an exhaustive search over sub-trees for every di¤erent value of has a 60 workable answer. There is a relatively small number of nested candidate subtrees of an original tree that are the only ones that are possible minimizers of a C k (T ) or C (T ), and as decreases one moves through that nested sequence of sub-trees from the largest/original tree to the smallest. Suppose that one begins with a large tree, T0 (with jT0 j …nal nodes). One may quickly search over all "pruned" versions of T0 (sub-trees T created by removing a node where there is a fork and all branches that follow below it) and …nd the one with minimum SSE (T ) SSE (T0 ) jT0 j jT j (This IS the per node (of the lopped o¤ branch of the …rst tree) increase in SSE.) Call that sub-tree T1 . T0 is the optimizer of C (T ) over subtrees of T0 for every (jT0 j jT1 j) = (SSE (T1 ) SSE (T0 )), but at = (jT0 j jT1 j) = (SSE (T1 ) SSE (T0 )), the optimizing sub-tree switches to T1 . One then may search over all "pruned" versions of T1 for the one with minimum SSE (T ) SSE (T1 ) jT1 j jT j and call it T2 . T1 is the optimizer of C (T ) over sub-trees of T0 for every (jT1 j jT2 j) = (SSE (T2 ) SSE (T1 )) (jT0 j jT1 j) = (SSE (T1 ) SSE (T0 )), but at = (jT1 j jT2 j) = (SSE (T2 ) SSE (T1 )) the optimizing sub-tree switches to T2 ; and so on. (In this process, if there ever happens to be a tie among subtrees in terms of a minimizing a ratio of increase in error sum of squares per decrease in number of nodes, one chooses the sub-tree with the smaller jT j.) For T optimizing C (T ), the function of , C (T ) = minC (T ) T is piecewise linear in , and both it and the optimizing nested sequence of sub-trees can be computed very e¢ ciently in this fashion. The continuous y, SEL version of this material is known as the making of "regression trees." The "classi…cation trees" version of it is very similar. One needs only to de…ne an empirical loss to associate with a given tree parallel to SSE used above. To that end, we …rst note that in a K class problem (where y takes values in G = f1; 2; : : : ; Kg) corresponding to a particular rectangle Rm is the fraction of training vectors with classi…cation k, pd mk = 1 # training input vectors in Rm X i with xi in Rm and a plausible G-valued predictor based on l rectangles is f^l (x) = arg max p\ m(x)k k2G 61 I [yi = k] the class that is most heavily represented in the rectangle to which x belongs. The empirical mis-classi…cation rate for this predictor is err = N l i 1 X h 1 X I yi 6= f^l (xi ) = Nm 1 N i=1 N m=1 p\ mk(m) where Nm = # training input vectors in Rm , and k (m) = arg max pd mk . Two k2G related alternative forms for a training error are "the Gini index" ! l K X 1 X err = Nm pd pd mk (1 mk ) N m=1 k=1 and the so-called "cross entropy" err = l 1 X Nm N m=1 K X k=1 ! pd d mk ln (p mk ) Upon adopting one of these forms for err and using it to replace SSE in the regression tree discussion, one has a classi…cation tree methodology. HTF suggest using the Gini index or cross entropy for tree growing and any of the indices (but most typically the empirical mis-classi…cation rate) for tree pruning according to cost-complexity. 9.1.1 Comments on Forward Selection The basic formulation of the tree-growing and tree-pruning methods here employs "one-step-at-a-time"/"greedy" (unable to defer immediate reward for the possibility of later success) methods. Like all such methods, they are not guaranteed to follow paths through the set of trees that ever get to "best" ones since they are "myopic," never considering what might be later in a search, if a current step were taken that provides little immediate payo¤. An example (essentially suggested by Mark Culp) dramatically illustrates this limitation. Consider a p = 3 case where x1 2 f0; 1g ; x2 2 f0; 1g ; and x3 2 [0; 1]. In fact, suppose that x1 and x2 are iid Bernoulli(:5) independent of x3 that is Uniform(0; 1). Then suppose that conditional on (x1 ; x2 ; x3 ) the output y is N( (x1 ; x2 ; x3 ) ; 1) for (x1 ; x2 ; x3 ) = 1000 I [x1 = x2 ] + x3 For a big training sample iid from this joint distribution, all branching will typically be done on the continuous variable x3 , completely missing the fundamental fact that it is the (joint) behavior of (x1 ; x2 ) that drives the size of y. (This example also supports the conventional wisdom that as presented the splitting algorithm "favors" splitting on continuous variables over splitting on values of discrete ones.) 62 9.1.2 Measuring the Importance of Inputs Consider the matter of assigning measures of "importance" of input variables for a regression tree predictor. In the spirit of ordinary linear models assessment of the importance of a predictor in terms of some reduction it provides in some error sum of squares, Breiman suggested the following. Suppose that in a regression tree, predictor xj provides the rectangle splitting criterion for nodes node1j ; : : : ; nodem(j)j and that before splitting at nodelj , the relevant rectangle Rnodelj has associated sum of squares X SSEnodelj = 2 y nodelj yi i with xi 2Rnodelj and that after splitting Rnodelj on variable xj to create rectangles Rnodelj1 and Rnodelj2 one has sums of squares associated with those two rectangles X SSEnodelj1 = 2 yi y nodelj1 yi y nodelj2 i with xi 2Rnodelj1 X and SSEnodelj2 = 2 i with xi 2Rnodelj2 The reduction in error sum of squares provided by the split on xj at nodelj is thus D (nodelj ) = SSEnodelj SSEnodelj1 + SSEnodelj2 One might then call m(j) Ij = X D (nodelj ) l=1 a measure of the importance of xp j in …tting the tree and compare the various Ij ’s (or perhaps the square roots, Ij ’s). Further, if a predictor is a (weighted) sum of regression trees (e.g. produced by "boosting" or in a "random forest") and Ijm measures the importance of xj in the mth tree, then M 1 X Ij: = Ijm M m=1 is perhaps a sensible measure of the importance of xj in the overall predictor. One can then compare the various Ij: (or square roots) as a means of comparing the importance of the input variables. 9.2 Random Forests This is a version of the "bagging" (bootstrap aggregation) idea of Section 10.1.1 applied speci…cally to (regression and classi…cation) trees. Suppose that one makes B bootstrap samples of N (random samples with replacement of size N ) 63 from the training set T , say T 1 ; T 2 ; : : : ; T B . For each one of these samples, T b , develop a corresponding regression or classi…cation tree by 1. at each node, randomly selecting m of the p input variables and …nding an optimal single split of the corresponding rectangle over the selected input variables, splitting the rectangle, and 2. repeating 1 at each node until a small maximum number of training cases, nm ax , is reached in each rectangle. (Note that no pruning is applied in this development.) Then let f^ b (x) be the corresponding tree-based predictor (taking values in < in the regression case or in G = f1; 2; : : : ; Kg in the classi…cation case). The random forest predictor is then B 1 X ^b f (x) f^B (x) = B b=1 in the regression case and f^B (x) = arg max k B h i X I f^ b (x) = k b=1 in the classi…cation case. The basic tuning parameters in the development of f^B (x) are then m and nm ax . HTF suggest default values of these parameters m = bp=3c and nm ax = 5 for regression problems, and p m= p and nm ax = 1 for classi…cation problems. (The nm ax = 1 default in classi…cation problems is equivalent to splitting rectangles in the input space <p until all rectangles are completely homogeneous in terms of output values, y.) It is common practice to make a kind of running cross validation estimate of error as one builds a random forest predictor based on "out-of-bag" (OOB) samples. That is, suppose that for each b I keep track of the set of (OOB) indices I (b) f1; 2; : : : ; N g for which the corresponding training vector does not get included in the bootstrap training set T b , and let y^iB = 1 B such that i 2 I (b) # of indices b X b B such that i2I(b) in regression contexts and y^iB = arg max k X b B such that i2I(b) 64 h i I f^ b (xi ) = k f^ b (xi ) in classi…cation problems. Then a running cross-validation type of estimate of Err is N 1 X 2 OOB f^B = (yi y^iB ) N i=1 in SEL regression contexts and N 1 X I [yi 6= y^iB ] OOB f^B = N i=1 in 0-1 loss classi…cation contexts. Notice that (as in any bagging context, and) as indicated in Section 10.1.1, one expects LLN type behavior and convergence of the (random) predictor f^B to some …xed (for …xed training set, T ) predictor, say f^rf , that is f^B (x) ! f^rf (x) for all x B!1 It is not too surprising that one then also expects the convergence of OOB f^B , and that plotting of OOB f^B versus B is a standard way of trying to assess whether enough bootstrap trees have been made to adequately represent the forest and predictor f^rf . In spite of the fact that for small B the (random) predictor f^B is built on a small number of trees and is fairly simple, B is not really a complexity parameter, but is rather a convergence parameter. There is a fair amount of confusing discussion in the literature about the impossibility of a random forest "over-…tting" with increasing B. This seems to be related to test error not initially-decreasing-but-then-increasing-in-B (which is perhaps loosely related to OOB f^B converging to a positive value associated with the limiting predictor f^rf , and not to 0). But as HTF point out on their page 596, it is an entirely di¤erent question as to whether f^rf itself is "too complex" to be adequately supported by the training data, T . (And the whole discussion seems very odd in light of the fact that for any …nite B, a di¤erent choice of bootstrap samples would produce a di¤erent f^B as a new randomized approximation to f^rf . Even for …xed x, the value f^B (x) is a random variable. Only f^rf (x) is …xed.) There is also a fair amount of confusing discussion in the literature about the role of the random selection of the m predictors to use at each node-splitting (and the choice of m) in reducing "correlation between trees in the forest." The Breiman/Cutler web site http://www.stat.berkeley.edu/~breiman/Random Forests/cc_home.htm says that the "forest error rate" (presumably the error rate for f^rf ) depends upon "the correlation between any two trees in the forest" and the "strength of each tree in the forest." The meaning of these is not clear. One possible meaning of the …rst is some version of correlation between values of f^ 1 (x) and f^ 2 (x) as one repeatedly selects the whole training set T in iid fashion from P and then makes two bootstrap samples— Section 15.4 of HTF 65 seems to use this meaning.5 A meaning of the second is presumably some measure of average e¤ectiveness of a single f^ b . HTF Section 15.4 goes on to suggest that increasing m increases both "correlation" and "strength" of the trees, the …rst degrading error rate and the second improving it, and that the OOB estimate of error can be used to guide choice of m (usually in a broad range of values that are about equally attractive) if something besides the default is to be used. 9.3 PRIM (Patient Rule Induction Method) This is another rectangle-based method of making a predictor on <p . (The language seems to be "patient" as opposed to "rash" and "rule induction" as in "predictor development.") PRIM6 could be termed a type of "bump-hunting." For a series of rectangles (or boxes) in p-space R1 ; R2 ; : : : ; Rl one de…nes a predictor f^l (x) = 8 > > > > > > > > > < y R1 y R2 R1 .. . if x 2 R1 if x 2 R2 R1 .. . y Rm [ m 1 Rk > k=1 > > > .. > > . > > > : y l c ([k=1 Rk ) if x 2 Rm m 1 [k=1 Rk .. . if x 2 = [lk=1 Rk The boxes or rectangles are de…ned recursively in a way intended to catch "the remaining part of the input space with the largest output values." That is, to …nd R1 1. identify a rectangle l1 l2 x1 u1 x2 u2 xp up .. . lp that includes all input vectors in the training set, 5 A second possibility concerns "bootstrap randomization distribution" correlation (for a …xed training set and a …xed x) between values f^ 1 (x) and f^ 2 (x). 6 The language seems to be "patient" as opposed to "rash." "Rule induction" is perhaps "predictor development" or more likely "conjunctive rule development" from the context of "market basket analysis." See Section 17.1 in regard to this latter usage. 66 2. identify a dimension, j, and either lj or uj so that by reducing uj or increasing lj just enough to remove a fraction (say = :1) of the training vectors currently in the rectangle, the largest value of y rectangle possible is produced, and update that boundary of the rectangle, 3. repeat 2. until some minimum number of training inputs xi remain in the rectangle (say, at least 10), 4. expand the rectangle in any direction (increase a u or decrease an l) adding a training input vector that provides a maximal increase in y rectangle , and 5. repeat 4. until no increase is possible by adding a single training input vector. This produces R1 . For what it is worth, step 2. is called "peeling" and step 4. is called "pasting." Upon producing R1 , one removes from consideration all training vectors with xi 2 R1 and repeats 1. through 5. to produce R2 . This continues until a desired number of rectangles has been created. One may pick an appropriate number of rectangles (l is a complexity parameter) by cross-validation and then apply the procedure to the whole training set to produce a set of rectangles and predictor on p-space that is piece-wise constant on regions built from boolean operations on rectangles. 10 10.1 Predictors Built on Bootstrap Samples, Model Averaging, and Stacking Predictors Built on Bootstrap Samples One might make B bootstrap samples of N (random samples with replacement of size N ) from the training set T , say T 1 ; T 2 ; : : : ; T B , and train on these bootstrap samples to produce, say, predictor f^ b based on T b Now (rather than using these to estimate the prediction error as in Section 12.4) consider a couple of ways of using these to build a predictor (under squared error loss). 10.1.1 Bagging The possibility considered in Section 8.7 of HTF is "bootstrap aggregation," or "bagging." This is use of the predictor f^bag (x) B 1 X ^b f (x) B b=1 67 Notice that even for …xed training set T and input x, this is random (varying with the selection of the bootstrap samples). One might let E denote averaging over the creation of a single bootstrap sample and f^ be the predictor derived from such a bootstrap sample and think of E f^ (x) as the "true" bagging predictor (that has the simulation-based approximation f^bag (x)). One is counting on a law of large numbers to conclude that f^bag (x) ! E f^ (x) as B ! 1. Note too, that unless the operations applied to a training set to produce f^ are linear, E f^ (x) will di¤er from the predictor computed from the training data, f^ (x). Where a loss other than squared error is involved, exactly how to "bag" bootstrapped versions of a predictor is not altogether obvious, and apparently even what might look like sensible possibilities can do poorly. 10.1.2 Bumping Another/di¤erent thing one might do with bootstrap versions of a predictor is to "pick a winner" based on performance on the training data. This is the "bumping"/stochastic perturbation idea of HTF’s Section 8.9. That is, let f^ 0 = f^ be the predictor computed from the training data, and de…ne bb = arg min N X yi f^ b (xi ) 2 b=0;1;:::;B i=1 and take b f^bum p (x) = f^ b (x) The idea here is that if a few cases in the training data are responsible for making a basically good method of predictor construction perform poorly, eventually a bootstrap sample will miss those cases and produce an e¤ective predictor. 10.2 Model Averaging and Stacking The bagging idea averages versions of a single predictor computed from di¤erent bootstrap samples. An alternative might be to somehow weight together di¤erent predictors (potentially even based on di¤erent models). 10.2.1 Bayesian Model Averaging One theoretically straightforward way to justify this kind of enterprise and identify candidate predictors is through the Bayes "multiple model" scenario (also used in Section 12.2.2). That is, suppose that M models are under consideration, the mth of which has parameter vector m and corresponding density for training data fm (T j 68 m) with prior density for m gm ( m) and prior probability for model m (m) The posterior distribution of the model index is Z (mjT ) / (m) fm (T j m ) gm ( m) d m i.e. is speci…ed by R (m) fm (T j m ) gm ( m ) d m (mjT ) = PM R m=1 (m) fm (T j m ) gm ( m ) d m Now suppose that for each model there is a quantity m ( m ) that is of interest, these having a common interpretation across models. Notice that under the current assumption that P depends upon the parameter m in model m, m ( m ) could be the conditional mean of y given the value of x. That is, one important instance of this structure is m ( m ; u) =E m [yjx = u] : Since in model m the density of m conditional on T is g( the conditional mean of E[ m ( m jT ) m m ( fm (T j m ) gm ( m ) fm (T j m ) gm ( m ) d m m) (in model m) is R m ( m ) fm (T j m ) gm ( m ) d R fm (T j m ) gm ( m ) d m m ) jm; T ] and a sensible predictor of E[ ( =R m m ) jT ] ( m) = m is M X (mjT ) E [ m m=1 ( m ) jm; T ] a (model posterior probability) weighted average of the predictors E[ m ( m ) jm; T ] individually appropriate under the M di¤ erent models. In the important special case where m ( m ; u) =E m [yjx = u], this is E[ 10.2.2 m( m ; u) jT ] = M X (mjT ) E [E m=1 m [yjx = u] jm; T ] Stacking What might be suggested like this, but with a less Bayesian ‡avor? Suppose that M predictors are available (all based on the same training data), 69 f^1 ; f^2 ; : : : ; f^M . Under squared error loss, I might seek a weight vector w for which the predictor M X f^ (x) = wm f^m (x) m=1 is e¤ective. I might even let w depend upon x, producing M X f^ (x) = wm (x) f^m (x) m=1 HTF pose the problem of perhaps …nding T (x;y) b = arg min E E w M X y w or even m=1 2 M X b (u) = arg min ET E 4 y w w wm f^m (x) wm f^m (x) m=1 !2 !2 (59) 3 jx = u5 Why might this program improve on any single one of the f^m ’s? In some sense, this is "obvious" since the w over which minimization (59) is done include vectors with one entry 1 and all others 0. But to indicate in a concrete setting why this might work, consider a case where M = 2 and according to the P N P joint distribution of (T ; (x; y)) E y f^1 (x) = 0 E y f^2 (x) = 0 and De…ne f^ = f^1 + (1 ) f^2 Then E y f^ (x) 2 =E f^1 (x) + (1 y ) y f^1 (x) + (1 = Var y = ( ;1 ) Cov y y f^1 (x) f^2 (x) f^2 (x) 2 f^2 (x) ) y 1 This is a quadratic function of , that (since covariance matrices are nonnegative de…nite) has a minimum. Thus there is a minimizing that typically (is not 0 or 1 and thus) produces better expected loss than either f^1 (x) or f^2 (x). 70 More generally, using either the P N P joint distribution of (T ; (x; y)) or the P N Pyjx=u conditional distribution of (T ; y) given x = u, one may con0 sider the random vector f^1 (x) ; f^2 (x) ; : : : ; f^M (x) ; y or the random vector 0 0 f^1 (u) ; f^2 (u) ; : : : ; f^M (u) ; y . Let f^ ; y stand for either one and 0 E f^f^ M and Ey f^ M M 1 be respectively the matrix of expected products of the predictions and vector of expected products of y and elements of f^. Upon writing out the expected square to be minimized and doing some matrix calculus, it’s possible to see that optimal weights are of the form 1 0 b = E f^f^ w Ey f^ b (x) is of this form) (or w (We observe that HTF’s notation in (8.57) could be interpreted to incorrectly place the inverse sign inside the expectation. They call this "the population linear regression of y on f^.") Of course, none of this is usable in practice, as typically the mean vector and expected cross product matrix are unknown. And apparently, the simplest empirical approximations to these do not work well. What, instead, HTF seem to recommend is to de…ne w stack = arg min w N X yi M X i wm f^m (xi ) m=1 i=1 !2 i for f^m the mth predictor …t to T f(xi ; yi )g, the training set with the ith case removed. (wstack optimizes a kind of leave-one-out cross-validation error for a linear combination of f^m ’s.) Then ultimately the "stacked" predictor of y is f^ (x) = M X stack ^ wm fm (x) m=1 11 Reproducing Kernel Hilbert Spaces: Penalized/Regularized and Bayes Prediction A general framework that uni…es many interesting regularized …tting methods is that of reproducing kernel Hilbert spaces. There is a very nice 2012 Statistics Surveys paper by Nancy Heckman (related to an older UBC Statistics Department technical report (#216)) entitled "The Theory and Application of Penalized Least Squares Methods or Reproducing Kernel Hilbert Spaces Made Easy," that is the nicest exposition I know about of the connection of this material to splines. Parts of what follows are borrowed shamelessly from her paper. 71 There is also some very helpful stu¤ in CFZ Section 3.5 and scattered through Izenman about RKHSs.7 11.1 RKHS’s and p = 1 Cubic Smoothing Splines To provide motivation for a somewhat more general discussion, consider again the smoothing spline problem. We consider here the function space Z 1 2 0 H = h : [0; 1] ! <j h and h are absolutely continuous and (h00 (x)) dx < 1 0 as a Hilbert space (a linear space with inner product where Cauchy sequences have limits) with inner product hf; giH f (0) g (0) + f 0 (0) g 0 (0) + Z 1 f 00 (x) g 00 (x) dx 0 1=2 (and corresponding norm khkH = hh; hiH ). With this de…nition of inner product and norm, for x 2 [0; 1] the (linear) functional (a mapping H ! <) Fx [f ] f (x) is continuous. Thus the so-called "Riesz representation theorem" says that there is an Rx 2 H such that Fx [f ] = hRx ; f iH = f (x) 8f 2 H (Rx is called the representer of evaluation at x.) It is in fact possible to verify that for z 2 [0; 1] Rx (z) = 1 + xz + R1x (z) for R1x (z) = xz min (x; z) 1 x+z 2 3 (min (x; z)) + (min (x; z)) 2 3 does this job. Then the function of two variables de…ned by R (x; z) Rx (z) is called the reproducing kernel for H, and H is called a reproducing kernel Hilbert space (RKHS) (of functions). De…ne a linear di¤erential operator on elements of H (a map from H to H) by L [f ] (x) = f 00 (x) 7 See also "Penalized Splines and Reproducing Kernels" by Pearce and Wand in The American Statistician (2006) and "Kernel Methods in Machine Learning" by Hofmann, Scholköpf, and Smola in The Annals of Statistics (2008). 72 Then the optimization problem solved by the cubic smoothing spline is minimization of Z 1 N X 2 2 (yi Fxi [h]) + (L [h] (x)) dx (60) 0 i=1 over choices of h 2 H. It is possible to show that the minimizer of the quantity (60) is necessarily of the form N X h (x) = 0 + 1 x + i R1xi (x) i=1 and that for such h, the criterion (60) is of the form (Y 0 K ) (Y T 0 K )+ T K for T = (1; X) ; N 2 0 = 1 ; and K = (R1xi (xj )) So this has utlimately produced a matrix calculus problem. 11.2 What Seems to be Possible Beginning from Linear Functionals and Linear Di¤erential Operators for p = 1 For constants di > 0; functionals Fi , and a linear di¤erential operator L de…ned for continuous functions wk (x) by L [h] (x) = h(m) (x) + m X1 wk (x) h(k) (x) k=1 Heckman considers the minimization of N X di (yi 2 Fi [h]) + Z b 2 (L [h] (x)) dx (61) a i=1 in the space of functions H= h : [a; b] ! <j derivatives of h up to order m 1 are Rb absolutely continuous and a h(m) (x) 2 dx < 1 One may adopt an inner product for H of the form hf; giH m X1 f (k) (a) g (k) (a) + Z b L [f ] (x) L [g] (x) dx a k=1 and have a RKHS. The assumption is made that the functionals Fi are continuous and linear, and thus that they are representable as Fi [h] = hfi ; hiH for 73 some fi 2 H. An important special case is that where Fi [h] = h (xi ), but Rb other linear functionals have been used, for example Fi [h] = a Hi (x) h (x) dx for known Hi . The form of the reproducing kernel implied by the choice of this inner product is derivable as follows. First, there is a linearly independent set of functions fu1 ; : : : ; um g that is a basis for the subspace of H consisting of those elements h for which L [h] = 0 (the zero function). Call this subspace H0 . The so-called Wronskian matrix associated with these functions is then (j 1) W (x) = ui (x) m m With 0 C = W (a) W (a) let R0 (x; z) = X 1 Cij ui (x) uj (z) i;j Further, there is a so-called Green’s function associated with the operator L, a function G (x; z) such that for all h 2 H satisfying h(k) (a) = 0 for k = 0; 1; : : : ; m 1 Z b h (x) = G (x; z) L [h] (z) dz a Let R1 (x; z) = Z b G (x; t) G (z; t) dt a The reproducing kernel associated with the inner product and L is then R (x; z) = R0 (x; z) + R1 (x; z) As it turns out, P H0 is a RKHS with reproducing kernel R0 under the inner m 1 (k) product hf; gi0 = (a) g (k) (a). Further, the subspace of H conk=1 f (k) sisting of those h with h (a) = 0 for k = 0; 1; : : : ; m 1 (call it H1 ) is a RKHS with reproducing kernel R1 (x; z) under the inner product hf; gi1 = Rb L [f ] (x) L [g] (x) dx. Every element of H0 is ? to every element of H1 in H a and every h 2 H can be written uniquely as h0 + h1 for an h0 2 H0 and an h1 2 H1 . These facts can be used to show that a minimizer of (61) exists and is of the form m N X X h (x) = u (x) + k k i R1 (xi ; x) i=1 k=1 For h (x) of this form, (61) is of the form (Y T 0 K ) D (Y 74 T K )+ 0 K for T = (Fi [uj ]) ; 0 B B =B @ 1 2 .. . m 1 C C C ; D = diag (d1 ; : : : ; dm ) ; and K = (Fi [R1 ( ; xj )]) A and its optimization is a matrix calculus problem. In the important special case where the Fi are function evaluation at xi , above Fi [uj ] = uj (xi ) and Fi [R1 ( ; xj )] = R1 (xi ; xj ) 11.3 What Is Common Beginning Directly From a Kernel Another way that it is common to make use of RKHSs is to begin with a kernel and its implied inner product (rather than deriving one as appropriate to a particular well-formed optimization problem in function spaces). This seems less satisfying/principled and more arbitrary/mechanical than developments beginning from linear functionals and di¤erential operators, but is nevertheless popular. The 2006 paper of Pearce and Wand in The American Statistician is a readable account of this thinking that is more complete than what follows here (and provides some references). This development begins with C a compact subset of <p and a symmetric kernel function K:C C!< Ultimately, we will considerP as predictors for x 2 C linear combinations of secN tions of the kernel function, i=1 bi K (x; xi ) (where the xi are the input vectors in the training set). But to get there in a rational way, and to incorporate use of a complexity penalty into the …tting, we will restrict attention to those kernels that have nice properties. In particular, we require that K be continuous and non-negative de…nite. We mean by this latter that for any set of M elements of <p , say fz j g, the M M matrix (K (z i ; z j )) is non-negative de…nite. Then according to what is known as Mercer’s Theorem (look it up in Wikipedia) K may then be written in the form K (z; x) = 1 X i i (z) i (x) (62) i=1 for linearly independent (in L2 (C)) functions f i g, and constants i 0, where the i comprise an orthonormal basis for P L2 (C) (so any function in 1 fR 2 L2 (C) has an expansion in terms of the i as i=1 hf; i i2 i for hf; hi2 f (x) h (x) dx ), and the ones corresponding to positive i may be taken to C 75 be continuous on C. Further, the i are "eigenfunctions" of the kernel K corresponding to the "eigenvalues i " in the sense that in L2 (C) Z i (z) K (z; ) dz = i i ( ) Our present interest is in a function space HK (that is a subset of L2 (C) with a di¤erent inner product and norm) with members of the form f (x) = 1 X ci (x) for ci with i i=1 1 2 X c i i i=1 <1 (63) (called the "primal form" of functions in the space). (Notice P1 P1 that all elements of L2 (C) are of the form f (x) = i=1 ci i (x) with i=1 c2i < 1.) More naturally, our interest centers on f (x) = 1 X bi K (x; z i ) for some set of z i (64) i=1 supposing that the series converges appropriately (called the "dual form" of functions in the space). The former is most useful for producing simple proofs, while the second is most natural for application, since how to obtain the i and corresponding i for a given K is not so obvious. Notice that K (z; x) = = 1 X i=1 1 X i i ( (z) i i i (x) (x)) i (z) i=1 P1 P1 and letting i i (x) = ci (x), since i=1 c2i (x) = i = i=1 i 2i (x) = K (x; x) < 1, the function K ( ; x) is of the form (63), so that we can expect functions of P1 the form (64) with absolutely convergent i=1 bi to be of form (63). In the space of functions (63), we de…ne an inner product (for our Hilbert space) *1 + 1 1 X X X ci d i ci i ; di i i=1 i=1 so that 1 X 2 ci i=1 Note then that for f = P1 i=1 ci i hf; K ( ; x)iHK = 1 2 X c i i HK i=1 i belonging to the Hilbert space HK , 1 X ci i=1 i i=1 HK i i (x) i 76 = 1 X i=1 ci i (x) = f (x) and so K ( ; x) is the representer of evaluation at x. Further, hK ( ; z) ; K ( ; x)iHK K (z; x) which is the reproducing property of the RKHS. Notice also that for two linear combinations of slices of the kernel function at some set of (say, M ) inputs fz i g (two functions in HK represented in dual form) M M X X f()= ci K ( ; z i ) and g ( ) = di K ( ; z i ) i=1 i=1 the corresponding HK inner product is * + M M X X hf; giHK = ci K ( ; z i ) ; dj K ( ; z j ) i=1 = j=1 M X M X HK ci dj K (z i ; z j ) i=1 j=1 0 1 d1 B C = (c1 ; : : : ; cM ) (K (z i ; z j )) i=1;:::;M @ ... A j=1;:::;M dM This is a kind of <M inner product of the coe¢ cient vectors c0 = (c1 ; : : : ; cM ) and d0 = (d1 ; : : : ; dM ) de…ned by the nonnegative de…nite matrix (K (z i ; z j )). Further, if a random M -vector Y has covariance matrix (K (z i ; z j )), this is 2 Cov c0 Y ; d0 Y . So, in particular, for f of this form kf kHK = hf; f iHK =Var(c0 Y ). For applying this material to the …tting of training data, for > 0 and a loss function L (y; yb) 0 de…ne an optimization criterion ! N X 2 minimize L (yi ; f (xi )) + kf kHK (65) f 2HK i=1 As it turns out, an optimizer of this criterion must, for the training vectors fxi g, be of the form N X ^ bi K (x; xi ) (66) f (x) = i=1 and the corresponding f^ 2 HK D E f^; f^ is then HK = N X N X bi bj K (xi ; xj ) i=1 j=1 The criterion (65) is thus 0 1 1 0 N N X X bj K (xi ; xj )A + b0 (K (xi ; xj )) bA minimize @ L @yi ; b2<N i=1 j=1 77 (67) Letting K = (K (xi ; xj )) and P = K (a symmetric generalized inverse of K) and de…ning LN (Y ; Kb) N X i=1 the optimization criterion (67) is 0 L @yi ; N X j=1 1 bj K (xi ; xj )A minimize LN (Y ; Kb) + b0 Kb b2<N i.e. minimize LN (Y ; Kb) + b0 K 0 P Kb b2<N i.e. minimize (LN (Y ; v) + v 0 P v) (68) v2C(K) That is, the function space optimization problem (65) reduces to the N -dimensional optimization problem (68). A v 2 C (K) (the column space of K) minimizing LN (Y ; v) + v 0 P v corresponds to b minimizing LN (Y ; Kb) + b0 Kb via Kb = v (69) For the particular special case of squared error loss, L (y; yb) = (y development has a very explicit punch line. That is, 0 LN (Y ; Kb) + b0 Kb = (Y Kb) + b0 Kb Kb) (Y 2 yb) , this Some vector calculus shows that this is minimized over choices of b by b = (K + I) 1 (70) Y and corresponding …tted values are Yb = v = K (K + I) 1 Y Then using (70) under squared error loss, the solution to (65) is from (66) f^ (x) = N X b i K (x; xi ) (71) i=1 To better understand the nature of (68), consider the eigen decomposition of K in a case where it is non-singular as K = U diag ( 1 ; 78 2; : : : ; N ) U 0 for eigenvalues 1 2 N > 0, where the eigenvector columns of U comprise an orthonormal basis for <N . The penalty in (68) is v 0 P v = v 0 U diag (1= 1 ; 1= 2 ; : : : ; 1= = N X 1 j j=1 N) U 0 v 2 hv; uj i Now (remembering that the uj comprise an orthonormal basis for <N ) v= N X j=1 2 hv; uj i uj and kvk = hv; vi = N X j=1 2 hv; uj i so we see that in choosing a v to optimize (LN (Y ; v) + v 0 P v), we penalize highly those v with large components in the directions of the late eigenvectors of K (the ones corresponding to its small eigenvalues) thereby tending to suppress those features of a potential v. HTF seem to say that the eigenvalues and eigenvectors of the data-dependent N N matrix K are somehow related respectively to the constants i and functions i in the representation (62) of K as a weighted series of products. That seems hard to understand and (even if true) certainly not obvious. CFZ provide a result summarizing the most general available version of this development, known as "the representer theorem." It says that if : [0; 1) ! < is strictly increasing and L ((x1 ; y1 ; h (x1 )) ; : : : ; (xN ; yN ; h (xN ))) 0 is an arbitrary loss function associated with the prediction of each yi as h (xi ), then an h 2 HK minimizing L ((x1 ; y1 ; h (x1 )) ; (x2 ; y2 ; h (x2 )) ; : : : ; (xN ; yN ; h (xN ))) + khkHK has a representation as h (x) = N X iK (x; xi ) i=1 Further, if f 1 ; 2 ; : : : ; M g is a set of real-valued functions and the N M matrix ( j (xi )) is of rank M then for h0 2 spanf 1 ; 2 ; : : : ; M g and h1 2 HK ; an h = h0 + h1 minimizing L ((x1 ; y1 ; h (x1 )) ; (x2 ; y2 ; h (x2 )) ; : : : ; (xN ; yN ; h (xN ))) + kh1 kHK has a representation as h (x) = M X i i (x) + i=1 N X iK (x; xi ) i=1 The extra generality provided by this theorem for the squared error loss case treated above is that it provides for linear combinations of the functions i (x) to be unpenalized in …tting. Then for N M =( 79 j (xi )) and 0 R= I an optimizing 1 0 is ^ = (R 0 1 0 Y Y where ^ optimizes 0 K ) (R 0 K )+ K and the earlier argument implies that ^ = (K + I) 11.3.1 1 R. Some Special Cases One possible kernel function in p dimensions is d K (z; x) = (1 + hz; xi) For …xed xi the basis functions K (x; xi ) are dth order polynomials in the entries of xi . So the …tting is in terms of such polynomials. Note that since K (z; x) is relatively simple here, there seems to be a good chance of explicitly deriving a representation (62) and perhaps working out all the details of what is above in a very concrete setting. Another standard kernel function in p dimensions is K (z; x) = exp kz 2 xk and the basis functions K (x; xi ) are essentially spherically symmetric normal pdfs with mean vectors xi . (These are "Gaussian radial basis functions" and for p = 2, functions (66) produce prediction surfaces in 3-space that have smooth symmetric "mountains" or "craters" at each xi ; of elevation or depth relative to the rest of the surface governed by bi and extent governed by .) Section 19 provides a number of insights that enable the creation of a wide variety of kernels beyond the few mentioned here. The standard development of so-called "support vector classi…ers" in a 2class context with y taking values 1, uses some kernel K (z; x) and predictors f^ (x) = b0 + 1 X bi K (x; xi ) i=1 in combination with h L y; f^ (x) = 1 y f^ (x) i + (the sign of f^ (x) providing the classi…cation associated with x). 11.3.2 Addendum Regarding the Structures of the Spaces Related to a Kernel An ampli…cation of some aspects of the basic description provided above for he RKHS corresponding to a kernel K : C C ! < is as follows. 80 From the representation K (z; x) = 1 X i i (z) i (x) i=1 write i so that the kernel is K (z; x) = = p 1 X i i i (z) i (x) (72) i=1 (Note that considered as functions in L2R (C) the i are orthogonal, but not 2 generally orthonormal, since h i ; i i2 (x) dx = i which is typically C i i not 1.) Representation (72) suggests that one think about the inner product for inputs provided by the kernel in terms of a transform of an input vector x 2 <p to an in…nite-dimensional feature vector (x) = ( 1 (x) ; 2 (x) ; : : :) and then "ordinary <1 inner products" de…ned on those feature vectors. The function space HK has members of the (primal) form f (x) = 1 X ci i (x) for ci with i=1 1 2 X c i <1 i i=1 This is perhaps more naturally f (x) = 1 X ci i (x) for ci with i=1 p 1 X ci i=1 2 i = i 1 X i=1 c2i < 1 P1 (Again, i=1 ci i (x) with P1 2 all elements of L2 (C) are of the form f (x) = c < 1.) The H inner product of two functions of this primal form has K i=1 i been de…ned as *1 + 1 1 X X X ci di ci i ; di i i=1 i=1 i i=1 HK = = * 1 X i=1 ci p i 1 X di ; p i i=1 c1 c2 p ; p ;::: ; 1 2 i i + HK d1 d2 p ; p ;::: 1 2 <1 So, two P elements of HKPwritten in terms of the i (instead of their multiples 1 1 i ) say i=1 di i have HK inner product that is the ordinary i=1 ci i and 1 < inner product of their vectors of coe¢ cients. 81 Now consider the function K ( ; x) = 1 X i () i (x) i=1 For f = P1 i=1 ci i 2 HK , * hf; K ( ; x)iHK = 1 X ci i ; K ( ; x) i=1 + = HK 1 X ci i (x) = f (x) i=1 and (perhaps more clearly than above) indeed K ( ; x) is the representer of evaluation at x in the function space HK . 11.4 Gaussian Process "Priors," Bayes Predictors, and RKHS’s The RKHS material has an interesting connection to Bayes prediction. It’s our purpose here to show that connection. Consider …rst a very simple context where P has parameter and therefore the conditional mean of the output is a parametric (in ) function of x E [yjx = u] = m (uj ) Suppose that the conditional distribution of yjx = u has density f (yju). Then for a prior distribution on with density g ( ), the posterior has density g ( jT ) / N Y i=1 f (yi jxi ) g ( ) For any …xed x, this posterior distribution induces a corresponding distribution on m (xj ). Then, a sensible predictor based on the Bayes structure is Z f^ (x) = E [m (xj ) jT ] = m (xj ) g ( jT ) d With this thinking in mind, let’s now consider a more complicated application of essentially Bayesian thinking to the development of a predictor, based the use of a Gaussian process as a more or less non-parametric "prior distribution" for the function (of u) E[yjx = u]. That is, for purposes of developing a predictor, suppose that one assumes that y = (x) + where (x) = (x) + (x) 2 E = 0; Var = ; the function (x) is known (it could be taken to be identically 0) and plays the role of a prior mean for the function (of u) (u) = E [yjx = u] 82 and (independent of the errors ), (x) is a realization of a mean 0 stationary Gaussian process on <p , this Gaussian process describing the prior uncertainty for (x) around (x). More explicitly, the assumption on (x) is that E (x) = 0 and Var (x) = 2 for all x, and for some appropriate (correlation ) function , Cov( (x) ; (z)) = 2 (x z) for all x and z ( (0) = 1 and the function of two variables (x z) must be positive de…nite). The "Gaussian" assumption is then that for any …nite set of element of <p , say z 1 ; z 2 ; : : : ; z M , the vector of corresponding values (z i ) is multivariate normal. There are a number of standard forms that have been suggested for the correlation function . The simplest ones are of a product form, i.e. if j is a valid one-dimensional correlation function, then the product (x z) = p Y j (xj zj ) j=1 is a valid correlation function for a Gaussian process on <p . Standard forms for correlation functions in one dimension are ( ) = exp c 2 and ( ) = exp ( c j j). The …rst produces "smoother" realizations than does the second, and in both cases, the constant c governs how fast realizations vary. One may then consider the joint distribution (conditional on the xi and assuming that for the training values yi the i are iid independent of the (xi )) of the training output values and a value of (x). From this, one can …nd the conditional mean for (x) given the training data. To that end, let N N = 2 Then for a single value of x, 0 1 00 y1 B y2 C BB B C BB B .. C BB B . C MVNN +1 BB B C BB @ yN A @@ (x) for (xi xj ) (x1 ) (x2 ) .. . 0 B B (x) = B @ N 1 1 C C C C; C (xN ) A (x) 2 2 2 (x (x .. . (x i=1;2;:::;N j=1;2;:::;N + 2I 0 (x) 1 x1 ) x2 ) C C C A xN ) (x) 2 1 C C C C C A Then standard multivariate normal theory says that the conditional mean of (x) given Y is 0 1 y1 (x1 ) B (x2 ) C 1 B y2 C 0 f^ (x) = (x) + (x) + 2I (73) B C .. @ A . yN (xN ) 83 Write w = + N 1 2 I and then note that (73) implies that f^ (x) = (x) + 0 y1 B 1 B y2 B @ yN N X wi 2 1 (x1 ) (x2 ) C C C .. A . (xN ) (x xi ) (74) (75) i=1 and we see that this development ultimately produces (x) plus a linear combination of the "basis functions" 2 (x xi ) as a predictor. Remembering that 2 (x z) must be positive de…nite and seeing the ultimate form of the predictor, we are reminded of the RKHS material. In fact, consider the case where (x) 0. (If one has some non-zero prior mean for (x), arguably that mean function should be subtracted from the raw training outputs before beginning the development of a predictor. At a minimum, output values should probably be centered before attempting development of a predictor.) Compare (74) and (75) to (70) and (71) for the (x) = 0 case. What is then clear is that the present "Bayes" Gaussian process development of a predictor under squared error loss based on a covariance function 2 (x z) and error variance 2 is equivalent to a RKHS regularized …t of a function to training data based on a kernel K (x; z) = 2 (x z) and penalty weight = 2. 12 More on Understanding and Predicting Predictor Performance There are a variety of theoretical and empirical quantities that might be computed to quantify predictor performance. Those that are empirical and reliably track important theoretical ones might potentially be used to select an e¤ective predictor. We’ll here consider some of those (theoretical and empirical) measures. We will do so in the by-now-familiar setting where training data (x1 ; y1 ) ; (x2 ; y2 ) ; : : : ; (xN ; yN ) are assumed to be iid P and independent of (x; y) that is also P distributed, and are used to pick a prediction rule f^ (x) to be used under a loss L (b y ; y) 0. What is very easy to think about and compute is the training error err = N 1 X L f^ (xi ) ; yi N i=1 This typically decreases with increased complexity in the form of f , and is no reliable indicator of predictor performance o¤ the training set. Measures of prediction rule performance o¤ the training set must have a theoretical basis 84 (or be somehow based on data held back from the process of prediction rule development). General loss function versions of (squared error loss) quantities related to err de…ned in Section 1.4, are h i Err (u) ET E L f^ (x) ; y jx = u ErrT E(x;y) L f^ (x) ; y (76) Ex Err (x) = ET ErrT (77) and Err A slightly di¤erent and semi-empirical version of this expected prediction error (77) is the "in-sample" test error (7.12) of HTF N 1 X Err (xi ) N i=1 12.1 Optimism of the Training Error Typically, err is less than ErrT . Part of the di¤erence in these is potentially due to the fact that ErrT is an "extra-sample" error, in that the averaging in (76) is potentially over values x outside the set of values in the training data. We might consider instead ErrT in = N 1 X yi E L yi ; f^ (xi ) N i=1 (78) where the expectations indicated in (78) are over yi Pyjx=xi (the entire ^ training sample used to choose f , both inputs and outputs, is being held constant in the averaging in (78)). The di¤erence op = ErrT in err is called the "optimism of the training error." HTF use the notation ! = EY op = EY (ErrT in err) (79) where the averaging indicated by EY is over the outputs in the training set (using the conditionally independent yi ’s, yi Pyjx=xi ). HTF say that for many losses N 2 X != CovY (b yi ; yi ) (80) N i=1 85 For example, consider the case of squared error loss. There ! = EY (ErrT in = EY EY 1 N err) N X f^ (xi ) yi EY i=1 N 2 X Y ^ = E yi f (xi ) N i=1 N 2 X Y ^ E f (xi ) (yi = N i=1 = 2 N 1 X yi N i=1 f^ (xi ) 2 EY EY yi f^ (xi ) E [yjx = xi ]) N 2 X CovY (b yi ; yi ) N i=1 We note that in this context, assuming that given the xi in the training data the outputs are uncorrelated and have constant variance 2 , by (14) != 12.2 Cp , AIC and BIC 2 2 df Yb N The fact that ErrT in = err + op suggests the making of estimates of ! = EY op and the use of err + ! b (81) as a guide in model selection. This idea produces consideration of the model selection criteria Cp /AIC and BIC. 12.2.1 Cp and AIC For the situation of least squares …tting with p predictors or basis functions and squared error loss, df Yb = p = tr X X 0 X 1 X0 PN so that i=1 CovY (b yi ; yi ) = p 2 . Then, if c2 is an estimated error variance based on a low-bias/high-number-of-predictors …t, a version of (81) suitable for this context is Mallows’Cp Cp err + 86 2p c2 N In a more general setting, if one can appropriately evaluate or estimate PN Y 2 yi ; yi ) = df Yb , a general version of (81) becomes the Akaike i=1 Cov (b information criterion AIC = err + 12.2.2 N 2 X 2 2 CovY (b yi ; yi ) = err + df Yb N i=1 N BIC For situations where …tting is done by maximum likelihood, the Bayesian Information Criterion of Schwarz is an alternative to AIC. That is, where the joint distribution P produces density P (yj ; u) for the conditional distribution of yjx = u and b is the maximum likelihood estimator of , a (maximized) log-likelihood is N X loglik = log P yi jb; xi i=1 and the so-called Bayesian information criterion 2 loglik + (log N ) df Yb BIC = 2 For yjx normal with variance BIC = N 2 , up to a constant, this is err + (log N ) N 2 df Yb and after switching 2 for log N , BIC is a multiple of AIC. The replacement of 2 with log N means that when used to guide model/predictor selections, BIC will typically favor simpler models/predictors than will AIC. The Bayesian origins of BIC can be developed as follows. Suppose (as in Section 10.2.1) that M models are under consideration, the mth of which has parameter vector m and corresponding density for training data fm (T j with prior density for m) m gm ( m) and prior probability for model m (m) With this structure, the posterior distribution of the model index is Z (mjT ) / (m) fm (T j m ) gm ( m ) d m 87 Under 0-1 loss and uniform ( ), one wants to choose model m maximizing Z fm (T j m ) gm ( m ) d m = fm (T ) = the mth marginal of T Apparently, the so-called Laplace approximation says that log fm (T ) dm log N + O (1) 2 log fm T j c m where dm is the real dimension of m . Assuming that the marginal of x doesn’t change model-to-model or parameter-to-parameter, log fm T j c is m loglik +CN , where CN is a function of only the input values in the training set. Then 2 log fm (T ) 2 loglik + (log N ) dm + O (1) = BIC + O (1) 2CN 2CN and (at least approximately) choosing m to maximize fm (T ) is choosing m to minimize BIC. 12.3 Cross-Validation Estimation of Err K-fold cross-validation is described in Section 1.5. One hopes that N 1 X ^ L yi ; f^k(i) (xi ) ) CV f = N i=1 estimates Err. In predictor selection, say where predictor f^ has a complexity parameter , we look at CV f^ as a function of , try to optimize, and then re…t (with that ) to the whole training set. K-fold cross-validation can be expected to estimate Err for “N " = 1 1 K N The question of how cross-validation might be expected to do is thus related to how Err changes with N (the size of the training sample). Typically, Err decreases monotonically in N approaching some limiting value as N goes to in…nity. The "early" (small N ) part of the "Err vs. N curve" is steep and the "late" part (large N ) is relatively ‡at. If (1 1=K) N is large enough that at such size of the training data set, the curve is ‡at, then the e¤ectiveness of cross-validation is limited only by the noise inherent noise in estimating it, and not by the fact that training sets of size (1 1=K) N are not of size N . 88 Operationally, K = 5 or 10 seem standard, and a common rule of thumb is to select for use the predictor of minimum complexity whose cross-validation error is within one standard error of the minimum CV f^ for predictors f^ under consideration. Presumably, "standard error" here refers to (K=N ) sample standard deviation of the K values K N X 1=2 times a L yi ; f^k (xi ) i3k(i)=k (though it’s not completely obvious which complexity’s standard error is under discussion when one says this). HTF say that for many linear …tting methods (that produce Yb = M Y ) including least squares projection and cubic smoothing splines, the N = K (leave one out) cross-validation error is (for f^i produced by training on T f(xi ; yi )g) N 1 X ^ yi CV f = N i=1 2 ^i f (xi ) N 1 X = N i=1 yi f^ (xi ) 1 Mii !2 (for Mii the ith diagonal element of M ). The so-called generalized crossvalidation approximation to this is the much more easily computed N 1 X GCV f^ = N i=1 = yi 1 f^ (xi ) tr (M ) =N !2 err 2 (1 tr (M ) =N ) It is worth noting (per HTF Exercise 7.7) that since 1= (1 x near 0, N 1 X ^ GCV f = N i=1 yi 1 N 1 X yi N i=1 f^ (xi ) tr (M ) =N f^ (xi ) 2 x) 2 1 + 2x for !2 2 + tr (M ) N N 1 X yi N i=1 f^ (xi ) 2 ! which is close to AIC, the di¤erence being that here 2 is being estimated based on the model being …t, as opposed to being estimated based on a low-bias/large model. 12.4 Bootstrap Estimation of Err Suppose that the values of the input vectors in the training set are unique. One might make B bootstrap samples of N (random samples with replacement of size 89 N ) from the training set T , say T 1 ,T 2 ; : : : ; T B , and train on these bootstrap samples to produce predictors, say predictor f^ b based on T b Let C i be the set of indices b = 1; 2; : : : ; B for which (xi ; yi ) 2 = T b . A possible bootstrap estimate of Err is then " # N (1) 1 X 1 X b ^ d L yi ; f (xi ) Err N i=1 #C i i b2C It’s not completely clear what to make of this. For one thing, the T b do not have N distinct elements. In fact, it’s fairly easy to see that the expected number of distinct cases in a bootstrap sample is about :632N . That is, note that P [x1 is in a particular bootstrap sample T ] = 1 1 N 1 N ! 1 exp ( 1) = :632 Then E (number of distinct cases in T ) = E N X I [xi is in T ] i=1 = N X ! EI [xi is in T ] i=1 N X :632 i=1 = :632N (1) d to estimate Err at :632N , not at So roughly speaking, we might expect Err N . So unless Err as a function of training set size is fairly ‡at to the right of :632N , one might expect substantial positive bias in it as an estimate of Err (at N ). HTF argue for (:632) (1) d d Err :368 err + :632 Err as a …rst order correction on the biased bootstrap estimate, but admit that this is not perfect either, and propose a more complicated …x (that they call (:632+) d Err ) for classi…cation problems. It is hard to see what real advantage the bootstrap has over K-fold cross-validation, especially since one will typically want B to be substantially larger than usual values for K. 90 13 Linear (and a Bit on Quadratic) Methods of Classi…cation Suppose now that an output variable y takes values in G = f1; 2; : : : ; Kg, or equivalently that y is a K-variate set of indicators, yk = I [y = k]. The object here is to consider methods of producing prediction/classi…cation ruleso fb(x) n p b taking values in G (and mostly ones) that have sets x 2 < jf (x) = k with boundaries that are de…ned (at least piece-wise) by linear equalities x0 =c (82) The most obvious/naive potential method here, namely to regress the K indicator variables yk onto x (producing least squares regression vector coe¢ cients b ) and then to employ k fb(x) = arg maxfbk (x) = arg maxx0 b k k k fails miserably because of the possibility of "masking" if K > 2. One must be smarter than this. Three kinds of smarter alternatives are Linear (and Quadratic) Discriminant Analysis, Logistic Regression, and direct searches for separating hyperplanes. The …rst two of these are "statistical" in origin with long histories in the …eld. 13.1 Linear (and a bit on Quadratic) Discriminant Analysis Suppose that for (x; y) P , k = P [y = k] and the conditional distribution of x on <p given that y = k is MVNp ( k ; ), i.e. the conditional pdf is gk (x) = (2 ) p=2 (det ) 1=2 1 (x 2 exp 0 k) 1 (x k) Then it follows that ln P [y = kjx = u] P [y = ljx = u] = ln 1 2 k l 1 0 k k+ 1 2 1 0 l l +u 1 0 ( k l) (83) so that a theoretically optimal predictor/decision rule is f (x) = arg max ln ( 1 2 k) k 1 0 k k 1 + x0 k and boundaries between regions in <p where f (x) = k and f (x) = l are subsets of the sets x 2 <p j x0 1 ( k l) = k ln l 91 + 1 2 0 k 1 k 1 2 0 l 1 l i.e. are de…ned by equalities of the form (82). This is dependent upon all K conditional normal distributions having the same covariance matrix, . In the event these are allowed to vary, conditional distribution k with covariance matrix k , a theoretically optimal predictor/decision rule is f (x) = arg max ln ( k) k 1 ln (det 2 1 (x 2 k) k) 0 1 k (x k) and boundaries between regions in <p where f (x) = k and f (x) = l are subsets of the sets 0 k) fx 2 <p j 12 (x ln 1 k (x 1 2 k l 1 2 k) ln (det 0 l) (x k) + 1 2 1 l ln (det (x l) = l )g Unless k = l this kind of set is a quadratic surface in <p , not a hyperplane. One gets (not linear, but) Quadratic Discriminant Analysis. Of course, in order to use LDA or QDA, one must estimate the vectors k and the covariance matrix or matrices k from the training data. Estimating K potentially di¤erent matrices k requires estimation of a very large number of parameters. So thinking about QDA versus LDA, one is again in the situation of needing to …nd the level of predictor complexity that a given data set will support. QDA is a more ‡exible/complex method than LDA, but using it in preference to LDA increases the likelihood of over-…t and poor prediction. One idea that has been o¤ered as a kind of continuous compromise between LDA and QDA is for 2 (0; 1) to use bk ( ) = b k + (1 ) b p o oled in place of b k in QDA. This kind of thinking even suggests as an estimate of a covariance matrix common across k b ( ) = b p o oled + (1 ) c2 I for 2 (0; 1) and c2 an estimate of variance pooled across groups k and then across coordinates of x, j, in LDA. Combining these two ideas, one might even invent a two-parameter set of …tted covariance matrices bk ( ; ) = b k + (1 b p o oled + (1 ) ) c2 I for use in QDA. Employing these in LDA or QDA provides the ‡exibility of choosing a complexity parameter or parameters and potentially improving prediction performance. Returning speci…cally to LDA, let = K 1 X K k=1 92 k and note that one is free to replace x and all K means x = 1=2 (x ) and k 1=2 = ( k with respectively ) k This produces ln P [y = kjx = u ] P [y = ljx = u ] 1 ku 2 k = ln l 2 kk + 1 ku 2 2 lk and (in "sphered" form) the theoretically optimal predictor/decision rule can be described as f (x) = arg max ln ( k k) 1 kx 2 2 kk That is, in terms of x , optimal decisions are based on ordinary Euclidian distances to the transformed means k . Further, this form can often be made even simpler/be seen to depend upon a lower-dimensional (than p) distance. The k typically span a subspace of <p of dimension min (p; K 1). For M =( p K 1; 2; : : : ; K) let P M be the p p projection matrix projecting onto the column space of M in <p (C (M )). Then kx 2 kk = k[P M + (I = k(P M x k ) + (I 2 k k + k(I = kP M x 2 k )k P M )] (x 2 PM)x k 2 PM)x k the last equality coming because (P M x PM)x 2 k ) 2 C (M ) and (I ? 2 C (M ) . Since k(I P M ) x k doesn’t depend upon k, the theoretically optimal predictor/decision rule can be described as f (x) = arg max ln ( k) k 1 kP M x 2 2 kk and theoretically optimal decision rules can be described in terms of the projection of x onto C (M ) and its distances to the k . Now, 1 MM0 K is the (typically rank min (p; K 1)) sample covariance matrix of the k and has an eigen decomposition as 1 M M 0 = V DV 0 K for D = diag (d1 ; d2 ; : : : ; dp ) 93 where d1 d2 dp are the eigenvalues and the columns of V are orthonormal eigenvectors corre1 sponding in order to the successively smaller eigenvalues of K M M 0 . These v k with dk > 0 specify linear combinations of the coordinates of the l , hv k ; l i, with the largest possible sample variances subject to the constraints that kvk = 1 and hv l ; v k i = 0 for all l < k. These v k are ? vectors in successive directions of most important unaccounted-for spread of the k . This suggests the possibility of "reduced rank LDA." That is, for l rank M M 0 de…ne V l = (v 1 ; v 2 ; : : : ; v l ) let P l = V l V 0l be the matrix projecting onto C (V l ) in <p : A possible "reduced rank" approximation to the theoretically optimal LDA classi…cation rule is fl (x) = arg max ln ( k) k 1 kP l x 2 Pl 2 kk and l becomes a complexity parameter that one might optimize to tune or regularize the method. Note also that for w 2 <p P lw = l X k=1 hv k ; wi v k For purposes of graphical representation of what is going on in these computations, one might replace the p coordinates of x and the means k with the l coordinates of 0 (hv 1 ; x i ; hv 2 ; x i ; : : : ; hv l ; x i) (84) and of the (hv 1 ; k i ; hv 2 ; k i ; : : : ; hv l ; 0 k i) (85) It seems to be essentially ordered pairs of these coordinates that are plotted in HTF in their Figures 4.8 and 4.11. In this regard, we need to point out that since any eigenvector v k could be replaced by v k without any fundamental e¤ect in the above development, the vector (84) and all of the vectors (85) could be altered by multiplication of any particular set of coordinates by 1. (Whether a particular algorithm for …nding eigenvectors produces v k or v k is not fundamental, and there seems to be no standard convention in this regard.) It appears that the pictures in HTF might have been made using the R function lda and its choice of signs for eigenvectors. The form x0 is (of course and by design) linear in the coordinates of x. An obvious natural generalization of this discussion is to consider discriminants that 94 are linear in some (non-linear) functions of the coordinates of x. This is simply choosing some M basis functions/transforms/features hm (x) and replacing the p coordinates of x with the M coordinates of (h1 (x) ; h2 (x) ; : : : ; hM (x)) in the development of LDA. Of course, upon choosing basis functions that are all coordinates, squares of coordinates, and products of coordinates of x, one produces linear (in the basis functions) discriminants that are general quadratic functions of x. Or, data-dependent basis functions that are K ( ; xi ) for some kernel function K produces "LDA" that depends upon the value of some "basis function network." The possibilities opened here are myriad and (as always) "the devil is in the details." 13.2 Logistic Regression A generalization of the MVN conditional distribution result (83) is an assumption that for all k < K ln P [y = kjx = u] P [y = Kjx = u] = k0 + u0 (86) k Here there are K 1 constants k0 and K 1 p-vectors k to be speci…ed, not necessarily tied to class mean vectors or a common within-class covariance matrix for x: In fact, the set of relationships (86) do not fully specify a joint distribution for (x; y). Rather, they only specify the nature of the conditional distributions of yjx = u. (In this regard, the situation is exactly analogous to that in ordinary simple linear regression. A bivariate normal distribution for (x; y) gets one normal conditional distributions for y with a constant variance and mean linear in x. But one may make those assumptions conditionally on x, without assuming anything about the marginal distribution of x, that in the bivariate normal model is univariate normal.) Using as shorthand for a vector containing all the constants k0 and the vectors k , the linear log odds assumption (86) produces the forms P [y = kjx = u] = pk (u; ) = for k < K, and P [y = Kjx = u] = pK (u; ) = 1+ exp ( PK 1 1+ k=1 k0 + u0 exp ( PK 1 k=1 k0 k) + u0 k) 1 exp ( k0 + u0 k) and a theoretically optimal (under 0-1 loss) predictor/classi…cation rule is f (x) = arg max pk (x; ) k 95 13.3 Separating Hyperplanes In the K = 2 group case, if there is a 2 <p and real number the training data y = 2 exactly when x0 + 0 > 0 0 such that in a "separating hyperplane" fx 2 <p j x0 + 0 = 0g can be found via logistic regression. The (conditional) likelihood will not have a maximum, but if one follows a search path far enough toward limiting value of 0 for the loglikelihood or 1 for the likelihood, satisfactory 2 <p and 0 from an iteration of the search algorithm will produce separation. A famous older algorithm for …nding a separating hyperplane is the so-called "perceptron" algorithm. It can be de…ned as follows. Temporarily treat y taking values in G = f1; 2g as real-valued, and de…ne y = 2y 3 (so that y takes real values 1). From some starting points 0 and 00 cycle through the training data cases in order (repeatedly as needed). At any iteration l, take n o yi = 2 and x0i + 0 > 0, or l = l 1 and 0l = 0l 1 if yi = 1 and x0i + 0 0 l l 1 = + yi xi otherwise and 0l = 0l 1 + yi This will eventually identify a separating hyperplane when a series of N iterations fails to change the values of and 0 . If there is a separating hyperplane, it will typically not be unique. One can attempt to de…ne and search for "optimal" such hyperplanes that, e.g., maximize distance from the plane to the closest training vector. See HTF Section 4.5.2 in this regard. 14 Support Vector Machines Consider a 2-class classi…cation problem. For notational convenience, we’ll suppose that output y takes values in G = f 1; 1g. Our present concern is in a further development of linear classi…cation methodology beyond that provided in Section 13. For 2 <p and 0 2 < we’ll consider the form g (x) = x0 + 0 (87) and a theoretical predictor/classi…er f (x) = sign (g (x)) 96 (88) We will approach the problem of choosing and 0 to in some sense provide a maximal cushion around a hyperplane separating between xi with corresponding yi = 1 and xi with corresponding yi = 1: 14.1 The Linearly Separable Case In the case that there is a classi…er of form (88) with 0 training error rate, we consider the optimization problem maximize M u with kuk = 1 and 0 2 < subject to yi (x0i u + 0) M 8i (89) This can be thought of in terms of choosing a unit vector u (or direction) in <p so that upon projecting the training input vectors xi onto the subspace of multiples of u there is maximum separation between convex hull of projections of the xi with yi = 1 and the convex hull of projections of xi with corresponding yi = 1. (The sign on u is chosen to give the latter larger x0i u than the former.) Notice that if u and 0 solve this minimization problem 1 0 M= and 0 = 1B B x0 u B min 2 @ xi with i yi = 1 C C max x0i uC A xi with yi = 1 1 0 C 1B C B x0i u + max x0i uC B min 2 @ xi with A xi with yi = 1 yi = 1 For purposes of applying standard optimization theory and software, it is useful to reformulate the basic problem (89) several ways. First, note that (89) may be rewritten as maximize M u with kuk = 1 and 0 2 < subject to yi x0i Then if we let = it’s the case that k k= u M u M 1 1 or M = M k k 97 + 0 M 1 8i (90) so that (90) can be rewritten 1 2 minimize k k p 2 2< and 0 2 < subject to yi (x0i + 0) 1 8i (91) This formulation (91) is that of a convex (quadratic criterion, linear inequality constraints) optimization problem for which there exists standard theory and algorithms. The so-called primal functional corresponding to (91) is (for 2 <N ) FP ( ; 0; N X 1 2 k k 2 ) i (yi (x0i + 0) 1) for 0 i=1 To solve (91), one may for each 0 choose ( ( ) ; 0 ( )) to minimize FP ( ; ; ) and then choose 0 to maximize FP ( ( ) ; 0 ( ) ; ). The Karush-Kuhn-Tucker conditions are necessary and su¢ cient for solution of a constrained optimization problem. In the present context they are the gradient conditions N X @FP ( ; 0 ; ) = (92) i yi = 0 @ 0 i=1 and @FP ( ; @ 0; ) N X = the feasibility conditions yi (x0i + i yi xi =0 (93) i=1 0) 1 0 8i (94) the non-negativity conditions (95) 0 and the orthogonality conditions i (yi (x0i + 0) 1) = 0 8i (96) Now (92) and (93) are N X i yi = 0 and = i=1 N X i=1 98 i yi xi ( ) (97) and plugging these into FP ( ; FD ( ) = = 1 2 k ( )k 2 1 XX 2 X i 0; ) gives a function of N X (yi x0i ( ) i i=1 j i i 1 2 = 10 i 1 XX 2 i j 0 1) XX 0 i j yi yj xi xj only 0 i j yi yj xi xj j + X i i 0 i j yi yj xi xj H for H = (yi yj x0i xj ) N (98) N Then the "dual" problem is 1 2 maximize 10 0 subject to H 0 0 and y=0 (99) and apparently this problem is easily solved. Now condition (96) implies that if iopt > 0 opt yi x0i + opt 0 =1 so that 1. by (94) the corresponding xi has minimum x0i ( opt ) for training vectors with yi = 1 or maximum x0i ( opt ) for training vectors with yi = 1 (so that xi is a support vector for the "slab" of thickness 2M around a separating hyperplane), 2. 0 ( opt yi 0 ) may be determined using the corresponding xi from opt =1 opt yi x0i i.e. 0 opt = yi x0i opt (apparently for reasons of numerical stability it is common practice to average values yi x0i ( opt ) for support vectors in order to evaluate opt )) and, 0( 3. 1 = yi = yi 0 0 opt opt 0 + yi @ + N X j=1 99 N X j=1 10 opt A j yj xj opt 0 j yj yi xj xi xi PN The fact (97) that ( ) i=1 i yi xi implies that only the training cases with i > 0 (typically corresponding to a relatively few support vectors) determine the nature of the solution to this optimization problem. Further, for SV the set indices of support vectors in the problem, X X opt opt 2 0 opt = j yi yj xi xj i i2SV j2SV = X opt i i2SV = X opt 0 j yi yj xj xi j2SV X opt i 1 yi opt 0 i2SV = X opt i i2SV the next to last of these following from 3. above, and the last following from the gradient condition (92). Then the "margin" for this problem is simply M= 14.2 1 k ( = qP opt )k 1 i2SV opt i The Linearly Non-separable Case In a linearly non-separable case, the convex optimization problem (91) does not have a solution (no pair 2 <p and 0 2 < provides yi (x0i + 0 ) 1 8i). We might, therefore (in looking for good choices of 2 <p and 0 2 <) try to relax the constraints of the problem slightly. That is, suppose that i 0 and consider the set of constraints yi (x0i + 0) + i 1 8i (the i are called "slack" variables and provide some "wiggle room" in search for a hyperplane that "nearly" separates the two classes with a good margin). We might try to control the total amount of slack allowed by setting a limit N X i C i=1 for some positive C. Note that if yi (x0i + 0 ) 0 case i is correctly classi…ed in the training set, and so if for some pair 2 <p and 0 2 < this holds for all i, we have a separable problem. So any non-separable problem must have at least one negative yi (x0i + 0 ) for any 2 <p and 0 2 < pair. This in turn requires that the budget C must be at least 1 for a non-separable problem to have a solution even with the addition of slack variables. In fact, this reasoning implies that a budget C allows for at most C mis-classi…cations in the training set. And in a non-separable case, C must be allowed to be large enough so that 100 some choice of 2 <p and 0 2 < produces a classi…er with training error rate no larger than C=N . In any event, we consider the optimization problem 1 2 k k minimize p 2 2< and 0 2 < yi (x0i + 0 ) + i 1 8i PN for some i 0 with i=1 i subject to C (100) that can be thought of as generalizing the problem (91). HTF prefer to motivate (100) as equivalent to maximize M u with kuk = 1 and 0 2 < yi (x0i u + for some i subject to 0) M (1 ) 8i PN i 0 with C i=1 i generalizing the original problem (89), but it’s not so clear that this is any more natural than simply beginning with (100). A more convenient version of (100) is N X 1 2 minimize k k +C 2 2 <p i=1 and 0 2 < i subject to yi (x0i + 0 ) + for some i i 0 1 8i (101) A nice development on pages 376-378 of Izenman’s book provides the following solution to this problem (101) parallel to the development in Section 14.1. Generalizing (99), the "dual" problem boils down to maximize 10 1 2 0 subject to 0 H C 1 and 0 y=0 (102) for H = (yi yj x0i xj ) N (103) N The constraint 0 C 1 is known as a "box constraint" and the "feasible region" prescribed in (102) is the intersection of a hyperplane de…ned by 0 y = 0 and a "box" in the positive orthant. The C = 1 version of this reduces to the "hard margin" separable case. Upon solving (102) for opt , the optimal 2 <p is of the form X opt opt = (104) i yi xi i2SV for SV the indices of set of "support vectors" xi which have iopt > 0. The points with 0 < iopt < C will lie on the edge of the margin (have i = 0) and the ones with iopt = C have i > 0. Any of the support vectors on the edge of the margin (with 0 < iopt < C ) may be used to solve for 0 2 < as 0 opt = yi 101 x0i opt (105) and again, apparently for reasons of numerical stability it is common practice to average values yi x0i ( opt ) for such support vectors in order to evaluate opt ). Clearly, in this process the constant C functions as a regular0( ization/complexity parameter and large C in (101) correspond to small C in (100). Identi…cation of a classi…er requires only solution of the dual problem (102), evaluation of (104) and (105) to produce linear form (87) and classi…er (88). 14.3 SVM’s and Kernels The form (87) is (of course and by design) linear in the coordinates of x. A natural generalization of this development would be to consider forms that are linear in some (non-linear) functions of the coordinates of x. There is nothing really new or special to SVM classi…ers associated with this possibility if it is applied by simply de…ning some basis functions hm (x) and considering form 0 B B g (x) = B @ h1 (x) h2 (x) .. . hM (x) 10 C C C A + 0 for use in a classi…er. Indeed, HTF raise this kind of possibility already in their Chapter 4, where they introduce linear classi…cation methods. However, the fact that in both linearly separable and linearly non-separable cases, optimal SVM’s depend upon the training input vectors xi only through their inner products (see again (98) and (103)) and our experience with RKHS’s and the computation of inner products in function spaces in terms of kernel values suggests another way in which one might employ linear forms of nonlinear functions in classi…cation. 14.3.1 Heuristics Let K be a non-negative de…nite kernel and consider the possibility of using functions K (x; x1 ) ; K (x; x2 ) ; : : : ; K (x; xN ) to build new (N -dimensional) feature vectors 0 1 K (x; x1 ) B K (x; x2 ) C B C k (x) = B C .. @ A . K (x; xN ) for any input vector x (including the xi in the training set) and rather than de…ning inner products for new feature vectors (for input vectors x and z) in terms of <N inner products 0 k (x) k (z) = N X K (x; xk ) K (z; xk ) k=1 102 we instead consider using the RKHS inner products of corresponding functions hK (x; ) ; K (z; )iH = K (x; z) K Then, in place of (98) or (103) de…ne H = (yi yj K (xi ; xj )) N and let opt (106) N solve either (99) or (102). Then with opt = N X opt i yi k (xi ) i=1 as in the developments of the previous sections, we replace the <N inner product of ( opt ) and a feature vector k (x) with *N + N X opt X opt = i yi K (xi ; ) ; K (x; ) i yi hK (xi ; ) ; K (x; )iH K i=1 i=1 HK = N X opt i yi K (x; xi ) i=1 Then for any i for which iopt > 0 (an index corresponding to a support feature vector in this context) we set 0 opt = yi N X opt j yj K (xi ; xj ) j=1 and have an empirical analogue of (87) (for the kernel case) g^ (x) = N X opt i yi K (x; xi ) + 0 opt (107) i=1 with corresponding classi…er f^ (x) = sign (^ g (x)) (108) as an empirical analogue of (88). It remains to argue that this classi…er developed simply by analogy is a solution to any well-posed optimization problem. 14.3.2 An Optimality Argument The heuristic argument for the use of kernels in the SVM context to produce form (107) and classi…er (108) is clever enough that some authors simply let it stand on its own as "justi…cation" for using "the kernel trick" of replacing 103 <N inner products of feature vectors with HK inner products of basis functions. A more satisfying argument would be based on an appeal to optimality/regularization considerations. HTF’s treatment of this possibility is not helpful. The following is based on a 2002 Machine Learning paper of Lin, Wahba, Zhang, and Lee. As in Section 11 consider the RKHS associated with the non-negative de…nite kernel K, and the optimization problem involving the "hinge loss"8 N X minimize g 2 HK and 0 2 < (1 yi ( 0 1 2 kgkHK 2 + g (xi )))+ + i=1 (109) Exactly as noted in Section 11 an optimizing g 2 HK above must be of the form g (x) = = N X jK (x; xj ) j=1 0 k (x) so the minimization problem is minimize 2 <N and 0 2 < N X 1 yi 0 + i=1 0 2 N 1 X k (xi ) + + 2 j=1 jK (x; xj ) HK that is, minimize 2 <N and 0 2 < N X 1 yi 0 + 0 k (xi ) + i=1 + 1 2 0 K for K = (K (xi ; xj )) Now apparently, this is equivalent to the optimization problem minimize 2 <N and 0 2 < N X i + i=1 1 2 0 subject to K which (also apparently) for H N N yi 0 k (xi ) + 0 + i for some i 0 1 8i = (yi yj K (xi ; xj )) as in (106) has the dual problem of the form maximize 10 1 2 0 H subject to 0 1 and 0 y=0 (110) 8 The so-called hinge loss is convenient to use here in making optimality arguments. For (x; y) P it is reasonably straighforward to argue that the function h minimizing the expected hinge loss E[1 yh (x)]+ is ho p t (u) =sign P [y = 1jx = u] 21 , namely the minimum risk classi…er under 0-1 loss. 104 or maximize 10 1 2 0 1 2 subject to 0 H 1 and 0 y = 0 (111) That is, the function space optimization problem (109) has a dual that is the same as (102) for the choice of C = and kernel 12 K (x; z) produced by the heuristic argument in Section 14.3.1. Then, if opt is a solution to (110), Lin et al. say that an optimal 2 <N is 1 diag (y1 ; : : : ; yN ) opt this producing coe¢ cients to be applied to the functions K ( ; xi ). On the other hand, the heuristic of Section 14.3.1 prescribes that for opt the solution to (111) coe¢ cients in the vector diag (y1 ; : : : ; yN ) opt get applied to the functions 12 K ( ; xi ). Upon recognizing that opt = 1 opt it becomes evident that for the choice of C = and kernel 12 K ( ; ), the heuristic in Section (14.3.1) produces a solution to the optimization problem (109). 14.4 Other Support Vector Stu¤ Several other issues related to the kind of arguments used in the development of SVM classi…ers are discussed in HTF (and Izenman). One is the matter of multi-class problems. That is, where G = f1; 2; : : : ; Kg how might one employ machinery of this kind? There are both heuristic and optimality-based methods in the literature. A heuristic one-against-all strategy might be the following. Invent 2-class problems (K of them), the kth based on yki = 1 1 if yi = k otherwise Then for (a single) C and k = 1; 2; : : : ; K solve the (possibly linearly nonseparable) 2-class optimization problems to produce functions g^k (x) (that would lead to one-versus-all classi…ers f^k (x) = sign(^ gk (x)). A possible overall classi…er is then f^ (x) = arg max g^k (x) k2G A second heuristic strategy is to develop a voting scheme based on pair-wise comparisons. That is, one might invent K problems of classifying class l 2 versus class m for l < m, choose a single C and solve the (possibly linearly non-separable) 2-class optimization problems to produce functions g^lm (x) and 105 corresponding classi…ers f^lm (x) = sign(^ glm (x)). For m > l de…ne f^ml (x) = f^lm (x) and de…ne an overall classi…er by 0 1 X f^ (x) = arg max @ f^km (x)A k2G m6=k or, equivalently 0 f^ (x) = arg max @ k2G X m6=k 1 h i I f^km (x) = 1 A In addition to these fairly ad hoc methods of trying to extend 2-class SVM technology to K-class problems, there are developments that directly address the problem (from an overall optimization point of view). Pages 391-397 of Izenman provide a nice summary of a 2004 paper of Lee, Lin, and Wabha in this direction. Another type of question related to the support vector material is the extent to which similar methods might be relevant in regression-type prediction problems. As a matter of fact, there are loss functions alternative to squared error or absolute error that lead naturally to the use of the kind of technology needed to produce the SVM classi…ers. That is, one might consider so called " insensitive" losses for prediction like L1 (y; yb) = max (0; jy or L2 (y; yb) = max 0; (y ybj ) 2 yb) and be led to the kind of optimization methods employed in the SVM classi…cation context. See Izenman pages 398-401 in this regard. 15 Boosting The basic idea here is to take even a fairly simple (and poorly performing form of) predictor and by sequentially modifying/perturbing it and re-weighting (or modifying) the training data set, to creep toward an e¤ective predictor. 15.1 The AdaBoost Algorithm for 2-Class Classi…cation Consider a 2-class 0-1 loss classi…cation problem. We’ll suppose that output y takes values in G = f 1; 1g (this coding of the states is again more convenient than f0; 1g or f1; 2g codings for our present purposes). The AdaBoost.M1 algorithm is built on some base classi…er form f (that could be, for example, the simplest possible tree classi…er, possessing just 2 …nal nodes). It proceeds as follows. 106 1. Initialize weights on the training data (xi ; yi ) at 1 for i = 1; 2; : : : ; N N wi1 2. Fit a G-valued predictor/classi…er f^1 to the training data to optimize N h i X I yi 6= f^ (xi ) i=1 let err1 = and de…ne 1 N i 1 X h I yi 6= f^1 (xi ) N i=1 = ln 1 err1 err1 3. Set new weights on the training data h i 1 exp 1 I yi 6= f^1 (xi ) wi2 = N for i = 1; 2; : : : ; N (This up-weights mis-classi…ed observations by a factor of (1 err1 ) =err1 ).) 4. For m = 2; 3; : : : ; M (a) Fit a G-valued predictor/classi…er f^m to the training data to optimize N X i=1 (b) Let errm = h i wim I yi 6= f^ (xi ) PN i=1 (c) Set m h i wim I yi 6= f^m (xi ) PN i=1 wim = ln 1 errm errm (d) Update weights as wi(m+1) = wim exp mI h i yi 6= f^m (xi ) for i = 1; 2; : : : ; N 5. Output a resulting classi…er that is based on "weighted voting" by the classi…ers f^m as ! " M # M X X ^ ^ f^ (x) = sign = 1 2I m fm (x) m fm (x) < 0 m=1 m=1 (Classi…ers with small errm get big positive weights in the "voting.") 107 This apparently turns out to be a very e¤ective way of developing a classi…er, and has been called the best existing "o¤-the-shelf" methodology for the problem. 15.2 Why AdaBoost Might Be Expected to Work HTF o¤er some motivation for the e¤ectiveness of AdaBoost. This is related to the optimization of an expected "exponential loss." 15.2.1 Expected "Exponential Loss" for a Function of x For g an arbitrary function of x, consider the quantity E exp ( yg (x)) = EE [exp ( yg (x)) jx] (112) If g (x) were typically very large positively when y = 1 and very large negatively when y = 1, this quantity (112) could be expected to be small. So if one were in possession of such a function of x, one might expect to produce a sensible classi…er as f (x) = sign (g (x)) = 1 2I [g (x) < 0] (113) A bit di¤erently put, a g constructed to make (112) small should also produce a good classi…er as (113). It’s a bit of a digression, but HTF point out that it’s actually fairly easy to see what g (x) optimizes (112). One needs only identify for each u what value a optimizes E [exp ( ay) jx = u] That is, E [exp ( ay) jx = u] = exp ( a) P [y = 1jx = u] + exp (a) P [y = 1jx = u] and an optimal a is easily seen to be half the log odds ratio, i.e. the g optimizing (112) is P [y = 1jx = u] 1 g (u) = ln 2 P [y = 1jx = u] 15.2.2 Sequential Search for a g (x) With Small Expected Exponential Loss and the AdaBoost Algorithm Consider "base classi…ers" hl (x; l ) (taking values in G = f 1; 1g) with parameters l and functions built from them of the form gm (x) = m X l hl (x; l) l=1 (for training-data-dependent l and gm (x) = gm 1 l ). Obviously, with this notation (x) + 108 m hm (x; m) We might think of successive ones of these as perturbations of the previous ones (by addition of the term m hm (x; m ) to gm 1 (x)). We might further think about how such functions might be successively de…ned in an e¤ort to make them e¤ective as producing small values of (112). Of course, since we typically do not have available a complete probability model for (x; y) we will only be able to seek to optimize/make small an empirical version of (112). As it turns out, taking thesePpoints of view produces a positive multiple of the AdaBoost M voting function m=1 m f^m (x) as gM (x). (It then follows that AdaBoost might well produce e¤ective classi…ers by the logic that g producing small (112) ought to translate to a good classi…er via (113).) So consider an empirical loss that amounts to an empirical version of N times (112) for gm N X Em = exp ( yi gm (xi )) i=1 This is Em = N X exp ( yi gm 1 (xi ) yi m hm exp ( yi gm 1 (xi )) exp ( yi (xi ; m )) i=1 = N X m hm (x; m )) i=1 Let vim = exp ( yi gm 1 (xi )) and consider optimal choice of m and m > 0 (for purposes of making gm the best possible perturbation of gm 1 in terms of minimizing Em ). (We may assume that l > 0 provided the set of classi…ers de…ned by hl (x; l ) as l varies includes both h and h for any version of the base classi…er, h.) Write X X Em = vim exp ( m ) + vim exp ( m ) i with hm (xi ; m )=yi = (exp ( m) i with hm (xi ; m )6=yi exp ( m )) N X i=1 + exp ( m) N X vim I [hm (xi ; m) 6= yi ] vim i=1 So independent of the value of m > 0, m needs to be chosen to optimize the vim -weighted error rate of hm (x; m ). Call the optimized (using weights vim ) version of hm (x; m ) by the name hm (x), and notice that the prescription for choice of this base classi…er is exactly the AdaBoost prescription of how to choose f^m (based on weights wim ). 109 So now consider minimization of N X exp ( yi hm (xi )) i=1 over choices of . This is 0 B )B @ exp ( X X vim + i with hm (xi ; m )6=yi i with hm (xi ; m )=yi = exp ( ) N X vim + i=1 vim exp (2 N X vim (exp (2 m) i=1 1 C C m )A ! 1) I [hm (xi ) 6= yi ] and minimization of it is equivalent to minimization of exp ( ) 1 + (exp (2 Let errhmm and see that choice of = m) PN i=1 1) PN i=1 vim I [hm (xi ) 6= yi ] PN i=1 vim ! vim I [hm (xi ) 6= yi ] PN i=1 vim requires minimization of exp ( 1) errhmm ) 1 + (exp (2 ) A bit of calculus shows that the optimizing m = 1 ln 2 1 is errhmm errhmm (114) Notice that the prescription (114) calls for a coe¢ cient that is derived as exactly half m , the coe¢ cient of f^m in the AdaBoost voting function (starting from weights wim ) prescribed in 4b. of the AdaBoost algorithm. (And note that as regards sign, the 1=2 is completely irrelevant.) So to …nish the argument that successive choosing of gm ’s amounts to AdaBoost creation of a voting function, it remains to consider how the weights vim get updated. For moving from gm to gm+1 one uses weights vi(m+1) = exp ( yi gm (xi )) = exp ( yi (gm 1 = vim exp ( yi m hm = vim exp ( = vim exp (2 (xi ) + m hm (xi ))) (xi )) m (2I [hm (xi ) 6= yi ] mI 1)) [hm (xi ) 6= yi ]) exp ( 110 m) Since exp ( m ) is constant across i, it is irrelevant to weighting, and since the prescription for m produces half what AdaBoost prescribes in 4b. for m the weights used in the choice of m+1 and hm+1 x; m+1 are derived exactly according to the AdaBoost algorithm for updating weights. So, at the end of the day, gm ’s are built from base classi…ers to creep towards a function optimizing an empirical version of Eexp ( yg (x)). Since it is easy to see that g1 corresponds to the …rst AdaBoost step, gM is further 1=2 of the AdaBoost voting function. The gm ’s generate the same classi…er as the AdaBoost algorithm. 15.3 Other Forms of Boosting One might, in general, think about strategies for sequentially modifying/perturbing a predictor (perhaps based on a modi…ed version of a data set) as a way of creeping towards a good predictor. Here is a general "gradient boosting algorithm" (taken from Izenman and apparently originally due to Friedman and implemented in the R package gbm) aimed in this direction for a general loss function and general class of potential updates for a predictor. PN 1. Start with f^0 (x) = arg min i=1 L (yi ; yb) : y b 2. For m = 1; 2; : : : ; M @ b) @y b L (yi ; y (a) let yeim = y b=f^m 1 (xi ) (b) …t, for some class of functions hm ( ; ), parameters ( m; m) = arg min ( ; ) N X (e yim m hm (xi ; )) and m as 2 i=1 (c) let m = arg min N X L yi ; f^m 1 (xi ) + hm (xi ; m) i=1 (d) set f^m (x) = f^m 1 (x) + m hm (x; m) 3. and …nally, set f^ (x) = f^M (x). In this algorithm, the set of N values m hm (xi ; m ) in 2b. functions as an approximate negative gradient of the total loss at the set of values f^m 1 (xi ) and the least squares criterion at 2b. might be replaced by others (like least absolute deviations). The "minimizations" need to be taken with a grain of salt, as they are typically only the "approximate" minimizations (like those associated with greedy algorithms used to …t binary trees, e.g.). The step 2c. is some kind of line search in a direction of steepest descent for the total loss. 111 Notice that there is nothing here that requires that the functions hm (x; ) come from classi…cation or regression trees, rather the development allows for fairly arbitrary base predictors. When the problem is a regression problem with squared error loss some substantial simpli…cations of this occur. 15.3.1 SEL 1 2 Suppose for now that L (y; yb) = algorithm above 2 yb) . (y 1. f^0 (x) = y, 2. yeim = @ @y b 1 2 yb) (yi 2 y b=fm 1 (xi ) Then, in the gradient boosting = yi f^m 1 (xi ) so that step 2b. of the algorithm says "…t, using SEL, a model for the residuals from the previous iteration," and 3. the optimization of in step 2c of the algorithm produces that step 2d of the algorithm sets f^m (x) = f^m 1 (x) + m hm (x; m = m so m) and ultimately one is left with the prescription "…t, using, SEL a model for the residuals from the previous iteration and add it to the previous predictor to produce the mth one," a very intuitively appealing algorithm. Again, there is nothing here that requires that the hm (x; ) come from regression trees. This notion of gradient boosting could be applied to any kind of base regression predictor. 15.3.2 AEL (and Binary Regression Trees) Suppose now that L (y; yb) = jy 1. f^0 (x) = median fyi g, and 2. yeim = @ @y b (jyi ybj) y b=f^m ybj : Then, in the gradient boosting algorithm 1 (xi ) = sign yi of the algorithm says "…t, a model for from the previous iteration." f^m 1 (xi ) so that step 2b. 1’s coding the signs of the residuals In the event that the base predictors being …t are regression trees, the (xi ; m ) …t at step 2b. will be values between 1 and 1 (constant on rectangles) if the …tting is done using least squares (exactly as indicated in 2b.). If the alternative of least absolute error were used in place of least squares in 2b., one would have the median of a set of 1’s and 1’s, and thus values either 1 or 1 constant on rectangles. In this latter case, the optimization for m is of the form N X ( 1)i y f^m 1 (xi ) m = arg min m hm m i=1 112 and while there is no obvious closed form for the optimizer, it is at least the case that one is minimizing a piecewise linear continuous function of . K-Class Classi…cation 15.3.3 HTF allude to a procedure that uses gradient boosting in …tting a (K-dimensional) form f = (f1 ; : : : ; fK ) PK under the constraint that k=1 fk = 0, sets corresponding class probabilities (for k 2 G = f1; 2; : : : ; Kg) to exp (fk (x)) pk (x) = PK l=1 exp (fl (x)) and has the implied theoretical form of classi…er f (x) = arg max pk (x) k2G An appropriate (and convenient) loss is L (y; p) = K X I [y = k] ln (pk ) k=1 = K X I [y = k] fk + ln k=1 K X ! exp (fk ) k=1 This leads to @ L (y; p) = I [y = k] @fk = I [y = k] exp (fk ) PK l=1 exp (fl ) pk and thus yeikm = I [yi = k] p^k(m 1) (xi ) (Algorithm 10.4 on page 387 of HTF provides a version of the gradient boosting algorithm for this loss and K-class classi…cation.) 15.4 Some Issues (More or Less) Related to Boosting and Use of Trees as Base Predictors Here we consider several issues that might come up in the use of boosting, particularly if regression or classi…cation trees are the base predictors. One issue is the question of how large regression or classi…cation trees should be allowed to grow if they are used as the functions hm (x; ) in a boosting algorithm. The answer seems to be "Not too large, maybe to about 6 or so terminal nodes." 113 Section 10.12 of HTF concerns control of complexity parameters/avoiding over…t/regularization in boosting. Where trees are used as the base predictors, limiting their size (as above) provides some regularization. Further, there is the parameter M in the boosting algorithm that can/should be limited in size (very large values surely producing over…t). Cross-validation here looks typically pretty unworkable because of the amount of computing involved. HTF seem to indicate that (for better or worse?) holding back a part of the training sample and watching performance of a predictor on that single test set as M increases is a commonly used method of choosing M . "Shrinkage" is an idea that has been suggested for regularization in boosting. The basic idea is that instead of setting f^m (x) = f^m 1 (x) + m hm (x; m) as in 2d. of the gradient boosting algorithm, one chooses f^m (x) = f^m 1 (x) + m hm (x; 2 (0; 1) and sets m) i.e. one doesn’t make the "full correction" to f^m 1 in producing f^m . This will typically require larger M for good …nal performance of a predictor than the non-shrunken version, but apparently can (in combination with larger M ) improve performance. "Subsampling" is some kind of bootstrap/bagging idea. The notion is that at each iteration of the boosting algorithm, instead of choosing an update based on the whole training set, one chooses a fraction of the training set at random, and …ts to it (using a new random selection at each iteration). This reduces the computation time per iteration and apparently can also improve performance. HTF show its performance on an example in combination with shrinkage. 16 Prototype and Nearest Neighbor Methods of Classi…cation We saw when looking at "linear" methods of classi…cation in Section 13 that these can reduce to classi…cation to the class with …tted mean "closest" in some appropriate sense to an input vector x. A related notion is to represent classes each by several "prototype" vectors of inputs, and to classify to the class with closest prototype. In their Chapter 13, HTF consider such classi…ers and related nearest neighbor classi…ers. So consider a K-class classi…cation problem (where y takes values in G = f1; 2; : : : ; Kg) and suppose that the coordinates of input x have been standardized according to training means and standard deviations. For each class k = 1; 2; : : : ; K, represent the class by prototypes z k1 ; z k2 ; : : : ; z kR p belonging to < and consider a classi…er/predictor of the form f (x) = arg min min kx l k 114 z kl k (that is, one classi…es to the class that has a prototype closest to x). The most obvious question in using such a rule is "How does one choose the prototypes?" One standard (admittedly ad hoc, but not unreasonable) method is to use the so-called "K-means (clustering) algorithm" one class at a time. (The "K" in the name of this algorithm has nothing to do with the number of classes in the present context. In fact, here the "K" naming the clustering algorithm is our present R, the number of prototypes used per class. And the point in applying the algorithm is not so much to see exactly how training vectors aggregate into "homogeneous" groups/clusters as it is to …nd a few vectors to represent them.) For Tk = fxi with corresponding yi = kg an "R "-means algorithm might proceed by 1. randomly selecting R di¤erent elements from Tk say (1) (1) (1) z k1 ; z k2 ; : : : ; z kR 2. then for m = 2; 3; : : : letting ( the mean of all xi 2 Tk with (m) z kl = (m 1) (m 1) xi z kl < xi z kl0 for all l 6= l0 iterating until convergence. This way of choosing prototypes for class k ignores the "location" of the other classes and the eventual use to which the prototypes will be put. A potential improvement on this is to employ some kind of algorithm (again ad hoc, but reasonable) that moves prototypes in the direction of training input vectors in their own class and away from training input vectors from other classes. One such method is known by the name "LVQ"/"learning vector quantization." This proceeds as follows. With a set of prototypes (chosen randomly or from an R-means algorithm or some other way) z kl k = 1; 2; : : : ; K and l = 1; 2; : : : ; R in hand, at each iteration m = 1; 2; : : : for some sequence of "learning rates" f m g with m 0 and m & 0 1. sample an xi at random from the training set and …nd k; l minimizing kxi z kl k 2. if yi = k (from 1.), update z kl as z new kl = z kl + m (xi z kl ) m (xi z kl ) and if yi 6= k (from 1.), update z kl as z new kl = z kl 115 iterating until convergence. As early as Section 1, we considered nearest neighbor methods, mostly for regression-type prediction. Consider here their use in classi…cation problems. As before, de…ne for each x the l-neighborhood nl (x) = the set of l inputs xi in the training set closest to x in <p A nearest neighbor method is to classify x to the class with the largest representation in nl (x) (possibly breaking ties at random). That is, de…ne X f^ (x) = arg max I [yi = k] (115) k xi 2nl (x) l is a complexity parameter that might be chosen by cross-validation. Apparently, properly implemented this kind of classi…er can be highly e¤ective in spite of the curse of dimensionality. This depends upon clever/appropriate/applicationspeci…c choice of feature vectors/functions, de…nition of appropriate "distance" in order to de…ne "closeness" and the neighborhoods, and appropriate local or global dimension reduction. HTF provide several attractive examples (particularly the LandSat example and the handwriting character recognition example) that concern feature selection and specialized distance measures. They then go on to consider more or less general approaches to dimension reduction for classi…cation in high-dimensional Euclidean space. A possibility for "local" dimension reduction is this. At x 2 <p one might use regular Euclidean distance to …nd, say, 50 neighbors of x in the training inputs to use to identify an appropriate local distance to employ in actually de…ning the neighborhood nl (x) to be employed in (115). The following is (what I think is a correction of) what HTF write in describing a DANN (discriminant adaptive nearest neighbor) (squared) metric at x 2 <p . Let D2 (z; x) = (z 0 x) Q (z x) for Q=W =W for some 1 2 1 2 W (B ) 1 2 1 BW 1 2 + I W 1 + I W 1 2 1 2 (116) > 0 where W is a pooled within-class sample covariance matrix W = K X ^k W k k=1 = K X k=1 ^k 1 nk 1 X 116 (xi xk ) (xi 0 xk ) (xk is the average xi from class k in the 50 used to create the local metric), B is a weighted between-class covariance matrix of sample means B= K X x) (xk ^k (xk 0 x) k=1 (x is a (probably weighted) average of the xk ) and B =W 1 2 BW 1 2 1 Notice that in (116), the "outside" W 2 factors "sphere" (z x) di¤erences relative to the within-class covariance structure. B is then the between-class covariance matrix of sphered sample means. Without the I, the distance would then discount di¤erences in the directions of the eigenvectors corresponding to large eigenvalues of B (allowing the neighborhood de…ned in terms of D to be severely elongated in those directions). The e¤ect of adding the I term is to limit this elongation to some degree, preventing xi "too far in terms of Euclidean distance from x" from being included in nl (x). A global use of the DANN kind of thinking might be to do the following. At each training input vector xi 2 <p , one might again use regular Euclidean distance to …nd, say, 50 neighbors and compute a weighted between-class-mean covariance matrix B i as above (for that xi ). These might be averaged to produce N 1 X Bi B= N i=1 Then for eigenvalues of B, say 1 0 with corresponding 2 p (unit) eigenvectors u1 ; u2 ; : : : ; up one might do nearest neighbor classi…cation based on the …rst few features vj = u0j x and ordinary Euclidean distance. I’m fairly certain that this is a nearest neighbor version of the reduced rank classi…cation idea …rst met in the discussion of linear classi…cation in Section 13. 17 Some Methods of Unsupervised Learning As we said in Section 1, "supervised learning" is basically prediction of y belonging to < or some …nite index set from a p-dimensional x with coordinates each individually in < or some …nite index set, using training data pairs (x1 ; y1 ) ; (x2 ; y2 ) ; : : : ; (xN ; yN ) to create an e¤ective prediction rule yb = fb(x) 117 This is one kind of discovery and exploitation of structure in the training data. As we also said in Section 1, "unsupervised learning" is discovery and quanti…cation of structure in 0 0 1 x1 B x02 C B C X =B . C . N p @ . A x0N without reference to some particular coordinate of a p-dimensional x as an object of prediction. There are a number of versions of this problem in Ch 14 of HTF that we will outline here. 17.1 Association Rules/Market Basket Analysis Suppose for the present that x takes values in X = p Y j=1 Xj where each Xj is …nite, Xj = sj1 ; sj2 ; : : : ; sjkj For some set of indices J f1; 2; : : : ; pg and subsets of the coordinates spaces Sj Xj for j 2 J we suppose that the rectangle R = fx j xj 2 Sj 8j 2 J g is selected as potentially of interest. (More below on how this might come about.) Apparently, the statement x 2 R (or "xj 2 Sj 8j 2 J ") is called a conjunctive rule. In application of this formalism to "market basket analysis" it is common to call the collection of indices j 2 J and corresponding sets Sj "item sets" corresponding to the rectangle.9 For an item set I = f(j1 ; Sj1 ) ; (j2 ; Sj2 ) ; : : : ; (jr ; Sjr )g 9 In this application, an X will typically specify a product type and the correpsonding j sjk ’s speci…c brands and/or models of the product type. As we probably want to allow the possibility that a given transaction involves no product of type j, there needs to be the understanding that one sjk for a given j stands for "no purchase in this category." Further, since a given coordinate, j, of x can take only one value from Xj , it seems that we must somehow restrict attention to transactions that involve at most a single product of each type. 118 consider a (non-trivial proper) subset J1 J and the items sets I1 = f(j; Sj ) jj 2 J1 g and I2 = InI1 It is then common to talk about "association rules" of the form I1 =) I2 (117) and to consider quantitative measures associated with them. In framework (117), I1 is called the antecedent and I2 is called the consequent in the rule. For an association rule I1 =) I2 , 1. the support of the rule (also the support of the item set or rectangle) is 1 1 # fxi jxi 2 Rg = # fxi jxij 2 Sj 8j 2 J g N N (the relative frequency with which the full item set is seen in the training cases, i.e. that the training vectors fall into R), 2. the con…dence or predictability of the rule is # fxi jxij 2 Sj 8j 2 J g # fxi jxij 2 Sj 8j 2 J1 g (the relative frequency with which the full item set I is seen in the training cases that exhibit the smaller item set I1 ), 3. the "expected con…dence" of the rule is 1 # fxi jxij 2 Sj 8j 2 J2 g N (the relative frequency with which item set I2 is seen in the training cases), and 4. the lift of the rule is N # fxi jxij 2 Sj 8j 2 J g conf idence = expected conf idence # fxi jxij 2 Sj 8j 2 J1 g # fxi jxij 2 Sj 8j 2 J2 g (a measure of association). If one thinks of the cases in the training set as a random sample from some distribution, then roughly, the support of the rule is an estimate of "P (I)," the con…dence is an estimate of "P (I2 jI1 )," the expected con…dence is an estimate of "P (I2 )," and the lift is an estimate of the ratio "P (I1 and I2 ) = (P (I1 ) P (I2 ))." The basic thinking about association rules seems to be that usually (but perhaps not always) one wants rules with large support (so that the estimates can be reasonably expected to be reliable). Further, one then wants large con…dence or lift, as these indicate that the corresponding rule will be useful in terms of understanding how the coordinates of x are related in the training data. Apparently, standard practice is to identify a large number of promising item sets and association rules, and make a database of association rules than can be queried in searches like: 119 "Find all rules in which YYY is the consequent that have con…dence over 70% and support more than 1%." Basic questions that we have to this point not addressed are where one gets appropriate rectangles R (or equivalently, item sets I) and how one uses them to produce association rules. In answer to the second of these questions, one might say "consider all 2r 2 association rules that can be associated with a given rectangle." But what then are "interesting" rectangles or how does one …nd a potentially useful set of rectangles? We proceed to consider two answers to this question. 17.1.1 The "Apriori Algorithm" for the Case of Singleton Sets Sj One standard way of generating a set of rectangles to process into association rules is to use the so-called "apriori algorithm." This produces potentially interesting rectangles where each Sj is a singleton set. That is, for each j 2 J , there is lj 2 f1; 2; : : : ; kj g so that Sj = sjlj . For such cases, one ends with item sets of the form I= j1 ; sj1 lj1 ; j2 ; sj2 lj2 ; : : : ; jr ; sjr ljr or more simply I= j1 ; sj1 lj1 ; j2 ; sj2 lj2 ; : : : ; jr ; sjr ljr or even just I = sj1 lj1 ; sj2 lj2 ; : : : ; sjr ljr The "apriori algorithm" operates to produce such rectangles as follows. Pp 1. Pass through all K = j=1 kj single items s11 ; s12 ; : : : s1k1 ; s21 ; : : : ; s2k2 ; : : : ; sp1 ; : : : ; spkp identifying those sjl that individually have support/prevalence 1 # fxi j xij = sjl g N at least t and place them in the set I1t = fitem sets of size 1 with support at least tg 2. For each sj 0 l0 2 I1t check to see which two-element item sets fsj 0 l0 ; sj 00 l00 gj 00 6=j 0 and sj 00 l00 2I1t have support/prevalence 1 # fxi j xij 0 = sj 0 l0 and xij 00 = sj 00 l00 g N 120 at least t and place them in the set I2t = fitem sets of size 2 with support at least tg .. . 8 9 entries = <z m 1}| { t m. For each sj 0 l0 ; sj 00 l00 ; : : : 2 Im : ; 1 check to see which m-element item sets fsj 0 l0 ; sj 00 l00 ; : : :g [ fsj l g for j 2 = fj 0 ; j 00 ; : : :g and sj l 2 I1t have support/prevalence 1 # xi j xij 0 = sj 0 l0 ; xij 00 = sj 00 l00 ; : : : ; xij N =sj l at least t and place them in the set t Im = fitem sets of size m with support at least tg t is empty. Then This algorithm terminates when at some stage m the set Im a sensible set of rectangles (to consider for making association rules) is I t = t , the set of all item sets with singleton coordinate sets Sj and prevalence in [Im the training data of at least t. Apparently for commercial databases of "typical size," unless t is very small it is feasible to use this algorithm to to …nd I t . It is apparently also possible to use a variant of the apriori algorithm to …nd all association rules based on rectangles/item sets in I t with con…dence at least c. This then produces a database of association rules that can be queried by a user wishing to identify useful structure in the training data set. 17.1.2 Using 2-Class Classi…cation Methods to Create Rectangles What the apriori algorithm does is clear. What "should" be done is far less clear. The apriori algorithm seems to search for rectangles in X that are simplydescribed and have large relative frequency content in the training data. But it obviously does not allow for rectangles that have coordinate sets Xj containing more than one element, it’s not the only thing that could possibly be done in this direction, it won’t …nd practically important "small-relative-frequency-content" rectangles, and there is nothing in the algorithm that directly points it toward rectangles that will produce association rules with large lift. HTF o¤er (without completely ‡eshing out the idea) another (fairly clever) path to generating a set of rectangles. That involves making use of simulated data and a rectangle-based 2-class classi…cation algorithm like CART or PRIM. (BTW, it’s perhaps the present context where association rules are in view that gives PRIM its name.) The idea is that if one augments the training set by assigning the value y = 1 to all cases in the original data set, and subsequently 121 generates N additional …ctitious data cases from some distribution P on X , assigning y = 0 to those simulated cases, a classi…cation algorithm can be used to search for rectangles that distinguish between the real and simulated data. One will want N to be big enough so that the simulated data can be expected to represent P , and in order achieve an appropriate balance between real and simulated values in the combined data set presented to the classi…cation algorithm, each real data case may need to be weighted with a weight larger than 1. (One, of course, uses those rectangles that the classi…cation algorithm assigns to the decision "y = 1.") Two possibilities for P are 1. a distribution of independence for the coordinates of x, with the marginal of xj uniform on Xj , and 2. a distribution of independence for the coordinates of x, with the marginal of xj the relative frequency distribution of fxij g on Xj . One could expect the …rst of these to produce rectangles with large relative frequencies in the data set and thus results similar in spirit to the those produced by the apriori algorithm (except for restriction to singleton Sj ’s). The second possibility seems more intriguing. At least on the face of it, seems like it might provide a more or less automatic way to look for parts of the original training set where one can expect large values of lift for association rules. It is worth noting that HTF link this method of using classi…cation to search for sets of high disparity between the training set relative frequencies and P probability content to the use of classi…cation methods with simulated data in the problem multivariate density estimation. This is possible because knowing a set of optimal classi…ers is essentially equivalent to knowing data density. The "…nd optimal classi…ers" problem and the "estimate the density" problem are essentially duals. 17.2 Clustering Typically (but not always) the object in "clustering" is to …nd natural groups of rows or columns of 0 0 1 x1 B x02 C B C X =B . C N p @ .. A x0N (in some contexts one may want to somehow …nd homogenous "blocks" in a properly rearranged X). Sometimes all columns of X represent values of continuous variables (so that ordinary arithmetic applied to all its elements is meaningful). But sometimes some columns correspond to ordinal or even categorical variables. In light of all this, we will let xi i = 1; 2; : : : ; r stand for "items" to be clustered (that might be rows or columns of X) with entries that need not necessarily be continuous variables. 122 In developing and describing clustering methods, it is often useful to have a dissimilarity measure d (x; z) that (at least for the items to be clustered and perhaps for other possible items) quanti…es how "unalike" items are. This measure is usually chosen to satisfy 1. d (x; z) 0 8x; z 2. d (x; x) = 0 8x, and 3. d (x; z) = d (z; x) 8x; z. It may be chosen to further satisfy 4. d (x; z) d (x; w) + d (z; w) 8x; z; and w, or 40 . d (x; z) max [d (x; w) ; d (z; w)] 8x; z; and w. Where 1-4 hold, d is a "metric." Where 1-3 hold and the stronger condition 40 holds, d is an "ultrametric." In a case where one is clustering rows of X and each column of X contains values of a continuous variable, a squared Euclidean distance is a natural choice for a dissimilarity measure d (xi ; x ) = kxi i0 2 x k = i0 p X (xij x i0 j ) 2 j=1 In a case where one is clustering columns of X and each column of X contains values of a continuous variable, with rjj 0 the sample correlation between values in columns j and j 0 , a plausible dissimilarity measure is d (xj ; xj 0 ) = 1 jrjj 0 j When dissimilarities between r items are organized into a (non-negative symmetric) r r matrix D = (dij ) = (d (xi ; xj )) with 0’s down its diagonal, the terminology "proximity matrix" is often used. For some clustering algorithms and for some purposes, the proximity matrix encodes all one needs to know about the items to do clustering. One seeks a partition of the index set f1; 2; : : : ; rg into subsets such that the dij for indices within a subset are small (and the dij for indices i and j from di¤erent subsets are large). 17.2.1 Partitioning Methods ("Centroid"-Based Methods) By far the most commonly used clustering methods are based on partitioning related to "centroids," particularly the so called "K-Means" clustering algorithm for the rows of X in cases where the columns contain values of continuous 123 variables xj . The version of the method we will describe here uses squared Euclidean distance to measure dissimilarity. Somewhat fancier versions (like that found in CFZ) are built on Mahalanobis distance. The algorithms begins with some set of K distinct "centers" c01 ; c02 ; : : : ; c0K . They might, for example, be a random selection of the rows of X (subject to the constraint that they are distinct). One then assigns each xi to that center c0k0 (i) minimizing d xi ; c0l over choice of l (creating K clusters around the centers) and replaces all of the c0k with the corresponding cluster means c1k = X 1 I k 0 (i) = k xi 0 # of i with k (i) = k At stage m with all cm k minimizing 1 cm km 1 (i) 1 available, one then assigns each xi to that center d xi ; clm 1 over choice of l (creating K clusters around the centers) and replaces all of the 1 cm with the corresponding cluster means k cm k = 1 # of i with k m 1 (i) =k X I km 1 (i) = k xi This iteration goes on to convergence. One compares multiple random starts for a given K (and then minimum values found for each K) in terms of Error Sum of Squares (K) = K X k=1 X d (xi ; ck ) xi in cluster k for c1 ; c2 ; : : : ; cK the …nal means produced by the iterations. One may then consider the monotone sequence of Error Sums of Squares and try to identify a value K beyond which there seem to be diminishing returns for increased K. A more general version of this algorithm (that might be termed a K-medoid algorithm) doesn’t require that the entries of the xi be values of continuous variables, but (since it is then unclear that one can even evaluate, let alone …nd a general minimizer of, d (xi ; )) restricts the "centers" to be original items. This algorithm begins with some set of K distinct "medoids" c01 ; c02 ; : : : ; c0K that are a random selection from the r items xi (subject to the constraint that they are distinct). One then assigns each xi to that medoid c0k0 (i) minimizing d xi ; c0l over choice of l (creating K clusters associated with the medoids) and replaces all of the c0k with c1k the corresponding minimizers over the xi0 belonging to cluster k of the sums X d (xi ; xi0 ) i with k0 (i)=k 124 At stage m with all cm k 1 cm km 1 (i) minimizing 1 available, one then assigns each xi to that medoid d x i ; cm l 1 over choice of l (creating K clusters around the medoids) and replaces all of the 1 cm with cm k the corresponding minimizers over the xi0 belonging to cluster k k of the sums X d (xi ; xi0 ) i with km 1 (i)=k This iteration goes on to convergence. One compares multiple random starts for a given K (and then minimum values found for each K) in terms of K X k=1 X d (xi ; ck ) xi in cluster k for c1 ; c2 ; : : : ; cK the …nal medoids produced by the iterations. 17.2.2 Hierarchical Methods There are both agglomerative/bottom-up methods and divisive/top-down methods of hierarchical clustering. To apply an agglomerative method, one must …rst choose a method of using dissimilarities for items to de…ne dissimilarities for clusters. Three common (and somewhat obvious) possibilities in this regard are as follows. For C1 and C2 di¤erent elements of a partition of the set of items, or equivalently their r indices, one might de…ne dissimilarity of C1 and C2 as 1. D (C1 ; C2 ) = min fdij ji 2 C1 and j 2 C2 g (this is the "single linkage" or "nearest neighbor" choice), 2. D (C1 ; C2 ) = max fdij ji 2 C1 and j 2 C2 g (this is the "complete linkage" choice), or P 3. D (C1 ; C2 ) = #C11#C2 i2C1 ; j2C2 dij (this is the "average linkage" choice). An agglomerative hierarchical clustering algorithm then operates as follows. One begins with every item xi ; i = 1; 2; : : : ; r functioning as a singleton cluster. Then one …nds the minimum dij for i 6= j and puts the corresponding two items into a single cluster (of size 2). Then when one is at a stage where there are m clusters, one …nds the two clusters with minimum dissimilarity and merges them to make a single cluster, leaving m 1 clusters overall. This continues until there is only a single cluster. The sequence of r di¤erent clusterings (with r through 1 clusters) serves as a menu of potentially interesting solutions to the clustering problem. These are apparently often displayed in the form of a dendogram, where cutting the dendogram at a given level picks out one of the (increasingly coarse as the level rises) clusterings. Those items clustered together "deep" in 125 the tree/dendogram are presumably interpreted to be potentially "more alike" than ones clustered together only at a high level. A divisive hierarchical algorithm operates as follows. Starting with a single "cluster" consisting of all items, one …nds the maximum dij and uses the two corresponding items as seeds for two clusters. One then assigns each xl for l 6= i and l 6= j to the cluster represented by xi if d (xi ; xl ) < d (xj ; xl ) and to the cluster represented xj otherwise. When one is at a stage where there are m clusters, one identi…es the cluster with largest dij corresponding to a pair of elements in the cluster, splitting it using the method applied to split the original "single large cluster" (to produce an (m + 1)-cluster clustering). This, like the agglomerative algorithm, produces a sequence of r di¤erent clusterings (with 1 through r clusters) that serves as a menu of potentially interesting solutions to the clustering problem. And like the sequence produced by the agglomerative algorithm, this sequence can be represented using a dendogram. One may modify either the agglomerative or divisive algorithms by …xing a threshold t > 0 for use in deciding whether or not to merge two clusters or to split a cluster. The agglomerative version would terminate when all pairs of existing clusters have dissimilarities more than t. The divisive version would terminate when all dissimilarities for pairs of items in all clusters are below t. Fairly obviously, employing a threshold has the potential to shorten the menu of clusterings produced by either of the methods to include less than r clusterings. (Obviously, thresholding the agglomerative method cuts o¤ the top of the corresponding full dendogram, and thresholding the divisive method cuts o¤ the bottom of the corresponding full dendogram.) 17.2.3 (Mixture) Model-Based Methods A completely di¤erent approach to clustering into K clusters is based on use of mixture models. That is, for purposes of producing a clustering, I might consider acting as if items x1 ; x2 ; : : : ; xr are realizations of r iid random variables with parametric marginal density g (xj ; 1; : : : ; K) = K X kf (xj k) (118) k=1 PK for probabilities k > 0 with k=1 k = 1, a …xed parametric density f (xj ), and parameters 1 ; : : : ; K . (Without further restrictions the family of mixture distributions speci…ed by density (118) is not identi…able, but we’ll ignore that fact for the moment.) One might think of the ratio PK kf k=1 (xj kf 126 k) (xj k) as a kind of "probability" that x is generated by component k of the mixture and de…ne cluster k to be the set of xi for which k = arg max PK l lf k=1 (xi j l ) kf (xi j k) = arg max l lf (xi j l ) In practice, ; 1 ; : : : ; K must be estimated and estimates used in place of parameters in de…ning clusters. That is, an implementable clustering method is to de…ne cluster k (say, Ck ) to be Ck = xi jk = arg max bl f xi j bl (119) l Given the lack of identi…ability in the unrestricted mixture model, it might appear that prescription (119) could be problematic. But such is not really the case. While the likelihood L( ; 1; : : : ; K) = r Y g (xi j ; 1; : : : ; K) i=1 will have multiple maxima, using any maximizer for an estimate of the parameter vector will produce the same set of clusters (119). It is common to employ the "EM algorithm" in the maximization of L ( ; 1 ; : : : ; K ) (the …nding of one of many maximizers) and to include details of that algorithm in expositions of model-based clustering. However, strictly speaking, that algorithm is not intrinsic to the basic notion here, namely the use of the clusters in display (119). 17.2.4 Self-Organizing Maps The basic motivation here seems to be that for items x1 ; x2 ; : : : ; xr belonging to <p , the object is to …nd L M cluster centers/prototypes that adequately represent the items, where one wishes to think of those cluster centers/prototypes as indexed on an L M regular grid in 2 dimensions (that we might take to be f1; 2; : : : ; Lg f1; 2; : : : ; M g) with cluster centers/prototypes whose index vectors are close on the grid being close in <p . (There could, of course, be 3-dimensional versions of this for L M N cluster centers/prototypes, and so on.) I believe that the object is both production of the set of centers/prototypes and ?also? assignment of data points to centers/prototypes. In this way, this amounts to some kind of modi…ed/constrained K = L M group clustering problem. One probably will typically P want to begin with standardization of the p coordinate variables xj (so that xi = 0 and the sample variance of each set of values fxij gi=1;2;:::;r is 1). This puts all of the xj on the same scale and doesn’t allow one coordinate of an xi to dominate a Euclidean norm. This topic seems to be driven by two somewhat ad hoc algorithms and lack a wellthought-out (or at least well-articulated) rationale (for both its existence and its methodology). Nevertheless, it seems popular and we’ll describe here the algorithms, saying what we can about its motivation and usefulness. 127 One begins with some set of initial cluster centers z 0lm l=1;:::;L and m=1;:::;M . This might be a random selection (without replacement or the possibility of duplication) from the set of items. It might be a set of grid points in the 2-dimensional plane in <p de…ned by the …rst two principal components of the items fxi gi=1;:::;r . And there are surely other sensible possibilities. Then de…ne neighborhoods on the L M grid, n (l; m), that are subsets of the grid "close" in some kind of distance (like regular Euclidean distance) to the various elements of the L M grid. n (l; m) could be all of the grid, (l; m) alone, all grid points (l0 ; m0 ) within some constant 2-dimensional Euclidean distance, c, of (l; m), etc. Then de…ne a weighting function on <p , say w (kxk), so that w (0) = 1 and w (kxk) 0 is monotone non-increasing in kxk. For some schedule of non-increasing positive constants 1 > 1 , the 2 3 n o SOM algorithms de…ne iteratively sets of cluster centers/prototypes z jlm for j = 1; 2; : : :. At iteration j, an "online" version of SOM selects (randomly or perhaps in turn from an initially randomly set ordering of the items) an item xj and j 1 1. identi…es the center/prototype z lm closest to xj in to <p , call it bj and its j grid coordinates (l; m) (Izenman calls bj the "BMU" or best-matchingunit), j 2. adjusts those z jlm1 with index vectors belonging n (l; m) (close to the j BMU index vector on the 2-dimensional grid) toward x by the prescription j 1 z jlm = z jlm1 + j w z lm bj xj z jlm1 (adjusting those centers di¤erent from the BMU potentially less dramatically than the BMU), and 3. for those z jlm1 with index pairs (l; m) not belonging n (l; m) j sets z jlm = z jlm1 iterating to convergence. j, n At oiteration n o a "batch" version of SOM updates all centers/prototypes j 1 z jlm1 to z jlm as follows. For each z jlm1 , let Xlm be the set of items for n o which the closest element of z jlm1 has index pair (l; m). Then update z jlm1 j 1 as some kind of (weighted) average of the elements of [(l;m)0 2n(l;m) X(l;m) 0 (the set of xi closest to prototypes with labels that are 2-dimensional grid neighbors 1 the obvious sample mean of (l; m)). A natural form of this is to set (with xj(l;m) j 1 of the elements of Xlm ) P z jlm = (l;m)0 2n(l;m) P j 1 z lm w (l;m)0 2n(l;m) w 128 j 1 z (l;m) 0 z jlm1 j 1 x(l;m) 0 1 z j(l;m) 0 It is fairly obvious that even if these algorithms converge, di¤erent starting sets z 0lm will produce di¤erent limits (symmetries alone mean, for example, that the choices z 0lm = ulm and z 0lm = uL l;M m produce what might look like di¤erent limits, but are really completely equivalent). Beyond this, what is provided by the 2-dimensional layout of indices of prototypes is not immediately obvious. It seems to be fairly common to compare an error sum of squares for a SOM to that of a K = L M means clustering and to declare victory if the SOM sum is not much worse than the K-means value. 17.2.5 Spectral Clustering This material is a bit hard to place in an outline like this one. It is about clustering, but requires the use of principal components ideas, and has emphases in common with the local version of multi-dimensional scaling treated in Section 17.3. The basic motivation is to do clustering in a way that doesn’t necessarily look for "convex" clusters of points in p-space, but rather for "roughly connected"/"contiguous" sets of points of any shape in p-space. Consider then N points x1 ; x2 ; : : : ; xN in <p . We consider weights wij = w (kxi xj k) for a decreasing function w : [0; 1) ! [0; 1] and use them to de…ne similarities/adjacencies sij . (For example, we might use w (d) = exp d2 =c for some c > 0.) Similarities can be exactly sij = wij , but can be even more "locally" de…ned as follows. For …xed K consider the symmetric set of index pairs NK = (i; j) j the number of j 0 with wij 0 > wij is less than K or the number of i0 with wi0 j > wij is less than K (an index pair is in the set if one of the items is in the K-nearest neighbor neighborhood of the other). One might then de…ne sij = wij I [(i; j) 2 NK ]. We’ll call the matrix S = (sij ) i=1;:::;N j=1;:::;N the adjacency matrix, and use the notation gi = N X sij j=1 and G = diag (g1 ; g2 ; : : : ; gN ) It is common to think of the points x1 ; x2 ; : : : ; xN in <p as nodes on a graph, with edges between nodes weighted by similarities sij , and the gi so-called node degrees, i.e. sums of weights of the edges connected to nodes i. The matrix L=G S 129 is called the (unnormalized) graph Laplacian, and one standardized (with respect to the node degrees) version of this is Note that for any vector u, u0 Lu = N X gi u2i i=1 e =I L N X N X G 1 (120) S ui uj sij i=1 j=1 0 1 N N N X N X 1 @X X sij u2i + = sij u2j A 2 i=1 j=1 j=1 i=1 N = N X N X ui uj sij i=1 j=1 N 1 XX sij (ui 2 i=1 j=1 2 uj ) so that the N N symmetric L is nonnegative de…nite. We then consider the spectral/eigen decomposition of L and focus on the small eigenvalues. For v 1 ; : : : ; v m eigenvectors corresponding to 2nd through (m + 1)st smallest nonzero eigenvalues (since L1 = 0 there is an uninteresting 0 eigenvalue), let V = (v 1 ; : : : ; v m ) and cluster the rows of as a way of clustering the N cases. (As we noted in the discussion in Section 2.3.1, small eigenvalues are associated with linear combinations of columns of L that are close to 0.) Why should this work? For v l a column of V (that is a eigenvector of L corresponding to a small eigenvalue), v 0l Lv l = N X N X sij (vli vlj ) 2 0 i=1 j=1 and points xi and xj with large adjacencies must have similar corresponding coordinates of the eigenvectors. HTF (at the bottom of their page 545) essentially argue that the number of "0 or nearly 0" eigenvalues of L is indicative of the number of clusters in the original N data vectors. Notice that a series of points could be (in sequence) close to successive elements of the sequence but have very small adjacencies for points separated in the sequence. "Clusters" by this methodology need NOT be "clumps" of points, but could also be serpentine "chains" of points in <p . It is easy to see that P G 1S is a stochastic matrix (thus specifying an N -state stationary Markov Chain). One can expect spectral clustering with the normalized graph Laplacian (120) to yield groups of states such that transition by such a chain between the groups is relatively infrequent (the MC typically moves within the groups). 130 17.3 Multi-Dimensional Scaling This material begins (as in Section 17.2) with dissimilarities among N items, dij , that might be collected in an N N proximity matrix D = (dij ). (These might, but do not necessarily, come from Euclidean distances among N data vectors x1 ; x2 ; : : : ; xN in <p .) The object of multi-dimensional scaling is to (to the extent possible) represent the N items as points z 1 ; z 2 ; : : : ; z N in <k with kz i zj k dij This is phrased precisely in terms of one of several optimization problems, where one seeks to minimize a "stress function" S (z 1 ; z 2 ; : : : ; z N ). The least squares (or Kruskal-Shepard) stress function (optimization criterion) is X 2 SLS (z 1 ; z 2 ; : : : ; z N ) = (dij kz i z j k) i<j This criterion treats errors in reproducing big dissimilarities exactly like it treats errors in reproducing small ones. A di¤erent point of view would make faithfulness to small dissimilarities more important than the exact reproduction of big ones. The so-called Sammon mapping criterion SSM (z 1 ; z 2 ; : : : ; z N ) = X (dij i<j kz i dij 2 z j k) re‡ects this point of view. Another approach to MDS that emphasizes the importance of small dissimilarities is discussed in HTF under the name of "local multi-dimensional scaling." Here one begins for …xed K with the symmetric set of index pairs NK = (i; j) j the number of j 0 with dij 0 < dij is less than K or the number of i0 with di0 j < dij is less than K (an index pair is in the set if one of the items is in the K-nearest neighbor neighborhood of the other). Then a stress function that emphasizes the matching of small dissimilarities and not large ones is (for some choice of > 0) X X 2 SL (z 1 ; z 2 ; : : : ; z N ) = (dij kz i z j k) kz i z j k i<j and (i;j)2NK i<j and (i;j)2N = K Another version of MDS begins with similarities sij (rather than with dissimilarities dij ). (One important special case of similarities derives from vectors x1 ; x2 ; : : : ; xN in <p through centered inner products sij hxi x; xj xi.) A "classical scaling" criterion is X 2 SC (z 1 ; z 2 ; : : : ; z N ) = (sij hz i z; z j zi) i<j HTF claim that if in fact similarities are centered inner products, classical scaling is exactly equivalent to principal components analysis. 131 The four scaling criteria above are all "metric" scaling criteria in that the distances kz i z j k are meant to approximate the dij directly. An alternative is to attempt minimization of a non-metric stress function like P 2 kz i z j k) i<j ( (dij ) SNM (z 1 ; z 2 ; : : : ; z N ) = P 2 zj k i<j kz i over vectors z 1 ; z 2 ; : : : ; z N and increasing functions ( ). ( ) will preserve/enforce the natural ordering of dissimilarities without attaching importance to their precise values. Iterative algorithms for optimization of this stress function alternate between isotonic regression to choose ( ) and gradient descent to choose the zi. In general, if one can produce a small value of stress in MDS, one has discovered a k-dimensional representation of N items, and for small k, this is a form of "simple structure." 17.4 More on Principal Components and Related Ideas Here we extend the principal components ideas …rst raised in Section (2.3) based on the SVD ideas of Section 2.2, still with the motivation of using it as means for identifying simple structure in an N -case p-variable data set, where as earlier, we write 0 0 1 x1 B x02 C B C X =B . C N p @ .. A x0N 17.4.1 "Sparse" Principal Components In standard principal components analysis, the v j are sometimes called "loadings" because (in light of the fact that z j = Xv j ) they specify what linear combinations of variables xj are used in making the various principal component vectors. If the v j were "sparse" (had lots of 0’s in them) interpretation of these loadings would be easier. So people have made proposals of alternative methods of de…ning "principal components" that will tend to produce sparse results. One due to Zou is as follows. One might call a v 2 <p a …rst sparse principal component direction if it is part of a minimizer (over choices of v 2 <p and 2 <p with k k = 1) of the criterion N X 2 2 xi x0i v + kvk + 1 kvk1 (121) i=1 for k k1 the l1 norm on <p and constants 0 and 1 0. The last term in this expression is analogous to the lasso penalty on a vector of regression coe¢ cients as considered in Section 3.1.2, and produces the same kind of tendency to "0 out" entries that we saw in that context. If 1 = 0 the …rst sparse principal 132 component direction, v, is proportional to the ordinary …rst principal component direction. In fact, if = 1 = 0 and N > p, v = the ordinary …rst principal component direction is the optimizer. For multiple components, an analogue of the …rst case is a set of K vectors v k 2 <p organized into a p K matrix V that is part of a minimizer (over choices of p K matrices V and p K matrices with 0 = I) of the criterion N K K X X X 2 2 xi V 0 xi + kv k k + (122) 1k kv k k1 i=1 k=1 k=1 for constants 0 and 1k 0. Zou has apparently provided e¤ective algorithms for optimizing criteria (121) or (122). 17.4.2 Non-negative Matrix Factorization There are contexts (for example, when data are counts) where it may not make intuitive sense to center inherently non-negative variables, so that X is naturally non-negative, and one might want to …nd non-negative matrices W and H such that X W H N p rr p N Here the emphasis might be on the columns of W as representing "positive components" of the (positive) X, just as the columns of the matrix U D in SVD’s provide the principal components of X. Various optimization criteria could be set to guide the choice of W and H. One might try to minimize p N X X 2 xij (W H)ij i=1 j=1 or maximize p N X X xij ln (W H)ij (W H)ij i=1 j=1 over non-negative choices of W and H, and various algorithms for doing these have been proposed. (Notice that the second of these criteria is an extension of a loglikelihood for independent Poisson variables with means entries in W H to cases where the xij need only be non-negative, not necessarily integer.) While at …rst blush this enterprise seems sensible, there is a lack of uniqueness in a factorization producing a product W H, and therefore how to interpret the columns of one of the many possible W ’s is not clear. (An easy way to see the lack of uniqueness is this. Suppose that all entries of the product W H are positive. Then for E a small enough (but not 0) matrix, all entries 1 of W W (I + E) 6= W and H (I + E) H 6= H are positive, and W H = W H.) Lacking some natural further restriction on the factors W and H (beyond non-negativity) it seems the practical usefulness of this basic idea is also lacking. 133 17.4.3 Archetypal Analysis Another approach to …nding an interpretable factorization of X was provided by Cutler and Breiman in their "archetypal analysis." Again one means to write X N W H p N rr p for appropriate W and H. But here two restrictions are imposed, namely that 1. the rows of W are probability vectors (so that the approximation to X is in terms of convex combinations/weighted averages of the rows of H), and 2. H = B X where the rows of B are probability vectors (so that the r p r NN p rows of H are in turn convex combinations/weighted averages of the rows of X). The r rows of H = BX are the "prototypes" (?archetypes?) used to represent the data matrix X. With this notation and restrictions, (stochastic matrices) W and B are chosen to minimize 2 kX W BXk It’s clearly possible to rearrange the rows of a minimizing B and make corre2 sponding changes in W without changing kX W BXk . So strictly speaking, the optimization problem has multiple solutions. But in terms of the set of rows of H (a set of prototypes of size r) it’s possible that this optimization problem often has a unique solution. (Symmetries induced in the set of N rows of X can be used to produce examples where it’s clear that genuinely di¤erent sets of 2 prototypes produce the same minimal value of kX W BXk . But it seems likely that real data sets will usually lack such symmetries and lead to a single optimizing set of prototypes.) Emphasis in this version of the "approximate X" problem is on the set of prototypes as "representative data cases." This has to be taken with a grain of salt, since they are nearly always near the "edges" of the data set. This should be no surprise, as line segments between extreme cases in <p can be made to run close to cases in the "middle" of the data set, while line segments between interior cases in the data set will never be made to run close to extreme cases. 17.4.4 Independent Component Analysis We begin by supposing (as before) that X has been centered. For simplicity, suppose also that it is full rank (rank p). With the SVD as before X = U D V0 N p N pp pp p we consider the "sphered" version of the data matrix p X = N XV D 1 134 so that the sample covariance matrix of the data is 1 X 0X N =I Note that the columns of X are then scaled principal components of the (centered) data matrix and we operate with and on X . (For simplicity of notation, we’ll henceforth drop the " " on X.) This methodology seems to be an attempt to …nd latent probabilistic structure in terms of independent variables to account for the principal components. In particular, in its linear form, ICA attempts to model the N (transposed) rows of X as iid of the form xi = A si (123) p 1 p pp 1 for iid vectors si , where the (marginal) distribution of the vectors si is one of independence of the p coordinates/components and the matrix A is an unknown parameter. Consistent with our sphering of the data matrix, we’ll assume that Covx = I and without any loss of generality assume that the covariance matrix for s is not only diagonal, but that Covs = I. Since then I =Covx = A (Covs) A0 = AA0 , A must be orthogonal, and so A0 x = s b then sbi b 0 xi Obviously, if one can estimate A with an orthogonal A A serves as an estimate of what vector of independent components led to the ith row of X and indeed b XA b S has columns that provide predictions of the N (row) p-vectors s0i , and we might thus call those the "independent components" of X (just as we term the columns of XV the principal components of X). There is a bit of arbitrariness in the representation (123) because the ordering of the coordinates of s and the corresponding rows of A is arbitrary. But this is no serious concern. So then, the question is what one might use as a method to estimate A in (123). There are several possibilities. The one discussed in HTF is related to entropy and Kullback-Leibler distance. If one assumes that a (p-dimensional) random vector Y has a density f , with marginal densities f1 ; f2 ; : : : ; fp then an p Q "independence version" of the distribution of Y has density fj and the K-L j=1 135 divergence of the distribution of Y from its independence version is 1 0 0 1 Z p B f (y) C Y C B @ A K f; fj = f (y) ln B p C dy Q A @ j=1 fj (yj ) j=1 Z = f (y) ln f (y) d y Z = f (y) ln f (y) d y j=1 p Z X f (y) (ln fj (yj )) dy fj (yj ) (ln fj (yj )) dyj j=1 p X = p Z X j=1 H (Yj ) H (Y ) for H the entropy function for a random argument, and one might think of this K-L divergence as the di¤erence in the information carried by Y (jointly) and the sum across the components of their individual information contents. If one then thinks of s as random and of the form A0 x for random x, it is perhaps sensible to seek an orthogonal A to minimize (for for aj the jth column of A) p X j=1 H (sj ) H (s) = = = p X j=1 p X j=1 p X j=1 H a0j x H A0 x H a0j x H (x) H a0j x H (x) ln jdet Aj This is equivalent as it turns out (for orthogonal A) to maximization of C (A) = p X j=1 H (z) H a0j x (124) for z standard normal. Then a common approximation is apparently H (z) H a0j x EG (z) EG a0j x 2 for G (u) 1c ln cosh (cu) for a c 2 [1; 2]. Then, criterion (124) has the empirical approximation b (A) = C p X EG (z) j=1 136 N 1 X G a0j x0i N i=1 !2 b can be taken to be an optimizer of where, as usual, x0i is the ith row of X. A b C (A). Ultimately, this development produces a rotation matrix that makes the p entries of rotated and scaled principal component score vectors "look as independent as possible." This is thought of as resolution of a data matrix into its "independent sources" and as a technique for "blind source separation." 17.4.5 Principal Curves and Surfaces In the context of Section 2.3, the line in <p de…ned by fcv 1 jc 2 <g serves as a best straight line representative of the data set. Similarly, the "plane" in <p de…ned by fc1 v 1 + c2 v 2 jc1 2 <; c1 2 <g serves as a best "planar" representative of the data set. The notions of "principal curve" and "principal surface" are attempts to generalize these ideas to 1-dimensional and 2-dimensional structures in <p that may not be "straight" or "‡at" but rather have some curvature. A parametric curve in <p is represented by a vector-valued function 0 1 f1 (t) B f2 (t) C B C f (t) = B . C @ .. A fp (t) de…ned on some interval [0; T ], where we assume that the coordinate functions fj (t) are smooth. With 1 0 0 f1 (t) B f20 (t) C C B f 0 (t) = B . C @ .. A fp0 (t) the "velocity vector" for the curve, f 0 (t) is then the "speed" for the curve and the arc length (distance) along f (t) from t = 0 to t = t0 is 0 Lf (t ) = Z t0 f 0 (t) dt 0 In order to set an unambiguous representation of a curve, it will be useful to assume that it is parameterized so that it has unit speed, i.e. that Lf (t0 ) = t0 1 for all t0 2 [0; T ]. Notice that if it does not, for (Lf ) the inverse of the arc length function, the parametric curve g ( ) = f (Lf ) 1 ( ) for 2 [0; Lf (T )] (125) does have unit speed and traces out the same set of points in <p that are traced out by f (t). So there is no loss of generality in assuming that parametric curves we consider here are parameterized by arc length, and we’ll henceforth write f ( ). 137 Then, for a unit speed parametric curve f ( ) and point x 2 <p , we’ll de…ne the projection index f (x) = sup j kx f ( )k = inf kx f ( )k (126) This is roughly the last arc length for which the distance from x to the curve is minimum. If one thinks of x as random, the "reconstruction error" E kx 2 f( f (x))k (the expected squared distance between x and the curve) might be thought of as a measure of how well the curve represents the distribution. Of course, for a data set containing N cases xi , an empirical analog of this is N 1 X kxi N i=1 f( 2 f (xi ))k (127) and a "good" curve representing the data set should have a small value of this empirical reconstruction error. Notice however, that this can’t be the only consideration. If it was, there would sure be no real di¢ culty in running a very wiggly (and perhaps very long) curve through every element of a data set to produce a curve with 0 empirical reconstruction error. This suggests that with 0 00 1 f1 ( ) B f200 ( ) C B C f 00 ( ) = B C .. @ A . fp00 ( ) the curve’s "acceleration vector," there must be some kind of control exercised on the curvature, f 00 ( ) , in the search for a good curve. We’ll note below where this control is implicitly applied in standard algorithms for producing principal curves for a data set. Returning for a moment to the case where we think of x as random, we’ll say that f ( ) is a principal curve for the distribution of x if it satis…es a so-called self-consistency property, namely that f ( ) = E [xj f (x) = ] (128) This provides motivation for an iterative 2-step algorithm to produce a "principal curve" for a data set. Begin iteration with f 0 ( ) the ordinary …rst principal component line for the data set. Speci…cally, do something like the following. For some choice T > 2 max jhxi ; v 1 ij begin with i=1;:::;N T 2 f0 ( ) = 138 v1 to create a unit-speed curve that extends past the data set in both directions along the …rst principal component direction in <p . Then project the xi onto the line to get N values 1 i = (xi ) = hxi ; v 1 i + f0 T 2 and in light of the criterion (128) more or less average xi with corresponding to get f 1 ( ). A speci…c possible version of this is to consider, f 0 (xi ) near for each coordinate, j, the N pairs 1 i ; xij and to take as fj1 ( ) a function on [0; T ] that is a 1-dimensional cubic smoothing spline. One may then assemble these into a vector function to create f 1 ( ). NOTICE that implicit in this prescription is control over the second derivatives of the component functions through the sti¤ness parameter/weight in the smoothing spline optimization. Notice also that for this prescription, the unitspeed property of f 0 ( ) will not carry over to f 1 ( ) and it seems that one must usehthe idea (125) to iassure that f 1 ( ) is parameterized in terms of arc length RT on 0; 0 f 1 ( ) d . With iterate f m values 1 ( ) in hand, one projects the xi onto the curve to get N m i = fm 1 (xi ) and for each coordinate j considers the N pairs f( m i ; xij )g Fitting a 1-dimensional cubic smoothing spline to these produces fjm ( ), and these functions are assembled into a vector to create f m ( ) (which may need some adjustment via (125) to assure that the curve is parameterized in terms of arc length). One iterates until the empirical reconstruction error (127) converges, and one takes the corresponding f m ( ) to be a principal curve. Notice that which sti¤ness parameter is applied in the smoothing steps will govern what one gets for such a curve. There have been e¤orts to extend principal curves technology to the creation of 2-dimensional principal surfaces. Parts of the extension are more or less clear. A parametric surface in <p is represented by a vector-valued function 0 1 f1 (t) B f2 (t) C B C f (t) = B C .. @ A . fp (t) for t 2 S <2 . A projection index parallel to (126) and self-consistency property parallel to (128) can clearly be de…ned for the surface case, and thin 139 plate splines can replace 1-dimensional cubic smoothing splines for producing iterates of coordinate functions. But ideas of unit-speed don’t have obvious translations to <2 and methods here seem fundamentally more complicated than what is required for the 1-dimensional case. 17.5 Density Estimation One might phrase the problem of describing structure for x in terms of estimating a pdf for the variable. We brie‡y consider the problem of "kernel density estimation." This is the problem: given x1 ; x2 ; : : : ; xN iid with (unknown) pdf f , how to estimate f ? For the time being, suppose that p = 1. For g ( ) some …xed pdf (like, for example, the standard normal pdf), invent a location-scale family of densities on < by de…ning (for > 0) h( j ; ) = 1 g One may, if one then wishes, think of a corresponding "kernel" K (; ) g The Parzen estimate of f (x0 ) is then N 1 X h (x0 jxi ; ) f^ (x0 ) = N i=1 = N 1 X K (x0 ; xi ) N i=1 an average of kernel values. A standard choice of univariate density g ( ) is ' ( ), the standard normal pdf. The natural generalization of this to p dimensions is to let h ( j ; ) be the MVNp ; 2 I density. One should expect that unless N is huge, this methodology will be reliable only for fairly small p (say 3 at most) as a means of estimating a general p-dimensional pdf. There is some discussion in HTF of classi…cation based on estimated multivariate densities, and in particular on "naive" estimators. That is, for purposes of de…ning classi…ers (not for the purpose of real density estimation) one might treat elements xj of x as if they were independent, and multiply together kernel estimates of marginal densities. Apparently, even where the assumption of independence is clearly "wrong" this naive methodology can produce classi…ers that work pretty well. 140 17.6 Google PageRanks This might be thought of as of some general interest beyond the particular application of ranking Web pages, if one abstracts the general notion of summarizing features of a directed graph with N nodes (N Web pages in the motivating application) where edges point from some nodes to other nodes (there are links on some Web pages to other Web pages). The basic idea is that one wishes to rank the nodes (Web pages) by some measure of importance. If i 6= j de…ne Lij = 1 0 if there is a directed edge pointing from node j to node i otherwise and de…ne cj = N X Lij = the number of directed edges pointed away from node j i=1 (There is the question of how we are going to de…ne Ljj . We may either declare that there is an implicit edge pointed from each node j to itself and adopt the convention that Ljj = 1 or we may declare that all Ljj = 0.) A node (Web page) might be more important if many other (particularly, important) nodes have edges (links) pointing to it. The Google PageRanks ri > 0 are chosen to satisfy10 ri = (1 d) + d X j Lij cj rj (129) for some d 2 (0; 1) (producing minimum rank (1 d)). (Apparently, a standard choice is d = :85.) The question is how one can identify the ri .11 Without loss of generality, with 0 1 r1 B C r = @ ... A rN we’ll assume that r 0 1 = N so that the average rank is 1. Then, for 8 < 1 if c 6= 0 j dj = c : 0j if c = 0 j 1 0 There L is the question of what cij should mean in case cj = 0. We’ll presume that the j meaning is that the ratio is 0: 1 1 These are, of course, simply N linear equations in the N unknowns r , and for small N i one might ignore special structure and simply solve these numerically with a generic solver. In what follows we exploit some special structure. 141 de…ne the N N diagonal matrix D = diag (d1 ; d2 ; : : : ; dN ). (Clearly, if we use the Ljj = 1 convention, then dj = 1=cj for all j.) Then in matrix form, the N equations (129) are (for L = (Lij ) i=1;:::;N ) j=1;:::;N r = (1 = d) 1 + dLDr 1 (1 N d) 110 + dLD r (using the assumption that r 0 1 = N ). Let 1 (1 N T = d) 110 + dLD so that r 0 T 0 = r 0 . Note all entries of T are non-negative and that T 01 = = 1 (1 N d) 110 + dLD 0 1 1 (1 N = (1 d) 110 1 + dDL0 1 0 1 c1 B c2 C B C d) 1 + dD B . C @ .. A cN so that if all cj > 0, T 0 1 = 1. We have this condition as long as we either limit application to sets of nodes (Web pages) where each node has an outgoing edge (an outgoing link) or we decide to count every node as pointing to itself (every page as linking to itself) using the Ljj = 1 convention. Henceforth suppose that indeed all cj > 0. Under this assumption, T 0 is a stochastic matrix (with rows that are probability vectors), the transition matrix for an irreducible aperiodic …nite state Markov Chain. De…ning the probability vector p= 1 r N it then follows that since p0 T 0 = p0 the PageRank vector is N times the stationary probability vector for the Markov Chain. This stationary probability vector can then be found as the limit of any row of T0 n as n ! 1. 142 18 Document Features and String Kernels for Text Processing An important application of various kinds of both supervised and unsupervised learning methods is that of text processing. The object is to quantify structure and commonalities in text documents. Patterns in characters and character strings and words are used to characterize documents, group them into clusters, and classify them into types. We here say a bit about some simple methods that have been applied. Suppose that N documents in a collection (or corpus) are under study. One needs to de…ne "features" for these, or at least some kind of "kernel" functions for computing the inner products required for producing principal components in an implicit feature space (and subsequently clustering or deriving classi…ers, and so on). If one treats documents as simply sets of words (ignoring spaces and punctuation and any kind of order of words) one simple set of features for documents d1 ; d2 ; : : : ; dN is a set of counts of word frequencies. That is, for a set of p words appearing in at least one document, one might take xij = the number of occurrences of word j in document i and operate on a representation of the documents in terms of an N p data matrix X. These raw counts xij are often transformed before processing. One popular idea is the use of a "tf-idf" (term frequency-inverse document frequency) weighting of elements of X. This replaces xij with tij = xij ln PN i=1 N I [xij > 0] or variants thereof. (This up-weights non-zero counts of words that occur in few documents. The logarithm is there to prevent this up-weighting from overwhelming all other aspects of the counts.) One might also decide that document length is a feature that is not really of primary interest and determine to normalize vectors xi (or ti ) in one way or another. That is, one might begin with values x xij Pp ij or kx x ik ij j=1 rather than values xij . This latter consideration is, of course, not relevant if the documents in the corpus all have roughly the same length. Processing methods that are based only on variants of the word counts xij are usually said to be based on the "Bag-of-Words." They obviously ignore potentially important word order. (The instructions "turn right then left" and "turn left then right" are obviously quite di¤erent instructions.) One could then consider ordered pairs or n-tuples of words. So, with some "alphabet" A (that might consist of English words, Roman letters, amino acids in protein sequencing, base pairs in DNA sequencing, etc.) 143 consider strings of elements of elements of the alphabet, say s = b1 b2 bjsj where each bi 2 A A document (... or protein sequence ... or DNA sequence) might be idealized as such a string of elements of A. An n-gram in this context is simply a string of n elements of A, say u = b1 b2 bn where each bi 2 A Frequencies of unigrams (1-grams) in documents are (depending upon the alphabet) "bag-of-words" statistics for words or letters or amino acids, etc. Use of features of documents that are counts of occurrences of all possible n-grams apn pears to often be problematic, because unless jAj and n are fairly small, p = jAj will be huge and then X huge and sparse (for ordinary N and jsj). And in many contexts, sequence/order structure is not so "local" as to be e¤ectively expressed by only frequencies of n-grams for small n. One idea that seems to be currently popular is to de…ne a set of interesting p strings, say U = fui gi=1 and look for their occurrence anywhere in a document, with the understanding that they may be realized as substrings of longer strings. That is, when looking for string u (of length n) in a document s, we count every di¤erent substring of s (say s0 = si1 si2 sin ) for which s0 = u But we discount those substrings of s matching u according to length as follows. For some > 0 (the choice = :5 seems pretty common) give matching substring s0 = si1 si2 sin weight in i1 +1 0 = js j so that document i (represented by string si ) gets value of feature j X ln l1 +1 xij = sil1 sil2 (130) siln =uj It further seems that it’s common to normalize the rows of X by the usual Euclidean norm, producing in place of xij the value xij kxi k (131) This notion of using features (130) or normalized features (131) looks attractive, but potentially computationally prohibitive, particularly since the "interesting set" of strings U is often taken to be An . One doesn’t want to have to compute all features (130) directly and then operate with the very large matrix X. But just as we were reminded in Section 2.3.2, it is only XX 0 that is required to …nd principal components of the features (or to de…ne SVM classi…ers 144 or any other classi…ers or clustering algorithms based on principal components). So if there is a way to e¢ ciently compute or approximate inner products for rows of X de…ned by (130), namely 0 10 1 X X X ln l1 +1 A @ mn m1 +1 A @ hxi ; xi0 i = u2An = X u2An sil1 sil2 X sil1 sil2 siln =u X siln =u si0 m1 si0 m2 si0 m1 si0 m2 si0 mn =u ln l1 +mn m1 +2 si0 mn =u it might be possible to employ this idea. And if the inner products hxi ; xi0 i can be computed e¢ ciently, then so can the inner products 1 1 xi ; xi0 kxi k kxi0 k =p hxi ; xi0 i hxi ; xi i hxi0 ; xi0 i needed to employ XX 0 for the normalized features (131). For what it is worth, it is in vogue to call the function of documents s and t de…ned by X X X ln l1 +mn m1 +2 K (s; t) = u2An sl1 sl2 sln =u tm1 tm2 tmn =u the String Subsequence Kernel and then call the matrix XX 0 = (hxi ; xj i) = ¯ ¯ ¯ (K (si ; sj )) the Gram matrix for that "kernel." The good news is that there are fairly simple recursive methods for computing K (s; t) exactly in O (n jsj jtj) time and that there are approximations that are even faster (see the 2002 Journal of Machine Learning Research paper of Lodhi et al.). That makes the implicit use of features (130) or normalized features (131) possible in many text processing problems. 19 Kernel Mechanics We have noted a number of times (particularly in the context of RKHSs) the potential usefulness of non-negative de…nite "kernels" K (s; t). In the RKHS context, these map continuously C C ! < for C a compact subset of <p . But n the basic requirement is that for any set of points fxi gi=1 , xi 2 C, the Gram matrix (K (xi ; xj )) i=1;:::;N j=1;:::;N is non-negative de…nite. For most purposes in machine learning, one can give up (continuity and) requiring that the domain of K (s; t) is a subset of <p <p , replacing it with arbitrary X X and requiring only that the Gram matrix be n non-negative de…nite for any set of fxi gi=1 , xi 2 X . It is in this context that the "string kernels" of Section 18 can be called "kernels" and here we adopt this usage/understanding. 145 A direct way of producing a kernel function is through a Euclidean inner product of vectors of "features." That is, if : X ! <m (so that component j of , j , maps X to a univariate real feature) then K (s; t) = h (s) ; (t)i (132) is a kernel function. (This is, of course, the basic idea used in Section 2.3.2.) Section 6.2 of the book Pattern Recognition and Machine Learning by Bishop notes that it is very easy to make new kernel functions from known ones. In particular, for c > 0; K1 ( ; ) and K2 ( ; ) kernel functions on X X , f ( ) arbitrary, p ( ) a polynomial with non-negative coe¢ cients, : X ! <m , K3 ( ; ) a kernel on <m <m , and M a non-negative de…nite matrix, all of the following are kernel functions: 1. K (s; t) = cK1 (s; t) on X X, 2. K (s; t) = f (s) K1 (s; t) f (t) on X 3. K (s; t) = p (K1 (s; t)) on X X, X, 4. K (s; t) = exp (K1 (s; t)) on X X, 5. K (s; t) = K1 (s; t) + K2 (s; t) on X 6. K (s; t) = K1 (s; t) K2 (s; t) on X 7. K (s; t) = K3 ( (s) ; (t)) on X 8. K (s; t) = s0 M t on <m X, X, X , and <m . (Fact 7 generalizes the basic insight of display (132).) Further, if X and KA ( ; ) is a kernel on XA XA and KB ( ; ) is a kernel on XB the following are both kernel functions: 9. K ((sA ; sB ) ; (tA ; tB )) = KA (sA ; tA ) + KB (sB ; tB ) on X 10. K ((sA ; sB ) ; (tA ; tB )) = KA (sA ; tA ) KB (sB ; tB ) on X X A XB XB , then X , and X. An example of a kernel on a somewhat abstract (but …nite) space is this. For a …nite set A consider X = 2A , the set of all subsets of A. A kernel on X X can then be de…ned by K (A1 ; A2 ) = 2jA1 \A2 j for A1 A and A2 A There are several probabilistic and statistical arguments that can lead to forms for kernel functions. For example, a useful fact from probability theory (Bochner’s Theorem) says that characteristic functions for p-dimensional distributions are non-negative de…nite complex-valued functions of s 2 <p . So if (s) is a real-valued characteristic function, then K (s; t) = 146 (s t) is a kernel function on <p <p . Related to this line of thinking are lists of standard characteristic functions (that in turn produce kernel functions) and theorems about conditions su¢ cient to guarantee that a real-valued function is a characteristic function. For example, each of the following is a real characteristic function for a univariate random variable (that can lead to a kernel on <1 <1 ): 1. (t) = cos at for some a > 0, 2. (t) = 3. (t) = exp 4. (t) = exp ( a jtj) for some a > 0: sin at for some a > 0, at at2 for some a > 0, and And one theorem about su¢ cient conditions for a real-valued function on <1 to be a characteristic function says that if f is symmetric (f ( t) = f (t)), f (0) = 1, and f is decreasing and convex on [0; 1), then f is the characteristic function of some distribution on <1 . (See Chung page 191.) Bishop points out two constructions motivated by statistical modeling that yield kernels that have been used in the machine learning literature. One is this. For a parametric model on (a potentially completely abstract) X , consider densities f (xj ) that when treated as functions of are likelihood functions (for various possible observed x). Then for a distribution G for 2 , Z K (s; t) = f (sj ) f (tj ) dG ( ) is a kernel. This is the inner product in the space of square integrable functions on the probability space with measure G of the two likelihood functions. In this space, the distance between the functions (of ) f (sj ) and f (tj ) is sZ (f (sj ) 2 f (tj )) dG ( ) and what is going on here is the implicit use of (in…nite-dimensional) features that are likelihood functions for the "observations" x. Once one starts down this path, other possibilities come to mind. One is to replace likelihoods with loglikelihoods and consider the issue of "centering" and even "standardization." That is, one might de…ne a feature (a function of ) corresponding to x as Z 0 ln f (xj ) dG ( ) x ( ) = ln f (xj ) or x ( ) = ln f (xj ) R ln f (xj ) ln f (xj ) dG ( ) or even 00x ( ) = qR R 2 ln f (xj ) ln f (xj ) dG ( ) dG ( ) 147 Then obviously, the corresponding kernel function is Z Z 0 K (s; t) = s ( ) t ( ) dG ( ) or K (s; t) = Z 00 00 or K 00 (s; t) = s ( ) t ( ) dG ( ) 0 s ( ) 0 t ( ) dG ( ) (Of these three possibilities, centering alone is probably the most natural from a statistical point of view. It is the "shape" of a loglikelihood that is important in statistical context, not its absolute level. Two loglikelihoods that di¤er by a constant are equivalent for most statistical purposes. Centering perfectly lines up two likelihoods that di¤er by a constant.) In a regular statistical model for x taking values in X with parameter vector = ( 1 ; 2 ; : : : ; k ), the k k Fisher information matrix, say I ( ), is nonnegative de…nite. Then with score function 1 0 @ ln f (xj ) C B @ 1 C B @ C B ln f (xj ) C B C B @ 2 r ln f (xj ) = B C .. C B . C B A @ @ ln f (xj ) @ k (for any …xed ) the function 0 K (s; t) = r ln f (sj ) (I ( )) 1 r ln f (tj ) has been called the "Fisher kernel" in the machine learning literature. (It follows from Bishops’s 7. and 8. that this is indeed a kernel function.) Note that K (x; x) is essentially the score test statistic for a point null hypothesis about . The implicit feature vector here is the k-dimensional score function (evaluated at some …xed , a q basis for testing about ), and rather than Euclidean norm, the norm kuk u0 (I ( )) di¤erences in feature vectors. 20 1 u is implicitly in force for judging the size of Undirected Graphical Models and Machine Learning This is a very brief look at an area of considerable current interest in machine learning related to what is known as "deep learning." HTF’s Section 17.4 and Chapter 27 of Murphy’s Machine Learning can be consulted for more details. Suppose that some set of q binary variables (variables each taking values in f0; 1g) x1 ; x2 ; : : : ; xq is of interest. One way to build a tractable coherent parametric model for these variables is to …rst adopt a corresponding undirected graph with q nodes some set of "edges" E connecting some pairs of nodes as 148 representing the variables. Figure 2 is a cartoon of one such graph for a q = 8 case. Figure 2: A particular undirected graph for q = 8 nodes. One then posits a very speci…c and convenient parametric form for probabilities corresponding to the undirected graph. For every pair (j; k) such that 0 < j < k q and there is an edge in E connecting node j and node k, let jk be a corresponding real parameter. Further, for k = 1; : : : ; q let 0k be an additional real parameter. Then a probability mass function with P exp f (x) = P j<k such that xj and xk share an edge x2f0;1gq exp q jk xj xk P + Pq k=1 0k xk jk xj xk j<k such that xj and xk share an edge + (133) Pq k=1 0k xk for x 2 f0; 1g speci…es a particular kind of "Markov network," an "Ising model" or "Boltzmann Machine" for these variables. Let the normalizing constant that is the denominator on the right of display (133) be called ( ) and note the obvious fact that for these models X ln (f (x)) = jk xj xk + j<k such that xj and xk share an edge q X 0k xk ln ( ( )) (134) k=1 Further, observe that such a model can have as many as q 2 +q = q (q + 1) 2 di¤erent real parameters. If we let xj stand for the (q 1)-dimensional vector consisting of all entries of x except xj , it follows directly from relationship (134) that ln P xj = 1jxj P xj = 0jxj = X lj xl l<j such that xl and xj share an edge 149 + X l>j such that xj and xl share an edge jl xl + 0j and then that 0 P xj = 1jxj = BP exp @ 0 l<j such that xl and xj share an edge BP 1 + exp @ l<j such that xl and xj share an edge lj xl + lj xl P l>j such that xj and xl share an edge + P l>j such that xj and xl share an edge jl xl + jl xl 1 C 0j A + 1 C 0j A One important observation about this model form is then that for …xed , Gibbs sampling to produce realizations from this distribution (133) is obvious (using these simple conditional probabilities). Further, conditioning on any set of entries of x, how to simulate (again via Gibbs sampling) from a conditional distribution for the rest of the entries (again for …xed ) is equally obvious (as the conditioning variables simply provide some entries of xj above that are …xed at their conditioning values throughout the Gibbs iterations). A version of this general framework that is particularly popular in machine learning applications is that where nodes are arranged in two layers and there are edges only between nodes on di¤erent layers, not between nodes in the same layer. Such a model is often called a restricted Boltzmann machine (an RBM). In machine learning contexts, one layer of nodes is called the "hidden layer" and the other is called the "visible layer." Typically the nodes in the visible layer correspond to (digital versions) of variables that are at least at some cost empirically observable, while the variables corresponding to hidden nodes are completely latent/unobservable and somehow represent some stochastic physical or mental mechanism. In addition, it is convenient in some contexts to think of visible nodes as being of two types, say belonging to a set V1 or a set V2 . For example, in a prediction context, the nodes in V1 might encode "x"/inputs and the nodes in V2 might encode y/outputs. We’ll use the naming conventions indicated in Figure 3. (As we have already suggested for general Boltzmann machines) given the parameter vector , simulation (via Gibbs sampling) from a RBM is easy. In particular, where the nodes in V1 encode an input vector x and the nodes in V2 encode an output y, simulation from the conditional distribution of the whole network given a set of values for vl+1 ; : : : ; vl+m provides a predictive distribution for the response. So the fundamental di¢ culty one faces in using these models is the matter of …tting a vector of coe¢ cients . What will be available as training data for an RBM is some set of (potentially incomplete) vectors of values for visible nodes, say v i for i = 1; : : : ; N (that one will typically assume are independent from some appropriate marginal distributions for visible vectors derived from the overall joint distribution of values associated with all nodes visible and hidden). Specializing form (133) 150 Figure 3: An undirected graph corresponding to a restricted Boltzmann machine with q = l + m + n total nodes. l nodes are hidden and the m + n visible nodes break into the two sets V1 and V2 . to the RBM case, one has exp f (h; v) = P j=1;:::;l jk hj vk k=l+1;:::;l+m+n + Pl k=1 0k hk ( ) + Pl+m+n k=l+1 0k vk (135) for ( )= X (h;v)2f0;1gl+m+n 0 B exp B @ X j=1;:::;l k=l+1;:::;l+m+n jk hj vk + l X k=1 0k hk + l+m+n X k=l+1 1 C C 0k vk A Notice now that even in an hypothetical case where one has "data" consisting of complete (h; v) pairs, the existence of the unpleasant normalizing constant ( ) (of a sum form in the denominator) in form (135) would make optimization N Q PN of a likelihood f (hi ; v i ) or loglikelihood i=1 ln f (hi ; v i ) not obvious. i=1 But the fact that one must sum out over (at least) all hidden nodes in order to get contributions to a likelihood or loglikelihood makes the problem even more computationally di¢ cult. That is, if an ith training case provides a m+n complete visible vector v i 2 f0; 1g , the corresponding likelihood term is the marginal of that visible con…guration X f (v i ) = f (h; v i ) h2f0;1gl And the situation becomes even more unpleasant if an ith training case provides, for example, only values for variables corresponding to nodes in V1 (say v 1i 2 m f0; 1g ), since the corresponding likelihood term is the marginal of only that 151 visible con…guration f (v 1i ) = X f (h; (v 1i ; v 2 )) h2f0;1gl v 2 2f0;1gn Substantial e¤ort in computer science circles has gone into the search for "learning" algorithms aimed at …nding parameter vectors that produce large values of loglikelihoods based on N training cases (each term based on some marginal of f (h; v) corresponding to a set of visible nodes). These seem to be mostly based on approximate stochastic gradient descent ideas and approximations to appropriate expectations based on short Gibbs sampling runs. Hinton’s notion of "contrastive divergence" appears to be central to the most well known of these. The area of "deep learning" then seems to be based on the notion of generalizing RBMs to more than a single hidden layer and potentially employs visible layers on both the top and the bottom of an undirected graph. A cartoon of a deep learning network is given in Figure 4. The fundamental feature of the graph architecture here is that there are edges only between nodes in successive layers. Figure 4: An hypothetical "deep" generalization of a RBM. If the problem of how to …t an Ising model/Boltzmann machine to an RBM is a practically di¢ cult one, …tting a deep network model (based on some training cases consisting of values for variables associated with some of the visible nodes) is clearly going to be doubly di¢ cult. What seems to be currently popular is some kind of "greedy"/"forward" sequential …tting of one set of parameters connecting two successive layers at a time, followed by generation of simulated 152 values for variables corresponding to nodes in the "newest" layer and treating those as "data" for …tting a next set of parameters connecting two layers (and so on). If the …tting problem can be solved for a given architecture and form of training data, some very interesting possibilities for using a good "deep" model emerge. For one, in a classi…cation/prediction context, one might treat the bottom visible layer as associated with an input vector x, and the top layer as also visible and associated with an output y or y. In theory, training cases could consist of complete (x; y) pairs, but they could also involve some incomplete cases, where y is missing, or part of x is missing, or ... so on. (Once the training issue is handled, simulation of the top layer variables from inputs of interest used as bottom layer values again enables classi…cation/prediction.) As an interesting second possibility for using deep restricted Boltzmann machines, one could consider architectures where the top and bottom layers are both visible and encode essentially (or even exactly12 ) the same information and between them are several hidden layers with one having very few nodes (like, say, two nodes). If one can …t such a thing to training cases consisting of sets of values corresponding to visible nodes, one can simulate (again via Gibbs sampling) for a …xed set of "visible values" corresponding values at the few nodes serving as the narrow layer of the network. The vector of estimated marginal probabilities of 1 at those nodes might then in turn serve as a kind of pair of generalized principal component values for the set of input visible values. (Notice that in theory these are possible to compute even for incomplete sets of visible values!) 1 2 I am not altogether sure what are the practical implcations of essnetially turning the kind of "tower" in Figure 4 into a "band" or ‡at bracelet that connects back on itself, the top hidden layer having edges directly to the bottom visible layer, but I see no obvious theoretical issues with this architecture. 153