Stat 502X Summary/Main Points … What Do We Now Know (That We May Not Have Known Before)? 1. Up-front processing of available information into an N p data array ( N cases and p "features"/variables) is a big deal. If it fails to encode all available information, what can be learned is limited unnecessarily. If p is unnecessarily large, one will "look for a needle in a haystack" and suffer in the finding of structure. 2. Essentially ALL large p data sets are sparse. 3. "Supervised Learning" concerns prediction and classification. There is a target/output variable y in view. 4. "Unsupervised Learning" concerns quantifying patterns in "features" alone (there is no "output" to be predicted based on all other variables). 5. The SVD (and related eigen decompositions) and PC's are cool useful dimensionreduction tools. 6. "Kernels" can for some methods (ones depending upon data only through computation of inner products) allow efficient use of implicitly defined high-dimensional feature sets (and conveniently matched inner products). 7. Decision theory matters. 8. Prediction can be poor because of 1) inherent noise, 2) inadequate model/method flexibility, and/or 3) fitting methodology that is inherently poor or not adequately supported by N cases. Big p tends to mitigate 2) but exacerbate 3). 9. With anything but very small p , cross-validation is essential/central in predictive analytics for choosing appropriate model/method flexibility. 10. "Nearest neighbor" methods of prediction are approximately (as N goes to infinity) optimal (i.e. Bayes) methods for prediction (for both measurement-scale and qualitative targets y , i.e. for SEL prediction and classification). 11. Some form of "linearity in predictors" method is typically a well-studied/well-known and simple/low-complexity/low-flexibility prediction tool for "small" N problems. 12. For SEL prediction: a. An optimal (unrealizable) SEL predictor is the conditional mean of y given the vector of inputs. All predictors based on training data are "good" only in as far as they approximate the conditional mean function. b. Linear predictors based on large p typically need to subjected to some form of "shrinkage toward 0 " (thinking of all inputs and y centered, and inputs possibly standardized) to avoid over-fit (CV guiding choice of a shrinkage parameter). Variable selection, ridge, lasso, non-negative garrote, PCR, and PLS are ultimately all shrinkage methods. 1 c. The flexibility of predictors available using a fixed set of input variables is greatly increased by creating new "features"/"transforms" of them using sets of basis functions. These can come from mathematics that promises that any continuous function of a fixed number of arguments can be uniformly approximated on a compact set by linear combinations of elements of such a class of functions. The "problem" with using such sets of functions of inputs is that the effective size of p explodes, and one must again apply some kind of shrinkage/variable selection d. e. f. g. h. i. j. k. l. to prevent over-fit. Splines are predictors that solve a penalized function fitting problem. Some interesting theory (not presented in the course) implies that the optimizing function is a particular linear combination of data-dependent basis functions (that amount to data-chosen slices of an appropriate kernel function). A penalty weight governs the smoothness of the optimizing function and serves as a complexity parameter. Kernel smoothers are locally weighted averages or locally weighted regressions and ultimately behave much like smoothing splines. A bandwidth parameter governs smoothness and serves as a complexity parameter. Additive models and other high-dimensional uses of (low-dimensional) smoothing methods extend their usefulness beyond the very low-dimensional cases where they have a chance of working directly. Neural (and radial basis function) networks are highly flexible parametric forms that can be used to produce predictors. They involve many parameters and nonlinear least squares fitting problems. Finding sensible fitting methods is an issue. A penalization (of the size of a parameter vector) method is one way of handling the possibility of over-fitting, with the penalty weight serving as a complexity parameter. Binary regression trees provide one completely non-parametric prediction method. Cost-complexity pruning is a way (using cross-validation) to control over-fitting. Boosting for SEL amounts to predicting and correcting successive sets of residuals. Especially when applied with a "learning rate" less than 1.0, it is an effective way of successively correcting predictors to produce a good one. Bagging is the averaging of predictors fit to bootstrap samples from a training set (and can, e.g., effectively smooth out regression trees). Bagging has the interesting feature of providing its own test set for each bootstrap sample (the "out of bag" sample) and thus doesn't require CV. A random forest is a version of bagged binary regression trees, where a different random selection of input variables is made for each split. PRIM is a rectangle-based prediction method that is based more or less on "bumphunting." 2 m. Linear combinations of predictors can improve on any individual predictor (as a means of approximating the ideal predictor, the conditional mean function). Boosting is one kind of linear combination. Other "ensembles" range from the very ad hoc to ones derived from Bayes model averaging. 13. For 0-1 loss classification problems: a. An optimal (unrealizable) classifier is the value of y with maximum conditional probability given the inputs. All classifiers based on training data are good only to the extent that they approximate this classifier. b. Basic statistical theory connects classification to Neyman-Pearson testing and promises that in a K -class problem, a vector of K 1 likelihood ratios is minimal sufficient. This latter observation can be applied to turn many qualitative inputs for classification into a typically small number ( K 1 ) of (equivalent in terms of information content) quantitative inputs. c. Perhaps the simplest kind of classifier one might envision developing is one that partitions an input features space p using p 1 -dimensional hyperplanes. These might be called linear classifiers. Linear classifiers follow from probability models i. with MVN class-conditional distributions with common (across classes) covariance matrix but different class means, or ii. with logistic regression structure. (Dropping the common covariance matrix assumption in i. produces quadratic surfaces partitioning an input features space.) d. Use of polynomial (or other) basis functions with basic input variables extends the usefulness of linear classifiers. e. Support vector technology is based on optimization of a "margin" around a p 1 -dimensional hyperplane decision boundary. As the basic version of the technology is based on inner products of feature vectors, a natural extension of the methodology uses data-defined slices of a kernel function as basis functions and kernel values as inner products. f. The AdaBooostM.1 algorithm (for the 2-class classification problem) is based on application of a general gradient boosting method to an exponential loss, with successive perturbations of a predictor based on trees with a single split. As the form of an optimizer of expected exponential loss is the same as optimal classifier, the algorithm tends to produce good classifiers. g. Neural networks with logistic output functions can produce classifiers with highly flexible decision boundaries. h. Binary decision/classification trees (exactly parallel to regression trees in that the most prevalent class in the training set for a given rectangle provides the classification value for the rectangle) provide one completely non-parametric 3 classification method. Cost-complexity pruning is a way (using cross-validation) to control over-fitting. i. A random forest classifier is a version of bagged binary classification tree, where a different random selection of input variables is made for each split. j. Prototype classifiers divide an input space according to proximity to a few input vectors chosen to represent K class-conditional distributions of inputs. k. Some form of input space dimension-reduction (local or global) is typically required in order to make nearest neighbor classifiers practically effective. 14. Clustering aims to divide a set of items into homogenous groups or clusters. Methods of clustering begin with matrices of dissimilarities between items (that could be cases or features in an N p data matrix). Standard clustering methods include: a. partitioning methods ( K - means or medoids methods), b. hierarchical methods (agglomerative or divisive), and c. model-based methods. 15. When clustering is applied to graphically-defined "spectral" features of an N p data matrix, one gets a methodology that can find groups of points in p that make up contiguous but not necessarily cloud-shaped structures. 16. Association rules and corresponding measures of dependence are used in searches for interesting features in large databases of transactions. 17. Specialized kinds of features are developed for particular application areas. Standard text processing features involve "word" counts and scores for the occurrences of interesting ngrams in bigger strings. 4