Stat 602 Summary/Main Points … What Do We Now Know (That We May Not Have Known Before)? 1. Up-front processing of available information into an N p data array ( N cases and p "features"/variables) is a big deal. If it fails to encode all available information, what can be learned is limited unnecessarily. If p is unnecessarily large, one will "look for a needle in a haystack" and suffer in the finding of structure. 2. Essentially ALL large p data sets are sparse. 3. "Supervised Learning" concerns prediction and classification. There is a target/output variable y in view. 4. "Unsupervised Learning" concerns quantifying patterns in "features" alone (there is no "output" to be predicted based on all other variables). 5. The SVD (and related eigen decompositions) and PC's are cool useful dimensionreduction tools. 6. "Kernels" can for some methods (ones depending upon data only through computation of inner products) allow efficient use of implicitly defined high-dimensional feature sets (and conveniently matched inner products). 7. Decision theory matters. 8. Prediction can be poor because of 1) inherent noise, 2) inadequate model/method flexibility, and/or 3) fitting methodology that is inherently poor or not adequately supported by N cases. Big p tends to mitigate 2) but exacerbate 3). 9. With anything but very small p , cross-validation is essential/central in predictive analytics for choosing appropriate model/method flexibility. 10. "Nearest neighbor" methods of prediction are approximately (as N goes to infinity) optimal (i.e. Bayes) methods for prediction (for both measurement-scale and qualitative targets y , i.e. for SEL prediction and classification). 11. Some form of "linearity in predictors" method is typically a well-studied/well-known and simple/low-complexity/low-flexibility prediction tool for "small" N problems. 12. For SEL prediction: a. An (unrealizable) optimal SEL predictor is the conditional mean of y given the vector of inputs. All predictors based on training data are "good" only in as far as they approximate the conditional mean function. b. Linear predictors based on large p typically need to subjected to some form of "shrinkage toward 0 " (thinking of all inputs and y centered, and inputs possibly standardized) to avoid over-fit (CV guiding choice of a shrinkage parameter). Variable selection, ridge, lasso, non-negative garrote, PCR, and PLS are ultimately all shrinkage methods. 1 c. The flexibility of predictors available using a fixed set of input variables is greatly increased by creating new "features"/"transforms" of them using sets of basis functions. These can come from mathematics that promises that any continuous function of a fixed number of arguments can be uniformly approximated on a compact set by linear combinations of elements of such a class of functions. The "problem" with using such sets of functions of inputs is that the effective size of p explodes, and one must again apply some kind of shrinkage/variable selection d. e. f. g. h. i. j. k. to prevent over-fit. Splines are predictors that solve a penalized function fitting problem. Some interesting theory implies that the optimizing function is a particular linear combination of data-dependent basis functions (that amount to data-chosen slices of an appropriate kernel function). A penalty weight governs the smoothness of the optimizing function and serves as a complexity parameter. Kernel smoothers are locally weighted averages or locally weighted regressions and ultimately behave much like smoothing splines. A bandwidth parameter governs smoothness and serves as a complexity parameter. Additive models and other high-dimensional uses of (low-dimensional) smoothing methods extend their usefulness beyond the very low-dimensional cases where they have a chance of working directly. Neural (and radial basis function) networks are highly flexible parametric forms that can be used to produce predictors. They involve many parameters and nonlinear least squares fitting problems. Finding sensible fitting methods is an issue. A penalization (of the size of a parameter vector) method is one way of handling the possibility of over-fitting, with the penalty weight serving as a complexity parameter. Lack of identifiability is inescapable (but perhaps in some sense doesn't matter so much in that it is a predictor that is sought, not a paremter vector). Binary regression trees provide one completely non-parametric prediction method. Cost-complexity pruning is a way (using cross-validation) to control over-fitting. Boosting for SEL amounts to predicting and correcting successive sets of residuals. Especially when applied with a "learning rate" less than 1.0, it is an effective way of successively correcting predictors to produce a good one. Bagging is the averaging of predictors fit to bootstrap samples from a training set (and can, e.g., effectively smooth out regression trees). Bagging has the interesting feature of providing its own test set for each bootstrap sample (the "out of bag" sample) and thus doesn't require CV to compare predictors. A random forest is a version of bagged binary regression trees, where a different random selection of input variables is made for each split. 2 l. PRIM is a rectangle-based prediction method that is based more or less on "bumphunting." m. Linear combinations of predictors can improve on any individual predictor (as a means of approximating the ideal predictor, the conditional mean function). Boosting is one kind of linear combination. Other "ensembles" range from the very ad hoc to ones derived from Bayes model averaging. 13. For 0-1 loss classification problems: a. An (unrealizable) optimal classifier is the value of y with maximum conditional probability given the inputs. All classifiers based on training data are good only to the extent that they approximate this classifier. b. Basic statistical theory connects classification to Neyman-Pearson testing and promises that in a K -class problem, a vector of K 1 likelihood ratios is minimal sufficient. This latter observation can be applied to turn many qualitative inputs for classification into a typically small number ( K 1 ) of (equivalent in terms of information content) quantitative inputs with no loss of classification potential. c. Perhaps the simplest kind of classifier one might envision developing is one that partitions an input features space p using p 1 -dimensional hyperplanes. These might be called linear classifiers. Linear classifiers follow from probability models i. with MVN class-conditional distributions with common (across classes) covariance matrix but different class means, or ii. with logistic regression structure. (Dropping the common covariance matrix assumption in i. produces quadratic surfaces partitioning an input features space.) d. Use of polynomial (or other) basis functions with basic input variables extends the usefulness of linear classifiers. e. Support vector technology is based on optimization of a "margin" around a p 1 -dimensional hyperplane decision boundary. As the basic version of the technology is based on inner products of feature vectors, a natural extension of the methodology uses data-defined slices of a kernel function as basis functions and kernel values as inner products. f. The AdaBooostM.1 algorithm (for the 2-class classification problem) is based on application of a general gradient boosting method to an exponential loss, with successive perturbations of a predictor based on trees with a single split. As the form of an optimizer of expected exponential loss is the same as optimal classifier, the algorithm tends to produce good classifiers. g. Neural networks with logistic output functions can produce classifiers with highly flexible decision boundaries. 3 h. Binary decision/classification trees (exactly parallel to regression trees in that the most prevalent class in the training set for a given rectangle provides the classification value for the rectangle) provide one completely non-parametric classification method. Cost-complexity pruning is a way (using cross-validation) to control over-fitting. i. A random forest classifier is a version of a bagged binary classification tree, where a different random selection of input variables is made for each split. j. Prototype classifiers divide an input space according to proximity to a few input vectors chosen to represent K class-conditional distributions of inputs. k. Some form of input space dimension-reduction (local or global) is typically required in order to make nearest neighbor classifiers practically effective. 14. Reproducing Kernel Hilbert Spaces are spaces of functions built essentially from linear combinations of slices of a (positive definite) kernel function with inner product between two slices defined as the kernel evaluated at the pair of locations of those slices. This provides a useful norm on linear combinations of slices and allows the reduction of many penalized prediction/fitting problems to linear algebra problems. a. Beginning from a function space, a differential operator on the space, a related inner product, and a linear functional on the space, the "Heckman route" to the use of RKHS theory fashions a kernel consistent with the inner product. This methodology generalizes "standard" theory of cubic splines. b. Beginning from a kernel function (and appealing to Mercer's theorem) one can define a related RKHS. The "Representer Theorem" is a powerful result that promises a convenient form for the solution to a very general function optimization problem, and provides reduction to linear algebra for the solution. c. Bayes prediction using Gaussian process priors for tractable correlation functions turns out to be equivalent to use of RKHSs based on the correlation function. 15. Clustering aims to divide a set of items into homogenous groups or clusters. Methods of clustering begin with matrices of dissimilarities between items (that could be cases or features in an N p data matrix). Standard clustering methods include: a. partitioning methods ( K - means or medoids methods), b. hierarchical methods (agglomerative or divisive), and c. model-based methods. 16. Self-organizing maps are a kind of simultaneous clustering and dimension-reduction method that associate data in p with points on (a smallish relative to N ) regular grid in 2 . Multi-dimensional scaling is a kind of mapping technique that intends to represent data in p by corresponding points in k for k p in a way that more or less preserves distances between points and their mapped images. 17. Sparse principal components, non-negative matrix factorization, archetypal analysis, and independent component analysis can all be thought of as variants on ordinary principal 4 components and attempts to produce simple/interpretable (low-dimensional) descriptions of structure in a set of data vectors in p . 18. When clustering is applied to (graphically-defined) spectral features of an N p data matrix, one gets a methodology that can find groups of points in p that make up contiguous but not necessarily "cloud-shaped" structures. 19. Association rules and corresponding measures of dependence are used in searches for interesting features in large databases of retail transactions. 20. Specialized kinds of features (and kernels) are developed for particular application areas. Standard text processing features involve word counts and scores for the occurrences of interesting n- grams in bigger strings and have a related so-called "string kernel." 21. Ising Models/Boltzmann Machines are parametric models for random vectors of 0's and 1's (or 1's and 1's ) based on an undirected graphical structure and linear models for log probabilities involving main effects for nodes and interaction effects between nodes sharing edges. While very popular and reportedly quite effective, these models 1) present serious fitting difficulties and 2) are very often essentially degenerate (and thus fitted version often fail to be able to generate data like those used to produce them). 22. Relevance Vector Machines are associated with Bayes methods in prediction problems based on linear combinations of slices of kernel functions (basis functions that are datachosen). Appropriate priors for coefficients in GLMs can promote sparsity in posteriors for coefficient vectors (posteriors that say that with large probability many coefficients are nearly 0). Then those data vectors with kernel slices with small posterior probability of being nearly 0 are termed "relevance vectors." 5