Stat 602 Summary/Main Points … What Do We Now Know... Known Before)? 1. Up-front processing of available information into an

advertisement
Stat 602 Summary/Main Points … What Do We Now Know (That We May Not Have
Known Before)?
1. Up-front processing of available information into an N  p data array ( N cases and p
"features"/variables) is a big deal. If it fails to encode all available information, what can
be learned is limited unnecessarily. If p is unnecessarily large, one will "look for a
needle in a haystack" and suffer in the finding of structure.
2. Essentially ALL large p data sets are sparse.
3. "Supervised Learning" concerns prediction and classification. There is a target/output
variable y in view.
4. "Unsupervised Learning" concerns quantifying patterns in "features" alone (there is no
"output" to be predicted based on all other variables).
5. The SVD (and related eigen decompositions) and PC's are cool useful dimensionreduction tools.
6. "Kernels" can for some methods (ones depending upon data only through computation of
inner products) allow efficient use of implicitly defined high-dimensional feature sets
(and conveniently matched inner products).
7. Decision theory matters.
8. Prediction can be poor because of 1) inherent noise, 2) inadequate model/method
flexibility, and/or 3) fitting methodology that is inherently poor or not adequately
supported by N cases. Big p tends to mitigate 2) but exacerbate 3).
9. With anything but very small p , cross-validation is essential/central in predictive
analytics for choosing appropriate model/method flexibility.
10. "Nearest neighbor" methods of prediction are approximately (as N goes to infinity)
optimal (i.e. Bayes) methods for prediction (for both measurement-scale and qualitative
targets y , i.e. for SEL prediction and classification).
11. Some form of "linearity in predictors" method is typically a well-studied/well-known and
simple/low-complexity/low-flexibility prediction tool for "small" N problems.
12. For SEL prediction:
a. An (unrealizable) optimal SEL predictor is the conditional mean of y given the
vector of inputs. All predictors based on training data are "good" only in as far as
they approximate the conditional mean function.
b. Linear predictors based on large p typically need to subjected to some form of
"shrinkage toward 0 " (thinking of all inputs and y centered, and inputs possibly
standardized) to avoid over-fit (CV guiding choice of a shrinkage parameter).
Variable selection, ridge, lasso, non-negative garrote, PCR, and PLS are
ultimately all shrinkage methods.
1 c. The flexibility of predictors available using a fixed set of input variables is greatly
increased by creating new "features"/"transforms" of them using sets of basis
functions. These can come from mathematics that promises that any continuous
function of a fixed number of arguments can be uniformly approximated on a
compact set by linear combinations of elements of such a class of functions. The
"problem" with using such sets of functions of inputs is that the effective size of
p explodes, and one must again apply some kind of shrinkage/variable selection
d.
e.
f.
g.
h.
i.
j.
k.
to prevent over-fit.
Splines are predictors that solve a penalized function fitting problem. Some
interesting theory implies that the optimizing function is a particular linear
combination of data-dependent basis functions (that amount to data-chosen slices
of an appropriate kernel function). A penalty weight governs the smoothness of
the optimizing function and serves as a complexity parameter.
Kernel smoothers are locally weighted averages or locally weighted regressions
and ultimately behave much like smoothing splines. A bandwidth parameter
governs smoothness and serves as a complexity parameter.
Additive models and other high-dimensional uses of (low-dimensional) smoothing
methods extend their usefulness beyond the very low-dimensional cases where
they have a chance of working directly.
Neural (and radial basis function) networks are highly flexible parametric forms
that can be used to produce predictors. They involve many parameters and nonlinear least squares fitting problems. Finding sensible fitting methods is an issue.
A penalization (of the size of a parameter vector) method is one way of handling
the possibility of over-fitting, with the penalty weight serving as a complexity
parameter. Lack of identifiability is inescapable (but perhaps in some sense
doesn't matter so much in that it is a predictor that is sought, not a paremter
vector).
Binary regression trees provide one completely non-parametric prediction
method. Cost-complexity pruning is a way (using cross-validation) to control
over-fitting.
Boosting for SEL amounts to predicting and correcting successive sets of
residuals. Especially when applied with a "learning rate" less than 1.0, it is an
effective way of successively correcting predictors to produce a good one.
Bagging is the averaging of predictors fit to bootstrap samples from a training set
(and can, e.g., effectively smooth out regression trees). Bagging has the
interesting feature of providing its own test set for each bootstrap sample (the "out
of bag" sample) and thus doesn't require CV to compare predictors.
A random forest is a version of bagged binary regression trees, where a different
random selection of input variables is made for each split.
2 l. PRIM is a rectangle-based prediction method that is based more or less on "bumphunting."
m. Linear combinations of predictors can improve on any individual predictor (as a
means of approximating the ideal predictor, the conditional mean function).
Boosting is one kind of linear combination. Other "ensembles" range from the
very ad hoc to ones derived from Bayes model averaging.
13. For 0-1 loss classification problems:
a. An (unrealizable) optimal classifier is the value of y with maximum conditional
probability given the inputs. All classifiers based on training data are good only
to the extent that they approximate this classifier.
b. Basic statistical theory connects classification to Neyman-Pearson testing and
promises that in a K -class problem, a vector of K  1 likelihood ratios is minimal
sufficient. This latter observation can be applied to turn many qualitative inputs
for classification into a typically small number ( K  1 ) of (equivalent in terms of
information content) quantitative inputs with no loss of classification potential.
c. Perhaps the simplest kind of classifier one might envision developing is one that
partitions an input features space  p using  p  1 -dimensional hyperplanes.
These might be called linear classifiers. Linear classifiers follow from
probability models
i. with MVN class-conditional distributions with common (across classes)
covariance matrix but different class means, or
ii. with logistic regression structure.
(Dropping the common covariance matrix assumption in i. produces quadratic
surfaces partitioning an input features space.)
d. Use of polynomial (or other) basis functions with basic input variables extends the
usefulness of linear classifiers.
e. Support vector technology is based on optimization of a "margin" around a
 p  1 -dimensional hyperplane decision boundary. As the basic version of the
technology is based on inner products of feature vectors, a natural extension of the
methodology uses data-defined slices of a kernel function as basis functions and
kernel values as inner products.
f. The AdaBooostM.1 algorithm (for the 2-class classification problem) is based on
application of a general gradient boosting method to an exponential loss, with
successive perturbations of a predictor based on trees with a single split. As the
form of an optimizer of expected exponential loss is the same as optimal
classifier, the algorithm tends to produce good classifiers.
g. Neural networks with logistic output functions can produce classifiers with highly
flexible decision boundaries.
3 h. Binary decision/classification trees (exactly parallel to regression trees in that the
most prevalent class in the training set for a given rectangle provides the
classification value for the rectangle) provide one completely non-parametric
classification method. Cost-complexity pruning is a way (using cross-validation)
to control over-fitting.
i. A random forest classifier is a version of a bagged binary classification tree,
where a different random selection of input variables is made for each split.
j. Prototype classifiers divide an input space according to proximity to a few input
vectors chosen to represent K class-conditional distributions of inputs.
k. Some form of input space dimension-reduction (local or global) is typically
required in order to make nearest neighbor classifiers practically effective.
14. Reproducing Kernel Hilbert Spaces are spaces of functions built essentially from linear
combinations of slices of a (positive definite) kernel function with inner product between
two slices defined as the kernel evaluated at the pair of locations of those slices. This
provides a useful norm on linear combinations of slices and allows the reduction of many
penalized prediction/fitting problems to linear algebra problems.
a. Beginning from a function space, a differential operator on the space, a related
inner product, and a linear functional on the space, the "Heckman route" to the
use of RKHS theory fashions a kernel consistent with the inner product. This
methodology generalizes "standard" theory of cubic splines.
b. Beginning from a kernel function (and appealing to Mercer's theorem) one can
define a related RKHS. The "Representer Theorem" is a powerful result that
promises a convenient form for the solution to a very general function
optimization problem, and provides reduction to linear algebra for the solution.
c. Bayes prediction using Gaussian process priors for tractable correlation functions
turns out to be equivalent to use of RKHSs based on the correlation function.
15. Clustering aims to divide a set of items into homogenous groups or clusters. Methods of
clustering begin with matrices of dissimilarities between items (that could be cases or
features in an N  p data matrix). Standard clustering methods include:
a. partitioning methods ( K - means or medoids methods),
b. hierarchical methods (agglomerative or divisive), and
c. model-based methods.
16. Self-organizing maps are a kind of simultaneous clustering and dimension-reduction
method that associate data in  p with points on (a smallish relative to N ) regular grid in
 2 . Multi-dimensional scaling is a kind of mapping technique that intends to represent
data in  p by corresponding points in  k for k  p in a way that more or less preserves
distances between points and their mapped images.
17. Sparse principal components, non-negative matrix factorization, archetypal analysis, and
independent component analysis can all be thought of as variants on ordinary principal
4 components and attempts to produce simple/interpretable (low-dimensional) descriptions
of structure in a set of data vectors in  p .
18. When clustering is applied to (graphically-defined) spectral features of an N  p data
matrix, one gets a methodology that can find groups of points in  p that make up
contiguous but not necessarily "cloud-shaped" structures.
19. Association rules and corresponding measures of dependence are used in searches for
interesting features in large databases of retail transactions.
20. Specialized kinds of features (and kernels) are developed for particular application areas.
Standard text processing features involve word counts and scores for the occurrences of
interesting n- grams in bigger strings and have a related so-called "string kernel."
21. Ising Models/Boltzmann Machines are parametric models for random vectors of 0's and
1's (or 1's and 1's ) based on an undirected graphical structure and linear models for log
probabilities involving main effects for nodes and interaction effects between nodes
sharing edges. While very popular and reportedly quite effective, these models 1) present
serious fitting difficulties and 2) are very often essentially degenerate (and thus fitted
version often fail to be able to generate data like those used to produce them).
22. Relevance Vector Machines are associated with Bayes methods in prediction problems
based on linear combinations of slices of kernel functions (basis functions that are datachosen). Appropriate priors for coefficients in GLMs can promote sparsity in posteriors
for coefficient vectors (posteriors that say that with large probability many coefficients
are nearly 0). Then those data vectors with kernel slices with small posterior probability
of being nearly 0 are termed "relevance vectors."
5 
Download