Stat 502X Summary/Main Points … What Do We Now Know... Known Before)? 1. Up-front processing of available information into an

advertisement
Stat 502X Summary/Main Points … What Do We Now Know (That We May Not Have
Known Before)?
1. Up-front processing of available information into an N  p data array ( N cases and p
"features"/variables) is a big deal. If it fails to encode all available information, what can
be learned is limited unnecessarily. If p is unnecessarily large, one will "look for a
needle in a haystack" and suffer in the finding of structure.
2. Essentially ALL large p data sets are sparse.
3. "Supervised Learning" concerns prediction and classification. There is a target/output
variable y in view.
4. "Unsupervised Learning" concerns quantifying patterns in "features" alone (there is no
"output" to be predicted based on all other variables).
5. The SVD (and related eigen decompositions) and PC's are cool useful dimensionreduction tools.
6. "Kernels" can for some methods (ones depending upon data only through computation of
inner products) allow efficient use of implicitly defined high-dimensional feature sets
(and conveniently matched inner products).
7. Decision theory matters.
8. Prediction can be poor because of 1) inherent noise, 2) inadequate model/method
flexibility, and/or 3) fitting methodology that is inherently poor or not adequately
supported by N cases. Big p tends to mitigate 2) but exacerbate 3).
9. With anything but very small p , cross-validation is essential/central in predictive
analytics for choosing appropriate model/method flexibility.
10. "Nearest neighbor" methods of prediction are approximately (as N goes to infinity)
optimal (i.e. Bayes) methods for prediction (for both measurement-scale and qualitative
targets y , i.e. for SEL prediction and classification).
11. Some form of "linearity in predictors" method is typically a well-studied/well-known and
simple/low-complexity/low-flexibility prediction tool for "small" N problems.
12. For SEL prediction:
a. An optimal (unrealizable) SEL predictor is the conditional mean of y given the
vector of inputs. All predictors based on training data are "good" only in as far as
they approximate the conditional mean function.
b. Linear predictors based on large p typically need to subjected to some form of
"shrinkage toward 0 " (thinking of all inputs and y centered, and inputs possibly
standardized) to avoid over-fit (CV guiding choice of a shrinkage parameter).
Variable selection, ridge, lasso, non-negative garrote, PCR, and PLS are
ultimately all shrinkage methods.
1 c. The flexibility of predictors available using a fixed set of input variables is greatly
increased by creating new "features"/"transforms" of them using sets of basis
functions. These can come from mathematics that promises that any continuous
function of a fixed number of arguments can be uniformly approximated on a
compact set by linear combinations of elements of such a class of functions. The
"problem" with using such sets of functions of inputs is that the effective size of
p explodes, and one must again apply some kind of shrinkage/variable selection
d.
e.
f.
g.
h.
i.
j.
k.
l.
to prevent over-fit.
Splines are predictors that solve a penalized function fitting problem. Some
interesting theory (not presented in the course) implies that the optimizing
function is a particular linear combination of data-dependent basis functions (that
amount to data-chosen slices of an appropriate kernel function). A penalty weight
governs the smoothness of the optimizing function and serves as a complexity
parameter.
Kernel smoothers are locally weighted averages or locally weighted regressions
and ultimately behave much like smoothing splines. A bandwidth parameter
governs smoothness and serves as a complexity parameter.
Additive models and other high-dimensional uses of (low-dimensional) smoothing
methods extend their usefulness beyond the very low-dimensional cases where
they have a chance of working directly.
Neural (and radial basis function) networks are highly flexible parametric forms
that can be used to produce predictors. They involve many parameters and nonlinear least squares fitting problems. Finding sensible fitting methods is an issue.
A penalization (of the size of a parameter vector) method is one way of handling
the possibility of over-fitting, with the penalty weight serving as a complexity
parameter.
Binary regression trees provide one completely non-parametric prediction
method. Cost-complexity pruning is a way (using cross-validation) to control
over-fitting.
Boosting for SEL amounts to predicting and correcting successive sets of
residuals. Especially when applied with a "learning rate" less than 1.0, it is an
effective way of successively correcting predictors to produce a good one.
Bagging is the averaging of predictors fit to bootstrap samples from a training set
(and can, e.g., effectively smooth out regression trees). Bagging has the
interesting feature of providing its own test set for each bootstrap sample (the "out
of bag" sample) and thus doesn't require CV.
A random forest is a version of bagged binary regression trees, where a different
random selection of input variables is made for each split.
PRIM is a rectangle-based prediction method that is based more or less on "bumphunting."
2 m. Linear combinations of predictors can improve on any individual predictor (as a
means of approximating the ideal predictor, the conditional mean function).
Boosting is one kind of linear combination. Other "ensembles" range from the
very ad hoc to ones derived from Bayes model averaging.
13. For 0-1 loss classification problems:
a. An optimal (unrealizable) classifier is the value of y with maximum conditional
probability given the inputs. All classifiers based on training data are good only
to the extent that they approximate this classifier.
b. Basic statistical theory connects classification to Neyman-Pearson testing and
promises that in a K -class problem, a vector of K  1 likelihood ratios is minimal
sufficient. This latter observation can be applied to turn many qualitative inputs
for classification into a typically small number ( K  1 ) of (equivalent in terms of
information content) quantitative inputs.
c. Perhaps the simplest kind of classifier one might envision developing is one that
partitions an input features space  p using  p  1 -dimensional hyperplanes.
These might be called linear classifiers. Linear classifiers follow from
probability models
i. with MVN class-conditional distributions with common (across classes)
covariance matrix but different class means, or
ii. with logistic regression structure.
(Dropping the common covariance matrix assumption in i. produces quadratic
surfaces partitioning an input features space.)
d. Use of polynomial (or other) basis functions with basic input variables extends the
usefulness of linear classifiers.
e. Support vector technology is based on optimization of a "margin" around a
 p  1 -dimensional hyperplane decision boundary.
As the basic version of the
technology is based on inner products of feature vectors, a natural extension of the
methodology uses data-defined slices of a kernel function as basis functions and
kernel values as inner products.
f. The AdaBooostM.1 algorithm (for the 2-class classification problem) is based on
application of a general gradient boosting method to an exponential loss, with
successive perturbations of a predictor based on trees with a single split. As the
form of an optimizer of expected exponential loss is the same as optimal
classifier, the algorithm tends to produce good classifiers.
g. Neural networks with logistic output functions can produce classifiers with highly
flexible decision boundaries.
h. Binary decision/classification trees (exactly parallel to regression trees in that the
most prevalent class in the training set for a given rectangle provides the
classification value for the rectangle) provide one completely non-parametric
3 classification method. Cost-complexity pruning is a way (using cross-validation)
to control over-fitting.
i. A random forest classifier is a version of bagged binary classification tree, where
a different random selection of input variables is made for each split.
j. Prototype classifiers divide an input space according to proximity to a few input
vectors chosen to represent K class-conditional distributions of inputs.
k. Some form of input space dimension-reduction (local or global) is typically
required in order to make nearest neighbor classifiers practically effective.
14. Clustering aims to divide a set of items into homogenous groups or clusters. Methods of
clustering begin with matrices of dissimilarities between items (that could be cases or
features in an N  p data matrix). Standard clustering methods include:
a. partitioning methods ( K - means or medoids methods),
b. hierarchical methods (agglomerative or divisive), and
c. model-based methods.
15. When clustering is applied to graphically-defined "spectral" features of an N  p data
matrix, one gets a methodology that can find groups of points in  p that make up
contiguous but not necessarily cloud-shaped structures.
16. Association rules and corresponding measures of dependence are used in searches for
interesting features in large databases of transactions.
17. Specialized kinds of features are developed for particular application areas. Standard text
processing features involve "word" counts and scores for the occurrences of interesting ngrams in bigger strings.
4 
Download