Stat 602X Outline (2013) Steve Vardeman Iowa State University May 27, 2013

advertisement
Stat 602X Outline (2013)
Steve Vardeman
Iowa State University
May 27, 2013
Abstract
This outline is the 2013 revision of a Modern Multivariate Statistical
Learning course outline developed in 2009 and 2011-12 summarizing ISU
Stat 648 (2009), Stat 602X (2011) and Stat 690 (2012). That earlier
outline was based mostly on the topics and organization of The Elements
of Statistical Learning by Hastie, Tibshirani, and Friedman, though very
substantial parts of its technical development bene…ted from Izenman’s
Modern Multivariate Statistical Techniques and from Principles and Theory for Data Mining and Machine Learning by Clarke, Fokoué, and Zhang.
The present revision of the earlier outline bene…ts from a set of thoughful comments on that document kindly provided by Ken Ryan and Mark
Culp. It intends to re‡ect the organization of the 2013 course as it is presented. This document will be in ‡ux as the course unfolds, potentially
deviating from the earlier organization and re‡ecting corrections and intended improvements in exposition. As in the earlier outline, the notation
here is meant to accord to the extent possible with that commonly used
in ISU Stat 511, 542, and 543, and the background assumed here is the
content of those courses and Stat 579.
Contents
1 Introduction
1.1 Some Very Simple Probability Theory and Decision Theory and
Prediction Rules . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Two Conceptually Simple Ways One Might Approximate a Conditional Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2.1 Rules Based on a Linear Assumption About the Form of
the Conditional Mean . . . . . . . . . . . . . . . . . . . .
1.2.2 Nearest Neighbor Rules . . . . . . . . . . . . . . . . . . .
1.3 The Curse of Dimensionality . . . . . . . . . . . . . . . . . . . .
1.4 Variance-Bias Considerations in Prediction . . . . . . . . . . . .
1.5 Complexity, Fit, and Choice of Predictor . . . . . . . . . . . . . .
1.6 Plotting to Portray the E¤ects of Particular Inputs . . . . . . . .
1
5
6
7
7
7
8
10
12
13
2 Some Linear Algebra and Simple Principal Components
2.1 The Gram-Schmidt Process and the QR Decomposition of
rank = p Matrix X . . . . . . . . . . . . . . . . . . . . . . .
2.2 The Singular Value Decomposition of X . . . . . . . . . . . .
2.3 Matrices of Centered Columns and Principal Components . .
2.3.1 "Ordinary" Principal Components . . . . . . . . . . .
2.3.2 "Kernel" Principal Components . . . . . . . . . . . . .
14
a
. .
. .
. .
. .
. .
15
17
19
19
21
3 (Non-OLS) Linear Predictors
22
3.1 Ridge Regression, the Lasso, and Some Other Shrinking Methods 22
3.1.1 Ridge Regression . . . . . . . . . . . . . . . . . . . . . . . 23
3.1.2 The Lasso, Etc. . . . . . . . . . . . . . . . . . . . . . . . . 25
3.1.3 Least Angle Regression (LAR) . . . . . . . . . . . . . . . 27
3.2 Two Methods With Derived Input Variables . . . . . . . . . . . . 29
3.2.1 Principal Components Regression . . . . . . . . . . . . . . 30
3.2.2 Partial Least Squares Regression . . . . . . . . . . . . . . 31
4 Linear Prediction Using Basis Functions
4.1 p = 1 Wavelet Bases . . . . . . . . . . . . . . . . . . . . . . .
4.2 p = 1 Piecewise Polynomials and Regression Splines . . . . .
4.3 Basis Functions and p-Dimensional Inputs . . . . . . . . . . .
4.3.1 Multi-Dimensional Regression Splines (Tensor Bases) .
4.3.2 MARS (Mulitvariate Adapative Regression Splines) .
5 Smoothing Splines
5.1 p = 1 Smoothing Splines . . . . . . . . . . . . . .
5.2 Multi-Dimensional Smoothing Splines . . . . . .
5.3 An Abstraction of the Smoothing Spline Material
Fitting in <N . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
. . . . . . . . .
. . . . . . . . .
and Penalized
. . . . . . . . .
33
33
36
38
38
38
40
40
44
46
6 Kernel Smoothing Methods
47
6.1 One-dimensional Kernel Smoothers . . . . . . . . . . . . . . . . . 47
6.2 Kernel Smoothing in p Dimensions . . . . . . . . . . . . . . . . . 49
7 High-Dimensional Use of Low-Dimensional Smoothers
50
7.1 Additive Models and Other Structured Regression Functions . . 50
7.2 Projection Pursuit Regression . . . . . . . . . . . . . . . . . . . . 52
8 Highly Flexible Non-Linear Parametric Regression
8.1 Neural Network Regression . . . . . . . . . . . . . .
8.1.1 The Back-Propagation Algorithm . . . . . . .
8.1.2 Formal Regularization of Fitting . . . . . . .
8.2 Radial Basis Function Networks . . . . . . . . . . . .
2
Methods 53
. . . . . . . 53
. . . . . . . 54
. . . . . . . 56
. . . . . . . 58
9 Trees and Related Methods
9.1 Regression and Classi…cation Trees . . . . .
9.1.1 Comments on Forward Selection . .
9.1.2 Measuring the Importance of Inputs
9.2 Random Forests . . . . . . . . . . . . . . .
9.3 PRIM (Patient Rule Induction Method) . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
10 Predictors Built on Bootstrap Samples, Model Averaging, and
Stacking
10.1 Predictors Built on Bootstrap Samples . . . . . . . . . . . . . . .
10.1.1 Bagging . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.1.2 Bumping . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.2 Model Averaging and Stacking . . . . . . . . . . . . . . . . . . .
10.2.1 Bayesian Model Averaging . . . . . . . . . . . . . . . . . .
10.2.2 Stacking . . . . . . . . . . . . . . . . . . . . . . . . . . . .
59
59
62
63
63
66
67
67
67
68
68
68
69
11 Reproducing Kernel Hilbert Spaces: Penalized/Regularized and
Bayes Prediction
71
11.1 RKHS’s and p = 1 Cubic Smoothing Splines . . . . . . . . . . . . 72
11.2 What Seems to be Possible Beginning from Linear Functionals
and Linear Di¤erential Operators for p = 1 . . . . . . . . . . . . 73
11.3 What Is Common Beginning Directly From a Kernel . . . . . . . 75
11.3.1 Some Special Cases . . . . . . . . . . . . . . . . . . . . . . 80
11.3.2 Addendum Regarding the Structures of the Spaces Related to a Kernel . . . . . . . . . . . . . . . . . . . . . . . 80
11.4 Gaussian Process "Priors," Bayes Predictors, and RKHS’s . . . . 82
12 More on Understanding and Predicting Predictor Performance
12.1 Optimism of the Training Error . . . . . . . . . . . . . . . . . . .
12.2 Cp , AIC and BIC . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.2.1 Cp and AIC . . . . . . . . . . . . . . . . . . . . . . . . . .
12.2.2 BIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.3 Cross-Validation Estimation of Err . . . . . . . . . . . . . . . . .
12.4 Bootstrap Estimation of Err . . . . . . . . . . . . . . . . . . . . .
84
85
86
86
87
88
89
13 Linear (and a Bit on Quadratic) Methods of Classi…cation
91
13.1 Linear (and a bit on Quadratic) Discriminant Analysis . . . . . . 91
13.2 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 95
13.3 Separating Hyperplanes . . . . . . . . . . . . . . . . . . . . . . . 96
14 Support Vector Machines
14.1 The Linearly Separable Case . .
14.2 The Linearly Non-separable Case
14.3 SVM’s and Kernels . . . . . . . .
14.3.1 Heuristics . . . . . . . . .
14.3.2 An Optimality Argument
3
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
96
97
100
102
102
103
14.4 Other Support Vector Stu¤ . . . . . . . . . . . . . . . . . . . . . 105
15 Boosting
15.1 The AdaBoost Algorithm for 2-Class Classi…cation . . . . . . . .
15.2 Why AdaBoost Might Be Expected to Work . . . . . . . . . . . .
15.2.1 Expected "Exponential Loss" for a Function of x . . . . .
15.2.2 Sequential Search for a g (x) With Small Expected Exponential Loss and the AdaBoost Algorithm . . . . . . . . .
15.3 Other Forms of Boosting . . . . . . . . . . . . . . . . . . . . . . .
15.3.1 SEL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15.3.2 AEL (and Binary Regression Trees) . . . . . . . . . . . .
15.3.3 K-Class Classi…cation . . . . . . . . . . . . . . . . . . . .
15.4 Some Issues (More or Less) Related to Boosting and Use of Trees
as Base Predictors . . . . . . . . . . . . . . . . . . . . . . . . . .
106
106
108
108
16 Prototype and Nearest Neighbor Methods of Classi…cation
114
108
111
112
112
113
113
17 Some Methods of Unsupervised Learning
117
17.1 Association Rules/Market Basket Analysis . . . . . . . . . . . . . 118
17.1.1 The "Apriori Algorithm" for the Case of Singleton Sets Sj 120
17.1.2 Using 2-Class Classi…cation Methods to Create Rectangles 121
17.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
17.2.1 Partitioning Methods ("Centroid"-Based Methods) . . . . 123
17.2.2 Hierarchical Methods . . . . . . . . . . . . . . . . . . . . 125
17.2.3 (Mixture) Model-Based Methods . . . . . . . . . . . . . . 126
17.2.4 Self-Organizing Maps . . . . . . . . . . . . . . . . . . . . 127
17.2.5 Spectral Clustering . . . . . . . . . . . . . . . . . . . . . . 129
17.3 Multi-Dimensional Scaling . . . . . . . . . . . . . . . . . . . . . . 131
17.4 More on Principal Components and Related Ideas . . . . . . . . 132
17.4.1 "Sparse" Principal Components . . . . . . . . . . . . . . . 132
17.4.2 Non-negative Matrix Factorization . . . . . . . . . . . . . 133
17.4.3 Archetypal Analysis . . . . . . . . . . . . . . . . . . . . . 134
17.4.4 Independent Component Analysis . . . . . . . . . . . . . 134
17.4.5 Principal Curves and Surfaces . . . . . . . . . . . . . . . . 137
17.5 Density Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 140
17.6 Google PageRanks . . . . . . . . . . . . . . . . . . . . . . . . . . 141
18 Document Features and String Kernels for Text Processing
143
19 Kernel Mechanics
145
20 Undirected Graphical Models and Machine Learning
148
4
1
Introduction
The course is mostly about "supervised" machine learning for statisticians. The
fundamental issue here is prediction, usually with a fairly large number of predictors and a large number of cases upon which to build a predictor. For
x 2 <p
an input vector (a vector of predictors) with jth coordinate xj and
y
some output value (usually univariate, and when it’s not, we’ll indicate so
with bold face) the object is to use N observed (x; y) pairs
(x1 ; y1 ) ; (x2 ; y2 ) ; : : : ; (xN ; yN )
called training data to create an e¤ective prediction rule
yb = fb(x)
"Supervised" learning means that the training data include values of the output
variable. "Unsupervised" learning means that there is no particular variable
designated as the output variable, and one is looking for patterns in the training
data x1 ; x2 ; : : : ; xN , like clusters or low-dimensional structure.
Here we will write
0
0 0 1
1
y1
x1
B x02 C
B y2 C
B
B
C
C
X = B . C ; Y = B . C ; and T = (X; Y )
N p
@ .. A N 1 @ .. A
yN
x0N
for the training data, and when helpful (to remind of the size of the training
set) add "N " subscripts.
A primary di¤erence between a "machine learning" point of view and that
common in basic graduate statistics courses is an ambivalence toward making
statements concerning parameters of any models used in the invention of prediction rules (including those that would describe the amount of variation in
observables one can expect the model to generate). Standard statistics courses
often conceive of those parameters as encoding scienti…c knowledge about some
underlying data-generating mechanism or real-world phenomenon. Machine
learners don’t seem to care much about these issues, but rather are concerned
about "how good" a prediction rule is.
An output variable might take values in < or in some …nite set G = f1; 2; : : : ; Kg
or G = f0; 1; : : : ; K 1g. In this latter case, it can be notationally convenient
to replace a G-valued output y with K indicator variables
yj = I [y = j]
5
(of course) each taking real values. This replaces the
output
0
0
1
y0
y1
B y1
B y2 C
B
B
C
y = B . C or y = B
..
@
@ .. A
.
yK
yK
taking values in f0; 1g
1.1
K
original y with the vector
1
K
< .
1
C
C
C
A
(1)
Some Very Simple Probability Theory and Decision
Theory and Prediction Rules
For purposes of appealing to general theory that can suggest sensible forms for
prediction rules, suppose that the random pair (x0 ; y) has ((p + 1)-dimensional
joint) distribution P . (Until further notice all expectations refer to distributions and conditional distributions derived from P . In general, if we need
to remind ourselves what variables are being treated as random in probability,
expectation, conditional probability, or conditional expectation computations,
we will superscript E or P with their names.) Suppose further that L (b
y ; y) 0
is some loss function quantifying how unhappy I am with prediction yb when the
output is y. Purely as a thought experiment (not yet as anything based on the
training data) I might consider trying to choose a functional form f (NOT yet
f^) to minimize
EL (f (x) ; y)
This is (in theory) easily done, since
EL (f (x) ; y) = EE [L (f (x) ; y) jx]
and the minimization can be accomplished by considering the conditional mean
for each possible value of x. That is, an optimal choice of f is de…ned by
f (u) = arg minE [L (a; y) jx = u]
(2)
a
2
In the simple case where L (b
y ; y) = (b
y y) , the optimal f in (2) is then
well-known to be
f (u) = E [yjx = u]
In a classi…cation context, where y takes values in G = f1; 2; : : : ; Kg or
G = f0; 1; : : : ; K 1g, I might use (0-1) loss function
L (b
y ; y) = I [b
y 6= y]
An optimal f corresponding to (2) is then
X
f (u) = arg min P [y = vjx = u]
a
v6=a
= arg maxP [y = ajx = u]
a
6
Further, if I de…ne y as in (1), this is
f (u) = the index of the largest entry of E [yjx = u]
So in both the squared error loss context and in the 0-1 loss classi…cation context, (P ) optimal choice of f involves the conditional mean of an output variable
(albeit the second is a vector of 0-1 indicator variables). These strongly suggest
the importance of …nding ways to approximate conditional means E[yjx = u] in
order to (either directly or indirectly) produce good prediction rules.
1.2
Two Conceptually Simple Ways One Might Approximate a Conditional Mean
It is standard to point to two conceptually appealing but in some sense polaropposite ways of approximating E[yjx = u] (for (x; y) with distribution P ).
The tacit assumption here is that the training data are realizations of (x1 ; y1 ) ;
(x2 ; y2 ) ; : : : ; (xN ; yN ) iid P .
1.2.1
Rules Based on a Linear Assumption About the Form of the
Conditional Mean
If one is willing to adopt fairly strong assumptions about the form of E[yjx = u]
then Stat 511-type computations might be used to guide choice of a prediction
rule. In particular, if one assumes that for some 2 <p
u0
E [yjx = u]
(3)
(if we wish, we are free to restrict x1 = u1 1 so that this form has an intercept)
then an OLS computation with the training data produces
and suggests the predictor
b = X 0X
1
X 0Y
f^ (x) = x0 b
Probably "the biggest problems" with this possibility in situations involving large p are the potential in‡exibility of the linear form (3) working directly
with the training data, and the "non-automatic" ‡avor of traditional regression model-building that tries to both take account of/mitigate problems of
multicollinearity and extend the ‡exibility of forms of conditional means by
constructing new predictors from "given" ones (further exploding the size of
"p" and the associated di¢ culties).
1.2.2
Nearest Neighbor Rules
One idea (that might automatically allow very ‡exible forms for E[yjx = u]) is
to operate completely non-parametrically and to think that if N is "big enough,"
7
perhaps something like
X
1
yi
# of xi = u
(4)
i with
xi =u
might work as an approximation to E[yjx = u]. But almost always (unless N
is huge and the distribution of x is discrete) the number of xi = u is 0 or at
most 1 (and absolutely no extrapolation beyond the set of training inputs is
possible). So some modi…cation is surely required. Perhaps the condition
xi = u
might be replaced with
xi
u
in (4).
One form of this is to de…ne for each x the k-neighborhood
nk (x) = the set of k inputs xi in the training set closest to x in <p
A k-nearest neighbor approximation to E[yjx = u] is then
mk (u) =
1
k
X
yi
i with
xi 2nk (u)
suggesting the prediction rule
fb(x) = mk (x)
One might hope that upon allowing k to increase with N , provided that P is
not so bizarre (one is counting, for example, on the continuity of E[yjx = u] in
u) this could be an e¤ective predictor. It is surely a highly ‡exible predictor
and automatically so. But as a practical matter, it and things like it often
fail to be e¤ective ... basically because for p anything but very small, <p is far
bigger than we naively realize.
1.3
The Curse of Dimensionality
Roughly speaking, one wants to allow the most ‡exible kinds of prediction rules
possible. Naively, one might hope that with what seem like the "large" sample
sizes common in machine learning contexts, even if p is fairly large, prediction
rules like k-nearest neighbor rules might be practically useful (as points "…ll up"
<p and allow one to …gure out how y behaves as a function of x operating quite
locally). But our intuition about how sparse data sets in <p actually must be is
pretty weak and needs some education. In popular jargon, there is the "curse
of dimensionality" to be reckoned with.
There are many ways of framing this inescapable sparsity. Some simple ones
involve facts about uniform distributions on the unit ball in <p
fx 2 <p j kxk
8
1g
and on a unit cube centered at the origin
[ :5; :5]
p
For one thing, "most" of these distributions are very near the surface of the
p
solids. The cube [ r; r] capturing (for example) half the volume of the cube
(half the probability mass of the distribution) has
r = :5 (:5)
1=p
which converges to :5 as p increases. Essentially the same story holds for the
uniform distribution on the unit ball. The radius capturing half the probability
mass has
1=p
r = (:5)
which converges to 1 as p increases. Points uniformly distributed in these
regions are mostly near the surface or boundary of the spaces.
Another interesting calculation concerns how large a sample must be in order
for points generated from the uniform distribution on the ball or cube in an iid
fashion to tend to "pile up" anywhere. Consider the problem of describing
the distance from the origin to the closest of N iid points uniform in the pdimensional unit ball. With
R = the distance from the origin to a single random point
R has cdf
8
< 0
rp
F (r) =
:
1
0
r<0
r 1
r>1
So if R1 ; R2 ; : : : ; RN are iid with this distribution, M = min fR1 ; R2 ; : : : ; RN g
has cdf
8
0
m<0
<
N
FM (m) =
1 (1 mp )
0 m 1
:
1
m>1
This distribution has, for example, median
1
FM (:5) =
1
2
1
1=N
!1=p
For, say, p = 100 and N = 106 , the median of the distribution of M is :87.
In retrospect, it’s not really all that hard to understand that even "large"
numbers of points in <p must be sparse. After all, it’s perfectly possible for
p-vectors u and v to agree perfectly on all but one coordinate and be far apart
in p-space. There simply are a lot of ways for two p-vectors to di¤er!
In addition to these kinds of considerations of sparsity, there is the fact that
the potential complexity of functions of p variables explodes exponentially in
9
p. CFZ identify this as a second manifestation of "the curse," and go on to
expound a third that says "for large p, all data sets exhibit multicollinearity (or
its generalization to non-parametric …tting)" and its accompanying problems of
reliable …tting and extrapolation. In any case, the point is that even "large"
N doesn’t save us and make practical large-p prediction a more or less trivial
application of standard parametric or non-parametric regression methodology.
1.4
Variance-Bias Considerations in Prediction
Just as in model parameter estimation, the "variance-bias trade-o¤" notion
is important in a prediction context. Here we consider decompositions of
what HTF in their Section 2.9 and Chapter 7 call various forms of "test/
generalization/prediction error" that can help us think about that trade-o¤,
and what issues come into play as we try to approximate a conditional mean
function.
In what follows, assume that (x1 ; y1 ) ; (x2 ; y2 ) ; : : : ; (xN ; yN ) are iid P and
independent of (x; y) that is also P distributed. Write ET for expectation
(and VarT for variance) computed with respect to the joint distribution of the
training data and E[ jx = u] for conditional expectation (and Var[ jx = u] for
conditional variance) based on the conditional distribution of yjx = u. A
measure of the e¤ectiveness of the predictor f^ (x) at x = u (under squared
error loss) is (the function of u, an input vector at which one is predicting)
Err (u)
f^ (x)
ET E
2
y
jx = u
For some purposes, other versions of this might be useful and appropriate. For
example
2
ErrT
E(x;y) f^ (x) y
is another kind of prediction or test error (that is a function of the training data).
(What one would surely like, but almost surely cannot have is a guarantee that
ErrT is small uniformly in T .) Note then that
Err
Ex Err (x) = ET ErrT
is a number, an expected prediction error.
In any case, a useful decomposition of Err(u) is
2
Err (u) = ET
f^ (u)
E [yjx = u]
= ET
f^ (u)
ET f^ (u)
2
h
+ E (y
+ ET f^ (u)
= VarT f^ (u) + ET f^ (u)
2
E [yjx = u]) jx = u
2
E [yjx = u]
+ Var [yjx = u]
2
E [yjx = u]
10
i
+ Var [yjx = u]
The …rst quantity in this decomposition, VarT f^ (u) , is the variance of the prediction at u. The second term, ET f^ (u)
2
E [yjx = u]
, is a kind of squared
bias of prediction at u. And Var[yjx = u] is an unavoidable variance in outputs
at x = u. Highly ‡exible prediction forms may give small prediction biases at
the expense of large prediction variances. One may need to balance the two o¤
against each other when looking for a good predictor.
Some additional insight into this can perhaps be gained by considering
Err
Ex Err (x)
= Ex VarT f^ (x) + Ex ET f^ (x)
2
E [yjx]
+ Ex Var [yjx]
The …rst term on the right here is the average (according to the marginal of x)
of the prediction variance at x. The second is the average squared prediction
bias. And the third is the average conditional variance of y (and is not under the
control of the analyst choosing f^ (x)). Consider here a further decomposition
of the second term.
Suppose that T is used to select a function (say gT ) from some linear sus2
bspace, say S = fgg, of the space functions h with Ex (h (x)) < 1, and that
ultimately one uses as a predictor
f^ (x) = gT (x)
Since linear subspaces are convex
ET gT = ET f^ 2 S
g
Further, suppose that
2
arg minEx (g (x)
g
E [yjx])
g2S
the projection of (the function of u) E[yjx = u] onto the space S. Then write
h (u) = E [yjx = u]
g (u)
so that
E [yjx] = g (x) + h (x)
Then, it’s a consequence of the facts that Ex (h (x) g (x)) = 0 for all g 2 S
and therefore that Ex (h (x) g (x)) = 0 and Ex (h (x) g (x)) = 0, that
Ex ET f^ (x)
2
E [yjx]
2
= Ex (g
(x)
(g (x) + h (x)))
= Ex (g
(x)
g (x)) + Ex (h (x))
2Ex ((g
(x)
= Ex ET f^ (x)
11
2
2
g (x)) h (x))
2
g (x)
+ Ex (E [yjx]
g (x))
2
The …rst term on the right above is an average squared estimation bias, measuring how well the average (over T ) predictor function approximates the element
of S that best approximates the conditional mean function. This is a measure
of how appropriately the training data are used to pick out elements of S. The
second term on the right is an average squared model bias, measuring how well
it is possible to approximate the conditional mean function E[yjx = u] by an
element of S. This is controlled by the size of S, or e¤ectively the ‡exibility
allowed in the form of f^. Average squared prediction bias can thus be large
because the form …t is not ‡exible enough, or because a poor …tting method is
employed.
1.5
Complexity, Fit, and Choice of Predictor
In rough terms, standard methods of constructing predictors all have associated
"complexity parameters" (like numbers and types of predictors/basis functions
or "ridge parameters" used in regression methods, and penalty weights/bandwidths/neighborhood sizes applied in smoothing methods) that are at the choice
of a user. Depending upon the choice of complexity, one gets more or less
‡exibility in the form f^. If a choice of complexity doesn’t allow enough ‡exibility
in the form of a predictor, under-…t occurs and there is large bias in prediction.
On the other hand, if the choice allows too much ‡exibility, bias may be reduced,
but the price paid is large variance of prediction and over-…t. It is a theme
that runs through this material that complexity must be chosen in a way that
balances variance and bias for the particular combination of N and p and general
circumstance one faces. Speci…cs of methods for doing this are presented in
some detail in Chapter 7 of HTF, but in broad terms, the story is as follows.
The most obvious/elementary means of selecting a "complexity parameter"
is to (for all options under consideration) compute and compare values of the
so-called "training error"
err =
N
1 X ^
f (xi )
N i=1
2
yi
PN
(or more generally, err = N1 i=1 L f^ (xi ) ; yi ) and look for predictors with
small err. The problem is that err is no good estimator of Err (or any other
sensible quanti…cation of predictor performance). It typically decreases with
increased complexity (without an increase for large complexity), and fails to
reliably indicate performance outside the training sample.
The existing practical options for evaluating likely performance of a predictor
(and guiding choice of complexity) are then three.
1. One might employ other (besides err) training-data based indicators of
likely predictor performance, like Mallows’Cp , "AIC," and "BIC."
2. In genuinely large N contexts, one might hold back some random sample of
the training data to serve as a "test set," …t to produce f^ on the remainder,
12
and use
X
1
size of the test set
2
f^ (xi )
yi
i2 the
test set
to indicate likely predictor performance.
3. One might employ sample re-use methods to estimate Err and guide choice
of complexity. Cross-validation and bootstrap ideas are used here.
We’ll say more about these possibilities later, but at the moment outline the
cross-validation idea. K-fold cross-validation consists of
1. randomly breaking the training set into K disjoint roughly equal-sized
pieces, say T 1 ; T 2 ; : : : ; T K ,
2. training on each of the reduced training sets T
dictors f^k ,
T k , to produce K pre-
3. letting k (i) be the index of the piece T k containing training case i, and
computing the cross-validation error
N
1 X ^k(i)
f
(xi )
CV f^ =
N i=1
(more generally, CV f^ =
1
N
PN
i=1
2
yi
L yi ; f^k(i) (xi ) ).
This is roughly the same as …tting on each T T k and correspondingly evaluating on T k , and then averaging. (When N is a multiple of K, these are exactly
the same.) One hopes that CV f^ approximates Err.
The choice of K = N is called "leave one out cross-validation" and sometimes
in this case there are slick computational ways of evaluating CV f^ .
1.6
Plotting to Portray the E¤ects of Particular Inputs
An issue brie‡y discussed in HTF Ch 10 is the making and plotting of functions
a few of the coordinates of x in an attempt to understand the nature of the
in‡uence of these in a predictor. (What they say is really perfectly general,
not at all special to the particular form of predictors discussed in that chapter.)
If, for example, I want to understand the in‡uence the …rst two coordinates of
x have on a form f (x), I might think of somehow averaging out the remaining
coordinates of x. One theoretical implementation of this idea would be
f 12 (u1 ; u2 ) = E(x3 ;x4 ;:::;xp ) f (u1 ; u2 ; x3 ; x4 ; : : : ; xp )
i.e. averaging according to the marginal distribution of the excluded input
variables. An empirical version of this is
N
1 X
f (u1 ; u2 ; x3i ; x4i ; : : : ; xpi )
N i=1
13
This might be plotted (e.g. in contour plot fashion) and the plot called a partial
dependence plot for the variables x1 and x2 . HTF’s language is that this
function details the dependence of the predictor on (u1 ; u2 ) "after accounting
for the average e¤ects of the other variables." This thinking amounts to a
version of the kind of thing one does in ordinary factorial linear models, where
main e¤ects are de…ned in terms of average (across all levels of all other factors)
means for individual levels of a factor, two-factor interactions are de…ned in
terms of average (again across all levels of all other factors) means for pairs of
levels of two factors, etc.
Something di¤erent raised by HTF is consideration of
fe12 (u1 ; u2 ) = E [f (x) jx1 = u1 ; x2 = u2 ]
(This is, by the way, the function of (u1 ; u2 ) closest to f (x) in L2 (P ).) This
is obtained by averaging not against the marginal of the excluded variables,
but against the conditional distribution of the excluded variables given x1 =
u1 ; x2 = u2 . No workable empirical version of fe12 (u1 ; u2 ) can typically be
de…ned. And it should be clear that this is not the same as f 12 (u1 ; u2 ). HTF
say this in some sense describes the e¤ects of (x1 ; x2 ) on the prediction "ignoring
the impact of the other variables." (In fact, if f (x) is a good approximation
for E[yjx], this conditioning produces essentially E[yjx1 = u1 ; x2 = u2 ] and fe12
is just a predictor of y based on (x1 ; x2 ).)
The di¤erence between f 12 and fe12 is easily seen through resort to a simple
example. If, for example, f is additive of the form
f (x) = h1 (x1 ; x2 ) + h2 (x3 ; : : : ; xp )
then
f 12 (x1 ; x2 ) = h1 (x1 ; x2 ) + Eh2 (x3 ; : : : ; xp )
= h1 (x1 ; x2 ) + constant
while
fe12 (u1 ; u2 ) = h1 (u1 ; u2 ) + E [h2 (x3 ; : : : ; xp ) jx1 = u1 ; x2 = u2 ]
and the fe12 "correction" to h1 (u1 ; u2 ) is not necessarily constant in (u1 ; u2 ).
The upshot of all this is that partial dependence plots are potentially helpful,
but that one needs to remember that they are produced by averaging according
to the marginal of the set of variables not under consideration.
2
Some Linear Algebra and Simple Principal
Components
Methods of modern multivariate statistical learning often involve more linear
algebra background than is assumed or used in Stat 511. So here we provide
14
some relevant linear theory. We then apply that to the unsupervised learning
problem of "principal components analysis."
We continue to use the notation
0
1
0 0 1
y1
x1
B y2 C
B x02 C
B
C
C
B
X = B . C and potentially Y = B . C
.. A
N 1
N p @
@ .. A
yN
x0N
and recall that it is standard Stat 511 fare that ordinary least squares projects
Y onto C (X), the column space of X, in order to produce the vector of …tted
values
1
0
yb1
B yb2 C
C
B
Yb = B . C
N 1
@ .. A
2.1
ybN
The Gram-Schmidt Process and the QR Decomposition of a rank = p Matrix X
For many purposes it would be convenient if the columns of a full rank (rank =
p) matrix X were orthogonal. In fact, it would be useful to replace the N p
matrix X with an N p matrix Z with orthogonal columns and having the
property that for each l if X l and Z l are N l consisting of the …rst l columns
of respectively X and Z, then C (Z l ) = C (X l ). Such a matrix can in fact be
constructed using the so-called Gram-Schmidt process. This process generalizes
beyond the present application to <N to general Hilbert spaces, and recognition
of that important fact, we’ll adopt standard Hilbert space notation for the innerproduct in <N , namely
N
X
ui vi = u0 v
hu; vi
i=1
and describe the process in these terms.
In what follows, vectors are N -vectors and in particular, xj is the jth column
of X. Notice that this is in potential con‡ict with earlier notation that made
xi the p-vector of inputs for the ith case in the training data. We will simply
have to read the following in context and keep in mind the local convention.
With this notation, the Gram-Schmidt process proceeds as follows:
1. Set the …rst column of Z to be
z 1 = x1
2. Having constructed Z l
1
= (z 1 ; z 2 ; : : : ; z l
z l = xl
1 ),
let
l 1
X
hxl ; z j i
zj
hz j ; z j i
j=1
15
It is easy enough to see that hz l ; z j i = 0 for all j < l (building up the
orthogonality of z 1 ; z 2 ; : : : ; z l 1 by induction), since
hz l ; z j i = hxl ; z j i
hxl ; z j i
as at most one term of the sum in step 2. above is non-zero. Further, assume
that C (Z l 1 ) = C (X l 1 ). z l 2 C (X l ) and so C (Z l ) C (X l ). And since
any element of C (X l ) can be written as a linear combination of an element
of C (X l 1 ) = C (Z l 1 ) and xl , we also have C (X l )
C (Z l ). Thus, as
advertised, C (Z l ) = C (X l ).
Since the z j are perpendicular, for any N -vector w,
l
X
hw; z j i
zj
hz j ; z j i
j=1
is the projection of w onto C (Z l ) = C (X l ). (To see this, consider minimizaD
E
2
Pl
Pl
Pl
tion of the quantity w
by
j=1 cj z j ; w
j=1 cj z j = w
j=1 cj z j
choice of the constants cj .) In particular, the sum on the right side of the
equality in step 2. of the Gram-Schmidt process is the projection of xl onto
C (X l 1 ). And the projection of a vector of outputs Y onto C (X l ) is
l
X
hY ; z j i
zj
hz j ; z j i
j=1
This means that for a full p-variable regression,
hY ; z p i
hz p ; z p i
is the regression coe¢ cient for z p and (since only z p involves it) the last variable in X, xp . So, in constructing a vector of …tted values, …tted regression
coe¢ cients in multiple regression can be interpreted as weights to be applied to
that part of the input vector that remains after projecting the predictor onto
the space spanned by all the others.
The construction of the orthogonal variables z j can be represented in matrix
form as
X = Z
N
where
p
N
pp p
is upper triangular with
kj
= the value in the kth row and jth column of
8
1
if j = k
<
hz k ; xj i
=
if j > k
:
hz k ; z k i
16
De…ning
1=2
D = diag hz 1 ; z 1 i
; : : : ; hz p ; z p i
1=2
= diag (kz 1 k ; : : : ; kz p k)
and letting
1
Q = ZD
and R = D
one may write
X = QR
(5)
that is the so-called QR decomposition of X.
In (5), Q is N p with
Q0 Q = D
1
Z 0 ZD
1
1
=D
diag (hz 1 ; z 1 i ; : : : ; hz p ; z p i) D
1
=I
That is, Q has for columns perpendicular unit vectors that form a basis for
C (X). R is upper triangular and that says that only the …rst l of these unit
vectors are needed to create xl .
The decomposition is computationally useful in that (for q j the jth column
1=2
of Q, namely q j = (hz j ; z j i)
z j ) the projection of a response vector Y onto
C (X) is (since the q j comprise an orthonormal basis for C (X))
and
Yb =
p
X
Y ; q j q j = QQ0 Y
(6)
j=1
b OLS = R
1
Q0 Y
(The fact that R is upper triangular implies that there are e¢ cient ways to
compute its inverse.)
2.2
The Singular Value Decomposition of X
If the matrix X has rank r then it has a so-called singular value decomposition
as
X = U D V0
N
p
N
rr rr p
where U has orthonormal columns spanning C (X), V has orthonormal columns
spanning C X 0 (the row space of X), and D = diag (d1 ; d2 ; : : : ; dr ) for
d1
d2
dr > 0
The dj are the "singular values" of X.1
1 It is worth noting that for a real non-negative de…nite square matrix (a covariance matrix),
the singular value decomposition is the eigen decomposition, U = V , columns of these matrices
are unit eigenvectors, and the SVD singular values are the corresponding eigenvalues.
17
An interesting property of the singular value decomposition is this. If U l and
V l are matrices consisting of the …rst l r columns of U and V respectively,
then
X l = U l diag (d1 ; d2 ; : : : ; dl ) V 0l
is the best (in the sense of squared distance from X in <N p ) rank = l approximation to X. (In passing, note that application of this kind of argument
to covariance matrices provides low-rank approximations to complicated covariance matrices.)
Since the columns of U are an orthonormal basis for C (X), the projection
of an output vector Y onto C (X) is
OLS
Yb
=
r
X
j=1
hY ; uj i uj = U U 0 Y
(7)
In the full rank (rank = p) X case, this is of course, completely parallel to (6)
and is a consequence of the fact that the columns of both U and Q (from the
QR decomposition of X) form orthonormal bases for C (X). In general, the
two bases are not the same.
Now using the singular value decomposition of a full rank (rank p) X,
X 0 X = V D 0 U 0 U DV 0
= V D2 V 0
(8)
which is the eigen (or spectral) decomposition of the symmetric and positive
de…nite X 0 X.
The vector
z 1 Xv 1
is the product Xw with the largest squared length in <N subject to the constraint that kwk = 1: A second representation of z 1 is
0 1
1
B 0 C
B C
z 1 = Xv 1 = U DV 0 v 1 = U D B . C = d1 u1
@ .. A
0
In general,
0
1
hx1 ; v j i
B
C
..
z j = Xv j = @
A = dj uj
.
hxN ; v j i
(9)
is the vector of the form Xw with the largest squared length in <N subject to
the constraints that kwk = 1 and hz j ; z l i = 0 for all l < j.
18
2.3
Matrices of Centered Columns and Principal Components
In the event that all the columns of X have been centered (each 10 xj = 0 for
xj the jth column of X), there is additional terminology and insight associated
with singular value decompositions as describing the structure of X. Note
that centering is often sensible in unsupervised learning contexts because the
object is to understand the internal structure of the data cases xi 2 <p , not
the location of the data cloud (that is easily represented by the sample mean
vector). So accordingly, we …rst translate the data cloud to the origin.
Principal components ideas are then based on the singular value decomposition of X
X = U D V0
N
p
N
rr rr p
(and related spectral/eigen decompositions of X 0 X and XX 0 ).
2.3.1
"Ordinary" Principal Components
The columns of V (namely v 1 ; v 2 ; : : : ; v r ) are called the principal component directions in <p of the xi , and the elements of the vectors z j 2 <N
from display (9), namely the inner products hxi ; v j i, are called the principal
components of the xi . (The ith element of z j , hxi ; v j i, is the value of the jth
principal component for case i, or the corresponding principal component
score.) Notice that hxi ; v j i v j is the projection of xi onto the 1-dimensional
space spanned by v j .
It is worth thinking a bit more about the form of the product
X
l
= U l diag (d1 ; d2 ; : : : ; dl ) V 0l
that we’ve already said is the best rank l approximation to X. In fact it is
0 0 1
v1
l
B v 02 C X
B
C
(z 1 ; z 2 ; : : : ; z l ) B . C =
z j v 0j
@ .. A
v 0l
j=1
Pl
and its ith row is j=1 hxi ; v j i v 0j , which (since the v j are orthonormal) is the
transpose of the projection of xi onto C (V l ). That is,
0
1
0
1
0
1
hx1 ; v 1 i
hx1 ; v 2 i
hx1 ; v l i
B
C 0 B
C 0
B
C 0
..
..
..
X l=@
+@
A v1 + @
A v2 +
A vl
.
.
.
=
hxN ; v 1 i
z 1 v 01
+
z 2 v 02
+
+
hxN ; v 2 i
z l v 0l
hxN ; v l i
a sum of rank 1 summands, producing for X l a matrix with each xi in X
replaced by the transpose of its projection onto C (V l ). The values of the
19
principal components are the coe¢ cients applied to the v 0j for a given case to
produce the summands. Note that in light of relationships (9) and the fact
that the uj ’s are unit vectors, the squared length of z j is d2j and non-increasing
in j. So on the whole, the summands z j v 0j that make up X l decrease in
size with j. In this sense, the directions v 1 ; v 2 ; : : : ; v r are successively less
important in terms of describing variation in the xi . This agrees with typical
statistical interpretation of instances where a few early singular values are much
bigger than the others. In such cases "simple structure" discovered in the data
is that observations can be more or less reconstructed as linear combinations
(with principal components as weights) of a few orthonormal vectors.
Izenman, in his discussion of "polynomial principal components" points out
that in some circumstances the existence of a few very small singular values can
also identify important simple structure in a data set. Suppose, for example,
that all singular values except dp 0 are of appreciable size. One simple feature
of the data set is then that all hxi ; v p i 0, i.e. there is one linear combination
of the p coordinates xj that is essentially constant (namely hx; v p i). The
data fall nearly on a (p 1)-dimensional hyperplane in <p . In cases where
the p coordinates xj are not functionally independent (for example consisting
of centered versions of 1) all values, 2) all squares of values, and 3) all cross
products of values of a smaller number of functionally independent variables), a
single "nearly 0" singular value identi…es a quadratic function of the functionally
independent variables that must be essentially constant, a potentially useful
insight about the data set.
X 0 X and XX 0 and Principal Components The singular value decomposition of X means that both X 0 X and XX 0 have useful representations in
terms of singular vectors and singular values. Consider …rst X 0 X (that is most
of the sample covariance matrix). As noted in display (8), the SVD of X means
that
X 0 X = V D2 V 0
and it’s then clear that the columns of V are eigenvectors of X 0 X and the
squares of the diagonal elements of D are the corresponding eigenvalues. An
eigen analysis of X 0 X then directly yields the principal component directions
of the data, and through the further computation of the inner products in (9),
the principal components z j (and hence the singular vectors uj ) are available.
Note that
1 0
XX
N
is the (N -divisor) sample covariance matrix2 for the p input variables x1 ; x2 ; : : : ; xp .
The principal component directions of X in <p , namely v 1 ; v 2 ; : : : ; v r , are also
unit eigenvectors of the sample covariance matrix. The squared lengths of
2 Notice that when X has standardized columns (i.e. each column of X, x , has
j
1
h1; xj i = 0 and hxj ; xj i = N ), the matrix N
X 0 X is the sample correlation matrix for the p
input variables x1 ; x2 ; : : : ; xp .
20
the principal components z j in <N divided by N are the (N -divisor) sample
variances of entries of the z j , and their values are
d2j
1
1 0
z j z j = dj u0j uj dj =
N
N
N
The SVD of X also implies that
XX 0 = U DV 0 V DU 0 = U D 2 U 0
and it’s then clear that the columns of U are eigenvectors of XX 0 and the
squares of the diagonal elements of D are the corresponding eigenvalues. U D
then produces the N r matrix of principal components of the data. It does
not appear that the principal component directions are available even indirectly
based only on this second eigen analysis.
2.3.2
"Kernel" Principal Components
This topic concerns the possibility of using a nonlinear function : <p ! <M
to map data vectors x to (a usually higher-dimensional) vector of features (x).
Of course, a perfectly straightforward way of proceeding is to simply create a
new N M data matrix
0 0
1
(x1 )
B 0 (x2 ) C
B
C
=B
C
..
@
A
.
0
(xN )
and, after centering via
e =
1
J
N
=
I
1
J
N
(10)
for J an N N matrix of 1’s, do an SVD of e producing singular values and
both sets of singular vectors. But this doesn’t seem to be the most common
approach.
Rather than specify the transform (or even the space <M ) it seems more
0
common to instead specify (the so-called Gram matrix)
by appeal to the
fact that it is of necessity non-negative de…nite and might thus be generated by
an appropriate "kernel" function. That is, for some choice of a non-negative
0
de…nite function K (x; z) : <p <p ! <, one assumes that
is of the form
0
=
0
(xi )
(xj )
i=1;:::;N
j=1;:::;N
= (K (xi ; xj )) i=1;:::;N
j=1;:::;N
and doesn’t much bother about what
might produce such a ( and then)
0
. (See Section 19 concerning the construction of kernel functions and note
in passing that a choice of K (x; z) essentially
p de…nes a new inner product on
<p and corresponding new norm kxkK = K (x; x).) But one must then
21
0
0
somehow handle the fact that
is not e e that is needed to do a principal
components analysis in feature space. (Without (x) one can’t make use of
relationship (10).)
f
A bit of linear algebra shows that for an N
M matrix M and M
1
I N J M,
0
fM
f = MM0
M
1
JMM0
N
1
1
M M 0J + 2 J M M 0J
N
N
0
0
so that e e is computable from only
. Then (as indicated in Section
0
f
f
2.3.1) an eigen analysis for M M will produce the principal components for
the data expressed in feature space. (This analysis produces the z j ’s, but
not the principal component directions in the feature space. There is the
0
0
1
f M
f = M 0M
representation M
N M J M , but that doesn’t help one …nd
0
0
e e or its eigenvectors using only
.)
3
(Non-OLS) Linear Predictors
There is more to say about the development of a linear predictor
fb(x) = x0 ^
for an appropriate ^ 2 <p than what is said in books and courses on ordinary
linear models (where ordinary least squares is used to …t the linear form to all
p input variables or to some subset of M of them). We continue the basic
notation of Section 2, where the (supervised learning) problem is prediction,
and there is a vector of continuous outputs, Y , of interest.
3.1
Ridge Regression, the Lasso, and Some Other Shrinking Methods
An alternative to seeking to …nd a suitable level of complexity in a linear prediction rule through subset selection and least squares …tting of a linear form to the
selected variables, is to employ a shrinkage method based on a penalized version of least squares to choose a vector ^ 2 <p to employ in a linear prediction
rule. Here we consider several such methods. The implementation of these
methods is not equivariant to the scaling used to express the input variables
xj . So that we can talk about properties of the methods that are associated
with a well-de…ned scaling, we assume here that the output variable has
been centered (i.e. that hY ; 1i = 0) and that the columns of X have
been standardized (and if originally X had a constant column, it has been
removed).
22
3.1.1
For a
Ridge Regression
ridge
> 0 the ridge regression coe¢ cient vector b
2 <p is
b ridge = arg min (Y
0
X ) (Y
0
X )+
(11)
2<p
OLS
Here
is a penalty/complexity parameter that controls how much b
is
shrunken towards 0. The unconstrained minimization problem expressed in
(11) has an equivalent constrained minimization description as
b ridge =
t
arg min
2
with k k
0
(Y
X ) (Y
X )
(12)
t
2
ridge
for an appropriate t > 0. (Corresponding to used in (11), is t = b
used in (12). Conversely, corresponding to t used in (12), one may use a value
of in (11) producing the same error sum of squares.)
The unconstrained form (11) calls upon one to minimize
(Y
0
X ) (Y
0
X )+
and some vector calculus leads directly to
b ridge = X 0 X + I
1
X 0Y
So then, using the singular value decomposition of X (with rank = r),
ridge
ridge
Yb
= Xb
= U DV 0 V DU 0 U DV 0 + I
1
V DU 0 Y
= U D V 0 V DU 0 U DV 0 + I V
1
DU 0 Y
1
= U D D2 + I
DU 0 Y
!
r
X
d2j
=
hY ; uj i uj
2
dj +
j=1
(13)
Comparing to (7) and recognizing that
0<
d2j+1
d2j+1 +
<
d2j
d2j +
<1
we see that the coe¢ cients of the orthonormal basis vectors uj employed to
ridge
OLS
get Yb
are shrunken version of the coe¢ cients applied to get Yb
. The
most severe shrinking is enforced in the directions of the smallest principal
components of X (the uj least important in making up low rank approximations
to X).
23
The function
df ( ) = tr X X 0 X + I
1
X0
1
= tr U D D 2 + I
DU 0
1
0
!
r
X
d2j
uj u0j A
= tr @
2+
d
j
j=1
0
1
!
r
X
d2j
= tr @
u0j uj A
2+
d
j
j=1
!
r
X
d2j
=
d2j +
j=1
is called the "e¤ective degrees of freedom" associated with the ridge regression.
In regard to this choice of nomenclature, note that if = 0 ridge regression
is ordinary least squares and this is r, the usual degrees of freedom associated
with projection onto C (X), i.e. trace of the projection matrix onto this column
space.
As ! 1, the e¤ective degrees of freedom goes to 0 and (the centered)
ridge
Yb
goes to 0 (corresponding to a constant predictor). Notice also (for future
ridge
ridge
reference) that since Yb
= Xb
= X X 0X + I
M = X X 0X + I
1
1
X 0 Y = M Y for
X 0 , if one assumes that
CovY =
2
I
(conditioned on the xi in the training data, the outputs are uncorrelated and
have constant variance 2 ) then
e¤ective degrees of freedom = tr (M ) =
N
1 X
2
Cov (^
yi ; yi )
(14)
i=1
This follows since Yb = M Y and CovY = 2 I imply that
1
0
0
1
M
Yb
A= 2@
A I M 0 jI
Cov @
I
Y
=
2
MM0
M0
M
I
and the terms Cov(^
yi ; yi ) are the diagonal elements of the upper right block
of this covariance matrix. This suggests that tr(M ) is a plausible general
de…nition for e¤ective degrees of freedom for any linear …tting method Yb =
M Y , and that more generally, the last form in (14) might be used in situations
24
where Yb is other than a linear form in Y . Further (reasonably enough) the
last form is a measure of how strongly the outputs in the training set can be
expected to be related to their predictions.
3.1.2
The Lasso, Etc.
The "lasso" (least absolute selection and shrinkage operator) and some other
¯
¯
¯
¯
¯
relatives of ridge regression are the result of P
generalizing the
criteria
Poptimization
2
p
p
q
0
2
(11) and (12) by replacing
= k k = j=1 j with j=1 j j j for a q > 0.
That produces
8
9
p
<
=
X
q
q
b = arg min (Y X )0 (Y X ) +
j jj
(15)
;
2<p :
j=1
generalizing (11) and
bq =
t
(Y
arg
min
P
p
q
j=1 j j j
with
0
X ) (Y
X )
(16)
t
generalizing (12). The so called "lasso" is the q = 1 case of (15) or (16). That
is, for t > 0
b lasso =
t
argPmin
with
(Y
p
j=1 j j j
0
X ) (Y
X )
(17)
t
Because of the shape of the constraint region
8
p
<
X
2 <p j
j jj
:
t
j=1
9
=
;
lasso
(in particular its sharp corners at coordinate axes) some coordinates of b t
OLS
are often 0, and the lasso automatically provides simultaneous shrinking of b
toward 0 and rational subset selection. (The same is true of cases of (16) with
q < 1.)
In Table 3.4, HTF provide explicit formulas for …tted coe¢ cients for the case
of X with orthonormal columns.
Method of Fitting
Fitted Coe¢ cient for xj
OLS
bOLS
j
h
bOLS I rank bOLS
j
j
1
bOLS
j
1+
bOLS
sign bOLS
Best Subset (of Size M )
Ridge Regression
Lasso
j
25
j
M
2
i
+
Best subset regression provides a kind of "hard thresholding" of the least
squares coe¢ cients (setting all but the M largest to 0), ridge regression provides
shrinking of all coe¢ cients toward 0, and the lasso provides a kind of "soft
thresholding" of the coe¢ cients.
There are a number of modi…cations of the ridge/lasso idea. One is the
"elastic net" idea. This is a compromise between the ridge and lasso methods.
For an 2 (0; 1) and some t > 0, this is de…ned by
b elastic
net
;t
=
with
arg min
)j
j=1 ((1
0
(Y
Pp
j j+
2
j
)
X ) (Y
X )
t
(The constraint is a compromise between the ridge and lasso constraints.) The
constraint regions have "corners" like the lasso regions but are otherwise more
rounded than the lasso regions. The equivalent unconstrained optimization
speci…cation of elastic net …tted coe¢ cient vectors is for 1 > 0 and 2 > 0
b elastic
1; 2
net
0
= arg min (Y
X ) (Y
X )+
2<p
1
p
X
j=1
j jj +
2
p
X
2
j
j=1
Several sources (including a 2005 JRSSB paper of Zhou and Hastie) suggest
that a modi…cation of the elastic net idea, namely
(1 +
2)
b elastic
net
1; 2
performs better than the original version. (Note however, that when X has
orthonormal columns, it’s possible to argue that
belastic
net
1 ; 2 ;j
=
1
1+
2
sign bjOLS
bOLS
1
j
2
+
so that modi…cation of the elastic net coe¢ cient vector in this way simply reduces it to a corresponding lasso coe¢ cient vector.)
Breiman proposed a di¤erent shrinkage methodology he called the nonnegative garotte that attempts to …nd "optimal" re-weightings of the elements of
b OLS . That is, for > 0 Breiman considered the vector optimization problem
de…ned by
c =
arg min
c2<p with cj
0; j=1;:::;p
Y
Xdiag (c) b
OLS 0
Y
and the corresponding …tted coe¢ cient vector
0
b nng = diag (c ) b OLS = B
@
26
c
bOLS
1 1
Xdiag (c) b
1
C
..
A
.
OLS
b
c p p
OLS
+
p
X
j=1
cj
Again looking to the case where X has orthonormal columns to get an explicit
formula to compare to others, it turns out that in this context
1
0
bnng = bOLS B
@1
j
;j
2 bjOLS
C
2A
+
and the nonnegative garotte is again a combined shrinkage and variables selection method.
3.1.3
Least Angle Regression (LAR)
Another class of shrinkage methods is de…ned algorithmically, rather than directly algebraically, or in terms of solutions to optimization problems. This
includes the LAR (least angle regression). A description of the whole set of
LAR
LAR regression parameters b
(for the case of X with each column centered and with norm 1 and centered Y ) follows. (This is some kind of
amalgam of the descriptions of Izenman, CFZ, and the presentation in the 2003
paper Least Angle Regression by Efron, Hastie, Johnstone, and Tibshirani.)
Note that for Y^ a vector of predictions, the vector
c^
X0 Y
Y^
has elements that are proportional to the correlations between the columns of
X (the xj ) and the residual vector R = Y Y^ . We’ll let
C^ = max j^
cj j and sj = sign (^
cj )
j
Notice also that if X l is some matrix made up of l linearly independent columns
of X, then if b is an l-vector and W = X l b , b can be recovered from W as
1
X 0l X l
X 0l W . (This latter means that if we de…ne a path for Y^ vectors in
C (X) and know for each Y^ which linearly independent set of columns of X
is used in its creation, we can recover the corresponding path b takes through
<p .)
1. Begin with Y^ 0 = 0; b 0 = 0;and R0 = Y
Y^ = Y and …nd
j1 = arg max jhxj ; Y ij
j
(the index of the predictor xj most strongly correlated with y) and add
j1 to an (initially empty) "active set" of indices, A.
2. Move Y^ from Y^ 0 in the direction of the projection of Y onto the space
spanned by xj1 (namely hxj1 ; Y i xj1 ) until there is another index j2 6= j1
with
D
E
D
E
j^
cj2 j = xj2 ; Y Y^ = xj1 ; Y Y^ = j^
cj1 j
27
At that point, call the current vector of predictions Y^ 1 and the corresponding current parameter vector b 1 and add index j2 to the active set
A. As it turns out, for
!
!)+
(
C^0 + c^0j
C^0 c^0j
;
1 = min
j6=j1
1 hxj ; xj1 i
1 + hxj ; xj1 i
(where the "+" indicates that only positive values are included in the
minimization) Y^ 1 = Y^ 0 + sj1 1 xj1 = sj1 1 xj1 and b 1 is a vector of all
0’s except for sj1 1 in the j1 position. Let R1 = Y Y^ 1 .
3. At stage l with A of size l; b l 1 (with only l
1 non-zero entries),
Y^ l 1 ; Rl 1 = Y Y^ l 1 ; and c^l 1 = X 0 Rl 1 in hand, move from Y^ l 1 toward the projection of Y onto the sub-space of <N spanned by fxj1 ; : : : ; xjl g.
This is (as it turns out) in the direction of a unit ul vector "making equal
angles less than 90 degrees with all xj with j 2 A" until there is an index
jl+1 2
= A with
j^
cl
1;j+1 j
= j^
cl
1;j1 j
( = j^
cl
1;j2 j
=
= j^
cl
1;jl j )
At that point, with Y^ l the current vector of predictions, let b l (with only l
non-zero entries) be the corresponding coe¢ cient vector, take Rl = Y Y^ l
and c^l = X 0 Rl . It can be argued that with
!
!)+
(
C^l 1 + c^l 1;j
C^l 1 c^l 1;j
;
l = min
1 hxj ; ul i
1 + hxj ; ul i
j 2A
=
Y^ l = Y^ l 1 + sjl+1 l ul . Add the index jl+1 to the set of active indices,
A, and repeat.
This continues until there are r = rank (X) indices in A, and at that point Y^
OLS
OLS
moves from Y^ r 1 to Y^
and b moves from b r 1 to b
(the version of an
OLS coe¢ cient vector with non-zero elements only in positions with indices in
A). This de…nes a piecewise linear path for Y^ (and therefore b ) that could,
for example, be parameterized by Y^ or Y Y^ .
There are several issues raised by the description above. For one, the standard exposition of this method seems to be that the direction vector ul is prescribed by letting W l = (sj1 xj1 ; : : : ; sjl xjl ) and taking
ul =
1
1
W l W 0l W l
It’s clear that W 0l ul = W l W 0l W l
has the same inner product with ul .
1
1
W l W 0l W l
1
1
1
1
1, so that each of sj1 xj1 ; : : : ; sjl xj1
What is not immediately clear (but is
28
argued in Efron, Hastie, Johnstone, and Tibshirani) is why one knows that this
prescription agrees with a prescription of a unit vector giving the direction from
Y^ l 1 to the projection of Y onto the sub-space of <N spanned by fxj1 ; : : : ; xjl g,
namely (for P l the projection matrix onto that subspace)
1
P lY
Y^ l
P lY
Y^ l
1
1
Further, the arguments that establish that j^
cl 1;j1 j = j^
cl 1;j2 j =
=
j^
cl 1;jl j are not so obvious, nor are those that show that l has the form in
3. above. And …nally, HTF actually state their LAR algorithm directly in
terms of a path for b , saying at stage l one moves from b l 1 in the direction
of a vector with all 0’s except at those indices in A where there are the joint
least squares coe¢ cients based on the predictor columns fxj1 ; : : : ; xjl g. The
correspondence between the two points of view is probably correct, but is again
not absolutely obvious.
OLS
At any rate, the LAR algorithm traces out a path in <p from 0 to b
. One
might think of the point one has reached along that path (perhaps parameterized
by Y^ ) as being a complexity parameter governing how ‡exible a …t this
algorithm has allowed, and be in the business of choosing it (by cross-validation
or some other method) in exactly the same way one might, for example, choose
a ridge parameter.
What is not at all obvious but true, is that a very slight modi…cation of
this LAR algorithm produces the whole set of lasso coe¢ cients (17) as its path.
One simply needs to enforce the requirement that if a non-zero coe¢ cient hits
0, its index is removed from the active set and a new direction of movement
is set based on one less input
At any point along the modi…ed LAR
Pvariable.
p
path, one can compute t = j=1 j j j, and think of the modi…ed-LAR path as
parameterized by t. (While it’s not completely obvious, this turns out to be
monotone non-decreasing in "progress along the path," or Y^ ).
A useful graphical representation of the lasso path is one in which all coe¢ lasso
cients ^tj
are plotted against t on the same set of axes. Something similar is
often done for the LAR coe¢ cients (where the plotting is against some measure
of progress along the path de…ned by the algorithm).
3.2
Two Methods With Derived Input Variables
Another possible approach to the problem of …nding an appropriate level of
complexity in a …tted linear prediction rule is to consider regression on some
number M < p of predictors derived from the original inputs xj . Two such
methods discussed by HTF are the methods of Principal Components Regression
and Partial Least Squares. Here we continue to assume that the columns of
X have been standardized and Y has been centered.
29
3.2.1
Principal Components Regression
The idea here is to replace the p columns of predictors in X with the …rst few
(M ) principal components of X (from the singular value decomposition of X)
z j = Xv j = dj uj
Correspondingly, the vector of …tted predictions for the training data is
PCR
Yb
=
=
M
X
hY ; z j i
zj
hz
j ; zj i
j=1
M
X
j=1
hY ; uj i uj
(18)
Comparing this to (7) and (13) we see that ridge regression shrinks the coe¢ cients of the principal components uj according to their importance in making
up X, while principal components regression "zeros out" those least important
in making up X.
PCR
Notice too, that Yb
can be written in terms of the original inputs as
PCR
Yb
=
so that
M
X
j=1
hY ; uj i
1
Xv j
dj
0
1
M
X
1
hY ; uj i v j A
=X@
dj
j=1
0
1
M
X
1
A
=X@
2 hY ; Xv j i v j
d
j
j=1
b PCR =
M
X
1
hY ; Xv j i v j
d2
j=1 j
(19)
OLS
where (obviously) the sum above has only M terms. b
is the M = rank (X)
version of (19). As the v j are orthonormal, it is therefore clear that
b PCR
b OLS
OLS
So principal components regression shrinks both Yb
toward 0 in <N and
OLS
b
toward 0 in <p .
30
3.2.2
Partial Least Squares Regression
The shrinking methods mentioned thus far have taken no account of Y in determining directions or amounts of shrinkage. Partial least squares speci…cally
employs Y . In what follows, we continue to suppose that the columns of X
have been standardized and that Y has been centered.
The logic of partial least squares is this. Suppose that
z1 =
p
X
j=1
hY ; xj i xj
= XX 0 Y
It is possible to argue that for w1 = X 0 Y = X 0 Y ; Xw1 = z 1 = X 0 Y
linear combination of the columns of X maximizing
is a
jhY ; Xwij
(which is essentially the absolute sample covariance between the variables y and
x0 w) subject to the constraint that kwk = 1.3 This follows because
2
hY ; Xwi = wX 0 Y Y 0 Xw
and a maximizer of this quadratic form subject to the constraint is the eigenvector of X 0 Y Y 0 X corresponding to its single non-zero eigenvalue. It’s then
easy to verify that w1 is such an eigenvector corresponding to the non-zero
eigenvalue Y 0 XX 0 Y .
Then de…ne X 1 by orthogonalizing the columns of X with respect to z 1 .
That is, de…ne the jth column of X 1 by
hxj ; z 1 i
z1
hz 1 ; z 1 i
x1j = xj
and take
z2 =
p
X
Y ; x1j x1j
j=1
= X 1 X 10 Y
For w2 = X 10 Y = X 10 Y ; X 1 w2 = z 2 = X 10 Y
the columns of X 1 maximizing
is the linear combination of
Y ; X 1w
subject to the constraint that kwk = 1.
3 Note that upon replacing jhY ; Xwij with jhXw; Xwij one has the kind of optimization
problem solved by the …rst principal component of X.
31
Then for l > 1, de…ne X l by orthogonalizing the columns of X l
respect to z l . That is, de…ne the jth column of X l by
xlj = xlj
1
with
xlj 1 ; z l
zl
hz l ; z l i
1
and let
z l+1 =
p
X
Y ; xlj xlj
j=1
= X l X l0 Y
Partial least squares regression uses the …rst M of these variables z j as input
variables.
The PLS predictors z j are orthogonal by construction. Using the …rst M
of these as regressors, one has the vector of …tted output values
PLS
Yb
=
M
X
hY ; z j i
zj
hz
j ; zj i
j=1
Since the PLS predictors are (albeit recursively-computed data-dependent) linPLS
(namely
ear combinations of columns of X, it is possible to …nd a p-vector b
X 0X
1
X 0 Yb
PLS
M
) such that
PLS
PLS
Yb
= X bM
and thus produce the corresponding linear prediction rule
PLS
fb(x) = x0 b M
(20)
It seems like in (20), the number of components, M , should function as a
complexity parameter. But then again there is the following. When the xj
PLS
OLS
are orthonormal, it’s fairly easy to see that z 1 is a multiple of Yb
, b
=
1
PLS
OLS
b PLS =
= bp = b
, and all steps of partial least squares after the …rst
2
are simply providing a basis for the orthogonal complement of the 1-dimensional
OLS
subspace of C (X) generated by Yb
(without improving …tting at all). That
is, here changing M doesn’t change ‡exibility of the …t at all. (Presumably,
when the xj are nearly orthogonal, something similar happens.)
PLS, PCR, and OLS Partial least squares is a kind of compromise between
principal components regression and ordinary least squares. To see this, note
that maximizing
jhY ; Xwij
32
subject to the constraint that kwk = 1 is equivalent maximizing the sample
covariance between Y and Xw i.e.
sample standard
deviation of y
sample standard
deviation of x0 w
sample correlation
between y and x0 w
or equivalently
sample variance
of x0 w
sample correlation
between y and x0 w
2
(21)
subject to the constraint. Now if only the …rst term (the sample variance of
x0 w) were involved in (21), a …rst principal component direction would be an optimizing w1 , and z 1 = X 0 Y Xw1 a multiple of the …rst principal component
OLS
OLS
of X. On the other hand, if only the second term were involved, ^
= ^
OLS
OLS
a multiple of the
X 0Y = ^
would be an optimizing w1 , and z 1 = Y^
vector of ordinary least squares …tted values. The use of the product of two
terms can be expected to produce a compromise between these two.
Note further that this logic applied at later steps in the PLS algorithm
should then produce for z l some sort of compromise between an lth principal
component of X and a vector of suitably constrained least squares …tted values.
4
Linear Prediction Using Basis Functions
A way of moving beyond prediction rules that are functions of a linear form in
x, i.e. depend upon x through x0 ^, is to consider some set of (basis) functions
fhm g and predictors of the form or depending upon the form
f^ (x) =
p
X
^m hm (x) = h (x)0 ^
(22)
m=1
0
(for h (x) = (h1 (x) ; : : : ; hp (x))). We next consider some ‡exible methods
employing this idea. Notice that …tting of form (22) can be done using any of
the methods just discussed based on the N p matrix of inputs
0
0 1
h (x1 )
B h (x2 )0 C
B
C
X = (hj (xi )) = B
C
..
A
@
.
0
h (xN )
(i indexing rows and j indexing columns).
4.1
p = 1 Wavelet Bases
Consider …rst the case of a one-dimensional input variable x, and in fact here
suppose that x takes values in [0; 1]. One might consider a set of basis function
33
for use in the form (22) that is big enough and rich enough to approximate
essentially any function on [0; 1]. In particular, various orthonormal bases for
the square integrable functions on this interval (the space of functions L2 [0; 1])
come to mind. One might, for example, consider using some number of functions
from the Fourier basis for L2 [0; 1]
o1
np
o1
np
2 sin (j2 x)
[
2 cos (j2 x)
[ f1g
j=1
For example, using M
the …tting the forms
f (x) =
j=1
N=2 sin-cos pairs and the constant, one could consider
0
+
M
X
1m
sin (m2 x) +
m=1
M
X
2m
cos (m2 x)
(23)
m=1
If one has training xi on an appropriate regular grid, the use of form (23) leads
to orthogonality in the N (2M + 1) matrix of values of the basis functions X
and simple/fast calculations.
Unless, however, one believes that E[yjx = u] is periodic in u, form (23) has
its serious limitations. In particular, unless M is very very large, a trigonometric series like (23) will typically provide a poor approximation for a function
that varies at di¤erent scales on di¤erent parts of [0; 1], and in any case, the coe¢ cients necessary to provide such localized variation at di¤erent scales have no
obvious simple interpretations/connections to the irregular pattern of variation
being described. So-called "wavelet bases" are much more useful in providing
parsimonious and interpretable approximations to such functions. The simplest
wavelet basis for L2 [0; 1] is the Haar basis that we proceed to describe.
De…ne the so-called Haar "father" wavelet
' (x) = I [0 < x
1]
and the so-called Haar "mother" wavelet
(x) = ' (2x)
=I 0<x
' (2x
1)
1
2
I
1
<x
2
1
Linear combinations of these functions provide all elements of L2 [0; 1] that are
constant on 0; 21 and on 12 ; 1 . Write
0
= f'; g
Next, de…ne
1;0
(x) =
p
2 I 0<x
1;1
(x) =
p
2 I
1
<x
2
1
4
3
4
34
1
<x
4
3
I
<x
4
I
1
2
1
and
and let
1
=f
1;0 ;
1;1 g
Using the set of functions 0 [ 1 one can build (as linear combinations) all
elements of L2 [0; 1] that are constant on 0; 41 and on 14 ; 12 and on 21 ; 34 and
on 34 ; 1 .
The story then goes on as one should expect. One de…nes
2;0
1
8
3
8
5
8
7
8
(x) = 2 I 0 < x
1
<x
4
1
<x
2;2 (x) = 2 I
2
3
<x
2;3 (x) = 2 I
4
2;1
(x) = 2 I
1
<x
8
3
I
<x
8
5
<x
I
8
7
I
<x
8
I
1
4
1
2
3
4
and
and
and
1
and lets
2
=f
2;0 ;
2;1 ;
2;2 ;
2;3 g
In general,
m;j
(x) =
p
j
2m
2m x
2m
for j = 0; 1; 2; : : : ; 2m
1
and
m
=f
m;0 ;
m;1 ; : : : ;
m;2m 1 g
The Haar basis of L2 [0; 1] is then
[1
m=0
m
Then, one might entertain use of the Haar basis functions through order M
in constructing a form
m
f (x) =
0
+
M 2X1
X
mj
m;j
(x)
(24)
m=0 j=0
(with the understanding that 0;0 = ), a form that in general allows building
of functions that are constant on consecutive intervals of length 1=2M +1 . This
form can be …t by any of the various regression methods (especially involving
thresholding/selection, as a typically very large number, 2M +1 , of basis functions is employed in (24)). (See HTF Section 5.9.2 for some discussion of using
the lasso with wavelets.) Large absolute values of coe¢ cients mj encode scales
at which important variation in the value of the index m, and location in [0; 1]
where that variation occurs in the value j=2m . Where (perhaps after model selection/ thresholding) only a relatively few …tted coe¢ cients are important, the
35
corresponding scales and locations provide an informative and compact summary of the …t. A nice visual summary of the results of the …t can be made by
plotting for each m (plots arranged vertically, from M through 0, aligned and
to the same scale) spikes of length j mj j pointed in the direction of sign( mj )
along an "x" axis at positions (say) (j=2m ) + 1=2m+1 .
In special situations where N = 2K and
xi = i
1
2K
for i = 1; 2; : : : ; 2K
and one uses the Haar basis functions through order K
is computationally clean, since the vectors
0
1
m;j (x1 )
B
C
..
@
A
.
m;j
1, the …tting of (24)
(xN )
(together with the column vector
p of 1’s) are orthogonal. (So, upon proper
normalization, i.e. division by N = 2K=2 , they form an orthonormal basis for
<N .)
The Haar wavelet basis functions are easy to describe and understand. But
they are discontinuous, and from some points of view that is unappealing. Other
sets of wavelet basis functions have been developed that are smooth. The
construction begins with a smooth "mother wavelet" in place of the step function
used above. HTF make some discussion of the smooth "symmlet" wavelet basis
at the end of their Chapter 5.
4.2
p = 1 Piecewise Polynomials and Regression Splines
Continue consideration of the case of a one-dimensional input variable x, and
now K "knots"
< K
1 < 2 <
and forms for f (x) that are
1. polynomials of order M (or less) on all intervals (
at least)
j 1 ; j ),
and (potentially,
2. have derivatives of some speci…ed order at the knots, and (potentially, at
least)
3. are linear outside ( 1 ;
If we let I1 (x) = I [x <
and de…ne IK+1 (x) = I [ K
K ).
1 ],
for j = 2; : : : ; K let Ij (x) = I [ j 1 x < j ],
x], one can have 1. in the list above using basis
36
functions
I1 (x) ; I2 (x) ; : : : ; IK+1 (x)
xI1 (x) ; xI2 (x) ; : : : ; xIK+1 (x)
x2 I1 (x) ; x2 I2 (x) ; : : : ; x2 IK+1 (x)
..
.
xM I1 (x) ; xM I2 (x) ; : : : ; xM IK+1 (x)
Further, one can enforce continuity and di¤erentiability (at the knots) conditions
P(M +1)(K+1)
on a form f (x) =
m hm (x) by enforcing some linear relations
m=1
between appropriate ones of the m . While this is conceptually simple, it is
messy. It is much cleaner to simply begin with a set of basis functions that are
tailored to have the desired continuity/di¤erentiability properties.
A set of M + 1 + K basis functions for piecewise polynomials of degree M
with derivatives of order M 1 at all knots is easily seen to be
M
1 )+
1; x; x2 ; : : : ; xM ; (x
M
2 )+
; (x
M
K )+
; : : : ; (x
M
(since the value of and …rst j derivatives of (x
The
j )+ at j are all 0).
choice of M = 3 is fairly standard.
Since extrapolation with polynomials typically gets worse with order, it is
common to impose a restriction that outside ( 1 ; K ) a form f (x) be linear. For
the case of M = 3 this can be accomplished by beginning with basis functions
3
3
3
1; x; (x
1 )+ ; (x
2 )+ ; : : : ; (x
K )+ and imposing restrictions necessary to
force 2nd and 3rd derivatives to the right of K to be 0: Notice that (considering
x > K)
1
0
K
K
X
X
d2 @
3A
=6
(25)
j (x
j)
0 + 1x +
j (x
j )+
dx2
j=1
j=1
and
0
d3 @
dx3
0
+
1x
+
K
X
j
1
3A
j )+
(x
j=1
So, linearity for large x requires (from (26)) that
=6
K
X
j
(26)
j=1
PK
= 0. Further,
PK
substituting this into (25) means that linearity also requires that j=1 j j = 0.
PK 1
Using the …rst of these to conclude that K =
j and substituting into
j=1
the second yields
K
X2
K
j
K 1 =
j
K
j=1
and then
K
=
K
X2
j=1
K
j
j
K
37
K 1
j=1
K 1
K
X2
j=1
j
j
These then suggest the set of basis functions consisting of 1; x and for j =
1; 2; : : : ; K 2
(x
K
3
j )+
K
= (x
j
3
K 1 )+
(x
K 1
3
j )+
K
K
j
K 1
(x
K
+
K
3
K 1 )+
j
3
K )+
(x
K 1
+
K 1
K
j
3
K )+
(x
(x
K 1
3
K )+
(These are essentially the basis functions that HTF call their Nj .) Their use
produces so-called "natural" (linear outside ( 1 ; K )) cubic regression splines.
There are other (harder to motivate, but in the end more pleasing and
computationally more attractive) sets of basis functions for natural polynomial
splines. See the B-spline material at the end of HTF Chapter 5.
4.3
4.3.1
Basis Functions and p-Dimensional Inputs
Multi-Dimensional Regression Splines (Tensor Bases)
If p = 2 and the vector of inputs, x, takes values in <2 , one might proceed as
follows. If fh11 ; h12 ; : : : ; h1M1 g is a set of spline basis functions based on x1
and fh21 ; h22 ; : : : ; h2M2 g is a set of spline basis functions based on x2 one might
consider the set of M1 M2 basis functions based on x de…ned by
gjk (x) = h1j (x1 ) h2k (x2 )
and corresponding forms for regression splines
X
f (x) =
jk gjk (x)
j;k
The biggest problem with this potential method is the explosion in the size
of a tensor product basis as p increases. For example, using K knots for cubic
p
regression splines in each of p dimensions produces (4 + K) basis functions
for the p-dimensional problem. Some kind of forward selection algorithm or
shrinking of coe¢ cients will be needed to produce any kind of workable …ts with
such large numbers of basis functions.
4.3.2
MARS (Mulitvariate Adapative Regression Splines)
This is a high-dimensional regression-like methodology based on the use of
"hockey-stick functions" and their products as data-dependent "basis functions." That is, with input space <p we consider data-dependent basis functions
built on the N p pairs of functions
hij1 (x) = (xj
xij )+ and hij2 (x) = (xij
xj )+
(27)
(xij is the jth coordinate of the ith input training vector and both hij1 (x) and
hij2 (x) depend on x only through the jth coordinate of x). MARS builds
predictors sequentially, making use of these "re‡ected pairs" roughly as follows.
38
1. Identify a pair (27) so that
0
+
11 hij1
(x) +
12 hij2
(x)
has the best SSE possible. Call the selected functions
g11 = hij1 and g12 = hij2
and set
f^1 (x) = ^0 + ^11 g11 (x) + ^12 g12 (x)
2. At stage l of the predictor-building process, with predictor
f^l
^
1 (x) = 0 +
l 1
X
^m1 gm1 (x) + ^m2 gm2 (x)
m=1
in hand, consider for addition to the model, pairs of basis functions that
are either of the basic form (27) or of the form
hij1 (x) gm1 (x) and hij2 (x) gm1 (x)
or of the form
hij1 (x) gm2 (x) and hij2 (x) gm2 (x)
for some m < l, subject to the constraints that no xj appears in any
candidate product more than once (maintaining the piece-wise linearity of
sections of the predictor). Additionally, one may decide to put an upper
limit on the order of the products considered for inclusion in the predictor.
The best candidate pair in terms of reducing SSE gets called, say, gl1 and
gl2 and one sets
f^l (x) = ^0 +
l
X
^m1 gm1 (x) + ^m2 gm2 (x)
m=1
One might pick the complexity parameter l by cross-validation, but the
standard implementation of MARS apparently uses instead a kind of generalized
cross validation error
2
PN
f^l (xi )
i=1 yi
GCV (l) =
2
1 MN(l)
where M (l) is some kind of degrees of freedom …gure. One must take account
of both the …tting of the coe¢ cients in this and the fact that knots (values
xij ) have been chosen. The HTF recommendation is to use
M (l) = 2l + (2 or 3) (the number of di¤erent knots chosen)
(where presumably the knot count refers to di¤erent xij appearing in at least
one gm1 (x) or gm2 (x)).
39
5
Smoothing Splines
5.1
p = 1 Smoothing Splines
A way of avoiding the direct selection of knots for a regression spline is to
instead, for a smoothing parameter > 0, consider the problem of …nding (for
a min fxi g and max fxi g b)
!
Z b
N
X
2
2
00
f^ =
arg min
(yi h (xi )) +
(h (x)) dx
functions h with 2 derivatives
a
i=1
Amazingly enough, this optimization problem has a solution that can be fairly
simply described. f^ is a natural cubic spline with knots at the distinct values
xi in the training set. That is, for a set of (now data-dependent, as the knots
come from the training data) basis functions for such splines
h1 ; h2 ; : : : ; hN
(here we’re tacitly assuming that the N values of the input variable in the
training set are all di¤erent)
N
X
f^ (x) =
b j hj (x)
j=1
where the b j are yet to be identi…ed.
So consider the function
N
X
g (x) =
j hj
(x)
(28)
j=1
This has second derivative
g 00 (x) =
N
X
00
j hj
(x)
j=1
and so
2
(g 00 (x)) =
N X
N
X
00
j l hj
(x) h00l (x)
j=1 l=1
Then, for
0
= ( 1;
2; : : : ; N )
N
it is the case that
N
Z
and
=
Z
!
b
h00j
(t) h00l
(t) dt
a
b
2
(g 00 (x)) dx =
a
40
0
In fact, with the notation
H = (hj (xi ))
N
N
(i indexing rows and j indexing columns) the criterion to be optimized in order
to …nd f^ can be written for functions of the form (28) as
(Y
0
H ) (Y
0
H )+
and some vector calculus shows that the optimizing
1
b = H 0H +
H 0Y
is
(29)
which can be thought of as some kind of vector of generalized ridge regression
coe¢ cients.
Corresponding to (29) is a vector of smoothed output values
and the matrix
1
Yb = H H 0 H +
S
H 0Y
1
H H 0H +
H0
is called a smoother matrix.
Contrast this to a situation where some fairly small number, p, of …xed basis
functions are employed in a regression context. That is, for basis functions
b1 ; b2 ; : : : ; bp suppose
B = (bj (xi ))
N
p
Then OLS produces the vector of …tted values
1
Yb = B B 0 B
B0Y
and the projection matrix onto the column space of B, C (B), is P B = B B 0 B
S and P B are both N N symmetric non-negative de…nite matrices. While
P BP B = P B
i.e. P B is idempotent,
S S
S
meaning that S
S S is non-negative de…nite. P B is of rank p = tr(P B ),
while S is of rank N .
In a manner similar to what is done in ridge regression we might de…ne an
"e¤ective degrees of freedom" for S (or for smoothing) as
df ( ) = tr (S )
(30)
We proceed to develop motivation and a formula for this quantity and for Y^ .
Notice that it is possible to write
S
1
=I+ K
41
1
B0.
for
1
K = H0
1
H
so that
S = (I + K)
1
(31)
This is the so-called Reinsch form.
Some vector calculus shows that Yb = S Y is a solution to the minimization
problem
0
minimize (Y v) (Y v) + v 0 Kv
(32)
v2<N
so that this matrix K can be thought of as de…ning a "penalty" in …tting a
smoothed version of Y :
Then, since S is symmetric non-negative de…nite, it has an eigen decomposition as
N
X
S = U DU 0 =
dj uj u0j
(33)
j=1
where columns of U (the eigenvectors uj ) comprise an orthonormal basis for
<N and
D = diag (d1 ; d2 ; : : : ; dN )
for eigenvalues of S
d1
d2
dN > 0
It turns out to be guaranteed that d1 = d2 = 1.
Consider how the eigenvalues and eigenvectors of S are related to those for
K. An eigenvalue for K, say , solves
det (K
Now
det (K
I) = det
I) = 0
1
[(I + K)
(1 +
So 1+ must be an eigenvalue of I + K and 1= (1 +
1
of S = (I + K) . So for some j we must have
) must be an eigenvalue
1
1+
dj =
and observing that 1= (1 +
) I]
) is decreasing in , we may conclude that
dj =
1
1+
N
(34)
j+1
for
1
2
N
2
42
N
1
=
N
=0
the eigenvalues of K (that themselves do not depend upon ). So, for example,
in light of (30), (33), and (34), the smoothing e¤ective degrees of freedom are
df ( ) = tr (S ) =
N
X
dj = 2 +
j=1
N
X2
j=1
1
1+
j
which is clearly decreasing in (with minimum value 2 in light of the fact that
S has two eigenvalues that are 1).
Further, consider uj , the eigenvector of S corresponding to eigenvalue dj .
S uj = dj uj so that
uj = S
1
dj uj = (I + K) dj uj
so that
uj = dj uj + dj Kuj
and thus
Kuj =
1
dj
dj
uj =
N
j+1 uj
That is, uj is an eigenvector of K corresponding to the (N j + 1)st largest
eigenvalue. That is, for all the eigenvectors of S are eigenvectors of K and
thus do not depend upon .
Then, for any
0
1
N
X
Yb = S Y = @
dj uj u0j A Y
j=1
=
N
X
j=1
dj huj ; Y i uj
= hu1 ; Y i u1 + hu2 ; Y i u2 +
N
X
j=3
huj ; Y i
uj
1 + N j+1
(35)
and we see that Yb is a shrunken version of Y (that progresses from Y to the
projection of Y onto the span of fu1 ; u2 g as runs from 0 to 1)4 . The larger
is , the more severe the shrinking overall. Further, the larger is j, the smaller
is dj and the more severe is the shrinking in the uj direction. In the context
of cubic smoothing splines, large j correspond to "wiggly" (as a functions of
coordinate i or value of the input xi ) uj , and the prescription (35) calls for
suppression of "wiggly" components of Y .
Notice that large j correspond to early/large eigenvalues of the penalty matrix K in (32). Letting uj = uN j+1 so that
U = (uN ; uN
1 ; : : : ; u1 )
= (u1 ; u2 ; : : : ; uN )
4 It is possible to argue that the span of fu ; u g is the set of vectors of the form c1 + dx,
1
2
as is consistent with the integral penalty in original function optimization problem.
43
the eigen decomposition of K is
K = U diag ( 1 ;
2; : : : ; N ) U
0
and the criterion (32) can be written as
0
minimize (Y
v) + v 0 U diag ( 1 ;
v) (Y
v2<N
2; : : : ; N ) U
0
v
or equivalently as
0
minimize @(Y
v2<N
0
v) (Y
v) +
N
X2
j=1
2
j
1
uj ; v A
(36)
(since N 1 = N = 0) and we see that eigenvalues of K function as penalty
PN
coe¢ cients applied to the N orthogonal components of v = j=1 uj ; v uj in
the choice of optimizing v. From this point of view, the uj (or uj ) provide
the natural alternative (to the columns of H) basis (for <N ) for representing
or approximating Y , and the last equality in (35) provides an explicit form for
the optimizing smoothed vector Yb .
In this development, K has had a speci…c meaning derived from the H and
matrices connected speci…cally with smoothing splines and the particular
values of x in the training data set. But in the end, an interesting possibility
brought up by the whole development is that of forgetting the origins (from
K) of the j and uj and beginning with any interesting/intuitively appealing
orthonormal basis fuj g and set of non-negative penalties f j g for use in (36).
Working backwards through (35) and (34) one is then led to the corresponding
smoothed vector Yb and smoothing matrix S . (More detail on this matter is
in Section 5.3.)
It is also worth remarking that since Yb = S Y the rows of S provide weights to be applied to the elements of Y in order to produce predictions/smoothed values corresponding to Y . These can for each i be thought
of as de…ning a corresponding "equivalent kernel" (for an appropriate "kernelweighted average" of the training output values). (See Figure 5.8 of HTF2 in
this regard.)
5.2
Multi-Dimensional Smoothing Splines
If p = 2 and the vector of inputs, x, takes values in <2 , one might propose to
seek
!
N
X
2
^
f =
arg min
(yi h (xi )) + J [h]
functions h with 2 derivatives
for
J [h]
ZZ
<2
@2h
@x21
2
+2
i=1
@2h
@x1 @x2
44
2
+
@2h
@x22
2
dx1 dx2
An optimizing f^ : <2 ! < can be identi…ed and is called a "thin plate spline."
As ! 0, f^ becomes an interpolator, as ! 1 it de…nes the OLS plane
through the data in 3-space. In general, it can be shown to be of the form
f (x) =
+
0
0
x+
N
X
i
gi (x)
(37)
i=1
where gi (x) = (kx xi k) for (z) = z 2 ln z 2 . The gi (x) are "radial basis
functions" (radially symmetric basis functions) and …tting is accomplished much
as for the p = 1 case. The form (37) is plugged into the optimization criterion
and a discrete penalized least squares problem emerges (after taking account of
some linear constraints that are required to keep J [f ] < 1). HTF seem to
indicate that in order to keep computations from exploding with N , it usually
su¢ ces to replace the N functions gi (x) in (37) with K
N functions gi (x) =
(kx xi k) for K potential input vectors xi placed on a rectangular grid
covering the convex hull of the N training data input vectors xi .
For large p, one might simply declare that attention is going to be limited
to predictors of some restricted form, and for h in that restricted class, seek to
optimize
N
X
2
(yi h (xi )) + J [h]
i=1
for J [h] some appropriate penalty on h intended to regularize/restrict its wiggling. For example, one might assume that a form
g (x) =
p
X
gj (xj )
j=1
will be used and set
J [g] =
p Z
X
gj00 (x)
2
dx
j=1
and be led to additive splines.
Or, one might assume that
g (x) =
p
X
gj (xj ) +
j=1
X
gjk (xj ; xk )
(38)
j;k
and invent an appropriate penalty function. It seems like a sum of 1-D smoothing spline penalties on the gj and 2-D thin plate spline penalties on the gjk is
the most obvious starting point. Details of …tting are a bit murky (though I
am sure that they can be found in book on generalized additive models). Presumably one cycles through the summands in (38) iteratively …tting functions
to sets of residuals de…ned by the original yi minus the sums of all other current
versions of the components until some convergence criterion is satis…ed. (38)
is a kind of "main e¤ects plus 2-factor interactions" decomposition, but it is
(at least in theory) possible to also consider higher order terms in this kind of
expansion.
45
5.3
An Abstraction of the Smoothing Spline Material and
Penalized Fitting in <N
In abstraction of the smoothing spline development, suppose that fuj g is a
set of M
N orthonormal N -vectors,
0; j
0 for j = 1; 2; : : : ; M; and
consider the optimization problem
0
1
M
X
0
2
A
minimize @(Y v) (Y v) +
j huj ; vi
v2 spanfuj g
j=1
PM
PM
PM
2
2
For v = j=1 cj uj 2 spanfuj g, the penalty is
j=1 j huj ; vi =
j=1 j cj
and in this penalty, j is a multiplier of the squared length of the component
of v in the direction of uj . The optimization criterion is then
(Y
0
v) (Y
v) +
M
X
j=1
2
j
huj ; vi =
M
X
j=1
(huj ; Y i
2
cj ) +
M
X
2
j cj
j=1
and it is then easy to see (via simple calculus) that
copt
=
j
i.e.
Y^ = v opt =
huj ; Y i
1+ j
M
X
huj ; Y i
j=1
1+
uj
j
From this it’s clear how the penalty structure dictates optimally shrinking the
components of the projection of Y onto spanfuj g.
It is further worth noting that for a given set of penalty coe¢ cients, Y^ can
be represented as SY for
S=
M
X
j=1
dj uj u0j = U diag
1
1+
;:::;
1
1
1+
U0
M
for U = (u1 ; u2 ; : : : ; uM ). Then it’s easy to see that smoother matrix S is a
rank M matrix for which Y^ = SY .
One context in which this material might …nd immediate application is where
some set of basis functions fhj g are increasingly "wiggly" with increasing j and
the vectors uj come from applying the Gram-Schmidt process to the vectors
hj = (hj (x1 ) ; : : : ; hj (xN ))
0
In this context, it would be very natural to penalize the later uj more severely
than the early ones.
46
6
Kernel Smoothing Methods
The central idea of this material is that when …nding f^ (x0 ) I might weight
points in the training set according to how close they are to x0 , do some kind
of …tting around x0 , and ultimately read o¤ the value of the …t at x0 .
6.1
One-dimensional Kernel Smoothers
For the time being, suppose that x takes values in [0; 1]. Invent weighting
schemes for points in the training set by de…ning a (usually, symmetric about
0) non-negative, real-valued function D (t) that is non-increasing for t 0 and
non-decreasing for t 0. Often D (t) is taken to have value 0 unless jtj 1.
Then, a kernel function is
x
K (x; x0 ) = D
x0
(39)
where is a "bandwidth" parameter that controls the rate at which weights
drop o¤ as one moves away from x0 (and indeed in the case that D (t) = 0 for
jtj > 1, how far one moves away from x0 before no weight is assigned). Common
choices for D are
1. the Epanechnikov quadratic kernel, D (t) =
2. the "tri-cube" kernel, D (t) = 1
1
t2 I [jtj
1],
3
3
jtj
3. the standard normal density, D (t) =
3
4
I [jtj
1], and
(t).
Using weights (39) to make a weighted average of training responses, one
arrives at the Nadaraya-Watson kernel-weighted prediction at x0
PN
K (x0 ; xi ) yi
f^ (x0 ) = Pi=1
(40)
N
i=1 K (x0 ; xi )
This typically smooths training outputs yi in a more pleasing way than does
a k-nearest neighbor average, but it has obvious problems at the ends of the
interval [0; 1] and at places in the interior of the interval where training data
are dense to one side of x0 and sparse to the other, if the target E[yjx = z] has
non-zero derivative at z = x0 . For example, at x0 = 1 only xi 1 get weight,
and if E[yjx = z] is decreasing at z = x0 = 1, f^ (1) will be positively biased.
That is, with usual symmetric kernels, (40) will fail to adequately follow an
obvious trend at 0 or 1 (or at any point between where there is a sharp change
in the density of input values in the training set).
A way to …x this problem with the Nadaraya-Watson predictor is to replace
the locally-…tted constant with a locally-…tted line. That is, at x0 one might
choose (x0 ) and (x0 ) to solve the optimization problem
minimize
and
N
X
K (x0 ; xi ) (yi
i=1
47
2
( + xi ))
(41)
and then employ the prediction
f^ (x0 ) =
(x0 ) +
(x0 ) x0
(42)
Now the weighted least squares problem (41) has an explicit solution. Let
0
1
1 x1
B 1 x2 C
B
C
B =B .
.. C
.
N 2
@ .
. A
1
xN
and take
W (x0 ) = diag (K (x0 ; x1 ) ; : : : ; K (x0 ; xN ))
N
N
then (42) is
f^ (x0 ) = (1; x0 ) B 0 W (x0 ) B
1
B 0 W (x0 ) Y
(43)
0
= l (x0 ) Y
1
for the 1 N vector l0 (x0 ) = (1; x0 ) B 0 W (x0 ) B
B 0 W (x0 ). It is thus
obvious that locally weighted linear regression is (an albeit x0 -dependent) linear operation on the vector of outputs. The weights in l0 (x0 ) combine the
original kernel values and the least squares …tting operation to produce a kind
of "equivalent kernel" (for a Nadaraya-Watson type weighted average).
Recall that for smoothing splines, smoothed values are
where the parameter
Yb = S Y
is the penalty weight, and
df ( ) = tr (S )
We may do something parallel in the present context. We may take
0 0
1
l (x1 )
B l0 (x2 ) C
B
C
L =B
C
..
@
A
N N
.
l0 (xN )
where now the parameter
and de…ne
is the bandwidth, write
Yb = L Y
df ( ) = tr (L )
HTF suggest that matching degrees of freedom for a smoothing spline and a
kernel smoother produces very similar equivalent kernels, smoothers, and predictions.
48
There is a famous theorem of Silverman that adds technical credence to this
notion. Roughly the theorem says that for large N , if in the case p = 1 the
inputs x1 ; x2 ; : : : ; xN are iid with density p (x) on [a; b], is neither too big nor
too small,
juj
1
juj
p
sin p +
DS (u) = exp
2
4
2
2
1=4
(x) =
and
G (z; x) =
N p (x)
1
DS
(x) p (x)
z
x
(x)
then for xi not too close to either a or b,
(S )ij
1
G (xi ; xj )
N
(in some appropriate probabilistic sense) and the smoother matrix for cubic
spline smoothing has entries like those that would come from an appropriate
kernel smoothing.
6.2
Kernel Smoothing in p Dimensions
A direct generalization of 1-dimensional kernel smoothing to p dimensions might
go roughly as follows. For D as before, and x 2 <p , I might set
K (x0 ; x) = D
and …t linear forms locally by choosing
optimization problem
minimize
and
N
X
kx
x0 k
(44)
(x0 ) 2 <p to solve the
(x0 ) 2 < and
K (x0 ; xi ) yi
+
0
xi
2
i=1
and predicting as
f^ (x0 ) =
(x0 ) +
0
(x0 ) x0
This seems typically to be done only after standardizing the coordinates of x
and can be e¤ective as longs as N is not too small and p is not more than 2
or 3. However for p > 3, the curse of dimensionality comes into play and N
points usually just aren’t dense enough in p-space to make direct use of kernel
smoothing e¤ective. If the method is going to be successfully used in <p it will
need to be applied under appropriate structure assumptions.
One way to apply additional structure to the p-dimensional kernel smoothing
problem is to essentially reduce input variable dimension by replacing the kernel
49
(44) with the "structured kernel"
K
;A
0q
(x0 ; x) = D @
(x
0
x0 ) A (x
x0 )
1
A
for an appropriate non-negative de…nite matrix A. For the eigen decomposition
of A,
A = V DV 0
write
(x
0
x0 ) A (x
1
x0 ) = D 2 V 0 (x
x0 )
0
1
D 2 V 0 (x
x0 )
This amounts to using not x and <p distance from x to x0 to de…ne weights,
1
1
1
but rather D 2 V 0 x and <p distance from D 2 V 0 x to D 2 V 0 x0 . In the event
that some entries of D are 0 (or are nearly so), we basically reduce dimension
from p to the number of large eigenvalues of A and de…ne weights in a space
of that dimension (spanned by eigenvectors corresponding to non-zero eigenvalues) where the curse of dimensionality may not preclude e¤ective use of kernel
smoothing. The "trick" is, of course, identifying the right directions into which
to project. (Searching for such directions is part of the Friedman "projection
pursuit" ideas discussed below.)
7
High-Dimensional Use of Low-Dimensional
Smoothers
There are several ways that have been suggested for making use of fairly lowdimensional (and thus, potentially e¤ective) smoothing in large p problems.
One of them is the "structured kernels" idea just discussed. Two more follow.
7.1
Additive Models and Other Structured Regression Functions
A way to apply structure to the p-dimensional smoothing problem is through
assumptions on the form of the predictor …t. For example, one might assume
additivity in a form
p
X
f (x) = +
gj (xj )
(45)
j=1
and try to do …tting of the p functions gj and constant . Fitting is possible
via the so-called "back-…tting algorithm."
That is (to generalize form (45) slightly) if I am to …t (under SEL)
f (x) =
+
L
X
l=1
50
gl xl
for xl some part of x, I might set b =
l = 1; 2; : : : ; L; 1; 2; : : :
1
N
PN
i=1
yi , and then cycle through
1. …tting via some appropriate (often linear) operation (e.g., spline or kernel
smoothing)
gl xl to "data"
xli ; yil i=1;2;:::;N
for
yil = yi
0
@b +
X
m6=l
1
A
gm (xm
i )
where the gm are the current versions of the …tted summands,
2. setting
gl = the newly …tted version
the sample mean of this newly …tted version across all xli
(in theory this is not necessary, but it is here to prevent numerical/roundo¤ errors from causing the gm to drift up and down by additive constants
summing to 0 across m),
3. iterating until convergence to, say, f^ (x).
The simplest version of this, based on form (45), might be termed …tting
of a "main e¤ects model." But the algorithm might as well be applied to …t
a "main e¤ects and two factor interactions model," at some iterations doing
smoothing of appropriate "residuals" as functions of 1-dimensional xj and at
other iterations doing smoothing as functions of 2-dimensional (xj ; xj 0 ). One
may mix types of predictors (continuous, categorical) and types of functions of
them in the additive form to produce all sorts of interesting models (including
semi-parametric ones and ones with low order interactions).
Another possibility for introducing structure assumptions and making use
of low-dimensional smoothing in a large p situation, is by making strong global
assumptions on the forms of the in‡uence of some input variables on the output,
but allowing parameters of those forms to vary in a ‡exible fashion with the
values of some small number of coordinates of x. For sake of example, suppose
that p = 4. One might consider predictor forms
f (x) =
(x3 ; x4 ) +
1
(x3 ; x4 ) x1 +
2
(x3 ; x4 ) x2
That is, one might assume that for …xed (x3 ; x4 ), the form of the predictor is
linear in (x1 ; x2 ), but that the coe¢ cients of that form may change in a ‡exible
way with (x3 ; x4 ). Fitting might then be approached by locally weighted least
squares, with only (x3 ; x4 ) involved in the setting of the weights. That is, one
51
might for each (x30 ; x40 ), minimize over choices of
2 (x30 ; x40 ) the weighted sum of squares
N
X
K ((x30 ; x40 ) ; (x3i ; x4i )) (yi
( (x30 ; x40 ) +
(x30 ; x40 ) ;
1
1
(x30 ; x40 ) and
(x30 ; x40 ) x1i +
2
(x30 ; x40 ) x2i ))
i=1
and then employ the predictor
f^ (x0 ) = ^ (x30 ; x40 ) + ^1 (x30 ; x40 ) x10 + ^2 (x30 ; x40 ) x20
This kind of device keeps the dimension of the space where one is doing smoothing down to something manageable. But note that nothing here does any
thresholding or automatic variable selection.
7.2
Projection Pursuit Regression
For w1 ; w2 ; : : : ; wM unit p-vectors of parameters, we might consider as predictors …tted versions of the form
f (x) =
M
X
gm (w0m x)
(46)
m=1
This is an additive form in the derived variables vm = w0m x. The functions gm
and the directions wm are to be …t from the training data. The M = 1 case of
this form is the "single index model" of econometrics.
How does one …t a predictor of this form (46)? Consider …rst the M = 1
case. Given w, we simply have pairs (vi ; yi ) for vi = w0 xi and a 1-dimensional
smoothing method can be used to estimate g. On the other hand, given g, we
might seek to optimize w via an iterative search. A Gauss-Newton algorithm
can be based on the …rst order Taylor approximation
g (w0 xi )
g (w0old xi ) + g 0 (w0old xi ) (w0
w0old ) xi
so that
N
X
i=1
(yi
g (w0 xi ))
2
N
X
2
(g 0 (w0old xi ))
w0old xi +
i=1
yi
g (w0old xi )
g 0 (w0old xi )
2
w0 xi
2
We may then update wold to w using the closed form for weighted (by (g 0 (w0old xi )) )
no-intercept regression of
w0old xi +
yi
g (w0old xi )
g 0 (w0old xi )
on xi . (Presumably one must normalize the updated w in order to preserve unit
length property of the w in order to maintain a stable scaling in the …tting.) The
g and w steps are iterated until convergence. Note that in the case where cubic
52
2
smoothing spline smoothing is used in projection pursuit, g 0 will be evaluated
as some explicit quadratic and in the case of locally weighted linear smoothing,
form (43) will need to be di¤erentiated in order to evaluate the derivative g 0 .
When M > 1, terms gm (w0m x) are added to a sum of such in a forward
stage-wise fashion. HTF provide some discussion of details like readjusting
previous g’s (and perhaps w’s) upon adding gm (w0m x) to a …t, and the choice
of M .
8
8.1
Highly Flexible Non-Linear Parametric Regression Methods
Neural Network Regression
A multi-layer feed-forward neural network is a nonlinear map of x 2 <p to
one or more real-valued y’s through the use of functions of linear combinations
of functions of linear combinations of ... of functions of linear combinations
of coordinates of x. Figure 1 is a network diagram representation of a single
hidden layer feed-forward neural net with 3 inputs 2 hidden nodes and 2 outputs.
Figure 1: A Network Diagram Representation of a Single Hidden Layer Feedfoward Neural Net With 3 Inputs, 2 Hidden Nodes and 2 Ouputs
This diagram stands for a function of x de…ned by setting
z1 =
(
01
1+
11 x1
+
21 x2
+
31 x3 )
z2 =
(
02
1+
12 x1
+
22 x2
+
32 x3 )
53
and then
y1 = g1 (
01
1+
11 z1
+
21 z2 )
y2 = g2 (
02
1+
12 z1
+
22 z2 )
Of course, much more complicated networks are possible, particularly ones with
multiple hidden layers and many nodes on all layers.
The common choices of functional form at hidden nodes are ones with
sigmoidal shape like
1
(u) =
1 + exp ( u)
or
exp (u) exp ( u)
(u) = tanh (u) =
exp (u) + exp ( u)
(statisticians seem to favor the former, probably because of familiarity with logistic forms, while computer scientists seem to favor the latter). Such functions
are really quite linear near u = 0, so that for small ’s the functions of x entering the g’s in a single hidden layer network are nearly linear. For large ’s the
functions are nearly step functions. In light of the latter, it is not surprising
that there are universal approximation theorems that guarantee that any continuous function on a compact subset of <p can be approximated to any degree
of …delity with a single layer feed-forward neural net with enough nodes in the
hidden layer. This is both a blessing and a curse. It promises that these forms
are quite ‡exible It also promises that there must be both over-…tting and identi…ability issues inherent in their use (the latter in addition to the identi…ability
issues already inherent in the symmetric nature of the functional forms assumed
for the predictors). In regression contexts, identity functions are common for
the functions g, while in classi…cation problems, in order to make estimates of
class probabilities, the functions g often exponentiate one z and divide by a sum
of such terms for all z’s.
There are various possibilities for regularization of the ill-posed …tting problem for neural nets, ranging from the fairly formal and rational to the very
informal and ad hoc. Possibly the most common is to simply use an iterative
…tting algorithm and "stop it before it converges."
8.1.1
The Back-Propagation Algorithm
The most common …tting algorithm for neural networks is something called the
back-propagation algorithm or the delta rule. In the case where one is doing
SEL …tting for a single layer feed-forward neural net with M hidden nodes and
K output nodes, the logic goes as follows. There are M (p + 1) parameters
to be …t. In addition to the notation 0m use the notation
m
=(
There are K (M + 1) parameters
use the notation
k =(
1m ;
2m ; : : : ;
pm )
0
to be …t. In addition to the notation
1k ;
2k ; : : : ;
54
Mk)
0
0k ,
Let
stand for the whole set of parameters and consider the SEL …tting criterion
K
R( ) =
N
1 XX
(yik
2
i=1
fk (xi ))
2
(47)
k=1
Let z0i = 1 and for m = 1; 2; : : : ; M de…ne
zmi =
(
0
m xi )
+
0m
and
0
z i = (z1i ; z2i ; : : : ; zM i )
and write
K
Ri =
1X
(yik
2
fk (xi ))
2
k=1
PN
(so that R ( ) = i=1 Ri ).
Partial derivatives with respect to the parameters are (by the chain rule)
@Ri
=
@ mk
fk (xi )) gk0
(yik
0k
+
0
k zi
zmi
(for m = 0; 1; 2; : : : ; M and k = 1; 2; : : : ; K) and we will use the abbreviation
=
ki
fk (xi )) gk0
(yik
so that
@Ri
=
@ mk
0k
+
0
k zi
(48)
ki zmi
(49)
Similarly, letting xi0 = 1
@Ri
=
@ lm
=
K
X
fk (xi )) gk0
(yik
0k
+
0
k zi
mk
0
(
0m
+
0
m xi ) xil
k=1
K
X
0
ki mk
(
0m
+
0
m xi ) xil
k=1
(for l = 0; 1; 2; : : : ; p and m = 1; 2; : : : ; M ) and we will use the abbreviations
smi =
K
X
fk (xi )) gk0
(yik
0k
+
0
k zi
mk
0
(
0m
+
0
m xi )
k=1
=
K
X
mk
0
(
0m
+
0
m xi ) ki
(50)
k=1
(equations (50) are called the "back-propagation" equations) so that
@Ri
= smi xil
@ lm
55
(51)
An iterative search to make R ( ) small may then proceed by setting
(r+1)
mk
=
(r)
mk
r
N
X
@Ri
@ mk
i=1
=
(r)
0k
(r)
k
and
(r)
mk
r
N
X
(r) (r)
ki zmi
i=1
8m; k (52)
and
(r+1)
lm
=
=
(r)
lm
r
(r)
lm
r
N
X
@Ri
@ lm
i=1
N
X
(r)
0m ;
(r)
m
and all
(r)
0k
and
(r)
k
(r)
smi xil
(53)
i=1
for some "learning rate"
So operationally
r.
1. in what is called the "forward pass" through the network, using rth iterates
(r)
of ’s and ’s, one computes rth iterates of …tted values fk (xi ),
2. in what is called the "backward pass" through the network, the (r)th
(r)
iterates ki are computed using the rth iterates of the z’s, ’s and the
(r)
(r)
fk (xi ) in equations (48) and then the (r)th iterates sml are computed
using the (r)th ’s, ’s, and ’s in equations (50),
3. the
(r)
ki
(r)
and sml then provide partials using (49) and (51), and
4. updates for the parameters
8.1.2
and
come from (52) and (53).
Formal Regularization of Fitting
Suppose that the various coordinates of the input vectors in the training set
have been standardized and one wants to regularize the …tting of a neural net.
One possible way of proceeding is to de…ne a penalty function like
!
p
M
M X
K
X
1 XX 2
2
J( )=
(54)
lm +
mk
2
m=1
m=0
l=0
k=1
(it is not absolutely clear whether one really wants to include l = 0 and m = 0
in the sums (54)) and seek not to partially optimize R ( ) in (47), but rather to
fully optimize
R( )+ J ( )
(55)
(r)
(r)
for a > 0. By simply adding
r
r
lm
mk on the right side of (52) and
on the right side of (53), one arrives at an appropriate gradient descent algorithm
for trying to optimize the penalized error sum of squares (55). (Potentially, an
appropriate value for might be chosen based on cross-validation.)
56
Something that may at …rst seem quite di¤erent would be to take a Bayesian
point of view. For example, with a model
yik = fk (xi j ) +
for the
ik
iid N 0;
2
l
ik
, a likelihood is simply
;
2
=
N Y
K
Y
h yik jfk (xi j ) ;
i=1 k=1
2
for h j ; 2 the normal pdf. If then g ; 2 speci…es a prior distribution for
and 2 , a posterior for ; 2 has density proportional to
l
;
2
g
2
;
For example, one might well assume that a priori the ’s and ’s are iid N 0; 2
(where small 2 will provide regularization and it is again unclear whether one
wants to include the 0m and 0k in such an assumption or to instead provide
more di¤use priors for them, like improper "Uniform( 1; 1)" or at least large
variance normal ones).
And a standard (improper) prior for 2 would be to take log to be Uniform( 1; 1).
In any case, whether improper or proper, let’s abuse notation and write g 2
for the prior density for 2 . Then with independent mean 0 variance 2 priors for all the weights (except possibly the 0m and 0k that might be given
Uniform( 1; 1) priors) one has
ln l
;
2
g
;
2
1
R( )
/
N K ln ( )
=
N K ln ( ) + ln g
2
2
1
2
1
2
2
J ( ) + ln g
2
R( )+
2
J( )
(56)
(‡at improper priors for the 0m and 0k correspond to the absence of terms for
them in the sums for J ( ) in (54)). This recalls (55) and suggests that appropriate for regularization can be thought of as a variance ratio of "observation
variance" and prior variance for the weights.
It’s fairly clear how to de…ne Metropolis-Hastings-within-Gibbs algorithms
for sampling from l ; 2 g ; 2 . But it seems that typically the high dimensionality of the parameter space combined with the symmetry-derived multimodality of the posterior will prevent one from running an MCMC algorithm
long enough to fully detail the posterior It also seems unlikely however, that
detailing the posterior is really necessary or even desirable.
Rather, one might simply run the MCMC algorithm, monitoring the values of l ; 2 g ; 2 corresponding to the successively randomly generated
MCMC iterates. An MCMC algorithm will spend much of its time where the
corresponding posterior density is large and we can expect that a long MCMC
57
run will identify a nearly modal value for the posterior. Rather than averaging
neural nets according to the posterior, one might instead use as a predictor a
neural net corresponding to a parameter vector (at least locally) maximizing
the posterior.
Notice that one might even take the parameter vector in an MCMC run with
the largest l ; 2 g ; 2 value and for a grid of 2 values around the empirical maximizer use the back-propagation algorithm modi…ed to fully optimize
2
R( )+
2
J( )
over choices of . This, in turn, could be used with (56) to perhaps improve
somewhat the result of the MCMC "search."
8.2
Radial Basis Function Networks
Section 6.7 of HTF considers the use of the kind of kernels applied in kernel
smoothing as basis functions. That is, for
kx
K (x; ) = D
k
one might consider …tting nonlinear predictors of the form
f (x) =
0
+
M
X
jK
j
x;
(57)
j
j=1
where each basis element has prototype parameter and scale parameter . A
common choice of D for this purpose is the standard normal pdf.
A version of this with fewer parameters is obtained by restricting to cases
where 1 = 2 =
= M = : This restriction, however, has the potentially
unattractive e¤ect of forcing "holes" or regions of <p where (in each) f (x)
0,
including all "large" x. A way to replace this behavior with potentially di¤ering
values in the former "holes" and directions of "large" x, is to replace the basis
functions
!
x
j
K x; j = D
with normalized versions
D
hj (x) = PM
x
k=1
to produce a form
f (x) =
j
D (kx
0+
M
X
j=1
58
j hj
=
kk =
(x)
)
(58)
The …tting of form (57) by choice of 0 ; 1 ; : : : ; M ; 1 ; 2 ; : : : ; M ; 1 ; 2 ; : : : ;
is fraught with all
M or form (58) by choice of 0 ; 1 ; : : : ; M ; 1 ; 2 ; : : : ; M ;
the problems of over-parameterization and lack of identi…ability associated with
neural networks.
Another way to use radial basis functions to produce ‡exible functional
forms is to replace the forms ( 0m + 0m x) in a neural network with forms
K m ( 0m x; m ) or hm ( 0m x).
9
9.1
Trees and Related Methods
Regression and Classi…cation Trees
This is something genuinely new to our discussion. We’ll …rst consider the
SEL/regression version. This begins with a forward-selection/"greedy" algorithm for inventing predictions constant on p-dimensional rectangles, by successively looking for an optimal binary split of a single one of an existing set of
rectangles. For example, with p = 2, one begins with a rectangle [a; b] [c; d]
de…ned by
a = min xi1 x1
max xi1 = b
i=1;2;:::;N
i=1;2;:::;N
and
c=
min
i=1;2;:::;N
xi2
x2
max xi2 = d
i=1;2;:::;N
and looks for a way to split it at x1 = s1 or x2 = s1 so that the resulting two
rectangles minimize the training error
X
X
2
yi y rectangle
SSE =
rectangles i with xi in
the rectangle
One then splits (optimally) one of the (now) two rectangles at x1 = s2 or
x2 = s2 , etc.
Where l rectangles R1 ; R2 ; : : : ; Rl in <p have been created, and
m (x) = the index of the rectangle to which x belongs
the corresponding predictor is
f^l (x) =
1
# training input vectors xi in Rm(x)
and in this notation the training error is
SSE =
N
X
yi
i=1
59
f^l (xi )
2
X
i with xi
in Rm(x)
yi
If one is to continue beyond l rectangles, one then looks for a value sl to split
one of the existing rectangles R1 ; R2 ; : : : ; Rl on x1 or x2 or . . . or xp and thereby
produce the greatest reduction in SSE. (We note that there is no guarantee
that after l splits one will have the best (in terms of SSE) possible set of l + 1
rectangles.)
Any series of binary splits of rectangles can be represented graphically as
a binary tree, each split represented by a node where there is a fork and each
…nal rectangle by an end node. It is convenient to discuss rectangle-splitting
and "unsplitting" in terms of operations on corresponding binary trees, and
henceforth we adopt this language.
It is, of course, possible to continue splitting rectangles/adding branches to
a tree until every distinct x in the training set has its own rectangle. But that
is not helpful in a practical sense, in that it corresponds to a very low bias/high
variance predictor. So how does one …nd a tree of appropriate size? The
standard recommendation using K-fold cross-validation seems to be as follows.
For each of the K subsets of the training set (say S k = T
T k in the
notation of Section 1.5)
1. grow a tree (on a given data set) until the cell with the fewest training xi
contains 5 or less such points, then
2. "prune" the tree in 1. back by for each > 0 (a complexity parameter,
weighting SSEk (for S k ) against complexity de…ned in terms of tree size)
minimizing over choices of sub-trees, the quantity
C k (T ) = jT j +
SSEk (T )
(for, in the obvious way, jT j the number of …nal nodes in the candidate tree
and SSEk (T ) the error sum of squares for the corresponding predictor).
Write
Tk ( ) = arg min C k (T )
subtrees T
and let f^k be the corresponding predictor.
Then (as in Section 1.5), letting k (i) be the index of the piece of the training
set T k containing training case i, and compute the cross-validation error
CV ( ) =
N
1 X ^k(i)
f
(xi )
N i=1
2
yi
For b a minimizer of CV ( ), one then operates on the entire training set,
growing a tree until the cell with the fewest training xi contains 5 or less such
points, then …nding the sub-tree, say T b , optimizing C (T ) = jT j + ^ SSE (T ),
and using the corresponding predictor f^^ .
The question of how to …nd a subtree optimizing a C k (T ) or C (T ) without
making an exhaustive search over sub-trees for every di¤erent value of has a
60
workable answer. There is a relatively small number of nested candidate subtrees of an original tree that are the only ones that are possible minimizers of a
C k (T ) or C (T ), and as decreases one moves through that nested sequence
of sub-trees from the largest/original tree to the smallest.
Suppose that one begins with a large tree, T0 (with jT0 j …nal nodes). One
may quickly search over all "pruned" versions of T0 (sub-trees T created by
removing a node where there is a fork and all branches that follow below it) and
…nd the one with minimum
SSE (T ) SSE (T0 )
jT0 j jT j
(This IS the per node (of the lopped o¤ branch of the …rst tree) increase
in SSE.) Call that sub-tree T1 . T0 is the optimizer of C (T ) over subtrees of T0 for every
(jT0 j jT1 j) = (SSE (T1 ) SSE (T0 )), but at
=
(jT0 j jT1 j) = (SSE (T1 ) SSE (T0 )), the optimizing sub-tree switches to T1 .
One then may search over all "pruned" versions of T1 for the one with
minimum
SSE (T ) SSE (T1 )
jT1 j jT j
and call it T2 . T1 is the optimizer of C (T ) over sub-trees of T0 for every
(jT1 j jT2 j) = (SSE (T2 ) SSE (T1 ))
(jT0 j jT1 j) = (SSE (T1 ) SSE (T0 )),
but at = (jT1 j jT2 j) = (SSE (T2 ) SSE (T1 )) the optimizing sub-tree switches
to T2 ; and so on. (In this process, if there ever happens to be a tie among subtrees in terms of a minimizing a ratio of increase in error sum of squares per
decrease in number of nodes, one chooses the sub-tree with the smaller jT j.)
For T optimizing C (T ), the function of ,
C (T ) = minC (T )
T
is piecewise linear in , and both it and the optimizing nested sequence of
sub-trees can be computed very e¢ ciently in this fashion.
The continuous y, SEL version of this material is known as the making of
"regression trees." The "classi…cation trees" version of it is very similar. One
needs only to de…ne an empirical loss to associate with a given tree parallel to
SSE used above. To that end, we …rst note that in a K class problem (where
y takes values in G = f1; 2; : : : ; Kg) corresponding to a particular rectangle Rm
is the fraction of training vectors with classi…cation k,
pd
mk =
1
# training input vectors in Rm
X
i with xi
in Rm
and a plausible G-valued predictor based on l rectangles is
f^l (x) = arg max p\
m(x)k
k2G
61
I [yi = k]
the class that is most heavily represented in the rectangle to which x belongs.
The empirical mis-classi…cation rate for this predictor is
err =
N
l
i
1 X h
1 X
I yi 6= f^l (xi ) =
Nm 1
N i=1
N m=1
p\
mk(m)
where Nm = # training input vectors in Rm , and k (m) = arg max pd
mk . Two
k2G
related alternative forms for a training error are "the Gini index"
!
l
K
X
1 X
err =
Nm
pd
pd
mk (1
mk )
N m=1
k=1
and the so-called "cross entropy"
err =
l
1 X
Nm
N m=1
K
X
k=1
!
pd
d
mk ln (p
mk )
Upon adopting one of these forms for err and using it to replace SSE in the
regression tree discussion, one has a classi…cation tree methodology. HTF
suggest using the Gini index or cross entropy for tree growing and any of the
indices (but most typically the empirical mis-classi…cation rate) for tree pruning
according to cost-complexity.
9.1.1
Comments on Forward Selection
The basic formulation of the tree-growing and tree-pruning methods here employs "one-step-at-a-time"/"greedy" (unable to defer immediate reward for the
possibility of later success) methods. Like all such methods, they are not guaranteed to follow paths through the set of trees that ever get to "best" ones
since they are "myopic," never considering what might be later in a search, if a
current step were taken that provides little immediate payo¤.
An example (essentially suggested by Mark Culp) dramatically illustrates
this limitation. Consider a p = 3 case where x1 2 f0; 1g ; x2 2 f0; 1g ; and
x3 2 [0; 1]. In fact, suppose that x1 and x2 are iid Bernoulli(:5) independent
of x3 that is Uniform(0; 1). Then suppose that conditional on (x1 ; x2 ; x3 ) the
output y is N( (x1 ; x2 ; x3 ) ; 1) for
(x1 ; x2 ; x3 ) = 1000 I [x1 = x2 ] + x3
For a big training sample iid from this joint distribution, all branching will typically be done on the continuous variable x3 , completely missing the fundamental
fact that it is the (joint) behavior of (x1 ; x2 ) that drives the size of y. (This
example also supports the conventional wisdom that as presented the splitting
algorithm "favors" splitting on continuous variables over splitting on values of
discrete ones.)
62
9.1.2
Measuring the Importance of Inputs
Consider the matter of assigning measures of "importance" of input variables for
a regression tree predictor. In the spirit of ordinary linear models assessment
of the importance of a predictor in terms of some reduction it provides in some
error sum of squares, Breiman suggested the following. Suppose that in a
regression tree, predictor xj provides the rectangle splitting criterion for nodes
node1j ; : : : ; nodem(j)j and that before splitting at nodelj , the relevant rectangle
Rnodelj has associated sum of squares
X
SSEnodelj =
2
y nodelj
yi
i with xi 2Rnodelj
and that after splitting Rnodelj on variable xj to create rectangles Rnodelj1 and
Rnodelj2 one has sums of squares associated with those two rectangles
X
SSEnodelj1 =
2
yi
y nodelj1
yi
y nodelj2
i with xi 2Rnodelj1
X
and SSEnodelj2 =
2
i with xi 2Rnodelj2
The reduction in error sum of squares provided by the split on xj at nodelj is
thus
D (nodelj ) = SSEnodelj
SSEnodelj1 + SSEnodelj2
One might then call
m(j)
Ij =
X
D (nodelj )
l=1
a measure of the importance of xp
j in …tting the tree and compare the various
Ij ’s (or perhaps the square roots, Ij ’s).
Further, if a predictor is a (weighted) sum of regression trees (e.g. produced
by "boosting" or in a "random forest") and Ijm measures the importance of xj
in the mth tree, then
M
1 X
Ij: =
Ijm
M m=1
is perhaps a sensible measure of the importance of xj in the overall predictor.
One can then compare the various Ij: (or square roots) as a means of comparing
the importance of the input variables.
9.2
Random Forests
This is a version of the "bagging" (bootstrap aggregation) idea of Section 10.1.1
applied speci…cally to (regression and classi…cation) trees. Suppose that one
makes B bootstrap samples of N (random samples with replacement of size N )
63
from the training set T , say T 1 ; T 2 ; : : : ; T B . For each one of these samples,
T b , develop a corresponding regression or classi…cation tree by
1. at each node, randomly selecting m of the p input variables and …nding an
optimal single split of the corresponding rectangle over the selected input
variables, splitting the rectangle, and
2. repeating 1 at each node until a small maximum number of training cases,
nm ax , is reached in each rectangle.
(Note that no pruning is applied in this development.) Then let f^ b (x) be the
corresponding tree-based predictor (taking values in < in the regression case or
in G = f1; 2; : : : ; Kg in the classi…cation case). The random forest predictor is
then
B
1 X ^b
f (x)
f^B (x) =
B
b=1
in the regression case and
f^B (x) = arg max
k
B
h
i
X
I f^ b (x) = k
b=1
in the classi…cation case.
The basic tuning parameters in the development of f^B (x) are then m and
nm ax . HTF suggest default values of these parameters
m = bp=3c and nm ax = 5 for regression problems, and
p
m=
p and nm ax = 1 for classi…cation problems.
(The nm ax = 1 default in classi…cation problems is equivalent to splitting rectangles in the input space <p until all rectangles are completely homogeneous
in terms of output values, y.)
It is common practice to make a kind of running cross validation estimate
of error as one builds a random forest predictor based on "out-of-bag" (OOB)
samples. That is, suppose that for each b I keep track of the set of (OOB)
indices I (b)
f1; 2; : : : ; N g for which the corresponding training vector does
not get included in the bootstrap training set T b , and let
y^iB =
1
B such that i 2 I (b)
# of indices b
X
b B such that i2I(b)
in regression contexts and
y^iB = arg max
k
X
b B such that i2I(b)
64
h
i
I f^ b (xi ) = k
f^ b (xi )
in classi…cation problems. Then a running cross-validation type of estimate of
Err is
N
1 X
2
OOB f^B =
(yi y^iB )
N i=1
in SEL regression contexts and
N
1 X
I [yi 6= y^iB ]
OOB f^B =
N i=1
in 0-1 loss classi…cation contexts.
Notice that (as in any bagging context, and) as indicated in Section 10.1.1,
one expects LLN type behavior and convergence of the (random) predictor f^B
to some …xed (for …xed training set, T ) predictor, say f^rf , that is
f^B (x) ! f^rf (x) for all x
B!1
It is not too surprising that one then also expects the convergence of OOB f^B ,
and that plotting of OOB f^B versus B is a standard way of trying to assess
whether enough bootstrap trees have been made to adequately represent the
forest and predictor f^rf . In spite of the fact that for small B the (random)
predictor f^B is built on a small number of trees and is fairly simple, B is not
really a complexity parameter, but is rather a convergence parameter.
There is a fair amount of confusing discussion in the literature about the
impossibility of a random forest "over-…tting" with increasing B. This seems to
be related to test error not initially-decreasing-but-then-increasing-in-B (which
is perhaps loosely related to OOB f^B converging to a positive value associated
with the limiting predictor f^rf , and not to 0). But as HTF point out on their
page 596, it is an entirely di¤erent question as to whether f^rf itself is "too
complex" to be adequately supported by the training data, T . (And the whole
discussion seems very odd in light of the fact that for any …nite B, a di¤erent
choice of bootstrap samples would produce a di¤erent f^B as a new randomized
approximation to f^rf . Even for …xed x, the value f^B (x) is a random variable.
Only f^rf (x) is …xed.)
There is also a fair amount of confusing discussion in the literature about
the role of the random selection of the m predictors to use at each node-splitting
(and the choice of m) in reducing "correlation between trees in the forest." The
Breiman/Cutler web site http://www.stat.berkeley.edu/~breiman/Random
Forests/cc_home.htm says that the "forest error rate" (presumably the error
rate for f^rf ) depends upon "the correlation between any two trees in the forest"
and the "strength of each tree in the forest." The meaning of these is not clear.
One possible meaning of the …rst is some version of correlation between values
of f^ 1 (x) and f^ 2 (x) as one repeatedly selects the whole training set T in iid
fashion from P and then makes two bootstrap samples— Section 15.4 of HTF
65
seems to use this meaning.5 A meaning of the second is presumably some
measure of average e¤ectiveness of a single f^ b . HTF Section 15.4 goes on to
suggest that increasing m increases both "correlation" and "strength" of the
trees, the …rst degrading error rate and the second improving it, and that the
OOB estimate of error can be used to guide choice of m (usually in a broad range
of values that are about equally attractive) if something besides the default is
to be used.
9.3
PRIM (Patient Rule Induction Method)
This is another rectangle-based method of making a predictor on <p . (The
language seems to be "patient" as opposed to "rash" and "rule induction" as in
"predictor development.") PRIM6 could be termed a type of "bump-hunting."
For a series of rectangles (or boxes) in p-space
R1 ; R2 ; : : : ; Rl
one de…nes a predictor
f^l (x) =
8
>
>
>
>
>
>
>
>
>
<
y R1
y R2 R1
..
.
if x 2 R1
if x 2 R2 R1
..
.
y Rm [ m 1 Rk
>
k=1
>
>
>
..
>
>
.
>
>
>
: y l
c
([k=1 Rk )
if x 2 Rm
m 1
[k=1
Rk
..
.
if x 2
= [lk=1 Rk
The boxes or rectangles are de…ned recursively in a way intended to catch
"the remaining part of the input space with the largest output values." That
is, to …nd R1
1. identify a rectangle
l1
l2
x1
u1
x2
u2
xp
up
..
.
lp
that includes all input vectors in the training set,
5 A second possibility concerns "bootstrap randomization distribution" correlation (for a
…xed training set and a …xed x) between values f^ 1 (x) and f^ 2 (x).
6 The language seems to be "patient" as opposed to "rash." "Rule induction" is perhaps
"predictor development" or more likely "conjunctive rule development" from the context of
"market basket analysis." See Section 17.1 in regard to this latter usage.
66
2. identify a dimension, j, and either lj or uj so that by reducing uj or
increasing lj just enough to remove a fraction (say = :1) of the training
vectors currently in the rectangle, the largest value of
y rectangle
possible is produced, and update that boundary of the rectangle,
3. repeat 2. until some minimum number of training inputs xi remain in the
rectangle (say, at least 10),
4. expand the rectangle in any direction (increase a u or decrease an l) adding
a training input vector that provides a maximal increase in y rectangle , and
5. repeat 4. until no increase is possible by adding a single training input
vector.
This produces R1 . For what it is worth, step 2. is called "peeling" and step 4.
is called "pasting."
Upon producing R1 , one removes from consideration all training vectors
with xi 2 R1 and repeats 1. through 5. to produce R2 . This continues until a
desired number of rectangles has been created. One may pick an appropriate
number of rectangles (l is a complexity parameter) by cross-validation and then
apply the procedure to the whole training set to produce a set of rectangles and
predictor on p-space that is piece-wise constant on regions built from boolean
operations on rectangles.
10
10.1
Predictors Built on Bootstrap Samples, Model
Averaging, and Stacking
Predictors Built on Bootstrap Samples
One might make B bootstrap samples of N (random samples with replacement
of size N ) from the training set T , say T 1 ; T 2 ; : : : ; T B , and train on these
bootstrap samples to produce, say,
predictor f^ b based on T b
Now (rather than using these to estimate the prediction error as in Section 12.4)
consider a couple of ways of using these to build a predictor (under squared error
loss).
10.1.1
Bagging
The possibility considered in Section 8.7 of HTF is "bootstrap aggregation," or
"bagging." This is use of the predictor
f^bag (x)
B
1 X ^b
f (x)
B
b=1
67
Notice that even for …xed training set T and input x, this is random (varying
with the selection of the bootstrap samples). One might let E denote averaging
over the creation of a single bootstrap sample and f^ be the predictor derived
from such a bootstrap sample and think of
E f^ (x)
as the "true" bagging predictor (that has the simulation-based approximation
f^bag (x)). One is counting on a law of large numbers to conclude that f^bag (x) !
E f^ (x) as B ! 1. Note too, that unless the operations applied to a training
set to produce f^ are linear, E f^ (x) will di¤er from the predictor computed
from the training data, f^ (x).
Where a loss other than squared error is involved, exactly how to "bag"
bootstrapped versions of a predictor is not altogether obvious, and apparently
even what might look like sensible possibilities can do poorly.
10.1.2
Bumping
Another/di¤erent thing one might do with bootstrap versions of a predictor is
to "pick a winner" based on performance on the training data. This is the
"bumping"/stochastic perturbation idea of HTF’s Section 8.9. That is, let
f^ 0 = f^ be the predictor computed from the training data, and de…ne
bb = arg min
N
X
yi
f^ b (xi )
2
b=0;1;:::;B i=1
and take
b
f^bum p (x) = f^ b (x)
The idea here is that if a few cases in the training data are responsible for making
a basically good method of predictor construction perform poorly, eventually a
bootstrap sample will miss those cases and produce an e¤ective predictor.
10.2
Model Averaging and Stacking
The bagging idea averages versions of a single predictor computed from di¤erent bootstrap samples. An alternative might be to somehow weight together
di¤erent predictors (potentially even based on di¤erent models).
10.2.1
Bayesian Model Averaging
One theoretically straightforward way to justify this kind of enterprise and identify candidate predictors is through the Bayes "multiple model" scenario (also
used in Section 12.2.2).
That is, suppose that M models are under consideration, the mth of which
has parameter vector m and corresponding density for training data
fm (T j
68
m)
with prior density for
m
gm (
m)
and prior probability for model m
(m)
The posterior distribution of the model index is
Z
(mjT ) / (m) fm (T j m ) gm (
m) d m
i.e. is speci…ed by
R
(m) fm (T j m ) gm ( m ) d m
(mjT ) = PM
R
m=1 (m) fm (T j m ) gm ( m ) d
m
Now suppose that for each model there is a quantity m ( m ) that is of
interest, these having a common interpretation across models. Notice that
under the current assumption that P depends upon the parameter m in model
m, m ( m ) could be the conditional mean of y given the value of x. That is,
one important instance of this structure is m ( m ; u) =E m [yjx = u] : Since
in model m the density of m conditional on T is
g(
the conditional mean of
E[
m
(
m jT )
m
m
(
fm (T j m ) gm ( m )
fm (T j m ) gm ( m ) d
m
m)
(in model m) is
R
m ( m ) fm (T j m ) gm ( m ) d
R
fm (T j m ) gm ( m ) d m
m ) jm; T ]
and a sensible predictor of
E[
(
=R
m
m ) jT ]
(
m)
=
m
is
M
X
(mjT ) E [
m
m=1
(
m ) jm; T ]
a (model posterior probability) weighted average of the predictors E[ m ( m ) jm; T ]
individually appropriate under the M di¤ erent models. In the important special
case where m ( m ; u) =E m [yjx = u], this is
E[
10.2.2
m(
m ; u) jT ] =
M
X
(mjT ) E [E
m=1
m
[yjx = u] jm; T ]
Stacking
What might be suggested like this, but with a less Bayesian ‡avor? Suppose that M predictors are available (all based on the same training data),
69
f^1 ; f^2 ; : : : ; f^M . Under squared error loss, I might seek a weight vector w for
which the predictor
M
X
f^ (x) =
wm f^m (x)
m=1
is e¤ective. I might even let w depend upon x, producing
M
X
f^ (x) =
wm (x) f^m (x)
m=1
HTF pose the problem of perhaps …nding
T
(x;y)
b = arg min E E
w
M
X
y
w
or even
m=1
2
M
X
b (u) = arg min ET E 4 y
w
w
wm f^m (x)
wm f^m (x)
m=1
!2
!2
(59)
3
jx = u5
Why might this program improve on any single one of the f^m ’s? In some
sense, this is "obvious" since the w over which minimization (59) is done include
vectors with one entry 1 and all others 0. But to indicate in a concrete setting
why this might work, consider a case where M = 2 and according to the P N P
joint distribution of (T ; (x; y))
E y
f^1 (x) = 0
E y
f^2 (x) = 0
and
De…ne
f^ = f^1 + (1
) f^2
Then
E y
f^ (x)
2
=E
f^1 (x) + (1
y
) y
f^1 (x) + (1
= Var
y
= ( ;1
) Cov
y
y
f^1 (x)
f^2 (x)
f^2 (x)
2
f^2 (x)
) y
1
This is a quadratic function of , that (since covariance matrices are nonnegative de…nite) has a minimum. Thus there is a minimizing that typically
(is not 0 or 1 and thus) produces better expected loss than either f^1 (x) or
f^2 (x).
70
More generally, using either the P N P joint distribution of (T ; (x; y)) or
the P N Pyjx=u conditional distribution of (T ; y) given x = u, one may con0
sider the random vector f^1 (x) ; f^2 (x) ; : : : ; f^M (x) ; y or the random vector
0
0
f^1 (u) ; f^2 (u) ; : : : ; f^M (u) ; y . Let f^ ; y stand for either one and
0
E f^f^
M
and Ey f^
M
M
1
be respectively the matrix of expected products of the predictions and vector
of expected products of y and elements of f^. Upon writing out the expected
square to be minimized and doing some matrix calculus, it’s possible to see that
optimal weights are of the form
1
0
b = E f^f^
w
Ey f^
b (x) is of this form)
(or w
(We observe that HTF’s notation in (8.57) could be interpreted to incorrectly
place the inverse sign inside the expectation. They call this "the population
linear regression of y on f^.") Of course, none of this is usable in practice,
as typically the mean vector and expected cross product matrix are unknown.
And apparently, the simplest empirical approximations to these do not work
well.
What, instead, HTF seem to recommend is to de…ne
w
stack
= arg min
w
N
X
yi
M
X
i
wm f^m
(xi )
m=1
i=1
!2
i
for f^m
the mth predictor …t to T f(xi ; yi )g, the training set with the ith case
removed. (wstack optimizes a kind of leave-one-out cross-validation error for a
linear combination of f^m ’s.) Then ultimately the "stacked" predictor of y is
f^ (x) =
M
X
stack ^
wm
fm (x)
m=1
11
Reproducing Kernel Hilbert Spaces: Penalized/Regularized and Bayes Prediction
A general framework that uni…es many interesting regularized …tting methods
is that of reproducing kernel Hilbert spaces. There is a very nice 2012 Statistics Surveys paper by Nancy Heckman (related to an older UBC Statistics
Department technical report (#216)) entitled "The Theory and Application of
Penalized Least Squares Methods or Reproducing Kernel Hilbert Spaces Made
Easy," that is the nicest exposition I know about of the connection of this material to splines. Parts of what follows are borrowed shamelessly from her paper.
71
There is also some very helpful stu¤ in CFZ Section 3.5 and scattered through
Izenman about RKHSs.7
11.1
RKHS’s and p = 1 Cubic Smoothing Splines
To provide motivation for a somewhat more general discussion, consider again
the smoothing spline problem. We consider here the function space
Z 1
2
0
H = h : [0; 1] ! <j h and h are absolutely continuous and
(h00 (x)) dx < 1
0
as a Hilbert space (a linear space with inner product where Cauchy sequences
have limits) with inner product
hf; giH
f (0) g (0) + f 0 (0) g 0 (0) +
Z
1
f 00 (x) g 00 (x) dx
0
1=2
(and corresponding norm khkH = hh; hiH ). With this de…nition of inner
product and norm, for x 2 [0; 1] the (linear) functional (a mapping H ! <)
Fx [f ]
f (x)
is continuous. Thus the so-called "Riesz representation theorem" says that
there is an Rx 2 H such that
Fx [f ] = hRx ; f iH = f (x) 8f 2 H
(Rx is called the representer of evaluation at x.) It is in fact possible to verify
that for z 2 [0; 1]
Rx (z) = 1 + xz + R1x (z)
for
R1x (z) = xz min (x; z)
1
x+z
2
3
(min (x; z)) + (min (x; z))
2
3
does this job.
Then the function of two variables de…ned by
R (x; z)
Rx (z)
is called the reproducing kernel for H, and H is called a reproducing kernel
Hilbert space (RKHS) (of functions).
De…ne a linear di¤erential operator on elements of H (a map from H to H)
by
L [f ] (x) = f 00 (x)
7 See
also "Penalized Splines and Reproducing Kernels" by Pearce and Wand in The American Statistician (2006) and "Kernel Methods in Machine Learning" by Hofmann, Scholköpf,
and Smola in The Annals of Statistics (2008).
72
Then the optimization problem solved by the cubic smoothing spline is minimization of
Z 1
N
X
2
2
(yi Fxi [h]) +
(L [h] (x)) dx
(60)
0
i=1
over choices of h 2 H.
It is possible to show that the minimizer of the quantity (60) is necessarily
of the form
N
X
h (x) = 0 + 1 x +
i R1xi (x)
i=1
and that for such h, the criterion (60) is of the form
(Y
0
K ) (Y
T
0
K )+
T
K
for
T = (1; X) ;
N
2
0
=
1
; and K = (R1xi (xj ))
So this has utlimately produced a matrix calculus problem.
11.2
What Seems to be Possible Beginning from Linear
Functionals and Linear Di¤erential Operators for p =
1
For constants di > 0; functionals Fi , and a linear di¤erential operator L de…ned
for continuous functions wk (x) by
L [h] (x) = h(m) (x) +
m
X1
wk (x) h(k) (x)
k=1
Heckman considers the minimization of
N
X
di (yi
2
Fi [h]) +
Z
b
2
(L [h] (x)) dx
(61)
a
i=1
in the space of functions
H=
h : [a; b] ! <j
derivatives of h up to order m 1 are
Rb
absolutely continuous and a h(m) (x)
2
dx < 1
One may adopt an inner product for H of the form
hf; giH
m
X1
f (k) (a) g (k) (a) +
Z
b
L [f ] (x) L [g] (x) dx
a
k=1
and have a RKHS. The assumption is made that the functionals Fi are continuous and linear, and thus that they are representable as Fi [h] = hfi ; hiH for
73
some fi 2 H. An important special case is that where Fi [h] = h (xi ), but
Rb
other linear functionals have been used, for example Fi [h] = a Hi (x) h (x) dx
for known Hi .
The form of the reproducing kernel implied by the choice of this inner product is derivable as follows. First, there is a linearly independent set of functions
fu1 ; : : : ; um g that is a basis for the subspace of H consisting of those elements
h for which L [h] = 0 (the zero function). Call this subspace H0 . The so-called
Wronskian matrix associated with these functions is then
(j 1)
W (x) = ui
(x)
m m
With
0
C = W (a) W (a)
let
R0 (x; z) =
X
1
Cij ui (x) uj (z)
i;j
Further, there is a so-called Green’s function associated with the operator L,
a function G (x; z) such that for all h 2 H satisfying h(k) (a) = 0 for k =
0; 1; : : : ; m 1
Z b
h (x) =
G (x; z) L [h] (z) dz
a
Let
R1 (x; z) =
Z
b
G (x; t) G (z; t) dt
a
The reproducing kernel associated with the inner product and L is then
R (x; z) = R0 (x; z) + R1 (x; z)
As it turns out, P
H0 is a RKHS with reproducing kernel R0 under the inner
m 1 (k)
product hf; gi0 =
(a) g (k) (a). Further, the subspace of H conk=1 f
(k)
sisting of those h with h (a) = 0 for k = 0; 1; : : : ; m 1 (call it H1 ) is
a RKHS with reproducing kernel R1 (x; z) under the inner product hf; gi1 =
Rb
L [f ] (x) L [g] (x) dx. Every element of H0 is ? to every element of H1 in H
a
and every h 2 H can be written uniquely as h0 + h1 for an h0 2 H0 and an
h1 2 H1 .
These facts can be used to show that a minimizer of (61) exists and is of the
form
m
N
X
X
h (x) =
u
(x)
+
k k
i R1 (xi ; x)
i=1
k=1
For h (x) of this form, (61) is of the form
(Y
T
0
K ) D (Y
74
T
K )+
0
K
for
T = (Fi [uj ]) ;
0
B
B
=B
@
1
2
..
.
m
1
C
C
C ; D = diag (d1 ; : : : ; dm ) ; and K = (Fi [R1 ( ; xj )])
A
and its optimization is a matrix calculus problem. In the important special
case where the Fi are function evaluation at xi , above
Fi [uj ] = uj (xi ) and Fi [R1 ( ; xj )] = R1 (xi ; xj )
11.3
What Is Common Beginning Directly From a Kernel
Another way that it is common to make use of RKHSs is to begin with a kernel
and its implied inner product (rather than deriving one as appropriate to a particular well-formed optimization problem in function spaces). This seems less
satisfying/principled and more arbitrary/mechanical than developments beginning from linear functionals and di¤erential operators, but is nevertheless popular. The 2006 paper of Pearce and Wand in The American Statistician is a
readable account of this thinking that is more complete than what follows here
(and provides some references).
This development begins with C a compact subset of <p and a symmetric
kernel function
K:C C!<
Ultimately, we will considerP
as predictors for x 2 C linear combinations of secN
tions of the kernel function, i=1 bi K (x; xi ) (where the xi are the input vectors
in the training set). But to get there in a rational way, and to incorporate use of
a complexity penalty into the …tting, we will restrict attention to those kernels
that have nice properties. In particular, we require that K be continuous and
non-negative de…nite. We mean by this latter that for any set of M elements
of <p , say fz j g, the M M matrix
(K (z i ; z j ))
is non-negative de…nite.
Then according to what is known as Mercer’s Theorem (look it up in Wikipedia)
K may then be written in the form
K (z; x) =
1
X
i i
(z)
i
(x)
(62)
i=1
for linearly independent (in L2 (C)) functions f i g, and constants i
0,
where the i comprise an orthonormal basis for P
L2 (C) (so any function in
1
fR 2 L2 (C) has an expansion in terms of the i as i=1 hf; i i2 i for hf; hi2
f (x) h (x) dx ), and the ones corresponding to positive i may be taken to
C
75
be continuous on C. Further, the i are "eigenfunctions" of the kernel K
corresponding to the "eigenvalues i " in the sense that in L2 (C)
Z
i (z) K (z; ) dz = i i ( )
Our present interest is in a function space HK (that is a subset of L2 (C)
with a di¤erent inner product and norm) with members of the form
f (x) =
1
X
ci
(x) for ci with
i
i=1
1 2
X
c
i
i
i=1
<1
(63)
(called the "primal form" of functions
in the space). (Notice
P1
P1 that all elements
of L2 (C) are of the form f (x) = i=1 ci i (x) with i=1 c2i < 1.) More
naturally, our interest centers on
f (x) =
1
X
bi K (x; z i ) for some set of z i
(64)
i=1
supposing that the series converges appropriately (called the "dual form" of
functions in the space). The former is most useful for producing simple proofs,
while the second is most natural for application, since how to obtain the i and
corresponding i for a given K is not so obvious. Notice that
K (z; x) =
=
1
X
i=1
1
X
i i
(
(z)
i i
i
(x)
(x))
i
(z)
i=1
P1
P1
and letting i i (x) = ci (x), since i=1 c2i (x) = i = i=1 i 2i (x) = K (x; x) <
1, the function K ( ; x) is of the form (63),
so that we can expect functions of
P1
the form (64) with absolutely convergent i=1 bi to be of form (63).
In the space of functions (63), we de…ne an inner product (for our Hilbert
space)
*1
+
1
1
X
X
X
ci d i
ci i ;
di i
i=1
i=1
so that
1
X
2
ci
i=1
Note then that for f =
P1
i=1 ci i
hf; K ( ; x)iHK =
1 2
X
c
i
i
HK
i=1
i
belonging to the Hilbert space HK ,
1
X
ci
i=1
i
i=1
HK
i i
(x)
i
76
=
1
X
i=1
ci
i
(x) = f (x)
and so K ( ; x) is the representer of evaluation at x. Further,
hK ( ; z) ; K ( ; x)iHK
K (z; x)
which is the reproducing property of the RKHS.
Notice also that for two linear combinations of slices of the kernel function
at some set of (say, M ) inputs fz i g (two functions in HK represented in dual
form)
M
M
X
X
f()=
ci K ( ; z i ) and g ( ) =
di K ( ; z i )
i=1
i=1
the corresponding HK inner product is
*
+
M
M
X
X
hf; giHK =
ci K ( ; z i ) ;
dj K ( ; z j )
i=1
=
j=1
M X
M
X
HK
ci dj K (z i ; z j )
i=1 j=1
0
1
d1
B
C
= (c1 ; : : : ; cM ) (K (z i ; z j )) i=1;:::;M @ ... A
j=1;:::;M
dM
This is a kind of <M inner product of the coe¢ cient vectors c0 = (c1 ; : : : ; cM )
and d0 = (d1 ; : : : ; dM ) de…ned by the nonnegative de…nite matrix (K (z i ; z j )).
Further, if a random M -vector Y has covariance matrix (K (z i ; z j )), this is
2
Cov c0 Y ; d0 Y . So, in particular, for f of this form kf kHK = hf; f iHK =Var(c0 Y ).
For applying this material to the …tting of training data, for > 0 and a
loss function L (y; yb) 0 de…ne an optimization criterion
!
N
X
2
minimize
L (yi ; f (xi )) + kf kHK
(65)
f 2HK
i=1
As it turns out, an optimizer of this criterion must, for the training vectors fxi g,
be of the form
N
X
^
bi K (x; xi )
(66)
f (x) =
i=1
and the corresponding f^
2
HK
D
E
f^; f^
is then
HK
=
N X
N
X
bi bj K (xi ; xj )
i=1 j=1
The criterion (65) is thus
0
1
1
0
N
N
X
X
bj K (xi ; xj )A + b0 (K (xi ; xj )) bA
minimize @
L @yi ;
b2<N
i=1
j=1
77
(67)
Letting
K = (K (xi ; xj )) and P = K
(a symmetric generalized inverse of K)
and de…ning
LN (Y ; Kb)
N
X
i=1
the optimization criterion (67) is
0
L @yi ;
N
X
j=1
1
bj K (xi ; xj )A
minimize LN (Y ; Kb) + b0 Kb
b2<N
i.e.
minimize LN (Y ; Kb) + b0 K 0 P Kb
b2<N
i.e.
minimize (LN (Y ; v) + v 0 P v)
(68)
v2C(K)
That is, the function space optimization problem (65) reduces to the N -dimensional
optimization problem (68). A v 2 C (K) (the column space of K) minimizing
LN (Y ; v) + v 0 P v corresponds to b minimizing LN (Y ; Kb) + b0 Kb via
Kb = v
(69)
For the particular special case of squared error loss, L (y; yb) = (y
development has a very explicit punch line. That is,
0
LN (Y ; Kb) + b0 Kb = (Y
Kb) + b0 Kb
Kb) (Y
2
yb) , this
Some vector calculus shows that this is minimized over choices of b by
b = (K + I)
1
(70)
Y
and corresponding …tted values are
Yb = v = K (K + I)
1
Y
Then using (70) under squared error loss, the solution to (65) is from (66)
f^ (x) =
N
X
b i K (x; xi )
(71)
i=1
To better understand the nature of (68), consider the eigen decomposition
of K in a case where it is non-singular as
K = U diag ( 1 ;
78
2; : : : ; N ) U
0
for eigenvalues 1
2
N > 0, where the eigenvector columns of U
comprise an orthonormal basis for <N . The penalty in (68) is
v 0 P v = v 0 U diag (1= 1 ; 1= 2 ; : : : ; 1=
=
N
X
1
j
j=1
N) U
0
v
2
hv; uj i
Now (remembering that the uj comprise an orthonormal basis for <N )
v=
N
X
j=1
2
hv; uj i uj and kvk = hv; vi =
N
X
j=1
2
hv; uj i
so we see that in choosing a v to optimize (LN (Y ; v) + v 0 P v), we penalize
highly those v with large components in the directions of the late eigenvectors of
K (the ones corresponding to its small eigenvalues) thereby tending to suppress
those features of a potential v.
HTF seem to say that the eigenvalues and eigenvectors of the data-dependent
N
N matrix K are somehow related respectively to the constants i and
functions i in the representation (62) of K as a weighted series of products.
That seems hard to understand and (even if true) certainly not obvious.
CFZ provide a result summarizing the most general available version of this
development, known as "the representer theorem." It says that if : [0; 1) !
< is strictly increasing and L ((x1 ; y1 ; h (x1 )) ; : : : ; (xN ; yN ; h (xN )))
0 is an
arbitrary loss function associated with the prediction of each yi as h (xi ), then
an h 2 HK minimizing
L ((x1 ; y1 ; h (x1 )) ; (x2 ; y2 ; h (x2 )) ; : : : ; (xN ; yN ; h (xN ))) +
khkHK
has a representation as
h (x) =
N
X
iK
(x; xi )
i=1
Further, if f 1 ; 2 ; : : : ; M g is a set of real-valued functions and the N M
matrix ( j (xi )) is of rank M then for h0 2 spanf 1 ; 2 ; : : : ; M g and h1 2 HK ;
an h = h0 + h1 minimizing
L ((x1 ; y1 ; h (x1 )) ; (x2 ; y2 ; h (x2 )) ; : : : ; (xN ; yN ; h (xN ))) +
kh1 kHK
has a representation as
h (x) =
M
X
i
i (x) +
i=1
N
X
iK
(x; xi )
i=1
The extra generality provided by this theorem for the squared error loss case
treated above is that it provides for linear combinations of the functions i (x)
to be unpenalized in …tting. Then for
N
M
=(
79
j
(xi ))
and
0
R= I
an optimizing
1
0
is ^ =
(R
0
1
0
Y
Y where ^ optimizes
0
K ) (R
0
K )+
K
and the earlier argument implies that ^ = (K + I)
11.3.1
1
R.
Some Special Cases
One possible kernel function in p dimensions is
d
K (z; x) = (1 + hz; xi)
For …xed xi the basis functions K (x; xi ) are dth order polynomials in the entries
of xi . So the …tting is in terms of such polynomials. Note that since K (z; x)
is relatively simple here, there seems to be a good chance of explicitly deriving
a representation (62) and perhaps working out all the details of what is above
in a very concrete setting.
Another standard kernel function in p dimensions is
K (z; x) = exp
kz
2
xk
and the basis functions K (x; xi ) are essentially spherically symmetric normal
pdfs with mean vectors xi . (These are "Gaussian radial basis functions" and for
p = 2, functions (66) produce prediction surfaces in 3-space that have smooth
symmetric "mountains" or "craters" at each xi ; of elevation or depth relative
to the rest of the surface governed by bi and extent governed by .)
Section 19 provides a number of insights that enable the creation of a wide
variety of kernels beyond the few mentioned here.
The standard development of so-called "support vector classi…ers" in a 2class context with y taking values 1, uses some kernel K (z; x) and predictors
f^ (x) = b0 +
1
X
bi K (x; xi )
i=1
in combination with
h
L y; f^ (x) = 1
y f^ (x)
i
+
(the sign of f^ (x) providing the classi…cation associated with x).
11.3.2
Addendum Regarding the Structures of the Spaces Related
to a Kernel
An ampli…cation of some aspects of the basic description provided above for he
RKHS corresponding to a kernel K : C C ! < is as follows.
80
From the representation
K (z; x) =
1
X
i i
(z)
i
(x)
i=1
write
i
so that the kernel is
K (z; x) =
=
p
1
X
i i
i
(z)
i
(x)
(72)
i=1
(Note that considered as functions in L2R (C) the i are orthogonal, but not
2
generally orthonormal, since h i ; i i2
(x) dx = i which is typically
C i i
not 1.) Representation (72) suggests that one think about the inner product for
inputs provided by the kernel in terms of a transform of an input vector x 2 <p
to an in…nite-dimensional feature vector
(x) = (
1
(x) ;
2
(x) ; : : :)
and then "ordinary <1 inner products" de…ned on those feature vectors.
The function space HK has members of the (primal) form
f (x) =
1
X
ci
i (x) for ci with
i=1
1 2
X
c
i
<1
i
i=1
This is perhaps more naturally
f (x) =
1
X
ci
i
(x) for ci with
i=1
p
1
X
ci
i=1
2
i
=
i
1
X
i=1
c2i < 1
P1
(Again,
i=1 ci i (x) with
P1 2 all elements of L2 (C) are of the form f (x) =
c
<
1.)
The
H
inner
product
of
two
functions
of
this
primal form has
K
i=1 i
been de…ned as
*1
+
1
1
X
X
X
ci di
ci i ;
di i
i=1
i=1
i
i=1
HK
=
=
*
1
X
i=1
ci
p
i
1
X
di
;
p
i
i=1
c1
c2
p ; p ;::: ;
1
2
i
i
+
HK
d1
d2
p ; p ;:::
1
2
<1
So, two P
elements of HKPwritten in terms of the i (instead of their multiples
1
1
i ) say
i=1 di i have HK inner product that is the ordinary
i=1 ci i and
1
< inner product of their vectors of coe¢ cients.
81
Now consider the function
K ( ; x) =
1
X
i
()
i
(x)
i=1
For f =
P1
i=1 ci
i
2 HK ,
*
hf; K ( ; x)iHK =
1
X
ci
i ; K ( ; x)
i=1
+
=
HK
1
X
ci
i
(x) = f (x)
i=1
and (perhaps more clearly than above) indeed K ( ; x) is the representer of
evaluation at x in the function space HK .
11.4
Gaussian Process "Priors," Bayes Predictors, and
RKHS’s
The RKHS material has an interesting connection to Bayes prediction. It’s
our purpose here to show that connection. Consider …rst a very simple context
where P has parameter and therefore the conditional mean of the output is a
parametric (in ) function of x
E [yjx = u] = m (uj )
Suppose that the conditional distribution of yjx = u has density f (yju). Then
for a prior distribution on with density g ( ), the posterior has density
g ( jT ) /
N
Y
i=1
f (yi jxi ) g ( )
For any …xed x, this posterior distribution induces a corresponding distribution
on m (xj ). Then, a sensible predictor based on the Bayes structure is
Z
f^ (x) = E [m (xj ) jT ] = m (xj ) g ( jT ) d
With this thinking in mind, let’s now consider a more complicated application of essentially Bayesian thinking to the development of a predictor, based
the use of a Gaussian process as a more or less non-parametric "prior distribution" for the function (of u) E[yjx = u]. That is, for purposes of developing a
predictor, suppose that one assumes that
y = (x) +
where
(x) =
(x) + (x)
2
E = 0; Var = ; the function (x) is known (it could be taken to be identically 0) and plays the role of a prior mean for the function (of u)
(u) = E [yjx = u]
82
and (independent of the errors ), (x) is a realization of a mean 0 stationary
Gaussian process on <p , this Gaussian process describing the prior uncertainty
for (x) around (x). More explicitly, the assumption on (x) is that E (x) =
0 and Var (x) = 2 for all x, and for some appropriate (correlation ) function
, Cov( (x) ; (z)) = 2 (x z) for all x and z ( (0) = 1 and the function of
two variables (x z) must be positive de…nite). The "Gaussian" assumption
is then that for any …nite set of element of <p , say z 1 ; z 2 ; : : : ; z M , the vector of
corresponding values (z i ) is multivariate normal.
There are a number of standard forms that have been suggested for the
correlation function . The simplest ones are of a product form, i.e. if j is a
valid one-dimensional correlation function, then the product
(x
z) =
p
Y
j
(xj
zj )
j=1
is a valid correlation function for a Gaussian process on <p . Standard forms
for correlation functions in one dimension are ( ) = exp c 2 and ( ) =
exp ( c j j). The …rst produces "smoother" realizations than does the second,
and in both cases, the constant c governs how fast realizations vary.
One may then consider the joint distribution (conditional on the xi and
assuming that for the training values yi the i are iid independent of the (xi ))
of the training output values and a value of (x). From this, one can …nd the
conditional mean for (x) given the training data. To that end, let
N
N
=
2
Then for a single value of x,
0
1
00
y1
B y2 C
BB
B
C
BB
B .. C
BB
B . C MVNN +1 BB
B
C
BB
@ yN A
@@
(x)
for
(xi
xj )
(x1 )
(x2 )
..
.
0
B
B
(x) = B
@
N 1
1
C
C
C
C;
C
(xN ) A
(x)
2
2
2
(x
(x
..
.
(x
i=1;2;:::;N
j=1;2;:::;N
+ 2I
0
(x)
1
x1 )
x2 ) C
C
C
A
xN )
(x)
2
1
C
C
C
C
C
A
Then standard multivariate normal theory says that the conditional mean of
(x) given Y is
0
1
y1
(x1 )
B
(x2 ) C
1 B y2
C
0
f^ (x) = (x) + (x)
+ 2I
(73)
B
C
..
@
A
.
yN
(xN )
83
Write
w =
+
N 1
2
I
and then note that (73) implies that
f^ (x) =
(x) +
0
y1
B
1 B y2
B
@
yN
N
X
wi
2
1
(x1 )
(x2 ) C
C
C
..
A
.
(xN )
(x
xi )
(74)
(75)
i=1
and we see that this development ultimately produces (x) plus a linear combination of the "basis functions" 2 (x xi ) as a predictor. Remembering
that 2 (x z) must be positive de…nite and seeing the ultimate form of the
predictor, we are reminded of the RKHS material.
In fact, consider the case where (x) 0. (If one has some non-zero prior
mean for (x), arguably that mean function should be subtracted from the raw
training outputs before beginning the development of a predictor. At a minimum, output values should probably be centered before attempting development
of a predictor.) Compare (74) and (75) to (70) and (71) for the (x) = 0 case.
What is then clear is that the present "Bayes" Gaussian process development of
a predictor under squared error loss based on a covariance function 2 (x z)
and error variance 2 is equivalent to a RKHS regularized …t of a function
to training data based on a kernel K (x; z) = 2 (x z) and penalty weight
= 2.
12
More on Understanding and Predicting Predictor Performance
There are a variety of theoretical and empirical quantities that might be computed to quantify predictor performance. Those that are empirical and reliably
track important theoretical ones might potentially be used to select an e¤ective
predictor. We’ll here consider some of those (theoretical and empirical) measures. We will do so in the by-now-familiar setting where training data (x1 ; y1 ) ;
(x2 ; y2 ) ; : : : ; (xN ; yN ) are assumed to be iid P and independent of (x; y) that
is also P distributed, and are used to pick a prediction rule f^ (x) to be used
under a loss L (b
y ; y) 0.
What is very easy to think about and compute is the training error
err =
N
1 X
L f^ (xi ) ; yi
N i=1
This typically decreases with increased complexity in the form of f , and is no
reliable indicator of predictor performance o¤ the training set. Measures of
prediction rule performance o¤ the training set must have a theoretical basis
84
(or be somehow based on data held back from the process of prediction rule
development).
General loss function versions of (squared error loss) quantities related to
err de…ned in Section 1.4, are
h
i
Err (u) ET E L f^ (x) ; y jx = u
ErrT
E(x;y) L f^ (x) ; y
(76)
Ex Err (x) = ET ErrT
(77)
and
Err
A slightly di¤erent and semi-empirical version of this expected prediction error
(77) is the "in-sample" test error (7.12) of HTF
N
1 X
Err (xi )
N i=1
12.1
Optimism of the Training Error
Typically, err is less than ErrT . Part of the di¤erence in these is potentially
due to the fact that ErrT is an "extra-sample" error, in that the averaging in
(76) is potentially over values x outside the set of values in the training data.
We might consider instead
ErrT in =
N
1 X yi
E L yi ; f^ (xi )
N i=1
(78)
where the expectations indicated in (78) are over yi
Pyjx=xi (the entire
^
training sample used to choose f , both inputs and outputs, is being held constant
in the averaging in (78)). The di¤erence
op = ErrT in
err
is called the "optimism of the training error." HTF use the notation
! = EY op = EY (ErrT in
err)
(79)
where the averaging indicated by EY is over the outputs in the training set
(using the conditionally independent yi ’s, yi
Pyjx=xi ). HTF say that for
many losses
N
2 X
!=
CovY (b
yi ; yi )
(80)
N i=1
85
For example, consider the case of squared error loss. There
! = EY (ErrT in
= EY EY
1
N
err)
N
X
f^ (xi )
yi
EY
i=1
N
2 X Y ^
=
E yi f (xi )
N i=1
N
2 X Y ^
E f (xi ) (yi
=
N i=1
=
2
N
1 X
yi
N i=1
f^ (xi )
2
EY EY yi f^ (xi )
E [yjx = xi ])
N
2 X
CovY (b
yi ; yi )
N i=1
We note that in this context, assuming that given the xi in the training data
the outputs are uncorrelated and have constant variance 2 , by (14)
!=
12.2
Cp , AIC and BIC
2 2
df Yb
N
The fact that
ErrT in = err + op
suggests the making of estimates of ! = EY op and the use of
err + !
b
(81)
as a guide in model selection. This idea produces consideration of the model
selection criteria Cp /AIC and BIC.
12.2.1
Cp and AIC
For the situation of least squares …tting with p predictors or basis functions and
squared error loss,
df Yb
= p = tr X X 0 X
1
X0
PN
so that i=1 CovY (b
yi ; yi ) = p 2 . Then, if c2 is an estimated error variance
based on a low-bias/high-number-of-predictors …t, a version of (81) suitable for
this context is Mallows’Cp
Cp
err +
86
2p c2
N
In a more general setting, if one can appropriately evaluate or estimate
PN
Y
2
yi ; yi ) = df Yb
, a general version of (81) becomes the Akaike
i=1 Cov (b
information criterion
AIC = err +
12.2.2
N
2 X
2 2
CovY (b
yi ; yi ) = err +
df Yb
N i=1
N
BIC
For situations where …tting is done by maximum likelihood, the Bayesian Information Criterion of Schwarz is an alternative to AIC. That is, where the
joint distribution P produces density P (yj ; u) for the conditional distribution
of yjx = u and b is the maximum likelihood estimator of , a (maximized)
log-likelihood is
N
X
loglik =
log P yi jb; xi
i=1
and the so-called Bayesian information criterion
2 loglik + (log N ) df Yb
BIC =
2
For yjx normal with variance
BIC =
N
2
, up to a constant, this is
err +
(log N )
N
2
df Yb
and after switching 2 for log N , BIC is a multiple of AIC. The replacement of
2 with log N means that when used to guide model/predictor selections, BIC
will typically favor simpler models/predictors than will AIC.
The Bayesian origins of BIC can be developed as follows. Suppose (as in
Section 10.2.1) that M models are under consideration, the mth of which has
parameter vector m and corresponding density for training data
fm (T j
with prior density for
m)
m
gm (
m)
and prior probability for model m
(m)
With this structure, the posterior distribution of the model index is
Z
(mjT ) / (m) fm (T j m ) gm ( m ) d m
87
Under 0-1 loss and uniform ( ), one wants to choose model m maximizing
Z
fm (T j m ) gm ( m ) d m = fm (T ) = the mth marginal of T
Apparently, the so-called Laplace approximation says that
log fm (T )
dm
log N + O (1)
2
log fm T j c
m
where dm is the real dimension of m . Assuming that the marginal of x
doesn’t change model-to-model or parameter-to-parameter, log fm T j c
is
m
loglik +CN , where CN is a function of only the input values in the training set.
Then
2 log fm (T )
2 loglik + (log N ) dm + O (1)
= BIC + O (1)
2CN
2CN
and (at least approximately) choosing m to maximize fm (T ) is choosing m to
minimize BIC.
12.3
Cross-Validation Estimation of Err
K-fold cross-validation is described in Section 1.5. One hopes that
N
1 X
^
L yi ; f^k(i) (xi ) )
CV f =
N i=1
estimates Err. In predictor selection, say where predictor f^ has a complexity
parameter , we look at
CV f^
as a function of , try to optimize, and then re…t (with that ) to the whole
training set.
K-fold cross-validation can be expected to estimate Err for
“N " =
1
1
K
N
The question of how cross-validation might be expected to do is thus related
to how Err changes with N (the size of the training sample). Typically, Err
decreases monotonically in N approaching some limiting value as N goes to
in…nity. The "early" (small N ) part of the "Err vs. N curve" is steep and
the "late" part (large N ) is relatively ‡at. If (1 1=K) N is large enough that
at such size of the training data set, the curve is ‡at, then the e¤ectiveness
of cross-validation is limited only by the noise inherent noise in estimating it,
and not by the fact that training sets of size (1 1=K) N are not of size N .
88
Operationally, K = 5 or 10 seem standard, and a common rule of thumb is to
select for use the predictor of minimum complexity whose cross-validation error
is within one standard error of the minimum CV f^ for predictors f^ under
consideration. Presumably, "standard error" here refers to (K=N )
sample standard deviation of the K values
K
N
X
1=2
times a
L yi ; f^k (xi )
i3k(i)=k
(though it’s not completely obvious which complexity’s standard error is under
discussion when one says this).
HTF say that for many linear …tting methods (that produce Yb = M Y )
including least squares projection and cubic smoothing splines, the N = K
(leave one out) cross-validation error is (for f^i produced by training on T
f(xi ; yi )g)
N
1 X
^
yi
CV f =
N i=1
2
^i
f (xi )
N
1 X
=
N i=1
yi f^ (xi )
1 Mii
!2
(for Mii the ith diagonal element of M ). The so-called generalized crossvalidation approximation to this is the much more easily computed
N
1 X
GCV f^ =
N i=1
=
yi
1
f^ (xi )
tr (M ) =N
!2
err
2
(1
tr (M ) =N )
It is worth noting (per HTF Exercise 7.7) that since 1= (1
x near 0,
N
1 X
^
GCV f =
N i=1
yi
1
N
1 X
yi
N i=1
f^ (xi )
tr (M ) =N
f^ (xi )
2
x)
2
1 + 2x for
!2
2
+ tr (M )
N
N
1 X
yi
N i=1
f^ (xi )
2
!
which is close to AIC, the di¤erence being that here 2 is being estimated based
on the model being …t, as opposed to being estimated based on a low-bias/large
model.
12.4
Bootstrap Estimation of Err
Suppose that the values of the input vectors in the training set are unique. One
might make B bootstrap samples of N (random samples with replacement of size
89
N ) from the training set T , say T 1 ,T 2 ; : : : ; T B , and train on these bootstrap
samples to produce predictors, say
predictor f^ b based on T b
Let C i be the set of indices b = 1; 2; : : : ; B for which (xi ; yi ) 2
= T b . A possible
bootstrap estimate of Err is then
"
#
N
(1)
1 X
1 X
b
^
d
L yi ; f (xi )
Err
N i=1 #C i
i
b2C
It’s not completely clear what to make of this. For one thing, the T b do
not have N distinct elements. In fact, it’s fairly easy to see that the expected
number of distinct cases in a bootstrap sample is about :632N . That is, note
that
P [x1 is in a particular bootstrap sample T ] = 1
1
N
1
N
! 1 exp ( 1) = :632
Then
E (number of distinct cases in T ) = E
N
X
I [xi is in T ]
i=1
=
N
X
!
EI [xi is in T ]
i=1
N
X
:632
i=1
= :632N
(1)
d to estimate Err at :632N , not at
So roughly speaking, we might expect Err
N . So unless Err as a function of training set size is fairly ‡at to the right of
:632N , one might expect substantial positive bias in it as an estimate of Err (at
N ).
HTF argue for
(:632)
(1)
d
d
Err
:368 err + :632 Err
as a …rst order correction on the biased bootstrap estimate, but admit that
this is not perfect either, and propose a more complicated …x (that they call
(:632+)
d
Err
) for classi…cation problems. It is hard to see what real advantage the
bootstrap has over K-fold cross-validation, especially since one will typically
want B to be substantially larger than usual values for K.
90
13
Linear (and a Bit on Quadratic) Methods of
Classi…cation
Suppose now that an output variable y takes values in G = f1; 2; : : : ; Kg, or
equivalently that y is a K-variate set of indicators, yk = I [y = k]. The object
here is to consider methods of producing prediction/classi…cation
ruleso fb(x)
n
p b
taking values in G (and mostly ones) that have sets x 2 < jf (x) = k with
boundaries that are de…ned (at least piece-wise) by linear equalities
x0
=c
(82)
The most obvious/naive potential method here, namely to regress the K indicator variables yk onto x (producing least squares regression vector coe¢ cients
b ) and then to employ
k
fb(x) = arg maxfbk (x) = arg maxx0 b k
k
k
fails miserably because of the possibility of "masking" if K > 2. One must
be smarter than this. Three kinds of smarter alternatives are Linear (and
Quadratic) Discriminant Analysis, Logistic Regression, and direct searches for
separating hyperplanes. The …rst two of these are "statistical" in origin with
long histories in the …eld.
13.1
Linear (and a bit on Quadratic) Discriminant Analysis
Suppose that for (x; y) P , k = P [y = k] and the conditional distribution of
x on <p given that y = k is MVNp ( k ; ), i.e. the conditional pdf is
gk (x) = (2 )
p=2
(det )
1=2
1
(x
2
exp
0
k)
1
(x
k)
Then it follows that
ln
P [y = kjx = u]
P [y = ljx = u]
= ln
1
2
k
l
1
0
k
k+
1
2
1
0
l
l +u
1
0
(
k
l)
(83)
so that a theoretically optimal predictor/decision rule is
f (x) = arg max ln (
1
2
k)
k
1
0
k
k
1
+ x0
k
and boundaries between regions in <p where f (x) = k and f (x) = l are subsets
of the sets
x 2 <p j x0
1
(
k
l)
=
k
ln
l
91
+
1
2
0
k
1
k
1
2
0
l
1
l
i.e. are de…ned by equalities of the form (82).
This is dependent upon all K conditional normal distributions having the
same covariance matrix, . In the event these are allowed to vary, conditional distribution k with covariance matrix k , a theoretically optimal predictor/decision rule is
f (x) = arg max ln (
k)
k
1
ln (det
2
1
(x
2
k)
k)
0
1
k
(x
k)
and boundaries between regions in <p where f (x) = k and f (x) = l are subsets
of the sets
0
k)
fx 2 <p j 12 (x
ln
1
k
(x
1
2
k
l
1
2
k)
ln (det
0
l)
(x
k)
+
1
2
1
l
ln (det
(x
l)
=
l )g
Unless k = l this kind of set is a quadratic surface in <p , not a hyperplane.
One gets (not linear, but) Quadratic Discriminant Analysis.
Of course, in order to use LDA or QDA, one must estimate the vectors k
and the covariance matrix or matrices k from the training data. Estimating
K potentially di¤erent matrices k requires estimation of a very large number of
parameters. So thinking about QDA versus LDA, one is again in the situation
of needing to …nd the level of predictor complexity that a given data set will
support. QDA is a more ‡exible/complex method than LDA, but using it in
preference to LDA increases the likelihood of over-…t and poor prediction.
One idea that has been o¤ered as a kind of continuous compromise between
LDA and QDA is for 2 (0; 1) to use
bk ( ) =
b k + (1
) b p o oled
in place of b k in QDA. This kind of thinking even suggests as an estimate of
a covariance matrix common across k
b ( ) = b p o oled + (1
) c2 I
for 2 (0; 1) and c2 an estimate of variance pooled across groups k and then
across coordinates of x, j, in LDA. Combining these two ideas, one might even
invent a two-parameter set of …tted covariance matrices
bk ( ; ) =
b k + (1
b p o oled + (1
)
) c2 I
for use in QDA. Employing these in LDA or QDA provides the ‡exibility
of choosing a complexity parameter or parameters and potentially improving
prediction performance.
Returning speci…cally to LDA, let
=
K
1 X
K
k=1
92
k
and note that one is free to replace x and all K means
x =
1=2
(x
) and
k
1=2
=
(
k
with respectively
)
k
This produces
ln
P [y = kjx = u ]
P [y = ljx = u ]
1
ku
2
k
= ln
l
2
kk
+
1
ku
2
2
lk
and (in "sphered" form) the theoretically optimal predictor/decision rule can
be described as
f (x) = arg max ln (
k
k)
1
kx
2
2
kk
That is, in terms of x , optimal decisions are based on ordinary Euclidian
distances to the transformed means k . Further, this form can often be made
even simpler/be seen to depend upon a lower-dimensional (than p) distance.
The k typically span a subspace of <p of dimension min (p; K 1). For
M =(
p K
1;
2; : : : ;
K)
let P M be the p p projection matrix projecting onto the column space of M
in <p (C (M )). Then
kx
2
kk
= k[P M + (I
= k(P M x
k ) + (I
2
k k + k(I
= kP M x
2
k )k
P M )] (x
2
PM)x k
2
PM)x k
the last equality coming because (P M x
PM)x 2
k ) 2 C (M ) and (I
?
2
C (M ) . Since k(I P M ) x k doesn’t depend upon k, the theoretically
optimal predictor/decision rule can be described as
f (x) = arg max ln (
k)
k
1
kP M x
2
2
kk
and theoretically optimal decision rules can be described in terms of the projection of x onto C (M ) and its distances to the k .
Now,
1
MM0
K
is the (typically rank min (p; K 1)) sample covariance matrix of the k and
has an eigen decomposition as
1
M M 0 = V DV 0
K
for
D = diag (d1 ; d2 ; : : : ; dp )
93
where
d1
d2
dp
are the eigenvalues and the columns of V are orthonormal eigenvectors corre1
sponding in order to the successively smaller eigenvalues of K
M M 0 . These
v k with dk > 0 specify linear combinations of the coordinates of the l ,
hv k ; l i, with the largest possible sample variances subject to the constraints
that kvk = 1 and hv l ; v k i = 0 for all l < k. These v k are ? vectors in successive
directions of most important unaccounted-for spread of the k . This suggests
the possibility of "reduced rank LDA."
That is, for l rank M M 0 de…ne
V l = (v 1 ; v 2 ; : : : ; v l )
let
P l = V l V 0l
be the matrix projecting onto C (V l ) in <p : A possible "reduced rank" approximation to the theoretically optimal LDA classi…cation rule is
fl (x) = arg max ln (
k)
k
1
kP l x
2
Pl
2
kk
and l becomes a complexity parameter that one might optimize to tune or
regularize the method.
Note also that for w 2 <p
P lw =
l
X
k=1
hv k ; wi v k
For purposes of graphical representation of what is going on in these computations, one might replace the p coordinates of x and the means k with the l
coordinates of
0
(hv 1 ; x i ; hv 2 ; x i ; : : : ; hv l ; x i)
(84)
and of the
(hv 1 ;
k i ; hv 2 ;
k i ; : : : ; hv l ;
0
k i)
(85)
It seems to be essentially ordered pairs of these coordinates that are plotted in
HTF in their Figures 4.8 and 4.11. In this regard, we need to point out that
since any eigenvector v k could be replaced by v k without any fundamental
e¤ect in the above development, the vector (84) and all of the vectors (85)
could be altered by multiplication of any particular set of coordinates by 1.
(Whether a particular algorithm for …nding eigenvectors produces v k or v k is
not fundamental, and there seems to be no standard convention in this regard.)
It appears that the pictures in HTF might have been made using the R function
lda and its choice of signs for eigenvectors.
The form x0 is (of course and by design) linear in the coordinates of x. An
obvious natural generalization of this discussion is to consider discriminants that
94
are linear in some (non-linear) functions of the coordinates of x. This is simply
choosing some M basis functions/transforms/features hm (x) and replacing the
p coordinates of x with the M coordinates of (h1 (x) ; h2 (x) ; : : : ; hM (x)) in the
development of LDA.
Of course, upon choosing basis functions that are all coordinates, squares
of coordinates, and products of coordinates of x, one produces linear (in the
basis functions) discriminants that are general quadratic functions of x. Or,
data-dependent basis functions that are K ( ; xi ) for some kernel function K
produces "LDA" that depends upon the value of some "basis function network."
The possibilities opened here are myriad and (as always) "the devil is in the
details."
13.2
Logistic Regression
A generalization of the MVN conditional distribution result (83) is an assumption that for all k < K
ln
P [y = kjx = u]
P [y = Kjx = u]
=
k0
+ u0
(86)
k
Here there are K 1 constants k0 and K 1 p-vectors k to be speci…ed,
not necessarily tied to class mean vectors or a common within-class covariance
matrix for x: In fact, the set of relationships (86) do not fully specify a joint
distribution for (x; y). Rather, they only specify the nature of the conditional
distributions of yjx = u. (In this regard, the situation is exactly analogous to
that in ordinary simple linear regression. A bivariate normal distribution for
(x; y) gets one normal conditional distributions for y with a constant variance
and mean linear in x. But one may make those assumptions conditionally on
x, without assuming anything about the marginal distribution of x, that in the
bivariate normal model is univariate normal.)
Using as shorthand for a vector containing all the constants k0 and the
vectors k , the linear log odds assumption (86) produces the forms
P [y = kjx = u] = pk (u; ) =
for k < K, and
P [y = Kjx = u] = pK (u; ) =
1+
exp (
PK 1
1+
k=1
k0
+ u0
exp (
PK
1
k=1
k0
k)
+ u0
k)
1
exp (
k0
+ u0
k)
and a theoretically optimal (under 0-1 loss) predictor/classi…cation rule is
f (x) = arg max pk (x; )
k
95
13.3
Separating Hyperplanes
In the K = 2 group case, if there is a 2 <p and real number
the training data
y = 2 exactly when x0 + 0 > 0
0
such that in
a "separating hyperplane"
fx 2 <p j x0 +
0
= 0g
can be found via logistic regression. The (conditional) likelihood will not have
a maximum, but if one follows a search path far enough toward limiting value of
0 for the loglikelihood or 1 for the likelihood, satisfactory 2 <p and 0 from
an iteration of the search algorithm will produce separation.
A famous older algorithm for …nding a separating hyperplane is the so-called
"perceptron" algorithm. It can be de…ned as follows. Temporarily treat y
taking values in G = f1; 2g as real-valued, and de…ne
y = 2y
3
(so that y takes real values 1). From some starting points 0 and 00 cycle
through the training data cases in order (repeatedly as needed). At any iteration
l, take
n
o
yi = 2 and x0i + 0 > 0, or
l
= l 1 and 0l = 0l 1
if
yi = 1 and x0i + 0 0
l
l 1
=
+ yi xi
otherwise
and 0l = 0l 1 + yi
This will eventually identify a separating hyperplane when a series of N iterations fails to change the values of and 0 .
If there is a separating hyperplane, it will typically not be unique. One can
attempt to de…ne and search for "optimal" such hyperplanes that, e.g., maximize
distance from the plane to the closest training vector. See HTF Section 4.5.2
in this regard.
14
Support Vector Machines
Consider a 2-class classi…cation problem. For notational convenience, we’ll
suppose that output y takes values in G = f 1; 1g. Our present concern is in a
further development of linear classi…cation methodology beyond that provided
in Section 13.
For 2 <p and 0 2 < we’ll consider the form
g (x) = x0 +
0
(87)
and a theoretical predictor/classi…er
f (x) = sign (g (x))
96
(88)
We will approach the problem of choosing and 0 to in some sense provide a
maximal cushion around a hyperplane separating between xi with corresponding
yi = 1 and xi with corresponding yi = 1:
14.1
The Linearly Separable Case
In the case that there is a classi…er of form (88) with 0 training error rate, we
consider the optimization problem
maximize
M
u with kuk = 1
and 0 2 <
subject to yi (x0i u +
0)
M 8i
(89)
This can be thought of in terms of choosing a unit vector u (or direction) in
<p so that upon projecting the training input vectors xi onto the subspace of
multiples of u there is maximum separation between convex hull of projections of
the xi with yi = 1 and the convex hull of projections of xi with corresponding
yi = 1. (The sign on u is chosen to give the latter larger x0i u than the former.)
Notice that if u and 0 solve this minimization problem
1
0
M=
and
0
=
1B
B
x0 u
B min
2 @ xi with i
yi = 1
C
C
max
x0i uC
A
xi with
yi = 1
1
0
C
1B
C
B
x0i u + max
x0i uC
B min
2 @ xi with
A
xi with
yi = 1
yi = 1
For purposes of applying standard optimization theory and software, it is
useful to reformulate the basic problem (89) several ways. First, note that (89)
may be rewritten as
maximize
M
u with kuk = 1
and 0 2 <
subject to yi x0i
Then if we let
=
it’s the case that
k k=
u
M
u
M
1
1
or M =
M
k k
97
+
0
M
1 8i
(90)
so that (90) can be rewritten
1
2
minimize
k k
p
2
2<
and 0 2 <
subject to yi (x0i +
0)
1 8i
(91)
This formulation (91) is that of a convex (quadratic criterion, linear inequality
constraints) optimization problem for which there exists standard theory and
algorithms.
The so-called primal functional corresponding to (91) is (for 2 <N )
FP ( ;
0;
N
X
1
2
k k
2
)
i
(yi (x0i +
0)
1) for
0
i=1
To solve (91), one may for each
0 choose ( ( ) ; 0 ( )) to minimize
FP ( ; ; ) and then choose
0 to maximize FP ( ( ) ; 0 ( ) ; ). The
Karush-Kuhn-Tucker conditions are necessary and su¢ cient for solution of a
constrained optimization problem. In the present context they are the gradient
conditions
N
X
@FP ( ; 0 ; )
=
(92)
i yi = 0
@ 0
i=1
and
@FP ( ;
@
0;
)
N
X
=
the feasibility conditions
yi (x0i +
i yi xi
=0
(93)
i=1
0)
1
0 8i
(94)
the non-negativity conditions
(95)
0
and the orthogonality conditions
i
(yi (x0i +
0)
1) = 0 8i
(96)
Now (92) and (93) are
N
X
i yi = 0 and
=
i=1
N
X
i=1
98
i yi xi
( )
(97)
and plugging these into FP ( ;
FD ( )
=
=
1
2
k ( )k
2
1 XX
2
X
i
0;
) gives a function of
N
X
(yi x0i ( )
i
i=1
j
i
i
1
2
= 10
i
1 XX
2 i j
0
1)
XX
0
i j yi yj xi xj
only
0
i j yi yj xi xj
j
+
X
i
i
0
i j yi yj xi xj
H
for
H = (yi yj x0i xj )
N
(98)
N
Then the "dual" problem is
1
2
maximize 10
0
subject to
H
0
0 and
y=0
(99)
and apparently this problem is easily solved.
Now condition (96) implies that if iopt > 0
opt
yi x0i
+
opt
0
=1
so that
1. by (94) the corresponding xi has minimum x0i ( opt ) for training vectors
with yi = 1 or maximum x0i ( opt ) for training vectors with yi = 1
(so that xi is a support vector for the "slab" of thickness 2M around a
separating hyperplane),
2.
0
(
opt
yi
0
) may be determined using the corresponding xi from
opt
=1
opt
yi x0i
i.e.
0
opt
= yi
x0i
opt
(apparently for reasons of numerical stability it is common practice to
average values yi x0i ( opt ) for support vectors in order to evaluate
opt
)) and,
0(
3.
1 = yi
= yi
0
0
opt
opt
0
+ yi @
+
N
X
j=1
99
N
X
j=1
10
opt
A
j yj xj
opt
0
j yj yi xj xi
xi
PN
The fact (97) that ( )
i=1 i yi xi implies that only the training cases
with i > 0 (typically corresponding to a relatively few support vectors) determine the nature of the solution to this optimization problem. Further, for SV
the set indices of support vectors in the problem,
X X opt opt
2
0
opt
=
j yi yj xi xj
i
i2SV j2SV
=
X
opt
i
i2SV
=
X
opt
0
j yi yj xj xi
j2SV
X
opt
i
1
yi
opt
0
i2SV
=
X
opt
i
i2SV
the next to last of these following from 3. above, and the last following from
the gradient condition (92). Then the "margin" for this problem is simply
M=
14.2
1
k (
= qP
opt )k
1
i2SV
opt
i
The Linearly Non-separable Case
In a linearly non-separable case, the convex optimization problem (91) does not
have a solution (no pair 2 <p and 0 2 < provides yi (x0i + 0 )
1 8i).
We might, therefore (in looking for good choices of 2 <p and 0 2 <) try to
relax the constraints of the problem slightly. That is, suppose that i 0 and
consider the set of constraints
yi (x0i +
0)
+
i
1 8i
(the i are called "slack" variables and provide some "wiggle room" in search
for a hyperplane that "nearly" separates the two classes with a good margin).
We might try to control the total amount of slack allowed by setting a limit
N
X
i
C
i=1
for some positive C. Note that if yi (x0i + 0 ) 0 case i is correctly classi…ed
in the training set, and so if for some pair 2 <p and 0 2 < this holds for all
i, we have a separable problem. So any non-separable problem must have at
least one negative yi (x0i + 0 ) for any 2 <p and 0 2 < pair. This in turn
requires that the budget C must be at least 1 for a non-separable problem to
have a solution even with the addition of slack variables. In fact, this reasoning
implies that a budget C allows for at most C mis-classi…cations in the training
set. And in a non-separable case, C must be allowed to be large enough so that
100
some choice of 2 <p and 0 2 < produces a classi…er with training error rate
no larger than C=N .
In any event, we consider the optimization problem
1
2
k k
minimize
p
2
2<
and 0 2 <
yi (x0i + 0 ) + i 1 8i
PN
for some i 0 with
i=1 i
subject to
C
(100)
that can be thought of as generalizing the problem (91). HTF prefer to motivate
(100) as equivalent to
maximize
M
u with kuk = 1
and 0 2 <
yi (x0i u +
for some i
subject to
0)
M (1
) 8i
PN i
0 with
C
i=1 i
generalizing the original problem (89), but it’s not so clear that this is any more
natural than simply beginning with (100).
A more convenient version of (100) is
N
X
1
2
minimize
k k +C
2
2 <p
i=1
and 0 2 <
i
subject to
yi (x0i + 0 ) +
for some i
i
0
1 8i
(101)
A nice development on pages 376-378 of Izenman’s book provides the following solution to this problem (101) parallel to the development in Section 14.1.
Generalizing (99), the "dual" problem boils down to
maximize 10
1
2
0
subject to 0
H
C 1 and
0
y=0
(102)
for
H = (yi yj x0i xj )
N
(103)
N
The constraint 0
C 1 is known as a "box constraint" and the "feasible
region" prescribed in (102) is the intersection of a hyperplane de…ned by 0 y = 0
and a "box" in the positive orthant. The C = 1 version of this reduces to
the "hard margin" separable case.
Upon solving (102) for opt , the optimal 2 <p is of the form
X opt
opt
=
(104)
i yi xi
i2SV
for SV the indices of set of "support vectors" xi which have iopt > 0. The
points with 0 < iopt < C will lie on the edge of the margin (have i = 0) and
the ones with iopt = C have i > 0. Any of the support vectors on the edge
of the margin (with 0 < iopt < C ) may be used to solve for 0 2 < as
0
opt
= yi
101
x0i
opt
(105)
and again, apparently for reasons of numerical stability it is common practice
to average values yi x0i ( opt ) for such support vectors in order to evaluate
opt
). Clearly, in this process the constant C functions as a regular0(
ization/complexity parameter and large C in (101) correspond to small C in
(100). Identi…cation of a classi…er requires only solution of the dual problem
(102), evaluation of (104) and (105) to produce linear form (87) and classi…er
(88).
14.3
SVM’s and Kernels
The form (87) is (of course and by design) linear in the coordinates of x. A
natural generalization of this development would be to consider forms that are
linear in some (non-linear) functions of the coordinates of x. There is nothing
really new or special to SVM classi…ers associated with this possibility if it is
applied by simply de…ning some basis functions hm (x) and considering form
0
B
B
g (x) = B
@
h1 (x)
h2 (x)
..
.
hM (x)
10
C
C
C
A
+
0
for use in a classi…er. Indeed, HTF raise this kind of possibility already in
their Chapter 4, where they introduce linear classi…cation methods. However,
the fact that in both linearly separable and linearly non-separable cases, optimal SVM’s depend upon the training input vectors xi only through their inner
products (see again (98) and (103)) and our experience with RKHS’s and the
computation of inner products in function spaces in terms of kernel values suggests another way in which one might employ linear forms of nonlinear functions
in classi…cation.
14.3.1
Heuristics
Let K be a non-negative de…nite kernel and consider the possibility of using functions K (x; x1 ) ; K (x; x2 ) ; : : : ; K (x; xN ) to build new (N -dimensional) feature
vectors
0
1
K (x; x1 )
B K (x; x2 ) C
B
C
k (x) = B
C
..
@
A
.
K (x; xN )
for any input vector x (including the xi in the training set) and rather than
de…ning inner products for new feature vectors (for input vectors x and z) in
terms of <N inner products
0
k (x) k (z) =
N
X
K (x; xk ) K (z; xk )
k=1
102
we instead consider using the RKHS inner products of corresponding functions
hK (x; ) ; K (z; )iH
= K (x; z)
K
Then, in place of (98) or (103) de…ne
H = (yi yj K (xi ; xj ))
N
and let
opt
(106)
N
solve either (99) or (102). Then with
opt
=
N
X
opt
i yi k (xi )
i=1
as in the developments of the previous sections, we replace the <N inner product
of ( opt ) and a feature vector k (x) with
*N
+
N
X opt
X
opt
=
i yi K (xi ; ) ; K (x; )
i yi hK (xi ; ) ; K (x; )iH
K
i=1
i=1
HK
=
N
X
opt
i yi K
(x; xi )
i=1
Then for any i for which iopt > 0 (an index corresponding to a support feature
vector in this context) we set
0
opt
= yi
N
X
opt
j yj K
(xi ; xj )
j=1
and have an empirical analogue of (87) (for the kernel case)
g^ (x) =
N
X
opt
i yi K
(x; xi ) +
0
opt
(107)
i=1
with corresponding classi…er
f^ (x) = sign (^
g (x))
(108)
as an empirical analogue of (88). It remains to argue that this classi…er developed simply by analogy is a solution to any well-posed optimization problem.
14.3.2
An Optimality Argument
The heuristic argument for the use of kernels in the SVM context to produce
form (107) and classi…er (108) is clever enough that some authors simply let
it stand on its own as "justi…cation" for using "the kernel trick" of replacing
103
<N inner products of feature vectors with HK inner products of basis functions. A more satisfying argument would be based on an appeal to optimality/regularization considerations. HTF’s treatment of this possibility is not
helpful. The following is based on a 2002 Machine Learning paper of Lin,
Wahba, Zhang, and Lee.
As in Section 11 consider the RKHS associated with the non-negative de…nite
kernel K, and the optimization problem involving the "hinge loss"8
N
X
minimize
g 2 HK
and 0 2 <
(1
yi (
0
1
2
kgkHK
2
+ g (xi )))+ +
i=1
(109)
Exactly as noted in Section 11 an optimizing g 2 HK above must be of the form
g (x) =
=
N
X
jK
(x; xj )
j=1
0
k (x)
so the minimization problem is
minimize
2 <N
and 0 2 <
N
X
1
yi
0
+
i=1
0
2
N
1 X
k (xi ) + +
2 j=1
jK
(x; xj )
HK
that is,
minimize
2 <N
and 0 2 <
N
X
1
yi
0
+
0
k (xi )
+
i=1
+
1
2
0
K
for
K = (K (xi ; xj ))
Now apparently, this is equivalent to the optimization problem
minimize
2 <N
and 0 2 <
N
X
i
+
i=1
1
2
0
subject to
K
which (also apparently) for H
N
N
yi
0
k (xi ) + 0 + i
for some i 0
1 8i
= (yi yj K (xi ; xj )) as in (106) has the dual
problem of the form
maximize 10
1
2
0
H
subject to 0
1 and
0
y=0
(110)
8 The so-called hinge loss is convenient to use here in making optimality arguments. For
(x; y) P it is reasonably straighforward to argue that the function h minimizing the expected
hinge loss E[1 yh (x)]+ is ho p t (u) =sign P [y = 1jx = u] 21 , namely the minimum risk
classi…er under 0-1 loss.
104
or
maximize 10
1
2
0
1
2
subject to 0
H
1 and
0
y = 0 (111)
That is, the function space optimization problem (109) has a dual that is the
same as (102) for the choice of C = and kernel 12 K (x; z) produced by the
heuristic argument in Section 14.3.1. Then, if opt is a solution to (110), Lin
et al. say that an optimal 2 <N is
1
diag (y1 ; : : : ; yN )
opt
this producing coe¢ cients to be applied to the functions K ( ; xi ). On the
other hand, the heuristic of Section 14.3.1 prescribes that for opt the solution
to (111) coe¢ cients in the vector
diag (y1 ; : : : ; yN )
opt
get applied to the functions 12 K ( ; xi ). Upon recognizing that opt = 1 opt it
becomes evident that for the choice of C = and kernel 12 K ( ; ), the heuristic
in Section (14.3.1) produces a solution to the optimization problem (109).
14.4
Other Support Vector Stu¤
Several other issues related to the kind of arguments used in the development
of SVM classi…ers are discussed in HTF (and Izenman). One is the matter of
multi-class problems. That is, where G = f1; 2; : : : ; Kg how might one employ
machinery of this kind? There are both heuristic and optimality-based methods
in the literature.
A heuristic one-against-all strategy might be the following. Invent 2-class
problems (K of them), the kth based on
yki =
1
1
if yi = k
otherwise
Then for (a single) C and k = 1; 2; : : : ; K solve the (possibly linearly nonseparable) 2-class optimization problems to produce functions g^k (x) (that would
lead to one-versus-all classi…ers f^k (x) = sign(^
gk (x)). A possible overall classi…er is then
f^ (x) = arg max g^k (x)
k2G
A second heuristic strategy is to develop a voting scheme based on pair-wise
comparisons. That is, one might invent K
problems of classifying class l
2
versus class m for l < m, choose a single C and solve the (possibly linearly
non-separable) 2-class optimization problems to produce functions g^lm (x) and
105
corresponding classi…ers f^lm (x) = sign(^
glm (x)). For m > l de…ne f^ml (x) =
f^lm (x) and de…ne an overall classi…er by
0
1
X
f^ (x) = arg max @
f^km (x)A
k2G
m6=k
or, equivalently
0
f^ (x) = arg max @
k2G
X
m6=k
1
h
i
I f^km (x) = 1 A
In addition to these fairly ad hoc methods of trying to extend 2-class SVM
technology to K-class problems, there are developments that directly address
the problem (from an overall optimization point of view). Pages 391-397 of
Izenman provide a nice summary of a 2004 paper of Lee, Lin, and Wabha in
this direction.
Another type of question related to the support vector material is the extent to which similar methods might be relevant in regression-type prediction
problems. As a matter of fact, there are loss functions alternative to squared
error or absolute error that lead naturally to the use of the kind of technology
needed to produce the SVM classi…ers. That is, one might consider so called
" insensitive" losses for prediction like
L1 (y; yb) = max (0; jy
or
L2 (y; yb) = max 0; (y
ybj
)
2
yb)
and be led to the kind of optimization methods employed in the SVM classi…cation context. See Izenman pages 398-401 in this regard.
15
Boosting
The basic idea here is to take even a fairly simple (and poorly performing form
of) predictor and by sequentially modifying/perturbing it and re-weighting (or
modifying) the training data set, to creep toward an e¤ective predictor.
15.1
The AdaBoost Algorithm for 2-Class Classi…cation
Consider a 2-class 0-1 loss classi…cation problem. We’ll suppose that output y
takes values in G = f 1; 1g (this coding of the states is again more convenient
than f0; 1g or f1; 2g codings for our present purposes). The AdaBoost.M1 algorithm is built on some base classi…er form f (that could be, for example, the
simplest possible tree classi…er, possessing just 2 …nal nodes). It proceeds as
follows.
106
1. Initialize weights on the training data (xi ; yi ) at
1
for i = 1; 2; : : : ; N
N
wi1
2. Fit a G-valued predictor/classi…er f^1 to the training data to optimize
N
h
i
X
I yi 6= f^ (xi )
i=1
let
err1 =
and de…ne
1
N
i
1 X h
I yi 6= f^1 (xi )
N i=1
= ln
1
err1
err1
3. Set new weights on the training data
h
i
1
exp 1 I yi 6= f^1 (xi )
wi2 =
N
for i = 1; 2; : : : ; N
(This up-weights mis-classi…ed observations by a factor of (1
err1 ) =err1 ).)
4. For m = 2; 3; : : : ; M
(a) Fit a G-valued predictor/classi…er f^m to the training data to optimize
N
X
i=1
(b) Let
errm =
h
i
wim I yi 6= f^ (xi )
PN
i=1
(c) Set
m
h
i
wim I yi 6= f^m (xi )
PN
i=1 wim
= ln
1
errm
errm
(d) Update weights as
wi(m+1) = wim exp
mI
h
i
yi 6= f^m (xi )
for i = 1; 2; : : : ; N
5. Output a resulting classi…er that is based on "weighted voting" by the
classi…ers f^m as
!
" M
#
M
X
X
^
^
f^ (x) = sign
= 1 2I
m fm (x)
m fm (x) < 0
m=1
m=1
(Classi…ers with small errm get big positive weights in the "voting.")
107
This apparently turns out to be a very e¤ective way of developing a classi…er, and has been called the best existing "o¤-the-shelf" methodology for the
problem.
15.2
Why AdaBoost Might Be Expected to Work
HTF o¤er some motivation for the e¤ectiveness of AdaBoost. This is related
to the optimization of an expected "exponential loss."
15.2.1
Expected "Exponential Loss" for a Function of x
For g an arbitrary function of x, consider the quantity
E exp ( yg (x)) = EE [exp ( yg (x)) jx]
(112)
If g (x) were typically very large positively when y = 1 and very large negatively
when y = 1, this quantity (112) could be expected to be small. So if one were
in possession of such a function of x, one might expect to produce a sensible
classi…er as
f (x) = sign (g (x)) = 1 2I [g (x) < 0]
(113)
A bit di¤erently put, a g constructed to make (112) small should also produce
a good classi…er as (113).
It’s a bit of a digression, but HTF point out that it’s actually fairly easy to
see what g (x) optimizes (112). One needs only identify for each u what value
a optimizes
E [exp ( ay) jx = u]
That is,
E [exp ( ay) jx = u] = exp ( a) P [y = 1jx = u] + exp (a) P [y =
1jx = u]
and an optimal a is easily seen to be half the log odds ratio, i.e. the g optimizing
(112) is
P [y = 1jx = u]
1
g (u) = ln
2
P [y = 1jx = u]
15.2.2
Sequential Search for a g (x) With Small Expected Exponential Loss and the AdaBoost Algorithm
Consider "base classi…ers" hl (x; l ) (taking values in G = f 1; 1g) with parameters l and functions built from them of the form
gm (x) =
m
X
l hl
(x;
l)
l=1
(for training-data-dependent
l
and
gm (x) = gm
1
l ).
Obviously, with this notation
(x) +
108
m hm
(x;
m)
We might think of successive ones of these as perturbations of the previous ones
(by addition of the term m hm (x; m ) to gm 1 (x)). We might further think
about how such functions might be successively de…ned in an e¤ort to make
them e¤ective as producing small values of (112). Of course, since we typically
do not have available a complete probability model for (x; y) we will only be
able to seek to optimize/make small an empirical version of (112). As it turns
out, taking thesePpoints of view produces a positive multiple of the AdaBoost
M
voting function m=1 m f^m (x) as gM (x). (It then follows that AdaBoost
might well produce e¤ective classi…ers by the logic that g producing small (112)
ought to translate to a good classi…er via (113).)
So consider an empirical loss that amounts to an empirical version of N
times (112) for gm
N
X
Em =
exp ( yi gm (xi ))
i=1
This is
Em =
N
X
exp ( yi gm
1
(xi )
yi
m hm
exp ( yi gm
1
(xi )) exp ( yi
(xi ;
m ))
i=1
=
N
X
m hm
(x;
m ))
i=1
Let
vim = exp ( yi gm
1
(xi ))
and consider optimal choice of m and m > 0 (for purposes of making gm
the best possible perturbation of gm 1 in terms of minimizing Em ). (We may
assume that l > 0 provided the set of classi…ers de…ned by hl (x; l ) as l
varies includes both h and h for any version of the base classi…er, h.)
Write
X
X
Em =
vim exp ( m ) +
vim exp ( m )
i with
hm (xi ; m )=yi
= (exp (
m)
i with
hm (xi ; m )6=yi
exp (
m ))
N
X
i=1
+ exp (
m)
N
X
vim I [hm (xi ;
m)
6= yi ]
vim
i=1
So independent of the value of m > 0, m needs to be chosen to optimize
the vim -weighted error rate of hm (x; m ). Call the optimized (using weights
vim ) version of hm (x; m ) by the name hm (x), and notice that the prescription
for choice of this base classi…er is exactly the AdaBoost prescription of how to
choose f^m (based on weights wim ).
109
So now consider minimization of
N
X
exp ( yi hm (xi ))
i=1
over choices of . This is
0
B
)B
@
exp (
X
X
vim +
i with
hm (xi ; m )6=yi
i with
hm (xi ; m )=yi
= exp (
)
N
X
vim +
i=1
vim exp (2
N
X
vim (exp (2
m)
i=1
1
C
C
m )A
!
1) I [hm (xi ) 6= yi ]
and minimization of it is equivalent to minimization of
exp (
) 1 + (exp (2
Let
errhmm
and see that choice of
=
m)
PN
i=1
1)
PN
i=1
vim I [hm (xi ) 6= yi ]
PN
i=1 vim
!
vim I [hm (xi ) 6= yi ]
PN
i=1 vim
requires minimization of
exp (
1) errhmm
) 1 + (exp (2 )
A bit of calculus shows that the optimizing
m
=
1
ln
2
1
is
errhmm
errhmm
(114)
Notice that the prescription (114) calls for a coe¢ cient that is derived as exactly
half m , the coe¢ cient of f^m in the AdaBoost voting function (starting from
weights wim ) prescribed in 4b. of the AdaBoost algorithm. (And note that as
regards sign, the 1=2 is completely irrelevant.)
So to …nish the argument that successive choosing of gm ’s amounts to AdaBoost creation of a voting function, it remains to consider how the weights vim
get updated. For moving from gm to gm+1 one uses weights
vi(m+1) = exp ( yi gm (xi ))
= exp ( yi (gm
1
= vim exp ( yi
m hm
= vim exp (
= vim exp (2
(xi ) +
m hm
(xi )))
(xi ))
m (2I [hm (xi ) 6= yi ]
mI
1))
[hm (xi ) 6= yi ]) exp (
110
m)
Since exp ( m ) is constant across i, it is irrelevant to weighting, and since
the prescription for m produces half what AdaBoost prescribes in 4b. for m
the weights used in the choice of m+1 and hm+1 x; m+1 are derived exactly
according to the AdaBoost algorithm for updating weights.
So, at the end of the day, gm ’s are built from base classi…ers to creep towards
a function optimizing an empirical version of Eexp ( yg (x)). Since it is easy
to see that g1 corresponds to the …rst AdaBoost step, gM is further 1=2 of
the AdaBoost voting function. The gm ’s generate the same classi…er as the
AdaBoost algorithm.
15.3
Other Forms of Boosting
One might, in general, think about strategies for sequentially modifying/perturbing
a predictor (perhaps based on a modi…ed version of a data set) as a way of creeping towards a good predictor. Here is a general "gradient boosting algorithm"
(taken from Izenman and apparently originally due to Friedman and implemented in the R package gbm) aimed in this direction for a general loss function
and general class of potential updates for a predictor.
PN
1. Start with f^0 (x) = arg min i=1 L (yi ; yb) :
y
b
2. For m = 1; 2; : : : ; M
@
b)
@y
b L (yi ; y
(a) let yeim =
y
b=f^m
1 (xi )
(b) …t, for some class of functions hm ( ; ), parameters
(
m;
m)
= arg min
( ; )
N
X
(e
yim
m
hm (xi ; ))
and
m
as
2
i=1
(c) let
m
= arg min
N
X
L yi ; f^m
1
(xi ) + hm (xi ;
m)
i=1
(d) set
f^m (x) = f^m
1
(x) +
m hm
(x;
m)
3. and …nally, set f^ (x) = f^M (x).
In this algorithm, the set of N values m hm (xi ; m ) in 2b. functions as an
approximate negative gradient of the total loss at the set of values f^m 1 (xi )
and the least squares criterion at 2b. might be replaced by others (like least
absolute deviations). The "minimizations" need to be taken with a grain of
salt, as they are typically only the "approximate" minimizations (like those
associated with greedy algorithms used to …t binary trees, e.g.). The step 2c.
is some kind of line search in a direction of steepest descent for the total loss.
111
Notice that there is nothing here that requires that the functions hm (x; )
come from classi…cation or regression trees, rather the development allows for
fairly arbitrary base predictors. When the problem is a regression problem with
squared error loss some substantial simpli…cations of this occur.
15.3.1
SEL
1
2
Suppose for now that L (y; yb) =
algorithm above
2
yb) .
(y
1. f^0 (x) = y,
2. yeim =
@
@y
b
1
2
yb)
(yi
2
y
b=fm
1 (xi )
Then, in the gradient boosting
= yi
f^m
1
(xi ) so that step 2b.
of the algorithm says "…t, using SEL, a model for the residuals from the
previous iteration," and
3. the optimization of in step 2c of the algorithm produces
that step 2d of the algorithm sets
f^m (x) = f^m
1
(x) +
m hm
(x;
m
=
m
so
m)
and ultimately one is left with the prescription "…t, using, SEL a model for the
residuals from the previous iteration and add it to the previous predictor to
produce the mth one," a very intuitively appealing algorithm.
Again, there is nothing here that requires that the hm (x; ) come from
regression trees. This notion of gradient boosting could be applied to any kind
of base regression predictor.
15.3.2
AEL (and Binary Regression Trees)
Suppose now that L (y; yb) = jy
1. f^0 (x) = median fyi g, and
2. yeim =
@
@y
b
(jyi
ybj)
y
b=f^m
ybj : Then, in the gradient boosting algorithm
1 (xi )
= sign yi
of the algorithm says "…t, a model for
from the previous iteration."
f^m
1
(xi ) so that step 2b.
1’s coding the signs of the residuals
In the event that the base predictors being …t are regression trees, the
(xi ; m ) …t at step 2b. will be values between 1 and 1 (constant on
rectangles) if the …tting is done using least squares (exactly as indicated in 2b.).
If the alternative of least absolute error were used in place of least squares in
2b., one would have the median of a set of 1’s and 1’s, and thus values either
1 or 1 constant on rectangles. In this latter case, the optimization for m is
of the form
N
X
( 1)i
y f^m 1 (xi )
m = arg min
m hm
m
i=1
112
and while there is no obvious closed form for the optimizer, it is at least the
case that one is minimizing a piecewise linear continuous function of .
K-Class Classi…cation
15.3.3
HTF allude to a procedure that uses gradient boosting in …tting a (K-dimensional)
form
f = (f1 ; : : : ; fK )
PK
under the constraint that k=1 fk = 0, sets corresponding class probabilities
(for k 2 G = f1; 2; : : : ; Kg) to
exp (fk (x))
pk (x) = PK
l=1 exp (fl (x))
and has the implied theoretical form of classi…er
f (x) = arg max pk (x)
k2G
An appropriate (and convenient) loss is
L (y; p) =
K
X
I [y = k] ln (pk )
k=1
=
K
X
I [y = k] fk + ln
k=1
K
X
!
exp (fk )
k=1
This leads to
@
L (y; p) = I [y = k]
@fk
= I [y = k]
exp (fk )
PK
l=1 exp (fl )
pk
and thus
yeikm = I [yi = k]
p^k(m
1)
(xi )
(Algorithm 10.4 on page 387 of HTF provides a version of the gradient boosting
algorithm for this loss and K-class classi…cation.)
15.4
Some Issues (More or Less) Related to Boosting and
Use of Trees as Base Predictors
Here we consider several issues that might come up in the use of boosting,
particularly if regression or classi…cation trees are the base predictors.
One issue is the question of how large regression or classi…cation trees should
be allowed to grow if they are used as the functions hm (x; ) in a boosting
algorithm. The answer seems to be "Not too large, maybe to about 6 or so
terminal nodes."
113
Section 10.12 of HTF concerns control of complexity parameters/avoiding
over…t/regularization in boosting. Where trees are used as the base predictors,
limiting their size (as above) provides some regularization. Further, there is the
parameter M in the boosting algorithm that can/should be limited in size (very
large values surely producing over…t). Cross-validation here looks typically
pretty unworkable because of the amount of computing involved. HTF seem to
indicate that (for better or worse?) holding back a part of the training sample
and watching performance of a predictor on that single test set as M increases
is a commonly used method of choosing M .
"Shrinkage" is an idea that has been suggested for regularization in boosting.
The basic idea is that instead of setting
f^m (x) = f^m
1
(x) +
m hm
(x;
m)
as in 2d. of the gradient boosting algorithm, one chooses
f^m (x) = f^m
1
(x) +
m hm
(x;
2 (0; 1) and sets
m)
i.e. one doesn’t make the "full correction" to f^m 1 in producing f^m . This
will typically require larger M for good …nal performance of a predictor than
the non-shrunken version, but apparently can (in combination with larger M )
improve performance.
"Subsampling" is some kind of bootstrap/bagging idea. The notion is that
at each iteration of the boosting algorithm, instead of choosing an update based
on the whole training set, one chooses a fraction of the training set at random,
and …ts to it (using a new random selection at each iteration). This reduces the
computation time per iteration and apparently can also improve performance.
HTF show its performance on an example in combination with shrinkage.
16
Prototype and Nearest Neighbor Methods of
Classi…cation
We saw when looking at "linear" methods of classi…cation in Section 13 that
these can reduce to classi…cation to the class with …tted mean "closest" in some
appropriate sense to an input vector x. A related notion is to represent classes
each by several "prototype" vectors of inputs, and to classify to the class with
closest prototype. In their Chapter 13, HTF consider such classi…ers and related
nearest neighbor classi…ers.
So consider a K-class classi…cation problem (where y takes values in G =
f1; 2; : : : ; Kg) and suppose that the coordinates of input x have been standardized according to training means and standard deviations.
For each class k = 1; 2; : : : ; K, represent the class by prototypes
z k1 ; z k2 ; : : : ; z kR
p
belonging to < and consider a classi…er/predictor of the form
f (x) = arg min min kx
l
k
114
z kl k
(that is, one classi…es to the class that has a prototype closest to x).
The most obvious question in using such a rule is "How does one choose
the prototypes?" One standard (admittedly ad hoc, but not unreasonable)
method is to use the so-called "K-means (clustering) algorithm" one class at
a time. (The "K" in the name of this algorithm has nothing to do with the
number of classes in the present context. In fact, here the "K" naming the
clustering algorithm is our present R, the number of prototypes used per class.
And the point in applying the algorithm is not so much to see exactly how
training vectors aggregate into "homogeneous" groups/clusters as it is to …nd a
few vectors to represent them.)
For Tk = fxi with corresponding yi = kg an "R "-means algorithm might
proceed by
1. randomly selecting R di¤erent elements from Tk say
(1)
(1)
(1)
z k1 ; z k2 ; : : : ; z kR
2. then for m = 2; 3; : : : letting
(
the mean of all xi 2 Tk with
(m)
z kl =
(m 1)
(m 1)
xi z kl
< xi z kl0
for all l 6= l0
iterating until convergence.
This way of choosing prototypes for class k ignores the "location" of the other
classes and the eventual use to which the prototypes will be put. A potential
improvement on this is to employ some kind of algorithm (again ad hoc, but
reasonable) that moves prototypes in the direction of training input vectors in
their own class and away from training input vectors from other classes. One
such method is known by the name "LVQ"/"learning vector quantization." This
proceeds as follows.
With a set of prototypes (chosen randomly or from an R-means algorithm
or some other way)
z kl k = 1; 2; : : : ; K and l = 1; 2; : : : ; R
in hand, at each iteration m = 1; 2; : : : for some sequence of "learning rates"
f m g with m 0 and m & 0
1. sample an xi at random from the training set and …nd k; l minimizing
kxi z kl k
2. if yi = k (from 1.), update z kl as
z new
kl = z kl +
m
(xi
z kl )
m
(xi
z kl )
and if yi 6= k (from 1.), update z kl as
z new
kl = z kl
115
iterating until convergence.
As early as Section 1, we considered nearest neighbor methods, mostly for
regression-type prediction. Consider here their use in classi…cation problems.
As before, de…ne for each x the l-neighborhood
nl (x) = the set of l inputs xi in the training set closest to x in <p
A nearest neighbor method is to classify x to the class with the largest representation in nl (x) (possibly breaking ties at random). That is, de…ne
X
f^ (x) = arg max
I [yi = k]
(115)
k
xi 2nl (x)
l is a complexity parameter that might be chosen by cross-validation. Apparently, properly implemented this kind of classi…er can be highly e¤ective in spite
of the curse of dimensionality. This depends upon clever/appropriate/applicationspeci…c choice of feature vectors/functions, de…nition of appropriate "distance"
in order to de…ne "closeness" and the neighborhoods, and appropriate local or
global dimension reduction. HTF provide several attractive examples (particularly the LandSat example and the handwriting character recognition example)
that concern feature selection and specialized distance measures. They then
go on to consider more or less general approaches to dimension reduction for
classi…cation in high-dimensional Euclidean space.
A possibility for "local" dimension reduction is this. At x 2 <p one might
use regular Euclidean distance to …nd, say, 50 neighbors of x in the training
inputs to use to identify an appropriate local distance to employ in actually
de…ning the neighborhood nl (x) to be employed in (115). The following is
(what I think is a correction of) what HTF write in describing a DANN (discriminant adaptive nearest neighbor) (squared) metric at x 2 <p . Let
D2 (z; x) = (z
0
x) Q (z
x)
for
Q=W
=W
for some
1
2
1
2
W
(B )
1
2
1
BW
1
2
+ I W
1
+ I W
1
2
1
2
(116)
> 0 where W is a pooled within-class sample covariance matrix
W =
K
X
^k W k
k=1
=
K
X
k=1
^k
1
nk
1
X
116
(xi
xk ) (xi
0
xk )
(xk is the average xi from class k in the 50 used to create the local metric), B
is a weighted between-class covariance matrix of sample means
B=
K
X
x) (xk
^k (xk
0
x)
k=1
(x is a (probably weighted) average of the xk ) and
B =W
1
2
BW
1
2
1
Notice that in (116), the "outside" W 2 factors "sphere" (z x) di¤erences
relative to the within-class covariance structure. B is then the between-class
covariance matrix of sphered sample means. Without the I, the distance would
then discount di¤erences in the directions of the eigenvectors corresponding to
large eigenvalues of B (allowing the neighborhood de…ned in terms of D to
be severely elongated in those directions). The e¤ect of adding the I term
is to limit this elongation to some degree, preventing xi "too far in terms of
Euclidean distance from x" from being included in nl (x).
A global use of the DANN kind of thinking might be to do the following.
At each training input vector xi 2 <p , one might again use regular Euclidean
distance to …nd, say, 50 neighbors and compute a weighted between-class-mean
covariance matrix B i as above (for that xi ). These might be averaged to
produce
N
1 X
Bi
B=
N i=1
Then for eigenvalues of B, say 1
0 with corresponding
2
p
(unit) eigenvectors u1 ; u2 ; : : : ; up one might do nearest neighbor classi…cation
based on the …rst few features
vj = u0j x
and ordinary Euclidean distance. I’m fairly certain that this is a nearest neighbor version of the reduced rank classi…cation idea …rst met in the discussion of
linear classi…cation in Section 13.
17
Some Methods of Unsupervised Learning
As we said in Section 1, "supervised learning" is basically prediction of y belonging to < or some …nite index set from a p-dimensional x with coordinates
each individually in < or some …nite index set, using training data pairs
(x1 ; y1 ) ; (x2 ; y2 ) ; : : : ; (xN ; yN )
to create an e¤ective prediction rule
yb = fb(x)
117
This is one kind of discovery and exploitation of structure in the training data.
As we also said in Section 1, "unsupervised learning" is discovery and quanti…cation of structure in
0 0 1
x1
B x02 C
B
C
X =B . C
.
N p
@ . A
x0N
without reference to some particular coordinate of a p-dimensional x as an
object of prediction. There are a number of versions of this problem in Ch 14
of HTF that we will outline here.
17.1
Association Rules/Market Basket Analysis
Suppose for the present that x takes values in
X =
p
Y
j=1
Xj
where each Xj is …nite,
Xj = sj1 ; sj2 ; : : : ; sjkj
For some set of indices
J
f1; 2; : : : ; pg
and subsets of the coordinates spaces
Sj
Xj for j 2 J
we suppose that the rectangle
R = fx j xj 2 Sj 8j 2 J g
is selected as potentially of interest. (More below on how this might come
about.) Apparently, the statement x 2 R (or "xj 2 Sj 8j 2 J ") is called a
conjunctive rule.
In application of this formalism to "market basket analysis" it is common
to call the collection of indices j 2 J and corresponding sets Sj "item sets"
corresponding to the rectangle.9 For an item set
I = f(j1 ; Sj1 ) ; (j2 ; Sj2 ) ; : : : ; (jr ; Sjr )g
9 In this application, an X will typically specify a product type and the correpsonding
j
sjk ’s speci…c brands and/or models of the product type. As we probably want to allow
the possibility that a given transaction involves no product of type j, there needs to be the
understanding that one sjk for a given j stands for "no purchase in this category." Further,
since a given coordinate, j, of x can take only one value from Xj , it seems that we must
somehow restrict attention to transactions that involve at most a single product of each type.
118
consider a (non-trivial proper) subset J1
J and the items sets
I1 = f(j; Sj ) jj 2 J1 g and I2 = InI1
It is then common to talk about "association rules" of the form
I1 =) I2
(117)
and to consider quantitative measures associated with them. In framework
(117), I1 is called the antecedent and I2 is called the consequent in the rule.
For an association rule I1 =) I2 ,
1. the support of the rule (also the support of the item set or rectangle) is
1
1
# fxi jxi 2 Rg =
# fxi jxij 2 Sj 8j 2 J g
N
N
(the relative frequency with which the full item set is seen in the training
cases, i.e. that the training vectors fall into R),
2. the con…dence or predictability of the rule is
# fxi jxij 2 Sj 8j 2 J g
# fxi jxij 2 Sj 8j 2 J1 g
(the relative frequency with which the full item set I is seen in the training
cases that exhibit the smaller item set I1 ),
3. the "expected con…dence" of the rule is
1
# fxi jxij 2 Sj 8j 2 J2 g
N
(the relative frequency with which item set I2 is seen in the training cases),
and
4. the lift of the rule is
N # fxi jxij 2 Sj 8j 2 J g
conf idence
=
expected conf idence
# fxi jxij 2 Sj 8j 2 J1 g # fxi jxij 2 Sj 8j 2 J2 g
(a measure of association).
If one thinks of the cases in the training set as a random sample from some
distribution, then roughly, the support of the rule is an estimate of "P (I)," the
con…dence is an estimate of "P (I2 jI1 )," the expected con…dence is an estimate of
"P (I2 )," and the lift is an estimate of the ratio "P (I1 and I2 ) = (P (I1 ) P (I2 ))."
The basic thinking about association rules seems to be that usually (but perhaps
not always) one wants rules with large support (so that the estimates can be
reasonably expected to be reliable). Further, one then wants large con…dence
or lift, as these indicate that the corresponding rule will be useful in terms of
understanding how the coordinates of x are related in the training data. Apparently, standard practice is to identify a large number of promising item sets
and association rules, and make a database of association rules than can be
queried in searches like:
119
"Find all rules in which YYY is the consequent that have con…dence
over 70% and support more than 1%."
Basic questions that we have to this point not addressed are where one gets
appropriate rectangles R (or equivalently, item sets I) and how one uses them
to produce association rules. In answer to the second of these questions, one
might say "consider all 2r 2 association rules that can be associated with a
given rectangle." But what then are "interesting" rectangles or how does one
…nd a potentially useful set of rectangles? We proceed to consider two answers
to this question.
17.1.1
The "Apriori Algorithm" for the Case of Singleton Sets Sj
One standard way of generating a set of rectangles to process into association
rules is to use the so-called "apriori algorithm." This produces potentially
interesting rectangles where each Sj is a singleton set. That is, for each j 2 J ,
there is lj 2 f1; 2; : : : ; kj g so that Sj = sjlj . For such cases, one ends with
item sets of the form
I=
j1 ; sj1 lj1
; j2 ; sj2 lj2
; : : : ; jr ; sjr ljr
or more simply
I=
j1 ; sj1 lj1 ; j2 ; sj2 lj2 ; : : : ; jr ; sjr ljr
or even just
I = sj1 lj1 ; sj2 lj2 ; : : : ; sjr ljr
The "apriori algorithm" operates to produce such rectangles as follows.
Pp
1. Pass through all K = j=1 kj single items
s11 ; s12 ; : : : s1k1 ; s21 ; : : : ; s2k2 ; : : : ; sp1 ; : : : ; spkp
identifying those sjl that individually have support/prevalence
1
# fxi j xij = sjl g
N
at least t and place them in the set
I1t = fitem sets of size 1 with support at least tg
2. For each sj 0 l0 2 I1t check to see which two-element item sets
fsj 0 l0 ; sj 00 l00 gj 00 6=j 0
and sj 00 l00 2I1t
have support/prevalence
1
# fxi j xij 0 = sj 0 l0 and xij 00 = sj 00 l00 g
N
120
at least t and place them in the set
I2t = fitem sets of size 2 with support at least tg
..
.
8
9
entries =
<z m 1}|
{
t
m. For each sj 0 l0 ; sj 00 l00 ; : : : 2 Im
:
;
1
check to see which m-element item
sets
fsj 0 l0 ; sj 00 l00 ; : : :g [ fsj
l
g for j 2
= fj 0 ; j 00 ; : : :g and sj
l
2 I1t
have support/prevalence
1
# xi j xij 0 = sj 0 l0 ; xij 00 = sj 00 l00 ; : : : ; xij
N
=sj
l
at least t and place them in the set
t
Im
= fitem sets of size m with support at least tg
t
is empty. Then
This algorithm terminates when at some stage m the set Im
a sensible set of rectangles (to consider for making association rules) is I t =
t
, the set of all item sets with singleton coordinate sets Sj and prevalence in
[Im
the training data of at least t. Apparently for commercial databases of "typical
size," unless t is very small it is feasible to use this algorithm to to …nd I t . It
is apparently also possible to use a variant of the apriori algorithm to …nd all
association rules based on rectangles/item sets in I t with con…dence at least c.
This then produces a database of association rules that can be queried by a user
wishing to identify useful structure in the training data set.
17.1.2
Using 2-Class Classi…cation Methods to Create Rectangles
What the apriori algorithm does is clear. What "should" be done is far less
clear. The apriori algorithm seems to search for rectangles in X that are simplydescribed and have large relative frequency content in the training data. But it
obviously does not allow for rectangles that have coordinate sets Xj containing
more than one element, it’s not the only thing that could possibly be done in this
direction, it won’t …nd practically important "small-relative-frequency-content"
rectangles, and there is nothing in the algorithm that directly points it toward
rectangles that will produce association rules with large lift.
HTF o¤er (without completely ‡eshing out the idea) another (fairly clever)
path to generating a set of rectangles. That involves making use of simulated
data and a rectangle-based 2-class classi…cation algorithm like CART or PRIM.
(BTW, it’s perhaps the present context where association rules are in view that
gives PRIM its name.) The idea is that if one augments the training set by
assigning the value y = 1 to all cases in the original data set, and subsequently
121
generates N additional …ctitious data cases from some distribution P on X ,
assigning y = 0 to those simulated cases, a classi…cation algorithm can be
used to search for rectangles that distinguish between the real and simulated
data. One will want N to be big enough so that the simulated data can be
expected to represent P , and in order achieve an appropriate balance between
real and simulated values in the combined data set presented to the classi…cation
algorithm, each real data case may need to be weighted with a weight larger
than 1. (One, of course, uses those rectangles that the classi…cation algorithm
assigns to the decision "y = 1.")
Two possibilities for P are
1. a distribution of independence for the coordinates of x, with the marginal
of xj uniform on Xj , and
2. a distribution of independence for the coordinates of x, with the marginal
of xj the relative frequency distribution of fxij g on Xj .
One could expect the …rst of these to produce rectangles with large relative
frequencies in the data set and thus results similar in spirit to the those produced
by the apriori algorithm (except for restriction to singleton Sj ’s). The second
possibility seems more intriguing. At least on the face of it, seems like it might
provide a more or less automatic way to look for parts of the original training
set where one can expect large values of lift for association rules.
It is worth noting that HTF link this method of using classi…cation to search
for sets of high disparity between the training set relative frequencies and P
probability content to the use of classi…cation methods with simulated data in
the problem multivariate density estimation. This is possible because knowing
a set of optimal classi…ers is essentially equivalent to knowing data density. The
"…nd optimal classi…ers" problem and the "estimate the density" problem are
essentially duals.
17.2
Clustering
Typically (but not always) the object in "clustering" is to …nd natural groups
of rows or columns of
0 0 1
x1
B x02 C
B
C
X =B . C
N p
@ .. A
x0N
(in some contexts one may want to somehow …nd homogenous "blocks" in a
properly rearranged X). Sometimes all columns of X represent values of
continuous variables (so that ordinary arithmetic applied to all its elements
is meaningful). But sometimes some columns correspond to ordinal or even
categorical variables. In light of all this, we will let xi i = 1; 2; : : : ; r stand for
"items" to be clustered (that might be rows or columns of X) with entries that
need not necessarily be continuous variables.
122
In developing and describing clustering methods, it is often useful to have
a dissimilarity measure d (x; z) that (at least for the items to be clustered and
perhaps for other possible items) quanti…es how "unalike" items are. This
measure is usually chosen to satisfy
1. d (x; z)
0 8x; z
2. d (x; x) = 0 8x, and
3. d (x; z) = d (z; x) 8x; z.
It may be chosen to further satisfy
4. d (x; z)
d (x; w) + d (z; w) 8x; z; and w, or
40 . d (x; z)
max [d (x; w) ; d (z; w)] 8x; z; and w.
Where 1-4 hold, d is a "metric." Where 1-3 hold and the stronger condition 40
holds, d is an "ultrametric."
In a case where one is clustering rows of X and each column of X contains
values of a continuous variable, a squared Euclidean distance is a natural choice
for a dissimilarity measure
d (xi ; x ) = kxi
i0
2
x k =
i0
p
X
(xij
x i0 j )
2
j=1
In a case where one is clustering columns of X and each column of X contains
values of a continuous variable, with rjj 0 the sample correlation between values
in columns j and j 0 , a plausible dissimilarity measure is
d (xj ; xj 0 ) = 1
jrjj 0 j
When dissimilarities between r items are organized into a (non-negative
symmetric) r r matrix
D = (dij ) = (d (xi ; xj ))
with 0’s down its diagonal, the terminology "proximity matrix" is often used.
For some clustering algorithms and for some purposes, the proximity matrix
encodes all one needs to know about the items to do clustering. One seeks a
partition of the index set f1; 2; : : : ; rg into subsets such that the dij for indices
within a subset are small (and the dij for indices i and j from di¤erent subsets
are large).
17.2.1
Partitioning Methods ("Centroid"-Based Methods)
By far the most commonly used clustering methods are based on partitioning
related to "centroids," particularly the so called "K-Means" clustering algorithm
for the rows of X in cases where the columns contain values of continuous
123
variables xj . The version of the method we will describe here uses squared
Euclidean distance to measure dissimilarity. Somewhat fancier versions (like
that found in CFZ) are built on Mahalanobis distance.
The algorithms begins with some set of K distinct "centers" c01 ; c02 ; : : : ; c0K .
They might, for example, be a random selection of the rows of X (subject to
the constraint that they are distinct). One then assigns each xi to that center
c0k0 (i) minimizing
d xi ; c0l
over choice of l (creating K clusters around the centers) and replaces all of the
c0k with the corresponding cluster means
c1k =
X
1
I k 0 (i) = k xi
0
# of i with k (i) = k
At stage m with all cm
k
minimizing
1
cm
km 1 (i)
1
available, one then assigns each xi to that center
d xi ; clm
1
over choice of l (creating K clusters around the centers) and replaces all of the
1
cm
with the corresponding cluster means
k
cm
k =
1
# of i with k m
1 (i)
=k
X
I km
1
(i) = k xi
This iteration goes on to convergence. One compares multiple random starts
for a given K (and then minimum values found for each K) in terms of
Error Sum of Squares (K) =
K
X
k=1
X
d (xi ; ck )
xi in cluster k
for c1 ; c2 ; : : : ; cK the …nal means produced by the iterations. One may then
consider the monotone sequence of Error Sums of Squares and try to identify a
value K beyond which there seem to be diminishing returns for increased K.
A more general version of this algorithm (that might be termed a K-medoid
algorithm) doesn’t require that the entries of the xi be values of continuous
variables, but (since it is then unclear that one can even evaluate, let alone
…nd a general minimizer of, d (xi ; )) restricts the "centers" to be original items.
This algorithm begins with some set of K distinct "medoids" c01 ; c02 ; : : : ; c0K that
are a random selection from the r items xi (subject to the constraint that they
are distinct). One then assigns each xi to that medoid c0k0 (i) minimizing
d xi ; c0l
over choice of l (creating K clusters associated with the medoids) and replaces
all of the c0k with c1k the corresponding minimizers over the xi0 belonging to
cluster k of the sums
X
d (xi ; xi0 )
i with k0 (i)=k
124
At stage m with all cm
k
1
cm
km 1 (i) minimizing
1
available, one then assigns each xi to that medoid
d x i ; cm
l
1
over choice of l (creating K clusters around the medoids) and replaces all of the
1
cm
with cm
k the corresponding minimizers over the xi0 belonging to cluster k
k
of the sums
X
d (xi ; xi0 )
i with km
1 (i)=k
This iteration goes on to convergence. One compares multiple random starts
for a given K (and then minimum values found for each K) in terms of
K
X
k=1
X
d (xi ; ck )
xi in cluster k
for c1 ; c2 ; : : : ; cK the …nal medoids produced by the iterations.
17.2.2
Hierarchical Methods
There are both agglomerative/bottom-up methods and divisive/top-down methods of hierarchical clustering. To apply an agglomerative method, one must
…rst choose a method of using dissimilarities for items to de…ne dissimilarities
for clusters. Three common (and somewhat obvious) possibilities in this regard
are as follows. For C1 and C2 di¤erent elements of a partition of the set of
items, or equivalently their r indices, one might de…ne dissimilarity of C1 and
C2 as
1. D (C1 ; C2 ) = min fdij ji 2 C1 and j 2 C2 g (this is the "single linkage" or
"nearest neighbor" choice),
2. D (C1 ; C2 ) = max fdij ji 2 C1 and j 2 C2 g (this is the "complete linkage"
choice), or
P
3. D (C1 ; C2 ) = #C11#C2 i2C1 ; j2C2 dij (this is the "average linkage" choice).
An agglomerative hierarchical clustering algorithm then operates as follows.
One begins with every item xi ; i = 1; 2; : : : ; r functioning as a singleton cluster.
Then one …nds the minimum dij for i 6= j and puts the corresponding two items
into a single cluster (of size 2). Then when one is at a stage where there are m
clusters, one …nds the two clusters with minimum dissimilarity and merges them
to make a single cluster, leaving m 1 clusters overall. This continues until there
is only a single cluster. The sequence of r di¤erent clusterings (with r through
1 clusters) serves as a menu of potentially interesting solutions to the clustering
problem. These are apparently often displayed in the form of a dendogram,
where cutting the dendogram at a given level picks out one of the (increasingly
coarse as the level rises) clusterings. Those items clustered together "deep" in
125
the tree/dendogram are presumably interpreted to be potentially "more alike"
than ones clustered together only at a high level.
A divisive hierarchical algorithm operates as follows. Starting with a single
"cluster" consisting of all items, one …nds the maximum dij and uses the two
corresponding items as seeds for two clusters. One then assigns each xl for
l 6= i and l 6= j to the cluster represented by xi if
d (xi ; xl ) < d (xj ; xl )
and to the cluster represented xj otherwise. When one is at a stage where there
are m clusters, one identi…es the cluster with largest dij corresponding to a pair
of elements in the cluster, splitting it using the method applied to split the
original "single large cluster" (to produce an (m + 1)-cluster clustering). This,
like the agglomerative algorithm, produces a sequence of r di¤erent clusterings
(with 1 through r clusters) that serves as a menu of potentially interesting
solutions to the clustering problem. And like the sequence produced by the
agglomerative algorithm, this sequence can be represented using a dendogram.
One may modify either the agglomerative or divisive algorithms by …xing a
threshold t > 0 for use in deciding whether or not to merge two clusters or to
split a cluster. The agglomerative version would terminate when all pairs of
existing clusters have dissimilarities more than t. The divisive version would
terminate when all dissimilarities for pairs of items in all clusters are below
t. Fairly obviously, employing a threshold has the potential to shorten the
menu of clusterings produced by either of the methods to include less than r
clusterings. (Obviously, thresholding the agglomerative method cuts o¤ the
top of the corresponding full dendogram, and thresholding the divisive method
cuts o¤ the bottom of the corresponding full dendogram.)
17.2.3
(Mixture) Model-Based Methods
A completely di¤erent approach to clustering into K clusters is based on use of
mixture models. That is, for purposes of producing a clustering, I might consider acting as if items x1 ; x2 ; : : : ; xr are realizations of r iid random variables
with parametric marginal density
g (xj ;
1; : : : ;
K)
=
K
X
kf
(xj
k)
(118)
k=1
PK
for probabilities k > 0 with k=1 k = 1, a …xed parametric density f (xj ),
and parameters 1 ; : : : ; K . (Without further restrictions the family of mixture
distributions speci…ed by density (118) is not identi…able, but we’ll ignore that
fact for the moment.)
One might think of the ratio
PK
kf
k=1
(xj
kf
126
k)
(xj
k)
as a kind of "probability" that x is generated by component k of the mixture
and de…ne cluster k to be the set of xi for which
k = arg max PK
l
lf
k=1
(xi j l )
kf
(xi j
k)
= arg max
l
lf
(xi j l )
In practice, ; 1 ; : : : ; K must be estimated and estimates used in place of
parameters in de…ning clusters. That is, an implementable clustering method
is to de…ne cluster k (say, Ck ) to be
Ck =
xi jk = arg max bl f xi j bl
(119)
l
Given the lack of identi…ability in the unrestricted mixture model, it might
appear that prescription (119) could be problematic. But such is not really the
case. While the likelihood
L( ;
1; : : : ;
K)
=
r
Y
g (xi j ;
1; : : : ;
K)
i=1
will have multiple maxima, using any maximizer for an estimate of the parameter
vector will produce the same set of clusters (119). It is common to employ the
"EM algorithm" in the maximization of L ( ; 1 ; : : : ; K ) (the …nding of one
of many maximizers) and to include details of that algorithm in expositions
of model-based clustering. However, strictly speaking, that algorithm is not
intrinsic to the basic notion here, namely the use of the clusters in display (119).
17.2.4
Self-Organizing Maps
The basic motivation here seems to be that for items x1 ; x2 ; : : : ; xr belonging to
<p , the object is to …nd L M cluster centers/prototypes that adequately represent the items, where one wishes to think of those cluster centers/prototypes
as indexed on an L M regular grid in 2 dimensions (that we might take to
be f1; 2; : : : ; Lg f1; 2; : : : ; M g) with cluster centers/prototypes whose index
vectors are close on the grid being close in <p . (There could, of course, be
3-dimensional versions of this for L M N cluster centers/prototypes, and so
on.) I believe that the object is both production of the set of centers/prototypes
and ?also? assignment of data points to centers/prototypes. In this way, this
amounts to some kind of modi…ed/constrained K = L M group clustering
problem.
One probably will typically P
want to begin with standardization of the p
coordinate variables xj (so that
xi = 0 and the sample variance of each set
of values fxij gi=1;2;:::;r is 1). This puts all of the xj on the same scale and
doesn’t allow one coordinate of an xi to dominate a Euclidean norm. This
topic seems to be driven by two somewhat ad hoc algorithms and lack a wellthought-out (or at least well-articulated) rationale (for both its existence and
its methodology). Nevertheless, it seems popular and we’ll describe here the
algorithms, saying what we can about its motivation and usefulness.
127
One begins with some set of initial cluster centers z 0lm l=1;:::;L and m=1;:::;M .
This might be a random selection (without replacement or the possibility of
duplication) from the set of items. It might be a set of grid points in the
2-dimensional plane in <p de…ned by the …rst two principal components of the
items fxi gi=1;:::;r . And there are surely other sensible possibilities. Then
de…ne neighborhoods on the L M grid, n (l; m), that are subsets of the grid
"close" in some kind of distance (like regular Euclidean distance) to the various
elements of the L M grid. n (l; m) could be all of the grid, (l; m) alone, all
grid points (l0 ; m0 ) within some constant 2-dimensional Euclidean distance, c,
of (l; m), etc. Then de…ne a weighting function on <p , say w (kxk), so that
w (0) = 1 and w (kxk)
0 is monotone non-increasing in kxk. For some
schedule of non-increasing positive constants 1 > 1
, the
2
3 n
o
SOM algorithms de…ne iteratively sets of cluster centers/prototypes z jlm for
j = 1; 2; : : :.
At iteration j, an "online" version of SOM selects (randomly or perhaps in
turn from an initially randomly set ordering of the items) an item xj and
j 1
1. identi…es the center/prototype z lm
closest to xj in to <p , call it bj and its
j
grid coordinates (l; m) (Izenman calls bj the "BMU" or best-matchingunit),
j
2. adjusts those z jlm1 with index vectors belonging n (l; m)
(close to the
j
BMU index vector on the 2-dimensional grid) toward x by the prescription
j 1
z jlm = z jlm1 + j w z lm
bj
xj z jlm1
(adjusting those centers di¤erent from the BMU potentially less dramatically than the BMU), and
3. for those z jlm1 with index pairs (l; m) not belonging n (l; m)
j
sets
z jlm = z jlm1
iterating to convergence.
j,
n At oiteration
n
o a "batch" version of SOM updates all centers/prototypes
j 1
z jlm1 to z jlm as follows. For each z jlm1 , let Xlm
be the set of items for
n
o
which the closest element of z jlm1 has index pair (l; m). Then update z jlm1
j 1
as some kind of (weighted) average of the elements of [(l;m)0 2n(l;m) X(l;m)
0 (the
set of xi closest to prototypes with labels that are 2-dimensional grid neighbors
1
the obvious sample mean
of (l; m)). A natural form of this is to set (with xj(l;m)
j 1
of the elements of Xlm
)
P
z jlm
=
(l;m)0 2n(l;m)
P
j 1
z lm
w
(l;m)0 2n(l;m)
w
128
j 1
z (l;m)
0
z jlm1
j 1
x(l;m)
0
1
z j(l;m)
0
It is fairly obvious that even if these algorithms converge, di¤erent starting
sets z 0lm will produce di¤erent limits (symmetries alone mean, for example,
that the choices z 0lm = ulm and z 0lm = uL l;M m produce what might look
like di¤erent limits, but are really completely equivalent). Beyond this, what is
provided by the 2-dimensional layout of indices of prototypes is not immediately
obvious. It seems to be fairly common to compare an error sum of squares for
a SOM to that of a K = L M means clustering and to declare victory if the
SOM sum is not much worse than the K-means value.
17.2.5
Spectral Clustering
This material is a bit hard to place in an outline like this one. It is about clustering, but requires the use of principal components ideas, and has emphases in
common with the local version of multi-dimensional scaling treated in Section
17.3. The basic motivation is to do clustering in a way that doesn’t necessarily look for "convex" clusters of points in p-space, but rather for "roughly
connected"/"contiguous" sets of points of any shape in p-space.
Consider then N points x1 ; x2 ; : : : ; xN in <p . We consider weights wij =
w (kxi xj k) for a decreasing function w : [0; 1) ! [0; 1] and use them
to de…ne similarities/adjacencies sij . (For example, we might use w (d) =
exp d2 =c for some c > 0.) Similarities can be exactly sij = wij , but can be
even more "locally" de…ned as follows. For …xed K consider the symmetric set
of index pairs
NK =
(i; j) j
the number of j 0 with wij 0 > wij is less than K
or the number of i0 with wi0 j > wij is less than K
(an index pair is in the set if one of the items is in the K-nearest neighbor
neighborhood of the other). One might then de…ne sij = wij I [(i; j) 2 NK ].
We’ll call the matrix
S = (sij ) i=1;:::;N
j=1;:::;N
the adjacency matrix, and use the notation
gi =
N
X
sij
j=1
and
G = diag (g1 ; g2 ; : : : ; gN )
It is common to think of the points x1 ; x2 ; : : : ; xN in <p as nodes on a graph,
with edges between nodes weighted by similarities sij , and the gi so-called node
degrees, i.e. sums of weights of the edges connected to nodes i.
The matrix
L=G S
129
is called the (unnormalized) graph Laplacian, and one standardized (with respect to the node degrees) version of this is
Note that for any vector u,
u0 Lu =
N
X
gi u2i
i=1
e =I
L
N X
N
X
G
1
(120)
S
ui uj sij
i=1 j=1
0
1
N N
N X
N
X
1 @X X
sij u2i +
=
sij u2j A
2 i=1 j=1
j=1 i=1
N
=
N X
N
X
ui uj sij
i=1 j=1
N
1 XX
sij (ui
2 i=1 j=1
2
uj )
so that the N N symmetric L is nonnegative de…nite. We then consider
the spectral/eigen decomposition of L and focus on the small eigenvalues. For
v 1 ; : : : ; v m eigenvectors corresponding to 2nd through (m + 1)st smallest nonzero eigenvalues (since L1 = 0 there is an uninteresting 0 eigenvalue), let
V = (v 1 ; : : : ; v m )
and cluster the rows of as a way of clustering the N cases. (As we noted
in the discussion in Section 2.3.1, small eigenvalues are associated with linear
combinations of columns of L that are close to 0.)
Why should this work? For v l a column of V (that is a eigenvector of L
corresponding to a small eigenvalue),
v 0l Lv l =
N X
N
X
sij (vli
vlj )
2
0
i=1 j=1
and points xi and xj with large adjacencies must have similar corresponding
coordinates of the eigenvectors. HTF (at the bottom of their page 545) essentially argue that the number of "0 or nearly 0" eigenvalues of L is indicative of
the number of clusters in the original N data vectors. Notice that a series of
points could be (in sequence) close to successive elements of the sequence but
have very small adjacencies for points separated in the sequence. "Clusters" by
this methodology need NOT be "clumps" of points, but could also be serpentine
"chains" of points in <p .
It is easy to see that
P G 1S
is a stochastic matrix (thus specifying an N -state stationary Markov Chain).
One can expect spectral clustering with the normalized graph Laplacian (120)
to yield groups of states such that transition by such a chain between the groups
is relatively infrequent (the MC typically moves within the groups).
130
17.3
Multi-Dimensional Scaling
This material begins (as in Section 17.2) with dissimilarities among N items,
dij , that might be collected in an N N proximity matrix D = (dij ). (These
might, but do not necessarily, come from Euclidean distances among N data
vectors x1 ; x2 ; : : : ; xN in <p .) The object of multi-dimensional scaling is to (to
the extent possible) represent the N items as points z 1 ; z 2 ; : : : ; z N in <k with
kz i
zj k
dij
This is phrased precisely in terms of one of several optimization problems, where
one seeks to minimize a "stress function" S (z 1 ; z 2 ; : : : ; z N ).
The least squares (or Kruskal-Shepard) stress function (optimization criterion) is
X
2
SLS (z 1 ; z 2 ; : : : ; z N ) =
(dij kz i z j k)
i<j
This criterion treats errors in reproducing big dissimilarities exactly like it treats
errors in reproducing small ones. A di¤erent point of view would make faithfulness to small dissimilarities more important than the exact reproduction of
big ones. The so-called Sammon mapping criterion
SSM (z 1 ; z 2 ; : : : ; z N ) =
X (dij
i<j
kz i
dij
2
z j k)
re‡ects this point of view.
Another approach to MDS that emphasizes the importance of small dissimilarities is discussed in HTF under the name of "local multi-dimensional scaling."
Here one begins for …xed K with the symmetric set of index pairs
NK =
(i; j) j
the number of j 0 with dij 0 < dij is less than K
or the number of i0 with di0 j < dij is less than K
(an index pair is in the set if one of the items is in the K-nearest neighbor neighborhood of the other). Then a stress function that emphasizes the matching of
small dissimilarities and not large ones is (for some choice of > 0)
X
X
2
SL (z 1 ; z 2 ; : : : ; z N ) =
(dij kz i z j k)
kz i z j k
i<j and (i;j)2NK
i<j and (i;j)2N
= K
Another version of MDS begins with similarities sij (rather than with dissimilarities dij ). (One important special case of similarities derives from vectors
x1 ; x2 ; : : : ; xN in <p through centered inner products sij
hxi x; xj xi.)
A "classical scaling" criterion is
X
2
SC (z 1 ; z 2 ; : : : ; z N ) =
(sij hz i z; z j zi)
i<j
HTF claim that if in fact similarities are centered inner products, classical scaling is exactly equivalent to principal components analysis.
131
The four scaling criteria above are all "metric" scaling criteria in that the
distances kz i z j k are meant to approximate the dij directly. An alternative
is to attempt minimization of a non-metric stress function like
P
2
kz i z j k)
i<j ( (dij )
SNM (z 1 ; z 2 ; : : : ; z N ) =
P
2
zj k
i<j kz i
over vectors z 1 ; z 2 ; : : : ; z N and increasing functions ( ). ( ) will preserve/enforce
the natural ordering of dissimilarities without attaching importance to their precise values. Iterative algorithms for optimization of this stress function alternate
between isotonic regression to choose ( ) and gradient descent to choose the
zi.
In general, if one can produce a small value of stress in MDS, one has discovered a k-dimensional representation of N items, and for small k, this is a
form of "simple structure."
17.4
More on Principal Components and Related Ideas
Here we extend the principal components ideas …rst raised in Section (2.3) based
on the SVD ideas of Section 2.2, still with the motivation of using it as means for
identifying simple structure in an N -case p-variable data set, where as earlier,
we write
0 0 1
x1
B x02 C
B
C
X =B . C
N p
@ .. A
x0N
17.4.1
"Sparse" Principal Components
In standard principal components analysis, the v j are sometimes called "loadings" because (in light of the fact that z j = Xv j ) they specify what linear
combinations of variables xj are used in making the various principal component vectors. If the v j were "sparse" (had lots of 0’s in them) interpretation of
these loadings would be easier. So people have made proposals of alternative
methods of de…ning "principal components" that will tend to produce sparse
results. One due to Zou is as follows.
One might call a v 2 <p a …rst sparse principal component direction if it is
part of a minimizer (over choices of v 2 <p and 2 <p with k k = 1) of the
criterion
N
X
2
2
xi
x0i v + kvk + 1 kvk1
(121)
i=1
for k k1 the l1 norm on <p and constants
0 and 1 0. The last term in this
expression is analogous to the lasso penalty on a vector of regression coe¢ cients
as considered in Section 3.1.2, and produces the same kind of tendency to "0
out" entries that we saw in that context. If 1 = 0 the …rst sparse principal
132
component direction, v, is proportional to the ordinary …rst principal component
direction. In fact, if = 1 = 0 and N > p, v = the ordinary …rst principal
component direction is the optimizer.
For multiple components, an analogue of the …rst case is a set of K vectors
v k 2 <p organized into a p K matrix V that is part of a minimizer (over
choices of p K matrices V and p K matrices
with 0 = I) of the
criterion
N
K
K
X
X
X
2
2
xi
V 0 xi +
kv k k +
(122)
1k kv k k1
i=1
k=1
k=1
for constants
0 and 1k
0. Zou has apparently provided e¤ective
algorithms for optimizing criteria (121) or (122).
17.4.2
Non-negative Matrix Factorization
There are contexts (for example, when data are counts) where it may not make
intuitive sense to center inherently non-negative variables, so that X is naturally
non-negative, and one might want to …nd non-negative matrices W and H such
that
X
W H
N
p
rr p
N
Here the emphasis might be on the columns of W as representing "positive
components" of the (positive) X, just as the columns of the matrix U D in
SVD’s provide the principal components of X. Various optimization criteria
could be set to guide the choice of W and H. One might try to minimize
p
N X
X
2
xij
(W H)ij
i=1 j=1
or maximize
p
N X
X
xij ln (W H)ij
(W H)ij
i=1 j=1
over non-negative choices of W and H, and various algorithms for doing these
have been proposed. (Notice that the second of these criteria is an extension
of a loglikelihood for independent Poisson variables with means entries in W H
to cases where the xij need only be non-negative, not necessarily integer.)
While at …rst blush this enterprise seems sensible, there is a lack of uniqueness in a factorization producing a product W H, and therefore how to interpret
the columns of one of the many possible W ’s is not clear. (An easy way to
see the lack of uniqueness is this. Suppose that all entries of the product
W H are positive. Then for E a small enough (but not 0) matrix, all entries
1
of W
W (I + E) 6= W and H
(I + E) H 6= H are positive, and
W H = W H.) Lacking some natural further restriction on the factors W
and H (beyond non-negativity) it seems the practical usefulness of this basic
idea is also lacking.
133
17.4.3
Archetypal Analysis
Another approach to …nding an interpretable factorization of X was provided
by Cutler and Breiman in their "archetypal analysis." Again one means to write
X
N
W H
p
N
rr p
for appropriate W and H. But here two restrictions are imposed, namely that
1. the rows of W are probability vectors (so that the approximation to X
is in terms of convex combinations/weighted averages of the rows of H),
and
2. H = B X where the rows of B are probability vectors (so that the
r p
r NN
p
rows of H are in turn convex combinations/weighted averages of the rows
of X).
The r rows of H = BX are the "prototypes" (?archetypes?) used to represent
the data matrix X.
With this notation and restrictions, (stochastic matrices) W and B are
chosen to minimize
2
kX W BXk
It’s clearly possible to rearrange the rows of a minimizing B and make corre2
sponding changes in W without changing kX W BXk . So strictly speaking,
the optimization problem has multiple solutions. But in terms of the set of rows
of H (a set of prototypes of size r) it’s possible that this optimization problem
often has a unique solution. (Symmetries induced in the set of N rows of X
can be used to produce examples where it’s clear that genuinely di¤erent sets of
2
prototypes produce the same minimal value of kX W BXk . But it seems
likely that real data sets will usually lack such symmetries and lead to a single
optimizing set of prototypes.)
Emphasis in this version of the "approximate X" problem is on the set of
prototypes as "representative data cases." This has to be taken with a grain of
salt, since they are nearly always near the "edges" of the data set. This should
be no surprise, as line segments between extreme cases in <p can be made to
run close to cases in the "middle" of the data set, while line segments between
interior cases in the data set will never be made to run close to extreme cases.
17.4.4
Independent Component Analysis
We begin by supposing (as before) that X has been centered. For simplicity,
suppose also that it is full rank (rank p). With the SVD as before
X = U D V0
N
p
N
pp pp p
we consider the "sphered" version of the data matrix
p
X = N XV D 1
134
so that the sample covariance matrix of the data is
1
X 0X
N
=I
Note that the columns of X are then scaled principal components of the (centered) data matrix and we operate with and on X . (For simplicity of notation,
we’ll henceforth drop the " " on X.) This methodology seems to be an attempt to …nd latent probabilistic structure in terms of independent variables to
account for the principal components.
In particular, in its linear form, ICA attempts to model the N (transposed)
rows of X as iid of the form
xi = A si
(123)
p 1
p pp 1
for iid vectors si , where the (marginal) distribution of the vectors si is one of
independence of the p coordinates/components and the matrix A is an unknown
parameter. Consistent with our sphering of the data matrix, we’ll assume
that Covx = I and without any loss of generality assume that the covariance
matrix for s is not only diagonal, but that Covs = I. Since then I =Covx =
A (Covs) A0 = AA0 , A must be orthogonal, and so
A0 x = s
b then sbi
b 0 xi
Obviously, if one can estimate A with an orthogonal A
A
serves as an estimate of what vector of independent components led to the ith
row of X and indeed
b XA
b
S
has columns that provide predictions of the N (row) p-vectors s0i , and we might
thus call those the "independent components" of X (just as we term the columns
of XV the principal components of X). There is a bit of arbitrariness in
the representation (123) because the ordering of the coordinates of s and the
corresponding rows of A is arbitrary. But this is no serious concern.
So then, the question is what one might use as a method to estimate A in
(123). There are several possibilities. The one discussed in HTF is related to
entropy and Kullback-Leibler distance. If one assumes that a (p-dimensional)
random vector Y has a density f , with marginal densities f1 ; f2 ; : : : ; fp then an
p
Q
"independence version" of the distribution of Y has density
fj and the K-L
j=1
135
divergence of the distribution of Y from its independence version is
1
0
0
1
Z
p
B f (y) C
Y
C
B
@
A
K f;
fj = f (y) ln B p
C dy
Q
A
@
j=1
fj (yj )
j=1
Z
=
f (y) ln f (y) d y
Z
=
f (y) ln f (y) d y
j=1
p Z
X
f (y) (ln fj (yj )) dy
fj (yj ) (ln fj (yj )) dyj
j=1
p
X
=
p Z
X
j=1
H (Yj )
H (Y )
for H the entropy function for a random argument, and one might think of this
K-L divergence as the di¤erence in the information carried by Y (jointly) and
the sum across the components of their individual information contents.
If one then thinks of s as random and of the form A0 x for random x, it is
perhaps sensible to seek an orthogonal A to minimize (for for aj the jth column
of A)
p
X
j=1
H (sj )
H (s) =
=
=
p
X
j=1
p
X
j=1
p
X
j=1
H a0j x
H A0 x
H a0j x
H (x)
H a0j x
H (x)
ln jdet Aj
This is equivalent as it turns out (for orthogonal A) to maximization of
C (A) =
p
X
j=1
H (z)
H a0j x
(124)
for z standard normal. Then a common approximation is apparently
H (z)
H a0j x
EG (z)
EG a0j x
2
for G (u) 1c ln cosh (cu) for a c 2 [1; 2]. Then, criterion (124) has the empirical
approximation
b (A) =
C
p
X
EG (z)
j=1
136
N
1 X
G a0j x0i
N i=1
!2
b can be taken to be an optimizer of
where, as usual, x0i is the ith row of X. A
b
C (A).
Ultimately, this development produces a rotation matrix that makes the p
entries of rotated and scaled principal component score vectors "look as independent as possible." This is thought of as resolution of a data matrix into its
"independent sources" and as a technique for "blind source separation."
17.4.5
Principal Curves and Surfaces
In the context of Section 2.3, the line in <p de…ned by fcv 1 jc 2 <g serves as a
best straight line representative of the data set. Similarly, the "plane" in <p
de…ned by fc1 v 1 + c2 v 2 jc1 2 <; c1 2 <g serves as a best "planar" representative
of the data set. The notions of "principal curve" and "principal surface" are
attempts to generalize these ideas to 1-dimensional and 2-dimensional structures
in <p that may not be "straight" or "‡at" but rather have some curvature.
A parametric curve in <p is represented by a vector-valued function
0
1
f1 (t)
B f2 (t) C
B
C
f (t) = B . C
@ .. A
fp (t)
de…ned on some interval [0; T ], where we assume that the coordinate functions
fj (t) are smooth. With
1
0 0
f1 (t)
B f20 (t) C
C
B
f 0 (t) = B . C
@ .. A
fp0 (t)
the "velocity vector" for the curve, f 0 (t) is then the "speed" for the curve
and the arc length (distance) along f (t) from t = 0 to t = t0 is
0
Lf (t ) =
Z
t0
f 0 (t) dt
0
In order to set an unambiguous representation of a curve, it will be useful to
assume that it is parameterized so that it has unit speed, i.e. that Lf (t0 ) = t0
1
for all t0 2 [0; T ]. Notice that if it does not, for (Lf ) the inverse of the arc
length function, the parametric curve
g ( ) = f (Lf )
1
( )
for
2 [0; Lf (T )]
(125)
does have unit speed and traces out the same set of points in <p that are traced
out by f (t). So there is no loss of generality in assuming that parametric
curves we consider here are parameterized by arc length, and we’ll henceforth
write f ( ).
137
Then, for a unit speed parametric curve f ( ) and point x 2 <p , we’ll de…ne
the projection index
f
(x) = sup
j kx
f ( )k = inf kx
f ( )k
(126)
This is roughly the last arc length for which the distance from x to the curve is
minimum. If one thinks of x as random, the "reconstruction error"
E kx
2
f(
f
(x))k
(the expected squared distance between x and the curve) might be thought of
as a measure of how well the curve represents the distribution. Of course, for
a data set containing N cases xi , an empirical analog of this is
N
1 X
kxi
N i=1
f(
2
f
(xi ))k
(127)
and a "good" curve representing the data set should have a small value of this
empirical reconstruction error. Notice however, that this can’t be the only
consideration. If it was, there would sure be no real di¢ culty in running a very
wiggly (and perhaps very long) curve through every element of a data set to
produce a curve with 0 empirical reconstruction error. This suggests that with
0 00
1
f1 ( )
B f200 ( ) C
B
C
f 00 ( ) = B
C
..
@
A
.
fp00 ( )
the curve’s "acceleration vector," there must be some kind of control exercised
on the curvature, f 00 ( ) , in the search for a good curve. We’ll note below
where this control is implicitly applied in standard algorithms for producing
principal curves for a data set.
Returning for a moment to the case where we think of x as random, we’ll say
that f ( ) is a principal curve for the distribution of x if it satis…es a so-called
self-consistency property, namely that
f ( ) = E [xj
f
(x) = ]
(128)
This provides motivation for an iterative 2-step algorithm to produce a "principal curve" for a data set.
Begin iteration with f 0 ( ) the ordinary …rst principal component line for
the data set. Speci…cally, do something like the following. For some choice
T > 2 max jhxi ; v 1 ij begin with
i=1;:::;N
T
2
f0 ( ) =
138
v1
to create a unit-speed curve that extends past the data set in both directions
along the …rst principal component direction in <p . Then project the xi onto
the line to get N values
1
i
=
(xi ) = hxi ; v 1 i +
f0
T
2
and in light of the criterion (128) more or less average xi with corresponding
to get f 1 ( ). A speci…c possible version of this is to consider,
f 0 (xi ) near
for each coordinate, j, the N pairs
1
i ; xij
and to take as fj1 ( ) a function on [0; T ] that is a 1-dimensional cubic smoothing
spline. One may then assemble these into a vector function to create f 1 ( ).
NOTICE that implicit in this prescription is control over the second derivatives of the component functions through the sti¤ness parameter/weight in the
smoothing spline optimization. Notice also that for this prescription, the unitspeed property of f 0 ( ) will not carry over to f 1 ( ) and it seems that one must
usehthe idea (125) to iassure that f 1 ( ) is parameterized in terms of arc length
RT
on 0; 0 f 1 ( ) d .
With iterate f m
values
1
( ) in hand, one projects the xi onto the curve to get N
m
i
=
fm
1
(xi )
and for each coordinate j considers the N pairs
f(
m
i ; xij )g
Fitting a 1-dimensional cubic smoothing spline to these produces fjm ( ), and
these functions are assembled into a vector to create f m ( ) (which may need
some adjustment via (125) to assure that the curve is parameterized in terms
of arc length). One iterates until the empirical reconstruction error (127)
converges, and one takes the corresponding f m ( ) to be a principal curve.
Notice that which sti¤ness parameter is applied in the smoothing steps will
govern what one gets for such a curve.
There have been e¤orts to extend principal curves technology to the creation
of 2-dimensional principal surfaces. Parts of the extension are more or less clear.
A parametric surface in <p is represented by a vector-valued function
0
1
f1 (t)
B f2 (t) C
B
C
f (t) = B
C
..
@
A
.
fp (t)
for t 2 S
<2 . A projection index parallel to (126) and self-consistency
property parallel to (128) can clearly be de…ned for the surface case, and thin
139
plate splines can replace 1-dimensional cubic smoothing splines for producing
iterates of coordinate functions. But ideas of unit-speed don’t have obvious
translations to <2 and methods here seem fundamentally more complicated
than what is required for the 1-dimensional case.
17.5
Density Estimation
One might phrase the problem of describing structure for x in terms of estimating a pdf for the variable. We brie‡y consider the problem of "kernel density
estimation." This is the problem:
given x1 ; x2 ; : : : ; xN iid with (unknown) pdf f , how to estimate f ?
For the time being, suppose that p = 1. For g ( ) some …xed pdf (like, for
example, the standard normal pdf), invent a location-scale family of densities
on < by de…ning (for > 0)
h( j ; ) =
1
g
One may, if one then wishes, think of a corresponding "kernel"
K (; )
g
The Parzen estimate of f (x0 ) is then
N
1 X
h (x0 jxi ; )
f^ (x0 ) =
N i=1
=
N
1 X
K (x0 ; xi )
N i=1
an average of kernel values.
A standard choice of univariate density g ( ) is ' ( ), the standard normal
pdf. The natural generalization of this to p dimensions is to let h ( j ; ) be
the MVNp ; 2 I density. One should expect that unless N is huge, this
methodology will be reliable only for fairly small p (say 3 at most) as a means
of estimating a general p-dimensional pdf.
There is some discussion in HTF of classi…cation based on estimated multivariate densities, and in particular on "naive" estimators. That is, for purposes
of de…ning classi…ers (not for the purpose of real density estimation) one might
treat elements xj of x as if they were independent, and multiply together kernel estimates of marginal densities. Apparently, even where the assumption of
independence is clearly "wrong" this naive methodology can produce classi…ers
that work pretty well.
140
17.6
Google PageRanks
This might be thought of as of some general interest beyond the particular application of ranking Web pages, if one abstracts the general notion of summarizing
features of a directed graph with N nodes (N Web pages in the motivating
application) where edges point from some nodes to other nodes (there are links
on some Web pages to other Web pages). The basic idea is that one wishes to
rank the nodes (Web pages) by some measure of importance.
If i 6= j de…ne
Lij =
1
0
if there is a directed edge pointing from node j to node i
otherwise
and de…ne
cj =
N
X
Lij = the number of directed edges pointed away from node j
i=1
(There is the question of how we are going to de…ne Ljj . We may either declare
that there is an implicit edge pointed from each node j to itself and adopt the
convention that Ljj = 1 or we may declare that all Ljj = 0.)
A node (Web page) might be more important if many other (particularly,
important) nodes have edges (links) pointing to it. The Google PageRanks
ri > 0 are chosen to satisfy10
ri = (1
d) + d
X
j
Lij
cj
rj
(129)
for some d 2 (0; 1) (producing minimum rank (1 d)). (Apparently, a standard
choice is d = :85.) The question is how one can identify the ri .11
Without loss of generality, with
0
1
r1
B
C
r = @ ... A
rN
we’ll assume that r 0 1 = N so that the average rank is 1. Then, for
8
< 1 if c 6= 0
j
dj =
c
: 0j if c = 0
j
1 0 There
L
is the question of what cij should mean in case cj = 0. We’ll presume that the
j
meaning is that the ratio is 0:
1 1 These are, of course, simply N linear equations in the N unknowns r , and for small N
i
one might ignore special structure and simply solve these numerically with a generic solver.
In what follows we exploit some special structure.
141
de…ne the N N diagonal matrix D = diag (d1 ; d2 ; : : : ; dN ). (Clearly, if we
use the Ljj = 1 convention, then dj = 1=cj for all j.) Then in matrix form, the
N equations (129) are (for L = (Lij ) i=1;:::;N )
j=1;:::;N
r = (1
=
d) 1 + dLDr
1
(1
N
d) 110 + dLD r
(using the assumption that r 0 1 = N ). Let
1
(1
N
T =
d) 110 + dLD
so that r 0 T 0 = r 0 .
Note all entries of T are non-negative and that
T 01 =
=
1
(1
N
d) 110 + dLD
0
1
1
(1
N
= (1
d) 110 1 + dDL0 1
0
1
c1
B c2 C
B
C
d) 1 + dD B . C
@ .. A
cN
so that if all cj > 0, T 0 1 = 1. We have this condition as long as we either limit
application to sets of nodes (Web pages) where each node has an outgoing edge
(an outgoing link) or we decide to count every node as pointing to itself (every
page as linking to itself) using the Ljj = 1 convention. Henceforth suppose
that indeed all cj > 0.
Under this assumption, T 0 is a stochastic matrix (with rows that are probability vectors), the transition matrix for an irreducible aperiodic …nite state
Markov Chain. De…ning the probability vector
p=
1
r
N
it then follows that since p0 T 0 = p0 the PageRank vector is N times the stationary probability vector for the Markov Chain. This stationary probability
vector can then be found as the limit of any row of
T0
n
as n ! 1.
142
18
Document Features and String Kernels for
Text Processing
An important application of various kinds of both supervised and unsupervised
learning methods is that of text processing. The object is to quantify structure
and commonalities in text documents. Patterns in characters and character
strings and words are used to characterize documents, group them into clusters,
and classify them into types. We here say a bit about some simple methods
that have been applied.
Suppose that N documents in a collection (or corpus) are under study. One
needs to de…ne "features" for these, or at least some kind of "kernel" functions
for computing the inner products required for producing principal components
in an implicit feature space (and subsequently clustering or deriving classi…ers,
and so on).
If one treats documents as simply sets of words (ignoring spaces and punctuation and any kind of order of words) one simple set of features for documents
d1 ; d2 ; : : : ; dN is a set of counts of word frequencies. That is, for a set of p
words appearing in at least one document, one might take
xij = the number of occurrences of word j in document i
and operate on a representation of the documents in terms of an N p data
matrix X. These raw counts xij are often transformed before processing.
One popular idea is the use of a "tf-idf" (term frequency-inverse document
frequency) weighting of elements of X. This replaces xij with
tij = xij ln PN
i=1
N
I [xij > 0]
or variants thereof. (This up-weights non-zero counts of words that occur
in few documents. The logarithm is there to prevent this up-weighting from
overwhelming all other aspects of the counts.) One might also decide that
document length is a feature that is not really of primary interest and determine
to normalize vectors xi (or ti ) in one way or another. That is, one might begin
with values
x
xij
Pp ij
or
kx
x
ik
ij
j=1
rather than values xij . This latter consideration is, of course, not relevant if
the documents in the corpus all have roughly the same length.
Processing methods that are based only on variants of the word counts xij
are usually said to be based on the "Bag-of-Words." They obviously ignore
potentially important word order. (The instructions "turn right then left" and
"turn left then right" are obviously quite di¤erent instructions.) One could
then consider ordered pairs or n-tuples of words.
So, with some "alphabet" A (that might consist of English words, Roman
letters, amino acids in protein sequencing, base pairs in DNA sequencing, etc.)
143
consider strings of elements of elements of the alphabet, say
s = b1 b2
bjsj where each bi 2 A
A document (... or protein sequence ... or DNA sequence) might be idealized
as such a string of elements of A. An n-gram in this context is simply a string
of n elements of A, say
u = b1 b2
bn where each bi 2 A
Frequencies of unigrams (1-grams) in documents are (depending upon the alphabet) "bag-of-words" statistics for words or letters or amino acids, etc. Use of
features of documents that are counts of occurrences of all possible n-grams apn
pears to often be problematic, because unless jAj and n are fairly small, p = jAj
will be huge and then X huge and sparse (for ordinary N and jsj). And in
many contexts, sequence/order structure is not so "local" as to be e¤ectively
expressed by only frequencies of n-grams for small n.
One idea that seems to be currently popular is to de…ne a set of interesting
p
strings, say U = fui gi=1 and look for their occurrence anywhere in a document,
with the understanding that they may be realized as substrings of longer strings.
That is, when looking for string u (of length n) in a document s, we count every
di¤erent substring of s (say s0 = si1 si2
sin ) for which
s0 = u
But we discount those substrings of s matching u according to length as follows. For some > 0 (the choice = :5 seems pretty common) give matching
substring s0 = si1 si2
sin weight
in i1 +1
0
= js j
so that document i (represented by string si ) gets value of feature j
X
ln l1 +1
xij =
sil1 sil2
(130)
siln =uj
It further seems that it’s common to normalize the rows of X by the usual
Euclidean norm, producing in place of xij the value
xij
kxi k
(131)
This notion of using features (130) or normalized features (131) looks attractive, but potentially computationally prohibitive, particularly since the "interesting set" of strings U is often taken to be An . One doesn’t want to have to
compute all features (130) directly and then operate with the very large matrix
X. But just as we were reminded in Section 2.3.2, it is only XX 0 that is required to …nd principal components of the features (or to de…ne SVM classi…ers
144
or any other classi…ers or clustering algorithms based on principal components).
So if there is a way to e¢ ciently compute or approximate inner products for
rows of X de…ned by (130), namely
0
10
1
X
X
X
ln l1 +1 A @
mn m1 +1 A
@
hxi ; xi0 i =
u2An
=
X
u2An
sil1 sil2
X
sil1 sil2
siln =u
X
siln =u si0 m1 si0 m2
si0 m1 si0 m2
si0 mn =u
ln l1 +mn m1 +2
si0 mn =u
it might be possible to employ this idea. And if the inner products hxi ; xi0 i
can be computed e¢ ciently, then so can the inner products
1
1
xi ;
xi0
kxi k
kxi0 k
=p
hxi ; xi0 i
hxi ; xi i hxi0 ; xi0 i
needed to employ XX 0 for the normalized features (131). For what it is worth,
it is in vogue to call the function of documents s and t de…ned by
X
X
X
ln l1 +mn m1 +2
K (s; t) =
u2An
sl1 sl2
sln =u tm1 tm2
tmn =u
the String Subsequence Kernel and then call the matrix XX 0 = (hxi ; xj i) =
¯
¯
¯
(K (si ; sj )) the Gram matrix for that "kernel." The good news is that there
are fairly simple recursive methods for computing K (s; t) exactly in O (n jsj jtj)
time and that there are approximations that are even faster (see the 2002 Journal of Machine Learning Research paper of Lodhi et al.). That makes the
implicit use of features (130) or normalized features (131) possible in many text
processing problems.
19
Kernel Mechanics
We have noted a number of times (particularly in the context of RKHSs) the
potential usefulness of non-negative de…nite "kernels" K (s; t). In the RKHS
context, these map continuously C C ! < for C a compact subset of <p . But
n
the basic requirement is that for any set of points fxi gi=1 , xi 2 C, the Gram
matrix
(K (xi ; xj )) i=1;:::;N
j=1;:::;N
is non-negative de…nite. For most purposes in machine learning, one can give
up (continuity and) requiring that the domain of K (s; t) is a subset of <p <p ,
replacing it with arbitrary X X and requiring only that the Gram matrix be
n
non-negative de…nite for any set of fxi gi=1 , xi 2 X . It is in this context that
the "string kernels" of Section 18 can be called "kernels" and here we adopt this
usage/understanding.
145
A direct way of producing a kernel function is through a Euclidean inner
product of vectors of "features." That is, if : X ! <m (so that component j
of , j , maps X to a univariate real feature) then
K (s; t) = h (s) ;
(t)i
(132)
is a kernel function. (This is, of course, the basic idea used in Section 2.3.2.)
Section 6.2 of the book Pattern Recognition and Machine Learning by Bishop
notes that it is very easy to make new kernel functions from known ones. In
particular, for c > 0; K1 ( ; ) and K2 ( ; ) kernel functions on X X , f ( ) arbitrary, p ( ) a polynomial with non-negative coe¢ cients, : X ! <m , K3 ( ; ) a
kernel on <m <m , and M a non-negative de…nite matrix, all of the following
are kernel functions:
1. K (s; t) = cK1 (s; t) on X
X,
2. K (s; t) = f (s) K1 (s; t) f (t) on X
3. K (s; t) = p (K1 (s; t)) on X
X,
X,
4. K (s; t) = exp (K1 (s; t)) on X
X,
5. K (s; t) = K1 (s; t) + K2 (s; t) on X
6. K (s; t) = K1 (s; t) K2 (s; t) on X
7. K (s; t) = K3 ( (s) ;
(t)) on X
8. K (s; t) = s0 M t on <m
X,
X,
X , and
<m .
(Fact 7 generalizes the basic insight of display (132).) Further, if X
and KA ( ; ) is a kernel on XA XA and KB ( ; ) is a kernel on XB
the following are both kernel functions:
9. K ((sA ; sB ) ; (tA ; tB )) = KA (sA ; tA ) + KB (sB ; tB ) on X
10. K ((sA ; sB ) ; (tA ; tB )) = KA (sA ; tA ) KB (sB ; tB ) on X
X A XB
XB , then
X , and
X.
An example of a kernel on a somewhat abstract (but …nite) space is this.
For a …nite set A consider X = 2A , the set of all subsets of A. A kernel on
X X can then be de…ned by
K (A1 ; A2 ) = 2jA1 \A2 j for A1
A and A2
A
There are several probabilistic and statistical arguments that can lead to
forms for kernel functions. For example, a useful fact from probability theory
(Bochner’s Theorem) says that characteristic functions for p-dimensional distributions are non-negative de…nite complex-valued functions of s 2 <p . So if
(s) is a real-valued characteristic function, then
K (s; t) =
146
(s
t)
is a kernel function on <p <p . Related to this line of thinking are lists
of standard characteristic functions (that in turn produce kernel functions) and
theorems about conditions su¢ cient to guarantee that a real-valued function is a
characteristic function. For example, each of the following is a real characteristic
function for a univariate random variable (that can lead to a kernel on <1 <1 ):
1.
(t) = cos at for some a > 0,
2.
(t) =
3.
(t) = exp
4.
(t) = exp ( a jtj) for some a > 0:
sin at
for some a > 0,
at
at2
for some a > 0, and
And one theorem about su¢ cient conditions for a real-valued function on <1
to be a characteristic function says that if f is symmetric (f ( t) = f (t)),
f (0) = 1, and f is decreasing and convex on [0; 1), then f is the characteristic
function of some distribution on <1 . (See Chung page 191.)
Bishop points out two constructions motivated by statistical modeling that
yield kernels that have been used in the machine learning literature. One is
this. For a parametric model on (a potentially completely abstract) X , consider
densities f (xj ) that when treated as functions of are likelihood functions (for
various possible observed x). Then for a distribution G for 2 ,
Z
K (s; t) = f (sj ) f (tj ) dG ( )
is a kernel. This is the inner product in the space of square integrable functions
on the probability space
with measure G of the two likelihood functions. In
this space, the distance between the functions (of ) f (sj ) and f (tj ) is
sZ
(f (sj )
2
f (tj )) dG ( )
and what is going on here is the implicit use of (in…nite-dimensional) features
that are likelihood functions for the "observations" x. Once one starts down
this path, other possibilities come to mind. One is to replace likelihoods with
loglikelihoods and consider the issue of "centering" and even "standardization."
That is, one might de…ne a feature (a function of ) corresponding to x as
Z
0
ln f (xj ) dG ( )
x ( ) = ln f (xj ) or
x ( ) = ln f (xj )
R
ln f (xj )
ln f (xj ) dG ( )
or even 00x ( ) = qR
R
2
ln f (xj )
ln f (xj ) dG ( ) dG ( )
147
Then obviously, the corresponding kernel function is
Z
Z
0
K (s; t) =
s ( ) t ( ) dG ( ) or K (s; t) =
Z
00
00
or K 00 (s; t) =
s ( ) t ( ) dG ( )
0
s
( )
0
t
( ) dG ( )
(Of these three possibilities, centering alone is probably the most natural from
a statistical point of view. It is the "shape" of a loglikelihood that is important
in statistical context, not its absolute level. Two loglikelihoods that di¤er by a
constant are equivalent for most statistical purposes. Centering perfectly lines
up two likelihoods that di¤er by a constant.)
In a regular statistical model for x taking values in X with parameter vector
= ( 1 ; 2 ; : : : ; k ), the k k Fisher information matrix, say I ( ), is nonnegative de…nite. Then with score function
1
0
@
ln
f
(xj
)
C
B @ 1
C
B @
C
B
ln f (xj ) C
B
C
B
@
2
r ln f (xj ) = B
C
..
C
B
.
C
B
A
@ @
ln f (xj )
@ k
(for any …xed ) the function
0
K (s; t) = r ln f (sj ) (I ( ))
1
r ln f (tj )
has been called the "Fisher kernel" in the machine learning literature. (It
follows from Bishops’s 7. and 8. that this is indeed a kernel function.) Note that
K (x; x) is essentially the score test statistic for a point null hypothesis about .
The implicit feature vector here is the k-dimensional score function (evaluated
at some …xed , a q
basis for testing about ), and rather than Euclidean norm,
the norm kuk
u0 (I ( ))
di¤erences in feature vectors.
20
1
u is implicitly in force for judging the size of
Undirected Graphical Models and Machine
Learning
This is a very brief look at an area of considerable current interest in machine
learning related to what is known as "deep learning." HTF’s Section 17.4 and
Chapter 27 of Murphy’s Machine Learning can be consulted for more details.
Suppose that some set of q binary variables (variables each taking values
in f0; 1g) x1 ; x2 ; : : : ; xq is of interest. One way to build a tractable coherent
parametric model for these variables is to …rst adopt a corresponding undirected
graph with q nodes some set of "edges" E connecting some pairs of nodes as
148
representing the variables. Figure 2 is a cartoon of one such graph for a q = 8
case.
Figure 2: A particular undirected graph for q = 8 nodes.
One then posits a very speci…c and convenient parametric form for probabilities corresponding to the undirected graph. For every pair (j; k) such that
0 < j < k
q and there is an edge in E connecting node j and node k, let
jk be a corresponding real parameter. Further, for k = 1; : : : ; q let 0k be an
additional real parameter. Then a probability mass function with
P
exp
f (x) =
P
j<k such that xj and
xk share an edge
x2f0;1gq
exp
q
jk xj xk
P
+
Pq
k=1 0k xk
jk xj xk
j<k such that xj and
xk share an edge
+
(133)
Pq
k=1 0k xk
for x 2 f0; 1g speci…es a particular kind of "Markov network," an "Ising model"
or "Boltzmann Machine" for these variables. Let the normalizing constant that
is the denominator on the right of display (133) be called ( ) and note the
obvious fact that for these models
X
ln (f (x)) =
jk xj xk
+
j<k such that xj and
xk share an edge
q
X
0k xk
ln ( ( ))
(134)
k=1
Further, observe that such a model can have as many as
q
2
+q =
q (q + 1)
2
di¤erent real parameters.
If we let xj stand for the (q 1)-dimensional vector consisting of all entries
of x except xj , it follows directly from relationship (134) that
ln
P
xj = 1jxj
P
xj = 0jxj
=
X
lj xl
l<j such that xl and
xj share an edge
149
+
X
l>j such that xj and
xl share an edge
jl xl
+
0j
and then that
0
P
xj = 1jxj =
BP
exp @
0
l<j such that
xl and xj share
an edge
BP
1 + exp @
l<j such that
xl and xj share
an edge
lj xl
+
lj xl
P
l>j such that
xj and xl share
an edge
+
P
l>j such that
xj and xl share
an edge
jl xl
+
jl xl
1
C
0j A
+
1
C
0j A
One important observation about this model form is then that for …xed , Gibbs
sampling to produce realizations from this distribution (133) is obvious (using
these simple conditional probabilities). Further, conditioning on any set of
entries of x, how to simulate (again via Gibbs sampling) from a conditional
distribution for the rest of the entries (again for …xed ) is equally obvious (as
the conditioning variables simply provide some entries of xj above that are …xed
at their conditioning values throughout the Gibbs iterations).
A version of this general framework that is particularly popular in machine
learning applications is that where nodes are arranged in two layers and there
are edges only between nodes on di¤erent layers, not between nodes in the same
layer. Such a model is often called a restricted Boltzmann machine (an
RBM). In machine learning contexts, one layer of nodes is called the "hidden
layer" and the other is called the "visible layer." Typically the nodes in the
visible layer correspond to (digital versions) of variables that are at least at
some cost empirically observable, while the variables corresponding to hidden
nodes are completely latent/unobservable and somehow represent some stochastic physical or mental mechanism. In addition, it is convenient in some contexts
to think of visible nodes as being of two types, say belonging to a set V1 or a
set V2 . For example, in a prediction context, the nodes in V1 might encode
"x"/inputs and the nodes in V2 might encode y/outputs. We’ll use the naming
conventions indicated in Figure 3.
(As we have already suggested for general Boltzmann machines) given the
parameter vector , simulation (via Gibbs sampling) from a RBM is easy. In
particular, where the nodes in V1 encode an input vector x and the nodes in V2
encode an output y, simulation from the conditional distribution of the whole
network given a set of values for vl+1 ; : : : ; vl+m provides a predictive distribution
for the response. So the fundamental di¢ culty one faces in using these models
is the matter of …tting a vector of coe¢ cients .
What will be available as training data for an RBM is some set of (potentially incomplete) vectors of values for visible nodes, say v i for i = 1; : : : ; N
(that one will typically assume are independent from some appropriate marginal distributions for visible vectors derived from the overall joint distribution
of values associated with all nodes visible and hidden). Specializing form (133)
150
Figure 3: An undirected graph corresponding to a restricted Boltzmann machine
with q = l + m + n total nodes. l nodes are hidden and the m + n visible nodes
break into the two sets V1 and V2 .
to the RBM case, one has
exp
f (h; v) =
P
j=1;:::;l
jk hj vk
k=l+1;:::;l+m+n
+
Pl
k=1 0k hk
( )
+
Pl+m+n
k=l+1
0k vk
(135)
for
( )=
X
(h;v)2f0;1gl+m+n
0
B
exp B
@
X
j=1;:::;l
k=l+1;:::;l+m+n
jk hj vk
+
l
X
k=1
0k hk
+
l+m+n
X
k=l+1
1
C
C
0k vk A
Notice now that even in an hypothetical case where one has "data" consisting
of complete (h; v) pairs, the existence of the unpleasant normalizing constant
( ) (of a sum form in the denominator) in form (135) would make optimization
N
Q
PN
of a likelihood
f (hi ; v i ) or loglikelihood
i=1 ln f (hi ; v i ) not obvious.
i=1
But the fact that one must sum out over (at least) all hidden nodes in order
to get contributions to a likelihood or loglikelihood makes the problem even
more computationally di¢ cult. That is, if an ith training case provides a
m+n
complete visible vector v i 2 f0; 1g
, the corresponding likelihood term is the
marginal of that visible con…guration
X
f (v i ) =
f (h; v i )
h2f0;1gl
And the situation becomes even more unpleasant if an ith training case provides,
for example, only values for variables corresponding to nodes in V1 (say v 1i 2
m
f0; 1g ), since the corresponding likelihood term is the marginal of only that
151
visible con…guration
f (v 1i ) =
X
f (h; (v 1i ; v 2 ))
h2f0;1gl
v 2 2f0;1gn
Substantial e¤ort in computer science circles has gone into the search for
"learning" algorithms aimed at …nding parameter vectors that produce large
values of loglikelihoods based on N training cases (each term based on some
marginal of f (h; v) corresponding to a set of visible nodes). These seem to be
mostly based on approximate stochastic gradient descent ideas and approximations to appropriate expectations based on short Gibbs sampling runs. Hinton’s
notion of "contrastive divergence" appears to be central to the most well known
of these.
The area of "deep learning" then seems to be based on the notion of generalizing RBMs to more than a single hidden layer and potentially employs visible
layers on both the top and the bottom of an undirected graph. A cartoon of
a deep learning network is given in Figure 4. The fundamental feature of the
graph architecture here is that there are edges only between nodes in successive
layers.
Figure 4: An hypothetical "deep" generalization of a RBM.
If the problem of how to …t an Ising model/Boltzmann machine to an RBM is
a practically di¢ cult one, …tting a deep network model (based on some training
cases consisting of values for variables associated with some of the visible nodes)
is clearly going to be doubly di¢ cult. What seems to be currently popular is
some kind of "greedy"/"forward" sequential …tting of one set of parameters
connecting two successive layers at a time, followed by generation of simulated
152
values for variables corresponding to nodes in the "newest" layer and treating
those as "data" for …tting a next set of parameters connecting two layers (and
so on).
If the …tting problem can be solved for a given architecture and form of
training data, some very interesting possibilities for using a good "deep" model
emerge. For one, in a classi…cation/prediction context, one might treat the
bottom visible layer as associated with an input vector x, and the top layer as
also visible and associated with an output y or y. In theory, training cases could
consist of complete (x; y) pairs, but they could also involve some incomplete
cases, where y is missing, or part of x is missing, or ... so on. (Once the
training issue is handled, simulation of the top layer variables from inputs of
interest used as bottom layer values again enables classi…cation/prediction.)
As an interesting second possibility for using deep restricted Boltzmann machines, one could consider architectures where the top and bottom layers are
both visible and encode essentially (or even exactly12 ) the same information
and between them are several hidden layers with one having very few nodes
(like, say, two nodes). If one can …t such a thing to training cases consisting
of sets of values corresponding to visible nodes, one can simulate (again via
Gibbs sampling) for a …xed set of "visible values" corresponding values at the
few nodes serving as the narrow layer of the network. The vector of estimated
marginal probabilities of 1 at those nodes might then in turn serve as a kind of
pair of generalized principal component values for the set of input visible values.
(Notice that in theory these are possible to compute even for incomplete sets of
visible values!)
1 2 I am not altogether sure what are the practical implcations of essnetially turning the kind
of "tower" in Figure 4 into a "band" or ‡at bracelet that connects back on itself, the top
hidden layer having edges directly to the bottom visible layer, but I see no obvious theoretical
issues with this architecture.
153
Download