  More Simple Generalities About Predictive Analytics: Linear Methods and  Nearest Neighbor Methods 

advertisement
More Simple Generalities About Predictive Analytics: Linear Methods and Nearest Neighbor Methods Consider the basic supervised learning set-up for a vector of real-valued targets Y and matrix of
N1
inputs X that together comprise a training set T . A fixed function f that takes p-vectors ,
N p
say x , as inputs and outputs a target value y ,
ŷ  f  x 
is a predictor. Where the training set is used to choose a predictor, we'll write fˆ instead of f .
There is a whole spectrum of potential fˆ 's ranging widely in flexibility/complexity.
Mostly/"usually" highly flexible predictors tend to high variance (across training sets T ) of
predictions fˆ  u  (for fixed u ), and inflexible/simple predictors tend to have high bias (across
training sets T ) of predictions fˆ  u  (for fixed u ). Some standard prediction methods fˆ listed
roughly in order of increasing complexity/flexibility are:

the grand sample mean, y

simple linear regression of y on a single input, say x j

multiple linear regression of y on all p inputs

k -nearest neighbor prediction


We all know some things about ordinary (simple and) multiple regression that are worth quickly
reiterating here. Then we will say a small amount about nearest neighbor prediction rules.
Recall from (new) Stat 500 or (old) Stat 511 (see Vardeman's Stat 511 page if you want to
review this) that for C  X  the column space of X and PC  X  the projection matrix onto that
subspace of  N , ordinary least squares uses
ˆ OLS  P Y
Y
C  X
1 as a vector or predictions/fitted target values. One standard representation of this projection
matrix is
PC  X   X  X ' X  X

This means that

ˆ OLS  X  X ' X  XY  X  X ' X  XY
Y

and with

βˆ OLS   X ' X  XY
it is the case that
ˆ OLS  Xβˆ OLS
Y
It's thus sensible to call
fˆ  x   xβˆ OLS
(*)
the ordinary least squares predictor derived from the training set. Recall from the QR
decomposition discussion that in the full rank case, an alternative formula for the vector of least
squares coefficients is
βˆ OLS  R 1QY
Thinking about this "multiple regression"/OLS base for producing the linear predictor in (*), we
will see that:
1. "less flexible" versions of this predictor might be had by
a. reducing the size of X by deleting some columns (through some kind of "variable
search/selection" method),
b. using some other (than OLS) criterion to choose a vector of coefficients β̂ for use
in a form fˆ  x   xβˆ (often this is some form of "penalization" or "shrinkage"
that tends to make β̂ smaller than β̂ OLS , perhaps even setting some entries to 0
and having the effect of "dropping" those columns from the prediction form), and
2. "more flexible" versions of this predictor might be had by increasing the size of X by the
addition of some columns or replacement of the p original columns by a larger number
2 of functions of the p input variables. (This kind of possibility includes polynomial
regressions and use of other large sets of "basis" functions.)
There will be in Stat 502X methods between linear ones and "nearest neighbor" methods of
prediction in terms of flexibility, but for purposes of early introducing one potentially extremely
flexible prediction method, we develop the nearest neighbor idea a bit here.
Consider some joint distribution for  x, y  and remember that decision-theoretic considerations
point at predictors based on the conditional distributions of y | x  u and summaries of them like
f  u   E  y | x  u  and f  u   Median  y | x  u 
(**)
(under respectively squared- and absolute-error losses).
Consider making approximations of these predictors (**) based on a training set. Naively, one
might hope to use predictors like
fˆ  u  
1
# i with xi  u

i with xi  u
yi and fˆ  u   Median  yi | xi  u
The problem with this thinking is that typically (because of continuous nature of distributions of
x or just data sparsity) there will be at most one case for which xi  u . So instead of averaging
or taking a median of responses for vectors of inputs matching u exactly, one considers
averaging or taking a median of responses for vectors of inputs near u .
Let N k  u  stand for the set of indices i corresponding to the k vectors of inputs xi in the
training set closest to the point u  p (we'll assume some sensible method is adopted for
breaking ties). k -nearest neighbor approximations to the optimal but typically unrealizable
predictors (**) are
1
fˆ  u  
k

iN k  u 


yi and fˆ  u   Median yi | i  N k  u 
For a fixed training set size N , the smaller is k , typically
1. the noisier (the higher-variance) is a k -nearest neighbor predictor, and
2. the smaller the potential for prediction bias.
One expects that for very large N , k can be taken to be large enough to mitigate prediction
variance and a yet be a small enough fraction of N to mitigate prediction bias.
3 
Download