More Simple Generalities About Predictive Analytics: Linear Methods and Nearest Neighbor Methods Consider the basic supervised learning set-up for a vector of real-valued targets Y and matrix of N1 inputs X that together comprise a training set T . A fixed function f that takes p-vectors , N p say x , as inputs and outputs a target value y , ŷ f x is a predictor. Where the training set is used to choose a predictor, we'll write fˆ instead of f . There is a whole spectrum of potential fˆ 's ranging widely in flexibility/complexity. Mostly/"usually" highly flexible predictors tend to high variance (across training sets T ) of predictions fˆ u (for fixed u ), and inflexible/simple predictors tend to have high bias (across training sets T ) of predictions fˆ u (for fixed u ). Some standard prediction methods fˆ listed roughly in order of increasing complexity/flexibility are: the grand sample mean, y simple linear regression of y on a single input, say x j multiple linear regression of y on all p inputs k -nearest neighbor prediction We all know some things about ordinary (simple and) multiple regression that are worth quickly reiterating here. Then we will say a small amount about nearest neighbor prediction rules. Recall from (new) Stat 500 or (old) Stat 511 (see Vardeman's Stat 511 page if you want to review this) that for C X the column space of X and PC X the projection matrix onto that subspace of N , ordinary least squares uses ˆ OLS P Y Y C X 1 as a vector or predictions/fitted target values. One standard representation of this projection matrix is PC X X X ' X X This means that ˆ OLS X X ' X XY X X ' X XY Y and with βˆ OLS X ' X XY it is the case that ˆ OLS Xβˆ OLS Y It's thus sensible to call fˆ x xβˆ OLS (*) the ordinary least squares predictor derived from the training set. Recall from the QR decomposition discussion that in the full rank case, an alternative formula for the vector of least squares coefficients is βˆ OLS R 1QY Thinking about this "multiple regression"/OLS base for producing the linear predictor in (*), we will see that: 1. "less flexible" versions of this predictor might be had by a. reducing the size of X by deleting some columns (through some kind of "variable search/selection" method), b. using some other (than OLS) criterion to choose a vector of coefficients β̂ for use in a form fˆ x xβˆ (often this is some form of "penalization" or "shrinkage" that tends to make β̂ smaller than β̂ OLS , perhaps even setting some entries to 0 and having the effect of "dropping" those columns from the prediction form), and 2. "more flexible" versions of this predictor might be had by increasing the size of X by the addition of some columns or replacement of the p original columns by a larger number 2 of functions of the p input variables. (This kind of possibility includes polynomial regressions and use of other large sets of "basis" functions.) There will be in Stat 502X methods between linear ones and "nearest neighbor" methods of prediction in terms of flexibility, but for purposes of early introducing one potentially extremely flexible prediction method, we develop the nearest neighbor idea a bit here. Consider some joint distribution for x, y and remember that decision-theoretic considerations point at predictors based on the conditional distributions of y | x u and summaries of them like f u E y | x u and f u Median y | x u (**) (under respectively squared- and absolute-error losses). Consider making approximations of these predictors (**) based on a training set. Naively, one might hope to use predictors like fˆ u 1 # i with xi u i with xi u yi and fˆ u Median yi | xi u The problem with this thinking is that typically (because of continuous nature of distributions of x or just data sparsity) there will be at most one case for which xi u . So instead of averaging or taking a median of responses for vectors of inputs matching u exactly, one considers averaging or taking a median of responses for vectors of inputs near u . Let N k u stand for the set of indices i corresponding to the k vectors of inputs xi in the training set closest to the point u p (we'll assume some sensible method is adopted for breaking ties). k -nearest neighbor approximations to the optimal but typically unrealizable predictors (**) are 1 fˆ u k iN k u yi and fˆ u Median yi | i N k u For a fixed training set size N , the smaller is k , typically 1. the noisier (the higher-variance) is a k -nearest neighbor predictor, and 2. the smaller the potential for prediction bias. One expects that for very large N , k can be taken to be large enough to mitigate prediction variance and a yet be a small enough fraction of N to mitigate prediction bias. 3