Multiple Regression in R using LARS

advertisement
Section 1 – Supervised Learning with a Numeric Response
DSCI 425 – Supervised (Statistical) Learning
1.1 – Introduction to Supervised Learning for a Numeric Response
The general model for predicting a numeric response (𝑌) given a set of predictors
(𝑋1 , 𝑋2 , … , 𝑋𝑝 ) can be represented as:
𝑌 = 𝑓(𝑋1 , 𝑋2 , … , 𝑋𝑝 ) + 𝜖 𝑜𝑟 𝑚𝑜𝑟𝑒 𝑠𝑖𝑚𝑝𝑙𝑦 𝑌 = 𝑓(𝑿) + 𝜖
Here 𝑓(∙) is a function that we need to estimate from our available data and
𝑿 = (𝑋1 , … , 𝑋𝑝 ) is the set of potential predictors (or features) available to help us
predict the value of the numeric response 𝑌. The predictors (𝑋𝑗 ) may be a
mixture of numeric and categorical/nominal/ordinal variables.
There are two main reasons we may wish to estimate 𝑓(∙): (1) prediction (the
focus of this course) and (2) inference (the focus of STAT 360). In prediction we
are really only interested in predicting the response accurately. Thus the
estimated function 𝑓̂ = 𝑓̂(𝑋1 , … , 𝑋𝑝 ) = 𝑓̂(𝑿) is treated as a “black box” that
hopefully accurately predicts the response (see diagram below). How each of the
inputs (i.e. predictors) to the black box are used is not really of interest, although
we may wish to measure in some sense which of inputs are most important.
This is sometimes referred to as feature selection. From a practical standpoint
for future data collection and storage we would certainly like to eliminate the
need to retain unimportant 𝑋𝑗 ′𝑠.
Black Box
Inputs
(Predictors
or Features)
𝑿𝟏
𝑿𝟐
⋮
𝑿𝒑
→
→
→
→
𝑓̂(𝑋1 , 𝑋2 , … , 𝑋𝑝 )
̂
𝒀
Output
Predicted Value
of Response
Feedback about performance
In this course we will examine numerous methods for estimating 𝑓̂(𝑋1 , 𝑋2 , … , 𝑋𝑝 ).
We will start with the simplest methods and move towards increasing
complex/flexible methods as the semester progresses. Many of these methods,
or a variation of these methods, can used for classification problems as well
where the response is non-numeric.
3
Section 1 – Supervised Learning with a Numeric Response
DSCI 425 – Supervised (Statistical) Learning
As an example of a method for estimating 𝑓(∙) consider multiple linear
regression (MLR) where the function 𝑓(𝑿) is specified using a function that can
be represented as a linear combination of functions of the predictors (𝑈𝑘 ′𝑠)
specified by the model parameters (𝛽𝑗′ 𝑠), i.e.
𝑌 = 𝑓(𝑿) = 𝛽𝑜 + 𝛽1 𝑈1 + ⋯ + 𝛽𝑘 𝑈𝑘 + 𝜖
The functions 𝑈𝑘 = 𝑓𝑘 (𝑋𝑗 ) are called terms. A few examples of terms commonly
used in MLR are given below:
MLR is one “simple” method for estimating 𝑓(𝑋1 , 𝑋2 , … , 𝑋𝑝 ) and is an example of
what is referred to as a parametric linear model. In Section 2 we will “review”
MLR regression in much more detail. Again MLR was the main topic in STAT
360 – Regression Analysis. Other methods we will examine in this course are
nonparametric in nature. Nonparametric methods do not make explicit
assumptions about the functional form of 𝑓(𝑿). Instead they seek an estimate,
𝑓̂(𝑿), that gets as close the target response 𝑌 without being too rough or noisy or
wiggly (see simple diagram below).
𝑀𝑜𝑑𝑒𝑙 = 𝑓̂(𝑥)
𝑇𝑟𝑢𝑒 𝐹𝑢𝑛𝑐𝑡𝑖𝑜𝑛 = 𝑓(𝑥)
(Quartic Polynomial)
4
Section 1 – Supervised Learning with a Numeric Response
DSCI 425 – Supervised (Statistical) Learning
1.2 – Prediction and the Mean Squared Error (MSE)
The quality of given estimate 𝑓̂ is usually measured using the mean squared
error (MSE) or Root Mean Squared Error (RMSE) which are given by:
𝑛
𝟐
1
𝑀𝑆𝐸 = ∑ (𝑦𝑖 − 𝑓̂(𝒙𝒊 )) 𝑎𝑛𝑑 𝑅𝑀𝑆𝐸 = √𝑀𝑆𝐸
𝑛
𝑖=1
However we need to be very careful here! The summation is MSE formula above
was called the Residual Sum of Squares (RSS) in regression. As we learned in
regression whenever we increase the complexity of a model the RSS, and hence
the MSE, decreases. For prediction purposes overly complicated models will
NOT perform well. This is because they fit the data used to develop model too
well and thus will not generally perform well when making predictions for
future observations that were not used in the model development process.
Cross-validation, which we will discuss a bit later, is used to develop models
with “good” performance for the predictive problem at hand.
More or on the MSE
The general predictive/regression model for a numeric response is given by,
𝑌 = 𝑓(𝑿) + 𝜖 𝑎𝑛𝑑 𝐸(𝜖) = 0 & 𝑉𝑎𝑟(𝜖) = 𝜎 2 .
The MSE above is a sample based estimate of the population MSE which is given
by,
2
2
𝑀𝑆𝐸 = 𝐸 [(𝑌 − 𝑌̂) ] = 𝐸 [(𝑓(𝑿) + 𝜖 − 𝑓̂(𝑿)) ]
2
= 𝐸 [(𝑓(𝑿) − 𝑓̂(𝑿)) ] + 𝑉𝑎𝑟(𝜖)
5
Section 1 – Supervised Learning with a Numeric Response
DSCI 425 – Supervised (Statistical) Learning
Further expanding the right hand side of the expression above we obtain,
2
= 𝐸 [(𝑓(𝑿) − 𝐸 (𝑓̂(𝑿)) + 𝐸 (𝑓̂(𝑿)) − 𝑓̂(𝑿)) ] + 𝑉𝑎𝑟(𝜖)
𝟐
2
= 𝐸 [(𝑓(𝑿) − 𝐸 (𝑓̂(𝑿)) ] + 𝐸 [(𝑓̂(𝑿) − 𝐸(𝑓̂(𝑿)) ] + 𝑉𝑎𝑟(𝜖 2 )
2
= 𝐵𝑖𝑎𝑠[𝑓̂(𝑿)] + 𝑉𝑎𝑟(𝑓̂(𝑿)) + 𝑉𝑎𝑟(𝜖)
Thus the fitted model is balance between the squared bias and the variance of fit.
In general more complex models will have smaller bias but will have larger
variance. Choosing the “best” model requires striking a balance between the
bias and variance. Consider again the diagram above.
𝑀𝑜𝑑𝑒𝑙 = 𝑓̂(𝑥)
𝑇𝑟𝑢𝑒 𝐹𝑢𝑛𝑐𝑡𝑖𝑜𝑛 = 𝑓(𝑥)
(Quartic Polynomial)
6
Section 1 – Supervised Learning with a Numeric Response
DSCI 425 – Supervised (Statistical) Learning
Again to truly measure predictive performance we need to use observations that
are NOT used in the model development process. Cross-validation methods will
help us achieve this.
As stated above in order to obtain a reasonable (i.e. not overly optimistic)
measure of predictive performance we need to use observations that were not
used in any way to obtain the estimated model 𝑓̂(𝑿). Suppose we have a set of
m complete observations that were not used in the model development process.
Then we can calculate the Mean Squared Error for Prediction (MSEP) and the
Root Mean Squared Error for Prediction (RMSEP) as follows,
𝑚
𝑚
2
1
1
𝑀𝑆𝐸𝑃 = ∑ (𝑦𝑖 − 𝑓̂(𝒙𝒊 )) = ∑(𝑦𝑖 − 𝑦̂𝑖 )2
𝑚
𝑚
𝑖=1
&
𝑅𝑀𝑆𝐸𝑃 = √𝑀𝑆𝐸𝑃
𝑖=1
We could also express our prediction accuracy as a mean absolute error (MAE)
or as the mean absolute percentage error (MAPE).
𝑚
𝑚
𝑖=1
𝑖=1
1
1
𝑀𝐴𝐸 = ∑ |𝑦𝑖 − 𝑓̂(𝒙𝒊 )| =
∑|𝑦𝑖 − 𝑦̂𝑖 |
𝑚
𝑚
and
𝑀𝐴𝑃𝐸 =
∑𝑚
̂𝑖 )⁄𝑦𝑖 |
𝑖=1|(𝑦𝑖 − 𝑦
× 100%
𝑚
𝑚𝑢𝑠𝑡 ℎ𝑎𝑣𝑒 (𝑦𝑖 ≠ 0).
1.3 – Role of Inference
In parametric modeling inference is usually of interest. For example when fitting
a MLR model
𝑌 = 𝑓(𝑿) = 𝛽𝑜 + 𝛽1 𝑈1 + ⋯ + 𝛽𝑘 𝑈𝑘 + 𝜖
determining which terms to include in the final model is usually based on tstatistics and p-values associated with the estimated coefficients. Also
interpretation of the estimated parameters (𝛽̂𝑗 ) is of interest, particularly when
the terms (𝑈𝑘 ′𝑠) are the predictors themselves (i.e. 𝑋𝑗 ′𝑠). In predictive modeling,
accurately predicting the response is our really are only interest, so inferential
analyses (i.e. p-values and confidence intervals) will not been seen much in this
course. Although in the methods that are extensions of MLR (e.g. ridge, LASSO,
7
Section 1 – Supervised Learning with a Numeric Response
DSCI 425 – Supervised (Statistical) Learning
and Elastic Net regression) we will still have the inferential tools used as part of
MLR at our disposal.
In the next section we will review MLR with particular emphasis on the concepts
of predictors & terms, transformations, model building, model selection, crossvalidation, and discuss extensions of the MLR model.
8
Download