Section 1 – Supervised Learning with a Numeric Response DSCI 425 – Supervised (Statistical) Learning 1.1 – Introduction to Supervised Learning for a Numeric Response The general model for predicting a numeric response (𝑌) given a set of predictors (𝑋1 , 𝑋2 , … , 𝑋𝑝 ) can be represented as: 𝑌 = 𝑓(𝑋1 , 𝑋2 , … , 𝑋𝑝 ) + 𝜖 𝑜𝑟 𝑚𝑜𝑟𝑒 𝑠𝑖𝑚𝑝𝑙𝑦 𝑌 = 𝑓(𝑿) + 𝜖 Here 𝑓(∙) is a function that we need to estimate from our available data and 𝑿 = (𝑋1 , … , 𝑋𝑝 ) is the set of potential predictors (or features) available to help us predict the value of the numeric response 𝑌. The predictors (𝑋𝑗 ) may be a mixture of numeric and categorical/nominal/ordinal variables. There are two main reasons we may wish to estimate 𝑓(∙): (1) prediction (the focus of this course) and (2) inference (the focus of STAT 360). In prediction we are really only interested in predicting the response accurately. Thus the estimated function 𝑓̂ = 𝑓̂(𝑋1 , … , 𝑋𝑝 ) = 𝑓̂(𝑿) is treated as a “black box” that hopefully accurately predicts the response (see diagram below). How each of the inputs (i.e. predictors) to the black box are used is not really of interest, although we may wish to measure in some sense which of inputs are most important. This is sometimes referred to as feature selection. From a practical standpoint for future data collection and storage we would certainly like to eliminate the need to retain unimportant 𝑋𝑗 ′𝑠. Black Box Inputs (Predictors or Features) 𝑿𝟏 𝑿𝟐 ⋮ 𝑿𝒑 → → → → 𝑓̂(𝑋1 , 𝑋2 , … , 𝑋𝑝 ) ̂ 𝒀 Output Predicted Value of Response Feedback about performance In this course we will examine numerous methods for estimating 𝑓̂(𝑋1 , 𝑋2 , … , 𝑋𝑝 ). We will start with the simplest methods and move towards increasing complex/flexible methods as the semester progresses. Many of these methods, or a variation of these methods, can used for classification problems as well where the response is non-numeric. 3 Section 1 – Supervised Learning with a Numeric Response DSCI 425 – Supervised (Statistical) Learning As an example of a method for estimating 𝑓(∙) consider multiple linear regression (MLR) where the function 𝑓(𝑿) is specified using a function that can be represented as a linear combination of functions of the predictors (𝑈𝑘 ′𝑠) specified by the model parameters (𝛽𝑗′ 𝑠), i.e. 𝑌 = 𝑓(𝑿) = 𝛽𝑜 + 𝛽1 𝑈1 + ⋯ + 𝛽𝑘 𝑈𝑘 + 𝜖 The functions 𝑈𝑘 = 𝑓𝑘 (𝑋𝑗 ) are called terms. A few examples of terms commonly used in MLR are given below: MLR is one “simple” method for estimating 𝑓(𝑋1 , 𝑋2 , … , 𝑋𝑝 ) and is an example of what is referred to as a parametric linear model. In Section 2 we will “review” MLR regression in much more detail. Again MLR was the main topic in STAT 360 – Regression Analysis. Other methods we will examine in this course are nonparametric in nature. Nonparametric methods do not make explicit assumptions about the functional form of 𝑓(𝑿). Instead they seek an estimate, 𝑓̂(𝑿), that gets as close the target response 𝑌 without being too rough or noisy or wiggly (see simple diagram below). 𝑀𝑜𝑑𝑒𝑙 = 𝑓̂(𝑥) 𝑇𝑟𝑢𝑒 𝐹𝑢𝑛𝑐𝑡𝑖𝑜𝑛 = 𝑓(𝑥) (Quartic Polynomial) 4 Section 1 – Supervised Learning with a Numeric Response DSCI 425 – Supervised (Statistical) Learning 1.2 – Prediction and the Mean Squared Error (MSE) The quality of given estimate 𝑓̂ is usually measured using the mean squared error (MSE) or Root Mean Squared Error (RMSE) which are given by: 𝑛 𝟐 1 𝑀𝑆𝐸 = ∑ (𝑦𝑖 − 𝑓̂(𝒙𝒊 )) 𝑎𝑛𝑑 𝑅𝑀𝑆𝐸 = √𝑀𝑆𝐸 𝑛 𝑖=1 However we need to be very careful here! The summation is MSE formula above was called the Residual Sum of Squares (RSS) in regression. As we learned in regression whenever we increase the complexity of a model the RSS, and hence the MSE, decreases. For prediction purposes overly complicated models will NOT perform well. This is because they fit the data used to develop model too well and thus will not generally perform well when making predictions for future observations that were not used in the model development process. Cross-validation, which we will discuss a bit later, is used to develop models with “good” performance for the predictive problem at hand. More or on the MSE The general predictive/regression model for a numeric response is given by, 𝑌 = 𝑓(𝑿) + 𝜖 𝑎𝑛𝑑 𝐸(𝜖) = 0 & 𝑉𝑎𝑟(𝜖) = 𝜎 2 . The MSE above is a sample based estimate of the population MSE which is given by, 2 2 𝑀𝑆𝐸 = 𝐸 [(𝑌 − 𝑌̂) ] = 𝐸 [(𝑓(𝑿) + 𝜖 − 𝑓̂(𝑿)) ] 2 = 𝐸 [(𝑓(𝑿) − 𝑓̂(𝑿)) ] + 𝑉𝑎𝑟(𝜖) 5 Section 1 – Supervised Learning with a Numeric Response DSCI 425 – Supervised (Statistical) Learning Further expanding the right hand side of the expression above we obtain, 2 = 𝐸 [(𝑓(𝑿) − 𝐸 (𝑓̂(𝑿)) + 𝐸 (𝑓̂(𝑿)) − 𝑓̂(𝑿)) ] + 𝑉𝑎𝑟(𝜖) 𝟐 2 = 𝐸 [(𝑓(𝑿) − 𝐸 (𝑓̂(𝑿)) ] + 𝐸 [(𝑓̂(𝑿) − 𝐸(𝑓̂(𝑿)) ] + 𝑉𝑎𝑟(𝜖 2 ) 2 = 𝐵𝑖𝑎𝑠[𝑓̂(𝑿)] + 𝑉𝑎𝑟(𝑓̂(𝑿)) + 𝑉𝑎𝑟(𝜖) Thus the fitted model is balance between the squared bias and the variance of fit. In general more complex models will have smaller bias but will have larger variance. Choosing the “best” model requires striking a balance between the bias and variance. Consider again the diagram above. 𝑀𝑜𝑑𝑒𝑙 = 𝑓̂(𝑥) 𝑇𝑟𝑢𝑒 𝐹𝑢𝑛𝑐𝑡𝑖𝑜𝑛 = 𝑓(𝑥) (Quartic Polynomial) 6 Section 1 – Supervised Learning with a Numeric Response DSCI 425 – Supervised (Statistical) Learning Again to truly measure predictive performance we need to use observations that are NOT used in the model development process. Cross-validation methods will help us achieve this. As stated above in order to obtain a reasonable (i.e. not overly optimistic) measure of predictive performance we need to use observations that were not used in any way to obtain the estimated model 𝑓̂(𝑿). Suppose we have a set of m complete observations that were not used in the model development process. Then we can calculate the Mean Squared Error for Prediction (MSEP) and the Root Mean Squared Error for Prediction (RMSEP) as follows, 𝑚 𝑚 2 1 1 𝑀𝑆𝐸𝑃 = ∑ (𝑦𝑖 − 𝑓̂(𝒙𝒊 )) = ∑(𝑦𝑖 − 𝑦̂𝑖 )2 𝑚 𝑚 𝑖=1 & 𝑅𝑀𝑆𝐸𝑃 = √𝑀𝑆𝐸𝑃 𝑖=1 We could also express our prediction accuracy as a mean absolute error (MAE) or as the mean absolute percentage error (MAPE). 𝑚 𝑚 𝑖=1 𝑖=1 1 1 𝑀𝐴𝐸 = ∑ |𝑦𝑖 − 𝑓̂(𝒙𝒊 )| = ∑|𝑦𝑖 − 𝑦̂𝑖 | 𝑚 𝑚 and 𝑀𝐴𝑃𝐸 = ∑𝑚 ̂𝑖 )⁄𝑦𝑖 | 𝑖=1|(𝑦𝑖 − 𝑦 × 100% 𝑚 𝑚𝑢𝑠𝑡 ℎ𝑎𝑣𝑒 (𝑦𝑖 ≠ 0). 1.3 – Role of Inference In parametric modeling inference is usually of interest. For example when fitting a MLR model 𝑌 = 𝑓(𝑿) = 𝛽𝑜 + 𝛽1 𝑈1 + ⋯ + 𝛽𝑘 𝑈𝑘 + 𝜖 determining which terms to include in the final model is usually based on tstatistics and p-values associated with the estimated coefficients. Also interpretation of the estimated parameters (𝛽̂𝑗 ) is of interest, particularly when the terms (𝑈𝑘 ′𝑠) are the predictors themselves (i.e. 𝑋𝑗 ′𝑠). In predictive modeling, accurately predicting the response is our really are only interest, so inferential analyses (i.e. p-values and confidence intervals) will not been seen much in this course. Although in the methods that are extensions of MLR (e.g. ridge, LASSO, 7 Section 1 – Supervised Learning with a Numeric Response DSCI 425 – Supervised (Statistical) Learning and Elastic Net regression) we will still have the inferential tools used as part of MLR at our disposal. In the next section we will review MLR with particular emphasis on the concepts of predictors & terms, transformations, model building, model selection, crossvalidation, and discuss extensions of the MLR model. 8