LECTURE 02: EVALUATING MODELS January 27, 2016 SDS 293 Machine Learning Announcements / Questions • Life Sciences and Technology Fair is tomorrow: 3:30-6pm in the Carroll Room www.smith.edu/lazaruscenter/fairs_scitech.php • Office hours: does anyone have a conflict? Outline • Evaluating Models • Lab pt. 1 – Introduction to R: - Basic Commands - Graphics Overview - Indexing Data - Loading Data - Additional Graphical/Numerical Summaries • Lab pt. 2 - Exploring other datasets (time permitting) Beyond LR Stated goal of this course: explore methods that go beyond standard linear regression One tool to rule them all…? Question: why not just teach you the best one first? Answer: it depends • No single method dominates over all • On a particular data set, for a particular question, one specific method may work well; on a related but not identical dataset or question, another might be better • Choosing the right approach is arguably the most challenging aspect of doing statistics in practice • So how do we do it? Measuring “Quality of Fit” • One question we might ask: how well do my model’s predictions actually match the observations? • What we need: a way to measure how close the predicted response is to the true response • Flashback to your stats training: what do we use in regression? Mean Squared Error True response for the ith observation Of the squared difference We take the average Prediction our model gives over all observations for the ith observation “Training” MSE • This version of MSE is computed using the training data that was used to fit the model • Reality check: is this what we care about? Test MSE • Better plan: see how well the model does on observations we didn’t train on • Given some never-before-seen examples, we can just calculate the MSE on those using the same method • But what if we don’t have any new observations to test? - Can we just use the training MSE? - Why or why not? Example Test MSE Avg. training MSE Training vs. Test MSE • As the flexibility of the statistical learning method increases, we observe: - a monotone decrease in the training MSE - a U-shape in the test MSE • Fun fact: occurs regardless of the data set and statistical method being used • As flexibility increases, training MSE will decrease, but the test MSE may not Overfitting Trade-off between bias and variance • The U-shaped curve in the Test MSE is the result of two competing properties: bias and variance • Variance refers to the amount by which the model would change if we estimated it using different training data • Bias refers to the error that is introduced by approximating a real-life problem (which may be extremely complicated) using a much simpler model Relationship between bias and variance • In general, more flexible methods have higher variance Relationship between bias and variance • In general, more flexible methods have lower bias Trade-off between bias and variance • It is possible to show that the expected test MSE for a given value can be decomposed into three terms: The variance of our model The bias of ourThe model variance on the test value on the test ofvalue the error terms Balancing bias and variance • We know variance and squared bias are always nonnegative (why?) • There’s nothing we can do about the variance of the irreducible error inherent in the model • So we’re looking for a method that minimizes the sum of the first two terms… which are (in some sense) competing Balancing bias and variance • It’s easy to build a model with low variance but high bias (how?) • Just as easy to build one with low bias but high variance (how?) • The challenge: finding a method for which both the variance and the squared bias are low • This trade-off is one of the most important recurring themes in this course What about classification? • So far, we’ve only talked about how to evaluate the accuracy of a regression model • The idea of a bias-variance trade-off also translates to the classification setting, but we need some minor modifications to deal with qualitative responses • For example: we can’t really compute MSE without numerical values, so what can we do instead? Training error rate • One common approach is to use the training error rate, where we measure the proportion of the times our model incorrectly classifies a training data point: Where Using the an model’s classification And take We tally the up was indicator different function from the actual class average all the times Takeaways • Choosing the “right” level of flexibility is critical for success in both the regression and classification settings • The bias-variance tradeoff can make this a difficult task • In Chapter 5, we’ll return to this topic and explore various methods for estimating test error rates • We’ll then use these estimates to find the optimal level of flexibility for a given ML method Questions? Lab pt. 1: Introduction to R • Basic Commands • Graphics • Indexing data • Loading external data • Generating summaries • Playing with real data (time permitting!) Lab pt. 1: Introduction to R Lab pt. 1: Introduction to R Lab pt. 1: Introduction to R Lab pt. 1: Introduction to R • Today’s walkthrough (and likely many others) will be run using: which allows me to build “notebooks” which run live R code (python, too!) in the browser • Hint: this is also a nice way to format your homework! Lab pt. 2: Exploring Other Datasets • More datasets from the book - ISLR package - Already installed on Smith Rstudio server - Working locally? > install.packages(‘ISLR’) - Details available at: cran.r-project.org/web/packages/ISLR - Dataset descriptions: www.inside-r.org/packages/cran/ISLR/docs • Real world data: - Olympic Athletes: goo.gl/1aUnJW - World Bank Indicators: goo.gl/0QdN9U - Airplane Bird Strikes: goo.gl/lFl5ld - …and a whole bunch more: goo.gl/kcbqfc Coming Up • Next class: Linear Regression 1: Simple and Multiple LR • For planning purposes: Assignment 1 will be posted next week, and will be due the following Weds (Feb. 10th)