Prelude of Machine Learning 202 Statistical Data Analysis in the Computer Age (1991) Bradely Efron and Robert Tibshirani Agenda • • • • • • Overview Bootstrap Nonparametric Regression Generalized Additive Models Classification and Regression Trees Conclusion Agenda • • • • • • Overview Bootstrap Nonparametric Regression Generalized Additive Models Classification and Regression Trees Conclusion Overview • Classical statistical methods from 1920-1950: – Linear regression, hypothesis testing, standard errors, confidence intervals, etc. • New statistical methods Post 1980: – Based on the power of electronic computation – Require fewer distributional assumptions than their predecessors • How to spend computational wealth wisely? Agenda • • • • • • Overview Bootstrap Nonparametric Regression Generalized Additive Models Classification and Regression Trees Conclusion Bootstrap • Random sample from 164 data points • t(x) = 28.58 • How accurate is t(x)? • A device for extending SE to estimators other than the mean • Suppose t(x) is 25% trimmed mean Bootstrap • Why use a trimmed mean rather than mean(x)? • If data is from a long-tailed probability distribution, then the trimmed mean can be substantially more accurate than mean(x) • In practice, one does not know a priori if the true probability distribution is long-tailed. The bootstrap can help answer this question. Agenda • • • • • • Overview Bootstrap Nonparametric Regression Generalized Additive Models Classification and Regression Trees Conclusion Nonparametric Regression • Quadratic regression curve at 60% compliance • 27.72 +/- 3.08 Nonparametric Regression • Nonparametric Regression with loess at 60% compliance • 32.38 +/- ? • i.e. – Windowing with nearlest 20% data points – Smooth weight function – Weighted linear regression • How to find SE? Nonparametric Regression • How to find SE? • Bootstrap • 32.38 +/- 5.71 with B=50 • At 60% compliance • QR: 27.72 +/- 3.08 • NPR: 32.38 +/- 5.71 • On balance, the quadratic estimate should probably be preferred in this case. • It would have to have an unusually large bias to undo its superiority in SE. Agenda • • • • • • Overview Bootstrap Nonparametric Regression Generalized Additive Models Classification and Regression Trees Conclusion Generalized Additive Models • Generalized Linear model: – – Generalizes linear regression Linear model related to response variable using a link function Y = g(b0 + b1*X1 + ... + bm*Xm) • Additive Model: – Non parametric regression method – Estimate a non parametric function for each predictor – Combine all predictor functions to predict the dependent variable • Generalized Additive Model (GAM) : – Blends properties of Additive models with generalized linear model (GLM) – Each predictor function fi(xi) is fit using parametric or non parametric means – Provides good fits to training data at the expense of interpretability GAM Case Study • Analyze survival of infants after cardiac surgery for heart defects • Dataset: 497 infant records • Explanatory variables: – Age (Days) – Weight (Kg) – Whether Warm-blood cardiopelgia (WBC) was applied • WBC support data: – Of 57 infants who received WBC procedure, 7 died – Of 440 infants who received standard procedure, 133 died GAM Case Study: Logistic regression results • Three parameter regression model – Age, Weight: continuous variables – WBC applied: binary variable • Results: – WBC has strong beneficial effect: odds ratio of 3.8:1 – Higher weight => Lower risk of death – Age has no significant effect GAM Case Study: GAM Analysis • Add three individual smooth functions – Use locally weighted scatter plot smoothing (Loess) method • Results: – WBC has strong beneficial effect: odds ratio of 4.2:1 – Lighter infants have 55 times more likely to die than heavier infants – Surprising findings from log odds curve for age ! GAM Case Study: Conclusion • Traditional regression models may lead to oversimplification – Linear logistic regression forces curves to be straight lines – Vital information regarding effect of age lost in a linear model – More acute problem with large number of explanatory variables • GAM analysis exploits computational power to achieve new level of analysis flexibility – A Personal computer can do what required a Mainframe 10 years ago Agenda • • • • • • Overview Bootstrap Nonparametric Regression Generalized Additive Models Classification and Regression Trees Conclusion Classification and Regression Tree • A non parametric technique • An ideal analysis method to apply computer algorithms • Splits based upon how well the splits can explain variability • Once a node is split, the procedure is applied to each “split” recursively CART Case study • Gain insight into causes of duodenal ulcers – Use sample of 745 rats – 1 out of 56 different alkyl nucleophiles administered to each rat – Response: One of three severity levels (1,2,3), 3 being the highest severity • Skewed misclassification costs – Severe ulcer misclassification is more expensive than mild ulcer misclassification • Analysis tree construction: – Use 745 observations as the training data – Compute ‘apparent’ misclassification rates – Training data misclassification rate has downward bias CART Case study • Classification tree CART Case study: Observations • Optimal size of classification tree is a tradeoff – Higher training errors versus overfitting • It is usually better to construct large tree and prune from bottom • How to chose optimal size classification tree ? – Use test data on different tree models to understand misclassification rate in each tree – In the absence of test data, use cross validation approach CART: Cross validation • Mimic the use of test sample • Standard cross validation approach: – Divide dataset into 10 equal partitions – Use 90% of data as training set and the remaining 10% as test data – Repeat with all different combinations of the training and test data • Cross validation misclassification errors found to be 10% higher than the original • Cross validation and bootstrapping are closely related – Research on hybrid approaches in progress Agenda • • • • • • Overview Bootstrap Nonparametric Regression Generalized Additive Models Classification and Regression Trees Conclusion Conclusion • Computers have enabled a new generation of statistical methods and tools • Replace traditional mathematical ways with computer algorithms. • Freedom from bell-shaped curve assumptions of the traditional approach • Modern Statisticians need to understand: • Mathematical tractability is not required for computer based methods • Which computer based methods to use • When to use each method