The Elements of Statistical Learning Theory Hastie, Tibshirani,Friedman December 20, 2002 What is statistical learning? • supervised learning - response can be continuous so this is broader than supervised classification • unsupervised learning - cluster analysis 1 Examples 1. Email Spam: How to design an automatic spam detector based on commonly occurring words, and punctuation marks (57 vars)? (HTF use general additive models and trees) 2. Prostrate cancer: How to predict size of cancer based on prostrate size, age, .... (8 vars)? (HTF use regression) 3. Handwritten digit recognition: Build an automatic zip code reader for the US postal system based on 16×16 normalized images with 0-255 grey scale value for each pixel. (HTF use neural network and ICA classification) 4. Gene expression: What gene activate together to generate actions in organisms (thousands of genes, not too many experiments)? (HTF use cluster analysis) 2 Outline • terminology • parametric regression • parametric classification • nonparametric approaches 3 Terminology • observation ≡ instance • explanatory variable ≡ features (attributes) • response variable ≡ outcome (concept) • training set: the outcome and feature measurements for a sample of instances 4 Supervised Learning Model: y = f (X) + ε; training set L = (yi, Xi), i = 1, . . . , N Fit using: PN • Least squares: RSS(f ) = i=1(yi − f (X)i)2 P • Nearest-neighbor: fˆ(X) = 1 h Xi∈Nh (X) yi • *Penalized RSS: P RSS(f ) = RSS(f ) + λJ(f ) 5 Parametric Regression When p is small, use least squares, else • Best subset • Ridge • Lasso • Principal Component Regression (PCR) • Partial Least Squares (PLS) 6 Parametric Classification • Linear Discriminant Analysis (LDA) • Logistic Regression • Quadratic Discriminant Analysis (QDA) 7 Non-parametric Approach When p is small: • Basis expansion: expand the dimension of the input space via non-linear transformation, eg splines, wavelets, GAM, SVM • Local average: eg kernels, local likelihood, nearest neighbors 8 When p is large: • Partitioning the input space, eg Trees, PRIM, HME • additive spline methods: MARS • additive models with dimension reduction: projection pursuit, neural networks, SVM • subsampling, ensemble methods: bagging, boosting, random forests 9