The Elements of Statistical Learning Theory Hastie, Tibshirani,Friedman December 20, 2002

advertisement
The Elements of Statistical Learning
Theory
Hastie, Tibshirani,Friedman
December 20, 2002
What is statistical learning?
• supervised learning - response can be continuous so this is
broader than supervised classification
• unsupervised learning - cluster analysis
1
Examples
1. Email Spam: How to design an automatic spam detector
based on commonly occurring words, and punctuation marks
(57 vars)? (HTF use general additive models and trees)
2. Prostrate cancer: How to predict size of cancer based on
prostrate size, age, .... (8 vars)? (HTF use regression)
3. Handwritten digit recognition: Build an automatic zip code
reader for the US postal system based on 16×16 normalized
images with 0-255 grey scale value for each pixel. (HTF use
neural network and ICA classification)
4. Gene expression: What gene activate together to generate
actions in organisms (thousands of genes, not too many experiments)? (HTF use cluster analysis)
2
Outline
• terminology
• parametric regression
• parametric classification
• nonparametric approaches
3
Terminology
• observation ≡ instance
• explanatory variable ≡ features (attributes)
• response variable ≡ outcome (concept)
• training set: the outcome and feature measurements for a
sample of instances
4
Supervised Learning
Model: y = f (X) + ε; training set L = (yi, Xi), i = 1, . . . , N
Fit using:
PN
• Least squares: RSS(f ) = i=1(yi − f (X)i)2
P
• Nearest-neighbor: fˆ(X) = 1
h Xi∈Nh (X) yi
• *Penalized RSS: P RSS(f ) = RSS(f ) + λJ(f )
5
Parametric Regression
When p is small, use least squares, else
• Best subset
• Ridge
• Lasso
• Principal Component Regression (PCR)
• Partial Least Squares (PLS)
6
Parametric Classification
• Linear Discriminant Analysis (LDA)
• Logistic Regression
• Quadratic Discriminant Analysis (QDA)
7
Non-parametric Approach
When p is small:
• Basis expansion: expand the dimension of the input space via
non-linear transformation, eg splines, wavelets, GAM, SVM
• Local average: eg kernels, local likelihood, nearest neighbors
8
When p is large:
• Partitioning the input space, eg Trees, PRIM, HME
• additive spline methods: MARS
• additive models with dimension reduction: projection pursuit,
neural networks, SVM
• subsampling, ensemble methods: bagging, boosting, random
forests
9
Download