Evaluating learning

TAMING THE LEARNING ZOO SUPERVISED LEARNING ZOO  Bayesian learning Maximum likelihood  Maximum a posteriori  Decision trees  Support vector machines  Neural nets  k-Nearest-Neighbors  2 VERY APPROXIMATE “CHEAT-SHEET” FOR TECHNIQUES DISCUSSED IN CLASS Attributes N scalability D scalability Capacity Bayes nets D Good Good Good Naïve Bayes D Excellent Excellent Low Decision trees D,C Excellent Excellent Fair Neural nets C Poor Good Good SVMs C Good Good Good Nearest neighbors D,C Learn: E, Eval: P Poor Excellent WHAT HAVEN’T WE COVERED?  Boosting    Regression: predicting continuous outputs y=f(x)    Way of turning several “weak learners” into a “strong learner” E.g. used in popular random forests algorithm Neural nets, nearest neighbors work directly as described Least squares, locally weighted averaging Unsupervised learning Clustering  Density estimation  Dimensionality reduction  [Harder to quantify performance]  AGENDA  Quantifying learner performance Cross validation  Precision & recall   Model selection CROSS-VALIDATION ASSESSING PERFORMANCE OF A LEARNING ALGORITHM Samples from X are typically unavailable  Take out some of the training set  Train on the remaining training set  Test on the excluded instances  Cross-validation  CROSS-VALIDATION  Split original set of examples, train Examples D - + - + - - - + + + + - + - + + Train + + + Hypothesis space H CROSS-VALIDATION  Evaluate hypothesis on testing set Testing set - - - + + - + + + + - + Hypothesis space H CROSS-VALIDATION  Evaluate hypothesis on testing set Testing set - + + + + - + Test + - + - Hypothesis space H CROSS-VALIDATION  Compare true concept against prediction 9/13 correct Testing set - + ++ ++ -- -+ ++ ++ +- -+ -++ -- Hypothesis space H COMMON SPLITTING STRATEGIES  k-fold cross-validation Dataset Train Test COMMON SPLITTING STRATEGIES  k-fold cross-validation Dataset Train  Leave-one-out (n-fold cross validation) Test COMPUTATIONAL COMPLEXITY  k-fold cross validation requires k training steps on n(k-1)/k datapoints  k testing steps on n/k datapoints  (There are efficient ways of computing L.O.O. estimates for some nonparametric techniques, e.g. Nearest Neighbors)   Average results reported BOOTSTRAPPING Similar technique for estimating the confidence in the model parameters   Procedure: 1. Draw k hypothetical datasets from original data. Either via cross validation or sampling with replacement. 2. Fit the model for each dataset to compute parameters k 3. Return the standard deviation of 1,…,k (or a confidence interval) Can also estimate confidence in a prediction y=f(x)  SIMPLE EXAMPLE: AVERAGE OF N NUMBERS     Data D={x(1),…,x(N)}, model is constant  Learning: minimize E() = i(x(i)-)2 => compute average Repeat for j=1,…,k :  Randomly sample subset x(1)’,…,x(N)’ from D  Learn j = 1/N i x(i)’ Return histogram of 1,…,j 0.55 0.54 0.53 0.52 Average 0.51 0.5 Lower range 0.49 Upper range 0.48 0.47 1 10 100 |Data set| 1000 10000 PRECISION RECALL CURVES 17 PRECISION VS. RECALL  Precision   # of true positives / (# true positives + # false positives) Recall  # of true positives / (# true positives + # false negatives) A precise classifier is selective  A classifier with high recall is inclusive  18 PRECISION-RECALL CURVES Measure Precision vs Recall as the classification boundary is tuned Recall Better learning performance 19 Precision PRECISION-RECALL CURVES Measure Precision vs Recall as the classification boundary is tuned Which learner is better? Recall Learner B Learner A 20 Precision AREA UNDER CURVE AUC-PR: measure the area under the precisionrecall curve Recall AUC=0.68 21 Precision AUC METRICS  A single number that measures “overall” performance across multiple thresholds Useful for comparing many learners  “Smears out” PR curve   Note training / testing set dependence MODEL SELECTION AND REGULARIZATION COMPLEXITY VS. GOODNESS OF FIT More complex models can fit the data better, but can overfit  Model selection: enumerate several possible hypothesis classes of increasing complexity, stop when cross-validated error levels off  Regularization: explicitly define a metric of complexity and penalize it in addition to loss  MODEL SELECTION WITH K-FOLD CROSSVALIDATION Parameterize learner by a complexity level C  Model selection pseudocode:   For increasing levels of complexity C: errT[C],errV[C] = Cross-Validate(Learner,C,examples) [average k-fold CV training error, testing error]  If errT has converged, Needed capacity reached  Find value Cbest that minimizes errV[C]  Return Learner(Cbest,examples)  MODEL SELECTION: DECISION TREES C is max depth of decision tree. Suppose N attributes  For C=1,…,N:  errT[C],errV[C] = Cross-Validate(Learner,C, examples)  If errT has converged,  Find value Cbest that minimizes errV[C]  Return Learner(Cbest,examples)  MODEL SELECTION: FEATURE SELECTION EXAMPLE Have many potential features f1,…,fN  Complexity level C indicates number of features allowed for learning  For C = 1,…,N  errT[C],errV[C] = Cross-Validate(Learner, examples[f1,..,fC])  If errT has converged,  Find value Cbest that minimizes errV[C]  Return Learner(Cbest,examples)  BENEFITS / DRAWBACKS Automatically chooses complexity level to perform well on hold-out sets  Expensive: many training / testing iterations   [But wait, if we fit complexity level to the testing set, aren’t we “peeking?”] REGULARIZATION  Let the learner penalize the inclusion of new features vs. accuracy on training set  A feature is included if it improves accuracy significantly, otherwise it is left out Leads to sparser models  Generalization to test set is considered implicitly   Much faster than cross-validation REGULARIZATION  Minimize:   Cost(h) = Loss(h) +  Complexity(h) Example with linear models y = Tx: L2 error: Loss() = i (y(i)-Tx(i))2  Lq regularization: Complexity(): j |j|q  L2 and L1 are most popular in linear regularization  L2 regularization leads to simple computation of optimal   L1 is more complex to optimize, but produces sparse models in which many coefficients are 0!  DATA DREDGING As the number of attributes increases, the likelihood of a learner to pick up on patterns that arise purely from chance increases  In the extreme case where there are more attributes than datapoints (e.g., pixels in a video), even very simple hypothesis classes can overfit  E.g., linear classifiers  Sparsity important to enforce   Many opportunities for charlatans in the big data age! ISSUES IN PRACTICE The distinctions between learning algorithms diminish when you have a lot of data  The web has made it much easier to gather large scale datasets than in early days of ML  Understanding data with many more attributes than examples is still a major challenge!   Do humans just have really great priors? NEXT LECTURES Intelligent agents (R&N Ch 2)  Markov Decision Processes  Reinforcement learning  Applications of AI: computer vision, robotics 

Evaluating learning

Related documents

Products

Support

Evaluating learning

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib