Evaluating learning

advertisement
TAMING THE LEARNING ZOO
SUPERVISED LEARNING ZOO

Bayesian learning
Maximum likelihood
 Maximum a posteriori

Decision trees
 Support vector machines
 Neural nets
 k-Nearest-Neighbors

2
VERY APPROXIMATE “CHEAT-SHEET” FOR
TECHNIQUES DISCUSSED IN CLASS
Attributes
N scalability
D scalability
Capacity
Bayes nets
D
Good
Good
Good
Naïve Bayes
D
Excellent
Excellent
Low
Decision trees
D,C
Excellent
Excellent
Fair
Neural nets
C
Poor
Good
Good
SVMs
C
Good
Good
Good
Nearest
neighbors
D,C
Learn: E,
Eval: P
Poor
Excellent
WHAT HAVEN’T WE COVERED?

Boosting



Regression: predicting continuous outputs y=f(x)



Way of turning several “weak learners” into a “strong
learner”
E.g. used in popular random forests algorithm
Neural nets, nearest neighbors work directly as
described
Least squares, locally weighted averaging
Unsupervised learning
Clustering
 Density estimation
 Dimensionality reduction
 [Harder to quantify performance]

AGENDA

Quantifying learner performance
Cross validation
 Precision & recall


Model selection
CROSS-VALIDATION
ASSESSING PERFORMANCE OF A
LEARNING ALGORITHM
Samples from X are typically unavailable
 Take out some of the training set

Train on the remaining training set
 Test on the excluded instances
 Cross-validation

CROSS-VALIDATION

Split original set of examples, train
Examples D
- + - +
-
-
-
+
+
+
+
-
+
-
+
+
Train
+
+
+
Hypothesis space H
CROSS-VALIDATION

Evaluate hypothesis on testing set
Testing set
-
-
-
+
+
-
+
+
+
+
-
+
Hypothesis space H
CROSS-VALIDATION

Evaluate hypothesis on testing set
Testing set
-
+
+
+
+
-
+
Test
+
-
+
-
Hypothesis space H
CROSS-VALIDATION

Compare true concept against prediction
9/13 correct
Testing set
-
+
++
++
--
-+
++
++
+-
-+
-++
--
Hypothesis space H
COMMON SPLITTING STRATEGIES

k-fold cross-validation
Dataset
Train
Test
COMMON SPLITTING STRATEGIES

k-fold cross-validation
Dataset
Train

Leave-one-out (n-fold cross validation)
Test
COMPUTATIONAL COMPLEXITY

k-fold cross validation requires
k training steps on n(k-1)/k datapoints
 k testing steps on n/k datapoints
 (There are efficient ways of computing L.O.O.
estimates for some nonparametric techniques, e.g.
Nearest Neighbors)


Average results reported
BOOTSTRAPPING
Similar technique for estimating the confidence
in the model parameters 
 Procedure:
1. Draw k hypothetical datasets from original
data. Either via cross validation or sampling
with replacement.
2. Fit the model for each dataset to compute
parameters k
3. Return the standard deviation of 1,…,k (or a
confidence interval)
Can also estimate confidence in a prediction
y=f(x)

SIMPLE EXAMPLE: AVERAGE OF N
NUMBERS




Data D={x(1),…,x(N)}, model is constant 
Learning: minimize E() = i(x(i)-)2 => compute average
Repeat for j=1,…,k :
 Randomly sample subset x(1)’,…,x(N)’ from D
 Learn j = 1/N i x(i)’
Return histogram of 1,…,j
0.55
0.54
0.53
0.52
Average
0.51
0.5
Lower range
0.49
Upper range
0.48
0.47
1
10
100
|Data set|
1000
10000
PRECISION RECALL CURVES
17
PRECISION VS. RECALL

Precision


# of true positives / (# true positives + # false
positives)
Recall

# of true positives / (# true positives + # false
negatives)
A precise classifier is selective
 A classifier with high recall is inclusive

18
PRECISION-RECALL CURVES
Measure Precision vs Recall as the classification
boundary is tuned
Recall
Better learning
performance
19
Precision
PRECISION-RECALL CURVES
Measure Precision vs Recall as the classification
boundary is tuned
Which learner is
better?
Recall
Learner B
Learner A
20
Precision
AREA UNDER CURVE
AUC-PR: measure the area under the precisionrecall curve
Recall
AUC=0.68
21
Precision
AUC METRICS

A single number that measures “overall”
performance across multiple thresholds
Useful for comparing many learners
 “Smears out” PR curve


Note training / testing set dependence
MODEL SELECTION AND
REGULARIZATION
COMPLEXITY VS. GOODNESS OF FIT
More complex models can fit the data better, but
can overfit
 Model selection: enumerate several possible
hypothesis classes of increasing complexity, stop
when cross-validated error levels off
 Regularization: explicitly define a metric of
complexity and penalize it in addition to loss

MODEL SELECTION WITH K-FOLD CROSSVALIDATION
Parameterize learner by a complexity level C
 Model selection pseudocode:


For increasing levels of complexity C:
errT[C],errV[C] = Cross-Validate(Learner,C,examples)
[average k-fold CV training error, testing error]
 If errT has converged,
Needed capacity reached
 Find value Cbest that minimizes errV[C]
 Return Learner(Cbest,examples)

MODEL SELECTION: DECISION TREES
C is max depth of decision tree. Suppose N
attributes
 For C=1,…,N:

errT[C],errV[C] = Cross-Validate(Learner,C,
examples)
 If errT has converged,

Find value Cbest that minimizes errV[C]
 Return Learner(Cbest,examples)

MODEL SELECTION: FEATURE SELECTION
EXAMPLE
Have many potential features f1,…,fN
 Complexity level C indicates number of features
allowed for learning
 For C = 1,…,N

errT[C],errV[C] = Cross-Validate(Learner,
examples[f1,..,fC])
 If errT has converged,

Find value Cbest that minimizes errV[C]
 Return Learner(Cbest,examples)

BENEFITS / DRAWBACKS
Automatically chooses complexity level to
perform well on hold-out sets
 Expensive: many training / testing iterations


[But wait, if we fit complexity level to the testing
set, aren’t we “peeking?”]
REGULARIZATION

Let the learner penalize the inclusion of new
features vs. accuracy on training set

A feature is included if it improves accuracy
significantly, otherwise it is left out
Leads to sparser models
 Generalization to test set is considered implicitly


Much faster than cross-validation
REGULARIZATION

Minimize:


Cost(h) = Loss(h) +  Complexity(h)
Example with linear models y = Tx:
L2 error: Loss() = i (y(i)-Tx(i))2
 Lq regularization: Complexity(): j |j|q
 L2 and L1 are most popular in linear regularization

L2 regularization leads to simple computation of
optimal 
 L1 is more complex to optimize, but produces
sparse models in which many coefficients are 0!

DATA DREDGING
As the number of attributes increases, the
likelihood of a learner to pick up on patterns that
arise purely from chance increases
 In the extreme case where there are more
attributes than datapoints (e.g., pixels in a
video), even very simple hypothesis classes can
overfit

E.g., linear classifiers
 Sparsity important to enforce


Many opportunities for charlatans in the big data
age!
ISSUES IN PRACTICE
The distinctions between learning algorithms
diminish when you have a lot of data
 The web has made it much easier to gather large
scale datasets than in early days of ML
 Understanding data with many more attributes
than examples is still a major challenge!


Do humans just have really great priors?
NEXT LECTURES
Intelligent agents (R&N Ch 2)
 Markov Decision Processes
 Reinforcement learning
 Applications of AI: computer vision, robotics

Download