Evaluating learning

advertisement
TAMING THE LEARNING ZOO
SUPERVISED LEARNING ZOO

Bayesian learning (find parameters of a
probabilistic model)
Maximum likelihood
 Maximum a posteriori


Classification



Decision trees (discrete attributes, few relevant)
Support vector machines (continuous attributes)
Regression
Least squares (known structure, easy to interpret)
 Neural nets (unknown structure, hard to interpret)


Nonparametric approaches
k-Nearest-Neighbors
 Locally-weighted averaging / regression

2
VERY APPROXIMATE “CHEAT-SHEET” FOR
TECHNIQUES DISCUSSED IN CLASS
Task
Attributes
N scalability
D scalability
Capacity
Bayes nets
C
D
Good
Good
Good
Naïve Bayes
C
D
Excellent
Excellent
Low
Decision trees
C
D,C
Excellent
Excellent
Fair
Linear least
squares
R
C
Excellent
Excellent
Low
Nonlinear LS
R
C
Poor
Poor
Good
Neural nets
R
C
Poor
Good
Good
SVMs
C
C
Good
Good
Good
Nearest
neighbors
C
D,C
L:E, E:P
Poor
Excellent*
Locallyweighted
averaging
R
C
L:E, E:P
Poor
Excellent*
Boosting
C
D,C
?
?
Excellent*
VERY APPROXIMATE “CHEAT-SHEET” FOR
TECHNIQUES DISCUSSED IN CLASS
Task
Attributes
N scalability
D scalability
Capacity
Bayes netsNote: we
C have looked
D
Good subset of
Good
at a limited
existing
techniques
the “classical”
Naïve Bayes
C in this
D class (typically,
Excellent
Excellent
versions).
Decision trees
C
D,C
Excellent
Excellent
Good
Linear least
R
Cextend to: Excellent
Excellent
Most techniques
squares • Both C/R tasks (e.g., support vector regression)
Nonlinear •LSBothRcontinuous
C and discrete
Poor attributes
Poor
• Better scalability for certain types of problem
Neural nets
R
C
Poor
Good
Low
SVMs
C
C
Good
Good
Good
Nearest
neighbors
C
D,C
L:E, E:P
Poor
Excellent*
Locallyweighted
averaging
R
WithC“sufficiently
large” data Poor
sets
Good
Excellent*
Boosting
CWith “sufficiently
D,C
?
?
diverse”
weak leaners
Low
Fair
Good
Good
Excellent*
AGENDA

Quantifying learner performance
Cross validation
 Error vs. loss
 Precision & recall


Model selection
CROSS-VALIDATION
ASSESSING PERFORMANCE OF A
LEARNING ALGORITHM
Samples from X are typically unavailable
 Take out some of the training set

Train on the remaining training set
 Test on the excluded instances
 Cross-validation

CROSS-VALIDATION

Split original set of examples, train
Examples D
- + - +
-
-
-
+
+
+
+
-
+
-
+
+
Train
+
+
+
Hypothesis space H
CROSS-VALIDATION

Evaluate hypothesis on testing set
Testing set
-
-
-
+
+
-
+
+
+
+
-
+
Hypothesis space H
CROSS-VALIDATION

Evaluate hypothesis on testing set
Testing set
-
+
+
+
+
-
+
Test
+
-
+
-
Hypothesis space H
CROSS-VALIDATION

Compare true concept against prediction
9/13 correct
Testing set
-
+
++
++
--
-+
++
++
+-
-+
-++
--
Hypothesis space H
COMMON SPLITTING STRATEGIES

k-fold cross-validation
Dataset
Train
Test
COMMON SPLITTING STRATEGIES

k-fold cross-validation
Dataset
Train

Leave-one-out (n-fold cross validation)
Test
COMPUTATIONAL COMPLEXITY

k-fold cross validation requires
k training steps on n(k-1)/k datapoints
 k testing steps on n/k datapoints
 (There are efficient ways of computing L.O.O.
estimates for some nonparametric techniques, e.g.
Nearest Neighbors)


Average results reported
BOOTSTRAPPING
Similar technique for estimating the confidence
in the model parameters 
 Procedure:
1. Draw k hypothetical datasets from original
data. Either via cross validation or sampling
with replacement.
2. Fit the model for each dataset to compute
parameters k
3. Return the standard deviation of 1,…,k (or a
confidence interval)
Can also estimate confidence in a prediction
y=f(x)

SIMPLE EXAMPLE: AVERAGE OF N
NUMBERS




Data D={x(1),…,x(N)}, model is constant 
Learning: minimize E() = i(x(i)-)2 => compute average
Repeat for j=1,…,k :
 Randomly sample subset x(1)’,…,x(N)’ from D
 Learn j = 1/N i x(i)’
Return histogram of 1,…,j
0.55
0.54
0.53
0.52
Average
0.51
0.5
Lower range
0.49
Upper range
0.48
0.47
1
10
100
|Data set|
1000
10000
BEYOND ERROR RATES
17
BEYOND ERROR RATE

Predicting security risk


Predicting “low risk” for a terrorist, is
far worse than predicting “high risk”
for an innocent bystander (but maybe
not 5 million of them)
Searching for images

Returning irrelevant images is worse
than omitting relevant ones
18
BIASED SAMPLE SETS
Often there are orders of magnitude more
negative examples than positive
 E.g., all images of Kris on Facebook
 If I classify all images as “not Kris” I’ll have
>99.99% accuracy


Examples of Kris should count much more than
non-Kris!
FALSE POSITIVES
True concept
Learned concept
x2
x1
20
An example
incorrectly predicted
to be positive
FALSE POSITIVES
True concept
Learned concept
x2
New query
x1
21
An example
incorrectly predicted
to be negative
FALSE NEGATIVES
True concept
Learned concept
x2
New query
x1
22
PRECISION VS. RECALL

Precision


Recall


# of relevant documents retrieved / # of total
documents retrieved
# of relevant documents retrieved / # of total relevant
documents
Numbers between 0 and 1
23
PRECISION VS. RECALL

Precision


# of true positives / (# true positives + # false
positives)
Recall

# of true positives / (# true positives + # false
negatives)
A precise classifier is selective
 A classifier with high recall is inclusive

24
REDUCING FALSE POSITIVE RATE
True concept
Learned concept
x2
x1
25
REDUCING FALSE NEGATIVE RATE
True concept
Learned concept
x2
x1
26
PRECISION-RECALL CURVES
Measure Precision vs Recall as the classification
boundary is tuned
Recall
Perfect classifier
Actual performance
27
Precision
PRECISION-RECALL CURVES
Measure Precision vs Recall as the classification
boundary is tuned
Recall
Penalize false negatives
Equal weight
Penalize false positives
28
Precision
PRECISION-RECALL CURVES
Measure Precision vs Recall as the classification
boundary is tuned
Recall
29
Precision
PRECISION-RECALL CURVES
Measure Precision vs Recall as the classification
boundary is tuned
Recall
Better learning
performance
30
Precision
OPTION 1: CLASSIFICATION THRESHOLDS
Many learning algorithms (e.g., linear models,
NNets, BNs, SVM) give real-valued output v(x)
that needs thresholding for classification
v(x) > t => positive label given to x
v(x) < t => negative label given to x
 May want to tune threshold to get fewer false
positives or false negatives

31
OPTION 2: LOSS FUNCTIONS & WEIGHTED
DATASETS
General learning problem: “Given data D and
loss function L, find the hypothesis from
hypothesis class H that minimizes L”
 Loss functions: L may contain weights to favor
accuracy on positive or negative examples



E.g., L = 10 E+ + 1 E-
Weighted datasets: attach a weight w to each
example to indicate how important it is

Or construct a resampled dataset D’ where each
example is duplicated proportionally to its w
MODEL SELECTION
COMPLEXITY VS. GOODNESS OF FIT
More complex models can fit the data better, but
can overfit
 Model selection: enumerate several possible
hypothesis classes of increasing complexity, stop
when cross-validated error levels off
 Regularization: explicitly define a metric of
complexity and penalize it in addition to loss

MODEL SELECTION WITH K-FOLD CROSSVALIDATION
Parameterize learner by a complexity level C
 Model selection pseudocode:


For increasing levels of complexity C:
errT[C],errV[C] = Cross-Validate(Learner,C,examples)
 If errT has converged,
 Find value Cbest that minimizes errV[C]
 Return Learner(Cbest,examples)

REGULARIZATION

Minimize:


Cost(h) = Loss(h) +  Complexity(h)
Example with linear models y = Tx:
L2 error: Loss() = i (y(i)-Tx(i))2
 Lq regularization: Complexity(): j |j|q
 L2 and L1 are most popular in linear regularization

L2 regularization leads to simple computation of
optimal 
 L1 is more complex to optimize, but produces
sparse models in which many coefficients are 0!

DATA DREDGING
As the number of attributes increases, the
likelihood of a learner to pick up on patterns that
arise purely from chance increases
 In the extreme case where there are more
attributes than datapoints (e.g., pixels in a
video), even very simple hypothesis classes can
overfit



E.g., linear classifiers
Many opportunities for charlatans in the big data
age!
OTHER TOPICS IN MACHINE LEARNING

Unsupervised learning
Dimensionality reduction
 Clustering


Reinforcement learning


Agent that acts and learns how to act in an
environment by observing rewards
Learning from demonstration
 Agent
that learns how to act in an environment by
observing demonstrations from an expert
38
ISSUES IN PRACTICE
The distinctions between learning algorithms
diminish when you have a lot of data
 The web has made it much easier to gather large
scale datasets than in early days of ML
 Understanding data with many more attributes
than examples is still a major challenge!


Do humans just have really great priors?
NEXT LECTURES
Temporal sequence models (R&N 15)
 Decision-theoretic planning
 Reinforcement learning
 Applications of AI

Download