error(h|S)

advertisement
Error estimation
Data Mining II
Year 2009-10
Lluís Belanche
Alfredo Vellido
Error estimation
• Introduction
•Resampling methods:
•The Holdout
•Crossvalidation
• Random subsampling
• kfold Crossvalidation
• Leaveoneout
• The Bootstrap
• Error evaluation
• Accuracy and all that
Bias and variance estimates with
the bootstrap
Example: estimating bias & variance
Three-way data splits (1)
Three-way data splits (2)
Summary (data sample of size n)

Resubstitution:



Holdout (if iterated we get Random subsampling):





pessimistically-biased estimate
different partitions yield different estimates
K-fold CV (K«n):


optimistically-biased estimate
especially when the ratio of n to dimension is small
higher bias than LOOCV; lower than holdout
lower variance than LOOCV
LOOCV (n-fold CV): unbiased - large variance
Bootstrap:


lower variance than LOOCV
useful for very small n
Computational burden
Error Evaluation
Given:
• Hypothesis h(x): XC, in hypothesis space H,
mapping features x to a number of classes
• A data sample S of size n
Questions:
• What is the error of h on unseen data?
• If we have two competing hypotheses, which one will be better on unseen data?
• How do we compare two learning algorithms in the face of limited data?
• How certain are we about the answers to these questions?
Apparent & True Error
We can define two errors:
1) Error(h|S) is the apparent error, measured on the sample S:
error (h | S ) 
1
n
[h (xi )  yi ]

ni
1
2) Error(h|P) is the true error on data sampled from the
distribution P(x):
error (h | P )   dx P (x ) [h (x )  f (x )]
where f(x) is the true hypothesis.
A note on True Error

True Error need not be zero!
 Not

even if we knew the probabilities P(x)
Causes:
 Lack
of relevant features
 Intrinsic randomness of the process
A consequence of this is that we shall not attempt to fit hypotheses
with zero apparent error, ie error(h|S)=0 !!!
Quite on the contrary, we should favor those hypotheses s.t. error(h|S) ≈ error(h|P)
If error(h|S) >> error(h|P), then h is underfitting the sample S
If error(h|S) << error(h|P), then h is overfitting the sample S
How to estimate True Error (te)?





Estimate te as te in TE
Note te is a r.v.  CI
Let TE- the subset of TE
wrongly predicted by h
Let n = |S|, t = |TE|
|TE-| follows a binomial
distribution B(te, t)
S
The ML estimation of te is
te = |TE-| / t
This estimator is unbiased:
E[te] = te
Var[te] = te(1–te)/t
Confidence Intervals for te
“With N% confidence te=error(h|P) is contained
in the interval:”
te – s ≤ te ≤ te + s
z0.8  1.28
where s = zN √(te(1–te)/t)
In words, te is within zN standard errors of the
estimation.
This is because, for te(1–te)t>5 or t>30, it is safe
to approximate a Binomial by a Gaussian, for
which we can compute “z-values”.
80%
Normal(0,1)
1.28
Example 1
n = |S| = 1,000; t = |TE| = 250 (25% of S)
 Suppose |TE-| = 50 (our h hits 80% of TE)
 Then te = 0.2. For a CI at the 95% level:

= 1.967 and te is in [0.15, 0.25]
 Exercise: recompute CI at the 99% level,
using z0.99 = 2.326

z0.95
Example 2: comparing two hypotheses



Assume we need to compare 2 hypotheses h1
and h2 on the same data
We have t = |TE| = 100, on which h1 makes 10
errors and h2 makes 13
The CIs at the 95% (α=0.05) level are:
 [0.04,
0.16] for h1
 [0.06, 0.20] for h2


We cannot conclude that h1 is better than h2
Note: above is written 10%±6% (h1), 13%±7% (h2 )
Size does matter after all …



How large would TE need to be (say T) to affirm
that h1 is better than h2 ?
Assume both h1, h2 keep same accuracy
Force that UL of CI for h1 < LL of CI for h2
 UL
of CI for h1 is 0.10 + 1.967√(0.1*0.9 / T)
 LL of CI for h2 is 0.13 – 1.967√(0.13*0.87 / T)
 It turns out that T>1,742 (old size was 100!!!)

The probability that this fails is at most (1-α)/2
Paired t-test
• Chunk the data set S up in subsets s1,...,sk with |si |>30
• Design classifiers h1, h2 on every S\si
• On each subset si compute the errors and define:
i  error (h1| si )  error (h2| si )
• Now compute:
 
1
k
k
i

i
s ( ) 
1
1
k
 (i   )
k (k  1) i
2
1
• With N% confidence the difference in error between h1 and h2 is:
 tN ,k 1s ( )
• “tN,k-1” is the t-statistic related to the student-t distribution
• Since error(h1 | si) and error(h2 | si) are both approximately Normal
their difference is approximately Normal
Exercise: the real case …

A team of doctors has own classifier and
sample data of size 500
 Split
it in TR of size 300 and TE of size 200
 They get an error of 22% on TE
 They ask us for further advice …

We design a second classifier
 It
has an error of 15% on same TE
Exercise: the real case …
Answer the following questions:
1. Will you affirm that yours is better than theirs?
2. How large would TE need to be to (very reasonably) affirm that
yours is better than theirs?
3. What do you deduce from the above?
4. Suppose we move to 10-fold CV on the entire data set.
1. Give a new estimation of the error of your classifier
2. Perform a statistical test to check if there is any real difference
The doctors’ classifier errors: 0.22, 0.22, 0.29, 0.19, 0.23, 0.22, 0.20, 0.25, 0.19, 0.19
Your classifier’ errors: 0.15, 0.17, 0.21, 0.14, 0.13, 0.15, 0.14, 0.19, 0.11, 0.11
What is Accuracy?
No. of correct predictions
Accuracy =
No. of predictions
TP + TN
=
TP + TN + FP + FN
Example
classifier
A
B
C
D
TP
25
50
25
37
TN
25
25
50
37
FP
25
25
0
13
FN Accuracy
25
50%
0
75%
25
75%
13
74%
Clearly, B, C, D are all better than A
 Is B better than C, D?
 Is C better than B, D?
Accuracy may not
 Is D better than B, C?
tell the whole story

What is Sensitivity (aka Recall)?
No. of correct positive predictions
Sensitivity =
(wrt positives)
No. of positives
TP
=
TP + FN
Sometimes sensitivity wrt negatives is termed specificity
What is Specificity (aka Precision)?
No. of correct positive predictions
Precision =
wrt positives
No. of positive predictions
TP
=
TP + FP
Precision-Recall Trade-off


A predicts better than
B if A has better recall
and precision than B
There is a trade-off
between recall and
precision
precision


In some applications,
once you reach a
satisfactory precision,
you optimize for recall
In some applications,
once you reach a
satisfactory recall,
you optimize for
precision
Comparing prediction performance

Accuracy is the obvious measure
 But
it conveys the right intuition only when the
positive and negative populations are roughly
equal in size

Recall and precision together form a better
measure
 But
what do you do when A has better recall
than B and B has better precision than A?
F-measure

The harmonic mean of recall and precision
F=
classifier
A
B
C
D
2 * recall * precision
recall + precision
TP TN FP
25 75 75
0 150
0
50
0 150
30 100 50
(wrt positives)
FN Accuracy F-measure
25
50%
33%
50
75% undefined
0
25%
40%
20
65%
46%
Does not accord with intuition
Abstract model of a classifier
Given a test observation x
 Compute the prediction h(x)
 Predict x as negative if h(x) < t
 Predict x as positive if h(x) > t

t is the decision threshold of the classifier
changing t affects the recall and precision,
and hence accuracy, of the classifier
ROC Curves



By changing t, we get a  Then the larger the
area under the ROC
range of sensitivities
curve, the better
and specificities of a
classifier
Leads to ROC curve
that plots sensitivity vs.
(1 – specificity)
P(TP)
A predicts better than B
if A has better
sensitivities than B at
1 – specificity
most specificities
P(FP)
Download