slides lec2

advertisement
Evaluating Classifiers
Lecture 2
Instructor: Max Welling
Read chapter 5
Evaluation
Given:
• Hypothesis h(x): XC, in hypothesis space H,
mapping attributes x to classes c=[1,2,3,...C]
• A data-sample S(n) of size n.
Questions:
• What is the error of “h” on unseen data?
• If we have two competing hypotheses, which one is better
on unseen data?
• How do we compare two learning algorithms in the face of limited data?
• How certain are we about our answers?
Sample and True Error
We can define two errors:
1) Error(h|S) is the error on the sample S:
1 n
error (h | S )   [h (xi )  yi ]
n i 1
2) Error(h|P) is the true error on the unseen data sampled from the
distribution P(x):
error (h | P )   dx P (x ) [h (x )  f (x )]
where f(x) is the true hypothesis.
Binomial Distributions
• Assume you toss a coin n times.
• And it has probability p of coming heads (which we will call success)
• What is the probability distribution governing the number of heads in n trials?
• Answer: the Binomial distribution.
p (# heads  r | p , n ) 
n!
p r (1  p )n r
r !(n  r )!
Distribution over Errors
• Consider some hypothesis h(x)
• Draw n samples X~P(X).
• Do this k times.
• Compute e1=n*error(h|X1), e2=n*error(h|X2),...,ek=n*error(h|Xk).
• {e1,...,ek} are samples from a Binomial distribution !
• Why? imagine a magic coin, where God secretly determines the probability
of heads by the following procedure. First He takes some random hypothesis h.
Then, He draws x~P(x) and observes if h(x) correctly predicts the label correctly.
If it does, he makes sure the coin lands heads up...
• You have a single sample S, for which you observe
e(S) errors. What would be a reasonable estimate for Error(h|P) you think?
Binomial Moments
mean (r )  E [r | n , p ]  np
var
var (r )  E [(r  E [r ]) ]  np (1  p )
2
mean
• If we match the mean, np, with the observed value n*error(h|S) we find:
E [error (h | P )]  E [r / n ]  p  error (h | S )
• If we match the variance we can obtain an estimate of the width:
var[error (h | P )]  var[r / n ] 
error (h | S )  (1  error (h | S ))
n
Confidence Intervals
• We would like to state:
With N% confidence we believe that error(h|P) is contained in the interval:
error (h | P )  error (h | S )  z N
80%
error (h | S )  (1  error (h | S ))
n
z0.8  1.28
Normal(0,1)
1.28
• In principle z N is hard to compute exactly, but for np(1-p)>5 or n>30 it is safe to
approximate a Binomial by a Gaussian for which we can easily compute “z-values”.
Bias-Variance
• The estimator is unbiased if E [error (h | X )]  p
• Imagine again you have infinitely many sample sets X1,X2,.. of size n.
• Use these to compute estimates E1,E2,... of p where Ei=error(h|Xi)
• If the average of E1,E2,.. converges to p, then error(h|X) is an unbiased estimator.
• Two unbiased estimators can still differ in their
variance (efficiency). Which one do you prefer?
p
Eav
Flow of Thought
• Determine the property you want to know about the future data (e.g. error(h|P))
• Find an unbiased estimator E for this quantity based on observing data X (e.g. error(h|X))
• Determine the distribution P(E) of E under the assumption you have infinitely
many sample sets X1,X2,...of some size n. (e.g. p(E)=Binomial(p,n), p=error(h|P))
• Estimate the parameters of P(E) from an actual data sample S (e.g. p=error(h|S))
• Compute mean and variance of P(E) and pray P(E) it is close to a Normal distribution.
(sums of random variables converge to normal distributions – central limit theorem)
• State you confidence interval as: with confidence N% error(h|P) is contained in the interval
Y  mean  zN var
Assumptions
• We only consider discrete valued hypotheses (i.e. classes)
• Training data and test data are drawn IID from the same distribution P(x).
(IID: independently & identically distributed)
• The hypothesis must be chosen independently from the data sample S!
Homework: Consider training a classifier h(x) on data S.
Argue why the third assumption is violated.
Evaluating Learned h(x)
• If we learn h(x) from a sample Sn, it is a bad idea to evaluate it on the same data.
(You will be overly confident of yourself).
• Instead, follow the following procedure:
• Split the data into k subsets s1,..sK of size ni=N/k.
• Learn a hypothesis on complement S-si
• Compute error(h|si) on left-out subset si.
• After you finished, compute the total error as:
• Compute variance as:
1 k
error (h | P )  error (h | S )  ni  error (h | si )
n i 1
var[error (h | P )] 
• This “avoids” violating assumption 3.
error (h | S )  (1  error (h | S ))
n
Comparing 2 Hypotheses
• Assume we like to compare 2 hypothesis h1 and h2, which we have
tested on two independent samples S1 and S2 of size n1 and n2.
• I.e. we are interested in the quantity: d  error (h1| P )  error (h 2| P ) ?
• Define estimator for d: dˆ  error (h 1 | X 1)  error (h 2 | X 2)
with X1,X2 sample sets of size n1,n2.
• Since error(h1|S1) and error(h2|S2) are both approximately Normal
their difference is approximately Normal with:
mean  d  error (h 1 | S 1)  error (h 2 | S 2)
var 
error (h 1 | S 1)(1  error (h 1 | S 1)) error (h 2 | S 2)(1  error (h 2 | S 2))

n1
n2
• Hence, with N% confidence we believe that d is contained in the interval:
d  mean  zN var
Say, mean = -0.1, sqrt(var)=0.5. Z(0.8)=1.28. Do you want to conclude that h1 is significantly better than h2?
Paired Tests
• It is more likely that we want to compare 2 hypothesis on the same data.
• E.g. say we have a single data-set S and two hypothesis h1, h2. Split the data
again into subsets s1,..sk
error(h1|s1)=0.1
error(h2|s1)=0.13
error(h1|s2)=0.2
error(h2|s2)=0.21
error(h1|s3)=0.66
error(h2|s3)=0.68
error(h1|s4)=0.45
error(h2|s4)=0.47
... and so on.
• We have var(error(h1)) = large, var(error(h2)) = large.
However, h1 is consistently better than h2.
• We should look at differences
not at differences
error(h1|si)-error(h2|si),
error(h1|S) - error(h2|S)
• Problem: we cannot use: var  error (h1| si )(1  error (h1| si ))/ni  error (h 2| si )(1  error (h 2| si ))/ni
because that assumes the errors are independent.
Since they are estimates on the same data, they are not independent.
• Solution: a “paired t-test” (next slide)
Paired t-test
• Chunk the data up in subsets s1,...,sk with |si|>30
• On each subset compute the error and compute: i  error (h1| si )  error (h 2| si )
• Now compute:
1
 
k
k

i 1
i
k
1
s ( ) 
(i   )2

k (k  1) i 1
• State: With N% confidence the difference in error between h1 and h2 is:
  tN ,k 1s ( )
• “t” is the t-statistic which is related to the student-t distribution (table 5.6).
Comparing Learning Algorithms
• Again, we split the data into k subsets: S{s1,s2,...sk}.
• Train both learning algorithm 1 (L1) and learning algorithm 2 (L2) on the complement
of each subset: {S-s1,S-s2,...) to produce hypotheses {L1(S-si), L2(S-si)} for all i.
• Compute for all i : i  error (L1 (S  si ) | si )  error (L2 (S  si ) | si )
• As in the last slide, perform a paired t-test on these differences to compute an
estimate and a confidence interval for the relative error of the hypothesis produced
by L1 and L2.
Homework: Are all assumptions for the statistical test respected? If not, find one that is violated.
Do you think that this estimate is correct, optimistic or pessimistic?
Evaluation: ROC curves
moving threshold
class 0 (negatives)
TP = true positive rate =
# positives classified as positive
divided by # positives
FP = false positive rate =
# negatives classified as positives
divided by # negatives
TN = true negative rate =
# negatives classified as negatives
divided by # negatives
FN = false negatives =
# positives classified as negative
divided by # positives
class 1 (positives)
Identify a threshold in
your classifier that you
can shift.
Plot ROC curve while you
shift that parameter.
Conclusion
Never (ever) draw error-curves without confidence intervals
(The second most important sentence of this course)
Appendix
The following slide is optional:
Hypothesis Tests
• Consider you want to compare again two hypothesis h1 and h2 on
sample sets S1 and S2 of size n1 and n2.
• Now we like to answer: will h2 have significantly less error on future data than h1?
• We can formulate this as: is the probability P(d>0) larger than 95% (say).
ˆ
• Since d  mean (d ) this means:
What is the total probability that we observe
dˆ  0.1 when d>0.
P (dˆ)
• This can be computed as:


d 0
dd P (dˆ  0.1 | mean  d ) 
0.1
 ddˆ P (dˆ | mean  0)

(move Gauss to the right)
mean (dˆ)
dˆ  0.1
Download