tut_week8_v2_corrected

advertisement
CS 367 Tutorial
24 September 2008
Week 8 (tutorial #6)
Carl Schultz
Material is taken from lecture notes (http://www.cs.auckland.ac.nz/compsci367s2c/lectures/index.html).

learning task
o learn a target concept “f” (e.g. given the weather of a day, the function
tells you if Aldo will play sport on that day)
o consider hypotheses in space “H”
o training examples are given to the learner – trainer chooses each example
independently
 uses probability distribution “D” over all possible days – e.g.
maybe some days are more likely to be chosen than other days
(and maybe not)
all possible data points
(any tiny grey dot inside the circle)
a training data point
(a blue dot)
probability of
choosing a day
probability
distribution D
possible days

Sample error
o fraction of examples in a particular sample of examples that it
misclassifies
o e.g. out of the 12 blue dots above, 3 are misclassified
 sample error would be: 3 / 12 = 0.25

True error
o probability that, if I randomly pick an example (out of all possible
examples) according to distribution D, it misclassifies it
o true error is what we want – but sample error is the best we can get
o so how good is sample error at estimating true error?
o problems with sample error
 bias
 variance

bias (similar to accuracy)
o is the error between the “true value” f(x) and the average of a collection
of outputs f ‘(x) (each from a different hypothesis) from your learner
o archery analogy – bull’s eye is the “true value” we are aiming for and an
arrow is the output from one learnt hypothesis produced by our learner
o both boards below have low bias because the error between the average
of the arrows and the bull’s eye is low
o whereas the following boards have high bias
o we’ll look at f ‘(x) closer – this is the average value from input x given by
the different hypotheses that your learner might produce, depending on the
training set
o f si is a hypothesis learnt on training set “si” (of size m)
o just take the average of all possible “fsi”

machine learning bias
o inductive bias (we’ve seen this before, e.g. candidate-elimination and ID3)
o this allows the learner to generalise beyond the training data
o …but also makes assumptions that might not hold

systematic error bias
o errors that deviate from the true value in a consistent way
o e.g. thermometer that reads 2 degrees higher than the real temperature
o e.g. reading clock that is running fast
o very difficult to distinguish between systematic error and real patterns in
the data
o often requires independent source of information (e.g. instrument
calibration)

straight-statistical bias
o states that as training set size gets smaller, error will increase

statistical bias
o puts all the biases together into one formula (above)
o i.e. is the error between the “true value” f(x) and the average of a
collection of hypotheses outputs f ‘(x) where the hypothesis are learnt
from different training sets of size m from your learner

variance (similar to spread or precision)
o this is the average (squared) error between f ‘(x) (the mean value of
different hypotheses, each learnt from one of the training sets s1, s2,…sl )
and the value from each hypothesis fsi
o archery analogy – bull’s eye is the “true value” we are aiming for and an
arrow is the output from one learnt hypothesis produced by our learner
o both boards below have low variance because the error between (a) the
average of the arrows and (b) each particular arrow is low (i.e. all
arrows are near the average)
o whereas the following boards have high variance
o high variance=arrows are spread out
o low variance=arrows are near each other
o note: variance can be high while bias is low (low precision but high
accuracy – remember bias is the average of the outputs from different
hypotheses)
o note: variance can be low while bias is high (high precision but low
accuracy)

error (mean squared error) = bias2 + variance
Download