Model Selection

advertisement
Model Selection and
Validation
“All models are wrong; some are useful.”
George E. P. Box
Some slides were taken from:
• J. C. Sapll: MODELING CONSIDERATIONS AND STATISTICAL
INFORMATION
• J. Hinton: Preventing overfitting
• Bei Yu: Model Assessment
Overfitting
• The training data contains information
about the regularities in the mapping from
input to output. But it also contains noise
– The target values may be unreliable.
– There is sampling error. There will be
accidental regularities just because of the
particular training cases that were chosen.
• When we fit the model, it cannot tell which
regularities are real and which are caused
by sampling error.
– So it fits both kinds of regularity.
– If the model is very flexible it can model the
sampling error really well. This is a disaster.
A simple example of
overfitting
• Which model do you
believe?
– The complicated model
fits the data better.
– But it is not economical
• A model is convincing when
it fits a lot of data surprisingly
well.
– It is not surprising that a
complicated model can fit
a small amount of data.
Generalization
• The objective of learning is to achieve
good generalization to new cases,
otherwise just use a look-up table.
• Generalization can be defined as a
mathematical interpolation or regression
over a set of training points:
f(x)
x
Generalization
• Over-Training is the equivalent of over-fitting
a set of data points to a curve which is too
complex
• Occam’s Razor (1300s, English Logician):
– “plurality should not be assumed without
necessity”
• The simplest model which explains the
majority of the data is usually the best
Generalization
Preventing Over-training:
• Use a separate test or tuning set of examples
• Monitor error on the test set as network trains
• Stop network training just prior to over-fit error
occurring - early stopping or tuning
• Number of effective weights is reduced
• Most new systems have automated early stopping
methods
Generalization
Weight Decay: an automated method of
effective weight control
• Adjust the bp error function to penalize the growth of
unnecessary weights:
1

2
E   ( t j  o j )   wij2
2 j
2 i
where:
wij  wij  wij
 = weight -cost parameter
wij is decayed by an amount proportional to its
magnitude; those not reinforced => 0
Formal Model Definition
• Assume model z = h(x, ) + v, where z is
output, h(·) is some function, x is input, v
is noise, and  is vector of model
parameters
A fundamental goal is to take n data points and
estimate , forming ̂ n
13-8
Model Error Definition
• Given a data set [xi,yi], i = 1,..,n
• Given a model output h(x, n), where n
is taken from some family of parameters,
the sum squared errors (SSE, MSE) is
Σi [yi - h(xi, n)]2,
• The likelihood is
ΠiP(h(xi, n)|xi)
13-9
Error surface as a function of Model
parameters can look like this
13-10
Error surface can also look like this
Which one is better?
13-11
Properties of the error surfaces
• The first surface is rough, thus a small change in
parameter space can lead to large change in
error
• Due to the steepness of the surface, a minimum
can be found, although a gradient-descent
optimization algorithm can get stuck in local
minima
• The second is very smooth thus, large change in
parameter set does not lead to much change in
model error
• In other words, it is expected that generalization
performance will be similar to performance on a
test set
13-12
Parameter stability
• Finer detail: while the surface is very smooth,
it is impossible to get to the true minima.
• Suggests that models that penalize on
smoothness may be misleading.
• Breiman (1992) has shown that even in
simple problems and simple nonlinear
models, the degree of generalization is
strongly dependent on the stability of the
parameters.
13-13
Bias-Variance Decomposition
• Assume:
Y  f (X )  ,
• Bias-Variance Decomposition:

Err ( x0 )  E (Y  fˆ ( x0 )) 2 X  x0

 ~ N (0,  2 )


 E ( f ( x0 )    Efˆ ( x0 )  Efˆ ( x0 )  fˆ ( x0 )) 2
2
    [ Efˆ ( x0 )  f ( x0 )]2  E[ fˆ ( x0 )  Efˆ ( x0 )]2
2
   Bias 2 ( fˆ ( x ))  Var( fˆ ( x ))

• K-NN:
• Linear fit:
0
0
2
1 k
 
Err ( x0 )       f ( x( l ) )  f ( x0 )  
k
 k l 1

2
2
2
2
2
Err ( x 0)     [ Efˆp ( x0 )  f ( x0 )]2  h( x0 )  
– Ridge Regression:
h( x 0 )  X ( X T X  I ) 1 x0
13-14
Bias-Variance Decomposition
• The MSE of the model at a fixed x can be decomposed
as:
E{[h(x,̂ n)  E(z|x)]2 |x}
= E{[h(x,̂ n )  E(h(x, ̂ n))]2|x} + [E(h(x,̂ n))  E(z|x)]2
= variance at x + (bias at x)2
where expectations are computed w.r.t. ̂ n
• Above implies:
Model too simple  High bias/low variance
Model too complex  Low bias/high variance
13-15
Bias-Variance Tradeoff in Model
Selection in Simple Problem
13-16
Model Selection
• The bias-variance tradeoff provides conceptual
framework for determining a good model
– bias-variance tradeoff not directly useful
• Many methods for practical determination of a good
model
– AIC, Bayesian selection, cross-validation,
minimum description length, V-C dimension, etc.
• All methods based on a tradeoff between fitting error
(high variance) and model complexity (low bias)
• Cross-validation is one of the most popular model
fitting methods
13-17
Cross-Validation
• Cross-validation is a simple, general method for
comparing candidate models
– Other specialized methods may work better in specific
problems
• Cross-validation uses the training set of data
• Does not work on some pathological distributions
• Method is based on iteratively partitioning the full set of
training data into training and test subsets
• For each partition, estimate model from training subset
and evaluate model on test subset
• Select model that performs best over all test subsets
13-18
Division of Data for Cross-Validation
with Disjoint Test Subsets
13-19
Typical Steps for Cross-Validation
Step 0 (initialization) Determine size of test subsets and
candidate model. Let i be counter for test subset being used.
Step 1 (estimation) For the i th test subset, let the remaining
data be the i th training subset. Estimate  from this training
subset.
Step 2 (error calculation) Based on estimate for  from Step
1 (i th training subset), calculate MSE (or other measure) with
data in i th test subset.
Step 3 (new training/test subset) Update i to i + 1 and
return to step 1. Form mean of MSE when all test subsets
have been evaluated.
Step 4 (new model) Repeat steps 1 to 3 for next model.
Choose model with lowest mean MSE as best.
13-20
Numerical Illustration of Cross-Validation
(Example 13.4 in ISSO)
• Consider true system corresponding to a sine function
of the input with additive normally distributed noise
• Consider three candidate models
– Linear (affine) model
– 3rd-order polynomial
– 10th-order polynomial
• Suppose 30 data points are available, divided into 5
disjoint test subsets
• Based on RMS error (equiv. to MSE) over test subsets,
3rd-order polynomial is preferred
• See following plot
13-21
Numerical Illustration (cont’d): Relative Fits
for 3 Models with Low-Noise Observations
13-22
Standard approach to Model
Selection
• Optimize concurrently the likelihood or
mean squared error together with a
complexity penalty.
• Some penalties: norm of the weight vector,
smoothness, number of terminating leaves
(in CART), variance weights, cross
validation... etc.
• Spend most computational time on
optimizing the parameter solution via
sophisticated Gradient descent methods or
even global-minimum seeking methods. 13-23
Alternative approach
MDL based model selection
Later
13-24
Model Complexity
13-25
Preventing overfitting
• Use a model that has the right capacity:
– enough to model the true regularities
– not enough to also model the spurious
regularities (assuming they are weaker).
• Standard ways to limit the capacity of a
neural net:
– Limit the number of hidden units.
– Limit the size of the weights.
– Stop the learning before it has time to over-fit.
13-26
Limiting the size of the weights
• Weight-decay involves
adding an extra term to
the cost function that
penalizes the squared
weights.
– Keeps weights small
unless they have big
error derivatives.
C E

2
wi

2
i
C E

 wi
wi wi
C
1 E
when
 0, wi  
wi
 wi
C
w
13-27
The effect of weight-decay
• It prevents the network from using weights
that it does not need.
– This can often improve generalization a lot.
– It helps to stop it from fitting the sampling error.
– It makes a smoother model in which the output
changes more slowly as the input changes. w
• If the network has two very similar inputs it
prefers to put half the weight on each rather
than all the weight on one.
w/2
w/2
w
0
13-28
Model selection
• How do we decide which limit to use and
how strong to make the limit?
– If we use the test data we get an unfair
prediction of the error rate we would get on
new test data.
– Suppose we compared a set of models that
gave random results, the best one on a
particular dataset would do better than
chance. But it wont do better than chance on
another test set.
• So use a separate validation set to do
model selection.
13-29
Using a validation set
• Divide the total dataset into three subsets:
– Training data is used for learning the
parameters of the model.
– Validation data is not used of learning but is
used for deciding what type of model and
what amount of regularization works best.
– Test data is used to get a final, unbiased
estimate of how well the network works. We
expect this estimate to be worse than on the
validation data.
• We could then re-divide the total dataset
to get another unbiased estimate of the
true error rate.
13-30
Early stopping
• If we have lots of data and a big model, its
very expensive to keep re-training it with
different amounts of weight decay.
• It is much cheaper to start with very small
weights and let them grow until the
performance on the validation set starts
getting worse (but don’t get fooled by noise!)
• The capacity of the model is limited because
the weights have not had time to grow big.
13-31
Why early stopping works
• When the weights are very
small, every hidden unit is
in its linear range.
– So a net with a large layer of
hidden units is linear.
– It has no more capacity than
a linear net in which the
inputs are directly connected
to the outputs!
• As the weights grow, the
hidden units start using
their non-linear ranges so
the capacity grows.
outputs
inputs
13-32
Model Assessment and Selection
•
•
•
•
•
•
Loss Function and Error Rate
Bias, Variance and Model Complexity
Optimization
AIC (Akaike Information Criterion)
BIC (Bayesian Information Criterion)
MDL (Minimum Description Length)
13-33
Key Methods to Estimate
Prediction Error
• Estimate Optimism, then add it to the
training error rate.
Eˆ rrin  err  oˆp
• AIC: choose the model with smallest AIC
AIC ( )  err ( )  2
d ( ) 2
ˆ
N
• BIC: choose the model with smallest BIC
BIC 
N 
d 2
err

(log
N
)
 
N
  2 

13-34
Model Assessment and
Selection
• Model Selection:
– estimating the performance of different
models in order to choose the best one.
• Model Assessment:
– having chosen the model, estimating the
prediction error on new data.
13-35
Approaches
• data-rich:
– data split: Train-Validation-Test
– typical split: 50%-25%-25% (how?)
• data-insufficient:
– Analytical approaches:
• AIC, BIC, MDL, SRM
– efficient sample re-use approaches:
• cross validation, bootstrapping
13-36
Model Complexity
13-37
Bias-Variance Tradeoff
13-38
Summary
• Cross validation: A practical way to
estimate model error.
• Model Estimation should be done with a
penalty
• When best model estimation is chosen,
estimate on whole data or average
models on cross validated data
13-39
Loss Functions
• Continuous Response
squared error
absolute error
• Categorical Response
0-1 loss
log-likelihood
13-40
Error Functions
• Training Error:
– the average loss over the training sample.
1
err   L( y , fˆ ( x ))
– Continuous Response:
N
2
– Categorical Response:
err 
 log pˆ ( x )
N
i 1
i
i
N
• Generalization Error:
N
i 1
gi
i
– the expected prediction error over an independent
test sample.
ˆ ( X ))]
Err

E
[
L
(
Y
,
f
– Continuous Response:
Err  E[ L(G, Gˆ ( X ))]
– Categorical Response:
13-41
Detailed Decomposition for Linear
Model Family
• average squared bias decomposition
*  arg min E ( f ( X )   T X )2

T
T
T
E x 0 [ f ( x0 )  Ef ( x0 )]2  E x 0 [ f ( x0 )  * x0 ]2  E x 0 [ * x0  Eˆ x0 ]2
Ave[ Bias ]2  Ave[ Model _ Bias ]2  Ave[ Estimation _ Bias ]2
=0 for LLSF;
>0 for ridge regression
trade off with variance;
13-42
Download