Lecture 12 – Model Assessment and Selection

advertisement
Lecture 12 – Model Assessment
and Selection
Rice ECE697
Farinaz Koushanfar
Fall 2006
Summary
•
•
•
•
•
•
•
•
Bias, variance, model complexity
Optimism of training error rate
Estimates of in-sample prediction error, AIC
Effective number of parameters
The Bayesian approach and BIC
Vapnik-Chernovekis dimension
Cross-Validation
Bootstrap method
Model Selection Criteria
• Loss Function
• Training Error
• Generalization Error
Training Error vs. Test Error
Model Selection and Assessment
• Model selection:
– Estimating the performance of different models in
order to chose the best
• Model assessment:
– Having chosen a final model, estimating its
prediction error (generalization error) on new data
• If we were rich in data:
Train
Validation
Test
Bias-Variance Decomposition
• As we have seen before,
• The first term is the variance of the target around the
true mean f(x0); the second term is the average by
which our estimate is off from the true mean; the last
term is variance of f^(x0)
* The more complex f, the lower the bias, but the higher the variance
Bias-Variance Decomposition (cont’d)
• For K-nearest neighbor
• For linear regression
Bias-Variance Decomposition (cont’d)
• For linear regression, where h(x0) is the vector of
weights that produce fp(x0)=x0T(XTX)-1XTy and hence
Var[(fp(x0)]=||h(x0)||22
• This variance changes with x0, but its average over
the sample values xi is (p/N) 2
Example
• 50 observations and 20 predictors, uniformly
distributed in the hypercube [0,1]20.
• Left: Y is 0 if X11/2 and apply k-NN
• Right: Y is 1 if j=110Xj is 5 and 0 otherwise
Prediction error
Squared bias
Variance
Example – loss function
Prediction error
Squared bias
Variance
Optimism of Training Error
• The training error
• Is typically less than the true error
• In sample error
• Optimism
• For squared error, 0-1, and other losses, on can
show in general
Optimism (cont’d)
• Thus, the amount by which the error under
estimates the true error depends on how much
yi affects its own prediction
• For linear model
• For additive model Y=f(X)+ and thus,
Optimism increases linearly with number of inputs or basis d,
decreases as training size increases
How to count for optimism?
• Estimate the optimism and add it to the
training error, e.g., AIC, BIC, etc.
• Bootstrap and cross-validation, are direct
estimates of this optimism error
Estimates of In-Sample Prediction
Error
• General form of in-sample estimate is computed from
• Cp statistic: for an additive error model, when d
parameters are fit under squared error loss,
• Using this criterion, adjust the training error by a
factor proportional to the number of basis
• Akaike Information Criterion (AIC) is a similar but a
more generally applicable estimate of Errin, when the
log-likelihood loss function is used
Akaike Information Criterion (AIC)
• AIC relies on a relationship that holds
asymptotically as N
• Pr(Y) is a family of densities for Y (contains the
“true” density), “ hat” is the max likelihood
estimate of , “loglik” is the maximized loglikelihood:
N
log lik   log Pr ( y )
i 1
ˆ

i
AIC (cont’d)
• For the Gaussian model, the AICCp
• For the logistic regression, using the binomial
log-likelihood, we have
• AIC=-2/N. loglik + 2. d/N
• Choose the model that produces the smallest
possible AIC
• What if we don’t know d?
• How about having tuning parameters?
AIC (cont’d)
• Given a set of models f(x) indexed by a tuning
parameter , denote by err() and d() the training
error and number of parameters
d ( )
AIC()  err ()  2.
ˆ 
N
2
• The function AIC provides an estimate of the test
error curve and we find the tuning parameter  that
maximizes it
• By choosing the best fitting model with d inputs, the
effective number of parameters fit is more than d
AIC- Example: Phenome recognition
The effective number of parameters
• Generalize num of parameters to regularization
• Effective num of parameters is: d(S) = trace(S)
• In sample error is:
The effective number of parameters
• Thus, for a regularized model:
• Hence
• and
The Bayesian Approach and BIC
• Bayesian information criterion (BIC)
• BIC/2 is also known as Schwartz criterion
BIC is proportional to AIC (Cp) with a factor 2 replaced by log (N).
BIC penalizes complex models more heavily, prefering Simpler models
BIC (cont’d)
• BIC is asymptotically consistent as a selection
criteria: given a family of models, including
the true one, the prob. of selecting the true one
is 1 for N
• Suppose we have a set of candidate models Mm,
m=1,..,M and corresponding model parameters m,
and we wish to chose a best model
• Assuming a prior distribution Pr(m|Mm) for the
parameters of each model Mm, compute the posterior
probability of a given model!
BIC (cont’d)
• The posterior probability is
• Where Z represents the training data. To compare two
models Mm and Ml, form the posterior odds
• If the posterior greater than one, chose m, otherwise l.
BIC (cont’d)
• Bayes factor: the rightmost term in posterior odds
• We need to approximate Pr(Z|Mm)
• A Laplace approximation to the integral gives
• ^m is the maximum likelihood estimate and dm is the
number of free parameters of model Mm
• If the loss function is set as -2 log Pr(Z|Mm,^m), this
is equivalent to the BIC criteria
BIC (cont’d)
• Thus, choosing the model with minimum BIC is
equivalent to choosing the model with largest
(approximate) posterior probability
• If we compute the BIC criterion for a set of M models,
BICm, m=1,…,M, then the posterior of each model is
0.5 BIC m
estimates as
e
 e
M
 0.5 BIC l
l 1
• Thus, we can estimate not only the best model, but also
asses the relative merits of the models considered
Vapnik-Chernovenkis Dimension
• It is difficult to specify the number of parameters
• The Vapnik-Chernovenkis (VC) provides a general
measure of complexity and associated bounds on
optimism
• For a class of functions {f(x,)} indexed by a
parameter vector , and xp.
• Assume f is in indicator function, either 0 or 1
• If =(0,1) and f is a linear indicator, I(0+1Tx>0),
then it is reasonable to say complexity is p+1
• How about f(x, )=I(sin .x)?
VC Dimension (cont’d)
VC Dimension (cont’d)
• The Vapnik-Chernovenkis dimension is a way
of measuring the complexity of a class of
functions by assessing how wiggly its
members can be
• The VC dimension of the class {f(x,)} is
defined to be the largest number of points
(in some configuration) that can be
shattered by members of {f(x,)}
VC Dimension (cont’d)
• A set of points is shattered by a class of functions if
no matter how we assign a binary label to each point,
a member of the class can perfectly separate them
• Example: VC dim of linear indicator function in 2D
VC Dimension (cont’d)
• Using the concepts of VC dimension, one can prove
results about the optimism of training error when
using a class of functions. E.g.
• If we fit N data points using a class of functions
{f(x,)} having VC dimension h, then with
probability at least 1- over training sets
For regression, a1=a2=1
Cherkassky and Mulier, 1998
VC Dimension (cont’d)
• The bounds suggest that the optimism increases with
h and decreases with N in qualitative agreement with
the AIC correction d/N
• The results of VC dimension bounds are stronger:
they give a probabilistic upper bounds for all
functions f(x,) and hence allow for searching over
the class
VC Dimension (cont’d)
• Vapnik’s Structural Risk Minimization (SRM) is built
around the described bounds
• SRM fits a nested sequence of models of increasing
VC dimensions h1<h2<…, and then chooses the
model with the smallest value of the upper bound
• Drawback is difficulty in computing VC dim
• A crude upper bound may not be adequate
Example – AIC, BIC, SRM
Cross Validation (CV)
• The most widely used method
• Directly estimate the generalization error by applying
the model to the test sample
• K-fold cross validation
– Use part of data to build a model, different part to test
• Do this for k=1,2,…,K and calculate the prediction
error when predicting the kth part
CV (cont’d)
• :{1,…,N}{1,…,K} divides the data to groups
• Fitted function f^-(x), computed when  removed
• CV estimate of prediction error is
CV  1 N  L( y , fˆ
N
i 1
 ( i )
i
(x ))
i
• If K=N, is called leave-one-out CV
• Given a set of models f^-(x), the th model fit with
the kth part removed. For this set of models we have
CV()  1 N  L( y , fˆ
N
i 1
i
 ( i )
(x , ))
i
CV (cont’d)
CV()  1 N  L( y , fˆ
N
i 1
i
 ( i )
(x , ))
i
• CV() should be minimized over 
• What should we chose for K?
• With K=N, CV is unbiased, but can have a
high variance since the K training sets are
almost the same
• Computational complexity
CV (cont’d)
CV (cont’d)
• With lower K, CV has a lower variance, but bias
could be a problem!
• The most common are 5-fold and 10-fold!
CV (cont’d)
• Generalized leave-one-out cross validation, for
linear fitting with square error loss ỷ=Sy
• For linear fits (Sii is the ith on S diagonal)
2
1 N
1 N yi  fˆ ( x i ) 2
i
[ yi  f ( x i )]   [
]

N i1
N i1 1  Sii
• The GCV approximation is
1
y  fˆ ( x )
GCV 
]
[
N
1  trace (S) / N
N
i
i
2
i 1
GCV maybe sometimes advantageous where the trace is computed
more easily than the individual Sii’s
Bootstrap
• Denote the training set by Z=(z1,…,zN) where
zi=(xi,yi)
• Randomly draw a dataset with replacement from
training data
• This is done B times (e.g., B=100)
• Refit the model to each of the bootstrap datasets and
examine the behavior over the B replications
• From the bootstrap sample, we can estimate any
aspect of the distribution of S(Z) – where S(z) can be
any quantity computed from the data
Bootstrap - Schematic
1
For e.g., Var[S( Z)] 
 [S( Z )  S ]
B 1

B
b 1
*b
*
2
Bootstrap (Cont’d)
• Bootstrap to estimate the prediction error
ˆ rr
E
boot
1 1

  L( y , fˆ ( x ))
BN
B
N
b 1 i 1
*b
i
i
• E^rrboot does not provide a good estimate
– Bootstrap dataset is acting as both training and testing and
these two have common observations
– The overfit predictions will look unrealistically good
• By mimicking CV, better bootstrap estimates
• Only keep track of predictions from bootstrap
samples not containing the observations
Bootstrap (Cont’d)
• The leave-one-out bootstrap estimate of prediction
error
• C-i is the set of indices of the bootstrap sample b that
do not contain observation I
• We either have to choose B large enough to ensure
that all of |C-i| is greater than zero, or just leave-out
the terms that correspond to |C-i|’s that are zero
Bootstrap (Cont’d)
• The leave-one-out bootstrap solves the overfitting
problem, we has a training size bias
• The average number of distinct observations in each
bootstrap sample is 0.632.N
• Thus, if the learning curve has a considerable slope at
sample size N/2, leave-one-out bootstrap will be
biased upward in estimating the error
• There are a number of proposed methods to alleviate
this problem, e.g., .632 estimator, information error
rate (overfitting rate)
Bootstrap (Example)
• Five-fold CV and .632 estimate for the same
problems as before
• Any of the measures
could be biased but not
affecting, as long as
relative performance is
the same
Download