M C S I

advertisement
Slides for Introduction to Stochastic Search
and Optimization (ISSO) by J. C. Spall
CHAPTER 13
MODELING CONSIDERATIONS AND
STATISTICAL INFORMATION
“All models are wrong; some are useful.”
George E. P. Box
•Organization of chapter in ISSO
–Bias-variance tradeoff
–Model selection: Cross-validation
–Fisher information matrix: Definition, examples, and
efficient computation
Model Definition and MSE
• Assume model z = h(, x) + v, where z is output, h(·)
is some function, x is input, v is noise, and  is vector
of model parameters
– h(·) may represent simulation model
– h(·) may represent “metamodel” (response surface) of
existing simulation
• A fundamental goal is to take n data points and
estimate , forming ̂ n
• A common measure of effectiveness for estimate is
mean of squared model error (MSE) at fixed x:

2
ˆ
E h(n , x )  E( z x ) x

13-2
Bias-Variance Decomposition
• The MSE of the model at a fixed x can be decomposed
as:
E{[h( ̂ n, x)  E(z|x)]2 |x}
= E{[h( ̂ n , x)  E(h( ̂ n, x))]2|x} + [E(h( ̂ n, x))  E(z|x)]2
= variance at x + (bias at x)2
where expectations are computed w.r.t. ̂ n
• Above implies:
Model too simple  High bias/low variance
Model too complex  Low bias/high variance
13-3
Unbiased Estimator May Not be Best
(Example 13.1 from ISSO)
• Unbiased estimator is such that E h(ˆ n , x ) x   E(z x )
(i.e., mean of prediction is same as mean of data z)
• Example: Let ̂n denote sample mean of scalar i.i.d. data
as estimator of true mean  (h(, x) =  in notation above)
• Alternative biased estimator of  is r ˆ n , where 0 < r < 1
• MSE of biased and unbiased estimators generally satisfy
E (r ˆ n  )2  < E (ˆ n  )2 
• Biased estimate better in MSE sense
– However, optimal value of r requires knowledge of unknown
(true) 
13-4
Bias-Variance Tradeoff in Model
Selection in Simple Problem
13-5
Example 13.2 in ISSO:
Bias-Variance Tradeoff
• Suppose true process produces output according to z
= f(x) + noise, where f (x) = (x + x2 )1.1
• Compare linear, quadratic, and cubic approximations
• Table below gives average bias, variance, and MSE
Linear
Model
Quadratic
Model
Cubic
Model
bias2
variance
Overall
MSE
510.6
10.0
520.6
0.53
20.0
20.53
0.005
30.0
30.005
• Overall pattern of decreasing bias and increasing
variance; optimal tradeoff is quadratic model
13-6
Model Selection
• The bias-variance tradeoff provides conceptual
framework for determining a good model
– Bias-variance tradeoff not directly useful
• Need a practical method for optimizing bias-variance
tradeoff
• Practical aim is to pick a model that minimizes a criterion:
f1(fitting error from given data) + f2(model complexity)
where f1 and f2 are increasing functions
• All methods based on a tradeoff between fitting error
(high variance) and model complexity (low bias)
• Criterion above may/may not be explicitly used in given
method
13-7
Methods for Model Selection
• Among many popular methods are:
– Akaike Information Criterion (AIC) (Akaike, 1974)
• Popular in time series analysis
– Bayesian selection (Akaike, 1977)
– Bootstrap-based selection (Efron and Tibshirini, 1997)
– Cross-validation (Stone, 1974)
– Minimum description length (Risannen, 1978)
– V-C dimension (Vapnik and Chervonenkis, 1971)
• Popular in computer science
• Cross-validation appears to be most popular model
fitting method
13-8
Cross-Validation
• Cross-validation is simple, general method for
comparing candidate models
– Other specialized methods may work better in specific
problems
• Cross-validation uses the training set of data
• Method is based on iteratively partitioning the full set of
training data into training and test subsets
• For each partition, estimate model from training subset
and evaluate model on test subset
– Number of training (or test) subsets = number of model
fits required
• Select model that performs best over all test subsets
13-9
Choice of Training and Test Subsets
• Let n denote total size of data set, nT denote size of test
subset, nT < n
• Common strategy is leave-one-out: nT = 1
– Implies n test subsets during cross-validation process
• Often better to choose nT > 1
– Sometimes more efficient (sampling w/o replacement)
– Sometimes more accurate model selection
• If nT > 1, sampling may be with or without replacement
– “With replacement” indicates that there are “n choose
nT” test subsets, written n
nT
– With replacement may be prohibitive in practice:
e.g., n = 30, nT = 6 implies nearly 600K model fits!
 
• Sampling without replacement reduces number of test
subsets to n /nT (disjoint test subsets)
13-10
Conceptual Example of Sampling Without
Replacement: Cross-Validation with
3 Disjoint Test Subsets
13-11
Typical Steps for Cross-Validation
Step 0 (initialization) Determine size of test subsets and
candidate model. Let i be counter for test subset being used.
Step 1 (estimation) For the i th test subset, let the remaining
data be the i th training subset. Estimate  from this training
subset.
Step 2 (error calculation) Based on estimate for  from Step
1 (i th training subset), calculate MSE (or other measure) with
data in i th test subset.
Step 3 (new training and test subsets) Update i to i + 1 and
return to step 1. Form mean of MSE when all test subsets
have been evaluated.
Step 4 (new model) Repeat steps 1 to 3 for next model.
Choose model with lowest mean MSE as best.
13-12
Numerical Illustration of Cross-Validation
(Example 13.4 in ISSO)
• Consider true system corresponding to a sine function
of the input with additive normally distributed noise
• Consider three candidate models
– Linear (affine) model
– 3rd-order polynomial
– 10th-order polynomial
• Suppose 30 data points are available, divided into 5
disjoint test subsets (sampling w/o replacement)
• Based on RMS error (equiv. to MSE) over test subsets,
3rd-order polynomial is preferred
• See following plot
13-13
Numerical Illustration (cont’d): Relative Fits
for 3 Models with Low-Noise Observations
Sine wave (process mean)
3rd-order
10th-order
Linear
13-14
Fisher Information Matrix
• Fundamental role of data analysis is to extract
information from data
• Parameter estimation for models is central to process of
extracting information
• The Fisher information matrix plays a central role in
parameter estimation for measuring information
Information matrix summarizes the amount
of information in the data relative to the
parameters being estimated
13-15
Problem Setting
• Consider the classical statistical problem of estimating
parameter vector  from n data vectors z1, z2 ,…, zn
• Suppose have a probability density and/or mass function
associated with the data
• The parameters  appear in the probability function and
affect the nature of the distribution
– Example: zi  N(mean(), covariance()) for all i
• Let l(|z1, z2 ,…, zn) represent the likelihood function,
i.e., the p.d.f./p.m.f. viewed as a function of  conditioned
on the data
13-16
Information Matrix—Definition
• Recall likelihood function l(|z1, z2 ,…, zn)
• Information matrix defined as
  log l  log l 
Fn ()  E 
T 


 

where expectation is w.r.t. z1, z2 ,…, zn
• Equivalent form based on Hessian matrix:
  2 log l 
Fn ()  E 
 T 


• Fn() is positive semidefinite of dimension pp (p=dim())
13-17
Information Matrix—Two Key Properties
• Connection of Fn() and uncertainty in estimate ˆ n is
rigorously specified via two famous results ( = true
value of ):
1. Asymptotic normality:
dist
n (ˆ n   ) 
 N(0, F 1)
where
F  lim Fn ( ) n
n 
2. Cramér-Rao inequality:
cov(ˆ n )  Fn ( )1 for all n
Above two results indicate: greater variability of ˆ n
“smaller” Fn() (and vice versa)
13-18
Selected Applications
• Information matrix is measure of performance for several
applications. Four uses are:
1. Confidence regions for parameter estimation
– Uses asymptotic normality and/or Cramér-Rao
inequality
2. Prediction bounds for mathematical models
3. Basis for “D-optimal” criterion for experimental
design
– Information matrix serves as measure of how well  can
be estimated for a given set of inputs
4. Basis for “noninformative prior” in Bayesian
analysis
– Sometimes used for “objective” Bayesian inference
13-19
Download