Slides for Introduction to Stochastic Search and Optimization (ISSO) by J. C. Spall CHAPTER 13 MODELING CONSIDERATIONS AND STATISTICAL INFORMATION “All models are wrong; some are useful.” George E. P. Box •Organization of chapter in ISSO –Bias-variance tradeoff –Model selection: Cross-validation –Fisher information matrix: Definition, examples, and efficient computation Model Definition and MSE • Assume model z = h(, x) + v, where z is output, h(·) is some function, x is input, v is noise, and is vector of model parameters – h(·) may represent simulation model – h(·) may represent “metamodel” (response surface) of existing simulation • A fundamental goal is to take n data points and estimate , forming ̂ n • A common measure of effectiveness for estimate is mean of squared model error (MSE) at fixed x: 2 ˆ E h(n , x ) E( z x ) x 13-2 Bias-Variance Decomposition • The MSE of the model at a fixed x can be decomposed as: E{[h( ̂ n, x) E(z|x)]2 |x} = E{[h( ̂ n , x) E(h( ̂ n, x))]2|x} + [E(h( ̂ n, x)) E(z|x)]2 = variance at x + (bias at x)2 where expectations are computed w.r.t. ̂ n • Above implies: Model too simple High bias/low variance Model too complex Low bias/high variance 13-3 Unbiased Estimator May Not be Best (Example 13.1 from ISSO) • Unbiased estimator is such that E h(ˆ n , x ) x E(z x ) (i.e., mean of prediction is same as mean of data z) • Example: Let ̂n denote sample mean of scalar i.i.d. data as estimator of true mean (h(, x) = in notation above) • Alternative biased estimator of is r ˆ n , where 0 < r < 1 • MSE of biased and unbiased estimators generally satisfy E (r ˆ n )2 < E (ˆ n )2 • Biased estimate better in MSE sense – However, optimal value of r requires knowledge of unknown (true) 13-4 Bias-Variance Tradeoff in Model Selection in Simple Problem 13-5 Example 13.2 in ISSO: Bias-Variance Tradeoff • Suppose true process produces output according to z = f(x) + noise, where f (x) = (x + x2 )1.1 • Compare linear, quadratic, and cubic approximations • Table below gives average bias, variance, and MSE Linear Model Quadratic Model Cubic Model bias2 variance Overall MSE 510.6 10.0 520.6 0.53 20.0 20.53 0.005 30.0 30.005 • Overall pattern of decreasing bias and increasing variance; optimal tradeoff is quadratic model 13-6 Model Selection • The bias-variance tradeoff provides conceptual framework for determining a good model – Bias-variance tradeoff not directly useful • Need a practical method for optimizing bias-variance tradeoff • Practical aim is to pick a model that minimizes a criterion: f1(fitting error from given data) + f2(model complexity) where f1 and f2 are increasing functions • All methods based on a tradeoff between fitting error (high variance) and model complexity (low bias) • Criterion above may/may not be explicitly used in given method 13-7 Methods for Model Selection • Among many popular methods are: – Akaike Information Criterion (AIC) (Akaike, 1974) • Popular in time series analysis – Bayesian selection (Akaike, 1977) – Bootstrap-based selection (Efron and Tibshirini, 1997) – Cross-validation (Stone, 1974) – Minimum description length (Risannen, 1978) – V-C dimension (Vapnik and Chervonenkis, 1971) • Popular in computer science • Cross-validation appears to be most popular model fitting method 13-8 Cross-Validation • Cross-validation is simple, general method for comparing candidate models – Other specialized methods may work better in specific problems • Cross-validation uses the training set of data • Method is based on iteratively partitioning the full set of training data into training and test subsets • For each partition, estimate model from training subset and evaluate model on test subset – Number of training (or test) subsets = number of model fits required • Select model that performs best over all test subsets 13-9 Choice of Training and Test Subsets • Let n denote total size of data set, nT denote size of test subset, nT < n • Common strategy is leave-one-out: nT = 1 – Implies n test subsets during cross-validation process • Often better to choose nT > 1 – Sometimes more efficient (sampling w/o replacement) – Sometimes more accurate model selection • If nT > 1, sampling may be with or without replacement – “With replacement” indicates that there are “n choose nT” test subsets, written n nT – With replacement may be prohibitive in practice: e.g., n = 30, nT = 6 implies nearly 600K model fits! • Sampling without replacement reduces number of test subsets to n /nT (disjoint test subsets) 13-10 Conceptual Example of Sampling Without Replacement: Cross-Validation with 3 Disjoint Test Subsets 13-11 Typical Steps for Cross-Validation Step 0 (initialization) Determine size of test subsets and candidate model. Let i be counter for test subset being used. Step 1 (estimation) For the i th test subset, let the remaining data be the i th training subset. Estimate from this training subset. Step 2 (error calculation) Based on estimate for from Step 1 (i th training subset), calculate MSE (or other measure) with data in i th test subset. Step 3 (new training and test subsets) Update i to i + 1 and return to step 1. Form mean of MSE when all test subsets have been evaluated. Step 4 (new model) Repeat steps 1 to 3 for next model. Choose model with lowest mean MSE as best. 13-12 Numerical Illustration of Cross-Validation (Example 13.4 in ISSO) • Consider true system corresponding to a sine function of the input with additive normally distributed noise • Consider three candidate models – Linear (affine) model – 3rd-order polynomial – 10th-order polynomial • Suppose 30 data points are available, divided into 5 disjoint test subsets (sampling w/o replacement) • Based on RMS error (equiv. to MSE) over test subsets, 3rd-order polynomial is preferred • See following plot 13-13 Numerical Illustration (cont’d): Relative Fits for 3 Models with Low-Noise Observations Sine wave (process mean) 3rd-order 10th-order Linear 13-14 Fisher Information Matrix • Fundamental role of data analysis is to extract information from data • Parameter estimation for models is central to process of extracting information • The Fisher information matrix plays a central role in parameter estimation for measuring information Information matrix summarizes the amount of information in the data relative to the parameters being estimated 13-15 Problem Setting • Consider the classical statistical problem of estimating parameter vector from n data vectors z1, z2 ,…, zn • Suppose have a probability density and/or mass function associated with the data • The parameters appear in the probability function and affect the nature of the distribution – Example: zi N(mean(), covariance()) for all i • Let l(|z1, z2 ,…, zn) represent the likelihood function, i.e., the p.d.f./p.m.f. viewed as a function of conditioned on the data 13-16 Information Matrix—Definition • Recall likelihood function l(|z1, z2 ,…, zn) • Information matrix defined as log l log l Fn () E T where expectation is w.r.t. z1, z2 ,…, zn • Equivalent form based on Hessian matrix: 2 log l Fn () E T • Fn() is positive semidefinite of dimension pp (p=dim()) 13-17 Information Matrix—Two Key Properties • Connection of Fn() and uncertainty in estimate ˆ n is rigorously specified via two famous results ( = true value of ): 1. Asymptotic normality: dist n (ˆ n ) N(0, F 1) where F lim Fn ( ) n n 2. Cramér-Rao inequality: cov(ˆ n ) Fn ( )1 for all n Above two results indicate: greater variability of ˆ n “smaller” Fn() (and vice versa) 13-18 Selected Applications • Information matrix is measure of performance for several applications. Four uses are: 1. Confidence regions for parameter estimation – Uses asymptotic normality and/or Cramér-Rao inequality 2. Prediction bounds for mathematical models 3. Basis for “D-optimal” criterion for experimental design – Information matrix serves as measure of how well can be estimated for a given set of inputs 4. Basis for “noninformative prior” in Bayesian analysis – Sometimes used for “objective” Bayesian inference 13-19