Optimal Design Theory for Linear Models, Part I The notation used here approximately (but not exactly) follows that of Silvey (1980). Less theoretical (than Silvey’s) introductions to optimal design are included in many other textbooks including Myers, Montgomery, and Anderson-Cook (2009) and Morris (2011). Introduction Our presentation of orthogonal arrays was based loosely on an argument that the orthogonal property made them “best” in a relevant statistical sense. In this section, we’ll address the idea of “best” more carefully, and consider more direct and less restrictive concepts of relating definitions of “best” to specific designs. In short, we usually do this be defining optimality criteria, functions of designs that represent their quality from a statistical perspective. We then define and solve (or try to solve) an optimization problem which consists of finding the (or a) design that leads to the largest (or smallest) value of the optimality criterion, and call such a design “optimal” (with respect to that criterion). Criterion functions (or often just criteria for short) are formulated in different ways, reflecting the specific goals of an experiment, and designs that are optimal with respect to one criterion are not necessarily optimal with respect to others. General theory has been developed that allow optimal designs to be derived analytically in some cases. Where this is impossible or impractical, numerical methods are employed to solve the optimization problem. (However, many of the algorithms used in numerical optimization of designs are rooted in, or at least related to, the analytical theory.) Some Notation Before undertaking a general discussion of theory, we shall define some notation that will hopefully be broad enough to carry us through any linear models settings. Data for any specific experimental trial or run consists of a set of r controllable predictors, denoted by the r-vector u, and a scalar-valued response y. We usually think of these variables as being stated in “natural units”, e.g. the elements of u may be in units of pounds or dollars, and have not generally been scaled to something more numerically convenient. Parametric inference requires that we specify the form of a model relating y to u, and for linear models we will write: y= Pk i=1 θi fi (u) + , or y = where: • θi , i = 1, 2, ..., k, are unknown quantities • fi , i = 1, 2, ..., k are known functions 1 Pk i=1 θi xi + • is a random variable with mean 0 and variance σ 2 The values that can be entertained for the vector u constitute the design space (U), the physically meaningful domain; each u ∈ U. The induced design space is the set of corresponding values that can be entertained for x = (x1 , x2 , ..., xk ); each x ∈ X . Another way to say this is X = f (U) where f a vector-valued function (f1 , f2 , ..., fk )0 . An experiment consists of N experimental trials, with an experimental design denoted as U = {u1 , u2 , ..., uN }. For persent purposes, we shall assume that the random variable is i.i.d. across these runs. Each vector of predictors defines a corresponding vector from the derived design space, xi = f (ui ), i = 1, 2, ..., N . This, in turn, leads to what we shall call the design information matrix or design moment matrix: M(U ) = 1 N PN i=1 xi x0i Apart from the factor of N −1 , this is the familiar “X0 X matrix” in notation that is often used (for example, in STAT 512), and can be regarded as a “fundamental element” in describing the quality of statistical inference that can be drawn from data collected with design U . For example, under the stated assumptions, V ar(θ̂|U ) = σ2 M(U )−1 , N if the inverse exists. This, and other common variance formulae, suggest the intuition that good designs are associated with “large” M matrices, and by extension, that optimal designs should be associated with the “largest” M matrices. One concept that has been used for sorting out which design moment matrices are largest is the Loewner partial ordering for symmetric matrices, which defines M1 = M(U1 ) as being weakly “greater than” M2 = M(U2 ) iff M1 − M2 is positive semi-definite, suggesting that U1 is a superior design to U2 . The hope, then, would be that for a collection of competing designs, the design moment matrix for one (or a few) would dominate those of the other designs in this sense. When this occurrs, it is difficult to think of a statistical reason to dispute the indicated optimal design. Unfortunately, this is not the case in most practical situations. For example, consider the following two designs in N = 4 runs, with U = [−1, +1]2 for first-order regression (i.e. an intercept and two slopes): • • •• U1 : U2 : • • • 2 • Although U1 would be considered superior to U2 for nearly all statistical purposes (except for the benefit of pure replication in the second design), the eigenvalues of M1 − M2 are, in this case, 0.7071, 0, and -0.7071, and so neither design is preferred to the other in terms of the Loewner partial ordering. In fact, the sense of “ordering” implied here is so strong that it is of very little practical use, since far too many pairs of designs cannot be ordered in most problems of real interest. A more practical approach to optimal design is to base the ranking of alternative designs by a scalar-valued criterion function, φ(M), so that the problem of optimal design construction becomes one of function optimization. In the following sections, we briefly describe some of the criteria more commonly used in practice. Most of the effort in optimal design construction has as its goal finding the optimal design or designs for a given problem. However, we note at this point that in many cases, the criterion can be interpretted as being “monotonically related to the quality of the design”, so that two designs can be ordered in their desirability based on the criterion function, even if neither is optimal. This also leads to various definitions of “efficiency” of a design, based on the difference or ratio of the criterion value of that design and the criterion value of an optimal design of the same size (N ). D-optimality We turn now to examining specific optimality criteria that are in common use. Probably the most-often used (and certainly the one for which the theory described later is best developed) is D-optimality. This criterion is appropriate for experiments for which the overall goal is satisfied by estimating θ well, and can be developed by noting that the classical confidence region for this parameter vector can be written as CR(θ) = {θ : (θ − θ̂)0 M(U )(θ − θ̂) ≤ c1 } The boundary that defines this region is an ellipsoid in k-dimensional space, centered on the vector θ̂, with volume c2 |M(U )|−1/2 . Hence, for any values of σ 2 and θ̂, the volume of CR is minimized by a design that minimizes: φ(M(U )) = |M(U )−1 | or φ(M(U )) = −log|M(U )| or φ(M(U )) = log|M(U )| or equivalently, maximizes: φ(M(U )) = |M(U )| and designs that are optimal with respect to this criterion are called D-optimal (for “determinant”). This is a reasonable choice for a criterion for designing an experiment “to estimate θ well,” but it is not fool-proof. For example, suppose that for a particular problem, the design that minimizes the volume of CR(θ) actually produces confidence ellipsoids that are 3 extremely long in one direction, but very short in all directions orthogonal to this. (This would correspond to a matrix M(U ) with all but one eigenvalue very small, but one very large.) While the volume of the confidence region might be small, the length of any or all confidence intervals on individual elements of θ̂ could be very large! This remark shouldn’t be interpretted as a weakness of D-optimality specifically; similar arguments can be constructed for the “fallibility” of all common criteria. It is, instead, the inevitable consequence of relying on a single value (the criterion) to represent the value of an experimental design (which is hardly a scalar-valued structure). Reduction to a scalar-valued objective function makes optimization possible, but at a price. A reasonable practical approach to dealing with this is to construct a design using an appropriate criterion function, and check the quality of the design with respect to other related criteria to be sure that it is at least “good” with respect to them all. Example: Simple Linear Regression For sample size N = 3, set U = [−1, +1]. For simple linear regression, without scaling the independent variable, the usual parameterization is x = (1, u)0 , so X = 1 × [−1, +1]. In this case, 3 M(U ) = 13 P ui P ui P 2 P 1 2 P 2 |M(U )| = 9 (3 ui − ( ui ) ) ui That is, |M(U )| is proportional to the sample variance of ui ’s, therefore, any U that does not contain u = −1 and u = +1 can’t be optimal because “spreading it out” to the boarders would increase that variance. So, let u1 = +1, u2 = −1, |M(U )| = 19 (3(2 + u23 ) − u23 ) = 1 (6 9 + 2u23 ). This quantity is maximized at u3 = +1 or -1, so: 1 +1 U = −1 , X = 1 −1 −1 1 −1 1 +1 U = −1 , X = 1 −1 +1 1 +1 +1 +1 are D-optimal for SLR with this U and N = 3. (It should be obvious that the argument would take the same form for any interval on the real line.) Invariance of D-optimality D-optimality has a linear invariance property not shared by some of the other popular optimality criteria. In the standard definition of the problem, we begin with u and transform to x = f (u), leading to what we’ll temporarily call Mx (U ). Suppose, now, that we further linearly transform x as z = Tx, where T is a square matrix of full rank. In the new parameterization, the design information matrix is Mz = 1 N PN i=1 zi z0i = P 1 T[ N i=1 N 4 xi x0i ]T0 = TMx T0 . But it follows from this that |Mz | = |T|2 |Mx | (since the determinant of a product of square matrices is the product of determinants). So any design that maximizes |Mx | also maximizes |Mz | and vice versa, that is, D-optimality is invariant under nonsingular linear transformation of x. (In fact, beyond the question of which design is optimal, it is clear that the ranking of designs of the same size (N ) by this criterion is unchanged by linear transformation.) Example: SLR, continued Recall that the parameterization from the SLR example is x0 = (1, u). Now consider the effect of a linear tranformation defined by +1 +1 1+u , z = Tx = T= +1 −1 1−u That is, the new parameterization relates the dependent and predictor variables as y = θ1 (1+u)+θ2 (1−u)+, and Z is an upper-left to lower-right diagonal in [0, 2]2 . Transforming the model matrix for one of the optimal designs in x to the new parameterization, we have 1 +1 2 0 Xx = 1 −1 → Xz = 0 2 . 1 +1 2 0 Note that, as in the x-parameterization, the optimal design also leads to placement of two design points at one end of the augmented design space and one point at the other end in the z-parameterization. DA and Ds Here is one generalization of D-optimality in two forms, one of which is a special case of the other. Suppose we want to estimate A0s×k θ well, where s ≤ k and A of full rank, s. Note that when s = k, this is essentially a full-rank transformation as discussed above, so that a DA -optimal design in this case is the same as the D-optimal design; interesting cases correspond to s < k. By reasoning analogous to that given above, designs that minimize the volume of a CR for A0 θ are those that minimize φ(M(U )) = |A0 M−1 (U )A| A special case (which is the most common form you’ll see in the literature) is where A0 = (I|0), so that the focus is on a subset of s of the k parameters, Ds -optimality (for “subset”). (I’ve simplified the form of A here by writing it as if the first s parameters are the ones of interest.) Suppressing the argument U , let M= M11 M12 , M21 M22 M−1 = 5 M 11 M 12 M 21 M22 where the matrix partitioning is as suggested by A0 . The central matrix of the quadratic form defining a confidence region for the first s model parameters, acknowledging the remaining k − s as nuisance parameters, is M11 − M12 M−1 22 M21 , hence the criterion for Ds -optimality can be written as: φ(M(U )) = |M11 − M12 M−1 22 M21 | to be maximized, or −1 φ(M(U )) = |M11 | = |M11 − M12 M−1 to be minimized 22 M21 | While the notions of DA and Ds seem at first glance to be fairly natural and straightforward extensions to D-optimality, it should be noted that they can lead to substantially more difficult design construction problems (both analytically and computationally). This is because A0 θ may be estimable for designs for which M is not of full rank, and in fact, such designs may actually be optimal. For example, consider multiple linear regression in two predictors, U = [−1, +1]2 , y = θ0 + θ1 u1 + θ2 u2 + and let the subset of parameters of interest be (θ0 , θ1 )0 . It is not difficult to show that when N is even, U= N/2 points at (+1, 0) N/2 points at (−1, 0) is an optimal design for estimating the subset, but 1 0 0 M= 0 1 0 0 0 0 is singular. As a result, neither M−1 nor M−1 22 exists! We defer until later the accomodations that are required to deal with this issue. One additional general point related to Ds -optimality should be made. You will recall that in STAT 512, much was made about the view that experiments should be comparative. Rationale for this is based in the argument that, because tight experimental control (exercised to reduce variability) can result in unique common conditions for all runs in an experiment, comparisons between data in different experiments should not be expected to reflect only effects associated with experimental treatments. A direct consequence of this line of thinking is to treat the model intercept as a nuisance parameter. This point of view is nearly universal in what might be called the “traditional” treatment of statistical experimental design as developed by R.A. Fisher and those who followed him. In contrast to this, in much of the literature on optimal experimental design, less (and sometimes no) attention has been given to such issues as unit randomization and the overall effect of experimental control. Most 6 of the design criteria we will discuss, at least in their basic forms, treat the entire data model, including the intercept, as “meaningful” from an experimental point of view. Hence D-optimality is based on a summary measure of precision of all parameters in the expectation portion of the linear model, including the intercept, while Ds -optimality offers one way to re-cast optimal design ideas in a context that is closer to the “traditional” perspective. G-optimality While D-optimality is motivated by the need to estimate model parameters well, Goptimality (for “global”) is motivated by direct estimation of the expected response. The goal here is to estimate E(y(u)) well throughout U. For a given design, and at any one location u, the variance of that estimate is V ar(ŷ(u)) = σ2 0 x M(U )−1 x N where x = f (u). For any σ 2 , the G-optimal design minimizes the largest such variance (over points in the induced design space): φ(M(U )) = maxx∈X x0 [M(U )]−1 x or equivalently, maximizes: φ(M(U )) = −maxx∈X x0 [M(U )]−1 x As defined, the optimization problem required by the G-criterion is more difficult than that for the D-criterion (and most others). This is because evaluation of the criterion for a given design requires a complete optimization over all x ∈ X . Direct construction of a Goptimal design is then an optimization (over x ∈ X ) within an optimization (over U ∈ U N ), leading to analytical and computational difficulties. Note that φ(M(U )) is also −maxx∈X trace[M(U )]−1 [xx0 ]. If we had decided to look for the design that minimizes the average, instead of the maximum, variance of estimated expected response, then an appropriate criterion might be φ(M(U )) = R x∈X trace[M(U )]−1 [xx0 ]ω(x)dx = trace[M(U )]−1 R x∈X xx0 ω(x)dx The last integral is a region moment matrix, which is not dependent on the design. Finding an optimal design here would not require the “inner” optimization (over x). This is a special case of what is called “linear optimality” φ(M(U )) = trace[M(U )]−1 Ck×k = trace[CM(U )−1 ] 7 Despite construction advantages, this criterion may be less popular than G-optimality because it requires a weight function ω to define the average. (This can be ignored, but that amounts to tacitly assigning uniform weight across X , and note that except in simple cases, this is not equivalent to uniform weight across U.) The advantage of G-optimality is that the mini-max approach focuses entierly on the worst case (point of greatest estimation variance), and so does not require specification of the relative importance of one region of X relative to another. However, the “average” version of this criterion is sometimes used, especially in response surface analysis applications; this is developed further in the discussion of “Q-optimality” in the section below on average performance of ŷ. SLR Example, continued We continue with the example of simple linear regression from the notes on D-optimality. For any design U = {u1 , u2 , ...uN }, it is straightforward to show that the variance of ŷ(u) is minimized when u is the average value of the controlled variable over the design runs, and that this variance is a quadratic function of u so that the maximum variance ŷ occurs for u = −1 or +1, the most extreme values of U. We determined that for N = 3, the D-optimal designs are U1 = {−1, −1, +1} and U2 = {−1, +1, +1}. For these designs, V ar(ŷ) = 12 σ 2 and σ 2 at the two end-points, so that maxu V ar(ŷ(u)) = σ 2 . Consider now an alternative design, U3 = {−1, 0, +1}; for this plan, V ar(ŷ(−1)) = V ar(ŷ(+1)) = 56 σ 2 . Therefore, the D-optimal plan cannot also be G-optimal. A remaining question: Is U3 a G-optimal design for this problem? Other Criteria Suppose we want to minimize the average estimation variance of several linear combinations of θ’s, A0 θ. Motivation is similar to that for DA , but focuses instead on the average variance (ignoring correlations) instead of the volume of the CR. This leads to a criterion function for what is called A-optimality (for “average”): φ(M(U )) = trace[A0 M(U )−1 A] which is minimized by the optimal design. Note that this can also be written as φ(M(U )) = trace[M(U )−1 AA0 ] and so is also a special case of “linear optimality” mentioned above. When A0 contains only one row, this is sometimes called c-optimality: φ(M(U )) = c0 M(U )−1 c = trace[M(U )−1 cc0 ] which is analogous to Ds where the subset contains only one parameter. 8 With both A and c, complications can again arise with designs that should be called “optimal”, but have singular M. Related to c-optimality, but without this problem, is E-optimality (for “eigenvalue”), which calls for minimizing φ(M(U )) = maxc0 c=1 c0 M(U )−1 c = evmax (M(U )−1 ) or maximizing φ(M(U )) = evmin (M(U )). Criteria Involving the Average (over u) Performance of ŷ(u) The notes to this point largely follow the development of Silvey’s book. This section is closer to the development of Myers et al. (I’ve tried to make the notation consistent, but this was a last-minute addition, so let me know if you find something that is totally inconsistent.) Variance optimality Recall that G-optimality uses a criterion of form based on the precision of mean response estimation: φ(M(U )) = maxx∈X V ar[ŷ(x)] = σ2 maxu∈U x0 M−1 x N As briefly noted above, we could also define the criterion by averaging over the design region rather than focusing on the point of maximum variance, i.e. φ(M(U )) = σ2 N R u∈U x0 M−1 xdu This could also be defined with integration over x ∈ X ; here we integrate with respect to u with the understanding that x is defined as a function of u. Myers et al. call this Qoptimality, and it is also sometimes called IV- or I-optimality for “Integrated Variance”. As noted above, general “averaging” with an integral can be defined with a weight function, but here we’ll compute “averages” as uniformly weighted integrals in u ∈ U, with the understanding that this can be made more general when that helps. Define the volume of U to be Ω = R u∈U 1du. Then a Q-optimality criterion function can be developed as: aveu∈U V ar[ŷ(u)] = = = = = 1 R σ 2 x0 (X0 X)−1 xdu Ω u∈U σ2 R x0 M(U )−1 xdu N Ω u∈U R σ2 trace[M(U )−1 xx0 ]du N Ω u∈U Z σ2 N trace[M(U )−1 σ2 N −1 1 Ω | trace[M(U ) µ] 9 xx0 du] u∈U {z } Here, µ is a region moment matrix, and is analogous to what M would be in a limit as the experimental design increases in size in such a way that it “uniformly fills” U. Consistent with criterion functions presented above, we can omit constant factors of σ 2 and N , and define φ(M(U )) = trace[M(U )−1 µ] which is a specific form of linear optimality discussed above. Example: First-order regression model in U = [−1, +1]r µ= 1 R Ω u1 ∈[−1,+1] ... 1 u1 R ur ∈[−1,+1] ... u1 u2 u21 u1 u2 ... ... ... ... ... ur ... u1 ur du1 ...dur u2r ur ur u1 ur u2 ... Carrying out the integration on each scalar quantity: • 1 integrates to 2r , so Ω = 2r • Odd powers each integrate to zero • u2i integrates to 2r ( 13 ) so µ = diag(1, 13 , 13 , ..., 13 ). Q-optimality leads to minimization of V ar(θ̂1 ) + 1 3 Pr+1 i=2 V ar(θ̂i ). The most commonly used form of A-optimality employes A = I, leading to a criterion that is minimization of the average of all coefficient estimate variances, so in contrast, Q-optimality resembles this but places more weight on the intercept. A little reflection shows that any 2level orthogonal fractional factorial of resolution at least 3 is Q-optimal for 1st-order models on U = [−1, +1]r . Example: Second-order regression model in U = [−1, +1]2 µ= R R 1 Ω u1 ∈[−1,+1] u2 ∈[−1,+1] 1 u1 u2 u21 u21 u1 u2 u21 u2 u32 u1 u2 u22 u1 u22 u21 u2 u22 u1 u22 u32 u21 u22 u31 u2 u1 u32 u41 u21 u22 u42 sym Here, • Ω=4 10 du1 du2 • u2i integrates to 22 13 • u21 u22 integrates to 22 19 • u4i integrates to 22 15 • all terms with any odd power integrate to zero Bias optimality Q-optimality is based on average variance, but if the fitted model is of incorrect form, bias is also an issue. For example, for simple linear regression, r = 1, with U = [−1, +1], U = {−1, +1} is a Q-optimal design (along with any other reasonable form of optimality you might develop based only on variance). The alternative design, U = {− 34 , + 34 }, is certainly not Q-optimal, but may have less expected squared error in the estimates of E(y(u)) at most values of u ∈ U if the actual model is quadratic, e.g. -1 0 +1 To make this argument more formal, suppose E(y(u)) = x01 θ 1 + x02 θ 2 , but we fit a model of form ŷ(u) = x01 θ̂ 1 . The squared error of ŷ at any u is: err(u)2 = (x01 θ 1 + x02 θ 2 − x01 θ̂ 1 )2 . The expectation of this squared error is comprised of squared bias and variance terms: E[err(u)2 ] = [Eerr(u)]2 + V ar[err(u)]. 11 In turn, this expression can be integrated with respect to u to get a measure of integrated mean (or expected) squared error (IMSE), and the integration can be applied to the two components individually. Extend the definition of design and region moment matrices introduced earlier as: M11 (U ) = µ11 = 1 PN 0 i=1 x1,i x1,i N R 1 x x0 du Ω u∈U 1 1 µ12 = PN 0 i=1 x1,i x2,i x x0 du Ω u∈U 1 2 M12 (U ) = 1 N 1 R Then we can write the integrated variance and squared-bias components as: • V = R • B= R u∈U V ar[err(u)]du = σ2 trace[M11 (U )−1 µ11 ] N (as with Q-optimality) 2 u∈U [Eerr(u)] du −1 −1 −1 = θ 02 (µ22 − M21 M−1 11 µ12 − µ21 M11 M12 + M21 M11 µ11 M11 M12 )θ 2 Hence IMSE-optimality would suggest minimizing a criterion function comprised of the sum of these two terms; IMSE = B + V . This has been tried in some contexts, but it is generally difficult in most practical cases because we don’t know the relative size of θ 2 and σ 2 . For contexts in which variance is expected to dominate error (i.e. θ 2 is expected to be small relative to σ 2 ), ignoring B leads to Q-optimality. Now consider the opposite, where B is expected to dominate so that V might be ignored. The expression above for squared bias can be reduced by adding and subracting µ21 µ−1 11 µ12 in the central matrix, leading to a decomposition of B into two pieces, B = B1 + B2 , with: • B1 = θ 02 (µ22 − µ21 µ−1 11 µ12 )θ 2 −1 −1 −1 • B2 = θ 02 ((M21 M−1 11 − µ21 µ11 )µ11 (M11 M12 − µ11 µ12 ))θ 2 Note that B1 is not a function of the design; it is determined only by the model form, U and the unknown value of θ 2 , so we can ignore it for purposes of comparing designs. This would suggest using B2 as a design criterion, but this is still impractical since θ 2 is unknown. What can be done, at least in some instances, is to design the experiment so that B2 = 0 for any value of θ 2 . From the structure of B2 it is immediate that: • a necessary and sufficient condition for B2 = 0 is: −1 M−1 11 M12 = µ11 µ12 , • a sufficient condition for B2 = 0 is: M11 = µ11 and M12 = µ12 12 Note that M−1 11 M12 is what we called the “alias matrix” in STAT 512 and STAT 513. The sufficient condition is noted since it is sometimes easier to deal with in practice. Notice that this is a fundamentally different approach to design than those above that are associated with criterion functions. In the former cases (as we’ve noted), φ is a measure of “goodness” that can be used to rank designs even when they are not optimal. Because this “minimum bias” argument does not anticipate or “average over” the value of θ 2 , it does not result in a function φ that is directly associated with a statistical performance measure, but specifies “all or nothing” conditions that, when met, minimize integrated squared bias of expected response regardless of this value. Example: First-order regression model in U = [−1, +1]r Suppose a first order regression model is fitted in U = [−1, +1]r , but in truth, data are being generated by a second-order polynomial. Then: • µ11 = diag(1, 13 , 13 , ... 13 ) • µ12 = 1 R Ω u∈U (u1 1 u1 ... ur 2 u22 ... u1 u2 ...)du = 1 0 1 3 0 0 ... 00 00 00 ... 00 The pattern of nonzeros in these matrices match what we would have in M11 and M12 for a regular fractional factorial of resolution at least 3, but if the fraction were scaled so that all the values of ui in the design were ±1, these nonzeros would each be 1. A little thought suggests that the sufficient condition for minimum integrated bias can be achieved by rescaling such a design so that each ui is ±1 √ . 3 This is what intuition (and the figure a few pages earlier) suggests; minimizing integrated bias generally requires a design that is “shrunken” away from the boarders of U, relative to a design that is Q-optimal (or most other versions of optimality based only on variance considerations). References Myers, R.H., D.C. Montgomery, and C.M. Anderson-Cook (2009). Response Surface Methodology: Process and Product Optimization Using Designed Experiments, 3rd ed., Wiley, New York. Morris, M.D. (2011). Design of Experiments: An Introduction Based on Linear Models, Chapman and Hall, Boca Raton, FL. Silvey, S.D. (1980). Optimal Design: An Introduction to the Theory for Parameter Estimation, Chapman and Hall, London. 13