Extensions to Nonlinear Models The theory we’ve covered to this point is specifically developed for models that are linear in the parameters of interest. This is certainly the case for which the theory is most complete and elegant. However, some elements of it can be used in developing designs for nonlinear models as well. General Case Using as much of our previously defined notation as possible, we continue to denote the response variable by y, the vector of r predictors by u, and the design space of permitted values of u as U. In our “general case,” we specify the distribution of the response, y ∼ f (y; u, θ), where θ is the set of k unknown parameters of interest, and the form of the distribution f is known. (f will also be used to denote the probability density function of y in these notes.) Note that we’ve not introduced x yet, but something analogous to it will emerge.) We continue to regard a discrete experimental design as a set of N values of the vector of predictors, U = {u1 , u2 , ..., uN }, and now use the N -vector y to represent the responses associated with the conditions specified in the design. Our previous definition of M depends specifically on model linearity, and must be replaced in this context. We rely now on likelihood theory (assuming that standard regularity conditions are satisfied), and say that θ̂ is the maximum likelihood estimate of θ. An asymptotically valid expression for the variance matrix of θ̂ is the inverse of the Fisher information matrix: I(θ) = E[− ∂2 0 log ∂ θθ L(θ; y)], where the likelihood function L(θ; y) ≡ f (y; U, θ), the joint pdf of all y’s. Although an even more “general” version of “the general case” could be developed, here we will focus on experiments in which the N responses are statistically independent. In this case, L(θ; y) = QN i=1 f (yi ; ui , θ), and log L(θ; y) = I(θ) = PN i=1 −E[ ∂2 ∂ θθ 0 PN i=1 log f (yi ; ui , θ), so log f (yi ; ui , θ)] Hence the information matrix is the negative “expected curvature” of the log-likelihood. In the spirit of previous notation, we can define M(U, θ) = 1 I(θ) N for a discrete experimental design U . The serious difficulty here is that M is generally a function of θ as well as the design. But if we (unrealistically) claim to know θ, the optimality theory we’ve developed for linear models can be applied. Let µ be a probability measure (or design measure) over U, and let the set of such measures be H(U). Then, define the moment matrix for this design measure as 1 M(µ, θ) = Eµ [−Ey|u,θ ∂2 ∂ θθ 0 log f (y; u, θ)]. For fixed θ, let H(U) correspond to Mθ , that is, µ ∈ H(U) if and only if M(µ, θ) ∈ Mθ . (A distinction between this set-up and what we used with linear models is that η was defined over X ; here we don’t have an analogue of “x” – at least, not yet.) With this, we define the φθ -optimal design measure to be argmaxµ∈H(U ) φ(M(µ, θ)) and designate any such measure by µ∗ . Given φ, we define the Frechet derivative as before: Fφ (M1 , M2 ) = lim→0+ 1 [φ {(1 − )M1 + M2 } − φ {M1 }] where M1 = M(µ1 , θ) and M2 = M(µ2 , θ), and µ1 and µ2 ∈ H(µ). With this, following the arguments for linear models: Theorem 3.6A (6.1.1 in Silvey) For fixed θ, and φ concave on Mθ , µ0 is φθ -optimal Fφ [M(µ0 , θ), M(µ, θ)] ≤ 0, all µ ∈ H(µ) iff Let J(u, θ) = M(µ, θ) for the design measure µ that assigns probability 1 to the single vector u. Then: Theorem 3.7A (6.1.2 in Silvey) For fixed θ, and φ concave on Mθ and differentiable at M(µ0 , θ), µ0 is φθ -optimal iff Fφ [M(µ0 , θ), J(u, θ)] ≤ 0, all u ∈ U So, in summary, if we say we know θ, the funamental arguments of equivalence theory we developed for linear models hold for general nonlinear models as well, with accomodation that x and X , as we used them before, do not have exact counterparts in this case. An Often-Used Trick For most distributions/models, I(θ), and therefore also M(µ, θ), can be written as a R function of first (rather than second) derivatives. This begins by noting that f (y; u, θ)dy = 1. (We drop the arguments of f for a bit, with the understanding that it depends on both u and θ.) Because this quantity is a constant, ∂∂θ differentiation can be exchanged, this gives us R ∂ ∂θ R f dy = 0. If the order of integration and f dy = 0 The next step in the argument is to multiply and divide the integrand by f , and write an equivalent quantity: 2 R 1 ∂ ( f ∂θ f )f dy = R ∂ ∂θ (log f )f dy = E( ∂∂θ log f ) = 0 (Note that this is beginning to look a bit like I(θ), but only involves first derivatives.) The same strategy is used a second time; because the above quantity is a constant, differentiation with respect to θ yields a k × k matrix of zeros, and if the order of integration and differentiation can be exchanged: R ∂ 1 ∂f )(f )) 0 (( f ∂θ ∂θ dy = 0 Using integration-by-parts to re-express this integral, R 1 ∂f ( R ( ∂∂θ log R ∂ 1 ∂f ∂f )f dy = 0, 0 ) dy + 0( ∂θ ∂ θ Rf ∂ θ ∂ ∂ f )(f ( ∂ 0 log f )) dy + log f )f 0( ∂θ ∂θ ∂θ f ∂θ )( or dy = 0 But note now that the second term is the negative of the information matrix, and so rearrangement gives: −E( ∂2 0 ∂θ∂θ log f ) = E( ∂∂θ log f )( ∂∂θ log f )0 for the information associated with a single observation at the selected u, or for a design U of N points, PN I(θ) = i=1 and based on this, M(U, θ) = E( ∂∂θ log f (ui , θ))( ∂∂θ log f (ui , θ))0 1 I(θ) N as before. In summary, the expectations of the squares and products of first derivatives of the loglikelihood can be used in place of the negatives of the expectations of their corresponding second derivatives, if the required conditions hold. Note that this does not quite give us a form equivalent to xx0 , since the expectation with respect to y applies to the product of the two terms, rather than to each of them individually. We will see shortly that for one common class of problems, a further simplification can be made that leads to a separable product, and an analogue of x. A Simple Example using an Exponential Growth Model For a single controlled variable u ∈ U = (0, ∞), suppose that the response variable y has a Bernoulli distribution: y = 0 with probability e−θu = 1 with probability 1 − e−θu for some unknown θ > 0, and that the responses taken on different experimental trials are independent. Applications in which this kind of set-up might arise include reliability experiments in which u represents “time on test”, θ is a rate of physical degradation under a constant stress, and y is the result of a pass-fail performance test of a unit at time u, where the probability of failure (y = 1) increases with time. For a single fixed u 3 L = (e−θu )1−y + (1 − e−θu )y log L = (1 − y)(−θu) + ylog(1 − e−θu ) ∂ log ∂θ −θu e L = −(1 − y)u + yu 1−e −θu Because y(1 − y) = 0, the cross-product term is not present in the square of the above derivative, and since y = y 2 , −θu ∂ u 2 ( ∂θ log L)2 = (1 − y)u2 + y( −e ) 1−e−θu Taking the expectation requires only substitution of probabilities for y and 1 − y, and so for any single value of u, I(θ) = 1 e−θu (θu)2 [ 1−e −θu ]. θ2 −v e Regardless of the value of θ, the information is maximized when v = θu maximizes v 2 [ 1−e −v ], or the solution to ∂ 2 e−v v [ 1−e−v ] ∂v = 0, which is approximately v = 1.6. Since the information matrix is a scalar in this case, essentially all reasonable variance-based optimality criteria boil down to the same thing, i.e. φ = I, and since this function is maximized for only one value of u (as a multiple of θ), any optimal design will include only replicates of this value of u. Here is some intuition for why this form of solution makes sense. Suppose a much smaller value of u is used; then the responses will each, with very large probability, be 0. This would be be solid information that e−θu is large, or that θu is small, but since u is small, this leaves a great deal of uncertainty about θ. Likewise, if a much larger value of u is used, then responses will each, with very large probability, be 1, and there will be little uncertainty that e−θu is small or that θu is large, but this leaves the possibility that θ could be very, very large. Hence, only values of u that lead to an “intermediate” binomial probability can (statistically) eliminate the possibility of both large and small values of θ. For u = 1.6/θ, this probability of y = 1 is e−1.6 = 0.20. A Bernoulli probability of 0.80 ≈ e−0.22 would be “equivalent” in at least one sense, because the uncertainty associated with the probability of failure would be the same. Why would u = 0.22/θ be a less desirable design point? Additive Error Models and Exponential Families Now consider a more restrictive, but still widely-applicable, setting in which the random component of each observation is additive: U = {u1 , u2 , ..., uN } yi = m(ui , θ) + i = mi + i Here we will say that i has density pi (−) with mean zero (which does not limit generality due to the form of the model), with independent responses over the N experimental trials. 4 We subscript p with i to indicate that this distribution may depend on the value of u. Because of the model form, the distribution of yi is f (yi ) = pi (yi − mi ), i.e. a simple shift transformation. Since f (yi ) = pi (i ), logf (yi ) = logp(i ), so ∂ logf (yi ) ∂θ = 1 f (yi ) × ∂∂θ f (yi ) = p0i (i ) pi (i ) But in this expression, only the scalar E[ ∂∂θ log f (yi ) ∂ ∂θ 0 1 pi (i ) × p0i (i )[− ∂∂θ mi ] is a function of the random variable, and so log f (yi )] = ( R p0i (i )2 pi (i ) di )( ∂∂θ mi )( ∂∂θ mi )0 If pi is a distribution in the exponential family, the first factor is 1 , σi2 and so the information matrix for an N -point design is I(θ) = This form has the structure of M = σi−2 ( ∂∂θ mi )( ∂∂θ mi )0 PN i=1 P wxx0 from the linear case, but still requires that we know the parameters to fully define xi = σi−1 ∂∂θ mi , so that M(U, θ) = 1 N PN i=1 xi x0i If the σi ’s are all equal, xi can be simplified to just the vector of derivatives for purposes of experimental design. Example: Michaelis-Menton Model The Michaelis-Menton model is a standard nonlinear regression form used in enzyme kinetics and other applications. The response y is the rate of a reaction, the single controlled variable u is the concentration of the substrate under study, u ∈ U = (0, u∗ ), where u∗ is the operational upper bound on concentration level, and the parameters of interest are θ1 , the maximum possible (asymptotic in u) reaction rate (in units of y), and θ2 , the value of u at which the reaction rate is half of θ1 ; both of θ1 and θ2 must be positive. With additive random error (say, for measurement effects), the form of the model is y= θ1 u θ2 +u + If the distribution of is from the exponential family and variance is constant across experimental runs, we can define: x = ∂∂θ m = u θ2 +u 1 1 − θ2θ+u x = 1 x2 The induced design space is a parametric curve in two-dimensional space (x1 and x2 ) specified by values of a single variable (u) (again, remembering the θ1 and θ2 are, for our purposes, fixed). Plots below show an example of the Michaelis-Menton function, and the induced design space, for θ1 = θ2 = 1, and u∗ = 5: 5 0.00 Induced Dsn. Space for theta=(1,1) −0.30 −0.20 x2 −0.10 0.0 0.2 0.4 0.6 0.8 1.0 E(y) M−M Function with theta = (1,1) 0 1 2 3 4 5 0.0 0.2 u 0.4 0.6 0.8 1.0 x1 What design points would lead to a “diagonally dominant” information matrix (generally good)? • u near u∗ → x1 as near as possible to 1 → large x21 → large I1,1 • u near θ2 → x2 near − 41 θθ12 , it’s minimum → large x22 → large I2,2 2 • u near 0 or u∗ → x1 or x2 small → small x1 x2 → small I1,2 This exercise helps establish some intuition for what to expect, but doesn’t lead to a complete answer since, for example, x1 and x2 both small isn’t good. In fact, for any u∗ , the D-optimal design is: µ∗ : u = θ2 prob u∗ prob other 1 2 1 2 prob 0 You could, for example, find this design numerically with a point-exchange algorithm (Frechet derivatives, et cetera) with candidate points on the parabola. We will do this with a more interesting model shortly. Efficiency The optimal design for Michaelis-Menton model is a function of θ2 , but not a function of θ1 . But, the performance of the design is a function of both parameters. To see this, consider the design measure: µ:u= u1 prob ∞ prob other 6 1 2 1 2 prob 0 which is D-optimal if u∗ = ∞ and u1 is set to the value of θ2 . For this design measure, |I(θ)| = 2 2 1 θ1 u1 , 4 (θ2 +u1 )4 and for the optimal u1 = θ2 , this is 1 64 θ1 θ2 2 . But what if you are incorrect about your assumed value of θ2 ? Suppose you assume that θ1 = θ2 = 1, u∗ = ∞, and you place equal weight on (the optimal) u1 = 1 and u2 = ∞. As a function of the actual values of the parameters, |I(θ)| = θ12 1 . 4 (θ2 +1)4 The figure displays the values of this criterion function over (θ1 , θ2 ) ∈ [0.5, 1.5]2 . |I| as a function of theta's 1.4 01 0. 2 1.0 3 0.0 01 0. 4 0.0 5 0 . 0 6 0.0 0.6 theta2 0.0 7 0.0 0.6 0.8 1.0 1.2 8 0.0 1.4 theta1 Note that the criterion function of information is greatest at lower right corner (big θ1 , small θ2 ), even though this isn’t the set of parameter values for which the design is optimal. However, that shouldn’t really be surprising. The logic is that if θ1 = θ2 = 1, then the design we’ve picked is the best (with respect to our criterion). But this doesn’t say that the criterion value for this design might not be even greater at different parameter values (even though at different parameter values, there will be another design that is even better than the one we’ve selected). For this reason, it is sometimes preferred to consider an adjusted measure of optimality, efficiency, which compares, as a function of “true” parameter values, the design you’ve chosen relative to the best one you could have chosen for those parameters. For the Michaelis-Menten model, the D-optimal design measure places half its weight on each of u = θ2 and u = u∗ (which we’re saying is ∞ here to simplify things). That design has a criterion value which we’ll call |I(θ)|best = 2 1 θ1 . 64 θ22 We can then define a measure of efficiency for the design we’ve selected (based on an assumption that θ2 = 1) as θ2 2 eff(θ1 , θ2 ) = |I(θ)|/|I(θ)|best = 16 (θ2 +1) 4 This function is plotted below for θ2 ∈ [0.5, 1.5]: 7 0.90 0.80 eff 1.00 eff as a function of theta2 0.6 0.8 1.0 1.2 1.4 theta2 This shows that even though information (as measured by our criterion) associated with our design is greater for θ2 = .5 than for θ2 = 1, efficiency, which evaluates a design relative to the best that could have been used, is less than 1 at θ2 = .5. Unfortunately, there is usually no unique way to do this; e.g. efficiency can, for example, also be defined as a power of this ratio, or as |I(θ)| − |I(θ)|best . Example: Speed of R → P1 + P2 Here is a somewhat more extensive nonlinear regression problem based on another function from chemical kinetics. The model form is: y = θ1 θ3 u1 /(1 + θ1 u1 + θ2 u2 ) + with errors that are assumed to have a distribution in the exponential family with constant variance. Suppose the design region for the two controlled variables is U = [0, 3]2 , and the guessed/assumed value of θ is (2.9, 12.2, 0.69)0 . A Wynn algorithm, essentially like the one described in the notes for algorithms for linear models, was written to generate a D-optimal design using a grid of u with spacing of 0.03 in each direction (for a total of 1012 grid points), with corresponding points in (the 3-dimensional) X computed by evaluating the derivatives of the model form with respect to each of the 3 model parameters at the assumed value of θ. The design was initialized with a randomly chosen 10-point design, and the algorithm ran for 261 iterations before stopping, based on a stopping rule of δ = 0.1. The graphs show all points included in the initial design and added through the iterations, and an image plot of accumulated probability mass at each point: 8 ● 2.0 ● ● 2.0 W−algorithm, mass 3.0 ● ● 1.0 ● ● ● ● ● ● ● ●● ● ● ● 0.0 0.0 u2 ● ● 1.0 u2 (jittered) 3.0 W−algorithm, all points 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 u1 (jittered) 2.0 2.5 3.0 u1 The three locations at which most of the mass accumulates are (u1 , u2 ) = (0.27, 0.00) with mass 0.3180, (3.00,0.78) and (3.00,0.81) with mass 0.3065, and (3.00,0.00) with mass 0.3218. (The two neighboring grid points with substantial mass likely indicate a single mass point with u2 between 0.78 and 0.81.) It seems likely that for practical purposes, a design that places equal numbers of replicates at each of the three indicated points would be near optimal – if the assumed value of θ is correct. Nonlinear Functions of Linear Model Parameters One application of the ideas discussed here is to the design of experiments for linear models where the quantities of interest are nonlinear functions of the linear model parameters. A simple example is that of a quadratic regression, y = θ0 + θ1 x + θ2 x2 + where interest centers on (1) the location of the peak or dip, (2) the expected response value at the peak/dip, and (3) the second derivative at the peak/dip (or, given the form of the model, the second derivative anywhere). Simple calculus leads to: 1. g1 = −θ1 /(2θ2 ) 2. g2 = θ0 − θ12 /(2θ2 ) 3. g3 = 2θ2 as the three quantities of interest. More generally, a linear model y = x0 θ + is appropriate for modeling, but interest rests not in the elements of θ as such, but in k ∗ functions of them. Let g be a vector of these functions of θ, and define a k-by-k ∗ matrix G to have (i, j) element well-behaved likelihood asymptotics 9 ∂gj . ∂θi For standard and V ar(ĝ) = G0 V ar(θ̂)G ∝ G0 M(µ, θ)−1 G If G is square and of full rank V ar(ĝ)−1 ∝ G−1 M(µ, θ)G0 −1 = [G−1 x][x0 G0 −1 ]µ(x)dx R If some or all of the functions x are nonlinear in θ, then G depends on these parameter values. We can indicate this by writing Gθ , and for design purposes say: −1 0 I(θ) = [G−1 θ x][Gθ x] µ(x)dx R So, call G−1 θ x the augmented design space vector “x” instead; this fits the nonlinear setup we’ve been discussing. For k ∗ = k, Gθ of full rank, and the collection of functions g invertable, an alternative way to see this is that the model could be re-written as a nonlinear model in the quantities of interest. For example, in the quadratic regression problem mentioned above, inverting gives: 1. θ0 = g2 + g12 g3 2. θ1 = −g1 g3 3. θ2 = g3 /2 Substituting these expressions in the linear model (for θ) yields a nonlinear model (for g), and the approach we’ve discussed above will lead to the same result. Complications arise, as they do with Ds -optimality, when k ∗ < k, since an optimal design may not require that all elements of θ be individually estimable. Various “Robust” Approaches This section is really a “placeholder” for at least one body of work that is called “robust” nonlinear design. In short, the story we’ve told is predicated on knowing, or being willing to assume, a value for the parameter vector of interest, θ. This allows the development of theory that parallels that for linear models, but obviously does not (as presented so far) lead to designs for practical situations. One attempt to overcome this practical problem is to broaden the process we’ve talked about, generally by minimax arguments, to a broader collection of possible θ vectors. Suppose that, rather than a single value, we are willing to assume that the parameter vector of interest lies in some specified region, θ ∈ Θ. Then it can make sense to construct designs as solutions to: • argmaxµ minθ ∈Θ φ(µ, θ), or • argmaxµ θ ∈Θ φ(µ, θ)w(θ)dθ, for some suitable weight function w R 10 Alternative approaches can be considered by substituting a measure of efficiency for optimality in either of these two. We shall not explore these possibilities further here, but do note that despite the computational intensity that generally accompanies minimax problems, a number of researchers have spent considerable efforts in directions such as these. (A related approach we will consider in a bit more detail in the next unit is that of “Bayesian design”.) Another Approach: Sequential Estimation/Design Another approach to making optimal design theory more applicable to practical situations is to adopt a sequential approach in which data, as it is gathered, is used to improve the assumed (or “guessed”) value of θ. A general outline of such an approach might be written as follows: 1. Begin with a best guess, θ g . 2. Construct an optimal design as if θ = θ g . 3. Execute the experiment, collect the data, and compute θ̂. 4. Redefine θ g ← θ̂, and return to step 2. This algorithm is not complete. For example, in step 2, does “construct an optimal design” mean a design that is optimal in its own right, or optimal under that constraint that it includes the experimental runs from previous stages? Similarly in step 3, does “compute θ̂” mean an estimate based only on the most recent data, or does the estimation also including data from previous stages? Pro’s and con’s of various approaches are related to those we briefly discussed in our treatment of group screening designs. Where sequential experimentation is pratical, this is often a very effective and practical approach to experimental design. A general difficulty with the sequential approach is that there is, in many cases, no guarantee that V ar(θ̂)−1 ≈ I(θ) as derived under the standard (and relatively simple) formulation of I from the likelihood function for independent responses. Instead, sequential experiments are explicitly constructed so that u2 depends on y1 , u3 depends on y2 and perhaps y1 , et cetera. The statistical nature of this dependence is a function of f and details of the sequential rules. Where asymptotic arguments are developed, they are often based on showing that, with probability approaching 1, design points “pile up” on the (correct) locally optimal design, and that the sequence of θ̂ i ’s converge to θ “quickly” in some sense, relative to the number of experimental trials that can be attempted. Example Continued: Speed of R → P1 + P2 Fedorov (1972) presents a demonstration of sequential experimentation using the model of the previous example: 11 E(y) = θ1 θ3 u1 /(1 + θ1 u1 + θ2 u2 ) His demonstration is apparently based on real (rather than simulated) data, with experiments on catalytically dehydrated n-hexyl alcohol performed at 555◦ F, for which the two products of reaction are olefin and water. As in the previous example, the design region for this experiment is U = [0, 3]2 . For this setting, the “known” value of the parameter vector θ is (2.9, 12.2, 0.69)0 , the set of values we used as “guesses” in the previous example. However, for this demonstration, this information was not used. Instead, a standard 22 factorial experiment in all combinations of values 1 and 2 for u1 and u2 was used as a first-stage design. Based on the data collected for these 4 experimental trials, an estimate of the parameter vector was computed: θ̂ 4 = (10.39, 48.83, 0.74)0 (Note that these estimates are not especially close to the true parameter values, but also that they are based on only 4 data points.) Using these estimates as if they were the actual parameter values, Fedorov then determined the optimal 5th design point that should be added using the Frechet derivative associated with D-optimality; the rather extreme parameter estimates led to selection of the next (u1 , u2 ) = (0.1, 0.0). A new experimental run was conducted at this point, a new estimate of the parameter vector was computed based on the accumulated 5-run data set, and a 6th design point was selected. This process continued through a total of 13 experimental trials, with results tabulated below: trial u1 u2 y θ̂1 θ̂2 θ̂3 1 1.0 1.0 0.126 2 2.0 1.0 0.129 3 1.0 2.0 0.076 4 2.0 2.0 0.126 10.39 48.83 0.74 5 0.1 0.0 0.186 3.11 15.19 0.79 6 3.0 0.0 0.606 3.96 15.32 0.66 7 0.2 0.0 0.268 3.61 14.00 0.66 8 3.0 0.0 0.614 3.56 13.96 0.67 9 0.3 0.0 0.318 3.32 13.04 0.67 10 3.0 0.8 0.298 3.33 13.48 0.67 11 3.0 0.0 0.509 3.74 13.71 0.63 12 0.2 0.0 0.247 3.58 13.15 0.63 13 3.0 0.8 0.319 3.57 12.77 0.63 Note that the estimate based all 13 data values is considerably closer to the “known” vector value than θ̂ 4 . Fedorov does not discuss a formal stopping rule that led to the experiment ending after 13 trials, but does note that “dispersion of the parameter estimates became 12 insignificant and ... close to [those of] the theoretical value(s).” Note also that of the 9 sequential runs made (excluding the initial 4), 4 were close to (0.2,0.0), 3 were at (3.0,0.0), and 2 were at (3.0,0.8) – very close to the 3 apparent mass points based on our example that assumed the “true” parameter values. References Fedorov, V.V. (1927). Theory of Optimal Experiments, Academic Press, London. Originally published in Russian (1969) as TEORIYA OPTIMAL’NOGO EKSPERIMENTA. 13