Construction Algorithms Reviewing where we’ve been, our focus is on experimental designs defined as discrete points in, or probability distributions over, an r-dimensional design space U or k-dimensional augmented design space X . Fnding a φ-optimal exact N -point design directly is an N × rdimensional optimization: U∗ = argmaxU ⊂U r φ(M(U )) Without further development, this is a difficult optimization problem, a fact that led to the development of theory we talked about in the “Optimality II” notes. The key change required by that theory is the replacement of the discrete design U with a continuous design measure η. On the face of it, this would actually seem to make things worse, because now we need to solve for an entire distribution: η∗ = argmaxη∈H φ(M(η)) If φ is concave on M, we can use Theorem 3.6 to pose this as an optimization based on the Frechet derivative instead: η∗ = η such that Fφ (M(η), M(η 0 )) ≤ 0 for all η 0 ∈ H but this is potentially even worse; we must check each η (∞ of them if U is continuous) against all η 0 (∞ of them also). Theorem 3.7, which says: η∗ = η such that Fφ (M(η), xx0 ) ≤ 0 for all x ∈ X is potentially more helpful since x is of smaller dimension than η 0 , but still presents a huge problem when used as the basis for solving the optimization problem. In fact, the strength of the theory we’ve presented is really for confirming the optimality of a design, rather than constructing it “from scratch.” With construction problems so far, we’ve avoided the optimization by making an intelligent guess and proving it to be optimal. But in many cases, the problem is more complicated, and intelligent guesses are difficult to pose. Still, the theory we’ve seen suggests an approach to generating designs even when a good guess isn’t available. Since Fφ is a derivative of φ w.r.t. η, when Fφ (M(η), xx0 ) is large, this suggests that η may be improved by increasing the probability assigned to x in the design measure. This suggests a general algorithm for iteratively improving a “seed” design measure: 1 A general iterative algorithm for design measures: 1. begin with an arbitrary η0 → ηcurrent 2. find xadd = argmaxx∈X Fφ (M(ηcurrent ), xx0 ) 3. if Fφ (M(ηcurrent ), xadd x0add ) ≤ 0, STOP ... ηcurrent is φ-optimal 4. replace ηcurrent by ηnext = (1 − α)ηcurrent + αηadd (ηadd puts all mass on xadd ) for some α ∈ (0, 1) 5. return to step 2 For this algorithm, each iteration requires a search over x ∈ X in step 2, an r-dimensional optimization. In applications where X is infinite, the search is often modified by substituting a large, finite subset of X , such as a grid. The rationale for the algorithm depends on the requirements of Theorem 3.7, that is, that (1.) φ be concave on M, and (2.) φ be differentiable at each ηcurrent . The second of these can be guaranteed if η0 is chosen so that M(η0 ) is not singular; in this case, each subsequent M(ηcurrent ) will also be nonsingular (since the weight of η0 is never reduced to zero, so that differentiability is assured for φ at each ηcurrent . In practice η0 is often not really chosen “arbitrarily”, but is constructed with mass at points that are judged likely to be valuable. In reality, the performance of algorithms of this type, as with most iterative optimization solvers, is heavily dependent on this choice, and it is usually prudent to repeat the calculation a number of times using different η0 . Finally, note that the definition of the algorithm isn’t complete until a value of α is specified. It is clear that relatively larger values produce larger chanages in the design at each iteration. But as with any numerical optimization procedure based on derivatives, taking “steps” that are too large at each iteration can cause the search to fail. The following two sections describe particular proposals, each of which has some theoretical justification, for selecting the value of α. “V Algorithm” (Silvey, for V.V. Fedorov (1972)) At each iteration, select the value of α that results in the greatest improvement, i.e. maximizes φ(M(ηnext )) − φ(M(ηcurrent )) most. This could be done numerically using a 1dimensional “line search” for the best α ∈ (0, 1). Alternatively, the same thing can be achieved by looking at the Frechet derivative: 2 φ usually -∞ α Mcurrent xaddx’add That is, given ηcurrent and xadd , find α such that Fφ ((1−α)M(ηcurrent )+αxadd x0add , xadd x0add ) = 0. V Algorithm for D-optimality In this case, we want to solve for α: x0add [(1 − α)M(ηcurrent ) + αxadd x0add ]−1 xadd = k To do this efficiently, we need to be able to perform the indicated matrix inversion easily/quickly – or better yet, analytically – since this will change with different values of α. If we already know M(ηcurrent )−1 (which doesn’t change with α), the needed inversion can be computed efficiently via “rank-one update formulae”. In short, if A is a square symmetric matrix for which A−1 exists, and b is a column vector of the same order for which [A−bb0 ]−1 exists, 1. [A + bb0 ]−1 = A−1 − 1 A−1 bb0 A−1 1+b0 A−1 b 2. [A − bb0 ]−1 = A−1 + 1 A−1 bb0 A−1 1−b0 A−1 b Using “update 1”, let A = (1 − α)M, b = √ αxadd , and Q = x0add M−1 xadd . Then Fφ ((1 − α)M + αxadd x0add , xadd x0add ) = Q 1+α(Q−1) −k and setting this to zero leads to: α= Q−k k(Q−1) This algorithm (V, for D-optimality) can be shown to converge to η∗ under the conditions of Theorem 3.7. 3 “W Algorithm” (Silvey, for H.P. Wynn (1970)) Rather than computing an “optimal” α at each iteration, the Wynn version of the algorithm is based on selecting a fixed sequence of α’s in advance. This simplifies the procedure somewhat, although it may make slow the rate of convergence in some cases. The key aspect of doing things this way is selection of a sequence with the correct properties. In particular: • A sequence for which limi→∞ αi 6= 0 cannot guarantee convergence to an optimal design, since “adjustments” smaller than limi→∞ αi cannot ever be made. • A sequence for which limI→∞ PI i=1 αi < ∞ cannot guarantee convergence to an optimal design, since the algorithm can converge prematurely (“stutter to a stop on the side of a hill,” Silvey) Hence, what is needed is a decreasing (to zero) sequence with non-convergent partial sums. The most commonly used sequence of this type is a partial harmonic sequence: • Choose a nominal “N ” for η0 , • Let αi = 1 , N +1 i = 1, 2, 3, ... This corresponds to putting equal weight at each added point, e.g.: • Start with 10 equally weighted points • In the first iteration, α = 1 11 • Old points get a total new weight of 1 − α = • New point gets weight 10 , 11 or 1 11 each 1 11 Convergence is more difficult to show than for V, but can be proven in many specific cases. The W-algorithm based on the partial harmonic sequence does have one characteristic that some applied practitioners find appealing. The general algorithm, of which V- and W- are both special cases, is really intended to converge to an optimal design measure – a probability distribution over X – as motivated by general equivalence theory. (The following comments address a practical requirement that, while motivated by convergence properties, algorithmic searchs must generally be stopped before convergence is reached; we address implications of this in the next section.) Eventually, the solution to a practical design problem requires that ηf inal be replaced with a discrete design with probability ri N to a finite number of points of support, where the ri are positive integers and P assigned i ri = N for a specified number of experimental runs N . While both V- and W- versions of the algorithm can be shown (at least in some cases) to converge to an optimal design measure, 4 for any finite number of iterations, the design constructed using the W-algorithm does place ri N weight on each of a finite number of values of x, where N is the number of points of support in η0 plus the number of completed iterations, and ri are positive integers with P i ri = N . Since the weights are computed as “optimal” real numbers in the V-algorithm, there is no such guarantee in that case. This suggests that design measures produced using the W-algorithm may be easier to “round” in some cases. However, this “advantage” is most often not realized since the number of iterations required for reasonable near-convergence is usually much larger than an acceptable N for a discrete design in most applications. Stopping Rule and Guarantee Step 3 of the algorithm outlined above: if Fφ (M(ηcurrent ), xx0 ) ≤ 0, STOP ... is the “stopping criterion” for the algorithm – the rule that defines when the iterations cease and the “current” design measure is reported as the “final” design measure, ηf inal . This criterion, however, is impractical in use, because it is never satisfied unless ηcurrent really is an (exactly) optimal design measure. The realistic goal of a numerical search is to find a design that is very near optimal given a reasonable expenditure of computer time. So as with most “stopping criteria,” this step is usually modified to: if Fφ (M(ηcurrent ), xx0 ) ≤ δ, STOP ... for some specified small but positive δ. This is reasonable, but we should also ask how “far from optimal” ηf inal produced by the algorithm may be. One answer to this question can be developed as follows. Suppose we’ve selected a value of δ, have executed the algorithm, and have constructed ηf inal . The actual optimal design, which is likely not exactly the same, is η∗ . We know that at least one η∗ has a finite number of support points (Caratheodory), and that the differentiability reqirement for φ says that we can write Fφ (M(ηf inal ), M(η∗ )) = PI i=1 λi Fφ (M(ηf inal ), xi x0i ) where xi are the support points and λi are the associated probabilities of such a η∗ . The sum on the right is clearly bounded by maxx∈X Fφ (M(ηstop ), xx0 ), so Fφ (M(ηf inal ), M(η∗ )) ≤ maxx∈X Fφ (M(ηf inal ), xx0 ) ≤ δ the last inequality being true because the algorithm stopped. We also know, because φ is concave, that Fφ (M(ηf inal ), M(η∗ )) ≥ D ((M(ηf inal ), M(η∗ )) 5 for any ∈ (0, 1]; letting = 1 gives Fφ (M(ηf inal ), M(η∗ )) ≥ φ(M(η∗ )) − φ(M(ηf inal )) Tother, these imply φ(M(η∗ )) − φ(M(ηf inal )) ≤ δ so the selected value of δ is a (generally loose) bound on the decrease in φ resulting from use of the modified stopping rule. A Numerical Exercise Two R scripts were written to implement the V-algorithm and W-algorithm for Doptimality, respectively, for a regression problem in which U = [0, 1]2 with a cubic polynomial model: y(u) = θ1 + θ2 u1 + θ3 u2 + θ4 u21 + θ5 u1 u2 + θ6 u22 + θ7 u31 + θ8 u21 u2 + θ9 u1 u22 + θ10 u32 + For purposes of the search, U was “discretized” to a uniform grid of size 101 × 101. Each algorithm began with a randomly constructed design measure of 10 points of support, each with equal weight, and a stopping rule value of δ = 0.1 was used. The following plots show, for each algorithm, the points of η0 and the collection of points added (e.g. u’s associated with xadd ’s) and an “image plot” of total probability mass accumulated at each point in U. ● ● ● ● ● ● ● ● ● 0.2 ● 0.6 0.8 1.0 1.0 0.0 0.8 W−algorithm, mass 0.8 ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.6 ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.4 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● 0.2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.4 0.6 0.8 1.0 1.0 W−algorithm, all points ● 0.0 0.6 u1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.4 u1 (jittered) ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.2 0.2 0.4 0.2 ● ● ● ● ● ● ● ● 0.0 ● ● ● ● ● ●● ● ● ● 0.4 0.4 ● 0.0 0.0 ● ● ● ● ● ● ● ● ● ● 0.2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.0 u2 ● 0.8 1.0 ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.0 0.6 0.8 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● u2 0.6 0.4 0.2 u2 (jittered) ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● u2 (jittered) V−algorithm, mass 0.6 ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● 0.8 1.0 V−algorithm, all points 1.0 0.0 u1 (jittered) 0.2 0.4 0.6 u1 6 0.8 1.0 The two algorithms started with different random seed designs, so it is not surprising that the collection of points identified (left column) is somewhat different. However, both clearly place the great majority of probability mass on the same (to the precision of the grid) set of 16 points (more than the 10 parameters in the model, but substantially fewer than the Caratheodory bound). These 16 points account for 0.8976 of the accumulated probability mass over the 788 iterations of the V-algorithm, and 0.8974 of the accumulated probability mass over the 497 iterations of the W-algorithm. The points of support identified by both algorithms, and approximate probabilities associated with each (after proportionately redistributing the approximately 0.1 of mass assigned to other points by each algorithm) are: point sets prob. mass per point (0.00,0.00) (0.00,1.00) (1.00,0.00) (1.00,1.00) 0.1006 (0.00,0.32) (0.00,0.68) (1.00,0.32) (1.00,0.68) 0.0555 (0.32,0.00) (0.32,1.00) (0.68,0.00) (0.68,1.00) (0.26,0.26) (0.26,0.74) (0.74,0.26) (0.74,0.74) 0.0378 A reasonably similar (and perhaps operationally simple) design would round each of the coordinates above to the nearest value in {0, 14 , 31 , 23 , 43 , 1}, and include 10, 5, and 4 replicates of the points in the three sets listed, respectively, for a discrete design in N = 96 runs (although “reasonably similar” would need to be verified via calculation). Point-Exchange Algorithms Except asymptotically, all points added to a design in the V- and W-algorithms stay in the design. For example, the final weight of any point in the “initial design” that does not receive additional weight in any iteration, is 1 N Qf inal i=1 (1 − αi ) which can never be exactly zero. In practice, this generally leads to ηf inal that places very small weight on a large number of points, especially if η0 isn’t well-chosen. Again, since the ultimate aim in practical design problems is to generate a discrete design, socalled “point-exchange” algorithms have been proposed that limit the number of points of support in any ηcurrent by both adding and deleting points in the iterations. While such an approach can be taken to general design measures, convergence is often much more difficult (or impossible) to prove. Further, since the motivating concern is to produce a near-optimal discrete design, point-exchange algorithms are often written to directly construct discrete designs for a specified value of N , with equal weight assigned to each run (but allowing for multiple runs at any x). Fedorov Algorithm (Not “V-”) 7 If iterative construction of a discrete design with a specific value of N is the goal, perhaps the most obvious approach is the following: 1. begin with an arbitrary N - point discrete design, U0 → Ucurrent 2. find (xadd , xdelete ) = argmaxx1 ∈X ,x2 ∈Ucurrent φ(Ucurrent + x1 − x2 ) − φ(Ucurrent ) 3. if the difference computed in step 2 = 0, STOP ... 4. modify Ucurrent ← Ucurrent + xadd − xdelete 5. return to step 2 Among point-exchange algorithms, this approach is one of the most successful in generating good designs, however it is very computationally intensive. Note that a very large number of comparisons must be examined in each iteration. Suppoose that U (and therefore also X ) is (or has been “narrowed down to”) a large-but-finite collection of M “candidate points”. Then each iteration requires the examination of M × N possible “exchanges”. Furthermore, these are not “rank-one” changes as described above (although a careful implementation can be be based on looking at, for example, each of N rank-one deletions after implementing each of M rank-one additions). It should also be noted that, even with its generally demanding computational complexity, the Fedorov algorithm can only guarantee that the design produced, Uf inal , cannot be improved by the replacement of any point in the design with any point from U, while it might be quite possible that a more extensive modification would lead to a larger value of φ. Still, as noted, the algorithm is generally effective. The two alternatives described below are attempts to construct a near-optimal design more quickly. Wynn Algorithm (Not “W-”) A “Fedorov-like” algorithm with less demanding calculations at each iteration is as follows: 1. begin with an arbitrary N - point design, U0 → Ucurrent + 2. find xadd = argmaxx∈X Fφ (M(Ucurrent ), xx0 ), Ucurrent ← Ucurrent + xadd + +− + + 3. find xdelete = argminx∈Ucurrent Fφ (M(Ucurrent ), xx0 ), Ucurrent ← Ucurrent − xdelete +− 4. if φ(Ucurrent ) ≤ φ(Ucurrent ), STOP ... +− 5. Ucurrent ← Ucurrent 6. return to step 2 8 The Wynn algorithm accomplishes an “exchange” in M +N rank-one calculations (compared to the M × N more difficult calculations required by the Fedorov algorithm). The strength of the claim that can be made for the result of the calculation is correspondingly weakened; no improvement to Uf inal can be made by first adding the “most helpful” point and then removing “the least helpful” point, but a Fedorov-type exchange (more general) might lead to an improvement. In practice, because an implementation of the Wynn algorithm executes much more quickly than an implementation of the Fedorov algorithm, the former can often be run several times (from different seed designs) with the same or less effort than a single execution of the latter. Again, there is no general theory favoring one approach or the other, but the Fedorov approach is probably more popular in situations where computational capabilities support it. DETMAX Several discrete design construction algorithms have been proposed that attempt to improve on the performance of the Wynn scheme, but without incurring the computational cost of the Fedorov approach. One of the most generally successful is a modified point-exchange algorithm described by Mitchell (1974), which he called DETMAX. It was specifically proposed in the context of D-optimality, but could be implemented (and is described below) for a general criterion φ. 1. set values for the design size N , an upper bound U B, and a lower bound LB 2. initialize “failure lists” FLB ...FN ...FU B initialize “excursion lists” ELB ...EN ...EU B . 3. select an initial N -point design U0 Ucurrent ← U0 , Ubest ← U0 add Ucurrent to EN , n ← N 4. “EXCURSION” begins here: if φ(Ucurrent ) > φ(Ubest ), Ubest ← Ucurrent , clear F ’s else add E’s to F ’s in any case, clear E’s 5. with probability 1/2, go to step 6; with probability 1/2 go to step 7 6. find xadd = argmaxx∈X Fφ (M(Ucurrent ), xx0 ) n←n+1 Ucurrent ← Ucurrent + xadd add Ucurrent to En if n = N , go to 4 if n > U B, report Ubest and stop if Ucurrent ∈ Fn , go to 6; else go to 7 9 7. find xdelete = argminx∈Ucurrent Fφ (M(Ucurrent ), xx0 ) n←n−1 Ucurrent ← Ucurrent − xdelete add Ucurrent to En if n = N , go to 4 if n < LB, report Ubest and stop if Ucurrent ∈ Fn , go to 7; else go to 6 The idea behind DETMAX is that a one-point sequential exchange, as used in the Wynn algorithm, may be improved if the algorithm is allowed to take “excursions” that entertain intermediate plans with more than N + 1 or fewer than N − 1 points. Hence, a DETMAX excursion might add two or three points sequentially before deletions that bring the design back to a design of the specified size. “Excursion” lists are used to keep track of the designs of each allowable size that have been considered in an excursion, and “failure” lists are used to keep track of designs considered in an unsuccessful excursion. In each excursion, designs of greater than N + 1 points (or less than N − 1 points) are considered only if the best N + 1 point (or best N − 1 point) design has been recorded as a “failure”, and designs with even more (or fewer) points are similarly considered only to avoid repeating “failures”. The algorithm stops when any excursion is forced to consider a design of size larger than a set upper bound, or smaller than a set lower bound. A review and comparison of the three discrete design algorithms described here, as well as a few others, was presented by Cook and Nachtsheim (1980). A Few Other Approaches Beyond the material we’ve focused on here, a number of other approaches to numerical optimization (most of them not specific to the optimal design problem) have been used in this context. I briefly describe only a few of them in the following notes, with pointers to a few publications in which these ideas have been used to design experiments. Coordinate Exchange - Of the 5 methods listed here, this is the only one introduced in the statistics literature, by Meyer and Nachtsheim (1995). As opposed to the point-exchange methods discussed above, coordinate exchange works iteratively by: • identifying the k points in the current design at which V ar(ŷ(u)) is smallest • for each of these, evaluating the effect of exchanging one coordinate of u with another acceptable value for that coordinate • making changes where these improve the design 10 Because changes are considered on a coordinate-by-coordinate basis, U must be a “hyperrectangle”, i.e. the range of each element of u must be independent of the values of the other elements. Branch-and-Bound - This optimization technique essentially provides a short-cut to actually evaluating all possible experimental designs of size N that can be assembled from a finite collection of candidate points U, and so actually guarantees that the design found is optimal. Welch (1982) described an implementation in which: • a design is characterized by a vector of probabilities η associated with the points in U • a constraint set is defined by vectors LB and UB, and applied by requiring LB ≤ η ≤ UB • a binary tree is constructed for which – each node is associated with a constraint set – “lower” nodes are partitions of this constraint set – the lowest level nodes correspond to individual designs • bounds on φ are placed on sub-trees, allowing elimination of some sub-trees based on knowledge of some known (good) designs The technique is very effective, but generally cannot be used on very large problems. Simulated Annealing - Haines (1987) described how this probabilitic search process, which was inspired by the chemistry of material annealing, can be used in design construction. Her algorithm (as with most in the optimization literature) is described as a function minimizer, and operates by improving a single design iteratively. At each iteration: • a point udelete is randomly selected from the current design, and another point uadd is randomly selected from U. • ∆ = φ(Ucurrent + uadd − udelete ) − φ(Ucurrent ) is computed. • the exchange is “accepted” (i.e. becomes the next Ucurrent ) with probability: 1 if ∆≤0 e−∆/t if ∆>0 • t, the “temperature”, is reduced over iterations according to an “annealing schedule.” 11 Genetic Algorithms - This is a popular stochastic optimization algorithm based loosely on biological evolution. Hamada et al (2001) described how it can be implemented to iteratively improve the best of a “popluation” of trial designs of a fixed size: • a design (“chromosome”) is defined by N vectors u each of which contains r elements (“genes”) • the population of designs is improved by random application of: – “crossover”, in which genes from two different designs are exchanged – “mutation”, in which one or more genes from a single design are changed • the size of the population increases through “reproduction” at each stage, but is reduced back to the (generally same) size by retaining only the best (as evaluated by φ) designs Particle Swarm Optimization - This is a more recent approach to optimization, and is loosely based on the observed flight patterns of groups of birds. Chen et al (2011) have written a widely-referenced technical report on applications for experimental design, in which each of a group of experimental designs is characterized by the design U , and a “velocity” V indicating a nominal direction and rate of change of U in U r . In each iteration: • each V is updated, with bias toward the currently best U , and the global (to this point in the search) best U • each U is updated with its corresponding V Because this method is based on an idea of “smooth” transitions, it is most easily applied in continuous (rather than discrete) design contexts. References Cheng, R-B, S-P Chang, W. Wang, and W.K. Wong (2011 preprint). “Optimal Experimental Designs via Particle Swarm Optimization.” Cook, R.D. and Nachtsheim, C.J. (1980). “A Comparison of Algorithms for Constructing Exact D-Optimal Designs,” Technometrics 22, 315-324. Fedorov, V.V. (1972). Theory of Optimal Experiments Trans. and ed. by W.J. Studden and E.M. Klimko. New York: Academic Press. Haines, L.M. (1087). “The Application of the Annealing Algorithm to the Construction of Exact Optimal Designs for Linear Regression Models,” Technometrics 29, 439-447. 12 Hamada, M., H.F. Martz, C.S. Reese, and A.G. Wilson (2001). “Finding Near-Optimal Bayesian Experimental Designs via Genetic Algorithms,” The American Statistician 55, 175-181. Meyer, R.K. and C.J. Nachtsheim (1995). “The Coordinate-Exchange Algorithm for Constructing Exact Optimal Experimental Designs,” Technometrics 37, 60-69. Mitchell, T.J. (1974). “An Algorithm for the Construction of ‘D-Optimal’ Experimental Designs,” Technometrics 16, 203-210. Silvey, S.D. (1980). Optimal Design: An Introduction to the Theory for Parameter Estimation, Chapman and Hall, London. Welch, W.J. (1982). “Branch-and-Bound Search for Experimental Designs Based on DOptimality and Other Criteria,” Technometrics 24, 41-48. Wynn, H.P. (1970). “The Sequential Generation of D-Optimum Experimental Designs,” Annals of Mathematical Statistics 41, 1655-1664. 13