Stated Choice Experiments Most of these notes closely follow material in the textbook of Street and Burgess (2007). That book goes well beyond what is discussed here; this material should be considered only a very brief introduction to the topic. Introduction and Definitions In a stated choice experiment, individuals (who are the experimental units) are given a set of options, and are asked to idenitify the one element of the set they prefer. Street and Burgess note that applications can be found in settings such as government services (e.g. where the visitors to national parks are asked about various elements of the provided services) and business (in the form of market research). More specifically, a stated choice experiment is defined by a set of choice sets. Each choice set contains two or more options/alternatives. These choice sets are presented to respondents/subjects, each of whom is asked to identify the option they most prefer from each set. A very simple example might involve a survey of a group of employees who are asked how they prefer commuting to work; the single choice set in this case might be { drive, catch a bus, walk, cycle, other }. Where the rules of response require that exactly one choice be reported by each respondent for each choice set, this is sometimes called a forced choice experiment. For obvious reasons, it is helpful to have options such as “other” and “none of the above” with forced choice experiments. Much of what will be discussed here focuses on the setting in which each option in the choice set can be described by specifying the level for each of a number of attributes. This is exactly parallel to the idea of defining treatments in factorial experiments. An example might involve choices among health insurance plans, which are described by the maximum out-of-pocket expense a patient might be charged, whether the coverage extends to facilities outside of the patient’s local care network, whether specified pre-existing conditions can be covered, et cetera. A number of other variants and specific issues have been addressed in the literature on stated choice experiments, including: • binary choice, in which options are not comparative, but responses are only “yes” or “no” for sets that each contain a single option • implications for including (or not including) a “none” option in each set • issues that arise when some combinations of attribute levels are unrealistic or impossible, or when it is known ahead of time that certain combinations of attribute levels would always be preferred (i.e. a universally perceived best product for the lowest price) 1 We will not discuss these here, but focus on the basic problem and the construction and comparison of experimental designs, which in this case amount to selection of choice sets from among the collection of options to be studied. The Bradley-Terry Model The Bradley-Terry model was developed for paired comparison experiments; these correspond to stated choice experiments in which each choice set contains two elements. Suppose for the moment that we denote the entire collection of choices to be compared as {T1 , T2 , ...Tt }. The model specifies that when Ti is compared to Tj , Ti will be preferred with probability P r( Ti is preferred to Tj ) = πi , i 6= j; i, j = 1, 2, 3, ..., t πi + πj for model parameters πi ≥ 0, i = 1, 2, 3, ..., t, or collectively as the t-vector π. Note that there is an overparameterization here; multiplying all πi ’s by the same positive number results in an equivalent model. This ambiguity is eliminated by adding a constraint such as Q i πi = 1. It should be noted that the Bradley-Terry model is not “nonparametric”, but imposes a specific structure on the preference probabilities. This is made obvious by observing that there are t − 1 “free” parameters in the model after imposing the constraint, but t 2 probabilities being modeled. Because the amount of information in categorical data is much more limited than that associated with real-valued data, relatively “heavy-handed” parametric modeling may be necessary to achieve useful results. The constraints implied by the model form are generally reasonable in this context; they imply that options that are “relatively good” (π > c for some constant c, say) have probability greater than one-half of being preferred to options that are “relatively bad” (π < c). A little algebra reveals that the structure of the model prevents “circular” patterns of the form and π3 π3 +π1 > π1 π1 +π2 > π2 , π2 π2 +π3 π2 +π3 > π3 , π3 +π1 π1 . π1 +π+2 Suppose that each subject is shown the same choice sets (pairs of choices to be compared), and that subjects are not asked to repeat any comparison (no replicate comparisons within subject). Let nij = 1 if the pair (Ti , Tj ) is included as a choice set 0 if the pair (Ti , Tj ) is not included as a choice set Suppose that the choice sets are presented to s subjects, and for subject α the experimental results are represented as: 1 wijα = 0 when Ti is preferred to Tj when Tj is preferred to Ti With the convention that wijα = 0 if nij = 0, we can write the probability function for wijα 2 as w w π ijα πj jiα fijα (wijα , π) = i (πi + πj )nij If we let Ps α=1 wijα = wij , and wi = Pt j=1 wij , and if we assume independence among all choices, then the likelihood function for the entire experiment may be written as π1w1 π2w2 ...πtwt L(π) = Q snij i<j (πi + πj ) The independence assumption, like the parametric model form, can probably be questioned for its realism in some cases. For example, suppose that to any other choice, and 1 3 2 3 of all subjects prefer T1 prefer T2 to any other choice. Before parameter normalization, this could be expressed by sayint π1 = 2 and π2 = 1. But in this case, when any subject prefers T1 to T2 , that subject will also prefer T1 to (for example) T3 , and so the assumption of independence does not hold. Again, this should probably be viewed as a pragmatic assumption motivated by the need to extract a meaningful inference from data that contain limited information, but it also should be a warning that some effort should be made to verify that the assumptions are at least reasonable in any given situation. Possible model extensions to partially address this (that I’ve not seen, but may well have been explored) might involve relatively small random effects associated with each subject and treatment, to allow for more complex subject-conditional probabilities, and so perhaps more realistic independence assumptions. Maximum likelihood estimates under the independent-comparisons assumption can be found by differentiating log(L) with respect to each πi , equating to zero, and iteratively solving the t resulting equations. Convergence is assured under either of 2 conditions: • All nij = 1. • In every possible partition of the objects into two non-empty subsets, some object in the first set is preferred at least once to some object in the second. The second condition does not require the first, but is not entirely design-dependent. That is, there is always the possibility that a given choice is never selected even if it is directly compared to all other choices. What can be done at the design stage that insures that the second condition can be met is to require that the design be connected. This means that for every two options Ti and Tj , it is possible to construct a list of objects: Ti , Ti1 , Ti2 , ..., Tj such that each consecutive pair is included as a choice set. Under this constraint, if each option is chosen at least once (over the entire experiment), then the second condition is satisfied. 3 The per-choice contribution to the (i, j) element of the Fisher information matrix is from subject α (i.e. the actual information matrix associated with subject α’s data, divided by N ) is: I(π)i,j ∂logfabα (wabα , π) = λab E ∂πi a<b X where λab = nab /N and N = P i<j ! ∂logfabα (wabα , π) ∂πj ! nij . For i 6= j, we find that wijα 1 ∂logfabα (wabα , π) = − . ∂πi πi πi + πj wijα is a Bernoulli random variable that takes the value of 1 with probability πi /(πi + πj ). Assuming independence for the collection of comparisons, this leads to I(π)i,j = −λij 1 . (πi + πj )2 Similar considerations lead to an expression for the diagonal elements, I(π)i,i = X λia a6=i πa . (πa + πi )2 πi Assuming that the probabilities are the same for each subject, this matrix is also the actual overall information matrix for the experiment, divided by s × N . (This is in the spirit of our previous definitions of M, which were normalized by the number of experimental trials in the design.) Where choices are re-expressed in factorial-like attribute parameterization, it will be more convenient to work with the logs of the Bradley-Terry π’s. Define γi = log πi , and the corresponding t-element vector as γ. Changing variables is fairly straightforward, given ∂γi ∂πi = 1 . πi It follows that the information matrix for γ has elements I(γ)i,i = πi X λi,a a6=i πa , (πi + πa )2 I(γ)i,j = −λi,j πi πj , i 6= j (πi + πj )2 Following Street and Burgess (who apparently follow others), we also denote I(γ) by Λ(π). Example: 1.) Suppose t = 4, and all 6 possible option pairs are shown to each of 5 subjects. From the expressions above: 1 Λ(π) = 6 π1 πj j6=1 (π1 +πj )2 −π1 π2 (π1 +π2 )2 −π1 π3 (π1 +π3 )2 −π1 π4 (π1 +π4 )2 P −π1 π2 (π1 +π2 )2 P πj π2 j6=2 (π2 +π 2 j) −π2 π3 (π2 +π3 )2 −π2 π4 (π2 +π4 )2 4 −π1 π3 (π1 +π3 )2 −π2 π3 (π2 +π3 )2 P πj π3 j6=3 (π3 +π 2 j) −π3 π4 (π3 +π4 )2 −π1 π4 (π1 +π4 )2 −π2 π4 (π2 +π4 )2 −π3 π4 (π3 +π4 )2 P πj π4 j6=4 (π4 +π 2 j) 2.) Suppose t = 4, and that choice sets of only 3 comparisons – (T1 , T2 ), (T1 , T2 ), and (T1 , T3 ) – are shown to each of the 5 subjects. Then: π1 1 Λ(π) = 3 πj j6=1 (π1 +πj )2 −π1 π2 (π1 +π2 )2 −π1 π3 (π1 +π3 )2 −π1 π4 (π1 +π4 )2 P −π1 π2 (π1 +π2 )2 π1 π2 (π1 +π2 )2 −π1 π3 (π1 +π3 )2 −π1 π4 (π1 +π4 )2 0 0 0 π1 π3 (π1 +π3 )2 0 0 0 π1 π4 (π1 +π4 )2 Here we run into the difficulty we have for experimental design with any nonlinear model – the information matrix that is relevant for judging the value of an experimental design is dependent on the model parameters that would be the focus of the experiment – in this case, the vector π or equivalently γ. If some prior information is known about these parameters, a single guessed value might be used, or an approach based on a region (“robust design”) or prior distribution (“Bayes”) might be taken. In this case, a common choice is to design as if all choices are to be equally preferred, that is πi = 1 for all i = 1, 2, 3, ..., t. For this set of parameters values, the information matrices for these two examples become: 3 −1 −1 −1 −1 3 −1 −1 1 Λ(1) = 24 −1 −1 3 −1 −1 −1 −1 1 0 0 1 and Λ(1) = 12 −1 3 −1 −1 −1 −1 −1 3 0 1 0 0 0 1 respectively. A Representation Based on Attributes Consider re-expression of the choices available as a specified level for each of k attributes. Suppose the qth attribute has lq levels (denoted 0, 1, 2, ..., lq −1 as with standard notation for symbols in orthogonal arrays) so that the number of choices is t = Qk q=1 lq . A particluar choice can be fully specified by a list of the corresponding attribute levels. For notational specificity, now assume that the treatments are ordered lexicographically by this list representation; that is, T1 corresponds to (0, 0, ..., 0), T2 to (0, 0, ..., 1) ... Tlk to (0, 0, ..., lk − 1), Tlk +1 to (0, 0, ..., 1, 0) ... and finally Tt to (l1 , l2 , ..., lk ). The aim now will be to express each γi with a linear model in terms of meaningful main effects, two-factor interactions, et cetera defined from the attributes. Specifically, we will work with a partitioned model. Toward this end, construct a t − 1 by t matrix B for which each row is a contrast of unit length, and every two rows are orthogonal. The rows define factorial contrasts in the elements of γ. We partition B into 3 sets of rows (or sub-matrices by rows): B 1 B = B2 B3 5 The first submatrix B1 contains the set of contrasts in which we have interest, the second submatrix B2 contains contrasts for which corresponding effects should be included in the model, but which will be regarded as nuisance parameters, and the third submatrix B3 contains contrasts for which we are willing to assume that the the corresponding effects are actually zero. The implied linear model for γ is then: γ = B0 β = B01 β 1 + B02 β 2 + B03 β 3 , β3 = 0 For example, suppose t = 8, k = 3, and each attribute has 2 levels. Suppose our interest centers on estimating the 3 main effect contrasts, we suspect that the two-factor interactions may be present but have no real interest in them, and that we are willing to assume that the three-factor interaction is absent. Using the common ± full-rank parameterization, the B matrix, partitioned as described above, would be: 1 B= √ 2 2 −1 −1 −1 −1 −1 −1 −1 1 1 1 1 −1 −1 1 1 1 1 1 −1 1 −1 −1 1 −1 1 −1 1 1 −1 −1 −1 −1 1 1 1 −1 1 −1 −1 −1 1 1 1 1 −1 1 −1 1 1 −1 −1 1 1 −1 −1 1 Denote by B12 the subset of rows contained in both B1 and B2 . Then the information matrix for β 12 = (β 01 , β 02 )0 is I(β 12 ) = B12 Λ(π)B012 C11 C12 =C= 0 . C12 C22 The information matrix for β 1 , given that β 2 must also be included in the model, is 0 I(β 1 |β 2 ) = C11 − C12 C−1 22 C12 Example, Continued: Continuing with the earlier t = 4 example, suppose the choices are actually defined by two 2-level attributes/factors, that we want to make inferences about the two main effects: 1 1 1 −1 −1 B1 = 2 −1 1 −1 1 and that while not of direct interest, we are not willing to ignore the possibility of a two-factor interaction: 1 B2 = ( 1 −1 −1 1 ) 2 6 For the two designs specified earlier (all 6 pairs, and only pairs including T1 ), the matrix C is 1 24 × 4 16 0 0 16 0 0 1 0 0 1 = 0 1 0 0 6 0 16 0 0 1 1 12 × 4 respectively, and the corresponding matrix I(β 1 |β 2 ) is 1 1 0 6 0 1 1 3 1 24 1 3 Structural Properties Street and Burgess cite Huber and Zwerina (1996) for a set of 4 design properties they describe as “structural’. These clearly have implications for the statistical properties of inferences that can be drawn, but are appealing otherwise in their own right: 1. Level Balance: All the levels of each attribute should occur with equal frequency over all options in the choice sets. 2. Orthogonality: The joint occurrence of any two levels of different attributes should appear in options with frequencies equal to the product of their marginal frequencies. 3. Minimal Overlap: The frequency with which an attribute level repeats itself in each choice set should be as small as possible. If the number of items in each choice set is fewer than the number of levels for an attribute, then no attribute level should repeated. 4. Utility Balance: Options within a choice set should be equally attractive to subjects. Level balance and orthogonality, in particular, are properties of some of the best designs for linear models; their connection to statistical optimality may not be as firm here, but they are certainly reasonable design properties in any case. Minimal overlap might be interpretted as a sort of “level balance” within each choice set. Utility balance is a bit different from the others in that it is actually a property of the anticipated responses; it can be partial justification for a “design assumption” that the π’s are of equal value. Optimal Designs for Paired Comparisons and Binary Attributes Chapter 4 of Street and Burgess discusses the form of D- and A-optimal stated choice designs, for situations in which the choice sets are of size 2, all attributes have 2 levels, and 7 4 −4 2 1 −1 1 = 1 4 8 −4 2 −1 12 −4 −4 8 −1 −1 2 8 the design values of all πi ’s are equal. We only briefly outline their results here; the proofs of these results involve lengthy (although not especially difficult) algebra. Their arguments are restricted to designs within a specific class of designs in which: the set of possible choice pairs is restricted so that each pair of treatment combinations in which there are v attributres with different levels appears equally often. For example, for k = 2 attributes, the treatment pairs (00,11) and (01,10) each differ by the levels of v = 2 attributes, and so should either both appear, or neither should appear. Similarly the treatment pairs (00,01), (00,10), (01,11), and (10,11) each differ by the levels of v = 1 attributes and should either all appear, or none should appear. This means that for this (very small) example, the only three acceptable designs consist of the first set of pairs, the second set of pairs, or both sets. Within this restriction, the properties of designs can be expressed in terms of: iv = 1 if all choice pairs differing in v attributes are included 0 otherwise These quantities are then scaled by the number of comparisons made, av = iv /N . Limiting attention to this class of designs constitutes a rather major restriction on the exprimental design process. However, all designs in this class have the appealing feature that the information matrix for factorial effects, C, is diagonal no matter what effects are classified as being of interest (contrasts in B1 ), nusance (B2 ), or assumed absent (B3 ). Result 1 (Lemma 4.1.4, Theorem 4.1.1, and Theorem 4.1.2) The information matrix for paired comparison designs for estimating main effects, assuming all higher order effects are absent, is k 1X k−1 C= av 2 v=1 v−1 " !# Ik The D-optimal paired comparison designs for estimating main effects, assuming all higher order effects are absent, consist of the foldover pairs only; that is all k attributes appear at different levels in the two options in each choice set, i.e. for which ak = 1 2k−1 and all other av = 0. This is also the subset of designs that are A-optimal in this context. Example The following presents most of the contents of Table 4.5 in Street and Burgess, showing the determinant calculation for the 7 competing designs for k = 3 binary attributes 8 (t = 8 options): |C| a1 a2 a3 N 12a1 + 12a2 + 4a3 1 12 0 0 1 12 12 12 +0+0=1 7.234 × 10−5 0 1 12 0 1 12 0 + 12 12 +0=1 5.787 × 10−4 0 0 0 + 0 + 4 14 = 1 1.953 × 10−3 0 1 16 1 1 16 0 + 12 16 + 4 16 =1 8.240 × 10−4 1 16 1 24 1 28 0 1 4 1 16 1 16 1 1 16 12 16 + 0 + 4 16 =1 2.441 × 10−4 0 1 1 24 12 24 + 12 24 +0=1 2.441 × 10−4 1 28 1 1 1 28 12 28 + 12 28 + 4 28 = 1 3.644 × 10−4 1 24 1 28 4 The first 3 designs described in this table are comprised of the following choice sets: • (000, 001), (000, 010), (000, 100), (001, 011), (001, 101), (010, 011), (010, 110), (100, 101), (100, 110), (011, 111), (101, 111), (110, 111) • (000, 011), (000, 101), (000, 110), (011, 101), (011, 110), (101, 011), (101, 110), (101, 011), (101, 010), (100, 111), (010, 111), (001, 111) • (000, 111), (001, 110), (010, 101), (100, 011) The last 4 designs are comprised of combinations of these sets. Linear Models Intuition from STAT 512 For many nonlinear design problems, it can be difficult to see any parallel at all to parallel arguments for linear models. However, there are elements in this case that may make the comparison exercise useful. Consider a 2-level factorial problem in which a first-order linear model (intercept and main effects) is to be used, and the experiment must be performed using blocks of size 2, with block effects that are to be considered fixed. The fixed block assumption means (from STAT 512) that only the difference between the two responses from each block is informative; the individual values and their average are aliased with the unknown block effect and so cannot be directly compared to their counterparts from other blocks. This is not an exact analogue to the stated choice problem we are considering, where there is no interval-valued scale of measurement on which “the difference between the two responses from each block” can be recorded. However, the two problems are similar in that the useful information is confined to a comparison of two matched treatments. A full factorial design for factors A through F , for which experimental runs are “paired” with their fold-over counterparts is specified by the generating relation I = ±AB = ±BC = ... = ±EF along with all generalized interactions. Note that for k factors, this generating relation contains k − 1 “independent” words, and so does result in blocks of size 2k−(k−1) = 2. Note also that all words of even length not listed are included as generalized interactions, e.g. 9 AC = AB × BC, ACDF = AB × BC × DE × F , et cetera. Hence the factorial effects confounded with the 2k−1 block differences are the 2k−1 − 1 factorial effects of even order. But now recall that in a regular blocked factorial experiment, all pairs of factorial effects are either orthogonal or confounded. Hence all effects of odd order are estimable (unbiased by any other effects). As a result, a first order model could be estimated with full efficiency, i.e. each independently of the others, and each with standard deviation σ 2 /2k , and so this design is clearly optimal for this linear models setting. In fact, we know that if we are willing to assume higher-order odd effects are zero, smaller designs of block-size 2 could be constructed that are optimal for this problem. (For example, begin with a regular fraction of resolution III, “double” it by adding all foldover runs, creating a resolution IV design with a generating relation of only even-length words, and use fold-over pairs as blocks.) Street and Burgess continue their discussion of optimal designs for the stated choice problem by using fractional factorial plans in this way. Result 2 (Lemma 4.1.6, Theorem 4.1.3, and Theorem 4.1.4) The information matrix for paired comparison designs for estimating main effects and two-factor interactions, assuming all higher order effects are absent, is C= [ 21 Pk v=1 av k−1 v−1 ]Ik 0 0 [ Pk v=1 av k−2 v−1 ]Ik(k−1)/2 The D-optimal paired comparison designs for estimating main effects and two-factor interactions, assuming all higher order effects are absent, are given by av = k )−1 (k+1)/2 (2k−1 k+1 )−1 k/2 (2k−1 v = (k + 1)/2, if k is odd v = k/2, k/2 + 1, if k is even and all other av = 0. This is also the subset of designs that are A-optimal in this context. Example The following presents most of the contents of Table 4.6 in Street and Burgess, showing the determinant calculation for the 7 competing designs for k = 3 binary attributes (t = 8 options): a1 a2 a3 N |C| 1 12 0 0 1 3 1 3 12 ( 24 ) ( 12 ) = 4.1862 × 10−8 0 1 12 0 1 3 1 3 12 ( 12 ) ( 12 ) = 3.3490 × 10−7 0 0 0 1 16 3 3 1 3 16 ( 32 ) ( 16 ) = 2.0117 × 10−7 1 16 1 24 1 28 0 1 4 1 16 1 16 0 1 3 1 3 24 ( 16 ) ( 12 ) = 1.4129 × 10−7 1 28 1 3 1 3 28 ( 14 ) ( 14 ) = 1.3281 × 10−7 1 24 1 28 4 ( 81 )3 03 = 0 1 3 1 3 16 ( 16 ) ( 16 ) = 5.9605 × 10−8 10 The Multinomial Logit Model The multinomial logit model (MNL) appears to be a direct generalization of the BradleyTerry model, extended for the case where choice sets have more than two elements. Again, suppose that we denote the entire collection of choices to be compared as {T1 , T2 , ...Tt }. The model specifies that when Ti is compared to Tj1 , Tj2 , ..., Ti will be preferred with probability P r( Ti is preferred to Tj1 , Tj2 , ... ) = πi , i 6= jl ; i, jl = 1, 2, 3, ..., t P πi + πjl for model parameters πi ≥ 0, i = 1, 2, 3, ..., t, or collectively as the t-vector π. Suppose that each subject is shown the same N choice sets, each comprised of m options, of which ni1 ,i2 ,...,im compare the specific options Ti1 , Ti2 , ..., Tim , where ni1 ,i2 ,...,im = so that N = P 1 if the pair (Ti1 , Ti2 , ..., Tim ) is included as a choice set 0 if the pair (Ti1 , Ti2 , ..., Tim ) is not included as a choice set i1 <i2 <...<im ni1 ,i2 ,...,im . Extending the notation from the paired-comparisons case, for any specified subject, let wi1 ,i2 ,...,im = 1 when Ti1 is preferred to Ti2 ...Tim 0 otherwise with the convention that wi1 ,i2 ,...,im = 0 if ni1 ,i2 ,...,im = 0. Arguments paralleling those for the basic Bradley-Terry model lead to the per-comparison information matrix for γ, Λ. Define λi1 ,i2 ,...,im = ni1 ,i2 ,...,im /N . Then: Pm Λi1 ,i1 = πi1 X πi 1 λi1 ,i2 ,...,im Pmj=2 j 2 , Λi1 ,i2 = −πi1 πi2 λi1 ,i2 ,...,im Pm ( j=1 πij ) ( j=1 πij )2 i2 <i3 <...<im i3 <i4 <...im X Under the assumption that π1 = π2 = ... = πt = 1, this reduces to: Λi1 ,i1 = X m−1 1 λi1 ,i2 ,...,im , Λi1 ,i2 = − 2 2 m i2 <i3 <...<im m X λi1 ,i2 ,...,im i3 <i4 <...im Optimal Designs for Larger Choice Sets and Binary Attributes Chapter 5 of Street and Burgess extends some of the arguments made in Chapter 4 to situations for which each choice set has m (not necessarily 2) elements, continuing to focus on 2-level attributes. One simplicity for the results of 2-element choice sets is the important role of v, the number of attributes for which the levels differ in the two choices offered. Optimal design characterization is similar here, but made more complicated by the fact that there are m 2 pairs of choices within each choice set. For this reason, for any potential choice set, define a difference vector v = (d1 , d2 , ...dm(m−1)/2 ) 11 to be the collection of numbers of attributes with different levels for each pair of elements in the choice set. (Hence, for example, each di must be between 1 and k.) The order of elements isn’t important to the arguments made, so the convention used is that d1 ≤ d2 ≤ ... ≤ dm(m−1)/2 . The results of this chapter are restricted to the class of designs for which, if any choice set with a given difference vector is included, all choice sets with that difference vector are also included. For example, if m = 3 and k = 3, all possible choice sets have one of the difference vectors v1 = (1, 1, 2), v2 = (1, 2, 3), v3 = (2, 2, 2). There are 24, 24, and 8 possible choice sets associated with each of these difference vectors (respectively). Hence the only designs considered for this problem contain 8, 24, 32, 48, or 56 choice sets. Result 3 (Theorem 5.1.1) The D-optimal design for testing main effects only, when all other effects are assumed to be zero, is given by any choice sets in which, for each v present m(m−1)/2 X di = i=1 (m2 − 1)k/4, m odd m2 k/4, m even For these D-optimal designs and for main-effects contrasts, |C| = k m22−1 , k m 2k 1 , k 2 m odd m even Example Continuing the example from the beginning of this section, consider m = 3 and k = 3. The three possible difference vectors are v1 = (1, 1, 2), v2 = (1, 2, 3), and v3 = (2, 2, 2). m is odd, so the theorem requires that each v have sum (m2 − 1)k/4 = 6, so D-optimal designs can be constructed from choice sets associated with difference vectors v2 and v3 . Within the class of designs being considered, this identifies three designs – the 24 choice sets associated with v2 , the 8 choice sets associated with v3 , or the 32 choice sets associated with either of them. Thinking of This Another Way Return to the case of m = 2, choice sets with only two elements, and recall that in this case, Λ(π)i,i = πi X a6=i λi,a πa , (πi + πa )2 Λ(π)i,j = −λi,j πi πj , i 6= j (πi + πj )2 Note that this information matrix can also be written in a form we’ve used earlier. Define uij to be a t-element vector for which all elements are 0 except the ith and jth, and let these be: √ πi π j {uij }i = − , π i + πj 12 √ π i πj {uij }j = πi + πj and where the π’s are assumed for design purposes to be of equal value, the two non-zero elements of uij are ± 21 . Then Λ(π) can be written as: Λ(π) = 1 X uij u0ij N (i,j) where the sum is taken over the N treatment pairs (i, j) that are compared. This form makes clear a fact that you may have noticed in earlier examples; for any design (selection of choice sets), Λ has row- and column-sums of zero, and hence is singular. This reflects the fact that the π’s can be arbitrarily scaled by any multiplier, so the γ’s can be shifted by any additive constant. That is, there is no information in the data regarding the absolute location of the γ’s on the number line, and so linear combinations of them that are not contrasts are not estimable. However, if interest centers on factorial contrasts in the elements of γ, this singularity is not an issue. Consider, for example, t = 8 and k = 3 attributes of 2 levels each. Without regard to the factorial, we might define an experimental design region U containing 8 2 = 28 vectors of length 8: u1,2 = u1,3 = u1,4 = 1 (−1, +1, 0, 0, 0, 0, 0, 0)0 2 1 (−1, 0, +1, 0, 0, 0, 0, 0)0 2 1 (−1, 0, 0, +1, 0, 0, 0, 0)0 2 ... ... u7,8 = 1 (0, 0, 0, 0, 0, 0, −1, +1)0 2 If our interest centers in main effect contrasts, this suggests that an augmented design region X containing 28 vectors of length 3 can be generated as xi,j + + + + − − − − = + + − − + + − − ui,j , + − + − + − + − i<j =1−8 These vectors xij are, explicitly, comparison x0 comparison x0 (1, 2) (1, 3) (1, 6) (1, 5) (1, 6) (1, 7) (1, 8) (2, 3) (2, 4) (2, 5) (2, 6) (2, 7) (2, 8) (3, 4) (0, 0, −1) (0, −1, 0) (0, −1, −1) (−1, 0, 0) (−1, 0, −1) (−1, −1, 0) (−1, −1, −1) (0, −1, 1) (0, −1, 0) (−1, 0, 1) (−1, 0, 0) (−1, −1, 1) (−1, −1, 0) (0, 0, −1) (3, 5) (3, 6) (3, 7) (3, 8) (4, 5) (4, 6) (4, 7) (4, 8) (5, 6) (5, 7) (5, 8) (6, 7) (6, 8) (7, 8) (−1, 1, 0) (−1, 1, −1) (−1, 0, 0) (−1, 0, −1) (−1, 1, 1) (−1, 1, 0) (−1, 0, 1) (−1, 0, 0) (0, 0, −1) (0, −1, 0) (0, −1, −1) (0, −1, 1) (0, −1, 0) (0, 0, −1) The per-observation information matrix for γ can now be written as C= 1 X xij x0ij N (i,j) 13 where the sum is over the size-two treatment pairings in the included choice sets. Note that the vectors x for choice comparisons (1, 8), (2, 7), (3, 6), and (4, 5) each contain only non-zero elements, and that the design including just these 4 choice sets has information matrix 1 (x1,8 x01,8 + x2,7 x02,7 + x3,6 x03,6 + x4,5 x04,5 ) = I3 4 These 4 pairs of treatments are, in factorial notation, the foldover pairs identified as the D-optimal design by Street and Burgess. However, their development is entirely algebraic and requires constraining the class of designs under consideration. Representing the problem this way would allow use of the optimal design tools we’ve discussed previously, including continuous design theory, Frechet derivatives, and related construction algorithms. For problems in which the a prior π values are not all equal, algebraic arguments are made more complicated, but approaches based on the above representation can be easily adapted by simply changing the non-zero values of uij . Now consider the same problem, but now change specifications to required that the second order model be used. This does not change the form of Λ(π), because this matrix does not depend on the form of the factorial model. (Hence uij are as before.) However, the matrix B1 must be altered to reflect this change, and the elements of X are now defined as + + + + − − − − + + − − + + − − + − + − + − + − ui,j , xi,j = + + − − − − + + + − + − − + − + + − − + + − − + i<j =1−8 Of the 28 x-vectors generated, 16 contain three 0 elements and three ±1 elements; the remaining 12 contain 2 0 elements and four ±1 elements. These 12 vectors, and the treatment differences they represent, are: x0 comparison (000), (011) (0, −1, −1, −1, −1, 0) (000), (101) (−1, 0, −1, −1, 0, −1) (000), (110) (−1, −1, 0, 0, −1, −1) (001), (010) (0, −1, 1, −1, 1, 0) (001), (100) (−1, 0, 1, −1, 0, 1) (001), (111) (−1, −1, 0, 0, 1, 1) (010), (100) (−1, 1, 0, 0, −1, 1) (010), (111) (−1, 0, −1, 1, 0, 1) (011), (101) (−1, 1, 0, 0, 1, −1) (011), (110) (−1, 0, 1, 1, 0, −1) (100), (111) (0, −1, −1, 1, 1, 0) (101), (110) (0, −1, 1, 1, −1, 0) 14 The treatment pairs here are those that differ in the levels of v = 2 attributes. We can observe directly that this is the optimal design, because: • These x vectors have the largest number of non-zero elements, leading to the largest diagonal elements of P xx0 , and • If this set of x vectors is used as the rows in an X-matrix, every pair of columns in that matrix is orthogonal, i.e. the off-diagonal elements of P xx0 are all zero. The result (2) of Street and Burgess indicates that for k = 3, all choice pairs should differ by v = (k + 1)/2 = 2, so their optimal design coincides with this one. Unfortunately, generalization in this way is not so straightforward for choice sets of size m > 2. Recall that in this case, Λ(π)i1 ,i1 = X m−1 1 λi1 ,i2 ,...,im , Λ(π)i1 ,i2 = − 2 2 m i2 <i3 <...<im m X λi1 ,i2 ,...,im i3 <i4 <...im Again, the information matrix for γ can be written in linear form, but not with a single “u-vector” representing each choice set. For choice sets of size m, m − 1 vectors are required to express the information associated with one choice set. For example, for m = 3, the general form would be Λ(π) = 1 X 1 1 0 u u + u2ijk u2ijk 0 N (i,j,k) ijk ijk for a design comprised of choice sets (i, j, k). Still, a linear transformation to the vectors we’ve called x is possible; it would be interesting to think about how general design theory might be used to construct optimal designs under fewer constraints than those used by Street and Burgess. References Street, D. J. and L. Burgess (2007) The Construction of Optimal Stated Choice Experiments: Theory and Methods, John Wiley & Sons, New York. 15