Data Analysis for Bioinformatics:

Fractional factorial designs

Material in this lecture is based Chapter 8 of the textbook "Design and Analysis of

Experiments" by Douglas Montgomery.

The 2 k

factorial design is efficient, but as the number of factors, k, increases, the total number of observations for a complete replicate of all possible 2 k

combinations will grow beyond the available resources. In these cases, rather than running all 2 k combinations, we will run a carefully-selected fraction of the combinations. Typically, the fractions are 1/2, a 1/4, or 1/8 or all the possible combinations.

The one-half fraction of the 2 k

design

Consider a situation in which three factors, each at two levels, are of interest, but the researchers cannot afford to run all 2 3 = 8 treatment combinations. They can, however, afford 4 runs. This suggests a one-half fraction of a 2 3 design. Because the design contains 2 3-1 = 4 treatment combinations, a one-half fraction of the 2 3 design is often called a 2 3-1 design.

Here is the design matrix for all combinations of a 2 3 design.

I A B C AB AC BC ABC Treatment combination a b c

+

+

+

+

-

-

-

+

-

-

-

+

-

-

+

-

+

-

+

-

-

+

+

+ abc ab ac bc

(1)

+

+

+

+

+

+

+

-

-

+

+

-

+

-

+

+

-

-

+

+

+

+

-

-

+

+

-

+

-

+

+

-

-

+

+

+

-

Suppose we select the four treatment combinations a, b, c, abc as our 2 3-1 one-half fraction. These are shown in the top half of the table, with ABC=+.

Notice that, in the first 4 rows of the table,

 the + and – signs are the same for column A and column BC.

 the + and – signs are the same for column B and column AC.

 the + and – signs are the same for column C and column AB.

We say

 A is aliased with BC

-

-

-



B is aliased with AC

 C is aliased with AB

Terminology (useful but not required to know): Notice that the 2 3-1 design is formed by selecting only those treatment combinations that have a + sign in the ABC column. Thus,

ABC is called the generator of this design. Sometimes we will refer to a generator such as ABC as a word. Furthermore, the identity column I is also always +, so we call I=ABC the defining relation.

In the 2 3-1 design, we estimate the main effects of A, B, and C using the + and – signs in the table:

[A] = ½ * {a – b – c + abc}

[B] = ½ * {-a +b – c + abc}

[C]= ½ * {-a –b + c + abc}

We estimate the two-factor interactions of AB, AC, and BC using the + and – signs in the table:

[BC] = ½ * {a – b – c + abc}

[AC] = ½ * {-a +b – c + abc}

[AB]= ½ * {-a –b + c + abc}

We saw that, in the first 4 rows of the table, the + and – signs are the same for A and BC.

So

[A] = [BC]

[B] = [AC]

[C] = [AB]

Consequently,



The apparent effect of A is actually the effect of A plus the interaction [BC].



The apparent effect of B is actually the effect of A plus the interaction [AC].



The apparent effect of C is actually the effect of A plus the interaction [AB].

This is why we say

 A is aliased with BC

 B is aliased with AC

 C is aliased with AB

Aliasing is the price we pay when we use a fractional factorial design. We use fewer total runs, but some effects will be aliased. Aliasing is also called confounding.

Example fractional factorial design for 3 factors in 4 runs

If we want to look at all the possible interactions between k factors, then we need to run 2^k experiments. However, we may want to use fewer than 2^7 experiments, and may be willing to give up the ability to test for all those interactions.

Suppose we are baking cookies, and we are interested in the factors that may affect yield of good cookies. The 3 factors we examine are flour, oven, and sifting. We believe that interactions among these on their effects on yield will be small compared to main effects. We can examine this assumption later.

Usually, we will use a pre-defined design matrix to look at 3 factors in 4 runs, which we can get at a website that has tables of fractional factorial designs: http://www.itl.nist.gov/div898/handbook/pri/section3/pri3347.htm

Just to see how it is done, we'll generate our own design matrix. Here is the design matrix for 3 factors in 8 runs, with all interactions. To get a one-half fraction, we'll select the rows where ABC values are 1.

A B C AB AC BC ABC

-1

-1

1

1

-1

-1

1

1

-1

1

-1

1

-1

1

-1

1

1

-1

-1

1

-1

1

1

-1

1

-1

-1

1

1

-1

-1

1

-1

1

-1

1

1

-1

1

-1

-1

-1

1

1

1

1

-1

-1

Here is R code to create this design matrix we can use to examine the effects of 3 factors in 4 runs mystring.levels= "run, oven, flour, sift, yield

1,-1,-1,1,50

2,-1,1,-1,50

3,1,-1,-1,60

4,1,1,1,60"

OFS.data=read.table(textConnection(mystring.levels),header=TRUE,sep=",",row.names=

"run")

OFS.data

1

1

1

1

-1

-1

-1

-1

> OFS.data

oven flour sift yield

1 -1 -1 1 50

2 -1 1 -1 50

3 1 -1 -1 60

4 1 1 1 60

> lm.3factor=lm(yield ~ oven + flour + sift, data=OFS.data) summary(lm.3factor)

> summary(lm.3factor)

Call: lm(formula = yield ~ oven + flour + sift, data = OFS.data)

Residuals:

ALL 4 residuals are 0: no residual degrees of freedom!

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 5.500e+01 NA NA NA oven 5.000e+00 NA NA NA flour -8.882e-16 NA NA NA sift 8.882e-16 NA NA NA

Residual standard error: NaN on 0 degrees of freedom

Multiple R-squared: 1, Adjusted R-squared: NaN

F-statistic: NaN on 3 and 0 DF, p-value: NA

This experiment indicates that Oven type has a big effect (Estimate = 5), but neither sifting nor flour type has an effect (Estimate = e-16). We don't know if these are significant; at this point we just know what appears to be the most important factor.

But this design matrix looks exactly like the design matrix we used previously to look at

2 factors with interaction: mystring.levels= "run, oven, flour, oven.flour, yield

1,-1,-1,1,50

2,-1,1,-1,50

3,1,-1,-1,60

4,1,1,1,60"

OF.interaction.data=read.table(textConnection(mystring.levels),header=TRUE,sep=",",ro w.names="run")

OF.interaction.data

Design matrix with interaction term (oven*flour) in the 3 rd column:

> OF.interaction.data

oven flour oven.flour yield

1 -1 -1 1 50

2 -1 1 -1 50

3 1 -1 -1 60

4 1 1 1 60

>

Design matrix with an additional factor (sift) in the 3 rd column:

> OFS.data

oven flour sift yield

1 -1 -1 1 50

2 -1 1 -1 50

3 1 -1 -1 60

4 1 1 1 60

>

If we wanted to test for an interaction between "Flour type" and "Oven", we would use

Column 3, labeled "oven.flour".

When we use Column 3 to look at another factor (sift), the calculated effect of that factor is actually the sum of the effect of that factor plus the effect of any interaction.

We say that the factor "sift" is aliased with the oven*flour interaction.

The apparent effect of "sift" is the combined effect of "sift" and any oven*flour interaction.

 apparent sift effect = sift + flour* oven interaction

 apparent oven effect = oven + flour*sift interaction

 apparent flour effect = flour + oven*sift interaction

Fractional factorial design for 7 factors in 8 runs

Here is the design matrix for a full factorial for 3 factors in 8 runs including the interaction terms.

A

Length

1

1

1

1

-1

-1

-1

-1

B

Width

-1

-1

1

1

-1

-1

1

1

C

Paper type

-1

1

-1

1

-1

1

-1

1

D

Length*Width

-1

-1

1

1

1

1

-1

-1

E

L*P

1

-1

1

-1

-1

1

-1

1

F

W*P

1

-1

-1

1

1

-1

-1

1

G

L*W*P

Column D, is the interaction term for Length*Width, =A*B

E =A*C.

F=B*C

G = A*B*C

We could use all 2^3 = 8 runs to test for main effects and interactions of the 3 factors, as we did in the full factorial experiment.

If we are willing to assume that 2-way and 3-way interactions are small compared to the main effects, we can test 7 factors in 8 runs, as in the cookie baking example.

The design matrix is identical to the design matrix for looking at 3 factors with all possible interactions.

A B C D E F G

Batch Temp Water Sift

1 -1 -1 -1

Time

1

Butter Soda

1 1

Sugar

-1

1

-1

-1

1

-1

1

1

-1

7

8

5

6

2

3

4

-1

1

-1

1

1

-1

1

-1

-1

1

1

-1

1

1

1

1

1

1

-1

-1

-1

1

-1

-1

1

-1

-1

1

-1

1

-1

1

-1

1

-1

-1

-1

1

1

1

-1

-1

1

-1

-1

1

1

1

-1

In this design matrix, we have the same aliasing between columns:

D = A*B. So column D, Time, is aliased with the interaction term for Temp*Water =A*B.

There are other aliases.

A B C

Batch Temp Water Sift

1 -1 -1

4

5

2

3

1

-1

1

-1

-1

1

1

-1

6

7

8

1

-1

1

-1

1

1

-1

-1

-1

-1

1

1

1

1

D

Time

1

-1

-1

1

1

-1

-1

1

E F

Butter Soda

1

-1

1

-1

-1

1

1

-1

-1

-1

1

-1

1

-1

1

1

G

Sugar

-1

-1

1

1

1

-1

-1

1

In this design matrix, notice that the Temp column is the same as the Soda*Sugar interaction. A=F*G. So Temp is aliased with the Soda*Sugar interaction

The Temp column is also the same as the Water*Time and Sift*Butter columns.

A=B*D

A=C*E

So Temp is aliased with 3 two-way interactions: Soda*Sugar, Water*Time, and

Sift*Butter interactions

A=F*G

A=B*D

A=C*E

The apparent effect of Temp is aliased with these three interactions, so the apparent of effect of Temp is actually the sum of the main effect of Temp plus 2-way interaction

(plus higher-order interactions).

Temp + Soda*Sugar + Water*Time + Sift*Butter + higher-order interactions

The other columns have similar aliasing patterns. When we choose this design, we can examine which interactions are aliased, and decide if we think the interactions may be real and large. If so, we can consider them in further experiments.

Tables of fractional factorial designs are available at this website: http://www.itl.nist.gov/div898/handbook/pri/section3/pri3347.htm

Here is an example from the website, showing the design matrix that we used above for

7 factors in 8 runs. The Resolution of a design tells us what level of interactions the design is able to resolve, for example all main effects, or all 2-way, etc.

=======

2**(7-4) FRACTIONAL FACTORIAL DESIGN

NUMBER OF LEVELS FOR EACH FACTOR = 2

NUMBER OF FACTORS = 7

NUMBER OF OBSERVATIONS = 8 (A SATURATED DESIGN)

RESOLUTION = 3 (CAUTION! MAIN EFFECTS ARE

CONFOUNDED WITH 2-TERM INTERACTIONS)

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

FACTOR DEFINITION CONFOUNDING STRUCTURE

1 1 1 + 24 + 35 + 67 + HIGHER-ORDER INTERACTIONS







. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

12 12 + 4 + 37 + 56 + HIGHER-ORDER INTERACTIONS





















. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

DEFINING RELATION = I = 124 = 135 = 236 = 347 = 257 = 167 = 456

DEFINING RELATION (CONT.) = 1237 = 2345 = 1346 = 1256 = 1457

DEFINING RELATION (CONT.) = 2467 = 3567 = 1234567

REFERENCE--BOX, HUNTER & HUNTER, STAT. FOR EXP., PAGE 410, 391, 392.

NOTE--THIS DESIGN IS EQUIVALENT TO A TAGUCHI L8 DESIGN

REFERENCE--TAGUCHI, SYS. OF EXP. DES., VOL. 2.

NOTE--IF POSSIBLE, THIS (AS WITH ALL EXPERIMENT DESIGNS) SHOULD BE

RUN IN RANDOM ORDER (SEE DATAPLOT'S RANDOM PERMUTATION FILES).

NOTE--TO READ THIS FILE INTO DATAPLOT--

SKIP 75

READ 2TO7M4.DAT X1 TO X7

DATE--DECEMBER 1988

NOTE--IN THE DESIGN BELOW, "-1" REPRESENTS THE "LOW" SETTING OF A FACTOR

"+1" REPRESENTS THE "HIGH" SETTING OF A FACTOR

NOTE--ALL FACTOR EFFECT ESTIMATES WILL BE OF THE FORM

AVERAGE OF THE "HIGH" - AVERAGE OF THE "LOW"

X1 X2 X3 X4 X5 X6 X7

--------------------------

-1 -1 -1 +1 +1 +1 -1

+1 -1 -1 -1 -1 +1 +1

-1 +1 -1 -1 +1 -1 +1

+1 +1 -1 +1 -1 -1 -1

-1 -1 +1 +1 -1 -1 +1

+1 -1 +1 -1 +1 -1 -1

-1 +1 +1 -1 -1 +1 -1

+1 +1 +1 +1 +1 +1 +1

==========

Let's use this design to analyze a cookie baking experiment with 7 factors in 8 runs. mystring.levels="batch, flour, temp, soda, time, water, butter, sift, yield

1,brown,low,low,high,high,sift,low,47

2,white,low,low,low,low,sift,high,74

3,brown,high,low,low,high,no,high,84

4,white,high,low,high,low,no,low,62

5,brown,low,high,high,low,no,high,53

6,white,low,high,low,high,no,low,78

7,brown,high,high,low,low,sift,low,87

8,white,high,high,high,high,sift,high,60" cookie.data=read.table(textConnection(mystring.levels),header=TRUE,sep=",",row.nam

es="batch") cookie.data

flour temp soda time water butter sift yield

1 brown low low high high sift low 47

2 white low low low low sift high 74

3 brown high low low high no high 84

4 white high low high low no low 62

5 brown low high high low no high 53

6 white low high low high no low 78

7 brown high high low low sift low 87

8 white high high high high sift high 60

>

plot.design(cookie.data) lm.cookie = lm(yield ~ ., data= cookie.data) summary(lm.cookie)

Call: lm(formula = yield ~ ., data = cookie.levels.data)

Residuals:

ALL 8 residuals are 0: no residual degrees of freedom!

Coefficients:


(Intercept) 61.50 NA NA NA flourwhite 0.75 NA NA NA templow -10.25 NA NA NA sodalow -2.75 NA NA NA timelow 25.25 NA NA NA waterlow 1.75 NA NA NA

buttersift -2.25 NA NA NA siftlow 0.75 NA NA NA

Residual standard error: NaN on 0 degrees of freedom

Multiple R-squared: 1, Adjusted R-squared: NaN

F-statistic: NaN on 7 and 0 DF, p-value: NA

We are estimating 8 parameters in the model (intercept + 7 factors) with only 8 batches.

This fits the data perfectly, so R cannot calculate a standard error and R-squared = 1. But we can look at the magnitude of the coefficients (labeled "Estimates" in the summary), which are informative. coef(lm.cookie)

(Intercept) flourwhite templow sodalow

61.50 0.75 -10.25 -2.75

timelow waterlow buttersift siftlow

25.25 1.75 -2.25 0.75

>

Time appears to have the biggest effect (25.25), followed by temperature (10.25), with lesser effects for the other factors. Sifting and type of flour have very little influence on yield.

If "sifting" was a labor-intensive or expensive step, this experiment would be evidence that the step could be eliminated without affecting quality.

If the two types of "flour" had different cost, this experiment would be evidence that the less expensive type could be used without affecting quality.

We can't directly test the statistical significance of the coefficients, because we have estimated 8 coefficients with 8 runs. But there is another way to get an idea of what is significant.

If there were no significant factors, then their effect sizes should be randomly distributed around zero in a roughly normal distribution. We would expect them all to fall on the normal line in the qq plot. Two factors fall far from the line: time and temperature. qqnorm(coef(lm.cookie)[2:8], pch=names(coef(lm.cookie)[2:8])) qqline(coef(lm.cookie)[2:8])

This can be easier to see in a half-normal qq plot: halfnorm(coef(lm.cookie)[2:8], labs=names(coef(lm.cookie)[2:8]))

In this screening example, we assume that interaction effect (such as an interaction of time and temperature) are small compared to the main effects of each factor. We can examine this further.

A post-hoc look at the interaction terms

In the screening study for the cookies, we looked at 7 variables in 8 batches. We assumed interaction effects (such as an interaction of time and temperature) were small compared to the main effects of each factor.

The two most important factors according to the screening study were time, and temp.

A good next step would be further experiments on these variables.

However, at some risk of data dredging, before doing that, we can re-analyze the cookie data and look for interactions. lm.cookie.2vars = lm(yield ~ time * temp, data= cookie.data) summary(lm.cookie.2vars)

Call: lm(formula = yield ~ time * temp, data = cookie.data)

Residuals:

1 2 3 4 5 6 7 8

-3.0 -2.0 -1.5 1.0 3.0 2.0 1.5 -1.0

Coefficients:


(Intercept) 61.000 2.016 30.264 7.1e-06 timelow 24.500 2.850 8.595 0.00101 templow -11.000 2.850 -3.859 0.01816 timelow:templow 1.500 4.031 0.372 0.72869

(Intercept) *** timelow ** templow * timelow:templow

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.85 on 4 degrees of freedom

Multiple R-squared: 0.9786, Adjusted R-squared: 0.9626

F-statistic: 60.98 on 3 and 4 DF, p-value: 0.0008523

The interaction term timelow:templow gives a p-value of 0.73, indicating that there is not a significant interaction, at least at the times and temperatures we used for this experiment. We could analyze other pairs of factors in a similar way. with( cookie.data

, interaction.plot(temp, time, yield))

The lines on the time * temp interaction plot are about as parallel as they can get, again indicating no significant interaction.

A compromise

So far we've looked at two extreme cases:

1. Test for all possible interactions among k factors using 2^k runs

2. Assume no interactions, study using the minimum possible number of runs.

We can compromise between these two extremes.

 We can design experiments that let us test for, say, 2-way interactions, but assume that 3-way and higher interactions are small compared to the main effects and 2-way interactions. The price we will pay is doing more runs, though still many fewer than 2^k.



We can design experiments that let us test for 2-way and 3-way interactions, but assume that 4-way and higher interactions are small compared to the main effects, 2-way and 2-way interactions. Again, the price we will pay is doing more runs, though fewer than 2^k.

Here is another example design matrix. In this design matrix, we examine 5 factors in 16 runs. Having fewer factors and/or more runs allows us to resolve interactions. In this design, no main effects are confounded with any 2-factor interactions or 3-factor interactions. Main effects are confounded with 4-factor interactions.

Fractional factorial designs from the website: http://www.itl.nist.gov/div898/handbook/pri/section3/pri3347.htm

=====

THIS IS DATAPLOT DESIGN-OF-EXPERIMENT FILE 2TO5M.DAT

2**(5-1) FRACTIONAL FACTORIAL DESIGN

NUMBER OF LEVELS FOR EACH FACTOR = 2

NUMBER OF FACTORS = 5

NUMBER OF OBSERVATIONS = 16

RESOLUTION = 5 (THEREFORE NO MAIN EFFECTS ARE

CONFOUNDED WITH ANY 2-FACTOR INTERACTIONS

OR 3-FACTOR INTERACTIONS;

MAIN EFFECTS ARE CONFOUNDED WITH

4-FACTOR INTERACTIONS)

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

FACTOR DEFINITION CONFOUNDING STRUCTURE

1 1 1 + 2345

2 2 2 + 1345

3 3 3 + 1245

4 4 4 + 1235

5 1234 5 + 1234

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

12 12 + 345

13 13 + 245

14 14 + 235

15 15 + 234

23 23 + 145

24 24 + 135

25 25 + 134

34 34 + 125

35 35 + 124

45 45 + 123

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

DEFINING RELATION = I = 12345

REFERENCE--BOX, HUNTER & HUNTER, STAT. FOR EXP., PAGE 410,

NOTE--IF POSSIBLE, THIS (AS WITH ALL EXPERIMENT DESIGNS) SHOULD BE

RUN IN RANDOM ORDER (SEE DATAPLOT'S RANDOM PERMUTATION FILES).

NOTE--TO READ THIS FILE INTO DATAPLOT--

SKIP 50

READ 2TO5M1.DAT X1 TO X5

DATE--DECEMBER 1988

NOTE--IN THE DESIGN BELOW, "-1" REPRESENTS THE "LOW" SETTING OF A FACTOR

"+1" REPRESENTS THE "HIGH" SETTING OF A FACTOR

NOTE--ALL FACTOR EFFECT ESTIMATES WILL BE OF THE FORM

AVERAGE OF THE "HIGH" - AVERAGE OF THE "LOW"

X1 X2 X3 X4 X5

------------------

-1 -1 -1 -1 +1

+1 -1 -1 -1 -1

-1 +1 -1 -1 -1

+1 +1 -1 -1 +1

-1 -1 +1 -1 -1

+1 -1 +1 -1 +1

-1 +1 +1 -1 +1

+1 +1 +1 -1 -1

-1 -1 -1 +1 -1

+1 -1 -1 +1 +1

-1 +1 -1 +1 +1

+1 +1 -1 +1 -1

-1 -1 +1 +1 +1

+1 -1 +1 +1 -1

-1 +1 +1 +1 -1

+1 +1 +1 +1 +1

======

In general we want to choose designs that, at least, can distinguish (resolve) main effects from 2-way interactions. Usually, we'll also want at least 8, preferably 16 runs or more to give us power to detect modest effects.

Center points

When we run factorial designs with only two levels for a factor, we are not able to detect curvature in the response surface.

One way to help deal with this problem is to add center points to the design. If the original two levels for a factor are coded as {-1,1}, a center point would be put at 0, half way between the high and low settings.

For example, here is a center point we could use for a factor Temperature in cookie baking.

Actual value

350 o

400 o

450 o

Coding

-1

0

1

When we have several quantitative factors, we use a center point where each factor is set to a zero coded value. Here's an example of using center points for the cookie baking. mystring.levels="Temp,Time,GoodCookies

-1,-1,50

-1,-1,51

-1,1,60

-1,1,61

1,-1,60

1,-1,61

1,1,70

1,1,71

0,0,61

0,0,62

0,0,61,

0,0,59" cookies.data =read.table(textConnection(mystring.levels),header=TRUE,sep=",") cookies.data

> cookies.data

Temp Time GoodCookies

1 -1 -1 50

2 -1 -1 51

3 -1 1 60

4 -1 1 61

5 1 -1 60

6 1 -1 61

7 1 1 70

8 1 1 71

9 0 0 61

10 0 0 62

11 0 0 61

12 0 0 59

> lm.centerpoints = lm(GoodCookies ~ Temp*Time, data = cookies.data)

summary(lm.centerpoints)

> summary(lm.centerpoints)

Call: lm(formula = GoodCookies ~ Temp * Time, data = cookies.data)

Residuals:

Min 1Q Median 3Q Max

-1.5833 -0.5833 0.4167 0.4167 1.4167

Coefficients:


(Intercept) 6.058e+01 2.684e-01 225.71 < 2e-16 ***

Temp 5.000e+00 3.287e-01 15.21 3.46e-07 ***

Time 5.000e+00 3.287e-01 15.21 3.46e-07 ***

Temp:Time 9.735e-15 3.287e-01 0.00 1

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9298 on 8 degrees of freedom

Multiple R-squared: 0.983, Adjusted R-squared: 0.9766

F-statistic: 154.2 on 3 and 8 DF, p-value: 2.040e-07

In this case, there does not appear to be a significant interaction. Using 4 center points has given us a better estimate of the error. Because we have replicates, we can use the center points to do a test for lack of fit. If we see evidence of curvature, we would use response surface methods.

Data Analysis for Bioinformatics:

Related documents

Products

Support

Data Analysis for Bioinformatics:

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib