Uploaded by s_ b

ch2 notes(3)

advertisement
CHE 4372/5372 – Chapter 2: ANOVA and Regression
Joe Gauthier – Spring 2023
Last updated March 26, 2023
Learning goals:
• Understand the analysis of variance (ANOVA):
– why it is useful
– how it is performed
– how it is analyzed
– how to use JMP for ANOVA
• Understand regression models and how it connects to ANOVA
• Learn how to perform residual analysis
• Learn how to evaluate the assumptions of normal responses and constant variance
Before we start, a brief recap of terminology from Chapter 0:
• response: the dependent variable (i.e. what we measure or would like to predict)
• factor: an independent (explanatory) variable
• level or treatment: specific values of the factor at which experimental runs are performed
• replicates: identical runs for a given experiment
Recall also the types of variables:
• Continuous variables are always numeric
• Discrete (categorical) variables may be numeric or character; all character variables are discrete, but
not all discrete variables are character
Figure 1: Data types in JMP
The previous chapter involved analysis of data from a single factor, two–level experiment. For example,
with the lab rat weights, we had:
1
• response: weight of rats
• factor: group
• levels: 1, 2
• n=5, 8
Similarly, when analyzing the fuel economy of the mid–sized SUV, we had:
• response: miles per gallon
• factor: model of car
• levels: Ford Explorer, Chevy Blazer
• n =8, 9
In this chapter, we will focus on experiments studied at a levels with n replicates performed at each level.
1
Analysis of Variance (ANOVA)
Let’s start with an example. Suppose we want to study the effect of cotton content on the tensile strength
of a composite fiber containing cotton and synthetic materials (e.g. polyester). Here we have:
• response: tensile strength
• factor: weight % cotton in the fiber
• levels: 15, 20 ,25, 30, 35 % (a = 5)
• replicates: n = 5
The experimental data has been entered into JMP and can be found on Blackboard: Lecture content >
Chapter 2 > polycotton_fiber_tensile_strength.JMP
Figure 2: Cotton tensile strength data as entered in JMP
2
Five measurements were made for each retirement using a completely randomized design. Note in particular that the factor is specified to be a numeric, nominal variable, and the response is a continuous variable.
Just as we did before with a two factor comparison, we can use the Fit Y by X platform in JMP, and see
the following:
Figure 3: Result of Fit Y by X analysis in JMP. Display options enabled: jittered points, box and whisker
plot.
Do you think the mean tensile strength depends on the weight percent of cotton? What about the
variance?
A reasonable question to ask is whether or not there is a significant difference between the mean tensile
strength at different levels. In other words, does the factor (wt% cotton) have an effect on the response?
Using what we learned in the previous chapter, we could look at each pair, and there are in fact 10 different
pairs to consider:
Figure 4: Needed pairwise t tests to rigorously determine the effect of cotton weight percent on the mean
strength.
If each pairwise test is performed at a significance level α, then the probability of correctly not rejecting
the null hypothesis when H0 is in fact true is (1 − α) for each test. If H0 is true for all m pairs, then the
probability of not committing any type I errors is (1 − α)m – much lower!
• For α = 0.05 and m = 10, then (1 − 0.05)10 = 0.6; the probability of committing type I error is thus
40%.
• For α = 0.01 and m = 10, the probability of committing a type I error is 10%.
So, we need an alternative approach. Let’s summarize the data reporting for a single–factor experiment
such as the cotton tensile strength example, illustrated in the table below.
3
Table 1: Summary of data reporting for a single factor experiment. The number of treatments or levels is
given as a, the number of replicate observations (for each treatment) is given as n, and N = an is the total
number of experiments/observations.
treatment
(level)
1
2
..
.
a
observed responses
y11
y21
..
.
y12
y22
..
.
ya1
ya2
...
...
...
...
totals
averages
y1n
y2n
..
.
y1•
y2•
..
.
y 1•
y 2•
..
.
yan
ya•
y••
y a•
y ••
We define a few new variables under the ‘totals’ and ‘averages’ column, which will simplify our notation
later:
yi• =
y•• =
n
X
yij
j=1
a X
n
X
(1)
(2)
yij
i=1 j=1
(3)
y i• = yi• /n
y •• = y•• /N
(4)
.
This might seem confusing at first. To understand the notation, y i• is just the mean of all observations of
treatment i. In the case of the cotton tensile strength, this might refer to the mean tensile strength of all
observations at 30 wt%. y •• is referred to as the ‘grand’ mean, which is just the average of all observations.
In the case of the cotton tensile strength, this would just be the average of all 25 observations.
Returning to our question: does the factor affect the response? If the answer is no, then all the y values
are random samples from the same population distribution, regardless of the factor level. The variability
between samples at the same factor level (i.e. within treatments) will then be the same as the variability
between samples at different factor levels (i.e. between treatments). Again returning to our example, this
would suggest that the mean and variance tensile strength at all levels is the same within and between cotton
weight percentages. However, if the factor does affect the response, then the variation between treatments
should be greater than the variation within treatments. Strategy: we can propose a mathematical model
to describe how the response depends on the factor, and then perform statistical inference tests on the
parameters of this model.
1.1
Empirical models
In this class, we will only consider problems in which the response variable is continuous. The factors (i.e.
independent variables) may be discrete or continuous, leading to,
• discrete–factors effects models
• continuous–factor effects models
• mixed–factor effects models
The key point to take away is that hypotheses will now be made about model parameters, not parameters of
a statistical distribution. In particular, we will focus on building empirical models that propose an additive
(but not necessarily linear) relationship between the response and factor(s).
response = intercept + effect1 + effect2 + ... + error
.
(5)
Each additive term in this model equation between the intercept and error terms is an effect. Each effect term
is composed of one or more parameters. The error term is assumed to be random and normally distributed
4
(as predicted by the Central Limit Theorem). Here’s an example of a continuous–factor effects model similar
to others we’ll see again soon:
y = β0 + β1 T + β2 P + β3 T P + β4 T 2 + ε
(6)
.
Here, β1 T is the main/linear effect of factor T , β2 P is the main/linear effect of factor P , β3 T P is the cross
(or interaction) effect of factors T and P , and β4 T 2 is the quadratic effect of T . Note that the factors’
variable types determine the possible types of effects and also the number of model parameters.
Discrete–factor effects models
Consider a mathematical model for describing the main effect of a single discrete factor:
(
i = 1, 2, ..., a
yij = µ + τi + εij
(7)
j = 1, 2, ..., n
Here µ is the intercept of the model. y •• should serve as a good estimator for this parameter, as we’ll see
later. τi are parameters that describe the main effect of the discrete factor. Remember that what we’re
trying to test here is whether the effect between treatments is the same as the effect within treatments. In
other words, we want to test E(yij ) = µ + τi = µi for i = 1, 2, ..., a. To probe this, we can use the following
hypotheses:
H0 : µ1 = µ2 = ... = µa
(8)
H1 : µi 6= µj for at least one pair i, j
(9)
Each of these means refers to the mean effect of treatment i. In the context of our cotton tensile strength
example, µ1 might refer to the mean tensile strength of 15 wt%, µ2 might refer to the mean tensile strength
of 20 wt%, and so on. Typically we like to think of the mean µ as an ‘overall’ mean so that,
Pa
i=1 µi
=µ .
(10)
a
Recalling that each µi = µ + τi , we can plug this in to find,
a
1X
1
µ=
(µ + τi ) =
a i=1
a
aµ +
a
X
!
τi
i=1
a
1X
=µ+
τi
a i=1
(11)
In other words, this assumption implies a constraint on the effects τi ,
a
X
τi = 0
.
(12)
i=1
The interpretation of the above analysis is that our model essentially is just the ‘grand’ mean y •• plus an
effect τi . That is, the effect estimates τi are a ‘deviation’ from the overall mean. Another consequence of
the above constraint is that we can rewrite the hypotheses to be tested by just subtracting µ,
H0 : τ1 = τ2 = ... = τa = 0
(13)
H1 : τi 6= 0 for at least one treatment
(14)
Note that the model has a independent parameters: the overall mean, plus the a − 1 independent τi values.
The model used to test the null hypothesis therefore has a − 1 degrees of freedom.
If you’ve been paying close attention, this might seem familiar to you. We already know how to use e.g.
least squares regression to fit a line to data; i.e. to fit a linear model that describes the relationship between
a continuous dependent variable and a continuous independent variable, illustrated below in Figure 5.
5
25
y = a + bx
y
20
15
10
5
12.5 15.0 17.5 20.0 22.5 25.0 27.5 30.0 32.5
x
Figure 5: Fitting a continuous model to discrete data. It’s just a line!
We will soon see that the discrete–factor and continuous–factor models are indeed closely related.
Let’s now return to our example of cotton tensile strength. Recall that our model is,
(
i = 1, 2, ..., a
yij = µ + τi + εij
(15)
j = 1, 2, ..., n
with i referring to the factor level (in this case, the weight percent of cotton in the fiber) and j referring to
the experiment at the particular level (in this case, experiments 1–5 at each level). Our first goal is to use
the experimental data to obtain estimates for the parameters of our discrete factor effects model. This can
be easily done:
(16)
µ̂ = y •• = 15.04
τ̂i = (y i• − y •• ) = 9.8 − 15.04 = −5.24 for 15% cotton
(17)
= 15.4 − 15.04 = 0.36 for 20% cotton
(18)
etc.
(19)
Here, the ‘hat’ means that this quantity is an estimate for the parameter. In other words µ̂ is an estimate for
the model parameter µ, and τ̂i is an estimate for τi . We therefore have our prediction equation, ŷi = µ̂ + τ̂i ,


−5.24 if wt% = 15





0.36 if wt% = 20
ŷ = 15.04 + 2.56 if wt% = 25
(20)


6.56 if wt% = 30



−4.24 if wt% = 35
But, we’re not done yet! We now have estimates for the model parameters, but we still need to determine
whether any of them are significantly different than zero.
6
1.2
One way ANOVA
Let’s again recall our model:
(
i = 1, 2, ..., a
j = 1, 2, ..., n
yij = µ + τi + εij
(21)
According to this effects model, there are two possible sources of variability, i.e. two possible reasons why
every experiment doesn’t give exactly the same response y:
1. variability due to the treatment τi
2. random error εij
The analysis of variance method provides a way to partition the total variability between these two
sources.
Warning: there is a good deal of math ahead, and it’s important that you understand how this method is
derived. I strongly encourage you to take the time to work these steps out yourself and ensure you understand
each step.
The sample variance for all N observations is calculated following the standard sample variance expression, but substituting our known values y •• ,
a
S2 =
n
1 XX
SST
2
(yij − y •• ) =
N − 1 i=1 j=1
N −1
(22)
.
The analysis of variance method focuses its attention on the numerator, i.e. the double sum, SST , referred
to as the total corrected sum of squares. Our goal here is to determine how much of this variability is due
to random errors, and how much is due to the effect of the factor on the response. We can decompose this
total into two separate terms:
SST =
a X
n
X
i=1 j=1
2
(yij − y •• ) = n
a
X
2
(y i• − y •• ) +
i=1
a X
n
X
2
(yij − y i• )
(23)
i=1 j=1
= SSTreatments + SSE
,
(24)
where SSTreatments is the ‘treatments sum of squares’ and SSE is the ‘error sum of squares’. The above
equation is the fundamental ANOVA identity. It illustrates precisely how we are able to decouple the total
error into two pieces: the first, which estimates the variability due to the treatment (i.e. it measures the
variability between the treatment averages and the overall average) and the variability within treatments
(i.e. the error of individual measurements compared to the corresponding treatment average). From the
polycotton fiber tensile strength, we can calculate these quantities,
SST = 636.96
(25)
SSTreatments = 475.76
(26)
SSE = 161.20
(27)
This is all well and good, but how do we use these quantities? We need to identify a test statistic (and
reference distribution) that will allow us to test our null hypothesis of τi = 0 for all i.
Recall that we assumed the response is normally distributed, i.e. yij ∼ N (µ + τi , σ 2 ). In other words,
(yij − µ − τi )
∼ N (0, 1)
σ2
(28)
.
Now, if H0 is true, then all τi = 0 and we can cast S 2 as a χ2N statistic,
a
n
1 XX
2
(yij − µ) ∼ χ2N
σ 2 i=1 j=1
7
.
(29)
Additionally, if H0 is true, we also expect that each treatment mean should be a good estimate of the
population mean (since there is no variability due to the treatment), and so,
a
n
SSE
1 XX
2
=
(yij − y i• ) ∼ χ2N −a
σ2
σ 2 i=1 j=1
(30)
.
We can also invoke the CLT to assert that the treatment mean at any level i should be distributed as,
(31)
y i• ∼ N (µ + τi , σ 2 /n) ,
in other words,
(y i• − µ − τi )
∼ N (0, 1)
σ 2 /n
(32)
.
Again, if H0 is true, then all τi = 0 and so,
a
n X
2
(y − µ) ∼ χ2a
σ 2 i=1 i•
(33)
.
In this case, the grand mean provides a good estimate of the population mean, and so,
a
SSTreatments
n X
2
=
(y − y •• ) ∼ χ2a−1
σ2
σ 2 i=1 i•
(34)
.
We also know that the ratio of these two χ2 variables follows an F distribution with (a − 1) numerator d.o.f.
and a(n − 1) = (N − a) denominator d.o.f.,
χ2a−1 /(a − 1)
∼ Fa−1,N −a
χ2N −a /(N − a)
(35)
.
Substituting SSTreatments and SSE ,
SSTreatments
/(a −
σ2
SSE
/(N
− a)
σ2
1)
=
SSTreatments
a−1
SSE
N −a
∼ Fa−1,N −a
.
(36)
We now define two new quantities,
M STreatments =
M SE =
SSTreatments
a−1
SSE
N −a
,
.
(37)
(38)
Here, M SE is a pooled estimate of the common variance σ 2 within each of the a treatments. If the treatment
means are all equal, then M STreatments also provides an estimate of σ 2 . If the treatment means are not all
equal, then the value of M STreatments will be greater than σ 2 . From our polycotton example,
SSTreatments
475.76
=
= 118.94
a−1
5−1
SSE
161.20
M SE =
=
= 8.06
N −a
25 − 5
M STreatments =
(39)
(40)
Summarizing the hypotheses
Recall our hypotheses:
H0 : τ1 = τ2 = ... = τa = 0
(41)
H1 : τi 6= 0 for at least one treatment
(42)
8
The test statistic and reference distribution are,
M STreatments
∼ Fa−1,N −a
M SE
F0 =
.
(43)
We reject H0 when F0 is an extreme value in the right tail of the Fa−1,N −a distribution. In other words,
M STreatments
6 1 if H0 is true
M Se
M STreatments
F0 =
1 if H0 is false
M Se
F0 =
(44)
(45)
We made several assumptions, which we will come back to in a later section:
• experimental design is completely and properly randomized
• model errors are ∼ N (0, σ 2 )
• response variance σ 2 is the same for all levels of the factor
• the observations (i.e. response) are ∼ N (µ + τi , σ 2 )
Aside: is ANOVA always two sided?
H0 : τ1 = τ2 = ... = τa = 0
(46)
H1 : τ1 6= 0 or τ2 6= 0 or ... or τa 6= 0
(47)
The short answer is: yes. The ANOVA is not capable of testing a one–sided H1 . Both positive and
negative values of τi have a tendency to increase M STreatments , and hence the F statistic. This is actually
a good thing! If one–sided tests were possible, as the number of levels (i.e. a) increased, the number of
possible alternative hypotheses would become hideously large.
Aside: is the equal variances assumption valid?
A key assumption in ANOVA is that the variances for all levels are equal, i.e. σ12 = σ22 = ... = σa2 . There
are modified ANOVA methods (e.g. Welch) that have been developed to handle the case where variances
cannot assumed to be equal. However, these are rarely needed. For a well–designed experiment, why is it
often reasonable to expect that the variance should be about the same for all levels?
One–way ANOVA summary
We’ve just described how we will perform the analysis of variance for a single discrete factor effects model.
This is referred to as “one–way analysis of variance.” ANOVA is a remarkably useful and powerful tool in
statistics and data analysis.
Table 2: Summary table for performing one–way ANOVA
source
factor
error
total
sum of
squares
SSTreatments
SSE
SST
degrees of
freedom
a−1
N −a
N −1
mean square
F0 ∼ Fα,a−1,N −a
M STreatments
M SE
M STreatments /M SE
And recall the expressions for SST and SSTreatments ,
SST =
a X
n
X
i=1 j=1
a
X
2
(yij − y •• )
2
(y i• − y •• )
SSTreatments = n
i=1
9
(48)
(49)
We’re finally ready to return to our cotton fiber tensile strength example. We’ll use JMP to build and
test the following model:
(
i = 1, 2, ..., a
yij = µ + τi + εij
(50)
j = 1, 2, ..., n
a
X
τi = 0
(51)
,
i=1
H0 : τ1 = τ2 = ... = τa = 0
(52)
H1 : τi 6= 0 for at least one treatment
(53)
To do this, we’ll use the ‘Fit Model’ platform in JMP, illustrated below in Figure 6.
Figure 6: Fitting a continuous model to discrete data. It’s just a line!
JMP did all of the arithmetic for us. Let’s write out the model equation:
yij = µ + τi + εij
,
and we have our prediction equation,
ŷij = µ̂ + τ̂i
,
with parameter estimates τ̂i calculated by JMP as,


−5.24 if wt% = 15





0.36 if wt% = 20
ŷ = 15.04 + 2.56 if wt% = 25


6.56 if wt% = 30



−4.24 if wt% = 35 .
(54)
(55)
(56)
The first panel in the JMP report labeled ‘Analysis of Variance’ tests the hypotheses listed above. The
calculated p–value is < 0.0001, so we reject H0 . At least one treatment is significantly different than zero.
Which one(s) are significantly non–zero? We’ll get to that shortly.
10
Aside: coding data
Recall that in Chapter 1 we briefly discussed normalization of data, which was the process of rescaling/shifting data by a simple multiplication/addition by constants. This process is also referred to as coding.
For example, consider the following dataset:
data = {13.4, 7.2, 29.0, 10.1, 17.7}
(57)
.
We can normalize (code) these data so that they range from 0 to +1:
ycoded =
y − ymin
y − 7.2
=
ymax − ymin
29.0 − 7.2
,
(58)
giving the following:
y
y (coded)
13.4
0.28
7.2
0
29.0
1
10.1
0.13
17.7
0.48
This process is actually very commonly done, for a couple of reasons. For example, non–integer data can
be coded so that all values are integers. Computationally, this is beneficial for speed, accuracy, and therefore
long–term data storage. Perhaps more importantly, it is also done for proprietary/security reasons. Coded
data can be shared with “outsiders” without giving them your actual data. All of the statistical analyses we
have done so far will work just as well with coded data as they will with uncoded data!
A few lessons learned
• The analysis of variance involves all of the concepts we’ve covered so far: data types, probability,
distributions, formulating and testing hypotheses, ...
• ANOVA is a sophisticated analysis technique, but easy to understand when we break it down
• With software such as JMP, ANOVA is very easy to perform ... but correctly interpreting the results
demands that you understand the fundamentals
1.3
Generalized ANOVA
Consider a general example, where a response variable y is measured at three levels of a factor. Ten
measurements are collected at each level, with the results illustrated below in Figure 7.
11
Figure 7: Generalized ANOVA example. The response y is measured at three levels of a factor, given by the
three Normal distributions.
Here, vertical lines correspond to averages of the three levels (y 1• , y 2• , y 3• , and the grand mean y •• ).
Next we can follow our prescription from the previous section and calculate the sum of squares associated
with the treatment and the error,
SST = SSTreatments + SSE ,
(59)
with
SSTreatments = n
a
X
(60)
(y i• − y •• )2
i=1
= 10 (y 1• − y •• )2 + (y 2• − y •• )2 + (y 3• − y •• )2
(61)
,
and
SSE =
a X
n
X
(yij − y i• )2
(62)
i=1 j=1
=
10
10
10
X
X
X
(y1j − y 1• )2 +
(y2j − y 2• )2 +
(y3j − y 3• )2
j=1
j=1
(63)
.
j=1
Remembering the hypotheses that ANOVA tests,
(64)
H0 : τ1 = τ2 = τ3 = 0
H1 : τ1 6= 0 or τ2 6= 0 or τ3 6= 0
(65)
,
with test statistic,
F0 =
M STreatments
SSTreatments /(a − 1)
SSTreatments /(3 − 1)
=
=
∼ F2,27
M SE
SSE /(N − a)
SSE /(30 − 3)
.
(66)
If the ANOVA p–value> α, then we do not move on to the next step of comparing individual pairs of effects.
The factor does not have a significant effect on the response. Case closed!
12
However, if p–value < α, then the factor has a significant effect on the response. In other words, at least
one of the effects is significantly non-zero. But which one(s)? To determine this, our next step is to compare
all possible pairs of treatments. In other words, we wish to test a series of hypotheses having the general
form,
(67)
H0 : µi = µj
H1 : µi 6= µj
(68)
,
for all pairs i 6= j. But we can’t simply do a t–test on each pair, as the probability of committing a type I
error would be substantially higher than α. What do we do?
Controlling the ‘experimentwise’ or ‘family-wise’ error
What we really need to do is to find some way to compare all of the pairwise means while controlling
the total probability of committing a type I error. Recall that if α = 0.05 and and we have m = 10 pairs of
t–tests to compute, then the probability of committing a type I error is given as
1 − (1 − α)m = 1 − (1 − 0.05)10 = 40%
.
(69)
In other words, if we take the naive approach, we really have an effective α = 0.4, much higher than we
would typically be comfortable with. But what if we just reduce α in our pairwise tests? If α were set to a
lower value, say 0.005, then we would have an effective α of
αeffective = 1 − (1 − 0.005)10 ≈ 0.048 .
(70)
This approach is called the Bonferonni correction. By simply using a (much) more conservative significance
level when performing the pairwise tests, we can take the naive approach while maintaining our comfortable
overall significance level. This overall significance level is referred to as the ‘experimentwise’ or ‘family-wise’
error – the error associated with the entire procedure, or entire family of procedures.
In principle there’s no problem with this approach. However, in reality, using a significance level of
α = 0.005 when doing the pairwise tests means that it will become much more difficult to reject the null
hypothesis even when you should. In other words, the probability of committing a type II error increases,
or equivalently, the power decreases. It is entirely possible to reject the null hypothesis from ANOVA (i.e.,
at least one effect is significant), and then fail to reject any null hypothesis during the pairwise t–tests with
the Bonferonni correction (i.e. no effect is significant). An apparent contradiction caused by type II error!
In 1953, an American mathematician named John Tukey developed an alternative way to do many
pairwise mean comparisons while controlling the family-wise error, and without significantly reducing the
power of the test. The original formulation of this approach assumed a balanced experiment design (i.e. all
factors have the same number of samples), but it was later modified by Kramer to allow for a mixed design.
More details can be found in the textbook, but the general idea behind the Tukey(–Kramer) Honestly
Significant Difference (HSD) Test is that, rather than doing an actual pairwise comparison, it determines
the minimum difference between sample means that can be considered significant. In particular, two means
are said to be significantly different if
s
qα (a, f )
1
1
|y i• − y j• | > √
M SE
+
,
(71)
ni
nj
2
where a is the total number of treatments/levels, f is the number of degrees of freedom associated with
M SE , and q is the corresponding ‘Studentized range’ statistic. This is essentially just another reference
distribution, with values given in the back of the textbook just as the z, t, χ2 , and F distributions. Using
a similar procedure to our previous discussion on confidence intervals, we can also compute a confidence
interval for the difference of each pair of (population) means:
s
s
qα (a, f )
1
1
qα (a, f )
1
1
y i• − y j• − √
M SE
+
≤ µi − µj ≤ y i• − y j• + √
M SE
+
.
(72)
ni
nj
ni
nj
2
2
13
If this confidence interval does not contain zero, then the population means are said to be significantly
different. Generally speaking, this confidence interval will be a bit wider than the naive approach of using a
standard t–test, but will be much narrower than using e.g. the Bonferonni correction to the standard t–test.
We can visualize this in JMP by asking it to do pairwise comparisons with a standard t–test (the naive
approach without Bonferonni correction) and also Tukey’s HSD, shown below in Figure 8.
Figure 8: Pairwise mean comparisons using both the standard t–test (called Student’s t, the naive approach
without Bonferonni correction), the Tukey HSD test, and the t–test with Bonferonni correction. The radius
of the circle is related to the (one–sided) width of the confidence interval.
To the right side of Figure 8 are three sets of comparisons. The first, labeled ‘Each Pair Student’s t’
uses α = 0.05 for each pairwise comparison, which as we discussed above gives a roughly 40% chance of
committing a type I error. Each circle is roughly related (but not equivalent) to a confidence interval about
the mean of a given treatment (in this case, the tensile strength given the weight % cotton). Finally, the
last section shows the naive t–test with the Bonferonni correction (reducing the significance level of each
pairwise comparison) for a total type I error probability of about 0.05, the same as the Tukey HSD test.
JMP also reports a comparison for all pairs using the Tukey–Kramer HSD, shown in Figure 9.
Figure 9: Computed values of |y i• − y j• | − HSD, where HSD =
from the diagonal that JMP reports HSD = 5.3730.
qα (a,f )
√
2
r
M SE
1
ni
+
1
nj
. Here we can see
A little further down, JMP reports the significant differences in a way that is a little easier to see than
14
the circles, shown in Figure 10.
Figure 10: Report of significant differences between means from the Tukey–Kramer HSD test.
Finally, JMP automatically computes confidence intervals when you run a ‘Compare Means’ test,
illustrated in Figure 11.
Figure 11: Report of significant differences between means from the Tukey–Kramer HSD test.
Wait a second...
If the Tukey–Kramer HSD test is able to (effectively) perform pairwise means comparisons without
increasing the risk of committing a type I error, why did we even bother with ANOVA in the first place?
Why can we not just skip ANOVA and go straight to Tukey? This is a great question and is actually a
pretty deep ‘rabbit hole’ and area of current research. If you ask a statistician, they would say you don’t
necessarily need to do ANOVA if you’re going to use a compairson method like Tukey HSD which controls
the family–wise error. In fact, from the Wikipedia page for the Tukey HSD test:
“A common mistaken belief is that the Tukey hsd should only be used following a significant ANOVA. The
ANOVA is not necessary because the Tukey test controls the Type I error rate on its own.” – some Wikipedia
user
My response:
15
In all seriousness, the Wikipedia user is in principle correct. In practice though, the scientific community
is typically skeptical of results presented without first rejecting H0 via ANOVA. It’s possible, for instance,
to fail to reject H0 during ANOVA (i.e., suggesting no significant difference between treatments), and to
also find significant difference(s) from Tukey HSD (i.e. suggesting at least one significant difference between
treatments). This would likely represent a ‘borderline’ case of significance, and would therefore be met with
skepticism. Peer reviewers would likely demand you do a few more experiments to make sure the effect is
legitimate. If it is legitimate, the increased ‘significance’ of the effect would likely lead you to now reject H0
with ANOVA. The end result of this practice is a slightly reduced overall significance level, i.e. we lower α
from 0.05 to something slightly lower by this practice.
1.4
Desigining experiments with a control
In some experiments, there may be one particular treatment that for some reason is special. We call this
level a control. Examples include:
• Exploring modifications of a process, and using the current operating conditions as a control
• Exploring changes in product composition/formulation, and using the current composition/formulation
as a control
• Exploring the effect of different drugs and using a placebo as the control
Some people enjoy the idea of a control so much that they believe all experiments should have one. This
is certainly not true! But it does explain why it is so common for the first question to dribble out of some
people’s mouth when asking about your results will be “well yes that’s great, but what was your control?”
We do want to use a control when the goal of an experiment is to compare all treatments to a control
level to another one. But that’s obviously not always the case. Did we need a control for the cotton fiber
experiments?
Some of the confusion related to controls is certainly related to the fact that we sometimes refer to
experiments as ‘controlled.’ This is an experiment in which we set factor (independent variables) levels
(values) as desired. In science and engineering, we are usually able to do controlled experiments, though
not in cases where it would be unethical. For example, most of our knowledge on nutrition in humans is
extrapolated from experiments using animals. As a society we are ‘okay’ with the idea of e.g. locking mice
in a cage for their entire lives and completely controlling their diet and exercise, which is important when
investigating the effect of diet on e.g. weight loss and longevity. But we’re less okay with doing this with
human beings! In any case, a ‘controlled experiment’ is not the same thing as a control in an experiment.
When you are using a control, it is usually a good idea to perform more replicates for the control level.
A rule of thumb is usually
√
nc = n a ,
(73)
where a as before is the number of levels, nc is the number of replicates for the control level, and n is the
number of replicates for the other levels. For example, if n = 5 and a = 5, then nc ≈ 11. As always, the
overall run order must be properly randomized, including the control runs.
Example: drug toxicity
16
Suppose we want to measure the toxicity of two drugs. We can measure this by exposing a culture of
cells to the drug and measuring the percentage of cells that die after exposure. In this setup, it would be
good to have a control, i.e. a level that receives no drug, as a benchmark to compare to the effect of the
drugs. We will do 15 runs: 7 with a control (placebo, i.e. no drug), 4 with drug A, and 4 with drug B.
The run order is completely randomized. Note that in this case, we are not interested in comparing drug A
to drug B, only the comparisons of the drugs to the control. An important distinction! The results of the
experiment are shown below in Figure 12.
Figure 12: Drug toxicity experiment results.
As we did with the tensile strength analysis, we start with ANOVA:
(74)
H0 : τplacebo = τA = τB = 0
H1 : τplacebo 6= 0 or τA 6= 0 or τB 6= 0
.
(75)
Using JMP, we load the data into the Fit Y by X platform and click the ‘Means/ANOVA’ option in the
dropdown box. JMP reports:
Figure 13: Results of ANOVA on the drug toxicity data.
With p–value= 0.0384 < α = 0.05, we reject H0 and conclude that at least one treatment pair is
significantly different. Next, we proceed to the means comparisons. One twist when using a control is that
Tukey–Kramer HSD is no longer the ideal test. Recall that Tukey–Kramer compares all means. However,
with a control, we really just want to compare everything to the control, i.e. we don’t care about drug A
17
vs drug B, we only care about drug A vs the control and drug B vs the control. A modification to the
Tukey–Kramer HSD to account for this was developed by a person named Dunnett in the early 1960s, and
allows for a slightly more accurate comparison. We can do this with JMP, with results shown in Figure 14
below.
Figure 14: Result of Tukey and Dunnett analysis on the data following rejection of H0 with ANOVA.
Note that the Tukey–Kramer HSD test finds no significant difference, but Dunnett’s test finds drug B
to be significantly different. We get a little extra power (i.e. reduced probability of a type II error) by
neglecting to compare drug A to drug B!
A key question
Since the primary goal of this experiment was to compare each of the drugs with the control, why not just
do two separate experiments? In other words, why not do one experiment to explore the effect of drug A,
and then another experiment to explore the effect of drug B? The answer is economics. The joint experiment
with both drugs at the same time is more efficient and provides more reliable results, since we are able to
pool the results and get a better estimate of e.g. the effect of the control level.
18
2
Residuals analysis
In the last section, we made several important assumptions about the populations from which our experiments
are sampling. In this section, we will see how we can analyze the residuals of a model to see if the assumptions
we made are likely to be valid. First, let’s briefly review the three analysis platforms we’ve used with JMP
so far:
1. Distribution – analysis of a single sample: histograms, tests on means and variances, normal quantile
plots, outlier box plots, ...
2. Fit Y by X – data with a single response and a single factor having one, two, or more levels: two
sample tests, one–way ANOVA, ...
3. Fit Model – data with a single response and one or more factors (i.e. can do most of what the Fit
Y by X platform does and much more)
Next, let’s define what we mean by a residual. A residual is simply the difference between the actual
(i.e. observed) and estimated (i.e. predicted by the model) values of the response variable. The residual for
observation j in treatment i is defined as,
eij = yij − ŷi
(76)
,
where as before the ‘hat’ variable is just our estimate of the response in treatment i.
Residuals analysis is a very useful technique for answering a variety of questions:
1. Is the response variable normally distributed?
2. Are there any outlier data points?
3. Is the model adequate?
The residual of a data point is related to the error term in the model. Recall:
proposed model:
yij = µ + τi + εij
fitted model: ŷi = µ̂ + τ̂i
residual: eij = yij − ŷi
(77)
(78)
.
(79)
If the proposed model includes all relevant factors, and if µ̂ and τ̂i are good estimators for µ and τi , then eij
should be a good estimator for εij .
Checking our assumptions: normally distributed response variable
The statistical methods we now have in our toolbox (z–tests, t–tests, χ2 –tests, F –tests, ANOVA, ...) all
assume the response variable to be normally distributed.
If the model accounts for all important factors (i.e. if the model is “adequate”) and if the response
variable is normally distributed, and if the variance σ 2 is constant, and if the experiment was properly
randomized, then
e ≈ ε ∼ N (0, σ 2 ) .
(80)
In other words, the model errors (i.e. residuals) should be ∼ N (0, σ 2 )!
An important point: although y is assumed to be normally distributed with constant σ, this does NOT
mean we expect all observations to be from a single normal distribution. So we can NOT check the assumption
simply by looking at a histogram of all the y values. Recall the figure from the beginning of our discussion
of generalized ANOVA:
19
Here we have measured the response variable y at different levels of a factor. If we just plotted a
histogram of the entire data set, it would not appear to be normally distributed! We have to make sure we
plot the residuals, which would subtract out the effective difference in the mean to give us a single normally
distributed variable ∼ N (0, σ 2 ). We can do this in one of three ways, in order of least rigorous to most
rigorous:
1. visual inspection of histogram of residuals (requires large n).
2. normal quantile plots
3. statistical tests of normality
All three of these can be done in the ‘Distribution’ platform of JMP. The good news is that even when the
normality assumption appears to be questionable, in many cases these tests (especially t–tests) have been
found to give mostly reliable results.
First, let’s discuss the normal quantile plots, also referred to as a ‘Q–Q’ plot (quantile–quantile plot).
In short, the way this works is by computing quantiles of the given data set and comparing the magnitude
of those quantiles to what we would expect from a normal distribution. Recall that a quantile is just the
x value of a distribution at which point a certain percentage of the data is to the left of that x value, and
it’s closely related to a percentile. For instance, the median of a distribution is just the 50th percentile: by
definition, 50% of the data in the distribution is to the left of the median.
We can use Q–Q plots to estimate if a data set is normally distributed pretty easily. If the data set is
actually normally distributed, then the quantiles of that data set will be roughly the same as what we would
expect of the quantiles for the normal distribution. So if we plot the quantiles of our data set on the y–axis,
and the quantiles of a normal distribution on the x–axis, and the points roughly fall along the line y = x,
then we can say that the data are normally distributed. A few examples are shown below in Figure 15.
20
Figure 15: Example of Q–Q plots for a normal distribution (top left), χ25 distribution (top right), F5,5
distribution (bottom left), and exponential distribution (bottom right). Normally distributed data will lie
approximately along the line y = x on the Q–Q plot, while other distributions will deviate substantially.
In the above figure, note that the Q–Q for data drawn from an N (0, 1) distribution pretty closely follows
the diagonal line. All others show clear deviations, which are evidence of them not being normally distributed
– because they’re not!
In addition to quantile plots, we can statistically test if a data set is normally distributed using the
Shaprio–Wilk’s test. The details of how this works aren’t important, and JMP can do it for us easily. The
null hypothesis of this test supposes that the data are normally distributed. If the p–value is less than the
significance level, we reject H0 and conclude the data are not normally distributed. What should be noted
however, is that we need to distinguish between a statistically significant result and a practically significant
result, which we will return to later. To suffice: very large sample sizes will tend to reject H0 even if the
true difference is very small!
21
Now let’s look at how we can analyze residuals in JMP. We’ll use the polycotton fiber tensile strength
data to illustrate. First we will load the polycotton fiber data back into the ‘Fit Model’ platform and make
sure to set the emphasis to ‘minimal report’:
Figure 16: Loading data into the Fit Model platform of JMP. Setting Emphasis = ‘Minimal Report’ basically
says ‘do not drown me in results!’
Next we tell JMP to calculate the residuals and save them as a column in the data table:
Figure 17: Using JMP to calculate residuals and save them as a column in the data table.
22
This creates a new column in the data table containing the residuals. We then load this column into the
‘Distribution’ platform of JMP and click the ‘Normal Quantile Plot’ option:
Figure 18: Using JMP to calculate and plot a Q–Q plot, testing the normality and constant variance
assumption in ANOVA.
The data mostly follow the diagonal line, indicating that the response variable is normally distributed and
that the variance in each treatment is constant. Good news! We can also test this using the Shapiro–Wilks
test, which as a reminder takes as a null hypothesis the data being normally distributed. First we ask JMP
to fit a normal distribution to the residuals:
Figure 19: Using JMP to fit a normal distribution to the residuals.
Then within the ‘Fitted Normal’ dialogue box, we tell it to perform a ‘Goodness of Fit’ test:
23
Figure 20: Using JMP to perform a goodness of fit test on the resulting normal distribution, fit to residuals.
Finally, the results of the test are shown here:
Figure 21: Results of the Shapiro–Wilks test in JMP. Since the p–value=0.18 is greater than the significance
level α = 0.05, we do not reject H0 and conclude the data are normally distributed.
With all of this done, we can be reasonably confident that the assumptions made during ANOVA/Tukey
HSD were founded. But what do we do if our data are not normally distributed? There are a few options:
1. Search the literature: is the probability distribution for your response variable y known? Have others
identified an appropriate test statistic and reference distribution?
2. Use nonparametric tests – these are statistical analyses that do not make assumptions about the
population distribution. Examples include kernel density estimation techniques, or the Welch ANOVA
method (which does not assume constant variance between levels).
3. Look for a mathematical transformation of y that produces an approximately normally distributed
quantity. As an example, sometimes y is not normally distributed, but ln(y) is normally distributed.
In this case, y is said to follow a lognormal probability distribution. These variables are fairly common
in some areas of interest, including:
• survival time after diagnosis of cancer
• latent period of an infectious disease
• rainfall amounts
• species abundance
• income of working people
Aside: outliers
24
For identification of outliers, we first need to define the standardized residual:
dij = √
eij
M SE
(81)
.
Or, sometimes the M SE is modified slightly make a Studentized residual:
dij = p
eij
M SE (1 − hii )
,
(82)
where hii is an element of the “hat” matrix H which we will discuss later. Basically we just divide the residual
by an estimate for the standard deviation, i.e. either the mean square error, M SE or slightly modified M SE
for the Studentized residual.
A rule of thumb for outlier identification: an observation that has a scaled residual |dij | > 3 may be an
outlier. An outlier is an “unusual” or “wild” observation that doesn’t appear to be consistent with most of
the other data. What can cause outliers?
• a mistake in experimental measurement
• incorrect calculation when processing the data
• a poor/inadequate model
• a genuine extreme observation
Note: outliers are not always “bad” data. An outlier may turn out to be the most interesting and useful
observation in the entire experiment! Many famous discoveries were made by the person who did not throw
out the outlier. Examples: teflon, post–it notes, corn flakes, LSD, silly putty, aniline dyes, scotchgard,
cellophane, rayon, vulcanized rubber, penicillin, identification of Charon as a moon of Pluto, the ozone hole
over Antarctica, ...
By default, JMP will create a box plot for every histogram. In the box plot, outliers are identified as a
‘dot’ outside of the box plot, illustrated below in Figure 22
Figure 22: Identification of outliers using JMP. The outlier is illustrated as a dot outside of the box plot.
This plot for the response variable itself will identify extreme or unusual values; however, these should
not be called “outliers” necessarily because a model has not been specified at this stage. To identify outliers,
box plots should be constructed on the residuals, not the response variable. Are there any outliers in
the tensile strength experiment? Let’s check JMP.
25
Figure 23: Analyzing the polycotton tensile strength data for outliers using JMP. No outliers!
Dealing with an outlier
• Review your lab notebook to see if anything unusual happened on that particular run. Are there any
good reasons to think this is a “bad” data point?
• Explore what happens when you repeat the data analysis with the outlier excluded. Do your conclusions
change?
• Remember that an outlier might possibly be a genuinely interesting and novel result. Don’t throw
your ticket to (scientific) fame and fortune down the drain!
Remember also that outliers are determined by extreme residuals, not extreme responses. Is your data point
an outlier because the response is unusual, or is your model just inadequate? Whether or not a point might
be considered an outlier depends on the model under consideration! Thus, a point that is excluded as an
outlier in the analysis of one model should not be deleted entirely. If we decide to try out a different model,
we should re-include any previously excluded points, fit the model, and again check for outliers.
Model adequacy A model is said to be adequate if it quantitatively captures how the effects influence
the response, so that the only unexplained variability is due to random error. A model is therefore said to
be inadequate if other factors (e.g. nuisance variables that were not properly randomized out of influence)
are having an effect but were not included in the model. As we’ll see in the next section, even if we identify
all relevant factors, a model may still be inadequate if the appropriate effects are not included in the model!
How can we check for model adequacy? There are a few useful and informative plots we can construct
from the residuals to check model adequacy:
• time sequence plot
• plot of residuals vs estimated response values
• plot of residuals vs independent variables not included in the study
• and of course the familiar and much–loved R2 statistic also provides some measure of model adequacy,
with some important caveats
Remember that if our model accurately captures all relevant effects, then the residuals should be distributed
as eij ∼ N (0, σ 2 ). We can sequentially plot data drawn from a normal distribution to get an idea of what a
residuals plot ‘should’ look like, illustrated in Figure 24.
26
y
1
0
1
0
5
10
15
data sequence
20
25
Figure 24: Plot of data drawn sequentially from N (0, 1). The data is structureless, with no obvious patterns
or trends.
Order vs disorder – successes and failures of our brains
Between your ears rests the most sophisticated device in the known universe for collecting, analyzing, and
interpreting data. Humans are especially good at processing visual information. We run great simulation
software to model our 3D world in real time. Take advantage of this when presenting your work or choosing
a computational tool. But also be wary of our susceptibility to being fooled by illusions, or seeing patterns
in chaos. One particular phenomenon to be aware of is called ‘Pareidolia.’ What do you see in the picture
below?
Figure 25: Image of the surface of Mars taken by Viking 1 in July of 1976. Is that a human face, or a rock
formation?
From Wikipedia, Pareidolia is: “the tendency for perception to impose a meaningful interpretation on a
nebulous stimulus, usually visual, so that one sees an object, pattern, or meaning where there is none.” As
27
human beings, we have a tendency to find patterns where none exist. If you’re old enough to remember the
above image circulating in the early age of the internet (late 1990’s, early 2000’s), you certainly remember
the conclusions that were being drawn as a result of this phenomenon. In 2001, a much higher resolution
image of the same rock formation was taken by a NASA probe, shown below:
Figure 26: Image of the same rock formation as above taken by a NASA probe in 2001. How about now: is
that a human face, or a rock formation?
Back to the polycotton fiber data, let’s use JMP to analyze the residuals. First we’ll load the data into
the Fit Model platform as before. We can easily plot the residuals by the test sequence (called Row Number
in JMP) and against the predicted values:
Figure 27: Telling JMP to plot residuals by row number and predicted values to test the assumptions in our
analyses.
28
We’re interested in the ‘Plot Residual by Row’ and ‘Plot Residual by Predicted’ options. Since we
organized our data table to have the test sequence associated with the rows sequentially, plotting residuals
by row is equivalent to plotting residuals by test sequence. This is the resulting plot from JMP:
Figure 28: JMP residuals analysis: plot of residuals vs test sequence appears to be structureless, so there’s
nothing to worry about.
Next, the residuals plotted against the predicted value:
Figure 29: JMP residuals analysis: plot of residuals vs predicted value. This is a good way to graphically
check the assumption that the variance (σ 2 ) is similar at all levels.
The final thing we would want to check is the residuals plotted against other variables. The day of the
week of each experiment was not a factor in the experiment, but suppose we had that data and plotted the
residuals against that variable:
29
Figure 30: JMP residuals analysis: plot of residuals vs day of the week. It looks like the day of the week is
a nuisance variable that we’ll need to take into account in the future.
Uh oh, there’s a clear trend when we plot the residuals vs the day of the week that the experiment was
performed! This is a nuisance variable we didn’t take into account and will need to be properly blocked/randomized in the future.
Brief comment on R2
A familiar statistic when evaluating model adequacy is the R2 statistic, defined as,
R2 =
SSTreatments
SST
.
(83)
The R2 statistic can be useful, but the reference distributions are not known, and so they cannot be used
to make conclusions. In other words, at a given significance level α, there is no valid way to determine a
“cutoff” R2 value. Furthermore, the R2 statistic does not punish ‘overfitting’ of data, which we will learn
more about towards the end of the class. Bottom line: take R2 statistics with a heap of salt!
2.1
Side note: statistical significance vs practical significance
In general, increasing the sample size of an experiment will reduce the probability of committing a type II
error (failing to reject H0 when H0 is false). At the same time, decreasing the significance level α reduces
our chance of committing a type I error, but increases the chance of making a type II error. In the real
world we never know the true population parameters, and so the sample size is a very important decision
you have to make as a scientist. The statistical guidance behind this topic is called ‘power analysis’ which
we unfortunately won’t have time to cover in this class, but is discussed in the textbook.
To understand why this is important, let’s take an example of rat weights. Suppose a medical researcher
is using two species of rats as their test subject. Rats of species A and B are randomly selected from two
large populations in which rat weight is ∼ N (µA , σ 2 ) and ∼ N (µB , σ 2 ), respectively. The research doesn’t
know (and never will know) the population parameters µA , µB , σ 2 . To determine if the difference in mean
weight between the species is significant, the researcher plans to weigh a randomly selected sample of N rats
(so N/2 of each species) and perform a t–test at α = 0.05. All hypotheses tested will be two–tailed. We’ll
consider several different scenarios for the (blue) and B (orange) rat populations.
30
p(y)
0.04
0.02
~N(500, 102)
~N(550, 102)
p(y)
0.00
460 480 500 520 540 560 580 600
y
0.04
~N(510, 102)
~N(500, 102)
0.02
0.00
460
p(y)
0.04
500
y
~N(500, 102)
0.02
0.00
480
460
520
540
560
~N(501, 102)
480
500
y
520
540
Figure 31: Scenarios for the rat populations to be investigated.
Case 1: Large difference between means and small sample size. µA = 500 g, µB = 550 g, σ = 10 g, N = 10.
Figure 32: Data analysis for case 1 with µA = 500 g, µB = 550 g, σ = 10 g, N = 10. In this case we reject
H0 and conclude the population means are significantly different.
31
Case 2: Small difference between means and small sample size. µA = 500 g, µB = 510 g, σ = 10 g, N = 10.
Figure 33: Data analysis for case 2 with µA = 500 g, µB = 510 g, σ = 10 g, N = 10. In this case we fail to
reject H0 and conclude the population means are not significantly different.
Case 3: Small difference between means and larger sample size. µA = 500 g, µB = 510 g, σ = 10 g, N = 20.
Figure 34: Data analysis for case 3 with µA = 500 g, µB = 510 g, σ = 10 g, N = 20. In this case we reject
H0 and conclude the population means are significantly different.
Case 4: Tiny difference between means and larger sample size. µA = 500 g, µB = 501 g, σ = 10 g, N = 20.
Figure 35: Data analysis for case 4 with µA = 500 g, µB = 501 g, σ = 10 g, N = 20. In this case we fail to
reject H0 and conclude the population means are not significantly different.
32
At this point you might be wondering if I have lost my mind. Would H0 ever really be rejected if the two
populations being sampled are like those above? The answer is yes! We can discover very small differences,
but only with sufficiently large sample sizes.
Case 5: Tiny difference between means and very large sample size. µA = 500 g, µB = 501 g, σ = 10 g,
N = 2000.
Figure 36: Data analysis for case 5 with µA = 500 g, µB = 501 g, σ = 10 g, N = 2000. In this case we reject
H0 and conclude the population means are significantly different.
Statistically significant differences summarized: for comparative tests between two samples, if the sample
size is too small, even large differences between µA and µB will not be statistically significant. The converse
is also true. Extremely small differences between µA and µB will be found to be statistically significant if
the sample size is sufficiently large. Although a difference may be statistically significant, it might not be
practically significant.
Question: is this a problem with our approach? Or a strength?
Determining practical significance depends on common sense, familiarity with the system being studied,
and how you intend to use the results. Deciding on a sample size is hard, and even experts make mistakes.
If you’re curious, there is a somewhat rigorous way to do this called ‘prospective power analysis’ which we
unfortunately won’t have time to cover. But the textbook has a good discussion on it, and there are many
internet resources available. And, JMP can do it (of course!).
3
Effects models and regression analysis
As a reminder, an effects model is an empirical mathematical model proposed to describe the relationship
between a response variable and one or more factors. Each additive term in the model is called an effect.
Building and analyzing a model is a stepwise process as we discussed in the last section:
1. deciding on a model
2. estimation of the model parameters (“fitting”)
3. ANOVA to perform a “whole model” test to determine whether at least one effect is significant
4. statistical tests on individual effects (e.g. Tukey HSD)
5. residuals analysis
6. power analysis – did you make a type II error?
As we saw in the last section, estimating parameters for a discrete–factor effects model is relatively easy.
Parameters are estimated from the group (level) sample means. In a continuous–factor effects model, it’s
a bit more complicated. We have a few options, but the most common is linear least squares regression,
technically a sub–class of the more general method of “maximum likelihood estimation” (MLE). See the
Gauss–Markov theorem for more details. In any case, for both discrete– and continuous–factor models,
ANOVA is used for the first analysis: the whole model test.
33
3.1
Linear regression
What do we mean by a linear model?
Consider the familiar case in which the independent variable (factor) is continuous – there are
x
y
no levels of x. We could have a variety of different models to fit this data:
x1 y1
x2 y2
y = β0 + β1 x + ε
(84)
x3 y3
2
y = β0 + β1 x + ε
(85)
..
..
.
.
y = β0 + β1 ln(x) + ε
(86)
xn yn
√
y = β0 + β1 / x + ε
(87)
All of the above models are linear models because they are linear with respect to the model parameters
β0 and β1 . These parameters are also called the regression coefficients.
Let’s take as an example a simple linear model we all know and love:
y = β0 + β1 x + ε
.
(88)
Here, β0 and β1 are the “true” model parameters. From experimental data (i.e. random samples from a
larger population), we hope to obtain accurate estimates of these parameters. We will then have a model
that allows us to predict the response,
ŷ = β̂0 + β̂1 x
(89)
However, fitting these parameters usually involves an over–determined set of equations. Let’s take as an
example an experiment to study the effect of continuous variable x on continuous response y:
Here we have six equations with 2 unknowns (β̂0 and β̂1 ). This system is mathematically over–determined.
Real world data will always contain some random error and so there will be no exact solution to this set of
equations. So instead we focus on finding the “best” solution, the model for which predicted values provide
the “best fit” to the actual values.
The best approach to solving this problem is called the method of ‘maximum likelihood estimation’ or
MLE which we will cover towards the end of the course. Another option in addition to MLE is to choose what
is called a ‘loss function’ L. The loss function essentially tells the model what to minimize when optimizing
the parameters. Two typical choices of loss function might are,
L=
L=
n
X
i=1
n
X
(90)
|ei |
e2i
,
(91)
i=1
where ei is the residual of point i, i.e.
ei = yi − ŷi
= yi − (β̂0 + β̂1 xi ) .
34
(92)
(93)
Pn
The optimal values of the parameters are those which minimize the loss function. Taking L = i=1 e2i as
our loss function, we can determine optimal values of the parameters by taking the derivative with respect
to each parameter and setting them equal to zero (i.e. minimizing with respect to the parameters):
L=
n
X
e2i =
i=1
∂L
= −2
∂ β̂0
∂L
∂ β̂1
β̂1
= −2
β̂0
n
n
X
X
(yi − ŷi )2 =
(yi − β̂0 − β̂1 xi )2
i=1
n
X
(yi − β̂0 − β̂1 xi ) = 0
i=1
n
X
(94)
i=1
(95)
(96)
xi (yi − β̂0 − β̂1 xi ) = 0
i=1
This is now just two equations with two unknowns (β̂0 and β̂1 ) which can be easily solved. In general
there will be one equation for each parameter, since we need to take the derivative with respect to every
parameter and set it equal to zero. The above analysis is called “least squares linear regression” – “least
squares” refers to the fact that we are minimizing the square of the error (i.e. a quadratic loss function),
and linear regression refers to the fact that the model is linear in the parameters. As it turns out, least
squares linear regression is exactly identical to the result from the maximum likelihood estimation approach
when the errors are distributed normally (i.e. the model is adequate). In other words, least squares linear
regression is statistically optimal in estimating the parameters!
Least squares linear regression can be generalized to multivariable models without too much difficulty.
For example, suppose x1 and x2 are two different continuous variables (e.g. temperature and pressure). The
following are a few examples of linear models relating these two variables to a response variable y:
y = β0 + β1 x 1 + β2 x 2 + ε
(97)
y = β0 + β1 x 1 + β2 x 2 + β3 x 1 x 2 ε
x1
5
y = β0 + β1 x1 + β2 sin
+ε
x2
(98)
(99)
Again, these are all linear models – ‘linear’ refers to the parameters, not to the variables! In an even more
general case, suppose we have k continuous variables, x1 , x2 , ...xk . We can write down the simplest linear
model for this case,
y = β0 + β1 x1 + β2 x2 + ... + βk xk + ε .
(100)
Given n total observations, we can write n equations for k unknowns. For example, the ith equation in this
list will be:
ŷi = β̂0 + β̂1 x1 + β̂2 x2 + ... + β̂k xk .
(101)
We end up with n equations and k + 1 unknown coefficients. If n > p there is no exact solution! We can
apply the same least squares framework as before,
L=
n
X
e2i
n
n
k
X
X
X
2
=
(yi − ŷi ) =
(yi − β̂0 −
β̂k xij )2
i=1
∂L
= −2
∂ β̂0
∂L
∂ β̂1
β̂k6=0
i=1
i=1
n
k
X
X
(yi − β̂0 −
β̂k xij ) = 0
i=1
= etc.
(102)
j=1
(103)
j=1
(104)
,
β̂k6=1
and solve this set of k + 1 equations for the estimators β̂0 , β̂1 , ..., β̂k . This is what the computer is doing
‘behind the scenes’ when you fit a line in Excel or elsewhere!
Aside: dimensions and units
35
In discrete factor effects models, all model parameters (µ, τi ) have the same units as the response variable.
Recall the example with the polycotton fiber tensile strength. However, in continuous factor effects models,
the model parameters will generally not all have the same units. Regression coefficient units depend on the
dimensions of both the response and the factor. For example, in a single variable linear regression,
√
y = β0 + β1 x 2 + β2 / x + ε ,
(105)
it certainly cannot be the case that each of these coefficients have the same units, unless both y and x are
unitless.
3.2
Testing regression models
Recall that for a discrete factor effects model with multiple level, our first goal (after fitting the model to
obtain parameter estimates) was to use ANOVA to determine whether the factor affects the result at any
level (i.e. at least one level has a significant effect). If we find that it does, we move to the second goal which
was to compare effects at each level. A similar strategy is used for continuous factor effects models:
1. Whole model test via ANOVA
2. Tests on the individual parameters of the model (i.e. the regression coefficients)
Just as before, the total variability can be decomposed into the variability explained by our regression
model and the variability due to random error:
SST =
N
X
(yi − y •• )
2
(106)
(ŷi − y •• )
2
(107)
i=1
SSmodel =
N
X
i=1
SSE =
N
X
(108)
2
(ŷy − yi )
i=1
We set up ANOVA similarly to before, and first test whether there is a relationship between the response
variable and any regressor variable:
(109)
(110)
H0 : β1 = β2 = ... = βk = 0
H1 : βj 6= 0 for at least one j
note that β0 is not included in these hypotheses! That’s because it represents the “overall” mean, similar
to µ in the discrete–factor effects model example. Summarizing into a similar table as in the discrete factor
case,
Table 3: ANOVA for testing significance of regression
source
factor
error
total
sum of
squares
SSModel
SSE
SST
degrees of
freedom
k =p−1
N −p
N −1
mean square
F0 ∼ Fp−1,N −p
M SModel
M SE
M SModel /M SE
36
Example: Suppose you are doing an experiment to test the effect of oil viscosity x on the wear y of a
bearing assembly:
Figure 37: Ball bearing wear experiment.
We’ll start by proposing a simple model describing how the viscosity affects the wear:
y = β0 + β1 x + ε
.
(111)
We can analyze this in the Fit Y by X platform of JMP, making sure to specify that both variables are
continuous. After loading the data, we fit a line by selecting ‘Fit Line’ as illustrated in Figure 38:
Figure 38: Performing linear regression with JMP, part 1.
With the resulting JMP output:
37
Figure 39: Performing linear regression with JMP, part 2.
In addition to the usual R2 statistics, we get:
• ANOVA results (an F –test to determine the significance of the whole model)
• parameter estimates and t–tests for each of the regression coefficients
What conclusions can we make based on the R2 statistics and ANOVA? How are the parameter t–tests done
and what do they mean?
We can also run this in the Fit Model platform, which is a bit more powerful and has more details
available:
38
Figure 40: Performing linear regression with JMP, part 3.
Note that in the Fit Model platform, JMP reports confidence intervals on the model parameters.
What does the null hypothesis here really mean? Really what we have is a null model that we are
comparing with our proposed model. If the factor has no effect on the response, then we fail to reject the
null model. The whole model test is essentially just comparing the proposed model, i.e.
y = β0 + β1 x + ε
with the null hypothesis model,
y = β0 + ε
.
,
(112)
(113)
A significant result in the ANOVA indicates that at least one of the regressor variables (we only have one in
this example) has a significant effect on the response, illustrated in Figure 41.
39
Figure 41: Performing linear regression with JMP, part 4.
What if ANOVA p–value > α? In this case the proper conclusion is that the proposed model is not
significantly better than the null model. Note that this is not the same as concluding that x has no
significant effect on y. It could be that our model is deficient. ANOVA tests the model! So, we should
analyze the residuals to determine whether the model is adequate. If the residual analysis checks out, only
then can we justify a conclusion that x has no significant effect on the response y. We do that in Figure 42:
Figure 42: Performing linear regression with JMP, part 5: residuals analysis.
The residuals appear to be structureless when plotted by row and by the predicted value. The residual
normal quantile plot appears to be following the line y = x without any shift or deviating trend. Plotting
the actual value by the predicted value similarly shows a strong parity, suggesting our model is adequate.
40
3.3
Leverage points
In a given dataset, sometimes one or a handful of points will be what is called a ‘high leverage’ point. In
other words, these points are particularly influential in the model parameters. To see how we can identify
high leverage points, first we need to dust off some of our linear algebra knowledge. First, we note that
linear algebra lets us write linear regression models in a very compact way:
y = Xβ + ε
(114)
,
where the bolded versions of each variable refer to column vectors containing all levels of each variable. In
the case of the viscosity–wear example,
 


193
1 1.6
230
1 15.5
 


172


 ; X = 1 22  ; β = β0
y=
.
(115)
 91 
1 43 
β1
 


113
1 33 
125
1 40
Additionally, linear algebra provides a way to easily and analytically solve for the least squares estimates
of the regression coefficients. We won’t prove this relationship (if you’re interested, look up the ‘normal
equation’), but we can directly write,
−1 0
β̂ = X 0 X
Xy ,
(116)
where X 0 refers to the transpose of X. Plugging this result into our prediction equation, and we have,
ŷ = X β̂ = X X 0 X
−1
X 0 y = Hy
,
(117)
−1
where H is called the “hat” matrix – it “puts the hat on y.” The H and X 0 X
matrices both play an
important role in regression analysis. Recall for example that the diagonal elements of the hat matrix were
used in the calculation of Studentized residuals.
For the viscosity–wear data, we can calculate the hat matrix:


0.64 0.37 0.24 −0.16 0.03 −0.11
 0.37 0.25 0.20 0.03 0.11 0.05 


 0.24 0.20 0.18 0.11 0.14 0.12 
−1
0

 .
H=X XX
X=
(118)

−0.16 0.03 0.11 0.40 0.26 0.36 
 0.03 0.11 0.14 0.26 0.21 0.25 
−0.11 0.05 0.12 0.36 0.25 0.33
Diagonal elements of H may indicate observations that have high leverage, i.e. are highly influential, by
virtue of their location in X space. In JMP, you can save diagonal elements of the hat matrix with ‘Save’
> ‘Columns’ > ‘Hats’:
x
y
diag(H)
1.6 193
0.64
15.5 230
0.25
22.0 172
0.18
43.0 91
0.40
33.0 113
0.21
40.0 125
0.33
Here, the high leverage points are identified by rows where the diagonal of the corresponding hat matrix
is large. In other words, the first and fourth rows. Note that these rows correspond to the min and max
values of x. Leverage point analysis is more interesting in useful in multi–variate problems, when X is
k–dimensional.
41
3.4
Checking the adequacy of a regression model
The final thing to check with a regression model is adequacy. We will talk about this in greater detail
towards the end of the course (keep an eye out for bias/variance tradeoff), but for now, we can use many of
the same tools we used when analyzing discrete factor effects models. In particular:
• residuals analysis:
– normal quantile plot for standardized/Studentized residuals
– plot residuals vs. each regressor variable and look for a structureless plot
• R2 and adjusted R2 :
SSModel
SST
M SE
n−1
2
adj. R = 1 −
=1−
(1 − R2 )
SST /(n − 1)
n−p
R2 =
(119)
(120)
The adjusted R2 is just a modification of the original R2 which attempts (somewhat successfully) to punish
overfitting of the data. To see this, let’s return to our viscosity–wear data. In our initial analysis, we fit
a linear model with only one coefficient (besides the intercept), i.e. we fit a line to the data. What if we
included higher order terms and fit a polynomial to this data? We can fit a higher order polynomial in the
Fit Y by X platform of JMP. After loading the data, choose ‘Fit Polynomial’ > ‘4,quartic’ to fit a fourth
order polynomial. The results are summarized below in Figure 43.
Figure 43: Fitting a fourth order polynomial to the viscosity–wear data.
This model looks great! Recall that with the linear (i.e. fitting a line) model, R2 = 0.73 and adj.
R = 0.66. Here we see the fourth order fit has R2 = 0.98 and adj. R2 = 0.92, a significant improvement!
But hold on a second, we fail to reject H0 with ANOVA! In other words, ANOVA is telling us that this
fourth order model is not significantly better than the null model. What’s going on here?
Adding terms to the model decreases the error degrees of freedom, which results in a higher p–value
despite a higher F –value. To fit a higher order model, we need more observations. ANOVA can partition
the variance effectively only if we have enough data to allow for a good look at random error. We need
more error degrees of freedom! For example, if we have only two experimental points, we would have a
perfect fit (R2 = 1) but zero error d.o.f., and so the model would not be testable. This phenomenon is called
‘overfitting’ of the data, which we will return to later.
2
42
Download