Topic 20: Single Factor Analysis of Variance

advertisement
Topic 20: Single Factor
Analysis of Variance
Outline
• Analysis of Variance
–One set of treatments (i.e., single
factor)
• Cell means model
• Factor effects model
–Link to linear regression using
indicator explanatory variables
One-Way ANOVA
• The response variable Y is continuous
• The explanatory variable is categorical
– We call it a factor
– The possible values are called levels
• This approach is a generalization of the
independent two-sample pooled t-test
• In other words, it can be used when there
are more than two treatments
Data for One-Way ANOVA
• Y is the response variable
• X is the factor (it is qualitative/discrete)
– r is the number of levels
– often refer to these levels as groups
or treatments
• Yi,j is the jth observation in the ith group
Notation
• For Yi,j we use
– i to denote the level of the factor
– j to denote the jth observation at
factor level i
• i = 1, . . . , r levels of factor X
• j = 1, . . . , ni observations for level i
of factor X
– ni does not need to be the same in each
group
KNNL Example (p 685)
• Y is the number of cases of cereal sold
• X is the design of the cereal package
– there are 4 levels for X because there
are 4 different package designs
• i =1 to 4 levels
• j =1 to ni stores with design i (ni=5,5,4,5)
• Will use n if ni the same across groups
Data for one-way ANOVA
data a1;
infile 'c:../data/ch16ta01.txt';
input cases design store;
proc print data=a1;
run;
The data
Obs
1
2
3
4
5
6
7
8
cases
11
17
16
14
15
12
10
15
design
1
1
1
1
1
2
2
2
store
1
2
3
4
5
1
2
3
Plot the data
symbol1 v=circle i=none;
proc gplot data=a1;
plot cases*design;
run;
The plot
Plot the means
proc means data=a1;
var cases; by design;
output out=a2 mean=avcases;
proc print data=a2;
symbol1 v=circle i=join;
proc gplot data=a2;
plot avcases*design;
run;
New Data Set
Obs design _TYPE_ _FREQ_ avcases
1
1
0
5
14.6
2
2
0
5
13.4
3
3
0
4
19.5
4
4
0
5
27.2
Plot of the means
The Model
• We assume that the response variable is
– Normally distributed with a
1. mean that may depend on the level
of the factor
2. constant variance
• All observations assumed independent
• NOTE: Same assumptions as linear
regression except there is no assumed
linear relationship between X and E(Y|X)
Cell Means Model
• A “cell” refers to a level of the factor
• Yij = μi + εij
– where μi is the theoretical mean or
expected value of all observations
at level (or cell) i
– the εij are iid N(0, σ2) which means
– Yij ~N(μi, σ2) and independent
– This is called the cell means model
Parameters
• The parameters of the model are
– μ1, μ2, … , μr
– σ2
• Question (Version 1) – Does our
explanatory variable help explain Y?
• Question (Version 2) – Do the μi vary?
H0: μ1= μ2= … = μr = μ (a constant)
Ha: not all μ’s are the same
Estimates
• Estimate μi by the mean of the
observations at level i, Yi (sample mean)
• ûi = Yi= ΣYi,j/ni
• For each level i, also get an estimate of
the variance
2
• s i = Σ(Yij- Yi) 2/(ni-1)
(sample variance)
2
• We combine these si to get an overall
estimate of σ2
• Same approach as pooled t-test
Pooled estimate of
2
σ
• If the ni were all the same we would
2
average the s i
– Do not average the si
2
i ,
• In general we pool the s giving
weights proportional to the df, ni -1
• The pooled estimate is
2
2
s   ni  1si  ni  1


  n  1s  (n
i
2
i
T
 r)
Running proc glm
Difference 1: Need
to specify factor
variables
proc glm data=a1;
class design;
model cases=design;
Difference 2: Ask
means design;
for mean estimates
lsmeans design
run;
Output
Class Level
Information
Class Levels Values
design
41234
Number of Observations Read
Number of Observations Used
Important summaries to
check these
summaries!!!
19
19
SAS 9.3 default output
for MEANS statement
MEANS statement output
cases
Level of
design
1
2
N
5
5
Mean
14.6000000
13.4000000
Std Dev
2.30217289
3.64691651
3
4
4
5
19.5000000
27.2000000
2.64575131
3.96232255
Table of sample means
and sample variances
SAS 9.3 default output
for LSMEANS statement
LSMEANS statement
output
design
1
2
3
4
cases LSMEAN
14.6000000
13.4000000
19.5000000
27.2000000
Standard
Error Pr > |t|
1.4523544 <.0001
1.4523544 <.0001
1.6237816 <.0001
1.4523544 <.0001
Provides estimates based on
model (i.e., constant variance)
Notation

Yi.   j Yij / ni

Y..   i  j Yij / nT (grand sample mean)

nT is the total number of observatio ns

nT  i ni
(trt sample mean)
ANOVA Table
Source df
SS
MS
Model
r-1
Σij( Y - Y )2
SSR/dfR
Error
nT-r
Σij(Yij - Yi.)2
SSE/dfE
Total
nT-1
Σij(Yij - Y..)2
SST/dfT
i.
..
ANOVA SAS Output
Source
Model
Sum of
Mean
DF
Squares
Square F Value Pr > F
3 588.2210526 196.0736842 18.59 <.0001
Error
15 158.2000000
Corrected
Total
18 746.4210526
10.5466667
R-Square Coeff Var Root MSE cases Mean
0.788055 17.43042 3.247563
18.63158
Expected Mean Squares

E (MSE)  

E (MSR)   
2
2
 n      r  1
2
i
i
i
.
where  .   i ni  i / nT
• E(MSR) > E(MSE) when the group
means are different
• See KNNL p 694 – 698 for more details
• In more complicated models, these tell
us how to construct the F test
F test
•
•
•
•
•
•
F = MSR/MSE
H0: μ1 = μ2 = … = μr
Ha: not all of the μi are equal
Under H0, F ~ F(r-1, nT-r)
Reject H0 when F is large
Report the P-value
Maximum Likelihood
Approach
proc glimmix data=a1;
class design;
model cases=design / dist=normal;
lsmeans design;
run;
GLIMMIX Output
Data Set
Model Information
WORK.A1
Response Variable
cases
Response Distribution
Gaussian
Link Function
Identity
Variance Function
Default
Variance Matrix
Diagonal
Estimation Technique
Restricted Maximum
Likelihood
Residual
Degrees of Freedom Method
GLIMMIX Output
Fit Statistics
-2 Res Log Likelihood
AIC (smaller is better)
AICC (smaller is better)
BIC (smaller is better)
CAIC (smaller is better)
HQIC (smaller is better)
Pearson Chi-Square
Pearson Chi-Square / DF
84.12
94.12
100.79
97.66
102.66
94.08
158.20
10.55
GLIMMIX Output
Type III Tests of Fixed Effects
Num Den
Effect
DF DF F Value Pr > F
design
3 15
18.59 <.0001
design
1
2
3
4
design Least Squares Means
Standard
Estimate
Error DF t Value
14.6000
1.4524 15
10.05
13.4000
1.4524 15
9.23
19.5000
1.6238 15
12.01
27.2000
1.4524 15
18.73
Pr > |t|
<.0001
<.0001
<.0001
<.0001
Factor Effects Model
• A reparameterization of the cell means
model
• Useful way at looking at more
complicated models
• Null hypotheses are easier to state
• Yij = μ + i + εij
– the εij are iid N(0, σ2)
Parameters
• The parameters of the model are
– μ, 1, 2, … , r
– σ2
• The cell means model had r + 1 parameters
– r μ’s and σ2
• The factor effects model has r + 2
parameters
– μ, the r ’s, and σ2
– Cannot uniquely estimate all parameters
An example
• Suppose r=3; μ1 = 10, μ2 = 20, μ3 = 30
• What is an equivalent set of parameters
for the factor effects model?
• We need to have μ + i = μi
• μ = 0, 1 = 10, 2 = 20, 3 = 30
• μ = 20, 1 = -10, 2 = 0, 3 = 10
• μ = 5000, 1 = -4990, 2 = -4980, 3 = -4970
Problem with factor effects?
• These parameters are not estimable
or not well defined (i.e., unique)
• There are many solutions to the least
squares problem
• There is an X΄X matrix for this
parameterization that does not have
an inverse (perfect multicollinearity)
• The parameter estimators here are
biased (SAS proc glm)
Factor effects solution
• Put a constraint on the i
• Common to assume Σi i = 0
• This effectively reduces the number
of parameters by 1
• Numerous other constraints possible
Consequences
• Regardless of constraint, we always
have μi = μ + i
• The constraint Σi i = 0 implies
– μ = (Σi μi)/r (unweighted grand mean)
– i = μi – μ (group effect)
• The “unweighted” complicates
things when the ni are not all equal;
see KNNL p 702-708
Hypotheses
• H0: μ1 = μ2 = … = μr
• H1: not all of the μi are equal
are translated into
• H0: 1 = 2 = … = r = 0
• H1: at least one i is not 0
Estimates of parameters
• With the constraint Σi i = 0
ˆ .   i Yi. r  Y.. (if ni  n)
ˆi  Yi.  ˆ .
Solution used by SAS
• Recall, X΄X does not have an inverse
• We can use a generalized inverse in
its place
• (X΄X)- is the standard notation
• There are many generalized inverses,
each corresponding to a different
constraint
Solution used by SAS
• (X΄X)- used in proc glm corresponds
to the constraint r = 0
• Recall that μ and the i are not
estimable
• But the linear combinations μ + i are
estimable
• These are estimated by the cell
means
Cereal package example
•
•
•
•
Y is the number of cases of cereal sold
X is the design of the cereal package
i =1 to 4 levels
j =1 to ni stores with design i
SAS coding for X
•Class statement generates r explanatory
variables
•The ith explanatory variable is equal to 1
if the observation is from the ith group
•In other words, the rows of X are
1 1 0 0 0 for design=1
1 0 1 0 0 for design=2
1 0 0 1 0 for design=3
1 0 0 0 1 for design=4
Some options
proc glm data=a1;
class design;
model cases=design
/xpx inverse solution;
run;
Also
contains
X’Y
Output
The X'X Matrix
Int d1 d2 d3 d4 cases
Int
19 5 5 4
5
354
d1
5 5 0 0
0
73
d2
5 0 5 0
0
67
d3
4 0 0 4
0
78
d4
5 0 0 0
5
136
cases 354 73 67 78 136 7342
Output
X'X Generalized Inverse (g2)
Int
d1
d2
d3 d4 cases
Int
0.2 -0.2 -0.2 -0.2 0 27.2
d1
-0.2
0.4
0.2 0.2 0 -12.6
d2
-0.2
0.2
0.4 0.2 0 -13.8
d3
-0.2
0.2
0.2 0.45 0 -7.7
d4
0
0
0
0 0
0
cases 27.2 -12.6 -13.8 -7.7 0 158.2
Output matrix
•Actually, this matrix is
(X΄X)Y΄X(X΄X)-
(X΄X)- X΄Y
Y΄Y-Y΄X(X΄X)- X΄Y
•Parameter estimates are in upper
right corner, SSE is lower right
corner (last column on previous page)
Parameter estimates
Par
Int
d1
d2
d3
d4
Est
27.2 B
-12.6 B
-13.8 B
-7.7 B
0.0 B
St
Err
1.45
2.05
2.05
2.17
.
t
18.73
-6.13
-6.72
-3.53
.
P
<.0001
<.0001
<.0001
0.0030
.
Caution Message
NOTE: The X'X matrix has been
found to be singular, and a
generalized inverse was used
to solve the normal equations.
Terms whose estimates are
followed by the letter 'B' are
not uniquely estimable.
Interpretation
• If r = 0 (in our case, 4 = 0), then the
corresponding estimate should be zero
• the intercept μ is estimated by the mean
of the observations in group 4
• since μ + i is the mean of group i, the i
are the differences between the mean of
group i and the mean of group 4
Recall the means output
Level of
design N
Mean
Std Dev
1
2
3
4
14.6
13.4
19.5
27.2
2.3
3.6
2.6
3.9
5
5
4
5
Parameter estimates
based on means
Level of
design Mean
1
2
3
4
14.6
13.4
19.5
27.2
̂ = 27.2
= 27.2
ˆ1= 14.6-27.2 = -12.6
ˆ2= 13.4-27.2 = -13.8
ˆ3= 19.5-27.2 = -7.7
ˆ4= 27.2-27.2 = 0
Last slide
• Read KNNL Chapter 16 up to 16.10
• We used programs topic20.sas to
generate the output for today
• Will focus more on the relationship
between regression and one-way
ANOVA in next topic
Download