biol.582.f2011.lec.21

advertisement
BIOL 582
Lecture Set 21
One-Way MANOVA, Part I
BIOL 582
Multivariate data
• So far we have learned two things about multivariate data:
1. That linear models work equally well with multivariate
data, compared to univariate data
2. That we can visualize general patterns in multivariate data
spaces in reduced dimensions
• The next two lectures will attempt to wrap up some other issues
working with multivariate data, namely
1. Can I use ANOVA on a multivariate linear model to test
group differences?
2. Can I determine which of my response variables is better
for distinguishing group differences?
3. What is the difference between scatter plots that resemble
data spaces and ones that resemble statistical spaces?
• Before getting into these issues, let’s take a quick look again at
some issues with the Bumpus data
BIOL 582
Bumpus data redux: interpreting PCA
• From last time
> bumpus<-read.csv(”bumpus.csv")
> attach(bumpus)
> # the following morphological variables will be used
> # AE, BHL, HL, FL, TTL, SW, SKL
> Y<-cbind(AE,BHL,FL,TTL,SW,SKL)
Y.cor.pca
2.0
1.5
1.0
0.0
0.5
> # PCA on correlation matrix
> Y.cor.pca<-princomp(Y,cor=T)
> plot(Y.cor.pca)
Variances
2.5
3.0
3.5
• This time, let’s try this simple procedure
Comp.1
Comp.2
Comp.3
Comp.4
Comp.5
Comp.6
• It might be obvious that the plot shows the distribution of eigen
values for the six possible PCs
> Y.cor.pca$sdev^2
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
3.6394870 0.7481825 0.6176918 0.4498947 0.3605078 0.1842361
• Let’s compare this to random, uncorrelated data…
BIOL 582
Bumpus data redux: interpreting PCA
• Random data:
> Y.rand<-matrix(rnorm(6*nrow(Y),0,1),nrow=nrow(Y),ncol=6)
> Y.rand.cor.pca<-princomp(Y.rand)
> par(mfrow=c(1,2))
> plot(Y.cor.pca)
> plot(Y.rand.cor.pca)
• Compare plots
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
•
•
•
1.2
0.8
0.4
0.0
p, where p is the number of
variables.
Y.rand.cor.pca
Variances
2.0
Pop quiz! What is the total
variance of any PCA using a
correlation matrix?
0.0
1.0
Variances
3.0
Y.cor.pca
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
The Bumpus data produced a principal eigenvalue that can be described as
“overdispersed” (more variance than expected). Another way to describe the
pattern is lower entropy (uncertainty), which means greater information content in
fewer dimensions than expected by chance.
There are (debatable) ways to test the “significance” of PCs (and some use these
methods to decide which information to keep and which to throw out). This is not
advocated here.
Just realize that in the Bumpus data, prominent variable covariation is apparent.
BIOL 582
Covariation of response / group differences
• Why was the previous page important? Sometimes the covariation
of variables helps explain group differences in multivariate
responses. Utilizing a multivariate linear model might help reveal
this. Below is a hypothetical example to illustrate this point
• Are there obvious differences between the two groups for each trait,
independently?
y2
y1
•
•
But the two groups occupy different regions of the data space
By considering the covariation of responses, it is apparent that a test to compare
locations should reveal a difference.
BIOL 582
Multivariate linear model
• Let’s go through the steps of a linear model with multivariate
response data, as we did with univariate response data (sort of a
side by side comparison.
• From each model, the error can be obtained
• Then the error can be “squared” by
• Thus, for univariate data, the inner-product of two vectors produces
a scalar, which is SSE. For multivariate data, the result is a sums of
squares and cross-products matrix. This is the key difference. The
error is not one value; it is a matrix of measures of covariance.
BIOL 582
Multivariate linear model & ANOVA
• Recall that if one wishes to test group differences in means for
univariate data, one can compare the error of a “full” linear model
with both an intercept and parameters for estimating group effects
to a “reduced” model that contains just an intercept. This can be
done with a likelihood ratio test (using, e.g., an F stat)
• Inspection of this F value formula indicates that it will only work if
the residuals are a single vector, such that the inner product
produces a scalar. Here is a way to generalize this for matrices
• Note that the F is bolded but not italicized, as it is a matrix
BIOL 582
Multivariate linear model & ANOVA
• Two questions: What are the dimensions of the matrix? Can we get
a P-value?
• First answer is easy, the next answer, not so much
-1
éæ 1 ö ù
F = êç
÷S f ú
è
ë n-kø û
éæ 1 ö
ù
êç ÷ ( Sr - S f )ú
ëè Dk ø
û
• A sums of squares and cross-products matrix (SSCP) will ALWAYS be
a p x p matrix! The difference in two SSCP matrices, like the right
side of an equation is also a SSCP matrix. It is the SSCP matrix for
the effect that is tested.
• Sometimes a particular nomenclature is used to describe particular SSCPs.
The error SSCP is called E (do not confuse with eigenvectors) and the
difference between the E matrices of the two models is called H, which is
much like “hypothesis.” This nomenclature has some benefit. For example,
one could state that the expected value of tr(H) is 0 under the null
hypothesis. Thus, F can be written as
-1
éæ 1 ö ù éæ 1 ö ù
F = êç
÷ Eú êç ÷ Hú
ëè n - k ø û ëè Dk ø û
BIOL 582
Multivariate linear model & ANOVA
• Some sources take it a step further. Since SSCPs multiplied by the
reciprocal of their degrees of freedom produces covariance
matrices, some prefer to do the following:
-1
éæ 1 ö ù
F = êç
÷ Eú
ëè n - k ø û
éæ 1 ö ù
-1
êç ÷ Hú = W B
ëè Dk ø û
• Where B and H are “between-group” and “within-group” covariance
matrices, respectively. This approach obviously assumes one is
doing a one-way “MANOVA” rather than regression. However, the
original formula is more general for comparing any two nested
models.
• As for the second question, there is no way to get a P-value for a
matrix…
• But, there are a few multivariate coefficients that can be converted
to values that approximately follow F distributions. In some cases
(i.e., when Δk = 1), the values are said to be “Exact” F values.
• We will come back to what we can do with this F matrix next lecture
BIOL 582
Multivariate linear model & ANOVA
• Let’s not forget assumptions!
• Multivariate linear model assumptions are similar to univariate
assumptions, with a few twists
1.
2.
3.
4.
5.
6.
The error is multivariate normal (not easy to test, but one can look at a PC plot of
residuals and check for an elliptical shape of scatter. R has a multivariate ShapiroWilk test)
Homogeneity of covariance (there is a test for this – Box’s M – but it is overly
sensitive and generally not used. One can examine covariance matrices – the offdiagonal elements should be similar; the diagonal elements should be similar)
The relationship of dependent variables is linear.
Independent observation vectors (not values) of subjects
n >> p (This is more so for MANOVA than the linear model)
Y is not rank deficient. I.e., it is full-rank. The “rank” is the number of variables.
However, if one variable is a function of another, the matrix has more columns
than it needs. If many variables are highly correlated, rank deficiency might occur.
(One ad-hoc solution it to do PCA and keep the components that have eigenvalues
> 0. This is more so for MANOVA than the linear model. Multivariate test statistics
are based on a full rank assumption.) Note: rank deficiency in X is also a problem
for either univariate or multivariate responses. Computers have trouble inverting XTX
if the X is not full rank. This is called multicollinearity. It means that variables are inherently
related, and prediction should use a smaller set. (E.g., Assignment 8 – DPY and PercipIn –
Stepwise regression removed one of these most likely)
BIOL 582
Multivariate ANOVA
One-way MANOVA Steps
1.
2.
3.
4.
5.
6.
7.
8.
9.
State Null/Alternative hypotheses
Define model (evaluate assumptions)
Calculate Sf for the error of the full model
Define the reduced “null” model (always contains just an intercept)
Calculate Sr for the error of the reduced model
Calculate a multivariate test statistic using Sf and/or Sr . (Note: one can use
only Sf to get a test statistic if performing a randomization test.)
Evaluate the probability of the test statistic if the null hypothesis were true
a. By converting the test statistic to an F stat, which approximately
follows and F distribution
b. By performing a randomization test to create an empirical probability
distribution
Generate plots/tables
Maybe do a discriminant analysis
BIOL 582
Multivariate ANOVA
Example MANOVA: Bumpus Data
1.
2.
3.
4.
5.
6.
7.
8.
9.
State Null/Alternative hypotheses
Define model (evaluate assumptions)
Calculate Sf for the error of the full model
Define the reduced “null” model (always contains just an intercept)
Calculate Sr for the error of the reduced model
Calculate a multivariate test statistic Sf using and/or Sr . (Note: one can use
only Sf to get a test statistic if performing a randomization test.)
Evaluate the probability of the test statistic if the null hypothesis were true
a. By converting the test statistic to an F stat, which approximately
follows and F distribution
b. By performing a randomization test to create an empirical probability
distribution
Generate plots/tables
Maybe do a discriminant analysis
BIOL 582
Multivariate ANOVA
Example MANOVA: Bumpus Data
1.
State Null/Alternative hypotheses
• H0: Morphology does not differ between survivors and non-survivors, and
males and females (i.e., sex-survival groups)
• HA: Morphology differs between survivors and non-survivors, and males and
females (i.e., sex-survival groups) in some way.
H 0 : tr ( SH ) = 0
H A : tr ( SH ) > 0
where ΣH is the population covariance matrix for the effect
Since V = 1/Δk*H is a sample estimate of Σ , then the trace of H is
an adequate sample statistic for testing the null hypothesis. The
expected value for the null hypothesis is tr(H) = 0.
BIOL 582
Multivariate ANOVA
Example MANOVA: Bumpus Data
1.
2.
3.
4.
5.
6.
7.
8.
9.
State Null/Alternative hypotheses
Define model (evaluate assumptions)
Calculate Sf for the error of the full model
Define the reduced “null” model (always contains just an intercept)
Calculate Sr for the error of the reduced model
Calculate a multivariate test statistic Sf using and/or Sr . (Note: one can use
only Sf to get a test statistic if performing a randomization test.)
Evaluate the probability of the test statistic if the null hypothesis were true
a. By converting the test statistic to an F stat, which approximately
follows and F distribution
b. By performing a randomization test to create an empirical probability
distribution
Generate plots/tables
Maybe do a discriminant analysis
BIOL 582
Multivariate ANOVA
Example MANOVA: Bumpus Data
2.
3.
Define model (evaluate assumptions)
Calculate Sf for the error of the full model
> bumpus<-read.csv(“bumpus.csv")
> attach(bumpus)
> # the following morphological variables will be used
> # AE, BHL, HL, FL, TTL, SW, SKL
> Y<-cbind(AE,BHL,FL,TTL,SW,SKL)
> group<-factor(paste(sex,survived,sep='.'))
> lm.group<-lm(Y~group)
> lm.group
Call:
lm(formula = Y ~ group)
Coefficients:
AE
BHL
FL
TTL
SW
SKL
(Intercept) 241.571429 31.478571 18.027650 28.728307 15.279914 20.845236
groupf.TRUE -0.571429 -0.045238 0.129721 0.319617 -0.037495 -0.035379
groupm.FALSE 5.734127 0.082540 -0.008467 -0.113796 0.015825 0.610709
groupm.TRUE 5.840336 0.215546 0.174687 0.120116 0.065670 0.887702
>
BIOL 582
Multivariate ANOVA
Example MANOVA: Bumpus Data
2.
3.
Define model (evaluate assumptions)
Calculate Sf for the error of the full model
> # Consider assumptions......
> # One can go crazy trying to do tests
> # Almost any test will be too sensitive
> # They will suggest assumptions are violated
> # Yet the multivariate test stats are anti-conservative
> # So testing assumptions is rather futile
> # Better to look at residuals
>
> e<-resid(lm.group)
> pca.e<-princomp(e,cor=F)
> yh<-predict(lm.group)
> pca.yh<-princomp(yh,cor=F)
> par(mfrow=c(2,2))
> plot(pca.e,main="distribution of eiqenvalues")
> plot(pca.e$scores,asp=1,main="best 2-D view of residuals")
> plot(pca.yh$scores[,1],pca.e$scores[,1],main="best residual vs predicted view")
> plot(group,pca.e$scores[,1],main="best residual vs group view")
BIOL 582
Multivariate ANOVA
Example MANOVA: Bumpus Data
2.
3.
Define model (evaluate assumptions)
Calculate Sf for the error of the full model
best 2-D view of residuals
2
0
-4
5
-2
0
Comp.2
10
Variances
15
4
20
6
distribution of eiqenvalues
Comp.1
Comp.2
Comp.3
Comp.4
Comp.5
Comp.6
-10
-5
0
5
10
Comp.1
5
0
-10
-5
0
-5
-10
pca.e$scores[, 1]
5
10
best residual vs group view
10
best residual vs predicted view
-4
-3
-2
-1
pca.yh$scores[, 1]
0
1
2
f.FALSE
f.TRUE
m.FALSE
m.TRUE
BIOL 582
Multivariate ANOVA
Example MANOVA: Bumpus Data
2.
3.
Define model (evaluate assumptions)
Calculate Sf for the error of the full model
> Sf<-t(e)%*%e
> Sf
AE
BHL
FL
TTL
SW
SKL
AE 2966.8490 226.59416 256.26670 433.30517 111.23737 285.34416
BHL 226.5942 65.18760 35.11097 57.50294 18.80858 41.58504
FL 256.2667 35.11097 49.67377 68.00522 16.19890 35.69328
TTL 433.3052 57.50294 68.00522 141.80487 24.52523 55.61453
SW 111.2374 18.80858 16.19890 24.52523 19.39935 18.06867
SKL 285.3442 41.58504 35.69328 55.61453 18.06867 115.81108
> # potential homogeneity of covariance problem…
BIOL 582
Multivariate ANOVA
Example MANOVA: Bumpus Data
1.
2.
3.
4.
5.
6.
7.
8.
9.
State Null/Alternative hypotheses
Define model (evaluate assumptions)
Calculate Sf for the error of the full model
Define the reduced “null” model (always contains just an intercept)
Calculate Sr for the error of the reduced model
Calculate a multivariate test statistic Sf using and/or Sr . (Note: one can use
only Sf to get a test statistic if performing a randomization test.)
Evaluate the probability of the test statistic if the null hypothesis were true
a. By converting the test statistic to an F stat, which approximately
follows and F distribution
b. By performing a randomization test to create an empirical probability
distribution
Generate plots/tables
Maybe do a discriminant analysis
BIOL 582
Multivariate ANOVA
Example MANOVA: Bumpus Data
4.
5.
Define the reduced “null” model (always contains just an intercept)
Calculate Sr for the error of the reduced model
> lm.null<-lm(Y~1)
> e.r<-resid(lm.null)
> Sr<-t(e.r)%*%e.r
> Sr
AE
BHL
FL
TTL
SW
SKL
AE 4115.0294 261.26912 263.98818 410.11550 123.17917 435.47590
BHL 261.2691 66.59993 35.79883 57.34511 19.31347 46.82664
FL 263.9882 35.79883 50.64240 69.25259 16.41615 37.77880
TTL 410.1155 57.34511 69.25259 144.59029 24.40976 54.03802
SW 123.1792 19.31347 16.41615 24.40976 19.58572 19.88596
SKL 435.4759 46.82664 37.77880 54.03802 19.88596 136.92128
> # compare to Sf
> Sf
AE
BHL
FL
TTL
SW
SKL
AE 2966.8490 226.59416 256.26670 433.30517 111.23737 285.34416
BHL 226.5942 65.18760 35.11097 57.50294 18.80858 41.58504
FL 256.2667 35.11097 49.67377 68.00522 16.19890 35.69328
TTL 433.3052 57.50294 68.00522 141.80487 24.52523 55.61453
SW 111.2374 18.80858 16.19890 24.52523 19.39935 18.06867
SKL 285.3442 41.58504 35.69328 55.61453 18.06867 115.81108
Because AE compared to other variables has such large variance, the Sf matrix might violate
the homogeneity of covariance assumption (i.e., MANOVA devolves into ANOVA of AE)
BIOL 582
Multivariate ANOVA
Example MANOVA: Bumpus Data
1.
2.
3.
4.
5.
6.
7.
8.
9.
State Null/Alternative hypotheses
Define model (evaluate assumptions)
Calculate Sf for the error of the full model
Define the reduced “null” model (always contains just an intercept)
Calculate Sr for the error of the reduced model
Calculate a multivariate test statistic Sf using and/or Sr . (Note: one can use
only Sf to get a test statistic if performing a randomization test.)
Evaluate the probability of the test statistic if the null hypothesis were true
a. By converting the test statistic to an F stat, which approximately
follows and F distribution
b. By performing a randomization test to create an empirical probability
distribution
Generate plots/tables
Maybe do a discriminant analysis
BIOL 582
Multivariate ANOVA
Example MANOVA: Bumpus Data
6.
Calculate a multivariate test statistic Sf using and/or Sr . (Note: one can use
only Sf to get a test statistic if performing a randomization test.)
• There are several multivariate test statistics that can be used (e.g., check out
Wikipedia for “MANOVA”)
• It can be debated which are the best (although differences only appear in
small samples and if significance is questionable)
• The following is all we will consider in order to remain practical
H 0 : tr ( SH ) = 0
H A : tr ( SH ) > 0
•
•
•
•
where ΣH is the population covariance matrix for the effect
Since V = 1/Δk*H is a sample estimate of Σ , then the trace of H is
an adequate sample statistic for testing the null hypothesis. The
expected value for the null hypothesis is tr(H) = 0.
There are two basic ways to make this test stat more “recognizable”
Pillai’s trace PT = tr[(H + E)-1H]  resembles an R2 value
Hotelling-Lawley trace HLT = tr[E-1H]  resembles an F value (without df)
Both of these test statistics can be converted to values that approximately
follow F distributions, or one can randomize rows of Y and create error
distributions of tr(H)
BIOL 582
Multivariate ANOVA
Example MANOVA: Bumpus Data
1.
2.
3.
4.
5.
6.
7.
8.
9.
State Null/Alternative hypotheses
Define model (evaluate assumptions)
Calculate Sf for the error of the full model
Define the reduced “null” model (always contains just an intercept)
Calculate Sr for the error of the reduced model
Calculate a multivariate test statistic Sf using and/or Sr . (Note: one can use
only Sf to get a test statistic if performing a randomization test.)
Evaluate the probability of the test statistic if the null hypothesis were true
a. By converting the test statistic to an F stat, which approximately
follows and F distribution
b. By performing a randomization test to create an empirical probability
distribution
Generate plots/tables
Maybe do a discriminant analysis
BIOL 582
Multivariate ANOVA
Example MANOVA: Bumpus Data
7.
Evaluate the probability of the test statistic if the null hypothesis were true
a. By converting the test statistic to an F stat, which approximately
follows and F distribution
> # simple solution in R
> summary(manova(lm.group))
Df Pillai approx F num Df den Df Pr(>F)
group
3 0.51629 4.4692 18 387 7.506e-09 ***
Residuals 132
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> summary(manova(lm.group),test="Hotelling")
Df Hotelling-Lawley approx F num Df den Df Pr(>F)
group
3
0.94039 6.5653 18 377 2.758e-14 ***
Residuals 132
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
BIOL 582
Multivariate ANOVA
Example MANOVA: Bumpus Data
Evaluate the probability of the test statistic if the null hypothesis were true
b. By performing a randomization test to create an empirical probability
distribution
100
150
Histogram of result
50
0
> # Randomization approach
> H<-Sr-Sf
> t.H<-sum(diag(H)) # this is the test stat
> permute<-999
> result<-t.H; P<-1
> for(i in 1:permute){
+
Y.r<-Y[sample(nrow(Y)),] # this randomizes row vectors
+
# important to realize that all values for each subject
+
# are held together when shuffling values
+
lm.r<-lm(Y.r~group)
+
Sf.r<-t(resid(lm.r))%*%resid(lm.r)
+
t.H.r<-sum(diag(Sr-Sf.r))
+
result<-c(result,t.H.r)
+
P<-ifelse(t.H.r>=t.H,P+1,P+0)
+}
>
>
> # error distribution
> hist(result,xlab="trace(H)",breaks=50,col="blue")
> rug(result)
> rug(result[1],lwd=2,col='red')
>
> #Results
> P.value<-P/(permute+1)
> R2<-sum(diag(solve(Sr,H)))
> t.H
[1] 1174.643
> R2
[1] 0.5162905
> P.value
[1] 0.001
Frequency
7.
0
200
400
600
trace(H)
800
1000
1200
BIOL 582
Multivariate ANOVA
Example MANOVA: Bumpus Data
1.
2.
3.
4.
5.
6.
7.
8.
9.
State Null/Alternative hypotheses
Define model (evaluate assumptions)
Calculate Sf for the error of the full model
Define the reduced “null” model (always contains just an intercept)
Calculate Sr for the error of the reduced model
Calculate a multivariate test statistic Sf using and/or Sr . (Note: one can use
only Sf to get a test statistic if performing a randomization test.)
Evaluate the probability of the test statistic if the null hypothesis were true
a. By converting the test statistic to an F stat, which approximately
follows and F distribution
b. By performing a randomization test to create an empirical probability
distribution
Generate plots/tables
Maybe do a discriminant analysis
BIOL 582
Multivariate ANOVA
Example MANOVA: Bumpus Data
8.
9.
Generate plots/tables
Maybe do a discriminant analysis
•
These two are linked because we are in a position where we might want to
do multiple comparisons. Therefore, what we plot or how we present a
table might be contingent upon what a multiple comparisons test reveals
Also, multiple comparisons can be done several ways
These topic will be explored in the next lecture
•
•
Download