General Linear Model (GLM)

advertisement
General Linear Model (GLM)
The GLM gives us a convenient, code-efficient way to conduct statistical tests with
design matrices. These matrices contain all the relevant group membership
assignments and also utilizes linear algebra to garner sums over groups and yield
coefficients, etc.
The first thing we need is a design matrix, which is given by X and is of dimensions
[n x (p+1)]. So, we have rows of subjects and columns of predictors with an extra
concatenated column to be all 1s (which allows for the mean to be inserted into each
calculation). This is essentially nothing more than our multiple regression equation,
but in matrix notation.
The treatment (predictor) columns receive a 1 for each subject that was in that
treatment and a 0 if that subject was not. Now, when we multiply this matrix by a
vector that is [(p+1) x 1] we will get a vector of size [n x 1] that we can then add an
error vector of the same dimensions to. This essentially yields a model associated
with each subject’s response.
However this matrix is slightly redundant and can be revised. We don’t need all
treatments to be represented, because if we know a subject isn’t in A or B, then they
are, by default, in C. Furthermore, we can remove the mean from the equation since
it doesn’t have any variance and can just be added in as a constant later.
By removing these columns we the scope of our question. We are comparing groups
A and B to C instead of to the grand mean. However, we don’t want that, so we need
to make the mean of each column to be 0 so that we measure in relation to the mean
instead of the left-out group. We can do that by setting both columns A and B to [-1]
whenever the subject was a member of group C. So, now:
π‘Ώπ’…π’Šπ’Žπ’” = [𝒏 ∗ (𝒑 − 𝟏)]
Akin to an ANOVA, our intercept term that results from this equation is equal the
grand mean of Y (whereas with dummy coding it would be the mean for the control
group). Our regression coefficients will represent the estimated treatment effects
(mean of a group minus grand mean. In dummy it would be compared to control)
and our R squares will be the same as the Eta-squared because they both will
estimate the percentage of variation in the dependent variable that is accounted for
by variation among treatments.
If we want to start adding in factorial designs and interaction terms, we need to split
up the levels of each main effect and also create a column for each interaction. For
the main effect, we would set up a within-column comparison where if there is a
level A then we give 1 to the first level of A and -1 to the second level of A (giving the
column a mean of 0). Same would go for B. The interaction column is nothing more
than a multiplication of the columns that the interaction relates to. If there are 3
levels, use 0 if not in the current one, 1 if in the current 1 and -1 if in the third (same
setup as before).
A word about different types of Sum of Squares:
𝑇𝑦𝑝𝑒 𝐼𝐼𝐼 π‘Žπ‘‘π‘—π‘’π‘ π‘‘π‘  π‘“π‘œπ‘Ÿ π‘Žπ‘™π‘™ π‘œπ‘‘β„Žπ‘’π‘Ÿ 𝑒𝑓𝑓𝑒𝑐𝑑𝑠: 𝑆𝑆𝐴𝐡 = π‘†π‘†π‘Ÿπ‘’π‘”π‘Ÿπ‘’π‘ π‘ π‘–π‘œπ‘›π΄,𝐡,𝐴𝐡 − π‘†π‘†π‘Ÿπ‘’π‘”π‘Ÿπ‘’π‘ π‘ π‘–π‘œπ‘›π΄,𝐡
𝑇𝑦𝑝𝑒 𝐼𝐼 π‘–π‘”π‘›π‘œπ‘Ÿπ‘’π‘  π‘–π‘›π‘‘π‘’π‘Ÿπ‘Žπ‘π‘‘π‘–π‘œπ‘› π‘“π‘œπ‘Ÿ π‘šπ‘Žπ‘–π‘› 𝑒𝑓𝑓𝑒𝑐𝑑𝑠: 𝑆𝑆𝐴 = π‘†π‘†π‘Ÿπ‘’π‘”π‘Ÿπ‘’π‘ π‘ π‘–π‘œπ‘›π΄,𝐡 − π‘†π‘†π‘Ÿπ‘’π‘”π‘Ÿπ‘’π‘ π‘ π‘–π‘œπ‘›π΅
Type I SS is dependent on hierarchy. The highest effect up in the model simply gets
the SS associated with a regression that is run with that effect as the sole predictor.
Subsequent effects are measure as the difference between the current effect
combined with the previous effect as compared to the previous effect in isolation.
So, if you are interested in A then B then AB you would use π‘†π‘†π‘Ÿπ‘’π‘”π‘Ÿπ‘’π‘ π‘ π‘–π‘œπ‘› 𝐴 as 𝑆𝑆𝐴 ,
then 𝑆𝑆𝐡 = π‘†π‘†π‘Ÿπ‘’π‘”π‘Ÿπ‘’π‘ π‘ π‘–π‘œπ‘›π΄,𝐡 − π‘†π‘†π‘Ÿπ‘’π‘”π‘Ÿπ‘’π‘ π‘ π‘–π‘œπ‘›π΄ then 𝑆𝑆𝐴𝐡 = π‘†π‘†π‘Ÿπ‘’π‘”π‘Ÿπ‘’π‘ π‘ π‘–π‘œπ‘›π΄,𝐡,𝐴𝐡 −
π‘†π‘†π‘Ÿπ‘’π‘”π‘Ÿπ‘’π‘ π‘ π‘–π‘œπ‘›π΄,𝐡
We can examine the effects of our interactions. If the interaction term accounts for
any of the variation in Y, then removing the interaction predictors from the model
should lead to a decrease in accountable variation. We calculate R with and without
the interaction term(s) and observe the differences in our SSregression. We can do
the same thing to test for our main effect. We find the SSregression of the model that
includes our effect of interest. Then, get SSregression when that wasn’t included.
This yields SS for the effect of interest. Then, we use the SSresidual from our full
model as the error term when we wish to compute an F-statistic. As an example:
π‘Ίπ‘Ίπ’“π’†π’ˆπ’“π’†π’”π’”π’Šπ’π’π‘¨,𝑩,𝑨𝑩 − π‘Ίπ‘Ίπ’“π’†π’ˆπ’“π’†π’”π’”π’Šπ’π’π‘©,𝑨𝑩 = 𝑺𝑺𝑨
𝑭=
𝑴𝑺𝑨
π‘΄π‘Ίπ’“π’†π’”π’Šπ’…π’–π’‚π’π‘¨,𝑩,𝑨𝑩
The df for our error term is (N-a*b) and the df for A is simply (a-1).
Analysis of Covariance (ANCOVA)
What if a variable of non-interest is driving the change in our dependent variable? It
might account for some of the variance, but it’s not what we are interested in.
Therefore, we want to see if our predictors can account for the variance in the
dependent variable above and beyond these covariates. We should partial out the
variance that can be attributed to the covariate. The covariate is usually correlated
with the DV, but if we did our job of random assignment properly and the covariate
will reduce the error term. However, it shouldn’t be correlated with the IV. If it is,
then the means get adjusted for the treatment effects.
Essentially, what we want to do is obtain “adjusted means” for our treatment effects,
where the means are “adjusted” to be what they would be if treatments had not
varied on the covariate. Then we use an ANCOVA to test whether these adjusted
means differ significantly by using an error term from which the variance
attributable to the covariate has been partialled out. The analysis is not much
different than a normal ANOVA except that we want to remove the effect of the
covariate.
Before we start with an ANCOVA we need to meet the assumption of homogeneity of
regression. This states that the regression coefficients are equal across treatments
(that they are parallel). Data that violates this can’t be analyzed by an ANCOVA
because it would be impossible to estimate a common slope between the covariate
and the DV.
Basically, we are just comparing SSregression when the covariate is included as
compared to when it is not. The difference between a model that includes the
covariate and one that does not is the variation attributable to treatment effects that
are over and beyond that attributable to the covariance.
Formally the full ANCOVA model is given by:
π’€π’Šπ’‹ = 𝝉𝒋 + 𝒄 + 𝒄𝝉𝒋 + π’†π’Šπ’‹
The covariate (C) when it interacts with the treatment effect is the term that
represents the testing of the homogeneity of regression.
If the regression lines are parallel and thus the slopes of the regression lines that
could be calculated for each group separately are homogenous then we have
homogeneity of regression and the deletion of the interaction term should produce
only a trivial decrement in the percentage of accountable variation.
We can compare the obtained R values with and without the covariate interacting
with a particular treatment by running the F-test on the difference between the two
models. If we don’t reject the null, then we can remove that interaction of covariate
and treatment, but still keep the covariate in.
We always want the SS residual of the full model (including the covariate) to be our
SS error for subsequent testing.
The difference between SS regression with and without the group membership
predictors must be the amount of the sum of squares that can be attributable to
treatment over and above the amount that can be explained by the covariate. The
difference, thus, would, technically, be the SS treatment-adjusted since we have
removed the covariate’s contribution from the full model and essentially set the SS
treatment to be what it would be if covariates did not play a role (i.e if each
treatment group was evaluated at the same covariate level). This subtraction
essentially adjusts for any effects of the covariate and can be seen formally as:
π‘Ίπ‘Ίπ’•π’“π’†π’‚π’•π’Žπ’†π’π’•(𝒂𝒅𝒋𝒖𝒔𝒕𝒆𝒅) = π‘Ίπ‘Ίπ’“π’†π’ˆπ’“π’†π’”π’”π’Šπ’π’π‰,𝒄 + π‘Ίπ‘Ίπ’“π’†π’ˆπ’“π’†π’”π’”π’Šπ’π’π’„
We can also do the reverse to estimate SS covariate to see how powerful our
covariate is as a predictor of non-interest.
The df for covariates is simply the number of covariates and the df for error is N-k-c.
We can test our MS adjusted treatment (with df k-1) against the MS error where we
use the SS residual from the full model (covariate included) to obtain an F-statistic.
In order to interpret the results of a significant adjusted treatment effect, we need to
obtain the treatment means adjusted for the effects of the covariate. We need an
estimate of what the predictor mean for a specific treatment would be if the
predictors had not differed on the preinjection means. We simply are trying to
Μ… ′𝒋 . In order to do so, we simply use the mean of the covariate instead of the
predict 𝒀
individual covariate, multiply that by the covariate slope from the full model and
add in the regression coefficient dummy effect codes for the treatment group of
interest. Then, add the intercept and error term as well and you will yield an
adjusted group mean.
Any individual comparisons among treatments would now be made from these
adjusted means. However, we need to modify our error term from just using SS
residuals from the full model. We need to use 𝑴𝑺′𝒆𝒓𝒓𝒐𝒓 (the full model error) in
conjunction with 𝑺𝑺𝒆(𝒄) which is the error from the analysis of variance using only
the covariate as the predictor. We take the mean covariate at the treatment of
interest as well in order to form the following equation of comparisons of adjusted
group means:
𝑭(𝟏,𝑡−𝒂−𝟏) =
Μ… ′𝒋 − 𝒀
Μ… ′𝒋 )𝟐
(𝒀
(π‘ͺ𝒋 − π‘ͺπ’Œ )𝟐
𝟏
𝟏
𝑴𝑺′𝒆𝒓𝒓𝒐𝒓 [(𝒏 + 𝒏 ) + 𝑺𝑺
]
𝒋
π’Œ
𝒆(𝒄)
As for calculating the effect size of our effect…if the covariate naturally varies in the
population, then we can obtain:
𝜼𝟐 =
𝑺𝑺𝒂𝒅𝒋𝒖𝒔𝒕𝒆𝒅 π’•π’“π’†π’‚π’•π’Žπ’†π’π’•
𝑺𝑺𝒕𝒐𝒕𝒂𝒍
We can also technically yield this eta value by getting the difference between an R2
from a model predicting the dependent variable from only the covariate and one
predicting the dependent variable from both the covariate and the treatment. The
eta would be the contribution to explained variation of the treatment after
controlling for the covariate.
We can also calculate a d-family statistic in the normal format of:
𝝍
𝝈
Where 𝝍 is the difference between two means. 𝝈 would be the square root of our MS
error term from the full model (SS residual) or using the error from a control group
if we are comparing means.
𝒅=
We typically want to use our covariate to help reduce our error term because we
our treatment groups should be randomly assigned and the subjects within them
should not vary aside from pure error.
If we want to compare adjusted means in a weighted fashion (comparing one mean
to two others like in a linear contrast) then we need a new error term:
𝑴𝑺′′
𝒆𝒓𝒓𝒐𝒓
π‘Ίπ‘Ίπ’ˆ(𝒄)
π’ˆ−𝟏
= 𝑴𝑺′𝒆𝒓𝒓𝒐𝒓 [𝟏 +
]
𝑺𝑺𝒆(𝒄)
Where anything postpended with a (c) is the sum of squares when the analysis of
variance was done on the covariate (i.e the covariate was the sole predictor, but
given group membership).
Then we do our typical linear contrast to yield 𝝍 and get an F-statistic via:
𝑭=
π’ππŸ
∑ π’‚πŸπ’Š 𝑴𝑺′′
𝒆𝒓𝒓𝒐𝒓
If we want to do a test of simple effects (between cells) then we still need to adjust
our error term further:
𝑺𝑺𝒄𝒆𝒍𝒍𝒔(𝒄)
′
πŸπ‘΄π‘Ίπ’†π’“π’“π’π’“
π’•π’ˆ − 𝟏
𝑴𝑺′′
[𝟏 +
]
𝒆𝒓𝒓𝒐𝒓 =
𝒏
𝑺𝑺𝒆(𝒄)
Where 𝑆𝑆𝑐𝑒𝑙𝑙𝑠(𝑐) is from an ANOVA on the covariate.
Meta-Analysis
A meta-analysis is averaging the results of many studies on a single topic.
We want to first plot each study on a separate line in a forest plot, where we can
indicate effect size and the confidence interval on that effect size.
We must decide if what we are trying to measure is a fixed or random effect. A fixed
effect model assumes that there is one true effect size that we are trying to estimate
by looking at the results of multiple experiments. If you were an astronomer
attempting to measure the luminosity of a aprticular star, it is reasonable to think
that it does have one true luminosity and the difference between the measurements
of you and your colleagues is just error variance. Thus, we assume that the only
reason for variability in measurement is random sampling error. If each of our
studies had an infinite number of participants, all studies would come to the same
result because they are all measuring the same thing. In regards to other variables,
like depression, the waters are muddier: it may vary based on gender, family
settings, and a host of other variables. Random effect models, thus, will assume that
the true effects are randomly and normally distributed around some value. So, we
insert random error into our random model because the true effects we are aiming
for may well differ from study to study and are not all equal to some overall mean
effect.
So..a random effects model has the difference between the true effect and gran mean
of effects in the model, whereas a fixed model does not have this term because it
assumes that the grand mean is the true effect.
With a meta-analysis we want to calculate the over all effect by weighting the effect
size of each study. To do this we use the inverse of the variance estimate:
𝑾𝒄𝒖𝒓𝒓𝒆𝒏𝒕 π’”π’•π’–π’…π’š =
𝟏
π’”πŸπ’… π’„π’–π’“π’“π’†π’π’•π’”π’•π’–π’…π’š
We remember d as the difference of a control and a treatment over the standard
deviation of the control. And to get its standard error:
π’πŸ + π’πŸ
π’…πŸ
=
+
π’πŸ π’πŸ
𝟐(π’πŸ )
We can use these to construct confidence intervals where we add/subtract
1.96*𝑠𝑑2 π‘π‘’π‘Ÿπ‘Ÿπ‘’π‘›π‘‘π‘ π‘‘π‘’π‘‘π‘¦ to d.
π’”πŸπ’… π’„π’–π’“π’“π’†π’π’•π’”π’•π’–π’…π’š
When looking at multiple studies, we want to yield the overall effect:
∑ π‘Ύπ’Š π’…π’Š
Μ…=
𝒅
∑ π‘Ύπ’Š
We can compute confidence intervals around this overall effect to see if it includes 0
by getting the standard error of the overall effect:
𝟏
𝒔𝒅̅ = √
∑ π‘Ύπ’Š
However, we also want to make sure that out effects are measuring the same thing,
so we do this by measuring heterogeneity of a fixed effects model, which is given by
the statistic Q, which is simply a weighted sum of squared deviations for each study
from the mean effect size(analogous to SS between):
π’Œ
Μ… )𝟐
𝑸 = ∑ π‘Ύπ’Š (π’…π’Š − 𝒅
π’Š=𝟏
We can test this distribution under the chi-square distribution on k-1 studies.
If we want to test the same thing but for a random effect, we want to incorporate
each study’s true effect from the average effect and take that average:
π‘»πŸ =
𝑸−𝒅𝒇
π‘ͺ
∑ π‘ΎπŸ
; π‘ͺ = ∑ π‘Ύπ’Š − ∑ π‘Ύπ’Š
π’Š
Keep in mind that Q measures the differences among effect size measures, while df
is the expected variability if the null hypothesis is true. So the numerator of T is the
excess variability that cannot be attributed to random differences among studies.
You can think of C as analogous to the within groups term.
We can compare Qs as well within subgroups of our studies. Say a group of studies
did an intervention on rainy days and another did the intervention on sunny days.
What if there is a meta-effect? We can just calculate Q for each subgroup and test if
the Qs are different by comparing the sum of all the subgroups Q to the Q that was
calculated using each study at once. We can then have a Q that represents the
difference between these two Qs and test that under the null hypothesis that the
effect size is the same for all groups and test it on the chi-square on g-1 df, where g
is the number of grouping we split our collection of studies into. If the variance
between studies is not significant, then we have a fixed effect model.
We can really do these Q statistics with any effect measure we want. If we are
dealing with Risk Ratios, though, we will want to always take the log of that.
Non-Parametric Testing
Your data doesn’t meet the distribution assumptions of the test you want to use?
Fine. Make your own distribution. These tests will general be robust to outliers.
In bootstrapping we assume that the shape of the population is accurately reflected
in the distribution of a sample that we have required. So, we just draw samples from
our own distribution over and over again (with replacement – we pull one value and
then allow that value to be an option for choice on our second pull) in order to
create new samples that are just shuffling (with some duplications) of our current
sample. We do this sampling 10,000 times and then determine the extremes of our
distribution (the 2.5% of each end of the distribution) and see if our actual,
observed result falls outside and is considered extreme. We mainly use
bootstrapping, though, for deriving estimates of variation. We can do this for
regressions to get the standard error of our beta coefficients as well.
Resampling (permutation test) does a similar thing in that it shuffled group
assignment randomly, does statistical tests on those groups and forms a distribution
of results by doing it 10,000 times. Then, we compare our actual statistic to this
distribution. We do the same for paired sampled by drawing a large number of
samples of 19 difference scores, randomly assigning positive and negative signs to
the differences and then calculating the median of the differences. Do this 10,000
times and compare to our median difference of the true, non shuffled dataset. We
can get a p value by seeing how many values out of the 10,000 are at or above our
observed, actual statistic.
We can do this for correlations as well by sampling, with replacement, XY pairs and
then calculating a correlation amongst all the pairs and then comparing our actual r
to this bootstrapped value.
We can also use more straight-forward nonparametric tests.
If we want the nonparametric analogue to an independent t-test, we would use the
Wilcoxon’s Rank-Sum test. This test is especially sensitive to population differences
in central tendency. All you have to do is rank all N scores without regard to group
membership and then sum the rank numbers dependent on their group
membership. It would make sense that the sums should be relatively even if there is
no ranked ordering to the groups. We then test the smaller of the two groups (in
unequal ns) or the smaller of the two sums to the critical value for the test on Ws in
the Wilcoxon table. This table will tell us the smallest value of Ws that we would
expect to obtain by chance if the null hypothesis were true. We compare depending
on number of subjects in each group where n1 is always the number of subjects in
the smaller group. We will only be able to reject the null if the sum of the ranks for
the smaller group is sufficiently small. This might not make sense because what if
the smaller group is the higher one? Well that’s why we test the opposite as well,
which is:
Μ…Μ…Μ… − 𝑾𝒔 ; Μ…Μ…Μ…Μ…Μ…
𝑾′𝒔 = πŸπ‘Ύ
πŸπ‘Ύ = π’πŸ (π’πŸ + π’πŸ + 𝟏)
Compare that to our normal Ws and submit the smaller of them to the table for
testing.
If there are ties in our rank, just use that rank twice so 1 2 2 4 would be an order.
We can also shuffle our group labels on the ranks and run a permutation test to
observe our obtained Ws as compared to a distribution of random samplings.
We can do the same thing for matched samples by calculating difference scores,
ordering them by magnitude (absolute value) and then summing the positive ranks
and the negative ranks as T+ and T-. We can then compare this to a T table where
we use the smaller of T+ and T- to submit to the test.
If there is a difference score of 0, it would be advised to eliminate that participant
from consideration.
We can also do a sign test that only looks at the sign of the differences and ask how
likely an arrangement of +s and –s are in a population, which would just be
submitting the number of +s and the number of –s to a chi-square goodness of fit
test to see if this observed value is different from the expected value (an even
splitting of + and -).
The non-parametric equivalent of an ANOVA is the Kruskal-Wallis One-Way ANOVA
where we calculate an H statistic that is scaled.
We rank all scores without regard to group membership and then compute the sum
of the ranks for each group and weight and scale them to fit with the chi-square
distribution:
π’Œ
𝟏𝟐
π‘ΉπŸπ’Š
𝑯=
∑
− πŸ‘(𝑡 + 𝟏)
𝑡(𝑡 + 𝟏)
π’π’Š
π’Š=𝟏
We can then evaluate this H on the Chi-Square distribution with k-1 df.
Lastly, if we want to mirror a repeated measures ANOVA, then we can use the
Friedman’s rank test for k correlated samples where we rank the scores within a
subject (or level/row) and then get the sums and their variance across subjects. For
example, if a subject takes 3 tests, we rank their scores and then see if, consistently,
subjects are getting the best score on test 2. To obtain a statistic to compare to the
chi-square distribution:
π’Œ
𝟏𝟐
π‘ΏπŸπ‘­ =
∑ π‘ΉπŸπ’Š − πŸ‘π‘΅(π’Œ + 𝟏)
π‘΅π’Œ(π’Œ + 𝟏)
π’Š=𝟏
R would be the sum of our ranks, which we go on to square.
If our N is greater than 50 in any of these non-paramatrice test, when we should use
a normal approximation.
Download