Coding a Categorical Predictor

advertisement
Coding a Categorical Predictor
Dummy Codes binarize your data. Effect codes make the mean of the codes 0. These
codes are used as X values, which are predictors of a normally-formatted Y vector.
When entered into a regression, the Dummy that you entered as 0 is your “control
group”. Thereby, the coefficient associated with the Dummy vector after a
regression would be the differences between the group coded as 1 (treatment
group) and the group coded as 0 (control group). Furthermore, the intercept
coefficient will be the mean of the control group. This can be illustrated because if
you put in 0 for X, then you are left with just the intercept, but if you enter 1, you
add whatever the b1 is to that intercept.
With Effect Codes, the intercept represents the grand mean. The b-coefficient
assigned to effect code is then the difference between the grand mean and the
treatment group.
When we have more than just two groups, we need to be a bit cleverer about our
coding. We need to have multiple dummy coded/effect vectors.
D1
1
0
0
D2
0
1
0
E1
1
0
-1
E2
0
1
-1
The group that has 0/-1 in each vector is the control group.
Dummy Interpretation:
Intercept: mean of the control group.
b1: the difference between the group that had 1 in the first vector and the control
group.
b2: the difference between the group that had 1 in the first vector and the control
group.
Effect Interpretation:
Intercept: grand mean. (in unequeal n, just the unweighted average of group means)
b1: the difference between the group that had 1 in the first vector and the grand
mean. This is akin to the effect of treatment 1. In unequal n, it is comparison to the
average mean, rather than the grand mean.
b2: the difference between the group that had 1 in the first vector and the grand
mean. This is akin to the effect of treatment 2. In unequal n, it is comparison to the
average mean, rather than the grand mean.
If we think of effect codes in terms of ANOVA, each subject’s score represents
contributions of the grand mean + treatment effect + error.
𝑌 = 𝑏0 + 𝑏1 𝐸1 + 𝑏2 𝐸2 + 𝑒
This represents any one subject’s score. This minimizes depending on group, since
the effect code would be 0 depending on which group. When the effect code is -1 and
-1 for each (as it is for the control group), then we see:
𝑌 = 𝑏0 − (𝑏1 + 𝑏2 ) + 𝑒
What if our ns are unequal? For Dummy Codes, it’s the same thing. However, for
effect codes, our mean of each vector needs to be 0. If they aren’t our interpretations
are slightly different. But, we should really just weight our effect codes so that our
interpretations remain.
Instead of using -1 for the control group, we should instead use values that are
based one the sample sizes of the groups involved in the contrast.
Basically, you set up your binary effect codes as usual and then for the control group
use the negative of the ratio of n for treatment 1 to n control group. Then, repeat for
second group. If we had 6 in treatment 1, 5 in treatment 2 and 10 in control group:
𝑒𝑓𝑓𝑒𝑐𝑡 𝑐𝑜𝑑𝑒 𝑓𝑜𝑟 𝑣𝑒𝑐𝑡𝑜𝑟 1 =
𝑛 𝑡𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡 1
6
=
= .6 ∗ (−1) = −.6
𝑛 𝑐𝑜𝑛𝑡𝑟𝑜𝑙
10
We have an aim to have orthoganality(independence) amongst our vectors of effect
codes. A zero order correlation is necessary but not sufficient to define
orthogonality (not ever 0 correlation means orthogonal vectors – orthogonal could
still be curvilinear and still have a linear relationship). Orthogonal vectors can be
used to code a priori comparisons among group means. A full set of orthogonal
comparisons includes (number of levels) -1 comparisons, which exhaust the
available information.
We can contrast groups where the sum of each vector is 0. If there is a 0 in the
vector, then that means that particular group is not involved in that particular
contrast.
To determine if we have orthogonal contrast, we must prove that the products of
codes equals 0 when all are summed. Basically, if we have two vectors… do a dot
product and sum that resulting vector. That sum must be 0 for us to have orthogonal
contrasts.
When building our codes for contrasts, our b-coefficients will represent the
difference between the groups of interest scaled by the number of groups.
If we are doing specific contrasts AND our n is unequal, then we should use the
number of subjects per group as the actual effect codes.
For the first vector, we would use the number of the control group for the effect
codes of both treatments and then the effect code for the control would be the
negative sum of both treatment groups. If we are just doing a contrast between two
treatment groups, set the control to 0 and then assign the number of subjects in the
opposite group to each group(and make one negative so it all sums to 0). That is, if
democrats have 8 subjects and republicans have 6, then we’d make a contrast where
the effect code for democrats is 6 and the effect code for republicans is -8.
Be sure to incorporate sample size when checking the orthogonality of contrast.
Thus, if we are multiplying two vectors, make a third vector that has the n for each
row (group) and then dot product all 3 vectors and sum. That number should be 0.
With this setup, the intercept will still be the grand mean. The contrasts are now
between weighted means of groups.
With criterion scaling, we use a single coded vector to represent group membership.
You simply use the group mean on the DV as the effect code. This produces an
accurate R2, but will give wrong df since we only used 1 vector. Need to calculate MS
and F by hand. Your intercept and b1 will be meaningless since they will be
constants (0 and 1).
Interactions
Interaction basically exemplify that your data set of predictors may account for
some variance, but only at levels of another variable. “The effect of one variable
depends on the level of another”.
The most straightforward way to do this is by taking your two vectors of values and
multiplying them and then including all 3 vectors in a multiple regression.
The intercept of a regression with interactions that used dummy codes is the mean
for the variable that was coded 0.
The first dummy code for factor A (no interaction) will have a coefficient that tells us
the difference between factors A when everything else in the model is set to 0 (that
is…for the control groups).
Interaction terms tell us the difference in differences. That is, if we have two levels
of factor A and two levels of factor B, the interaction tells us how much B1 and B2
differ as a function of A1. A2 would be a separate interaction term.
If we don’t have an interaction, get rid of it and test the main effects.
If there is an interaction, then split the data into subsamples, regress the outcome on
the treatment variable in each subsample to find the appropriate MSregression
treatment. Then use the MSres from the full model. You can do this AT each level of
the continuous variable of interest by using a subsetting strategy (thresholding your
continuous variable to online include certain ranges of values).
You can also compute regions of significance by using the Potthoff extension of
Johnson-Neyman procedure. To do so, you need a variance/covariance matrix of the
coefficients. This will yield two X values beyond which T is significant < C and T is
sig > C.
You can also use a recentering strategy. We can remove a value of interest from each
number in our X and recode the X and the interactions and then our b-coefficient
associated with the dummy variable (b2) will test the group differences ay X=0
(which now represents your level of interest).
Recentering is important because it reduces the correlation of X and XZ, since X
makes up part of XZ. So center X and center Z and then take the cross-product of
those two, centered vectors.
We interpret a significant interaction by saying “the expected amount of change in
relationship between Z and Y for a 1 unit increase in X, or expected amount o change
in relationship between X and Y for a 1 unit increase in Z”.
Centering allows us to guarantee meaning out of our zero points. Like, when we say
“the coefficient of X when Z=0 could mean something much more if we know that 0
is actually a particular value that we have artificially made to be 0 by recentering”.
An interaction coefficient will be exactly the same regardless of whether or not we
recenter. The highest order term(the interaction) is invariant to linear
transformation.
If we do in fact have an interaction, we want to dissect our data to find what is
driving our interaction. Where is it on Z that X changes it relationship with Y? We
should choose particularly meaningful values such as the mean, +/- 1SD, percentiles
of the variable distribution (top 90%, etc.).
All we have to do is plug in our changing value (when all variables are centered),
using a rearranged regression equation:
𝑌̂ = (𝑏0 + 𝑏2 𝑍) + (𝑏1 + 𝑏3 𝑍)𝑋
Our next step is to obtain the standard errors of simple slopes so that we may test
their significance. We do this by using the following formula:
2
𝑠𝑏 = √𝑠𝑋2 + 2𝑍𝑠𝑋,𝑋𝑍 + 𝑍 2 𝑠𝑋𝑍
Where the first term under the root is the variance of the beta for X, Z is the Z value
you are testing the effect at (e.g +1SD), the next s is the covariance between X and
the interaction XZ. The last term is the variance of the interaction.
You can do this for all interesting levels of Z that you are interested in and submit it
to a t-test with n-k-1 df, where k is the number of predictors.
For example:
𝑡+1𝑆𝐷 𝑜𝑛 𝑍 =
𝑠𝑖𝑚𝑝𝑙𝑒 𝑠𝑙𝑜𝑝𝑒 = 𝑏1 + 𝑏3 𝑍
𝑠𝑏 𝑓𝑜𝑟+1𝑆𝐷 𝑜𝑛 𝑍
We can rearrange these equations the other way to look at levels of X that we are
interested in. Depends on the research questions. Just need to rearrange our
regression equation to look like:
and standard error to:
𝑌̂ = (𝑏0 + 𝑏1 𝑋) + (𝑏2 + 𝑏3 𝑋)𝑍
2
𝑠𝑏 = √𝑠𝑧2 + 2𝑋𝑠𝑍,𝑋𝑍 + 𝑋 2 𝑠𝑋𝑍
ANCOVA
An ANCOVA aims to remove the extraneous error for a more powerful test of group
differences (treatment). A Bad Ancova adjusts awat pre-existing group covariate
differences.
ANCOVA assumes that the covariate slope is the same across all treatment groups
(no interaction). You want to test this interaction, hope it is non-significant, and
then proceed with testing your main effects.
Don’t forget about the adjusted means:
𝑌̅𝑗(𝑎𝑑𝑗) = 𝑌̅𝑗 − 𝑏(𝑋̅𝑗 − 𝑋̅)
Differences in adjusted means are the same as differences among intercepts of
various lines. If you dummy coded your variables, then these difference tests are
built into the coefficient output.
Curvilinear Regression
Sometimes, a straight line just won’t do.
Each line has as many bends as the highest order exponent -1.
We need to make sure that our curved lines fit with our hypothesis and that we are
not overfitting.
If you had 20 unique X values, you could fit up to a 19th order polynomial. After that
point, the prediction line simply traces the mean for the number of observations at
each level.
Functions of variables are non-linear, but coefficients are still linear (we are still just
adding a coefficient and adding them all into one linear equation.).
A good plan is to start with a higher order than you hypothesize, hope it is nonsignificant, and return to your hypothesis. (e.g start with cubic then reduce down to
quadratic).
A key note is that any shared variance for the terms in your equation will belong to
the lower order term. You can only talk about the highest order term in the model
when saying what it accounts for “above and beyond” the other terms in the
equation.
Interpreting our coefficients is a bit more complex, now. We normally think of
coefficients as the change in Y with a unit increase in X. However, if we have X and
then X2 and then X3, then those latter terms depend on X. (The intercept is
independent, because 0 for X would make the higher order terms 0 as well.)
In a quadratic model, the b for X2 is known as the acceleration. That is…how much is
the change changing? If we expect a change in Y form X, but the magnitude of that
change changes, then the b coefficient for X2 will tell us how much the change is
changing.
The b for X is, technically, the instantaneous linear change at X=0(first derivative).
That is…from the get go, what is the slope of X before it starts accelerating….
We do not interpret lower order effects in the presence of a significant higher order
effect.
Centering in a curvilinear regression framework helps us interpret our data with
more ease and also reduces colinearity.
After centering, the acceleration will now give us a good idea of the overall trend.
With centered variables, our coefficients are more easily interpreted. The first b1
will tell us the predominant direction of the curve. The second will be concavity
(positive for opening upward, negative for opening downward). If b1 is 0, then that
means there is a U function.
If our X variable is relatively limited in the number of values it can take on, then you
should use orthogonal polynomials. Orthogonal polynomials are unique variables
that are structures to capture specific curve components. The weights of orthogonal
polynomials should sum to 0. Furthermore, the sum of the products of any pair of
corresponding weights will be 0.
The u-Point Scales, as they are known, will showcase the number of turns in the
curve, depending on how many times the series of numbers changes from increasing
to decreasing.
If the relationship between X and Y is monotonic, we can use the bulging rule to
determine how to transform X and Y depending on which quadrant (Cartesian) the
line’s curve falls on. This is REPLACING the actual X or Y variable with a transformed
one.
IIIIIIIV-
X up Y up
Y up X down
X down Y down
X up Y down
By up or down, we mean raiding to the power of X or Y to [3 2 1 .5 0 -.5 -1 -2 -3].
Transforming Y might change the error structure. If non-linearity is the only issue,
just transform X.
So we can either add terms or transform terms. Which to use when?
Transformation is appropriate for monotonic only. Polynomial works for both
monotonic and non-monotonic associations. Polynomial approach uses additional
model degrees of freedom, whereas a transformation does not.
Another option is an exponential and power functions. For exponential, we put eb1X
and for power we do Xb1.
The predicted value generated by a trend may be applied to X values not included in,
but within the range of the original data (interpolation).
However, extrapolating predicted values generated by a trend that is outside the
range of X value is risky. The trend may reverse at different ranges of X (since that
happened in our current dataset already!).
Variables that are non-linearly transformed can also interact with other variables.
b1 would be the instantaneous acceleration for control. b2 is the acceleration in
control group. b3(attached to a dummy variable) would show the difference in
intercepts of control and treatment. b4(interaction between original X and a
dummy) would show the difference in instantaneous change between control and
treatment. b5(the interaction between quadratic term and dummy variable) would
be the difference in quadratic acceleration between control and treatment.
Mediation
A mediator informs us about what mechanisms give rise to an effect other than the
predictor variable.
We must satisfy the following:
1)the predictor must be related to the mediator  a
2) the predictor must be related to the outcome  c (total effect)
3) the mediator must be related to the outcome (when the predictor is in the model)
 b /  c’ for the predictor
4) the predictor is strongly related to the outcome when mediator is in the model. 
c’ < c
Total effect = a +b
Mediated effect = total effect – direct effect
Mediation= c-c’
OR
Mediation=a*b
We need to test if the mediation is substantial.
We would first need to calculate the standard error of the mediated effect:
𝑠𝑚𝑒𝑑 = √𝑏 2 𝑠𝑎2 + 𝑎2 𝑠𝑏2
Then we can calculate the Z statistic by:
𝑚𝑒𝑑𝑖𝑎𝑡𝑖𝑜𝑛
𝑧=
𝑠𝑚𝑒𝑑
We can also bootstrap to test the mediated effect, since the z-test assumes
normality, which a*b is not necessarily normal.
If we have multiple mediators, then we need to run get an estimate of c’ from
iterating over each mediator being allowed to be in the model alongside the
predictor, one at a time. Then, sum all the c’s and estimate each a*b each time,
uniquely, depending on which mediator was in the model.
A phenomenon known as suppression will actually strengthen the predictive
validity of another variable, by increase the predictor’s coefficient.
Classical suppression is when the suppressor is related to the predictor but not to
the outcome. Thus, it suppresses variance in predictor that is irrelevant to the
outcome. (Verbal ability may serve as a suppressor in the association between a
paper and pencil test of job skills and a measure of job performance).
We can see suppression if c’ > c
Be sure to be wary of common cause models (Where the mediator causes X and Y)
when interpreting any hint of “causality” in your mediated model.
Piecewise Models
Basically, a continuous piecewise is similar to an interaction in that the relationship
between the predictor and the outcome is different for different ranges of the
predictor variable.
We can add in a categorical/dummy variable that interact with a recentered X at the
same level of the split in order to model a piecewise relationship. The dummy would
be a conditional (D1=O if X<some value * recentered at that same value).
Let’s say the value we want to recenter at is 10, then we can rearrange our terms to
have:
𝑌 = (𝑏0 − 15𝑏2 ) + (𝑏1 + 𝑏2 )𝑋1 + 𝑒
the intercept will be the first segments intercept
b1 will be the 1st segments slop
b2 will be the difference between the 1st and 2nd segment slopes
Thus b1+b2 would be the second segment’s slope.
If the interaction between the dummy variable (with the if statement) and the
recentered X is significant, then you need two slopes.
A discontinuous piecewise would be useful for something like an intervention,
where we expect one line to totally change its slope to the point where we would
need a new line at some point in the graph to model an intervention because a) the
intercept would change(if it could be extended back to the beginning of the
experiment, which it can’t) and b) the slope would change
We would set up our dummy code to be a condition statement based on the time
during our measurements when we expect our intervention to be. Additionally, we
would take our time predictor and recenter it to be at the time of interest. We also
need to have the dummy variable as its own isolate variable. Thus, if we rearrange
our terms we have(if 3 is the value for the if statement):
Here,
𝑌 = (𝑏0 − 3𝑏2 ) + (𝑏1 + 𝑏2 )𝑋 + 𝑏3 + 𝑒
b0 is the pre-intervention(if statement) intercept or the expected value when
X(raw)=0
b1 is the pre-intervention slope
b2 is the difference between the pre and post intervention slopes. If this is
significant, then we need two separate lines.
b3 is the immediate intervention effect. This is the difference between where we
would expect the original line (pre intervention) to keep going and where the new,
discontinuous line actually is. If this is significant, then we need a discontinuous
model.
We can add in another dummy variable to this model to represent a control group
that underwent the same testing at the same times, but without an intervention.
We would add in additional terms where that new dummy interacts with time, the if
statement dummy, and both the time and if statement together.
We would be interested in whether or not the 3 way interaction and the interaction
between the dummy for intervention and the if statement interact. The coefficients
for these would capture the difference between control and intervention group
POST the intervention (the if statement).
In a discontinuous piecewise model, intercept is for the control, X is for the control,
the interaction between the recentered X and the Dummy(condition) is for the
control and the Dummy is for the control. The Intervention is for the treatment
group, so is anything that interacts with the intervention. These show the
differences from the control.
In summary:
Continuous Piecewise: intercept + X + D1(if statement)X2(centered at if statement)
DisContinuous Piecewise: Continuous + D1
knots are the locations on a regression where we see a “spline” and a general shift in
the function.
Download