Week 8 Hour 1: More on polynomial fits. The AIC examples.

advertisement
Week 8
Hour 1: More on polynomial fits. The AIC
Hour 2: Dummy Variables – what are they? Lots of
examples.
Hour 3: Interactions. The stepwise method.
Stat 302 Notes. Week 8, Hour 2, Page 1 / 34
What are dummy variables?
In short, dummy variables are the way to include categorical
variables into a regression as explanatory variables.
A dummy variable can take two values:
0 – The observation does not belong in this category,
1 – The observation DOES belong in this category.
Stat 302 Notes. Week 8, Hour 2, Page 2 / 34
If there are only two categories to a variable, then you can
assign one group to 0 and the other to 1.
We could describe the two means of a two-sample t-test like
a regression like this:
μ1 = β0 + β1(0)
μ2 = β0 + β1(1)
, where x is a dummy variable.
Stat 302 Notes. Week 8, Hour 2, Page 3 / 34
Looking at a simple regression formula now,
y = β0 + β1(x) + error
... where β0 is the intercept and β1 is the slope, now we
have an alternative interpretation.
β0 Is the mean of the first group, and
β1 is the difference between the group means.
Stat 302 Notes. Week 8, Hour 2, Page 4 / 34
Example: Taking the means of two samples, we find group 1
has a mean of 45 and group 2 has a mean of 60.
Letting group 1 be the 'baseline', we would estimate the
parameters of the regression equation
y = β0 + β1(x) + error
to be
β0 = 45 and β1 = 15.
This way, we predict a value in group 1 has 45 + 15(0) = 45,
and group 2 has 45 + 15(1) = 60.
Stat 302 Notes. Week 8, Hour 2, Page 5 / 34
What's more, a t-test, which asks if
μ1 = μ2, or 'are the two means different?',
is the same as asking if
β1 = 0, or 'is the difference between the means 0?'
In short, by using a dummy variable for a grouping
variable, we can do a t-test with a regression.
Stat 302 Notes. Week 8, Hour 2, Page 6 / 34
We even get the same answers.
Stat 302 Notes. Week 8, Hour 2, Page 7 / 34
A regression slope is rise/run. For a dummy variable, the
'run' is the difference between 0 and 1 in the groups. The
'rise' is the difference in means. Rise/run = Rise/1 = Rise.
Stat 302 Notes. Week 8, Hour 2, Page 8 / 34
Some dummies are variable, but some are predictable.
Stat 302 Notes. Week 8, Hour 2, Page 9 / 34
Dummy variable can be used to translate ANOVA style
problems into regression as well.
However, dummy variables MUST only take two values
(typically 0 and 1)*.
Each dummy variable is 1 only for observations belonging to
that category / group. Each dummy variable is 0 otherwise.
* There are other interpretations that use -1 and 1, etc. These are
beyond the scope of this course.
Stat 302 Notes. Week 8, Hour 2, Page 10 / 34
So how you describe a categorical variable with more than 2
possible outcomes as a dummy variable? Use more than one
dummy.
One of the categories is considered a baseline. All of the
dummy variables will be 0 for observations in that category.
For observations in other categories, one of the dummies is
1 and the rest are 0.
Stat 302 Notes. Week 8, Hour 2, Page 11 / 34
A variable with 3 categories, needs 2 dummy variables to
fully describe it.
Here: Blue is the baseline. Since a colour can’t be red and
green at the same time, only one of the dummy variables
will ever be 1 for a particular case.
Stat 302 Notes. Week 8, Hour 2, Page 12 / 34
Doing a linear model with just these two dummy variables
would look like:
y = β0 + βred(1 if Red) + βgreen(1 if Green) + error
Which would be
= β0 for blue cases.
= β0 + βred for red cases.
= β0 + βgreen for green cases.
Stat 302 Notes. Week 8, Hour 2, Page 13 / 34
β0 , the intercept, the value when Red=0
and Green=0, is
the average response for blue cases.
βgreen is the difference in means between green and blue.
βred is the difference in means between red and blue.
Stat 302 Notes. Week 8, Hour 2, Page 14 / 34
A variable with K categories needs K-1 dummy variables.
ANOVA treats categorical variables as dummies, and that's
what determines where df are used up. One df for the
baseline, and each dummy variable costs one df.
That's why a k-group ANOVA has k-1 df for the grouping
variable.
Stat 302 Notes. Week 8, Hour 2, Page 15 / 34
Three big advantages to regressing with dummy variables:
- They allow multiple grouping variables to be considered in
a single model.
- They can show which means are significantly different from
the baseline.
*** They allow grouping variables and continuous variables
to be used together in a single model.
One big disadvantage:
- Any hypothesis tests are done in comparison (also known
as in contrast) to the baseline.
Stat 302 Notes. Week 8, Hour 2, Page 16 / 34
Sometimes it's good to be a dummy.
Stat 302 Notes. Week 8, Hour 2, Page 17 / 34
Consider the NHL dataset, and our multiple regression
model: Number of wins as a response to goals against and
goals for.
Stat 302 Notes. Week 8, Hour 2, Page 18 / 34
The National Hockey League is split into two conferences,
and teams in each conference occasionally (but not often)
play against each other.
Styles of play may differ between conferences, and we want
to see if one conference is winning more often than the
other.
We can do this with a model that includes goals for, goals
against AND a dummy variable for conference.
Stat 302 Notes. Week 8, Hour 2, Page 19 / 34
R creates the dummy variable automatically for us. By
default the baseline is first category alphabetically.
The baseline is 'E' for east, the 'ConfNameW' parameter is
the additional wins for being in the 'W'estern Conference.
Stat 302 Notes. Week 8, Hour 2, Page 20 / 34
So, when holding 'goals for' and 'goals against' constant,
teams in the Western Conference win 0.082 more games on
average.
However, the parameter for the conference dummy variable
is not showing up as significant. How well is the rest of the
model doing?
Stat 302 Notes. Week 8, Hour 2, Page 21 / 34
Stat 302 Notes. Week 8, Hour 2, Page 22 / 34
82.9% of the variance in the number of wins can be
explained by these three things together.
In other words, adding conference into our model told us
nothing more about wins than goals weren’t already
covering.
The R square of the model is the same with or without
conference.
Stat 302 Notes. Week 8, Hour 2, Page 23 / 34
The AIC and BIC confirm this because they are both higher
for the model with the conference dummy variable.
That means just as much variance is explained by
considering only goals for/against as by considering both
goals for/against and the conference of the team.
Conference contributes nothing extra.
This is probably because the strength of your opponents is
already reflected in the goals for / goals against record. It’s
not like goals against weak teams count for more.
Stat 302 Notes. Week 8, Hour 2, Page 24 / 34
We can combine variables in surprising ways.
Stat 302 Notes. Week 8, Hour 2, Page 25 / 34
One more example, the npk dataset from assignment two.
In the assignment we had only looked at the blocks.
Now let's look at a full model using N, P, K, and block.
Stat 302 Notes. Week 8, Hour 2, Page 26 / 34
Recall that the intercept is the predicted value of the
response when all the explanatory variables are zero.
That includes all the dummy variables.
All the dummy variables are 0 in the baseline group.
The baseline group for 'block' is block1.
So the intercept is the expected yield for a response...
...in block1
... with N, P, and K equal to zero.
Stat 302 Notes. Week 8, Hour 2, Page 27 / 34
The expected yield for a response in block 2 is 3.425 more
than it is in block 1, holding N, P, and K constant.
Stat 302 Notes. Week 8, Hour 2, Page 28 / 34
Each parameter is in comparison to block 1.
The other variables are controlled for dummy variables just
as they would be for any other variables.
Stat 302 Notes. Week 8, Hour 2, Page 29 / 34
The yield decreases by 3.98 as K increases by 1, holding
block, N, and P constant.
The ANOVA table reflects the number of parameters being
estimated that is associated with that variable.
The 'baseline' group mean is estimated as part of the
intercept, which is why there are 5 df for 6 groups.
Stat 302 Notes. Week 8, Hour 2, Page 30 / 34
For categorical variables, the p-value in the ANOVA table
tells you whether the response changes between ANY two
Stat 302 Notes. Week 8, Hour 2, Page 31 / 34
categories. This is usually more revealing than p-values for
dummy variables.
For 1 DF variables (i.e. continuous and two-group categorical
variables), the p-values are the same between ANOVA and
regression.
Stat 302 Notes. Week 8, Hour 2, Page 32 / 34
Finally, by using the 5 degrees of freedom, a categorical
variable that needs 5 dummy variables also results in AIC
and BIC penalties which are five times as large.
(2x5 = 10 points for AIC, log(24)x5=15.9 points for BIC)
With N, P, K, and block.
With N, P, K
Stat 302 Notes. Week 8, Hour 2, Page 33 / 34
On Thursday: Interaction Terms!
The stepwise method!
Stat 302 Notes. Week 8, Hour 2, Page 34 / 34
Download