Week 8 Hour 1: More on polynomial fits. The AIC Hour 2: Dummy Variables – what are they? Lots of examples. Hour 3: Interactions. The stepwise method. Stat 302 Notes. Week 8, Hour 2, Page 1 / 34 What are dummy variables? In short, dummy variables are the way to include categorical variables into a regression as explanatory variables. A dummy variable can take two values: 0 – The observation does not belong in this category, 1 – The observation DOES belong in this category. Stat 302 Notes. Week 8, Hour 2, Page 2 / 34 If there are only two categories to a variable, then you can assign one group to 0 and the other to 1. We could describe the two means of a two-sample t-test like a regression like this: μ1 = β0 + β1(0) μ2 = β0 + β1(1) , where x is a dummy variable. Stat 302 Notes. Week 8, Hour 2, Page 3 / 34 Looking at a simple regression formula now, y = β0 + β1(x) + error ... where β0 is the intercept and β1 is the slope, now we have an alternative interpretation. β0 Is the mean of the first group, and β1 is the difference between the group means. Stat 302 Notes. Week 8, Hour 2, Page 4 / 34 Example: Taking the means of two samples, we find group 1 has a mean of 45 and group 2 has a mean of 60. Letting group 1 be the 'baseline', we would estimate the parameters of the regression equation y = β0 + β1(x) + error to be β0 = 45 and β1 = 15. This way, we predict a value in group 1 has 45 + 15(0) = 45, and group 2 has 45 + 15(1) = 60. Stat 302 Notes. Week 8, Hour 2, Page 5 / 34 What's more, a t-test, which asks if μ1 = μ2, or 'are the two means different?', is the same as asking if β1 = 0, or 'is the difference between the means 0?' In short, by using a dummy variable for a grouping variable, we can do a t-test with a regression. Stat 302 Notes. Week 8, Hour 2, Page 6 / 34 We even get the same answers. Stat 302 Notes. Week 8, Hour 2, Page 7 / 34 A regression slope is rise/run. For a dummy variable, the 'run' is the difference between 0 and 1 in the groups. The 'rise' is the difference in means. Rise/run = Rise/1 = Rise. Stat 302 Notes. Week 8, Hour 2, Page 8 / 34 Some dummies are variable, but some are predictable. Stat 302 Notes. Week 8, Hour 2, Page 9 / 34 Dummy variable can be used to translate ANOVA style problems into regression as well. However, dummy variables MUST only take two values (typically 0 and 1)*. Each dummy variable is 1 only for observations belonging to that category / group. Each dummy variable is 0 otherwise. * There are other interpretations that use -1 and 1, etc. These are beyond the scope of this course. Stat 302 Notes. Week 8, Hour 2, Page 10 / 34 So how you describe a categorical variable with more than 2 possible outcomes as a dummy variable? Use more than one dummy. One of the categories is considered a baseline. All of the dummy variables will be 0 for observations in that category. For observations in other categories, one of the dummies is 1 and the rest are 0. Stat 302 Notes. Week 8, Hour 2, Page 11 / 34 A variable with 3 categories, needs 2 dummy variables to fully describe it. Here: Blue is the baseline. Since a colour can’t be red and green at the same time, only one of the dummy variables will ever be 1 for a particular case. Stat 302 Notes. Week 8, Hour 2, Page 12 / 34 Doing a linear model with just these two dummy variables would look like: y = β0 + βred(1 if Red) + βgreen(1 if Green) + error Which would be = β0 for blue cases. = β0 + βred for red cases. = β0 + βgreen for green cases. Stat 302 Notes. Week 8, Hour 2, Page 13 / 34 β0 , the intercept, the value when Red=0 and Green=0, is the average response for blue cases. βgreen is the difference in means between green and blue. βred is the difference in means between red and blue. Stat 302 Notes. Week 8, Hour 2, Page 14 / 34 A variable with K categories needs K-1 dummy variables. ANOVA treats categorical variables as dummies, and that's what determines where df are used up. One df for the baseline, and each dummy variable costs one df. That's why a k-group ANOVA has k-1 df for the grouping variable. Stat 302 Notes. Week 8, Hour 2, Page 15 / 34 Three big advantages to regressing with dummy variables: - They allow multiple grouping variables to be considered in a single model. - They can show which means are significantly different from the baseline. *** They allow grouping variables and continuous variables to be used together in a single model. One big disadvantage: - Any hypothesis tests are done in comparison (also known as in contrast) to the baseline. Stat 302 Notes. Week 8, Hour 2, Page 16 / 34 Sometimes it's good to be a dummy. Stat 302 Notes. Week 8, Hour 2, Page 17 / 34 Consider the NHL dataset, and our multiple regression model: Number of wins as a response to goals against and goals for. Stat 302 Notes. Week 8, Hour 2, Page 18 / 34 The National Hockey League is split into two conferences, and teams in each conference occasionally (but not often) play against each other. Styles of play may differ between conferences, and we want to see if one conference is winning more often than the other. We can do this with a model that includes goals for, goals against AND a dummy variable for conference. Stat 302 Notes. Week 8, Hour 2, Page 19 / 34 R creates the dummy variable automatically for us. By default the baseline is first category alphabetically. The baseline is 'E' for east, the 'ConfNameW' parameter is the additional wins for being in the 'W'estern Conference. Stat 302 Notes. Week 8, Hour 2, Page 20 / 34 So, when holding 'goals for' and 'goals against' constant, teams in the Western Conference win 0.082 more games on average. However, the parameter for the conference dummy variable is not showing up as significant. How well is the rest of the model doing? Stat 302 Notes. Week 8, Hour 2, Page 21 / 34 Stat 302 Notes. Week 8, Hour 2, Page 22 / 34 82.9% of the variance in the number of wins can be explained by these three things together. In other words, adding conference into our model told us nothing more about wins than goals weren’t already covering. The R square of the model is the same with or without conference. Stat 302 Notes. Week 8, Hour 2, Page 23 / 34 The AIC and BIC confirm this because they are both higher for the model with the conference dummy variable. That means just as much variance is explained by considering only goals for/against as by considering both goals for/against and the conference of the team. Conference contributes nothing extra. This is probably because the strength of your opponents is already reflected in the goals for / goals against record. It’s not like goals against weak teams count for more. Stat 302 Notes. Week 8, Hour 2, Page 24 / 34 We can combine variables in surprising ways. Stat 302 Notes. Week 8, Hour 2, Page 25 / 34 One more example, the npk dataset from assignment two. In the assignment we had only looked at the blocks. Now let's look at a full model using N, P, K, and block. Stat 302 Notes. Week 8, Hour 2, Page 26 / 34 Recall that the intercept is the predicted value of the response when all the explanatory variables are zero. That includes all the dummy variables. All the dummy variables are 0 in the baseline group. The baseline group for 'block' is block1. So the intercept is the expected yield for a response... ...in block1 ... with N, P, and K equal to zero. Stat 302 Notes. Week 8, Hour 2, Page 27 / 34 The expected yield for a response in block 2 is 3.425 more than it is in block 1, holding N, P, and K constant. Stat 302 Notes. Week 8, Hour 2, Page 28 / 34 Each parameter is in comparison to block 1. The other variables are controlled for dummy variables just as they would be for any other variables. Stat 302 Notes. Week 8, Hour 2, Page 29 / 34 The yield decreases by 3.98 as K increases by 1, holding block, N, and P constant. The ANOVA table reflects the number of parameters being estimated that is associated with that variable. The 'baseline' group mean is estimated as part of the intercept, which is why there are 5 df for 6 groups. Stat 302 Notes. Week 8, Hour 2, Page 30 / 34 For categorical variables, the p-value in the ANOVA table tells you whether the response changes between ANY two Stat 302 Notes. Week 8, Hour 2, Page 31 / 34 categories. This is usually more revealing than p-values for dummy variables. For 1 DF variables (i.e. continuous and two-group categorical variables), the p-values are the same between ANOVA and regression. Stat 302 Notes. Week 8, Hour 2, Page 32 / 34 Finally, by using the 5 degrees of freedom, a categorical variable that needs 5 dummy variables also results in AIC and BIC penalties which are five times as large. (2x5 = 10 points for AIC, log(24)x5=15.9 points for BIC) With N, P, K, and block. With N, P, K Stat 302 Notes. Week 8, Hour 2, Page 33 / 34 On Thursday: Interaction Terms! The stepwise method! Stat 302 Notes. Week 8, Hour 2, Page 34 / 34