Agenda for Week 7, Hour 1 ANOVA and R-squared revisited. Multiple regression and r-squared. Week 7, Hour 2 Multiple regression: co-linearity, perturbations, correlation matrix Stat 302 Notes. Week 7, Hour 1, Page 1 / 28 Consider this made-up dataset on silicon wafers, wafers.csv. It’s based on a very common type of quality control analysis in manufacturing. A factory manager is interested in reducing the number of bad wafers the factory produces in a batch. She sets the factory to make 6 batches of wafers each at 3 levels of cooking temperature and 3 levels of spin speed. There are 54 batches of wafers in total. The response variable is number of bad wafers (in a batch of 1000). Stat 302 Notes. Week 7, Hour 1, Page 2 / 28 Here are select rows from the dataset. 'cooktemp' is the cooking temperature in Celcius 'spinrpm' is the spin rate while cooling, in RPM 'bad' is the number of bad wafers in the batch Stat 302 Notes. Week 7, Hour 1, Page 3 / 28 Note that even though we can describe temperature and speed as continuous variables, we are treating them as categories here. Essentially we are calling them ‘low’, ‘medium’, and ‘high’ settings. wafers$spinrpm = as.factor(wafers$spinrpm) wafers$cooktemp = as.factor(wafers$cooktemp) Stat 302 Notes. Week 7, Hour 1, Page 4 / 28 Here is the one-way ANOVA of 'bad' using cooking temperature as an explanatory variable. mod = lm(bad ~ cooktemp, data=wafers) anova(mod) Stat 302 Notes. Week 7, Hour 1, Page 5 / 28 p value is small, so we have strong evidence that cooking temperature matters. Without the p-value, we could compare the obtained F to a critical value for F. (Recall: F test is one-tailed, we only care about larger variances) Stat 302 Notes. Week 7, Hour 1, Page 6 / 28 A hypothesis test tells us that there some of variance in bad wafer count is explained by cooktemp. It doesn't tell us how much of the variance is explained. For that we need the Sum of Squares total, which is SSgroup + SSresid = 727 + 2934 = 3661 Stat 302 Notes. Week 7, Hour 1, Page 7 / 28 Proportion of variance explained, or R-squared = SSgroup / SStotal = 727 / 3661 = 0.1986, or 19.86% of variation explained. Stat 302 Notes. Week 7, Hour 1, Page 8 / 28 We can also get this information from the summary of the lm() object that we used to get the ANOVA in the first place. There's no such thing as a correlation in an ANOVA, but the sometimes the ANOVA is referred to as having an R-squared because of this variance explained connection. Stat 302 Notes. Week 7, Hour 1, Page 9 / 28 A two-armed bird needs a two-way ANOVA Stat 302 Notes. Week 7, Hour 1, Page 10 / 28 Here is the two-way ANOVA using both cooking temperature and spin rate to explain 'bad'. mod = lm(bad ~ spinrpm + cooktemp, data=wafers) anova(mod) Stat 302 Notes. Week 7, Hour 1, Page 11 / 28 First, do we have evidence that the number of bad wafers change by temperature? What about by spin speed? Yes to both. The p-values associated with each factor is small. Stat 302 Notes. Week 7, Hour 1, Page 12 / 28 So both factors are explaining a significant proportion of the variance. But how much? We need the sum of squares total. This is 3661 , the total of the sum of squares from all sources: temperature, spin speed, and residuals. Stat 302 Notes. Week 7, Hour 1, Page 13 / 28 SStotal = SSspin + SStemp + SSresid = 1840 + 727 + 1094 = 3661 (The same as in the one-way ANOVA) Of this total, spin speed explains SSspin / SStotal = 1840 / 3661 = 50.26% of the variation, and temperature explains SStemp / SStotal = 727 / 3661 = 19.86% Stat 302 Notes. Week 7, Hour 1, Page 14 / 28 of it. Both grouping variables together explain 2567 / 3661 = 70.12% of the variation We can confirm this by looking at the linear model summary. The multiple r-squared should match variance explained by the model. (i.e. everything but the residuals) Stat 302 Notes. Week 7, Hour 1, Page 15 / 28 Are they evolving, or are we regressing? Stat 302 Notes. Week 7, Hour 1, Page 16 / 28 Recall that in assignment 1, we looked at some national hockey league data. We made a model of wins as a function of goals against. This is a simple regression model. The regression equation is Wins = 78.83 – 0.163*GA +error Stat 302 Notes. Week 7, Hour 1, Page 17 / 28 Wins = 78.83 – 0.163*GA +error …means that a team with 0 goals against it is expected to win 78.83 of their 82 games, and that every goal against the team costs it 0.168 wins. In this model, goals against explained 42.21% of the variation in the number of wins. Stat 302 Notes. Week 7, Hour 1, Page 18 / 28 We can expand this from a simple regression into a multiple regression model by incorporating a second explanatory variable, Goals For (GF) The regression equation is Wins = 37.95 – 0.163*GA + 0.177*GF +error Stat 302 Notes. Week 7, Hour 1, Page 19 / 28 The regression equation is Wins = 37.95 – 0.163*GA + 0.177*GF +error …meaning that a team with both 0 goals against and 0 goals for will win 37.95 games (a bit fewer than half). Every goal against will reduce this win count by 0.163 (holding “goals for” constant) Every goal for will increase the win count by 0.177 (holding “goals against” constant) Stat 302 Notes. Week 7, Hour 1, Page 20 / 28 When doing a multiple regression, the slope coefficient associated with each variable is implicitly while “holding other variables constant”. That means we take each slope effect separately, even if they often appear together. Example: If the team makes a chance (e.g. a trade or a coaching change) such that will score 5 more goals in a season, but also allow 3 more goals, then: Stat 302 Notes. Week 7, Hour 1, Page 21 / 28 Adding 5 'goals for' and 3 'goals against'. The effect of the additional goals against is to earn 0.163 * 3 = 0.489 fewer wins per season. The effect of the increase in goals for is to earn 0.177 * 5 = 0.885 more wins per season. The total effect is the sum of each separate effect, so with the change, we expect an increase of (-0.489) + 0.885 = 0.396 wins this season. Stat 302 Notes. Week 7, Hour 1, Page 22 / 28 In this multiple regression model, goals for and against together explain 82.93% of the variation in wins. In this model, it’s not surprising that including both 'goals for' and 'goals against' is better than including only one. However, when including additional explanatory variables, rsquared always increases until it is 100%. Stat 302 Notes. Week 7, Hour 1, Page 23 / 28 Even if the new variables are completely random noise, the r-squared will increase by a little bit. We use the ‘multiple r-squared’ in the model summary because it’s easy to interpret, but the adjusted r-squared is also useful, because it’s always a little less than the multiple r-squared to account for the amount that r-squared would increase from random noise. Stat 302 Notes. Week 7, Hour 1, Page 24 / 28 Question: Doesn’t goals for and goals against determine wins entirely? If you score more goals than your opponent, you win. End of story. Right? Answer: For a single game, that’s true. But we don’t have data of this resolution. We have the total goals for and against for the entire season, but not for individual games. Stat 302 Notes. Week 7, Hour 1, Page 25 / 28 When we aggregate data (e.g. add together the goals from different games in a season), we lose some information. Winning a game by 1 goal, or winning it by 50 goals both count as a single goal. That’s where the remaining 17% unexplained variance is: in the differences between individual games. Stat 302 Notes. Week 7, Hour 1, Page 26 / 28 Question: Could there be such a terrible team that a model will predict to have less than 0 wins? Answer: Yes. However, such a team would be an extreme outlier in the data. We shouldn’t extrapolate and apply the model to cases far outside the data we have observed. Stat 302 Notes. Week 7, Hour 1, Page 27 / 28 Break time Stat 302 Notes. Week 7, Hour 1, Page 28 / 28