ANOVA and R-squared revisited. Multiple regression and r-squared. Multiple regression: correlation matrix

advertisement
Agenda for Week 7, Hour 1
ANOVA and R-squared revisited.
Multiple regression and r-squared.
Week 7, Hour 2
Multiple regression: co-linearity, perturbations,
correlation matrix
Stat 302 Notes. Week 7, Hour 1, Page 1 / 28
Consider this made-up dataset on silicon wafers, wafers.csv.
It’s based on a very common type of quality control analysis
in manufacturing.
A factory manager is interested in reducing the number of
bad wafers the factory produces in a batch.
She sets the factory to make 6 batches of wafers each at 3
levels of cooking temperature and 3 levels of spin speed.
There are 54 batches of wafers in total. The response
variable is number of bad wafers (in a batch of 1000).
Stat 302 Notes. Week 7, Hour 1, Page 2 / 28
Here are select rows from the dataset.
'cooktemp' is the cooking temperature in Celcius
'spinrpm' is the spin rate while cooling, in RPM
'bad' is the number of bad wafers in the batch
Stat 302 Notes. Week 7, Hour 1, Page 3 / 28
Note that even though we can describe temperature
and speed as continuous variables, we are treating
them as categories here.
Essentially we are calling them ‘low’, ‘medium’, and
‘high’ settings.
wafers$spinrpm = as.factor(wafers$spinrpm)
wafers$cooktemp = as.factor(wafers$cooktemp)
Stat 302 Notes. Week 7, Hour 1, Page 4 / 28
Here is the one-way ANOVA of 'bad' using cooking
temperature as an explanatory variable.
mod = lm(bad ~ cooktemp, data=wafers)
anova(mod)
Stat 302 Notes. Week 7, Hour 1, Page 5 / 28
p value is small, so we have strong evidence that cooking
temperature matters.
Without the p-value, we could compare the obtained F to a
critical value for F.
(Recall: F test is one-tailed, we only care about larger variances)
Stat 302 Notes. Week 7, Hour 1, Page 6 / 28
A hypothesis test tells us that there some of variance in bad
wafer count is explained by cooktemp.
It doesn't tell us how much of the variance is explained.
For that we need the Sum of Squares total,
which is SSgroup + SSresid = 727 + 2934 = 3661
Stat 302 Notes. Week 7, Hour 1, Page 7 / 28
Proportion of variance explained, or R-squared
= SSgroup / SStotal
= 727 / 3661
= 0.1986, or 19.86% of variation explained.
Stat 302 Notes. Week 7, Hour 1, Page 8 / 28
We can also get this information from the summary of the
lm() object that we used to get the ANOVA in the first place.
There's no such thing as a correlation in an ANOVA, but the
sometimes the ANOVA is referred to as having an R-squared
because of this variance explained connection.
Stat 302 Notes. Week 7, Hour 1, Page 9 / 28
A two-armed bird needs a two-way ANOVA
Stat 302 Notes. Week 7, Hour 1, Page 10 / 28
Here is the two-way ANOVA using both cooking
temperature and spin rate to explain 'bad'.
mod = lm(bad ~ spinrpm + cooktemp,
data=wafers)
anova(mod)
Stat 302 Notes. Week 7, Hour 1, Page 11 / 28
First, do we have evidence that the number of bad
wafers change by temperature?
What about by spin speed?
Yes to both.
The p-values associated with each factor is small.
Stat 302 Notes. Week 7, Hour 1, Page 12 / 28
So both factors are explaining a significant
proportion of the variance. But how much?
We need the sum of squares total. This is 3661 , the
total of the sum of squares from all sources:
temperature, spin speed, and residuals.
Stat 302 Notes. Week 7, Hour 1, Page 13 / 28
SStotal = SSspin + SStemp + SSresid
= 1840 + 727 + 1094
= 3661
(The same as in the one-way ANOVA)
Of this total, spin speed explains
SSspin / SStotal = 1840 / 3661 = 50.26%
of the variation, and temperature explains
SStemp / SStotal = 727 / 3661 = 19.86%
Stat 302 Notes. Week 7, Hour 1, Page 14 / 28
of it.
Both grouping variables together explain
2567 / 3661 = 70.12% of the variation
We can confirm this by looking at the linear model
summary.
The multiple r-squared should match variance
explained by the model. (i.e. everything but the
residuals)
Stat 302 Notes. Week 7, Hour 1, Page 15 / 28
Are they evolving, or are we regressing?
Stat 302 Notes. Week 7, Hour 1, Page 16 / 28
Recall that in assignment 1, we looked at some national
hockey league data. We made a model of wins as a function
of goals against.
This is a simple regression model. The regression equation is
Wins = 78.83 – 0.163*GA +error
Stat 302 Notes. Week 7, Hour 1, Page 17 / 28
Wins = 78.83 – 0.163*GA +error
…means that a team with 0 goals against it is expected to
win 78.83 of their 82 games, and that every goal against the
team costs it 0.168 wins.
In this model, goals against explained 42.21% of the
variation in the number of wins.
Stat 302 Notes. Week 7, Hour 1, Page 18 / 28
We can expand this from a simple regression into a multiple
regression model by incorporating a second explanatory
variable, Goals For (GF)
The regression equation is
Wins = 37.95 – 0.163*GA + 0.177*GF +error
Stat 302 Notes. Week 7, Hour 1, Page 19 / 28
The regression equation is
Wins = 37.95 – 0.163*GA + 0.177*GF +error
…meaning that a team with both 0 goals against and 0 goals
for will win 37.95 games (a bit fewer than half).
Every goal against will reduce this win count by 0.163
(holding “goals for” constant)
Every goal for will increase the win count by 0.177
(holding “goals against” constant)
Stat 302 Notes. Week 7, Hour 1, Page 20 / 28
When doing a multiple regression, the slope coefficient
associated with each variable is implicitly while “holding
other variables constant”.
That means we take each slope effect separately, even if
they often appear together.
Example:
If the team makes a chance (e.g. a trade or a coaching
change) such that will score 5 more goals in a season, but
also allow 3 more goals, then:
Stat 302 Notes. Week 7, Hour 1, Page 21 / 28
Adding 5 'goals for' and 3 'goals against'.
The effect of the additional goals against is to earn
0.163 * 3 = 0.489 fewer wins per season.
The effect of the increase in goals for is to earn
0.177 * 5 = 0.885 more wins per season.
The total effect is the sum of each separate effect, so with
the change, we expect an increase of
(-0.489) + 0.885 = 0.396 wins this season.
Stat 302 Notes. Week 7, Hour 1, Page 22 / 28
In this multiple regression model, goals for and against
together explain 82.93% of the variation in wins.
In this model, it’s not surprising that including both 'goals
for' and 'goals against' is better than including only one.
However, when including additional explanatory variables, rsquared always increases until it is 100%.
Stat 302 Notes. Week 7, Hour 1, Page 23 / 28
Even if the new variables are completely random noise, the
r-squared will increase by a little bit.
We use the ‘multiple r-squared’ in the model summary
because it’s easy to interpret, but the adjusted r-squared is
also useful, because it’s always a little less than the multiple
r-squared to account for the amount that r-squared would
increase from random noise.
Stat 302 Notes. Week 7, Hour 1, Page 24 / 28
Question: Doesn’t goals for and goals against
determine wins entirely? If you score more goals
than your opponent, you win. End of story. Right?
Answer: For a single game, that’s true. But we don’t
have data of this resolution.
We have the total goals for and against for the entire
season, but not for individual games.
Stat 302 Notes. Week 7, Hour 1, Page 25 / 28
When we aggregate data (e.g. add together the goals
from different games in a season), we lose some
information.
Winning a game by 1 goal, or winning it by 50 goals
both count as a single goal.
That’s where the remaining 17% unexplained
variance is: in the differences between individual
games.
Stat 302 Notes. Week 7, Hour 1, Page 26 / 28
Question: Could there be such a terrible team that a
model will predict to have less than 0 wins?
Answer: Yes. However, such a team would be an
extreme outlier in the data.
We shouldn’t extrapolate and apply the model to
cases far outside the data we have observed.
Stat 302 Notes. Week 7, Hour 1, Page 27 / 28
Break time
Stat 302 Notes. Week 7, Hour 1, Page 28 / 28
Download