Week 7 Hour 3 Polynomial Fit

advertisement
Week 7 Hour 3
Polynomial Fit
Working with the VIF (Variance Inflation Factor)
Stat 302 Notes. Week 7, Hour 3, Page 1 / 36
Stat 302 Notes. Week 7, Hour 3, Page 2 / 36
Another big advantage of multiple regression is that it allows
for another tool for handling non-linearity – the polynomial
model.
With linear we regression, we make the assumption that Y
increases or decreases with X, but does so at the same rate
for every X.
Stat 302 Notes. Week 7, Hour 3, Page 3 / 36
This is not always the case. Sometimes a transformation, like
the log transform can fix a non-linear pattern.
Stat 302 Notes. Week 7, Hour 3, Page 4 / 36
For simple regression, either X or Y can be transformed.
For multiple regression, it has to be Y, because there is more
than one X.
Stat 302 Notes. Week 7, Hour 3, Page 5 / 36
But transforms only work when the trend is always
increasing or always decreasing.
In cases where the trend reaches some maximum or
minimum for Y in the middle of the X values, a transform will
fail to produce a reasonable linear model.
Stat 302 Notes. Week 7, Hour 3, Page 6 / 36
Stat 302 Notes. Week 7, Hour 3, Page 7 / 36
If we use multiple terms to describe a trend, we can get a
model that fits well, a polynomial model.
Regression equation: Y = β0 + β1 x +
β0 = 5 , β1 = 0.4, β2
Stat 302 Notes. Week 7, Hour 3, Page 8 / 36
= -1
β2x2
This situation can come up when there are two competing
factors along X.
Stat 302 Notes. Week 7, Hour 3, Page 9 / 36
Or when the function is simply non-linear.
Value = β0 + β1 x + β2x2
Interpretation:β1: The value of the mineral per carat
β2: The added value because larger jewels are rarer
Stat 302 Notes. Week 7, Hour 3, Page 10 / 36
There are other variables to consider
Stat 302 Notes. Week 7, Hour 3, Page 11 / 36
Before we start,
To calculate the VIF, you need the ‘car’ package for R, and
you need R version 3.2 or newer to install that.
To install the package, use the command
install.packages("car")
…and follow the prompts. This only needs to be done once.
To load the package, use the command
library(car)
…which has to be done every time you open R.
Stat 302 Notes. Week 7, Hour 3, Page 12 / 36
Now let’s jump into another multiple regression, using vif()
Consider the dataset gapminder.csv, which has lots of
economic and health information from 150 countries from
either the year 2007, or the nearest available year.
This is real data derived from the Gapminder foundation’s
website, gapminder.org
We are interested in understanding the factors behind the
different birth rate of countries. We have a few variables
that could be useful:
Stat 302 Notes. Week 7, Hour 3, Page 13 / 36
GDPpercap: The Gross Domestic Product (GDP) per person
(Year 2000, $USD). Roughly the amount of money made per
person.
argi_in_gdp: The proportion of a country’s GDP that comes
from agriculture and farming.
HDI: Human Development Index. A measure of how
advanced a country is, based on health, education, and
economy.
Stat 302 Notes. Week 7, Hour 3, Page 14 / 36
GINI: A measure of income inequality in the country. A
higher index means more inequality.
health_spending: The amount of money (Year 2000, $USD)
spent per person on health care.
female_work: The proportion of adult females (ages 18-64)
that are in the workforce.
Stat 302 Notes. Week 7, Hour 3, Page 15 / 36
We can start with a linear model that includes all of these
explanatory variables together.
What is this all saying?
Stat 302 Notes. Week 7, Hour 3, Page 16 / 36
All the other variables don’t appear significant, but that
could be from co-linearity.
βHDI is negative and significant (whatever that means),
meaning that the birth rate decreases as the HDI increases,
holding all else constant.
Recall from last time that co-linearity makes the slope
parameters difficult to estimate. To reflect this difficulty, the
standard errors are inflated.
Stat 302 Notes. Week 7, Hour 3, Page 17 / 36
We can look at the correlation matrix of the response and
the 6 explanatory variables to try and detect co-linearity.
Well, that’s a mess. It’s also ambiguous.
What means more for co-linearity?
1. The large correlations between argi_in_gdp and HDI.
2. The moderate correlations between health spending and
several other variables.
Stat 302 Notes. Week 7, Hour 3, Page 18 / 36
This is resolved with the Variance Inflation Factor (VIF),
which incorporates the correlation between one explanatory
variable and all the other ones, and combines it into a single
factor.
High VIF variables could be doing more harm than good.
Mathematics given at the end of this hour's notes, for those interested
Stat 302 Notes. Week 7, Hour 3, Page 19 / 36
Inflation can introduce difficulties
Stat 302 Notes. Week 7, Hour 3, Page 20 / 36
A VIF of 5 or more is bad. However, since each variable’s VIF
is affected by every other one, we should only remove one
at a time.
Let’s start with the worst offender, health_spending
Stat 302 Notes. Week 7, Hour 3, Page 21 / 36
Without the health_spending variable, there are some small
changes, but nothing dramatic.
R-squared dropped from 0.8427 to 0.8352, so
health_spending wasn't essential to explaining variation in
birth rates.
Stat 302 Notes. Week 7, Hour 3, Page 22 / 36
Looking at the VIF for the remaining 5 explanatory variables,
A more dramatic change appears:
The VIF for GDPpercap has changed from 5.003 to 1.857.
What happened?
Stat 302 Notes. Week 7, Hour 3, Page 23 / 36
The correlation between GDPpercap and health_spending is
0.919. So even if these were the only two explanatory
variables, there would be co-linearity.
It's also means removing health_spending would reduce
GDPpercap's VIF.
However, GDPpercap's VIF has been reduced so much that
it's no longer a problem. In short, there aren't any additional
correlations of note with GDPpercap.
Stat 302 Notes. Week 7, Hour 3, Page 24 / 36
Had we removed GDPpercap instead of health_spending,
the VIF of health_spending would have dropped instead.
The VIFs without GDPpercap look like this:
Stat 302 Notes. Week 7, Hour 3, Page 25 / 36
So it makes sense to include GDP or to include
health_spending in the birth rates model, but not both.
Why? Because including both is redundant.
This makes real world sense. Both GDP and health spending
are in terms of dollars per person. Having more dollars in
general means that more dollars will be spent on health.
Stat 302 Notes. Week 7, Hour 3, Page 26 / 36
None of the remaining VIFs are more than 5, so removing
more variables based on VIF is a judgement call.
The variable with the highest VIF is Human Development
Index (HDI), so it makes sense to try removing that one next.
Recall that the slope for HDI was statistically significant, so
this may seen a little crazy.
Watch what happens...
Stat 302 Notes. Week 7, Hour 3, Page 27 / 36
- Farming's portion of the economy, agri_in_gdp, has gone
from not significant to strongly significant. Also, its slope has
increased from 0.10 to 0.81.
- This implies that there was some relationship between a
country's level of development (HDI) and its reliance on
agriculture for income (agri_in_gdp).
Stat 302 Notes. Week 7, Hour 3, Page 28 / 36
Also, the VIF of agri_in_gdp has been reduced substantially,
implying a correlation between agri_in_gdp and HDI.
However...
With the HDI measure, this model explained 83.52% of the
variation in birth rates. Now it only explains 64.48%.
Therefore, HDI should remain in the model.
Stat 302 Notes. Week 7, Hour 3, Page 29 / 36
Should anything else be removed from the model? No.
There's only trivial amounts of co-linearity left.
In other terms, the remaining four variables all have high
uniqueness.
Their contributions to the model are all unique and
irreplaceable by the other explanatory variables.
Stat 302 Notes. Week 7, Hour 3, Page 30 / 36
There's lots more to read into this data set...
Stat 302 Notes. Week 7, Hour 3, Page 31 / 36
One issue we haven’t addressed yet: Weights.
Every country is being counted as an equally important
observation.
However, since things like birth rate in larger countries are
derived from larger populations, it may be better to treat
those larger countries as more important when fitting a
model.
There is a setting in the lm() command, ' weights= ' , that we
can use to do this, but we need the population size first. We
will revisit weighted data if time permits.
Stat 302 Notes. Week 7, Hour 3, Page 32 / 36
Another issue is whether hypothesis testing here is even
appropriate.
Usually we have a sample statistic and we hypothesize about
a population parameter. But here we’re using the 150 largest
of the world’s countries.
We have information about nearly the whole population
here, so we're not inferring from a sample to a population,
we are measuring the whole population directly.
Stat 302 Notes. Week 7, Hour 3, Page 33 / 36
When we have population-level data like this, for example if
we used census data directly instead of taking a survey, the
standard errors can be treated as if they are near zero.
What this means for us?
The parameter estimates are still good, and so are the
predictions. However, the p-values are pretty much
meaningless in this case.
Stat 302 Notes. Week 7, Hour 3, Page 34 / 36
Next week: More polynomial models, and model selection!
To get the correlation matrix in these slides.
gapminder = read.csv(“gapminder.csv”)
x = cor(gapminder[,c(2,3,10,11,12,13,14)],
use="pairwise.complete.obs")
round(x,3)
Stat 302 Notes. Week 7, Hour 3, Page 35 / 36
To compute VIF,
Treat the explanatory variable xi as a response variable, and make a
regression model using the other explanatory variables.
Then,
where R2 is the proportion of variance in xi explained by the
other explanatory variables.
Stat 302 Notes. Week 7, Hour 3, Page 36 / 36
Download