Week 7 Hour 3 Polynomial Fit

Week 7 Hour 3 Polynomial Fit Working with the VIF (Variance Inflation Factor) Stat 302 Notes. Week 7, Hour 3, Page 1 / 36 Stat 302 Notes. Week 7, Hour 3, Page 2 / 36 Another big advantage of multiple regression is that it allows for another tool for handling non-linearity – the polynomial model. With linear we regression, we make the assumption that Y increases or decreases with X, but does so at the same rate for every X. Stat 302 Notes. Week 7, Hour 3, Page 3 / 36 This is not always the case. Sometimes a transformation, like the log transform can fix a non-linear pattern. Stat 302 Notes. Week 7, Hour 3, Page 4 / 36 For simple regression, either X or Y can be transformed. For multiple regression, it has to be Y, because there is more than one X. Stat 302 Notes. Week 7, Hour 3, Page 5 / 36 But transforms only work when the trend is always increasing or always decreasing. In cases where the trend reaches some maximum or minimum for Y in the middle of the X values, a transform will fail to produce a reasonable linear model. Stat 302 Notes. Week 7, Hour 3, Page 6 / 36 Stat 302 Notes. Week 7, Hour 3, Page 7 / 36 If we use multiple terms to describe a trend, we can get a model that fits well, a polynomial model. Regression equation: Y = β0 + β1 x + β0 = 5 , β1 = 0.4, β2 Stat 302 Notes. Week 7, Hour 3, Page 8 / 36 = -1 β2x2 This situation can come up when there are two competing factors along X. Stat 302 Notes. Week 7, Hour 3, Page 9 / 36 Or when the function is simply non-linear. Value = β0 + β1 x + β2x2 Interpretation:β1: The value of the mineral per carat β2: The added value because larger jewels are rarer Stat 302 Notes. Week 7, Hour 3, Page 10 / 36 There are other variables to consider Stat 302 Notes. Week 7, Hour 3, Page 11 / 36 Before we start, To calculate the VIF, you need the ‘car’ package for R, and you need R version 3.2 or newer to install that. To install the package, use the command install.packages("car") …and follow the prompts. This only needs to be done once. To load the package, use the command library(car) …which has to be done every time you open R. Stat 302 Notes. Week 7, Hour 3, Page 12 / 36 Now let’s jump into another multiple regression, using vif() Consider the dataset gapminder.csv, which has lots of economic and health information from 150 countries from either the year 2007, or the nearest available year. This is real data derived from the Gapminder foundation’s website, gapminder.org We are interested in understanding the factors behind the different birth rate of countries. We have a few variables that could be useful: Stat 302 Notes. Week 7, Hour 3, Page 13 / 36 GDPpercap: The Gross Domestic Product (GDP) per person (Year 2000, $USD). Roughly the amount of money made per person. argi_in_gdp: The proportion of a country’s GDP that comes from agriculture and farming. HDI: Human Development Index. A measure of how advanced a country is, based on health, education, and economy. Stat 302 Notes. Week 7, Hour 3, Page 14 / 36 GINI: A measure of income inequality in the country. A higher index means more inequality. health_spending: The amount of money (Year 2000, $USD) spent per person on health care. female_work: The proportion of adult females (ages 18-64) that are in the workforce. Stat 302 Notes. Week 7, Hour 3, Page 15 / 36 We can start with a linear model that includes all of these explanatory variables together. What is this all saying? Stat 302 Notes. Week 7, Hour 3, Page 16 / 36 All the other variables don’t appear significant, but that could be from co-linearity. βHDI is negative and significant (whatever that means), meaning that the birth rate decreases as the HDI increases, holding all else constant. Recall from last time that co-linearity makes the slope parameters difficult to estimate. To reflect this difficulty, the standard errors are inflated. Stat 302 Notes. Week 7, Hour 3, Page 17 / 36 We can look at the correlation matrix of the response and the 6 explanatory variables to try and detect co-linearity. Well, that’s a mess. It’s also ambiguous. What means more for co-linearity? 1. The large correlations between argi_in_gdp and HDI. 2. The moderate correlations between health spending and several other variables. Stat 302 Notes. Week 7, Hour 3, Page 18 / 36 This is resolved with the Variance Inflation Factor (VIF), which incorporates the correlation between one explanatory variable and all the other ones, and combines it into a single factor. High VIF variables could be doing more harm than good. Mathematics given at the end of this hour's notes, for those interested Stat 302 Notes. Week 7, Hour 3, Page 19 / 36 Inflation can introduce difficulties Stat 302 Notes. Week 7, Hour 3, Page 20 / 36 A VIF of 5 or more is bad. However, since each variable’s VIF is affected by every other one, we should only remove one at a time. Let’s start with the worst offender, health_spending Stat 302 Notes. Week 7, Hour 3, Page 21 / 36 Without the health_spending variable, there are some small changes, but nothing dramatic. R-squared dropped from 0.8427 to 0.8352, so health_spending wasn't essential to explaining variation in birth rates. Stat 302 Notes. Week 7, Hour 3, Page 22 / 36 Looking at the VIF for the remaining 5 explanatory variables, A more dramatic change appears: The VIF for GDPpercap has changed from 5.003 to 1.857. What happened? Stat 302 Notes. Week 7, Hour 3, Page 23 / 36 The correlation between GDPpercap and health_spending is 0.919. So even if these were the only two explanatory variables, there would be co-linearity. It's also means removing health_spending would reduce GDPpercap's VIF. However, GDPpercap's VIF has been reduced so much that it's no longer a problem. In short, there aren't any additional correlations of note with GDPpercap. Stat 302 Notes. Week 7, Hour 3, Page 24 / 36 Had we removed GDPpercap instead of health_spending, the VIF of health_spending would have dropped instead. The VIFs without GDPpercap look like this: Stat 302 Notes. Week 7, Hour 3, Page 25 / 36 So it makes sense to include GDP or to include health_spending in the birth rates model, but not both. Why? Because including both is redundant. This makes real world sense. Both GDP and health spending are in terms of dollars per person. Having more dollars in general means that more dollars will be spent on health. Stat 302 Notes. Week 7, Hour 3, Page 26 / 36 None of the remaining VIFs are more than 5, so removing more variables based on VIF is a judgement call. The variable with the highest VIF is Human Development Index (HDI), so it makes sense to try removing that one next. Recall that the slope for HDI was statistically significant, so this may seen a little crazy. Watch what happens... Stat 302 Notes. Week 7, Hour 3, Page 27 / 36 - Farming's portion of the economy, agri_in_gdp, has gone from not significant to strongly significant. Also, its slope has increased from 0.10 to 0.81. - This implies that there was some relationship between a country's level of development (HDI) and its reliance on agriculture for income (agri_in_gdp). Stat 302 Notes. Week 7, Hour 3, Page 28 / 36 Also, the VIF of agri_in_gdp has been reduced substantially, implying a correlation between agri_in_gdp and HDI. However... With the HDI measure, this model explained 83.52% of the variation in birth rates. Now it only explains 64.48%. Therefore, HDI should remain in the model. Stat 302 Notes. Week 7, Hour 3, Page 29 / 36 Should anything else be removed from the model? No. There's only trivial amounts of co-linearity left. In other terms, the remaining four variables all have high uniqueness. Their contributions to the model are all unique and irreplaceable by the other explanatory variables. Stat 302 Notes. Week 7, Hour 3, Page 30 / 36 There's lots more to read into this data set... Stat 302 Notes. Week 7, Hour 3, Page 31 / 36 One issue we haven’t addressed yet: Weights. Every country is being counted as an equally important observation. However, since things like birth rate in larger countries are derived from larger populations, it may be better to treat those larger countries as more important when fitting a model. There is a setting in the lm() command, ' weights= ' , that we can use to do this, but we need the population size first. We will revisit weighted data if time permits. Stat 302 Notes. Week 7, Hour 3, Page 32 / 36 Another issue is whether hypothesis testing here is even appropriate. Usually we have a sample statistic and we hypothesize about a population parameter. But here we’re using the 150 largest of the world’s countries. We have information about nearly the whole population here, so we're not inferring from a sample to a population, we are measuring the whole population directly. Stat 302 Notes. Week 7, Hour 3, Page 33 / 36 When we have population-level data like this, for example if we used census data directly instead of taking a survey, the standard errors can be treated as if they are near zero. What this means for us? The parameter estimates are still good, and so are the predictions. However, the p-values are pretty much meaningless in this case. Stat 302 Notes. Week 7, Hour 3, Page 34 / 36 Next week: More polynomial models, and model selection! To get the correlation matrix in these slides. gapminder = read.csv(“gapminder.csv”) x = cor(gapminder[,c(2,3,10,11,12,13,14)], use="pairwise.complete.obs") round(x,3) Stat 302 Notes. Week 7, Hour 3, Page 35 / 36 To compute VIF, Treat the explanatory variable xi as a response variable, and make a regression model using the other explanatory variables. Then, where R2 is the proportion of variance in xi explained by the other explanatory variables. Stat 302 Notes. Week 7, Hour 3, Page 36 / 36

Week 7 Hour 3 Polynomial Fit

Related documents

Products

Support

Week 7 Hour 3 Polynomial Fit

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib