Limitations of simple linear regression

advertisement
Limitations of simple linear regression
• So far, we’ve only been able to examine the relationship between two
variables.
• In many instances, we believe that more than one independent variable is
correlated with the dependent variable.
• Multiple linear regression provides is a tool that allows us to examine the
relationship between 2 or more regressors and a response variable.
• This technique is especially useful when trying to account for potential
confounding factors in observational studies.
General form of the model
• The model is very similar to the simple linear model from before, with the
addition of other regressor(s).
• As before, we are assuming that the regressors relate to the response
variables linearly.
• We assume the relationship is of the form:
E(Y ) = β0 + β1 X1 + β2 X2 + ... + βk Xk
• We use the least squares methodology from before, which minimizes
Pn
2
i=1 (yi − yˆi ) .
• The fitted model is of the form
Ŷ = βˆ0 + βˆ1 X1 + βˆ2 X2 + ... + βˆk Xk
Model with two regressors
Consider the case of 2 regressors, which is much easier to visualize than more
complicated cases.
• In the case of 2 regressors we are fitting a plane through our data using the
same least squares strategy we had before.
• The coefficient of each independent variable tells us the estimated change in
the response associated with a one-unit increase in the independent variable,
given that the other independent variable is held constant
• If both the regressors change, then the estimated change in the response
variable (∆Y ) is given by: βˆ1 ∆X1 + βˆ2 ∆X2 .
Example with two regressors
Imagine that we have a random sample of families. For each family we have
information about the annual income, annual savings, and the number of
children. We are interested in how the number of children and the level of income
relate the amount of money saved.
• Our response variable is the annual savings.
• Our regressors are number of children and annual income.
• Our fitted model will take the form:
estimated average savings = βˆ0 + βˆ1 income + βˆ2 number of children
Looking at the results as separate regression lines
4
1
1
3
1
2
1
1
12
1 1
2
1
1
1
2
1
2
3 3
3 3
3
3
3
3
2
3
3
3
2
3
3
2
2
2
2
2
3
3 3
2
2
1
2 2
3
2
15
3
2
2 2
2
3
2
2
2
2
1
1
1
11
1
2
0
Savings (in thousands of dollars)
1
1 1
1
1
1
1
3
3
3
3
One child
Two children
Three children
3
3
20
25
Income (in thousands of dollars)
30
35
What if the slopes are different?
• In the last example, the coefficient of income remains the same, regardless of
the number of kids in the family.
• What if you think that there’s an interaction between income and children?
(That is, you think the effect is not strictly additive.)
• You might to choose to fit the model with an interaction effect, in which case
you are modeling:
mean savings = β0 + β1 ∗ income + β2 ∗ children + β3 ∗ income*children
• This allows the coefficient associated with income to change based on the
number of children
• This sort of model is still linear, because the unknowns (the βs) are linear in
their relationship to the knowns (income, children, income*children).
Fitted lines with interaction term
4
1
1
3
1
2
1
1
12
1 1
2
1
1
1
2
1
2
3 3
3
3
3
2
3
3
3
2
3
3 3
3
3
2
2
2
2
2
3
3 3
2
2
1
2 2
3
2
15
3
2
2 2
2
3
2
2
2
2
1
1
1
11
1
2
0
Savings (in thousands of dollars)
1
1 1
1
1
1
1
3
3
3
3
One child
Two children
Three children
3
3
20
25
Income (in thousands of dollars)
30
35
Dummy variables
• What happens when we want to denote a category as a regressor?
• For instance, let’s say that we have a data set with gender as a variable. We
have denoted males as “0” and females as “1”.
• Such binary variables are called “dummy variables”.
• Adding such a variable into the regression allows different intercepts in the
fitted regression equation for males and females (or whatever two categories
you have).
• Adding an interaction term of the form “dummy*non-dummy” allows
different coefficients for the the non-dummy variable for males and females.
Interpretation of dummy variables
Imagine that we have a data set for a sample of families, including annual
income, annual savings, and whether the familiy is has a single breadwinner
(“1”) or not (“0”).
Fit the model: mean savings = β0 + β1 ∗ income + β2 ∗ oneearn.
Assume we obtain the fitted regression equation:
estimated savings = 400 + 0.05 ∗ income − 0.02 ∗ oneearn
How does being a one breadwinner family affect the estimated average savings?
Interpretation of dummy variables (cont.)
Fit the model: mean savings = β0 +β1 ∗income+β2 ∗oneearn+β3 ∗income*oneearn
(variables same as those mentioned in last slide).
Assume we obtain the fitted regression equation
estimated savings = 400 + 0.10 ∗ income − 175 ∗ oneearn − 0.04 ∗ income*oneearn
Re-examine how being a one breadwinner family affects the estimated average
savings.
Inference concerning mult. regression coefficients
• As in the case of simple linear regression, we can also form confidence
intervals and conduct hypothesis tests for the coefficients βi .
• In the case of k regressors, the statistic
n − k − 1 degrees of freedom
βˆi
SEβˆ
has a t distribution with
i
• (100 − α)% Confidence interval for βi : β̂i ± tn−k−1
SEβˆi
α
2
• The estimates β̂i and their standard errors can be found on the output from
a statistical package like S-Plus. Also, the fitted regression equation is
sometimes presented with the standard errors listed under each estimate in
parentheses.:
ŷ
=
βˆ0
(SEβˆ0 )
+
βˆ1 x1
(SEβˆ1 )
+
βˆ2 x2
(SEβˆ2 )
Adjusted/corrected R2
• For multiple regression, we can still calculate the coefficient of determination
R2 = SSR
SST .
• As before, R2 measures the proportion of the sum of squares of deviations
of Y that can be explained by the relationship we have fitted using the
explanatory variables.
• Note that adding regressors can never cause R2 to decrease, even if the
regressors) do not seem to have a significant effect on the response of Y .
• Adjusted (sometimes called “corrected”) R2 takes into account the number
of regressors included in the model; in effect, it penalizes us for adding in
regressors that don’t “contribute their part” to explaning the response
variable.
• Adjusted R2 is given by the following, where k is the number of regressors
(n − 1)R2 − k
Adjusted R =
n−k−1
2
Example: Hiring salaries
Call: lm(formula = lgsalhr ~ age + educatn
+ seniorty + gender)
Coefficients:
(Intercept)
age
educatn
seniorty
gender
Value Std. Error
8.7161
0.1116
0.0002
0.0001
0.0170
0.0045
-0.0041
0.0009
-0.1439
0.0216
t value
78.1167
2.7001
3.8012
-4.3281
-6.6626
Pr(>|t|)
0.0000
0.0083
0.0003
0.0000
0.0000
Residual standard error: 0.09145 on 88 d.f.
Multiple R-Squared: 0.5209
Using the above information, adjusted
R2 = (93−1)(0.5209)−4
= 0.4991.
93−4−1
Download