Regression

advertisement
Regression
Hal Varian
10 April 2006
What is regression?




History
Curve fitting v statistics
Correlation and causation
Statistical models





Gauss-Markov theorem
Maximum likelihood
Conditional mean
What can go wrong…
Examples
Francis Galton, 1877

Plotted first regression line:


Diameter of sweetpeas v diameter of parents
Heights of fathers v heights of sons


Sons of unusually tall fathers tend to be tall, but
shorter than their fathers. Galton called this
“regression to mediocrity”.
But this is also true the other way around!
Regression to the mean fallacy.


Pick the lowest scoring 10% on the midterm and give
them extra tutoring
If they do better on the final, what can you conclude?
Did the tutoring help?
Regression analysis

Assume a linear relation between two
variables and estimate unknown
parameters



yt = a + b xt + et for t= 1,…,T
observed = fitted + error or residual
dependent variable ~ independent
variables/predictors/correlates
Curve fitting v regression

Often choose (a,b) to minimize the sum of
squared residuals (“least squares”)





Why not absolute value of residuals?
Why not fit xt = a + b yt?
How much can you trust the estimated values?
Need a statistical model to answer these
questions!
Linear regression: linear in parameters

Nonlinear regression, local regression, general
linear model, general additive model: same
principles apply
Possible goals




Estimate parameters (a , b and error
variance)
Test hypotheses (such as “x has no
influence on y”)
Make predictions about y conditional on
observing a new x-value
Summarize data (most common
unstated goal!)
Summarizing relationships

Would like to be able to interpret regression
as “causal”


Correlation v causation


“If x changes by Dx, then y will on average
change by Dx b.”
Compare the time on my wristwatch with the time
on your wristwatch…
Even ideally, best you can say is:

“When x changes by Dx in the sample, then on
average y changes by Dx b in the sample.”
Problem with causality

There may be a “third cause”


“my watch time” and “your watch time” both
depend on NIST time
Economics example




income ~ b education + (unobserved IQ+other)
education ~ IQ
Higher income is associated with higher education
in sample, but b is a biased estimate of partial
effect of education on income
Need a controlled experiment or more elaborate
estimation technique to resolve this “simultaneous
equations bias”
Statistical regression model



yt = a + b xt + et for t= 1,…,T
Think of random variable et as the sum of the
other omitted effects
What are attractive properties for error term?





E et = 0
Var et = constant
E et es = 0 (errors are independent)
E xt et = 0 (errors are conditionally uncorrelated
with explanatory variables – often problematic for
reasons on last slide! Exogenous v endogenous.)
Have to ask: how do the variables you don’t
observe affect the variables you do observe?
Optimality properties


Gauss-Markov theorem: If the error term has
these properties, then the linear regression
estimates of (a,b) are BLUE = “best linear
unbiased estimates” = out of all unbiased
estimates that are linear in yt the least
squares estimates have minimum variance.
If et are Normal IID distributed, then the
OLSQ estimates are maximum likelihood
estimates
Conditional means


In the regression model, note that the
expected value of yt is a + b xt . So the
conditional mean is linear in xt, which
is another interpretation of regression.
More generally, can think of regression
model as being: E yt = f(xt, b)
Regression output




Estimates of parameters
Standard errors of estimates and error
term
t-statistics = estimate/se and p-values
R2 = goodness of fit measure


Total SS = Fitted SS + Residual SS
R2 = Fitted SS / Total SS
100
Example from R
80
> x <- 1:100
60
> y <- x + 10*rnorm(100)
Estimate Std. Error t value Pr(>|t|)
(Intercept)
1.81944
1.86779
0.974
x
0.97354
0.03211
30.319
0.332
0
Coefficients:
20
40
y
> summary(lm(y~x))
<2e-16 ***
0
20
40
60
x
Residual standard error: 9.269 on 98 degrees of freedom
Multiple R-Squared: 0.9037,
Adjusted R-squared: 0.9027
80
100
What can go wrong?

Nonlinear relationship


Try quadratic, interaction term, logs, etc.
Var et is not constant

Heteroskedasticity – affects testing not estimates


Serial correlation – affects testing and prediction
accuracy


Take logs or use weighted least squares
Use time series methods
Multiple regression – colinearity

Socks ~ right shoes + left shoes + shoes + error
What can go wrong, cont

Errors in variables


Omitted variable bias


Bias depending on correlation of omitted with
included variables
Simultaneous equations bias


Underestimate magnitude of true effect
Third cause alluded to earlier, need to estimate full
model or use controlled experiment
Outliers

Non-normality of errors and influential
observations – remove them or use robust
estimation
Diagnostics


Look at residuals!!
R allows you to plot various regression
diagnostics



reg <- lm(y~x)
plot(reg)
Examples to follow…
Download