LIR 832 - Lecture 5 - Michigan State University

advertisement
Regression: An Introduction
LIR 832
Regression Introduced

Topics of the day:




A. What does OLS do? Why use OLS? How does
it work?
B. Residuals: What we don’t know.
C. Moving to the Multi-variate Model
D. Quality of Regression Equations: R2
Regression Example #1


Just what is regression and what can it do?
To address this, consider the study of truck
driver turnover in the first lecture…
Regression Example #2


Suppose that we are interested in
understanding the determinants of teacher pay.
What we have is a data set on average perpupil expenditures and average teacher pay by
state…
Regression Example #2
Descriptive Statistics: pay, expenditures
Variable
pay
expendit
N
51
51
Mean
24356
3697
Median
23382
3554
TrMean
23999
3596
Variable
pay
expendit
Minimum
18095
2297
Maximum
41480
8349
Q1
21419
2967
Q3
26610
4123
StDev
4179
1055
SE Mean
585
148
Regression Example #2
Covariances: pay, expenditures
pay
expendit
pay
17467605
3679754
expendit
1112520
Correlations: pay, expenditures
Pearson correlation of pay and expenditures = 0.835
P-Value = 0.000
Regression Example #2
$45,000
$40,000
$35,000
Avg. Pay
$30,000
$25,000
$20,000
$15,000
$10,000
$5,000
$0
$0
$1,000
$2,000
$3,000
$4,000
$5,000
Expenditures
$6,000
$7,000
$8,000
$9,000
Regression Example #2
The regression equation is
pay = 12129 + 3.31 expenditures
Predictor
Constant
expendit
S = 2325
Coef
SE Coef
T
P
12129
1197
10.13
0.000
3.3076
0.3117
10.61
0.000
R-Sq = 69.7%
R-Sq(adj) = 69.1%
pay = 12129 + 3.31 expenditures is the equation of a line
and we can add it to our plot of the data.
Regression Example #2
$45,000
$40,000
$35,000
Avg. Pay
$30,000
$25,000
Pay = 12129 +3.31*Expenditures
$20,000
$15,000
$10,000
$5,000
$0
$0
$1,000
$2,000
$3,000
$4,000
$5,000
Expenditures
$6,000
$7,000
$8,000
$9,000
Regression: What Can We Learn?

What can we learn from the regression?


Q1: What is the relationship between per pupil
expenditures and teacher pay?
A: For every additional dollar of expenditure, pay
increases by $3.31.
Regression: What Can We Learn?

Q2: Given our sample, is it reasonable to
suppose that increased teacher expenditures
are associated with higher pay?
H0: expenditures make no difference: β ≤0
 HA:
expenditures increase pay:
β >0
 P( (xbar -μ)/σ > (3.037 - 0)/.3117) = p( z > 10.61)


A: Reject our null, reasonable to believe there
is a positive relationship.
Regression: What Can We Learn?


Q3: What proportion of the variance in teacher
pay can we explain with our regression line?
A: R-Sq = 69.7%
Regression: What Can We Learn?


Q4: We can also make predictions from the
regression model. What would teacher pay be
if we spent $4,000 per pupil?
A: pay = 12129 + 3.31 expenditures


pay = 12129 + 3.31*4000 = $25,369
What if we had per pupil expenditures of $6400
(Michigan’s amount)?

Pay = 12129 + 3.31*6400 = $33,313
Regression: What Can We Learn?


Q5: For the states where we have data, we can also
observe the difference between our prediction and the
actual amount.
A: Take the case of Alaska:




expenditures $8,349
actual pay
$41,480
predicted pay = 12129 + 3.31*8,349 = 38744
difference between actual and predicted pay:

41480 - 38744 = $1,735
Regression: What Can We Learn?

Note that we have under predicted actual pay.


Why might this occur?
This is called the residual, it is a measure of
the imperfection of our model

What is the residual for the state of Maine?
per pupil expenditure is $3346
 actual teacher pay is $19,583

Regression: What Can We Learn?
$45,000
Residual (e) = Actual - Predicted
$40,000
$35,000
Avg. Pay
$30,000
$25,000
$20,000
$15,000
$10,000
$5,000
$0
$0
$1,000
$2,000
$3,000
$4,000
$5,000
Expenditures
$6,000
$7,000
$8,000
$9,000
Regression Nomenclature
Intercept
Slope Coefficient
i indexes the observation
Y=
0+
1*X+
i
i
Dependent Variable
Explanatory Variable
i
Residual or Error
Pay=12,129 + 3.31*Expenditure + e
i
i
Intercept
Slope Coefficient
i
Components of a Regression
Model




Dependent variable: we are trying to explain the
movement of the dependent variable around its mean.
Explanatory variable(s): We use these variables to
explain the movement of the dependent variable.
Error Term: This is the difference between what we
can account for with our explanatory variables and
the actual value taken on by the dependent variable.
Parameter: The measure of the relationship between
an explanatory variable and a dependent variable.
Regression Models are Linear


Q: What do we mean by “linear”?
A: The equation takes the form:
Y  a  bX
where
Y: the var iablebeing predicted
X : the predictor var iable
a: int ercept of the line
b: slope of the line
Regression Example #3

Using numbers, lets make up an equation for a
compensation bonus system in which everyone starts
with a bonus of $500 annually and then receives an
additional $100 for every point earned.
Bonus Income  $500  $100 * Job Po int s

Now create a table relating job points to bonus
income
Regression Example #3
Regression Example #3
Regression Example #3

Basic model takes the form:

Y = β0 + β1*X + ε

or, for the bonus pay example,

Pay = $500 + $100*expenditure + ε
Regression Example #3

This is the equation of a line where:



$500 is the minimum bonus when the individual has no
bonus points. This is the intercept of the line
$100 is the increase in the total bonus for every additional
job point. This is the slope of the line
Or:


β0 is the intercept of the vertical axis (Y axis) when X = 0
β1 is the change in Y for every 1 unit change in X, or:
Y2  Y1
Y rise
1 


X 2  X 1 X run
Regression Example #3

For points on the line:


Let X1 = 10 & X2 = 20
Using our line:
Y1= $500 + $100*10 = $1,500
 Y2= $500 +$100*20 = $2,500

Regression Example #3
Regression Example #3

1. The change in bonus pay for a 1 point
increase in job points:
$2,500  $1,500 $1,000
1 

 $100
20  10
10

2. What do we mean by “linear”?


Equation of a line:
Y = β0 + β1*X + ε is the equation of a line
Regression Example #3

Equation of a line which is linear in coefficients but
not variables:


Y = β0 + β1*X + β2*X2 + ε
Think about a new bonus equation:
Total Bonus  500  0 * Bonus Po int s  10 * Bonus Po int s 2


Base Bonus is still $500
You now get $0 per bonus point and $10 per bonus point
squared
Regression Example #3
Regression Example #3
Regression Example #3
Linearity of Regression Models



Y = β0 + β2*Xβ + ε is not the equation of a
line
Regression has to be linear in coefficients, not
variables
We can mimic curves and much else if we are
clever
The Error Term


The error term is the difference between what has
occurred and what we predict as an outcome.
Our models are imperfect because




omitted “minor” influences
measurement error in Y and X’s
issues of functional form (linear model for non-linear
relationship)
pure randomness of behavior
The Error Term



Our full equation is Y = β0 + β1*X + ε
However, we often write the deterministic part of our
model as:
E(Y|X) = β0 + β1*X
We use of “conditional on X” similar to conditional
probabilities. Essentially saying this is our best guess
about Y given the value of X.
The Error Term

This is also written as
Y  0  1 * X

Note that is called “Y-hat,” the estimate of Y

So we can write the full model as:

What does this mean in practice? Same x value may
produce somewhat different Y values. Our
predictions are imperfect!
Populations, Samples, and
Regression Analysis


Population Regression: Y = β0 + β1 X1 + ε
The population regression is the equation
for the entire group of interest. Similar in
concept to μ, the population mean


The population regression is indicated with Greek
letters.
The population regression is typically not
observed.
Populations, Samples, and
Regression Analysis

Sample Regressions:


As with means, we take samples and use these
samples to learn about (make inferences about)
populations (and population regressions)
The sample regression is written as

yi = b0 + b1 x1i + ei
or as
yi  0  1 X 1i  ei
Populations, Samples, and
Regression Analysis
Populations, Samples, and
Regression Analysis

As with all sample results, there are lots of
samples which might be drawn from a
population. These samples will typically
provide somewhat different estimates of the
coefficients. This is, once more, sampling
variation.
Populations and Samples:
Regression Example

Illustrative Exercise:


1. Estimate a simple regression model for all of the data on
managers and professionals, then take random 10%
subsamples of the data and compare the estimates!
2. Sample estimates are generated by assigning a number
between 0 and 1 to every observation using a uniform
distribution. We then chose observations for all of the
numbers betwee 0 and 0.1, 0.1 and 0.2, 0.3 and 0.3, etc.
Populations and Samples:
Regression Example
POPULATION ESTIMATES: Results for: lir832-managers-andprofessionals-2000.mtw
The regression equation is
weekearn = - 485 + 87.5 years ed
47576 cases used 7582 cases contain missing values
Predictor
Constant
years ed
S = 530.5
Coef
-484.57
87.492
SE Coef
18.18
1.143
R-Sq = 11.0%
Analysis of Variance
Source
DF
SS
Regression
1 1648936872
Residual Error 47574 13389254994
Total
47575 15038191866
T
-26.65
76.54
P
0.000
0.000
R-Sq(adj) = 11.0%
MS
1648936872
281441
F
5858.92
P
0.000
Side Note: Reading Output
The regression equation is
weekearn = - 485 + 87.5 years ed
variable]
[equation with dependent
47576 cases used 7582 cases contain missing values
[number of observations and number with missing data - why is the
latter important]
Predictor
Coef
SE Coef
T
P
Constant
-484.57
18.18
-26.65
0.000
years ed
87.492
1.143
76.54
0.000
[detailed information on estimated coefficients, standard error,
t against a null of zero, and a p against a null of 0]
S = 530.5
R-Sq = 11.0%
of fit measures]
R-Sq(adj) = 11.0%
[two goodness
Side Note: Reading Output
Analysis of Variance
Source
DF
SS
MS
F
P
Regression
1 1648936872
Residual Error 47574 13389254994
Total
47575 15038191866
1648936872
281441
5858.92
0.000
ESS
SSR
TSS
[This tells us the number of degrees of freedom, the explained sum of
squares, the residual sum of squares, the total sum of squares and
some test statistics]
Populations and Samples:
Regression Example
Populations and Samples:
Regression Example
SAMPLE 1 RESULTS
The regression equation is
weekearn = - 333 + 79.2 Education
4719 cases used 726 cases contain missing values
Predictor
Constant
Educatio
S = 539.5
Coef
-333.24
79.208
SE Coef
58.12
3.665
R-Sq = 9.0%
T
-5.73
21.61
P
0.000
0.000
R-Sq(adj) = 9.0%
Populations and Samples:
Regression Example
Populations and Samples:
Regression Example
SAMPLE 2 RESULTS
The regression equation is
weekearn = - 489 + 88.2 Education
4792 cases used 741 cases contain missing values
Predictor
Constant
Educatio
S = 531.7
Coef
-488.51
88.162
SE Coef
56.85
3.585
R-Sq = 11.2%
T
-8.59
24.59
P
0.000
0.000
R-Sq(adj) = 11.2%
Populations and Samples:
Regression Example
Populations and Samples:
Regression Example
SAMPLE 3 RESULTS
The regression equation is
weekearn = - 460 + 85.9 Education
4652 cases used 773 cases contain missing values
Predictor
Constant
Educatio
S = 525.2
Coef
-460.15
85.933
SE Coef
56.45
3.565
R-Sq = 11.1%
T
-8.15
24.10
P
0.000
0.000
R-Sq(adj) = 11.1%
Populations and Samples:
Regression Example
SAMPLE 4 RESULTS
The regression equation is
weekearn = - 502 + 88.4 Education
4708 cases used 787 cases contain missing values
Predictor
Constant
Educatio
S = 535.6
Coef
-502.18
88.437
SE Coef
57.51
3.632
R-Sq = 11.2%
T
-8.73
24.35
P
0.000
0.000
R-Sq(adj) = 11.2%
Populations and Samples:
Regression Example
SAMPLE 5 RESULTS
The regression equation is
weekearn = - 485 + 87.9 Education
4737 cases used 787 cases contain missing values
Predictor
Constant
Educatio
S = 523.4
Coef
-485.19
87.875
SE Coef
56.60
3.572
R-Sq = 11.3%
T
-8.57
24.60
P
0.000
0.000
R-Sq(adj) = 11.3%
Populations and Samples:
Regression Example
Populations and Samples:
A Recap of the Example
Estimate
0 (Intercept)
POPULATION
-484.57
Sample 1
-333.24
Sample 2
-488.51
Sample 3
-460.15
Sample 4
-502.18
Sample 5
-485.19
1 (Coefficient
on Education)
87.49
79.21
88.16
85.93
88.44
87.88
Populations and Samples: A
Recap of the Example


The sample estimates are not exactly equal to
the population estimates.
Different samples produce different estimates
of the slope and intercept.
Ordinary Least Squares (OLS):
How We Determine the Estimates

The residual is a measure of what we do not know.



ei = yi - b0 + b1 x1i
We want ei to be as small as possible
How do we choose (b0, b1)? AKA: Criteria for the
sample regression:
n

Chose among lines so that:



 ei  0
i1
The average value of the residual is zero.
Statistically, this occurs through any line that passes through the
means (X-bar, Y-bar).
Problem: there are an infinity of lines which meet this criteria.
Example of a Possible Regression
Line
Mean = $3,696
$45,000
$40,000
$35,000
Avg. Pay
$30,000
$25,000
Mean = $24,356
$20,000
$15,000
$10,000
$5,000
$0
$0
$1,000
$2,000
$3,000
$4,000
$5,000
Expenditures
$6,000
$7,000
$8,000
$9,000
Problem: Many Lines
Meet That Criteria
Mean = $3,696
$45,000
$40,000
$35,000
Avg. Pay
$30,000
$25,000
Mean = $24,356
$20,000
$15,000
$10,000
$5,000
$0
$0
$1,000
$2,000
$3,000
$4,000
$5,000
Expenditures
$6,000
$7,000
$8,000
$9,000
OLS: Choosing the Coefficients

Among these lines, find (b0, b1) pair which
minimizes the sum of squared residuals:
n
min 
i1



2
ei

2
e1

2
2
e2  en
Want to make the difference between the prediction and the
actual value, (Y - E(Y|X)), as small as possible.
Squaring puts greater weight on avoiding large individual
differences between actual and predicted values.
So we will chose the middle course, middle sized errors,
rather than a combination of large and small errors.
OLS: Choosing the Coefficients
OLS: Choosing the Coefficients

What are the characteristics of a sample regression?


D. It can be shown that, if these two conditions hold, our
regression line is:
Best
Linear
Unbiased
Estimator (or B-L-U-E)
This is called the: Gauss-Markov Theorem
OLS: Choosing the Coefficients
Descriptive Statistics: pay, expenditures
Variable
pay
expendit
N
51
51
Mean
24356
3697
Median
23382
3554
TrMean
23999
3596
Variable
pay
Minimum
18095
Maximum
41480
Q1
21419
Q3
26610
StDev
4179
1055
SE Mean
585
148
OLS: Choosing the Coefficients
The regression equation is
pay = 12129 + 3.31 expenditures
Predictor
Constant
expendit
S = 2325
Coef
12129
3.3076
SE Coef
1197
0.3117
R-Sq = 69.7%
T
10.13
10.61
P
0.000
0.000
R-Sq(adj) = 69.1%
Analysis of Variance
Source
Regression
Residual Error
Total
DF
1
49
50
SS
608555015
264825250
873380265
MS
608555015
5404597
F
112.60
P
0.000
OLS: Choosing the Coefficients
Mean of the residuals equal to zero?
Descriptive Statistics: Residual
Variable
Residual
N
51
Mean
-0
Median
-218
TrMean
-107
Variable
Residual
Minimum
-3848
Maximum
5529
Q1
-2002
Q3
1689
StDev
2301
SE Mean
322
OLS: Choosing the Coefficients
Passes Through the Point of Means?
pay = 12129 + 3.3076 expenditures
Variable
pay
expendit
N
51
51
Mean
24356
3697
$24,356 = 12129 + 3.3076*3697
$24,356 = 12129 + 12,122.20
$24,356 = $24357.2
Not too bad with rounding!
OLS: Demonstrating Residuals
Mean = $3,696
$45,000
e1
$40,000
$35,000
Avg. Pay
$30,000
e2
$25,000
Mean = $24,356
$20,000
$15,000
$10,000
$5,000
$0
$0
$1,000
$2,000
$3,000
$4,000
$5,000
Expenditures
$6,000
$7,000
$8,000
$9,000
How Does OLS Know
Which Line is BLUE?

If we are trying to minimize the sum of squared
residuals, we can manipulate the model to find the
following:
yi = b0 + b1 x1i + ei
ei = yi - b0 - b1 x1i
Therefore:
n

i1

ei2
n
  ( yi  b0  b1 xi ) 2
i1
Thus, since BLUE requires us to minimize the sum of
squared residuals, OLS chooses the b0 and b1 to
minimize the right side (since we know y and x).
How Does OLS Calculate the
Coefficients?

The formulas used for the coefficients are as
follows:
 y cov( x, y)  xy
b1 

 2 
x
var( x)
x
n

i 1
( xi  x )( yi  y )
( xi  x ) 2
b0  y  b1 x
Illustrative Example:
Attendance and Output
We want to build a model of output based on
attendance. We hypothesize the following:
output = 0 + 1*attendance + e
Attendance
8
3
2
6
4
Output
40
28
20
39
28
Example Results
The regression equation is
output = 15.7 + 3.32 attend
Predictor
Constant
attend
S = 3.079
Coef
15.733
3.3190
SE Coef
3.247
0.6392
R-Sq = 90.0%
T
4.85
5.19
P
0.017
0.014
R-Sq(adj) = 86.6%
Analysis of Variance
Source
Regression
Residual Error
Total
Obs
1
2
3
4
5
attend
8.00
3.00
2.00
6.00
4.00
DF
1
3
4
output
40.00
28.00
20.00
39.00
28.00
SS
255.56
28.44
284.00
Fit
42.28
25.69
22.37
35.65
29.01
MS
255.56
9.48
F
26.96
SE Fit
2.57
1.72
2.16
1.64
1.43
P
0.014
Residual
-2.28
2.31
-2.37
3.35
-1.01
St Resid
-1.35
0.90
-1.08
1.29
-0.37
Computing the Coefficients
Obs
1
2
3
4
5
mean
Attendance(X) Output(Y)
8
40
3
28
2
20
6
39
4
28
4.6
31
(X-Xbar) (X-Xbar)^2
3.4
11.56
-1.6
2.56
-2.6
6.76
1.4
1.96
-0.6
0.36
sum
(Y-Ybar) (X-Xbar)*(Y-Ybar)
9
30.6
-3
4.8
-11
28.6
8
11.2
-3
1.8
23.2
77
cov(x,y)/var(x)=77/23.2
cov(x,y)/var(x)=3.31896
So, b1 = 3.31896.
Thus, b0 = ybar-b1*xbar = 31-3.31896*4.6 = 15.732
Example: Residual Analysis
Variable
C15
N
5
Mean
0.00
Median
-1.01
TrMean
0.00
Variable
C15
Minimum
-2.37
Maximum
3.35
Q1
-2.33
Q3
2.83
StDev
2.66
SE Mean
1.19
Exercise

We are interested in the relationship between the
number of weeks an employee has been in some firm
sponsored training course and output. We have data
on three employees. Thus, compute the coefficients
for the following model:
output = 0 + 1*training + e
employee 1
employee 2
employee 3
Weeks Training
10
20
30
Output
590
400
430
Exercise: Worksheet
Using the data, calculate b1 and b0:
Training (X)
Output (Y)
X-Xbar
(X-Xbar)^2
Y-Ybar (X-Xbar)*(Y-Ybar)
Employee1
Employee2
Employee3
mean
sum
X-BAR
Y-BAR
sum
VAR (X)
b1 = COV(X,Y)/VAR(X) =
b0 = Y-BAR - b1*X-BAR =
COV(X,Y)
OLS: The Intercept (bO)

Why you shouldn’t spend too much time
worrying about the value of the intercept:
b0 = 24356 - 3.3076*3697 = 12129

Note that b0 is the value for pay if
expenditures were equal to 0, something we
may never observe.
Multiple Regression

Few outcomes are determined by a single factor:


1. We know that gender plays an important role in
determining pay. Is gender the only factor?
2. What is likely to matter in determining attendance at
a work site:





our program
holidays
weather
illness
demographics of the labor force
Multiple Regression

A complete model of an outcome will depend
not only on inclusion of our explanatory
variable of interest, but also including other
variables which we believe influence our
outcome.

Getting the “correct” estimates of our coefficients
depends on specifying the balance of the equation
correctly. This raises the bar in our work.
Multiple Regression: Example

An example with Weekly Earnings:



1. Regress weekly earnings of managers on
education
2. Add age and gender to the model
3. Add weekly hours to the model
Example: Weekly Earnings
The regression equation is
weekearn = - 485 + 87.5 years ed
47576 cases used 7582 cases contain missing values
Predictor
Constant
years ed
S = 530.5
Coef
-484.57
87.492
SE Coef
18.18
1.143
R-Sq = 11.0%
T
-26.65
76.54
P
0.000
0.000
R-Sq(adj) = 11.0%
Analysis of Variance
Source
DF
SS
Regression
1 1648936872
Residual Error 47574 13389254994
Total
47575 15038191866
MS
1648936872
281441
F
5858.92
P
0.000
Example: Weekly Earnings
The regression equation is
weekearn = - 402 + 76.4 years ed + 6.29 age - 319 Female
47576 cases used 7582 cases contain missing values
Predictor
Constant
age
Female
years ed
S = 500.4
Coef
-401.76
6.2874
-318.522
76.432
SE Coef
18.87
0.2021
4.625
1.089
R-Sq = 20.8%
T
-21.29
31.11
-68.87
70.16
P
0.000
0.000
0.000
0.000
R-Sq(adj) = 20.8%
Analysis of Variance
Source
DF
SS
Regression
3 3126586576
Residual Error 47572 11911605290
Total
47575 15038191866
MS
1042195525
250391
F
4162.27
P
0.000
Example: Weekly Earnings
The regression equation is
weekearn = - 1055 + 65.7 years ed + 6.87 age - 229 Female + 18.2
uhour-cd
44839 cases used 10319 cases contain missing values
Predictor
Constant
age
Female
uhour-cd
years ed
S = 459.1
Coef
-1054.63
6.8736
-229.466
18.2205
65.701
SE Coef
19.48
0.1932
4.490
0.2183
1.041
R-Sq = 31.8%
T
-54.15
35.57
-51.10
83.47
63.12
P
0.000
0.000
0.000
0.000
0.000
R-Sq(adj) = 31.8%
Analysis of Variance
Source
DF
SS
Regression
4 4415565740
Residual Error 44834 9450180490
Total
44838 13865746230
MS
1103891435
210782
F
5237.13
P
0.000
Example: Weekly Earnings


In the last model, how does age affect weekly
earnings? How does gender affect weekly
earnings? How do average weekly hours of
work affect weekly earnings?
How does the estimated effect of education
change as we add these “control variables”?
Interpreting the Coefficients

In the last model, the coefficient on education indicates that
for every additional year of education a manager earnings an
additional 65.09 per month, holding their age, gender, and
hours of work constant.
E(Weekly Income|education,age, gender, hours of work)

Alternatively, it is the difference in weekly earnings between
two individuals who, except for a one year difference in years
of education, are the same age and gender and work the same
weekly hours (otherwise equivalent managers)
Interpreting the Coefficients


The coefficient on gender indicates that women managers earn
$229.79 less than male managers who are otherwise similar in
education, age and weekly hours of work.
Note the similarity to the comparative static exercises in labor
economics in which we attempt to tease out the effect of one
factor holding all other factors constant.


What is the effect of raising the demand for labor, holding supply of
labor constant?
What is the effect on the wage of an improvement in working
conditions, holding other compensation related factors constant
(Theory of compensating differentials).
The Effect of Adding Variables

The addition of factors to a model don’t
always make a difference.

Example: Model of teacher pay as a function of
expenditures per pupil. Does region make a
difference.
The Effect of Adding Variables
Regression Analysis: pay versus expenditure
The regression equation is
pay = 12129 + 3.31 expenditures
Predictor
Constant
expendit
S = 2325
Coef
12129
3.3076
SE Coef
1197
0.3117
R-Sq = 69.7%
T
10.13
10.61
P
0.000
0.000
R-Sq(adj) = 69.1%
The Effect of Adding Variables
Regression Analysis: pay versus expenditures, NE, S
The regression equation is
pay = 13269 + 3.29 expenditures - 1674 NE - 1144 S
Predictor
Constant
expendit
NE
S
S = 2270

Coef
13269
3.2888
-1673.5
-1144.2
SE Coef
1395
0.3176
801.2
861.1
R-Sq = 72.3%
T
9.51
10.35
-2.09
-1.33
P
0.000
0.000
0.042
0.190
R-Sq(adj) = 70.5%
Region matters, but its influence on the expenditure/pay
relationship is de minimis.
Evaluating the Results

We will consider a number of criteria in
analyzing a regression equation.

Before touching the data
Is the equation supported by sound theory?
 Are all the obviously important variables included in the
model?
 Should we be using OLS to estimate this Model (what
is OLS)?
 Has the correct form been used to estimate the model?

Evaluating the Results

The data itself:


Is the data set a reasonable size and accurate?
The results:
How well does the estimated regression fit the data?
 Do the estimated coefficients correspond to the
expectations developed by the researcher before the
data was collected?
 Does the regression appear to be free of major
econometric problems?

Evaluating the Results:
R-Squared (Goodness of Fit)

R2 (also seen as r2), the Coefficient of
Determination:




We would like a simple measure which tells us how
well our equation fits out data. This is R2 ( Coefficient
of Determination)
For example: in our teacher pay model: R2 = 69.7%
For attendance/output: R2 = 86.6%
For our weekly earnings model R2 varies from 10.6%
to 31.9%
R-Squared (Goodness of Fit)

What is R2? The percentage of the total
movement of the dependent variable around its
mean (variance *n) explained by the
explanatory variable.
R-Squared (Goodness of Fit)

Concept of R2:



Our dependent variable Y, moves around its mean
We are trying to explain that movement with out
x’s. If we are doing well, then it should be the
case that most of the movement of Y should be
explained (predicted) by the X’s.
That suggests that explained movement should be
large and unexplained movement should be small.
R-Squared (Goodness of Fit)
n
TSS 

i 1
(Yi  Y ) 2
1 n
 n * var(Y )  n  (Yi  Y ) 2
n i 1
n
TSS   (Yi  Y ) 2
i 1
n
  RSS  ESS
i 1
RSS the residual sum of squares
ESS the exp lained sum of squares
R-Squared (Goodness of Fit)
ESS exp lained sum of squares
2
R 

TSS



total sum of squares
Note: 0 < R2 < 1
Suppose that we a regression which explains nothing.
Then the ESS = 0 and the measure is equal to zero.
Now suppose we have a model which fits the data
exactly. Every movement in y is correctly predicted.
Then the ESS = TSS and our measure is equal to 1.
R-Squared (Goodness of Fit)

In other words, as we approach R2=1, our
ability to explain movement in the dependent
variable increases.

Most of our results will fall into the middle range
between 0 and 1.
R-Squared (Goodness of Fit)
ESS
R 
TSS
2
or , more commonly
RSS
R  1
TSS
2
Returning to Weekly Earnings of
Managers Examples
The regression equation is
weekearn = - 485 + 87.5 years ed
dependent variable]
[equation with
47576 cases used 7582 cases contain missing values
[number of observations and number with missing data - why
is the latter important]
Predictor
Coef
SE Coef
T
P
Constant
-484.57
18.18
-26.65
0.000
years ed
87.492
1.143
76.54
0.000
[detailed information on estimated coefficients, standard
error, t against a null of zero, and a p against a null
of 0]
S = 530.5
R-Sq = 11.0%
[two goodness of fit measures]
R-Sq(adj) = 11.0%
Returning to Weekly Earnings of
Managers Examples
Regression Analysis: weekearn versus Education
The regression equation is
weekearn = - 442 + 85.2 Education
47576 cases used 7582 cases contain missing values
Predictor
Constant
Educatio
S = 531.7
Coef
-442.42
85.228
SE Coef
17.99
1.136
R-Sq = 10.6%
T
-24.59
75.01
P
0.000
0.000
R-Sq(adj) = 10.6%
Analysis of Variance
Source
DF
SS
Regression
1 1590256151
Residual Error 47574 13447935715
Total
47575 15038191866
MS
1590256151
282674
F
5625.76
P
0.000
Returning to Weekly Earnings of
Managers Examples
Regression Analysis: weekearn versus Education, age, female
The regression equation is
weekearn = - 382 + 75.0 Education + 6.53 age - 320 female
47576 cases used 7582 cases contain missing values
Predictor
Constant
Educatio
age
female
S = 500.9
Coef
-382.38
74.967
6.5320
-319.952
SE Coef
18.78
1.079
0.2020
4.628
R-Sq = 20.6%
T
-20.36
69.45
32.34
-69.14
P
0.000
0.000
0.000
0.000
R-Sq(adj) = 20.6%
Analysis of Variance
Source
DF
SS
Regression
3 3103974768
Residual Error 47572 11934217098
Total
47575 15038191866
MS
1034658256
250866
F
4124.34
P
0.000
Returning to Weekly Earnings of
Managers Examples
Regression Analysis: weekearn versus Education, age, female, hours
The regression equation is
weekearn = - 1053 + 65.1 Education + 7.07 age - 230 female + 18.3
hours
44839 cases used 10319 cases contain missing values
Predictor
Constant
Educatio
age
female
hours
S = 459.0
Coef
-1053.01
65.089
7.0741
-229.786
18.3369
SE Coef
19.43
1.029
0.1929
4.489
0.2180
R-Sq = 31.9%
T
-54.20
63.27
36.68
-51.19
84.11
P
0.000
0.000
0.000
0.000
0.000
R-Sq(adj) = 31.9%
Returning to Weekly Earnings of
Managers Examples

So the fit of the final model, with a control for
hours of work, is considerably better than the
fit for a model which added gender and age
and much better than the fit of a model with
just education as an explanatory variable.
Adjusted R-Squared (“R-bar
Squared”)

First limitation of R2:


1. As we add variables, the magnitude of ESS
never falls and typically increases. If we just use
R2 as a criteria for adding variables to a model, we
will keep adding infinitum. R2 never falls and
usually increases as one adds variables.
2. Instead use the measure R-bar squared. This
measure is calculated as:
Adjusted R-Squared (“R-bar
Squared”)
RSS 


RSS
(n  1)  
(n  k  1)  
2
2 ) * (n  1) 
R  1 

1

*

1

(
1

R
 

 
TSS



TSS
(
n

k

1
)
(
n

k

1
)



( n  1) 
where
n is the number of observations
k is the number of exp lanatory var iables
(n  k  1) is the deg rees of freedom
Adjusted R-Squared (“R-bar
Squared”)

As k , the number of regressors becomes
large, R-bar-squared becomes smaller, all
else constant. It imposes a penalty on
adding variables which really have very
little to do with the dependent variable. If
you add irrelevant variables, R2 may
remain the same or increase, but R-barsquared may well fall.
Adjusted R-Squared (“R-bar
Squared”)
Regression Analysis: pay versus expenditure
The regression equation is
pay = 12129 + 3.31 expenditures
Predictor
Constant
expendit
S = 2325
Coef
12129
3.3076
SE Coef
1197
0.3117
R-Sq = 69.7%
T
10.13
10.61
P
0.000
0.000
R-Sq(adj) = 69.1%
Adjusted R-Squared (“R-bar
Squared”)
Regression Analysis: pay versus expenditures, NE, S
The regression equation is
pay = 13269 + 3.29 expenditures - 1674 NE - 1144 S
Predictor
Constant
expendit
NE
S
S = 2270
Coef
13269
3.2888
-1673.5
-1144.2
SE Coef
1395
0.3176
801.2
861.1
R-Sq = 72.3%
T
9.51
10.35
-2.09
-1.33
P
0.000
0.000
0.042
0.190
R-Sq(adj) = 70.5%
Adjusted R-Squared (“R-bar
Squared”)

Note that the increase in R-bar-squared is
more modest than the increase in R2. This is
because the explanatory power of region is
modest and the effect of that power in
reducing the RSS is being counter-balanced by
the increase in the number of parameters.
Adjusted R-Squared (“R-bar
Squared”)

Need to be careful in the use of
compare regressions.


R 2 or R 2
to
It can be good in comparing specifications such as
with the variables in our specification for
managers. Confirms our view that weekly pay in
influenced education but also by age, gender and
hours (note that both r and r-bar increase).
It is not good for comparing different equations
with different data sets.
Example: Teachers’ Pay
Our model using state average earnings and expenditures has a R-sq of 72.3%
Regression Analysis: pay versus expenditures, NE, S
The regression equation is
pay = 13269 + 3.29 expenditures - 1674 NE - 1144 S
Predictor
Constant
expendit
NE
S
S = 2270
Coef
13269
3.2888
-1673.5
-1144.2
SE Coef
1395
0.3176
801.2
861.1
R-Sq = 72.3%
T
9.51
10.35
-2.09
-1.33
P
0.000
0.000
0.042
0.190
R-Sq(adj) = 70.5%
Example: Teachers’ Pay
Now consider a micro-data model:
Use our CPS data set for 2000 and merge the expenditure data into
data on individual teachers. Using STATE DATA:
. reg
teacherpay expenditures
Source |
SS
df
MS
-------------+-----------------------------Model |
608555015
1
608555015
Residual |
264825250
49 5404596.94
-------------+-----------------------------Total |
873380265
50 17467605.3
Number of obs
F( 1,
49)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
51
112.60
0.0000
0.6968
0.6906
2324.8
-----------------------------------------------------------------------------teacherpay |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------expenditures |
3.307585
.3117043
10.61
0.000
2.681192
3.933978
_cons |
12129.37
1197.351
10.13
0.000
9723.205
14535.54
------------------------------------------------------------------------------
Example: Teachers’ Pay
Now Shift to Year 2000 micro-data and append state
expenditures on education:
. summ weekearn age female uhour1 expenditure if pocc1 >= 151 & pocc1 <=
159
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+----------------------------------------------------weekearn |
7579
702.5348
429.1002
.02
2884.61
age |
7903
41.63254
11.9243
15
90
female |
7903
.732127
.4428791
0
1
uhour1 |
7903
35.51411
15.84321
-4
99
expenditures |
7903
3786.745
990.0271
2297
8349
Example: Teachers’ Pay
Now Estimate a Regression Equation Similar to the State Data Equation
Note the number of observations:
. reg weekearn expenditure
NE
Midwest South if pocc1 >= 151 & pocc1 <= 159
Source |
SS
df
MS
-------------+-----------------------------Model | 21170657.3
4 5292664.32
Residual | 1.3741e+09 7574 181429.033
-------------+-----------------------------Total | 1.3953e+09 7578 184126.967
Number of obs
F( 4, 7574)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
7579
29.17
0.0000
0.0152
0.0147
425.94
-----------------------------------------------------------------------------weekearn |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------expenditures |
.0390489
.0060053
6.50
0.000
.0272769
.0508209
NE |
58.93258
15.97818
3.69
0.000
27.61091
90.25425
Midwest |
32.89631
14.26238
2.31
0.021
4.938092
60.85454
South |
2.824219
13.6974
0.21
0.837
-24.02649
29.67493
_cons |
533.6598
24.94855
21.39
0.000
484.7538
582.5659
------------------------------------------------------------------------------
For every $1 in expenditures we get 3.9¢ in teacher
pay per week or, on a 52 week basis, $2.08!
Example: Teachers’ Pay
Build a more suitable model and R-sq increases.
. reg weekearn expenditure
pocc1 <= 159
female black NE
Source |
SS
df
MS
-------------+-----------------------------Model | 19477648.0
8 2434706.00
Residual | 61297869.1
407 150609.015
-------------+-----------------------------Total | 80775517.1
415
194639.80
Midwest South age coned if pocc1 >= 151 &
Number of obs =
F( 8,
7579) =
Prob > F
=
R-squared
=
Adj R-squared =
Root MSE
=
7479
16.17
0.0000
0.2411
0.2262
388.08
-----------------------------------------------------------------------------weekearn |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------expenditures |
.0636406
.0726116
0.88
0.381
-.0791
.2063811
female | -88.36832
43.00688
-2.05
0.041
-172.9117
-3.824972
black |
72.28883
48.35753
1.49
0.136
-22.77287
167.3505
NE | -84.77944
116.1567
-0.73
0.466
-313.1213
143.5625
Midwest |
-38.8376
54.29961
-0.72
0.475
-145.5803
67.9051
South | -.8350449
48.33866
-0.02
0.986
-95.85964
94.18955
age |
8.797438
1.653155
5.32
0.000
5.547649
12.04723
coned |
81.40359
11.48141
7.09
0.000
58.83332
103.9739
_cons | -1089.701
300.7268
-3.62
0.000
-1680.873
-498.5296
------------------------------------------------------------------------------
Example: Teachers’ Pay

Why the difference in R-sq?

Different levels of aggregation of data lead to
different total variance.
Micro-data has much more variance than state
average data (why might this be)?
 Times Series data often has r-sq of .98 or .99.
 As a result, we cannot use R-sq to compare the results
across different data sets or types of regressions. It can
be useful to compare specifications within a particular
model.

Correlation & R-Squared

R2 and ρ: What is the relationship?




ρ is the population value of the correlation, in the
sample the symbol for correlation is r.
If r is the correlation between X and Y, then R2, the
goodness of fit measure of a regression equation is
r 2.
Note that this ONLY holds for bi-variate relationships.
An example for the relationship between education
expenditures and teacher pay:
Correlation & R-Squared: Example
Results for: Teacher Expenditure.MTW
Correlations: pay, expenditures
Pearson correlation of pay and expenditures = 0.835
P-Value = 0.000
Regression Analysis: pay versus expenditures
The regression equation is
pay = 12129 + 3.31 expenditures
Predictor
Constant
expendit
S = 2325
Coef
12129
3.3076
SE Coef
1197
0.3117
R-Sq = 69.7%
r2 = .8352 =.697225 = R2
T
10.13
10.61
P
0.000
0.000
R-Sq(adj) = 69.1%
Download