Regression Statistics

advertisement
Last week we talked, among other things, about supply
and demand equations and said that having those
available may improve the accuracy of our predictions.
Last week we talked, among other things, about supply
and demand equations and said that having those
available may improve the accuracy of our predictions.
• How can we obtain those equations?
• What information do we need for that?
• What techniques do we use to translate
raw data into an equation?
• How much faith should we have in
the resulting equations?
Regression analysis
The simplest case is the relationship between two
variables, which may help answer such business-type
questions as
•How does the number of TVs sold at an outlet
depend on the TV price?
•How does the quantity demanded of paper
towels depend on the population of a town?
•How does the volume of ice cream sales
depend on the outside temperature?
Simple regression
Step 1. Collect data.
Observation #
1
2
3
4
5
6
7
8
9
10
Quantity
18
59
43
25
27
72
66
49
70
21
Price
475
400
450
550
575
375
375
450
400
500
700
600
500
400
300
200
100
0
0
10
20
30
40
50
60
70
80
Step 2. Assume a form of the relationship between
the variables of interest, such as
QD = A – B∙P,
or
QD = A/P – B∙P + CP – D,
or any other you can possibly think of,
where A through D are some numbers, or
“coefficients”.
Which of the above specifications would you prefer
to use? How would you justify your choice?
None of them represents the “real” relationship
between P and Q, therefore let’s use the linear, the
simplest one that is consistent with theory
(therefore common sense).
700
600
500
400
300
200
100
0
0
10
20
30
40
50
60
70
80
700
600
500
Price
400
Price
300
Linear (Price)
200
100
0
0
10
20
30
40
Quantity
50
60
70
80
Step 3. Find the best values for coefficients.
How?
Each observation is a point in the price-quantity space.
Each pair of numbers, A and B, when plugged into the
equation above, uniquely defines a line in the pricequantity space.
It is highly unlikely that all the points (observations) will
fit on the same line.
The best line will be the one that minimizes the sum of
squared deviations between the line and the actual
data points.
This procedure can be performed using a MS Excel
spreadsheet.
2003:
In the upper scroll-down
menu, choose
Tools  Data Analysis 
Regression
2007:
Data tab  Analysis 
Data Analysis 
Regression
then enter the range of cells that contain data for each
variable.
• Y is the “dependent variable” (in our case, QD).
• X is the “independent variable”, also called the
explanatory variable (in our case, P).
• Check the “Labels” box if and only if you include
column headers.
• Click on the cell where you want the output printout to
start, then click ‘OK’.
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.868299
R Square
Adjusted R
Square
0.753944
0.723187
Standard Error 11.147108
Observations
10
ANOVA
df
SS
MS
Regression
1
3045.935 3045.935
Residual
8
994.0642 124.2580
Total
9
Coefficients
F
24.51298
Significance
F
0.0011193
4040
Standard Error
t Stat
P-value
Lower 95%
Upper 95%
Intercept
163.706
24.2337
6.7553
0.000144
107.823
219.589
Price
-0.2608
0.0526
-4.9510
0.001119
-0.382
-0.1393
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.868299
R Square
Adjusted R
Square
0.753944
R2 shows the portion of the
variation in the dependent
variable, QD, that is
explained by the
independent one, P.
0.723187
Standard Error 11.147108
Observations
10
ANOVA
df
SS
MS
Regression
1
3045.935 3045.935
Residual
8
994.0642 124.2580
Total
9
Coefficients
F
24.51298
Significance
F
0.0011193
4040
Standard Error
t Stat
P-value
Lower 95%
Upper 95%
Intercept
163.706
24.2337
6.7553
0.000144
107.823
219.589
Price
-0.2608
0.0526
-4.9510
0.001119
-0.382
-0.1393
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.868299
R Square
Adjusted R
Square
0.753944
The greater the F-statistic is,
the lower is the probability that
the estimated regression model
fits the data purely by accident.
0.723187
Standard Error 11.147108
Observations
10
ANOVA
df
SS
MS
Regression
1
3045.935 3045.935
Residual
8
994.0642 124.2580
Total
9
Coefficients
F
24.51298
Significance
F
0.0011193
4040
Standard Error
t Stat
P-value
Lower 95%
Upper 95%
Intercept
163.706
24.2337
6.7553
0.000144
107.823
219.589
Price
-0.2608
0.0526
-4.9510
0.001119
-0.382
-0.1393
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.868299
R Square
Adjusted R
Square
0.753944
The greater the F-statistic is,
the lower is the probability that
the estimated regression model
fits the data purely by accident.
0.723187
That probability is given
under “Significance F”.
Standard Error 11.147108
Observations
10
ANOVA
df
SS
MS
Regression
1
3045.935 3045.935
Residual
8
994.0642 124.2580
Total
9
Coefficients
F
24.51298
Significance
F
0.0011193
4040
Standard Error
t Stat
P-value
Lower 95%
Upper 95%
Intercept
163.706
24.2337
6.7553
0.000144
107.823
219.589
Price
-0.2608
0.0526
-4.9510
0.001119
-0.382
-0.1393
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.868299
R Square
Adjusted R
Square
0.753944
The ‘Coefficients’ column
contains the values of A and
B that provide the best fit.
0.723187
Standard Error 11.147108
Observations
10
ANOVA
df
SS
MS
Regression
1
3045.935 3045.935
Residual
8
994.0642 124.2580
Total
9
Coefficients
F
24.51298
Significance
F
0.0011193
4040
Standard Error
t Stat
P-value
Lower 95%
Upper 95%
Intercept
163.706
24.2337
6.7553
0.000144
107.823
219.589
Price
-0.2608
0.0526
-4.9510
0.001119
-0.382
-0.1393
In our case, the regression analysis
suggests that the best estimate of the
demand for TVs based on the data
provided is…
In our case, the regression analysis
suggests that the best estimate of the
demand for TVs based on the data
provided is…
QD = A + B∙P
In our case, the regression analysis
suggests that the best estimate of the
demand for TVs based on the data
provided is…
QD = 163.7 – 0.261∙P
Which means that
For every $1 increase in the price of a TV set, the
quantity demanded drops by (approx.) 0.26 units
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.868299
R Square
Adjusted R
Square
0.753944
The ‘Coefficients’ column
contains the values of A and
B that provide the best fit.
0.723187
Standard Error 11.147108
Observations
10
ANOVA
df
SS
MS
Regression
1
3045.935 3045.935
Residual
8
994.0642 124.2580
Total
9
Coefficients
F
24.51298
Significance
F
0.0011193
4040
Standard Error
t Stat
P-value
Lower 95%
Upper 95%
Intercept
163.706
24.2337
6.7553
0.000144
107.823
219.589
Price
-0.2608
0.0526
-4.9510
0.001119
-0.382
-0.1393
SUMMARY OUTPUT
show the ‘statistical
significance’ of each
0.868299
coefficient.
The larger is the
0.753944
t-statistic, the higher are the
0.723187
chances that the coefficient
11.147108
is different from zero.
t-values
Regression Statistics
Multiple R
R Square
Adjusted R
Square
Standard Error
Observations
10
ANOVA
df
SS
MS
Regression
1
3045.935 3045.935
Residual
8
994.0642 124.2580
Total
9
Coefficients
F
24.51298
Significance
F
0.0011193
4040
Standard Error
t Stat
P-value
Lower 95%
Upper 95%
Intercept
163.706
24.2337
6.7553
0.000144
107.823
219.589
Price
-0.2608
0.0526
-4.9510
0.001119
-0.382
-0.1393
SUMMARY OUTPUT
show the ‘statistical
significance’ of each
0.868299
coefficient.
The larger is the
0.753944
t-statistic, the higher are the
0.723187
chances that the coefficient
11.147108
is different from zero.
t-values
Regression Statistics
Multiple R
R Square
Adjusted R
Square
Standard Error
Observations
10
As in the case of the
F-statistic, the same
information is also available
in the form of probabilities, or
P-values.
ANOVA
df
SS
MS
Regression
1
3045.935 3045.935
Residual
8
994.0642 124.2580
Total
9
Coefficients
F
24.51298
Significance
F
0.0011193
4040
Standard Error
t Stat
P-value
Lower 95%
Upper 95%
Intercept
163.706
24.2337
6.7553
0.000144
107.823
219.589
Price
-0.2608
0.0526
-4.9510
0.001119
-0.382
-0.1393
In our case, there is only a 0.11% probability that
price is irrelevant for consumer decisions.
Another way to say it is, there is a 99.89%
probability that price matters for consumer decisions
(more specifically, that higher price is associated
with a lower quantity demanded).
This, however, is very different from the statement
“There is 99.89% probability that the coefficient on
price is –0.26”, which is FALSE.
But the actual value is not certain –
it is distributed around the expected
value in a probabilistic manner
The regression output gives us the
“expected value” of the coefficient
The entire distribution lies in the positive range –
P0 and we are certain the sign of the coefficient
is positive (but don’t know its actual value!)
0
100
200
300
400
500
600
80% of the area is in
the positive range
20%
-200
- 100
P = 0.2
0
100
200
300
400
The “Lower 95%” and “Upper 95%” columns in the
printout give the upper and lower bounds within which
the true value of each coefficient falls with a certain
probability (95% probability in this case). This range
is also known as the “95% confidence interval”.
In our case, …
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.868299
R Square
Adjusted R
Square
0.753944
0.723187
Standard Error 11.147108
Observations
10
ANOVA
df
SS
MS
F
Regression
1
3045.935 3045.935 24.51298
Residual
8
994.0642 124.2580
Total
9
Coefficients
Signif F
0.0011193
4040
Standard Error
Intercept
163.706
24.2337
Price
-0.2608
0.0526
t Stat
P-value
6.7553 0.000144
-4.9510
0.001119
Lower 95% Upper 95%
107.823
219.589
-0.382
-0.1393
The “Lower 95%” and “Upper 95%” columns in the
printout give the upper and lower bounds within which
the true value of each coefficient falls with a certain
probability (95% probability in this case). This range
is also known as the “95% confidence interval”.
In our case, …
… there is a 95% probability the value of the
coefficient for price lies between -0.382 and -0.1393
Note the P-value for the estimated coefficient on
price equals the significance of F. This is always the
case when there is only one independent variable.
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.868299
R Square
Adjusted R
Square
0.753944
0.723187
Standard Error 11.147108
Observations
10
ANOVA
df
SS
MS
Regression
1
3045.935 3045.935
Residual
8
994.0642 124.2580
Total
9
Coefficients
F
24.51298
Significance
F
0.0011193
4040
Standard Error
t Stat
P-value
Lower 95%
Upper 95%
Intercept
163.706
24.2337
6.7553
0.000144
107.823
219.589
Price
-0.2608
0.0526
-4.9510
0.001119
-0.382
-0.1393
To summarize, we want:
The sign of coefficients to make economic sense;
R2 to be LARGE (close to 1)
The F-statistic to be LARGE
The t-statistic to be LARGE in absolute value
‘Significance F’ and P-values to be SMALL
The confidence interval to be SMALL / NARROW
Statistical significance of a coefficient is a statement
about how reliable the sign of the calculated coefficient
is (look at the t-statistic and the p-value).
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.868299
R Square
Adjusted R
Square
0.753944
0.723187
Standard Error 11.147108
Observations
10
ANOVA
df
SS
MS
Regression
1
3045.935 3045.935
Residual
8
994.0642 124.2580
Total
9
Coefficients
F
24.51298
Significance
F
0.0011193
4040
Standard Error
t Stat
P-value
Lower 95%
Upper 95%
Intercept
163.706
24.2337
6.7553
0.000144
107.823
219.589
Price
-0.2608
0.0526
-4.9510
0.001119
-0.382
-0.1393
Statistical significance of a coefficient is a statement
about how reliable the sign of the calculated coefficient
is (look at the t-statistic and the p-value).
Example: There is a 0.11% probability that
the coefficient on “Price” is non-negative
Statistical significance of a coefficient is a statement
about how reliable the sign of the calculated coefficient
is (look at the t-statistic and the p-value).
Example: There is a 0.11% probability that
there is either a positive or no relationship
between price and quantity demanded.
You may also come across the expression “the
coefficient is significant at the 5% level”. That means
there is no more than 5 percent probability that this
coefficient has the sign opposite to the estimate.
When statistical significance of a coefficient is low, the
role of the variable in question is weak or unclear.
Economic significance of a variable:
How, according to the regression results, a change in
one variable will affect the other variable (look at the
value of the coefficient itself).
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.868299
R Square
Adjusted R
Square
0.753944
0.723187
Standard Error 11.147108
Observations
10
ANOVA
df
SS
MS
Regression
1
3045.935 3045.935
Residual
8
994.0642 124.2580
Total
9
Coefficients
F
24.51298
Significance
F
0.0011193
4040
Standard Error
t Stat
P-value
Lower 95%
Upper 95%
Intercept
163.706
24.2337
6.7553
0.000144
107.823
219.589
Price
-0.2608
0.0526
-4.9510
0.001119
-0.382
-0.1393
Economic significance of a variable:
How, according to the regression results, a change in
one variable will affect the other variable (look at the
value of the coefficient itself).
Example: For every $1 increase in price,
quantity demanded drops by 0.261 unit.
Economic significance of a variable:
How, according to the regression results, a change in
one variable will affect the other variable (look at the
value of the coefficient itself).
Example: For every $100 increase in price,
quantity demanded drops by ~ 26 units.
Economic significance of a variable:
How, according to the regression results, a change in
one variable will affect the other variable (look at the
value of the coefficient itself).
Example: For every $4 decrease in price,
quantity demanded increases by
approximately one unit.
Avoid causality statements!
What if you are not happy with the fit?
(Say, F-statistic is too small, the statistical
significance is small, etc.)
Sometimes this is due to the fact that the points on
the scatter plot do not align well along a straight line.
In that case you may be able to make things better
by trying a different specification, such as a log-linear
regression.
A curve provides a better fit than a line.
To run a log-linear regression, you first need to create
new auxiliary variables
QD ' ln QD
and
P' ln P
Here, ln stands for the natural logarithm
(logarithm with base e, where e  2.72 is a
mathematical constant).
ln X is a number such that elnX = X.
After that, proceed with the regression as usual.
The resulting equation will be of the form
Q' D   0   P P'
or
ln QD   0   P ln P
which is equivalent to
QD  e 0 P  P
What if the log-linear specification doesn’t help?
How can we increase the explanatory power further?
Add more explanatory variables!
Another potential source of errors: the “specification problem”
Example: Data on demand for soft drinks:
P
Oct
Jan
Apr
July
Fitted line
Apr
Q
Another potential source of errors: the “specification problem”
Example: Data on demand for soft drinks:
P
The real story:
Oct
Jan
Apr
July
Fitted line
Apr
Q
Another potential source of errors: the “specification problem”
Example: Data on demand for soft drinks:
P
Oct
Apr
Jan
July
The real story:
Fitted line
Q
Once again, adding more explanatory variables
may help us understand things better
Multiple regression
The idea is similar to a simple regression, except
there are more than one explanatory (independent)
variable.
When compared to a simple regression, a multiple
regression helps avoid the aforementioned
“specification problem”, improve the overall
goodness of fit, and improve the understanding of
factors relevant for the variable of interest.
What variables other than own price could matter in
the soft drink example?
Outside temperature?
Town population? Etc.
Running a multiple regression in MS Excel is
similar to a simple regression except the fact that,
when choosing the cell range for independent
variables, you need to include all the
independent variables at once.
The output will contain more lines, according to the
number of variables included in the regression.
(A demonstration session follows.)
Regression output can again be translated into an
equation.
Such an equation helps us not only evaluate the
relationship between price and quantity, but also
answer such questions as…
- Are goods X and Y substitutes or complements?
- How does the consumption of our good depend on
income?
- Does advertising matter and how much?
and so on.
Things to look for when adding explanatory variables:
Does R2 improve (increase) when variables are
added?
Things to look for when adding explanatory variables:
Does R2 improve (increase) when variables are
added? (Normally, the answer is ‘yes’.)
What is happening to R2-adjusted?
(R2adj punishes the researcher for adding variables
that don’t contribute much to the explanatory power,
so it is a better criterion)
Only statistically significant variables should be
included in the final regression run and the resulting
equation.
More advanced statistical packages perform
‘stepwise regressions’ when the program itself
decides which variables are worth keeping and
which deserve to be dropped.
“Dummy” variables
Sometimes, we are interested in the role of a factor
that doesn’t have a numerical value attached to it,
such as gender, race, day of the week, etc.
Such factors can be included in the regression by
assigning a value to each realization of the variable
except one.
Dummy variables usually only take values of 0 or 1.
Examples:
Gender: “0” if male, “1” if female.
(one dummy does the job)
Day of the week: We need six (7-1=6) additional
variables.
X1: “1” if Monday, “0” otherwise.
X2: “1” if Tuesday, “0” otherwise… etc. up to X6.
No variable for Sunday.
We will know Sunday is important if a regression with
dummies included produces a noticeably better R2ADJ
than without them.
The “economic” interpretation of
the effect of dummy variables is
similar to that for regular
variables.
An average male buys 10 more gallons of
soft drinks in a year than an average female
On average, there are 200 more people
attending a Washburn home soccer game
when the game is on Friday
Download