Uses of Regression

advertisement
YALE School of Management
EMGT511: HYPOTHESIS TESTING AND REGRESSION
K. Sudhir
Lecture 3
Introduction to Regression
The first 2 sessions covered hypothesis testing about the mean, about the proportion, and
about differences between means. In each of the cases, the process started with a null
hypothesis, and we identified decision rules based on extreme outcomes for a test
statistic, if the null hypothesis is true. Simple random sampling was critical to the testing
procedure. Without probability sampling, we cannot justify the applications we did.
We now move to situations where we want to know the relationship between variables.
Regression analysis is a method that is used to estimate an equation that expresses how
one variable (the dependent or criterion variable) depends on one or more other variables
(independent or predictor variables). To justify statistical inference for regression
analysis, we make assumptions. Obviously, the relevance of the conclusions we make
depends on the extent to which the assumptions are correct.
Examples of Regression:
Testing for Gender Discrimination
A firm is sued for gender discrimination. The plaintiffs performed a hypothesis test on the
average difference between the salaries of males and females at the firm. They obtained a
statistically significant difference between the average salaries of males and females.
Attorneys for the firm argue that this evidence is moot, because they know that the males
have higher salaries due to longer experience. Historically there were more males in the
firm and therefore on average they tend to have greater experience. The question these
attorneys want to address is whether there are average salary differences between males
and females with comparable experience. How does one compare differences in salaries
between men and women controlling for experience?
A regression equation would help us to answer the problem. As we will see later, we
could create a variable called Gender that takes the value 0, if the person is Male, and 1 if
the person is Female. We could then estimate an equation such as:
Salary   0  1Gender   2 Experience
In this equation, we explore the discrimination issue by holding experience constant. That
is, 1 represents the average difference between males and females at the firm, controlling
for experience.
In this case, Salary is the dependent variable; Gender and Experience are the independent
variables.
A Model of Demand
Economists and others are frequently interested in estimating demand equations. An
understanding of how demand depends on various activities may help managers decide on
allocation of resources. It can also help both managers and policy makers learn about the
intensity of competition.
If we consider a given firm in isolation and ignore competitors, we could consider the
following simplified representation. A model of linear effects of price and advertising on
demand is:
Sales   0  1Price   2 Advertising
Sales is the dependent variable; Price and Advertising are the independent variables.
Modeling the Experience Curve Effect
Firms often find that as they have produced more in the past, the marginal cost of
production falls due to learning or cumulative experience. Understanding the relationship
between marginal costs and total production is useful for firms, because it helps them
forecast how much and how quickly costs will fall over time.
A typical regression equation for the experience curve is:
ln(Marginal Cost)   0  1 ln(Cumulative Production)
Academic performance and GMAT Scores
Many business schools rely heavily on GMAT scores for selecting applicants. The
assumption is that students with higher GMAT scores will have better academic
performance. However, it is unclear how an admission officer should combine GMAT
scores with information on undergraduate academic performance, work experience, etc.
Unless an admission officer analyzes the relationship between academic performance of
past graduates and the information available in the application file, s/he will use weights
for the different independent variables that are “biased”.
A simplified regression analysis would examine how individuals’ academic performances
in a specific MBA program relates to independent variables such as the same individuals’
GMAT scores (verbal, quant), undergraduate academic performance, quality of the
undergraduate institution, type of undergraduate major, etc.
Such a regression analysis can help the admission director screen applicants on minimum
academic competence so that s/he can then concentrate on harder to assess dimensions
such as career performance potential. For example we could specify a simplified equation
such as:
MBA  GPA   0  1GMAT _ VERB   2GMAT _ QUANT  3UG _ GPA   4 EXP _ YR
Uses of Regression
Regression is useful for two primary purposes:
(1) Prediction and Forecasting: Given specific values for the independent variables, we
can predict or forecast a value for the dependent variable. For example, what is the
predicted MBA score for an applicant with specific GMAT scores, and how good is
this prediction?
(2) Description: The regression result informs us how one or more independent variables
affect the dependent variable, assuming there is a causal relationship. This is useful
for managers to decide the “optimal” levels of the independent variables in order to
achieve a desire outcome. For example, based on an estimated demand equation, a
manager can learn how sales (and therefore profits) will change as a function of
prices. This enables the firm to decide on the optimal level of price so as to maximize
profits.
A Simple Regression Problem
Consider the following data on prices and demand. Estimate a regression model for these
data.
Unit sales in lbs.
Price in $
115
5
105
5
95
10
105
10
95
15
85
15
Prior to estimating a regression model, it is useful to graph the data. The best fitting line
runs approximately through the middle of the data. Intuitively this is the regression line.
Note that none of the data fall on the line.
120
Sales
110
100
90
0
5
10
15
Price
On average, these data show that a $5 increase in price results in a 10 lb. decrease in
sales. Thus, the slope of a linear equation equals
10
 2 .
5
By extrapolation we see that
the linear equation intersects the y-axis at 120. Thus, the average relationship between
the two variables is y = 120 – 2 xi where yi is unit sales in lbs. and xi is price in dollars.
This is the regression equation. The interpretation of the slope coefficient in this problem
is that if price increases by $1, demand is expected to decrease by 2 lbs.
We will verify if we can recover the parameters of the regression equation using Excel.
At the end of this note, I detail how to do regressions in Excel. Right now let us look at
the output of the regression for this example.
Regression Output
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.852803
R Square
0.727273
Adjusted R Square
0.659091
Standard Error
6.123724
Observations
6
ANOVA
df
Regression
Residual
Total
Intercept
Price
SS
1
4
5
MS
400
150
550
F Significance F
400 10.66667 0.030906
37.5
Coefficients
Standard Error t Stat
P-value Lower 95% Upper 95%
120 6.614378 18.14229 5.43E-05 101.6355 138.3645
-2 0.612372 -3.26599 0.030906 -3.70022 -0.29978
Interpretation of the Slope Coefficient and Intercept
Slope: for a one dollar increase in price, demand tends to decrease by 2 lbs.
Intercept: for a price of zero, demand is estimated to be 120 lbs.
Note: “tends to” and “estimated” are expressions that reflect the inexact nature of the
function.
(Note: We should not interpret this intercept too literally, because we have never
charged a price of zero and do not have adequate information about how sales will be
when price is zero; usually regression estimates work well in the range of the x data
which has been used in the regression. In this case, we can be more confident of
predictions in the price range of $5-$15, because this is the range of the data we have
used for the regression. Nevertheless, we should always attempt to estimate an equation
that can be used for all possible values of the predictor variable (s), Price in this case. )
Assumptions used in Regression and how it ties to Statistical Inference
We discussed last time that when doing hypothesis testing about means and proportions,
the use of Simple Random Sampling allows us to claim that the expected value of the
sample mean or sample proportion equals the value of the parameter. Simple random
sampling also allows us to derive the variance of the random variable. And we refer to the
Central Limit Theorem to justify the assumption that the random variable (sample mean
or sample proportion) is normally distributed.
For regression, we use historical data. To obtain formulas for statistical inference, we
make the following assumptions about the error term for regression. The assumptions
used for regression are:
(1) E(  i) = 0.
(2) Var (  i) = 2 (  2 to be clear about what the variance refers to)
(3) cov(  i ,  j) =0; independence of errors
(4)  i is normally distributed.
See figure below to see how these assumptions can be thought off pictorially.
y
y   0  1 x
0
x
1
2
3
As you can see from the picture the dependent variable data y falls along the regression
line, subject to a normal error term which is centered around the regression line. To
interpret the picture, it is useful to think of the normal curve as projecting out of the page
with its mean at exactly the point of the regression line. From the picture we can see:
a. The error terms have a mean zero, irrespective of the values of x, as indicated by the
normal curves which are centered around the regression line.
b. The variance of the error term is a constant irrespective of the values of x, as indicated
by the constant variance normal curve.
c. In the picture, we are not able to show the independence assumption.
d. The errors are normally distributed (as indicated by the bell curve).
Statistical Inference
We need to make the four assumptions about the error term to justify statistical inference
(hypothesis testing and confidence intervals). Let us see how these assumptions help us in
statistical inference.
1. Assumption (1) E(  i) = 0, implies that the least squares estimates are unbiased
 
estimates. Thus E ˆ0   0 and Eˆ1  1 .
2. Assumptions (2) and (3) imply that sˆ   
0
1 x2


and sˆ  
1
n S xx
S xx
where: Sxx   x i  x 
2
Usually we don’t know   , so we estimate it as s  =
( yi  ŷi )2
n2
=
 ˆ i2
n2
where ŷ  ˆ0  ˆ1 x
3. Assumption (4) assumes normality of  i, so we can do hypothesis testing. For
example, we can do the hypothesis test for 1 by constructing the following Z and
t-statistics.
If the null value is 1 Null
Z
ˆ1  1Null
 ˆ
1
If we estimate s  , then we do the t-test.
The corresponding t-statistic is t 
ˆ1  1Null
sˆ
1
Statistical Inference Example:
Performing a t-test
In the example problem above we estimated:
ˆ1  -2
ˆ0  120
Suppose we wish to test the following hypothesis for the example problem discussed
earlier:
Null
H0: 1 =0;
Alternative:
HA: 1  0
t
ˆ1  1Null
sˆ
1
=
ˆ1  1Null
s / S xx
We need to estimate s  =
 ˆ i2
( yi  ŷi )2
=
n2
n2
We go back to the data to estimate s  as follows:
yi
ŷ i
̂ i
̂ i 2
115
110
5
25
105
110
-5
25
95
100
-5
25
105
100
5
25
95
90
5
25
where: ŷi  120  2xi
85
90
s =
150
=
4
-5
25
0
150
37.5 = 6.1
We had computed Sxx   x i  x   100
Hence the t statistic can be computed as follows:
2
ˆ1 =-2; 1Null =0; s  =6.1; Sxx = 100
t 
 2  0   2

 3.28
6.1 / 10
.61
Since tn-2, 0.025,=t4, 0.025=2.776
Since – 3.28 < -2.776, reject H o .
Computing Confidence interval for true slope:
ˆ1   tn2, / 2  sˆ
1
2   2.776 .61
2  1.7
We can be be 95% sure that the true but unknown slope is between –3.7 and –0.3
(consistent with rejecting Ho).
How well did the regression do?
R-square and Adjusted R-square:
y
Regression Line
y  yˆ
y y
yˆ  y
Mean(
y
)
x
It can be shown that  ( yi  y ) 2 
i
SS(Total)
 ( y  yˆ )
i
i
i
2

 ( yˆ  y )
2
i
i
SS(Residuals) SS(Regression)
Where SS(Total) represents the sum of squares of the total variation in y from the mean;
SS(Regression) represents the sum of squares of the variation in y explained by the
regression;
SS(Residuals) represents the sum of squares of the variation in y explained by the
residuals
Thus the proportion of the variation explained by the regression is defined as R-square or
R2 and is defined as follows:
SS(Regress ion) SS(Total) - SS(Residua l)
SS(Residua l)
R2 

 1
SS(Total)
SS(Total)
SS(Total)
Clearly R2 is a number between 0 and 1. It will always increase when a new variable is
included in a regression.
In our example:
Suppose we did not use the x variable in the regression and estimated the model
y   0   . Then our best estimate of the  0 is y . In this case, the sum of squared errors
is given by: ( yi  y ) 2 .
yi
y
̂ i
̂ i 2
115
100
15
225
105
100
5
25
95
100
-5
25
105
100
5
25
95
100
-5
25
85
100
-15
225
0
550
For this problem above, this sum is equal to 550. This is the total sum of squared errors
from the mean for the y variable. This is referred to as the total variation in the y
variable.
We had computed earlier that SS(Residual)=150. So R-square = 1 – 150/550 = 0.727.
Compare this against the Excel Output we obtained last time. This is referred to as the Rsquare of the regression and is interpreted as the fraction of the variation in y in the
sample, explained by the regression.
The worst any variable in a regression could do is that it will it will explain zero
variation. This R-square measure can only increase as you add more variables on the
right-hand side of the equation. However managers and statisticians like to have the
simplest possible models of the world that provide correct inferences and explain the
most variation.
It makes sense to penalize models with variables that do not explain much variation in y.
Statisticians have therefore constructed an alternative measure of R-square. This is called
adjusted R-square. When adding additional variables into a regression, it is therefore
useful to look if the adjusted R-square is increasing, rather than whether the (unadjusted)
R-square is increasing. However, this is not the only way to decide which model is best.
An intuitive way to think about the difference between R-square and adjusted R-square is
as follows:
R-square is the proportion of the sample variation in y that is explained by the
regression.
Adjusted R-square is the proportion of the population variation in y that is
explained by the regression.
Adding a variable can never worsen the sample variation explained. But a bad variable in
explaining some small proportion of the sample variation in y may actually harm in
explaining the overall population variation. Hence R-square will always go up, while
adjusted R-square may go down when variables without much explanatory power are
added to the regression.
Doing Regressions in Excel
1. Below I outline the steps to do regressions in Excel. Enter the data as shown
below with the labels in the first row.
2. Select Tools>Data Analysis from the menu
Data
3. This opens up a dialog box below: Select Regression.
4. This opens another dialog box below:
a. Enter the y and x data ranges in the appropriate fill-in boxes. (If there are
more than 1 x variable, enter the full range; for example with 2 variables
you would enter $B$1:$C$7
b. Mark the labels checkbox, this makes sure the first row is recognized as
label.
c. If you want a different confidence level (say 99%), enter it.
d. You may request any plots you want. We will discuss these later.
Y
X
The results of the regression are given below:
1. In the last three rows of the output, Compare the estimates (called coefficients),
the standard errors, the t-statistic and the confidence interval we computed earlier
2. Look at the adjusted R-square
3. The standard error of the regression is what we call s
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.852803
R Square
0.727273
Adjusted R Square
0.659091
Standard Error
6.123724
Observations
6
ANOVA
df
Regression
Residual
Total
Intercept
Price
SS
1
4
5
MS
400
150
550
F Significance F
400 10.66667 0.030906
37.5
Coefficients
Standard Error t Stat
P-value Lower 95% Upper 95%
120 6.614378 18.14229 5.43E-05 101.6355 138.3645
-2 0.612372 -3.26599 0.030906 -3.70022 -0.29978
Download