Regression

advertisement
Understanding Relationships

To this point we have primarily done analyses of data on one variable,
i.e., on univariate data. Now we would like to look at relating two or
more variables, i.e., multivariate data.

We have already looked at relating quantitative variables through
covariance and correlation. In this section we will return to
investigate relationships between two quantitative variables. Now
our main focus will be on regression.

The primary objective of this section is to learn how relationships
between variables can be quantified and interpreted. In this context,
we will review causation, and we will show how the computer can be
used to ease the computational burden.

When we talk about relationships between variables, we usually
cannot conclude anything about causation (i.e., that change in one
variable causes a change in the other variable). We only can
conclude whether the variables are related or not.
Linear Regression - 1
Linear Regression

With correlation we found a single number that represented the
strength of the linear relationship between two variables. Another
way to look at the relationship between two variables is to
hypothesize the form of the linear relationship, and then from the data
estimate the equation of the line that relates them. Our objectives for
this section are to learn to relate two or more random variables in a
meaningful way, to test the significance of the relationships, and to
use the relationships to make predictions.

Most of what we have learned so far will be used in this section. We
will use histograms, summary measures, confidence intervals, and
tests of hypotheses to talk about regression.

In all of these discussions, we will again be talking about linear
relationships. We will see, however, that this isn't as restrictive as it
first sounds.

Here we often assume that a set of the variables is under our control.
These variables are referred to as the independent variables, and are
denoted x. The variable that is not under our control is called the
dependent (or response) variable, and is denoted y. For our purposes
there will only be one dependent variable, but we can have many
independent variables. Again, we assume nothing about causality,
i.e., if we find a linear relationship, we cannot necessarily say that a
change in the independent variables causes a change in the response
of the dependent variable.
Linear Regression - 2
Let's begin with a simple example. The following table and graph
show Narco Medical's advertising expense (in hundreds of dollars) in
a period, and their associated sales (in thousands of dollars).

Advertising Expense
Sales
8
15
9
11
7
10
6
11
5
8
1
5
16
14
Sales (thousands)
12
10
8
6
4
2
0
0
1
2
3
4
5
6
7
8
9
Advertising Expense (hundreds)
When there is only one independent variable, as in this case, we call
the modeling "simple linear regression."

We want to fit a line to the data. We will assume a linear model of
the form:
y = 0 + 1x + .
What criterion or criteria should we use to fit the line? What line
would best fit the data?
Linear Regression - 3
Least Squares

Most often a method known as least squares is used to fit the line.
The estimate of the line's intercept we will call b0, and that of the
line's slope will be b1. The estimated or predicted value of Y is
denoted Y . Hence our estimated line has the form
y i  b 0  b1x i .
In least squares we minimize the sum of squared differences between
Y and Y . We define a residual to be ei = yi - y i , and minimize
n
 e2i .
i 1
For you calculus fans, the procedure is to take the partial derivatives
with respect to b0 and b1 and set them equal to 0. We will not go
through the details, nor even worry about writing the result. The
book shows the result, and for us, the important thing is that the
computer calculates the values of b0 and b1 for us.

For the example, here is a partial printout from the spreadsheet:
Coefficients
Intercept
Advertising Expense
4
1
So the estimated relationship is
What would we predict sales to be if the advertising expense is $400?
What about an advertising expense of $1500?
Linear Regression - 4
Multiple Linear Regression

We can write a similar model when there are more than one
independent variables. The general form of the model is
y = 0 + 1x1 + 2x2 + ... + kxk + ,
and the estimated relationship is
y  b 0  b1x1  b 2 x 2  b p x p .
We use least squares to find the values of b0, b1, ..., bp that minimize
the sum of the squared differences between yi - y i .
Using Data Analysis Tools to Do Regression

Doing regression in Excel is very similar to using the other analysis
tools. With regression, however, having the data in the right form is
more important. First, all data should be entered in columns. Second,
all independent variables should be next to each other (i.e., in a
contiguous set of cells).
Once the data are entered correctly, select "Regression" from the
Tools/ Data Analysis menu item in Excel. You will be presented with
the dialogue box shown on the following page.
In the Input Y Range, enter the cell range referring to the column
containing the dependent variable. In the Input X Range, enter the
range of cells containing all independent variables. This is why the X
variables need to be next to each other. If your range of cells
included a row of labels, click the label box.
Linear Regression - 5
I never click the Constant is Zero box. In some physical systems it
only makes sense for the intercept to be 0, so we can force it do so. In
our examples that will never be the case. If you want a confidence
interval for the  values other than a 95% confidence interval, click in
the Confidence Level box and enter a different confidence level.
Next, indicate where you want the output to go. Finally, click on the
box next to “Residuals.” I leave all other boxes blank, because I
don’t like the way that Excel does the rest of the residual analysis or
the normal probability plot. Then hit enter.
Linear Regression - 6
Adequacy of Fit

We have discussed how we fit the line to the data, but we must
remember that we are using sample values to estimate the
hypothesized line. Since the results come from sample estimates,
they are subject to error, and hence we need to be sure that the
relationship we are seeing is really significant. There are a few
measures and tests that we can use to look at how good the fitted
relationship really is.

Example: We will use the pizza delivery example to motivate our
discussion. Let's concentrate on the delivery time as our dependent
variable. We want to predict delivery time from some of the other
variables in the data set. Which ones might it make sense to include?

We will try using distance, day of the week, and hour of the day as
our independent variables. On the next page is the overall output
from the computer for fitting these data. We will talk about every
part of the printout.
Linear Regression - 7
Regression Statistics
Multiple R
R Square
Adjusted R Square
Standard Error
Observations
0.936248
0.87656
0.874991
0.670277
240
Analysis of Variance
Regression
Residual
Total
Intercept
Day
Hour
Distance

df
3
236
239
Sum of
Squares
752.919
106.028
858.947
Coefficients
Standard
Error
t Statistic
1.156832
-0.02521
-0.00592
1.754525
0.229887
0.022013
0.018919
0.042988
5.032169 9.54E-07 0.703939 1.609725
-1.14541 0.253183 -0.06858 0.018153
-0.31297 0.754578 -0.04319 0.031351
40.8147 1E-109 1.669837 1.839213
What is the estimated relationship?
Linear Regression - 8
Mean
Square
F
Significance F
250.973 558.6225 7.1E-107
0.449271
P-value
Lower
95%
Upper
95%

Significance of the Overall Relationship:
Our first question to answer when looking at the goodness of fit of
the model is whether or not the overall relationship is significant. In
other words, is y related to any of the x's?

We do this by testing a hypothesis.
H0:
Ha:

The hypothesis is tested by comparing the amount of variation
explained by the independent variables to the amount of variation left
unexplained. The unexplained variance is shown on the printout as
Residual Mean Square. The explained portion is referred to as
Regression Mean Square. What would be the appropriate test
statistic?

We use the Analysis of Variance (ANOVA) portion of the printout to
test the hypothesis.
ANOVA
Regression
Residual
Total
df
3
236
239
Sum of
Squares
752.919
106.028
858.947
Linear Regression - 9
Mean
Square
F
Significance F
250.973 558.6225 7.1E-107
0.449271

Testing Individual Contributions:
We can also test the marginal contribution of an individual
independent variable when all other variables are included in the
model.

We again do this by testing a hypothesis.
H0:
Ha:
This turns out to be a t-test, very similar to the types we have done
before. Here is the part of the printout which can be used to do this
analysis.
Intercept
Day
Hour
Distance

Coefficients
Standard
Error
t Statistic
1.156832
-0.02521
-0.00592
1.754525
0.229887
0.022013
0.018919
0.042988
5.032169 9.54E-07 0.703939 1.609725
-1.14541 0.253183 -0.06858 0.018153
-0.31297 0.754578 -0.04319 0.031351
40.8147 1E-109 1.669837 1.839213
P-value
Note that we can also use this section of the printout to form
confidence intervals around individual contributions.
Linear Regression - 10
Lower
95%
Upper
95%

If we do simple linear regression, to test whether the one slope is
equal to 0 is equivalent to testing whether the correlation is equal to
0.
Here is the test:
H0:  = 0
Ha:   0
H0: 1 = 0
Ha: 1  0
or
For example if we wanted to test if the correlation between day of the
week and preparation time is singificant, if we run a regression we
obtain the following results:
Regression Statistics
Multiple R
R Square
Adjusted R Square
Standard Error
Observations
0.101878
0.010379
0.006221
1.970273
240
Analysis of Variance
df
Regression
Residual
Total
1
238
239
Sum of
Mean
Squares Square
F
9.689971 9.689971 2.496145
923.91 3.881975
933.6
Coefficients
Intercept
Prep Time
1.242385
0.191105
Significance F
0.115453
Standard
Error
t Statistic P-value
Lower
95%
Upper
95%
1.813175 0.685199 0.493883 -2.32954 4.814311
0.120959 1.579919 0.115448 -0.04718 0.429391
Conclusions:
Linear Regression - 11

Amount of Variation Explained by the Independent Variable:
Our next measure of goodness of fit is referred to as the coefficient of
(multiple) determination, and is denoted R2. R2 is the proportion of
variation explained by the model compared to the overall variation in
the data. It is computed as the ratio of the regression sums of squares
to the total sums of squares.

Adjusted R2:
One way of increasing the amount of variation explained is to
increase the number of independent or explanatory variables. To
adjust for this, there is a number called the adjusted coefficient of
multiple determination, which is adjusted for the number of variables
in the model.

The printout which shows R2 and the adjusted R2 is repeated below.
Regression Statistics
Multiple R
R Square
Adjusted R Square
Standard Error
Observations
0.936248
0.87656
0.874991
0.670277
240

ONLY if we are doing simple linear regression, the square root of R2
is the correlation coefficient between the dependent and independent
variables.

The other important number displayed in this portion of the printout
is called “Standard Error.” This is an estimate of the standard
deviation of the observations about the line. The book refers to this
quantity as the “standard error of the estimate,” and denotes it s .
Linear Regression - 12
Indicator Variables

In many cases there is reason to believe that the dependent variable is
related to qualitative variables. These variables are usually
categorical and include gender, marital status, educational level, etc.
These variables can also be handled in a regression model.

To use categorical variables, we must define what are called indicator
or dummy variables. These indicator variables assign numbers to the
categories. In general, if we have a total of m categories, we will
need m-1 indicator variables. Most of the time the qualitative
variable will have only two categories, so we will need only one
dummy variable.

To illustrate, consider a case where we want to relate a person’s
income to the number of months they have worked and to the
person’s gender. We define y to be income, x1 to be months on the
job, and x2 to be an indicator variable for gender, where x2=1 if the
person is male, and x2=0 if the person is female.

Our model is y = 0 + 1x1 + 2x2 + . If we want to investigate the
effect of gender, then we can look at each case separately. Plugging
in the appropriate values for x1 and x2 gives
Y = (0 + 2) + 1x1 + 
for males
Y = 0 + 1x1 + 
for females.
and
Linear Regression - 13
The model then says that gender affects the intercept of the
line, but does not affect the slope. Graphically we have
Income

45
Males
40
Females
35
30
25
20
0
50
100
150
200
Tenure

2 then represents the differential influence on income of being male.
If we want to know if the difference is significant, we can test the
hypothesis H0: 2 = 0. If we reject H0, then we conclude the
difference is significant.
Linear Regression - 14

Testing Assumptions
The last thing we want to do to examine how well the model fits the
data is to check the validity of the assumptions. Recall that the first
four assumptions were almost all concerned with the error terms of
the model. We can check these assumptions by analyzing the
residuals.

Let us begin by stating the assumptions of linear regression, which
will guide our analyses. There are five important assumptions.
1.
2.
3.
4.
5.
The relationship is linear.
The error terms () are normally distributed.
At every x value, the error terms have the same variance.
The error terms are independent of each other.
The independent variables (x1, x2,...,xk) are independent of each
other.


Residual Analysis
To check most of the assumptions, we do something called a residual
analysis, which involves looking at the difference between the actual
values and the predicted values (the residuals). The residuals are
estimates of the error terms.
To check for model linearity, error term independence, and equal
variance, we will use scatter plots. First we plot the residuals against
the predicted values. Second, we can make a time series plot of the
residuals. If the data are nonlinear, we should see a systematic
pattern in the first plot. If the error terms are dependent, we may also
see a systematic pattern in the scatter plot, or in the time series plot.
Finally, if the error terms have different variances, we should see the
spread in the residuals changing as a function of the predicted values.
If all the assumptions are met, we should see a random scatter plot
with no identifiable patterns.
Linear Regression - 15

To check for normality, we will use a histogram of the residuals. If it
looks close to a bell shaped distribution, we can feel comfortable that
the error terms are close to normally distributed. We can also use
standardized residuals (described below) as another check. The book
mentions a normal probability plot, and Excel presumably constructs
such a plot, but Excel’s is not consistent with the book’s. We will
ignore the normal probability plot for now. Later we will discuss a
hypothesis test to test if the residuals follow a normal distribution.
Linear Regression - 16
The plots for the pizza example are shown below.
Linearity:
Variance:
Independence:
Normality:
Scatter Plot
Residuals

2.5
2
1.5
1
0.5
0
-0.5 0
-1
-1.5
-2
5
10
Predicted Values
Linear Regression - 17
15
Residuals
Time Series Plot
2.5
2
1.5
1
0.5
0
-0.5
-1
-1.5
-2
100.00%
80.00%
60.00%
40.00%
Bin
Linear Regression - 18
.00%
More
1.752
1.501
1.249
0.998
0.746
0.495
0.244
-0.008
-0.259
-0.511
-0.762
-1.013
-1.265
20.00%
-1.516
35
30
25
20
15
10
5
0
-1.768
Frequency
Histogram
Examples of identifiable patterns:

3
Residuals
2
1
0
-1
-2
-3
Predicted Values
3
Residuals
2
1
0
-1
-2
-3
Predicted Values
Linear Regression - 19

Multicollinearity
The final assumption in multiple linear regression is that the
independent variables are truly independent of each other. If they are
not, the model will exhibit multicollinearity. We will discuss the
effects of multicollinearity, and how to detect it.

Effects of multicollinearity: If we are only interested in prediction,
then multicollinearity does not represent a large problem. If we are
trying to explain the relationships between dependent and
independent variables, however, it does cause problems. The main
problem is that the standard error of the regression coefficients is
highly inflated. Hence the estimated regression coefficients have
large sampling variability, and tend to vary widely from one sample
to the next when the independent variables are highly correlated.
Another problem is in the interpretation of the estimated coefficients.
When the explanatory variables are correlated, we cannot really vary
one variable without the correlated variable(s) changing at the same
time.

Detecting Multicollinearity: There are a few ways of informally
detecting multicollinearity, and there is one more formal way. Here
are some informal indications:
1.
Large changes in the estimated regression coefficients when a
variable is added or deleted.
2.
Nonsignificant results in individual tests on or wide confidence
intervals for the regression coefficients representing important
independent variables.
3.
Estimated regression coefficients with an algebraic sign that is
opposite of that expected from theoretical considerations or prior
experience.
4.
Large correlation coefficients of correlation between independent
variables in the correlation matrix.
Linear Regression - 20
Variance Inflation Factors

A more formal way to test for multicollinearity is to calculate
variance inflation factors (VIFs). These factors measure how much
the variances of the estimated regression coefficients are inflated as
compared to when the independent variables are not correlated.

An individual VIF for variable i is defined as
1
1  R 2i
, where R 2i is the
coefficient of multiple determination when Xi is regressed on all of
the other X's in the model. Fortunately, to calculate the VIFs we do
not to run a regression analysis of each independent variable on all of
the others.

To calculate the VIF values, we will use the correlation matrix of the
independent variables only. After finding the correlation matrix
using the CORRELATION analysis tool, fill in the blank values, and
then invert the matrix. In Quattro Pro for Windows this can be done
by using the NUMERIC TOOL called INVERT. The procedure is
almost self-explanatory.
To invert the matrix in Excel, you need to use the function
MINVERSE(array), where "array" is the range of cells containing the
correlation matrix. This function is called an array function. When
entering array functions you need to first block out all the cells where
you want the result to go, and then enter the formula by
simultaneously pressing CNTRL SHIFT ENTER. We will illustrate
this in the lab.

The VIF values will be on the diagonal of the resulting matrix. If the
any VIF value exceeds 10, there is evidence of multicollinearity.
Linear Regression - 21

Example: To show the effects of multicollinearity, and to show how
to detect it, consider a study of the relation of body fat to triceps
skinfold thickness, thigh circumference, and midarm circumference
based on a sample of 20 healthy females 25-34 years old. The data
are shown below.
Skinfold
Thickness
19.5
24.7
30.7
29.8
19.1
25.6
31.4
27.9
22.1
25.5
31.1
30.4
18.7
19.7
14.6
29.5
27.7
30.2
22.7
25.2

Thigh
circumference
43.1
49.8
51.9
54.3
42.2
53.9
58.5
52.1
49.9
53.5
56.6
56.7
46.5
44.2
42.7
54.4
55.3
58.6
48.2
51
Midarm
Circumference
29.1
28.2
37
31.1
30.9
23.7
27.6
30.6
23.2
24.8
30
28.3
23
28.6
21.3
30.1
25.7
24.6
27.1
27.5
Body Fat
11.9
22.8
18.7
20.1
12.9
21.7
27.1
25.4
21.3
19.3
25.4
27.2
11.7
17.8
12.8
23.9
22.6
25.4
14.8
21.1
The correlation matrix for all of the data is shown below.
Skinfold
Thickness
Skinfold Thickness
Thigh circumference
Midarm Circumference
Body Fat
1
0.923843
0.457777
0.843265
Thigh
Midarm
circumference Circumference
1
0.084667
0.87809
Linear Regression - 22
1
0.142444
Body Fat
1
We can see that the skinfold thickness and thigh circumference are
quite highly correlated. Hence we should suspect multicollinearity.

First let's look at the regression results for body fat when only
skinfold thickness and thigh circumference are used as independent
variables.
Regression Statistics
Multiple R
R Square
Adjusted R Square
Standard Error
Observations
0.882072
0.778052
0.75194
2.543166
20
Analysis of Variance
df
Regression
Residual
Total
2
17
19
Coefficients
Intercept
Skinfold Thickness
Thigh circumference

Sum of
Squares
Mean
Square
F
385.4387 192.7194 29.79723
109.9508 6.467694
495.3895
Standard
Error
Significance F
2.77E-06
Lower 95% Upper 95%
t Statistic
P-value
-19.1742 8.360641 -2.29339 0.033401 -36.8137 -1.53481
0.222353 0.303439 0.732775 0.47264 -0.41785 0.862554
0.659422 0.291187 2.264597 0.035425 0.045069 1.273774
Suppose that we tested significance at the .01 level. What can we
observe?
Linear Regression - 23

Now let's include the third variable.
Regression Statistics
Multiple R
R Square
Adjusted R Square
Standard Error
Observations
0.895186
0.801359
0.764113
2.479981
20
Analysis of Variance
Sum of
Squares
df
Regression
Residual
Total
3
16
19

F
396.9846 132.3282 21.51571
98.40489 6.150306
495.3895
Coefficients
Intercept
Skinfold Thickness
Thigh circumference
Midarm Circumference
Mean
Square
Standard
Error
t Statistic
P-value
117.0847 99.7824
1.1734 0.255136
4.334092 3.015511 1.437266 0.166908
-2.85685 2.582015 -1.10644 0.282348
-2.18606 1.595499 -1.37014 0.186616
Now what do we see?
Linear Regression - 24
Significance F
7.34E-06
Lower
95%
Upper
95%
-94.4445
-2.05851
-8.33047
-5.56837
328.6139
10.72669
2.616779
1.196246

Here is what we would do to get the variance inflation factors. First
we need the correlation matrix of the independent variables.
Skinfold
Thickness
Skinfold Thickness
Thigh circumference
Midarm Circumference
1
0.923843
0.457777
Thigh
Midarm
circumference Circumference
1
0.084667
1
Next we fill in the missing values.
1
0.923843
0.923843
1
0.457777 0.084667
0.457777
0.084667
1
Finally, we invert the matrix and look for the diagonal elements.
708.8429142
-631.9152231
-270.9894176

-631.9152231
564.3433857
241.4948157
-270.989
241.4948
104.606
There is clearly multicollinearity present in this case. Note that even
though the third independent variable (midarm circumference) was
not highly correlated with either of the other two independent
variables, it has a high VIF value. This is a case where one variable
is strongly related to two others together, even though the pairwise
correlations are small.
Linear Regression - 25
Making Predictions

One of the most common uses of fitting a linear model to a set of data
is to make predictions at new levels of the independent variables. To
make a prediction, we simply insert the desired values of the
independent variables into the estimated relationship. The result is a
point estimate. As with other statistical estimates, we can also make
interval estimates.

For example, suppose a new subdivision is being constructed and we
are interested in estimating how long it will take to deliver pizzas to
homes in the neighborhood. The distance from the pizza parlor to the
entrance to the subdivision is 5 miles. We will drop day of the week
and time of day from the model since they did not add to the model.
If we run a new regression, we obtain the estimated relationship
Estimated Delivery Time=1.02815+1.74962*Distance.

We will call the point estimate of y given a particular set of x values
y p . y p is actually an estimate of two things. First it is an estimate of
the mean level of y given all the values of x, represented by E(yp).
Second, it is an estimate of an individual value of y given all of the x
values.

We can also find interval estimates of the prediction, but there is no
built in function in Excel to do it. Hence we will not discuss it in this
class. If you are interested in how to do it, I have a macro function
that can be used.
Linear Regression - 26
Curvilinear Regression

So far we have looked at cases where the relationship is linear in both
the 's and the x's. The assumption on the x's is not as limiting as it
seems. In multiple regression, we can substitute nonlinear functions
for the x's and for y. This allows us to fit functions which show
nonlinearity. Hence, if the residual analysis shows that the linearity
assumption is violated, we can try to fit a model which is nonlinear.

The most common types of nonlinear models are polynomials and
logarithmic models. For example, in some physical systems, there is
a theoretical relationship that says y=axb. If we take the natural
logarithm of both sides, we obtain ln(y) = ln(a) + bln(x). If we
substitute y' = ln(y), x' = ln(x), and a' = ln(a), we have y' = a+bx',
which is a linear model. We can then use y' and x' as inputs to the
linear regression model, and find estimates for a and b, which usually
have useful physical interpretations.

Even when the model is linear, we can frequently use the logarithm as
a way of fixing problems with heteroscedasticity or normality. For
some reason, the transformation fixes the problem in many cases.

If we assume a polynomial relationship, the model may look like
y =0 + 1x+ 2x2+ 3x3+ .
We can let x1 = x, x2 = x2 and x3 = x3. However, this has an inherent
problem.

What is it?
Linear Regression - 27

We can fix the multicollinearity by using the following substitutions:
x1 = x - x , x2 = (x - x )2 and x3 = (x - x )3.

Indicator Variables and Curvilinear Regression
In our example with indicator variables, we said that the slopes were
assumed to be equal even when gender differed. We can use
curvilinear regression, combined with indicator variables, to take
away the assumption.
The model is y = 0 + 1x1 + 2x2 + 3x1x2 + .
Model for the men:
Model for the women:
What do we do if we want to test if the slope with respect to income
is the same for men and women?
Linear Regression - 28
Download