Introduction to Multiple Linear Regression

advertisement

Multiple Linear Regression Notes – Introduction

Relationships

Estimating Multiple Linear Functions

Sources of Variability

Assumptions

Assumption Checks

Testing for the effect of all the variables

Inferences about each coefficient

Testing the effect of some of the variables

Prediction and Estimation

NCSS

What you should be able to do when you finish the notes o Discuss differences in the types of multiple linear relationships o Define and put in English the effects of variables in the different models o Define and put in English the standard error of the estimate and the coefficient of determination o Discuss the required assumptions o Know how to check the validity of the assumptions o Test the hypothesis about the effect of the model o Test hypothesis and construct confidence intervals of the slopes o Construct confidence intervals for an average value of Y and prediction intervals for a value of Y (for specific values of X) o Compare the similarities and differences in simple linear and multiple linear regression o Get NCSS to calculate the required estimates and tests

1. Relationships

1.1 Types of relationships

First Order: more distinct variables are added to the slr model

Second Order: squared terms are added to model

Interaction: product terms are added

1.2 Effects of each variable

First Order Model: This is the change in the average value of the dependent variable when an independent variable increases by one, holding constant all the others.

Second Order Model: The change in the average value of the dependent variable when an independent variable increases by one depends on the current value of the independent variable, holding constant all the others.

Interaction Model: The change in the average value of the dependent variable when an independent variable increases by one depends on the value of another variable, while holding constant all the others.

1.3 Examples

Based on trying to predict exam grade, Y, based on hours studied, X

1

, and GPA, X

2

.

First Order Model: This is the change in the average exam grade when you study an additional hour, for those people with the same GPA.

E ( ExamGrade )

 

0

 

1

( HoursStudi ed )

 

2

( GPA )

Or E ( Y )

 

0

 

1

X

1

 

2

X

2

Second Order Model:

E ( ExamGrade )

 

0

 

1

( HoursStudi ed )

 

2

( HoursStudi ed )

2  

3

( GPA )

The change in the average exam grade when you study an additional hour depends on how long you have studied, controlling for the effect of GPA.

To find the change, substitute the value of “HourStudied+1” into the above equation and examine the difference:

E ( ExamGrade )

 

0

 

1

( HoursStudi ed

1 )

 

2

( HoursStudi ed

1 )

2  

3

( GPA ) minus

E ( ExamGrade )

 

0

 

1

( HoursStudi ed )

 

2

( HoursStudi ed )

2  

3

( GPA ) which equals to the following as long as GPA is the same value in both equations:

1

 

2

( 1

2 * HourStudie d )

Interaction Model: The change in the average exam grade when for each additional hour studied depends on the value of the GPA.

E ( ExamGrade )

 

0

 

1

( HoursStudi ed )

 

2

( GPA )

 

3

( HoursStudi ed * GPA ) or

E ( Y )

 

0

 

1

X

1

 

2

X

2

 

3

X

1

X

2

E ( Y )

 

0

(

1

 

3

X

2

) X

1

 

2

X

2

Other examples:

Baseball: http://baseballanalysts.com/archives/2007/05/the_value_of_th.php

Accounting http://www.nysscpa.org/cpajournal/2005/105/essentials/p56.htm

Business: http://www.cdc.gov/MMWR/preview/mmwrhtml/00037061.htm

U.S. Weather Forecasts: http://www.weather.gov/ost/NWS_TIP.pdf

Create your own examples of interpretations by changing the words and numbers in

First Order Model: http://wweb.uta.edu/faculty/eakin/busa5325/BetaInter1stOrderModel.xls

Interaction Model: http://wweb.uta.edu/faculty/eakin/busa5325/BetaInterInteractionModel.xls

Quadratic Model http://wweb.uta.edu/faculty/eakin/busa5325/BetaInterQuadraticModel.xls

2. Estimating Multiple Linear Functions

Same as for SLR: Least squares

Interpretation of estimates: just place the word estimate in the above examples and provide the values from the estimated equation

Examples for a first order two variable model: http://wweb.uta.edu/faculty/eakin/busa5325/EstbiMLR.xls

3. Sources of Variation

Total: The variability of the y values around their mean

Regression: the variability of the estimated slopes around their mean

Error: the variability of the y values around the mean given the values of the X’s

Standard Error of Estimate, S y|x1,x2,…,xk or S e

estimate of the typical error when predicting y that occurs when using this model estimate on the sample data. This was what we called the sample standard deviation.

Coefficient of Determination, R 2 : Percent of sample variability of Y that can be associated with variation in the independent variables

Example: Sales, size, and ad budget of store, S e

= 966,000, R

2

= 0.904, Interpret the meaning of the standard error of the estimate and the coefficient of determination.

Solution: As always replace Y and X with the names from the example

Standard Error of Estimate, S y|x

: If the estimated line is used, there is a typical error of $966,000 when predicting sales in this sample,

Coefficient of Determination, R 2 : 90.4% of the sample variability of sales that can be associated with variation in the size of the store and advertising budget.

Click on the following link to load an Excel worksheet that will allow you to create new examples: When asked for the independent variable list all the variables, separated by commas http://wweb.uta.edu/faculty/eakin/busa3321/VariationExplanation.xls

4. Assumptions

Same as in simple linear regression but more independent variables to list

Example: Based on trying to predict exam grade, Y, based on hours studied, X

1

, and

GPA, X

2

.

Linearity : the average exam grade has the specified relationship with hours studied and GPA

Independence: the students are independently selected and the values of the exam grades are not related for a given values of hours studied and GPA

Normality: the exam grades are normally distributed for students with the same values of hours studied and GPA

Equal Variation: the variation in the exam grades is the same (equal) for students with the same values of hours studied and GPA regardless of the value of the hours studied and GPA.

Examples: Interpreting the Assumptions of MLR

5. Assumption Checks

Residual plots versus estimated average

Residual plots versus each variable

Normality probability plots

6. Overall effects:

6.1 Test of the model: E ( Y )

 

0

 

1

X

1

 

2

X

2

...

  k

X k

Note that any one of the X’s above could be a function of one or more other independent variables; e.g., the square of another variable or the product of two variables. Therefore find the value of k by looking at the number on the last

in the model. So if our model was

E ( Y )

 

0

 

1

X

1

 

2

X

1

2  

3

X

1

2

X

2

 

4

X

2

2  

5

X

2 then k = 5

6.1.1 Test form for

Hypothesis: Ho:

1

=

2

=…=  k

=0 (no variable has an effect)

H1: at least one has an effect

Rejection region: One sided F with k and n-k-1 degrees of freedom

Test Statistic : This is an application of the 5 th Building Block of the course: if the sample slopes vary enough from zero, we can conclude the populaiotn slopes vary from zero. We will use the F form of the test:

F =variation of coefficient estimates/ variability of randomness

Conclusion: We can (not) say that changes in the values of at least one independent variable is associated with changes in the average value of the dependent variable.

6.1.2 Example: Based on trying to predict exam grade, Y, based on hours studied,

X

1

, and GPA, X

2

. You are given that n = 30, MSR = 200 and MSE = 10 using model: E ( ExamGrade )

 

0

 

1

( HoursStudi ed )

 

2

( GPA )

Hypothesis: Ho:

1

=

2

=0 (neither hours studied nor GPA has an effect on the average exam grade)

H1: at least one has an effect

Rejection region: Reject Ho if F > F

2, 27

= 3.354

Test Statistic : variation of coefficient estimates/ variability of randomness

F = MSR/MSE = 200/10 = 20

Conclusion: We can say that changes GPA and/or hours studied is useful in predicting the exam grade.

To create other examples, click on the link below and change numbers and names http://wweb.uta.edu/faculty/eakin/busa5325/MLRTestOfModel.xls

7. Test of each coefficient

We will only consider the test for the first-order model

The test is the same as in SLR but with n-k-1 degrees of freedom and adding the phrase “adjusting for all others variables”

Example: You are given that n = 30, with estimated model:

30

1 .

67 ( HoursStudi ed )

20 ( GPA ) where S b1

= 0.157 o Hypothesis o

H

0

:



1

=0

H

1

:



1

>0 o Rejection region: The degrees of freedom are n-k-1=27. Since this is a right-sided t-test, find the t-table value of 1.7033 in row 27 column 0.05.

Therefore Reject H

0

if t > 1.7033 o Test Statistic t

1 .

67

0

10 .

64

0 .

157 o Conclusion: After controlling for the effect of GPA, we can say that increases in the size of the store is associated with increases in the average sales

To create other examples, click on the link below and change numbers and names

 http://wweb.uta.edu/faculty/eakin/busa5325/MLRTestOfBeta1.xls

5.2 Confidence interval for slope

Same as in SLR but with n-k-1 degrees of freedom and controlling for all other variables.

Click on the following link to load an Excel worksheet that will allow you to create new examples for a two variable first-order model.

 http://wweb.uta.edu/faculty/eakin/busa3321/C.I.slope.xls

8. Testing a subset of the coefficients.

8.1 Variation of the subset of variables over and above the effect of all other variables.

Find the sum of squares of regression of all variables (Full Model)

Find the sum of squares regression of the other variables (Reduced Model).

The differences is the variation in the subset over and above the others

Estimate of variation is this difference divided by the number of variables, c, being tested.

8.2 Example: Based on trying to predict exam grade, Y, based on hours studied, X

1

, and GPA, X

2

. You are given that n=30 and

For the Full model:

E ( ExamGrade )

 

0

 

1

( HoursStudi ed )

 

2

( HoursStudi ed ) 2  

3

( GPA ) The

MSE = 5 with d.f. = n-k-1 = 26 and the SSR f

= 250

You want to test the effect of Hour Studied. So after fitting the reduced model:

E ( ExamGrade )

 

0

 

3

( GPA ) you find SSR r

= 50

Hypothesis: Ho:

1

=

2

=0 (hours studied has an no effect on the average exam grade after adjusting for GPA)

H1: it does have an effect

Rejection region: Reject Ho if F > F

2, 26

= 3.369

Test Statistic: variation of coefficient estimates/ variability of randomness

F

( SSR f

 c

SSR r

)

MSE f



( 250

2

50 )



5

20

Conclusion: We can say that hours studied is useful for predicting the exam grade after adjusting for the effect of GPA.

9. Estimating the average and predicting an individual

9.1 Estimating the average value of all Y values for observations with the same value of X

1

, the same value of X

2

, …. , and the same value of X k

Formula: estimated mean plus and minus the margin of error yˆ

 t n

 k

1

( SE mean

)

Conclusion: We can say with ___ confidence that the average value of the dependent is _____ with a margin of error of _________ for all observations with value of the independent variables of ____

Example Find average sales for all stores that have 4,000 square feet and a

$100,000 advertising budget. You are given that the estimated average sales

=964,000 + 1,670,000 (size)+1.45(ad budget) and that the SE mean

, the standard error of the mean estimate = 309,000

Solution: Substitute the value of x into the estimated equation to obtain the estimated average sales:

Estimate of average sales = 964,000 + 1,670,000 (4)+1.45(100,000) = 7,789,000

Next substitute values into confidence interval yˆ

 t n

 k

1

( SE mean

)

7 , 789 , 000

2 .

2010 ( 309 , 000 )

7 , 789 , 000

680109

Conclusion: For all stores that have 4,000 square feet and have a 100,000 advertising budget, we can say with 95% confidence that the average sales is

$7,789,000 with a margin of error of ± $680,109

Click on the following link to load an Excel worksheet that will allow you to create new examples (Special case of two independent variables with the

 estimated standard error of estimating the average value of Y already calculated)

 http://wweb.uta.edu/faculty/eakin/busa5325/CIMuYMLR.xls

9.2 Predicting the value of Y for an observation with a given value of X

Formula: predicted value plus and minus the margin of error

t

n

 k

1

( SE

individual

)

Conclusion: We can say with ___ confidence that the value of the dependent is

_____ with a margin of error of _________ for an observations with values of the

 independent variables of ____

Example: Find sales for a store that has 4,000 square feet. You are given the predicted sales =964,000 + 1,670,000 (size)+1.45(ad budget) and its estimated standard error is 1,104,000.

Solution: Substitute the value of x into the estimated equation to obtain the value of the predicted sales:

predicted sales = 964,000 + 1,670,000 (4)+1.45(100,000) = 7,789,000

Next substitute values into confidence interval yˆ

 t n

 k

1

( SE individual

)

7 , 789 , 000

2 .

2120 * ( 1 , 104 , 000 )

7 , 789 , 000

2 , 429 , 904

Conclusion: For a store that has 4,000 square feet and has a $100,000 advertising budget, we can say with 95% confidence that the sales will be $7,789,000 with a margin of error of ± $2,429,904

Click on the following link to load an Excel worksheet that will allow you to create new examples (Special case of two independent variables with the

 estimated standard error of predicting an individual value of Y already calculated)

 http://wweb.uta.edu/faculty/eakin/busa5325/CIYMLR.xls

10. SAS (to be constructed)

Download