Simple Linear Regression Notes

advertisement
Regression Notes

Relationships

Estimating the Linear Function

Measures of Variation

Assumptions

Assumption Checks

One Slope

Multiple Slopes

Estimate Averages

Predict Individual Values

NCSS

What you should be able to do when you finish the notes
o Discuss differences in the types of relationships
o Define and put in English the slopes and intercept of a population
o Discuss how the estimation procedure works
o Define and put in English the estimates of the slopes and intercept
o Define and put in English the standard error of the estimate and the coefficient of
determination
o Discuss the required assumptions
o Know how to check the validity of the assumptions
o Test hypothesis and construct confidence intervals of the slope
o Test the hypothesis about the effect of the model
o Construct confidence intervals for an average value of Y and prediction intervals
for a value of Y (for specific values of the X’s)
o Get NCSS to calculate the required estimates and tests
1. Relationships
Example 1: Suppose you believe that the reimbursed cost of a trip is a function of the
mileage. You receive a fixed cost of $40 and a variable of $0.35 per mile. In English

Fixed cost: When the mileage is zero, the reimbursed cost is $40.

Variable cost: For each mile traveled, the reimbursed cost increases by $0.35

Reimbursement for a trip of 100 miles = $40 + 0.35*(100) = $75
Example 2: A $10,000 computer is being depreciated over 5 years (with no resale value
at the end of 5 years). In English what is the meaning of the fixed and variable?

Fixed: When ____________ is zero, the _____________ is _______

Variable: For each _______, the ________ decreases by ___________

For a 2 year old computer the remaining value of the computer is _____ _______ = ______
Example 3: Suppose you believe that the reimbursed cost of a trip is a function of the
mileage and the number of days of the trip. You receive a fixed cost of $40 and variable
costs of $0.35 per mile and $150 per day. In English

Fixed cost: When the mileage and the number of days is zero, the reimbursed cost
is $40.

Variable cost of mileage: Assuming the number of days does not change, for each
mile traveled, the reimbursed cost increases by $0.35

Variable cost of days: Assuming the mileage does not change, for each day of the
trip, the reimbursed cost increases by $150

Reimbursement for a trip of 100 miles that lasted two days = $40 + 0.35*(100) +
$150*(2) = $375
 Forms of relationships:
 Mathematical versus tendencies
Do you believe that the actual cost of a trip is exactly equal to the reimbursed cost?
We will assume that the relationship is just a tendency (or on average).

Types of relationships – Of all possible depreciation methods, what is the simplest?
For one independent variable (Simple Linear Regression): The mean value of one
variable, Y, depends on the values of another, X. For example, the average starting salary
depends on a student’s GPA. There are many relationship functions but the simplest is a
straight line:
 y| x   0   1 X
This is read as ”the mean value of Y given X is a straight line function of X”.
For two independent variables (two or more independent variables is Multiple Linear
Regression): The mean value of one variable, Y, depends on the values of variables X1
and X2. For example, the average starting salary depends on a student’s GPA and their
work experience. There are many relationship functions but the simplest is a first-order
model:
 y|x1 , x 2  0  1X1   2 X 2
In general the model for k independent variables is:
 y|x1 , x 2 ,..., x k  0  1X1   2 X 2  ...   k X k

Incomplete information
Do you believe that it is possible to obtain information about all possible trips?
If no, then the estimates are in error and we need to compute their margins of error.
Example 4: Suppose you believe that the number of bars you company sells depends on the
price of the bar (measured in cents) and the amount spent on a promotion (in dollars). You have
recorded these three values for a random sample of 34 stores in your chain of stores. There are
five things of interest in this example. What are the value, interpretation and margin of error of…
Parameter
Statistic (given)
Standard Error (given) Margin of Error: use a
t (n-k-1)
Fixed (intercept)
5837.52
628.15
Slope of Price
-53.22
6.85
Slope of Promotion
3.61
0.68
The number of bars
Estimated average
110.08
sold for all stores
number of bars =
where the bar is
5837.52
priced 79 cents and
- 53.22(79)
there $400 is spent on
+ 3.61(400)
promotion
= 3077.14 bars
The number of bars
predicted number of
sold for a store where
bars = 5837.52
the bar is priced 79
- 53.22(79)
cents and there $400
+ 3.61(400)
is spent on promotion
= 3077.14 bars
647.49
Put each of the five in English with their margin of error

Intercept

Slope of price

Slope of promotion

Estimate of average

Prediction of an individual
2. Meaning of Population Slopes and Intercept

Intercept, 0: The average value of the dependent variable when the independent
variable takes on the value zero (We would usually add the phrase “in the
population” to this. However for the remainder of this class this will not be
included but will be understood to be in the sentence. Therefore I will assume you
are talking about the population unless you specifically state that your slope or
intercept is an estimate or use the phrase “in the sample” in its discussion.)

Slope, 1: Holding all other variables constant, when the first independent variable
increases by one unit, the average value of the dependent variable
increases/decreases by 1 units.

Slope, 2: Holding all other variables constant, when the second independent
variable increases by one unit, the average value of the dependent variable
increases/decreases by 2 units.
Example: Interpret the meaning of the intercept and slope when relating Starting Salary and
graduating GPA.
Solution: (1) Determine if one variable depends on the other. In this case the starting salaries of
students should depend on their GPA.
(2) Next determine the units of the independent variable. Here the units of GPA is in points.
(3) Finally, replace the names and units in the definitions above.

Intercept, 0: The average value of the starting salary when the GPA is zero.
Reword this to make it sound better. For example: For all students who graduate
with a GPA of zero, 0 is the average of their starting salaries. (Many times the
intercept does not have a realistic meaning.)

Slope, 1: The change in the average starting salary when the GPA increases by one
point. Rewording example: Average starting salary is 1 higher/lower (higher if
positive and lower if negative) for all students whose graduating GPA is higher than
another group by one point.

Two Notes: (1) Unless this is an experiment we cannot say that increasing GPA
causes average starting salary to change only that this is a difference in two
populations and (2)There are no other variables to hold constant.


Click on the following link to load an Excel worksheet that will allow you to create new
examples:

for one independent variable:
http://wweb.uta.edu/faculty/eakin/busa3321/Beta0-Beta1Interpretation.xls

for two independent variables
http://wweb.uta.edu/faculty/eakin/busa5325/BetaInter1stOrderModel.xls
3. Estimating the Simple Linear Function
3.1 Least squares – Minimizing the squared differences between the line and the actual
values of Y creates an estimate close to all values of Y. The estimated line is symbolized
as
yˆ  b0  b1 x1  b2 x2  ...  bk xk
This is read as “the predicted value of Y is a first-order linear function of the Xi’s”. The
predicted value will be both our (1) estimate of the population mean and (2) our
prediction of the value of Y for specified values of the Xi
3.2 Interpretation of estimates

Estimated intercept, b0: The estimated average value of the dependent variable
when the independent variable takes on the value zero

Estimated slope, bi: Holding all other variables constant, when the i-th independent
variable increases by one, the average value of the dependent variable is estimated
to increase/decrease by bi units.
3.3 Example: Sales in dollars, Size of store in thousands of square feet and advertising
budget in dollars: Predicted sales, yˆ  964,000  1,670,000( size )  1.45(ad budget )
Solution: Same as before, determine the dependent and independent variables along with
their units. Sales in dollars depends on size in thousands and ad budget in dollars.
Next, replace underlined terms with the appropriate names and insert the values of the
estimated slope and intercept. Rephrase if necessary.

Estimated intercept, b0: The estimated average value of the sales is $964,000 when
the size of the store and the advertising budget is zero

Estimated slope, b1: Holding the advertising budget constant, when the size of the
store increases by one thousand square feet, the average sales is estimated to
increase by $1, 670,000

Estimated slope, b2: Holding the store size constant, when the advertising budget
increases by one dollar, the average sales is estimated to increase by $1.45

Click on the following link to load an Excel worksheet that will allow you to create new
examples for the case of one independent variable.
http://wweb.uta.edu/faculty/eakin/busa3321/EstimateInterpretation.xls
for two independent variables: http://wweb.uta.edu/faculty/eakin/busa5325/EstbiMLR.xls
3.4 Correlation Coefficient, r:, a Special Case of the Slope in Simple Linear Regression
The values of X and Y are standardized; i.e. subtract the sample mean from each value
and then divide by the sample standard deviation of the values. This results in the two
new variables:
Zx= ( X – X̅ )/ Sx
and Zy= ( Y – Y̅ )/ Sy
when you estimate the least squares line:
ẑ y  b 0  b1z x ,
then b0 =0 and b1 = r
The correlation coefficient is just the sample slope of standardized values and is used to
measure the strength of the linear association between two variables. It means that for
each sample standard deviation increase in X, there is a r sample standard deviation
increase/decrease in Y.
Notes: (1) Correlation can fall between -1 (perfect negative linear relationship) to +1,( a
perfect positive linear association). See textbook for example pictures corresponding
with different values of f
(2) A low r just means there is a low straight line association. The association could be
strong but for a non-straight line association.
4. Measures of Variation

Variation and estimate: Variability of data away from its average

Standard Error of Estimate, Sy|x1,x2,…,xk or Se estimate of the typical error when predicting
y that occurs when using this model estimate on the sample data. This was what we
called the sample standard deviation.

Coefficient of Determination, R2: Percent of sample variability of Y that can be
associated with variation in the independent variables

Example: Sales, size, and ad budget of store, Se = 966,000, R2= 0.904, Interpret the
meaning of the standard error of the estimate and the coefficient of determination.

Solution: As always replace Y and X with the names from the example
Standard Error of Estimate, Sy|x: If the estimated line is used, there is a typical error of
$966,000 when predicting sales in this sample,
Coefficient of Determination, R2: 90.4% of the sample variability of sales that can be
associated with variation in the size of the store and advertising budget.

Click on the following link to load an Excel worksheet that will allow you to create new
examples: When asked for the independent variable list all the variables,
separated by commas
http://wweb.uta.edu/faculty/eakin/busa3321/VariationExplanation.xls
Uses of R2:
Baseball Example: http://www.insiderbaseball.com/Angelo-v2.htm
Business Example: http://www.sdcounty.ca.gov/rtf/docs/CNASmallBusiness.pdf
(search for “simple linear regression”)
Information Systems example :
http://databases.about.com/od/datamining/a/datamining.htm
5. Assumptions

Linearity: the average value of the dependent variable has a first order relationship with
the independent variables

Independence: the observations are randomly and independently selected

Normality: the values of the dependent variable are normally distributed for observations
with the same value of the independent variables

Equal Variation: the variation in the values of the dependent variable is the same (equal)
for observations with the same value of the independent variables regardless of the value
of the independent variables

Example : In the previous examples, the dependent variable was sales and the
independent variables were size and ad budget. All variables were measured on stores.
Interpret the assumptions in this context.
Solution: Reword the above sentences
Linearity: the average value of the sales has a first-order relationship with the size of
the store and advertising budget
Independence: the stores are randomly and independently selected
Normality: the values of the sales are normally distributed for stores of the same size
and advertising budget
Equal Variation: the variation in the values of the sales is the same (equal) for stores of
the same size and advertising budget regardless of the value of the size and advertising
budget

Click on the following link to load an Excel worksheet that will allow you to create new
examples. When asked for the independent variable, type in the list of independent
variables separated by commas
http://wweb.uta.edu/faculty/eakin/busa3321/RegAssumptionsInterpretation.xls
6. Assumption Checks

Errors versus residuals: Errors are the differences between the Y’s and their true average
values while the residuals are the differences between the Y’s and the their predicted
values

Residual plots – these should show a random scattering of points, any pattern indicates a
possible violation of an assumption.

Normality probability plots – these plots should show a straight line, any pattern other
than straight indicates a possible problem of normality.
Residual Plot- All Assumptions Met
6
4
Residual
2
0
-2
-4
-6
-8
0
5
10
15
20
25

Plots showing assumption violations:
X
Residual Plot-Not a First Order Model
200
20
150
15
100
10
50
Residual
Residual
Residual Plot- Unequal Variance
25
5
0
0
-50
-100
-5
-150
-10
-200
-15
0
5
10
15
X
20
25
0
5
10
15
X
20
25
7. Slope
7.1 Test for usefulness (slope)

Hypothesis:
Ho: 1=0
H1: one sided or two sided alternatives

Rejection region: One sided or two sided t distributions with n-k-1 degrees of freedom

Test Statistic : difference between estimated slope and hypothesized slope in number of
estimated standard errors
t

b1  0
sb1
Conclusion: Holding all other variables constant, we can (not) say that changes in the
values of the independent variable is associated with changes (increases or decreases for
one-sided tests) in the average value of the dependent variable.

Example: Holding advertising budget constant, determine if increases in the size of the
store is associated with an increase in the average sales. You are given that b1 =
1,670,000 , Sb1 = 157,000 , and n=14
o
Hypothesis
o
H0:1=0
H1:1>0
o Rejection region: The degrees of freedom are n-k-1=11. Since this is a right-sided
t-test, find the t-table value of 1.7959 in row 11 column 0.05.
Therefore Reject H if t > 1.7959
o Test Statistic t 
1.67  0  10.64
0.157
o Conclusion: We can say that increases in the size of the store is associated with
increases in the average sales holding advertising budget constant.

Click on the following link to load an Excel worksheet that will allow you to create new
examples

http://wweb.uta.edu/faculty/eakin/busa3321/TestOfBeta1.xls
7.2 Confidence interval for slope

Formula: estimated slope plus and minus the margin of error
b i  t n  k 1Sb i

Conclusion: Holding all other variables constant, we can say with ___ confidence that
increase of one unit in the independent variable is associated with (increases/decreases)
of _____ in the average value of the dependent with a margin of error of ±_________.
Or
Holding all other variables constant, we can say with ___ confidence that increase of one unit
in the independent variable is associated with (increases/decreases) of between _____ and
_______.


Example: Holding advertising budget constant, determine the amount of change in the
average value of the sales when the size of the store is associated increases by one
thousand square feet. You are given that yˆ  964,000  1,670,000( size )  1.45(ad budget )
Sb1 = 157,000 , and n=14
Solution: Substitute names and numbers into the above template
o Formula:
b1  t n  k 1Sb1
1,670,000  2.2010 * (157,000)
1,670,000  345,557
o Conclusion: Holding advertising budget constant, we can say with 95%
confidence that an increase of one thousand square feet in the size of the store is
associated with an increases of $1,670,000 in average sales with a margin of error
of ±$345,557.
Or
o Holding advertising budget constant, we can say with 95% confidence that an
increase of one thousand square feet in the size of the store is associated with an
increase of between $1,324,443 and $2,015,557 in average sales.

Click on the following link to load an Excel worksheet that will allow you to create new
examples

http://wweb.uta.edu/faculty/eakin/busa3321/C.I.slope.xls

8. Testing for the Effect of the Model:
8.1 Test of the model: E (Y )   0  1 X 1   2 X 2  ...   k X k
Note that any one of the X’s above could be a function of one or more other independent
variables; e.g., the square of another variable or the product of two variables. Therefore
find the value of k by looking at the number on the last  in the model. So if our model
was
E (Y )   0  1 X 1   2 X 12   3 X 12 X 2   4 X 22   5 X 2
then k = 5
8.1.1 Test form for

Hypothesis:
Ho: 1=2=…=k=0 (no variable has an effect)
H1: at least one has an effect

Rejection region: One sided F with k and n-k-1 degrees of freedom

Test Statistic : variation of coefficient estimates/ variability of randomness

Conclusion: We can (not) say that changes in the values of at least one independent
variable is associated with changes in the average value of the dependent variable.
8.1.2 Example: Based on trying to predict exam grade, Y, based on hours studied, X1, and
GPA, X2. You are given that n = 30, MSR = 200 and MSE = 10 using model:
E ( ExamGrade)   0  1 ( HoursStudied )   2 (GPA)

Hypothesis:
Ho: 1=2=0 (neither hours studied nor GPA has an effect on the
average exam grade)
H1: at least one has an effect

Rejection region: Reject Ho if F > F 2, 27 = 3.354

Test Statistic : variation of coefficient estimates/ variability of randomness

F = MSR/MSE = 200/10 = 20



Conclusion: We can say that changes GPA and/or hours studied is useful in
predicting the exam grade.
Or
We can say that that changes GPA and/or hours studied is associated with changes in
the average exam grade.
To create other examples, click on the link below and change numbers and names
http://wweb.uta.edu/faculty/eakin/busa5325/MLRTestOfModel.xls
9. Estimating the average and predicting an individual
9.1 Estimating the average value of all Y values for observations with the same value of X1,
the same value of X2, …. , and the same value of Xk

Formula: estimated mean plus and minus the margin of error
ŷ  t n  k 1 (SE mean )

Conclusion: We can say with ___ confidence that the average value of the dependent is
_____ with a margin of error of _________ for all observations with values of the
independent variables of ____
Or

: We can say with ___ confidence that the average value of the dependent is between___
and ____ for all observations with values of the independent variables of ____

Example Find average sales for all stores that have 4,000 square feet and a $100,000
advertising budget. You are given that the estimated average sales
=964,000 + 1,670,000 (size)+1.45(ad budget) and that the SEmean, the standard error of
the mean estimate = 309,000
Solution: Substitute the value of x into the estimated equation to obtain the estimated
average sales:
Estimate of average sales = 964,000 + 1,670,000 (4)+1.45(100,000) = 7,789,000
Next substitute values into confidence interval
ŷ  t n  k 1 (SE mean )
7,789,000  2.2010 (309,000)
7,789,000  680109

Conclusion: For all stores that have 4,000 square feet and an advertising budege of
$100,000, we can say with 95% confidence that the average sales is $7,789,000 with a
margin of error of ± $680,109
Or

Conclusion: For all stores that have 4,000 square feet and an advertising budege of
$100,000, we can say with 95% confidence that the average sales is between $ 7,108,891
and $8,469,109

Click on the following link to load an Excel worksheet that will allow you to create new
examples

http://wweb.uta.edu/faculty/eakin/busa5325/CIMuYMLR.xls (Special case of two
independent variables with the estimated standard error of estimating the average value of
Y already calculated)

9.2 Predicting the value of Y for an observation with a given value of X

Formula: predicted value plus and minus the margin of error
ŷ  t n k 1 (SEindividual)

Conclusion: We can say with ___ confidence that the value of the dependent is _____
with a margin of error of _________ for an observation with values of the independent
variables of ____
or

We can say with ___ confidence that the value of the dependent is between _____ and
______ for an observation with values of the independent variables of ____
 Example: Find sales for a store that has 4,000 square feet. and an advertising budget of
$100,000 You are given the predicted sales =964,000 + 1,670,000 (size)+1.45(ad
budget) and its estimated standard error is 1,104,000.
Solution: Substitute the value of x into the estimated equation to obtain the value of the
predicted sales:
predicted sales = 964,000 + 1,670,000 (4)+1.45(100,000) = 7,789,000
Next substitute values into confidence interval
ŷ  t n  k 1 (SE individual)
7,789,000  2.2120 * (1,104,000)
7,789,000  2,429,904

Conclusion: For a store that has 4,000 square feet and an advertising budget of $100,000,
we can say with 95% confidence that the sales will be $7,789,000 with a margin of error
of ± $2,429,904
Or

Conclusion: For a store that has 4,000 square feet nd an advertising budget of $100,000,
we can say with 95% confidence that the sales will be between $5,359,096 and
$10,218,904
Click on the following link to load an Excel worksheet that will allow you to create new
examples

http://wweb.uta.edu/faculty/eakin/busa5325/CIYMLR.xls (Special case of two
independent variables with the estimated standard error of predicting an individual value
of Y already calculated)
10. Why study regression?
The following link is a blog which has some references to studies where simple linear regression
outperforms experts. I have not followed up on the references though.
http://lesswrong.com/lw/3gv/statistical_prediction_rules_outperform_expert/
11. Other forms of regression
11.1 Names
 Simple Linear Model (SLR): a first order model with only one independent variable
 First Order Model : more distinct variables are added to the SLR model
 Second Order Model: squared terms are added to model
 Interaction Model: product terms are added
 Logarithmic Models
11.2 Effects of each variable

Simple Linear Model: The change in the average value of the dependent variable
when an independent variable increases by one. Note: There is no F test since only
one slope needs to be tested.

First Order Model: This is the change in the average value of the dependent variable
when an independent variable increases by one, holding constant all the others.

Second Order Model: The change in the average value of the dependent variable
when an independent variable increases by one depends on the current value of the
independent variable, holding constant all the others.

Interaction Model: The change in the average value of the dependent variable when
an independent variable increases by one depends on the value of another variable,
while holding constant all the others.

Logarithmic Model: Holding all other variables constant, for each one unit increase
in an independent variable, there is a average percentage increase in the the
dependent variable
11.3 Examples
Based on trying to predict exam grade, Y, based on hours studied, X1, and GPA, X2.

Simple Linear Model (using only X1 ): This is the change in the average exam grade
when you study an additional hour.
E(ExamGrade )  0  1 (HoursStudi ed)

First Order Model: This is the change in the average exam grade when you study an
additional hour, for those people with the same GPA.
E(ExamGrade )  0  1 (HoursStudi ed)  2 (GPA)
Or

E(Y)  0  1X1  2 X 2
Second Order Model:
E(ExamGrade )  0  1 (HoursStudi ed)  2 (HoursStudi ed) 2  3 (GPA )
The change in the average exam grade when you study an additional hour depends
on how long you have studied, controlling for the effect of GPA.
To find the change, substitute the value of “HourStudied+1” into the above
equation and examine the difference:
E(ExamGrade )  0  1 (HoursStudi ed  1)  2 (HoursStudi ed  1) 2  3 (GPA ) min
us
E(ExamGrade )  0  1 (HoursStudi ed)  2 (HoursStudi ed) 2  3 (GPA )
which equals to the following as long as GPA is the same value in both equations:
1  2 (1  2 * HourStudie d)

Interaction Model: The change in the average exam grade when for each additional
hour studied depends on the value of the GPA.
E(ExamGrade )  0  1 (HoursStudi ed)  2 (GPA )  3 (HoursStudi ed * GPA ) or
E(Y)  0  1X1  2 X 2  3X1 X 2
E(Y)  0  (1  3X 2 )X1  2 X 2

Logarithmic Model: For each additional hour student, there is an average
percentage increase in the exam grade holding GTA constant.
ELog ExamGrade    0  1 (HoursStudi ed )   2 (GPA )   3 (HoursStudi ed * GPA )
E(Log(Y))   0  1X1   2 X 2   3 X1 X 2
Other examples:
Baseball: http://baseballanalysts.com/archives/2007/05/the_value_of_th.php
Accounting http://www.nysscpa.org/cpajournal/2005/105/essentials/p56.htm
Business: http://www.cdc.gov/MMWR/preview/mmwrhtml/00037061.htm
U.S. Weather Forecasts: http://www.weather.gov/ost/NWS_TIP.pdf
11.4 Create your own examples of interpretations by changing the words and numbers in
First Order Model:
http://wweb.uta.edu/faculty/eakin/busa5325/BetaInter1stOrderModel.xls
Interaction Model:
http://wweb.uta.edu/faculty/eakin/busa5325/BetaInterInteractionModel.xls
Quadratic Model
http://wweb.uta.edu/faculty/eakin/busa5325/BetaInterQuadraticModel.xls
Download