Simple Linear Regression Notes

advertisement
Simple Linear Regression Notes

Relationships

Estimating the Simple Linear Function

Measures of Variation

Assumptions

Assumption Checks

Slope

Estimate Averages

Predict Individual Values

NCSS

What you should be able to do when you finish the notes
o Discuss differences in the types of relationships
o Define and put in English the slope and intercept of a population
o Discuss how the estimation procedure works
o Define and put in English the estimates of the slope and intercept
o Define and put in English the standard error of the estimate and the coefficient of
determination
o Discuss the required assumptions
o Know how to check the validity of the assumptions
o Test hypothesis and construct confidence intervals of the slope
o Construct confidence intervals for an average value of Y and prediction intervals
for a value of Y (for specific values of X)
o Get NCSS to calculate the required estimates and tests
Purpose: To illustrate the application of the Building Blocks to other estimates.
1. Relationships
Example 1: Suppose you believe that the reimbursed cost of a trip is a function of the
mileage.
You receive a fixed cost of $40 and a variable of $0.35 per mile. In English
Fixed cost: When the mileage is zero, the reimbursed cost is $40.
Variable cost: For each mile traveled, the reimbursed cost increases by $0.35
Example 2: A $10,000 computer is being depreciated of over 5 years (with no resale
value at the end of 5 years). In English what is the meaning of the fixed and variable?
Fixed: When ____________ is zero, the _____________ is _______
Variable: For each _______, the ________ decreases by ___________
 Forms of relationships:
 Mathematical versus tendencies
Do you believe that the actual cost of a trip is exactly equal to the reimbursed cost?
We will assume that the relationship is just a tendency (or on average).

Types of relationships – Of all possible depreciation methods, what is the simplest?
The mean value of one variable, Y, depends on the values of another, X. For example, the
average starting salary depends on a student’s GPA. There are many relationship
functions but the simplest is a straight line:
 y| x   0   1 X
This is read as ”the mean value of Y given X is a straight line function of X”.

Incomplete information
Do you believe that it is possible to obtain information about all possible trips?
If no, then the estimates are in error (Second Building Block) and we need to compute
their standard errors and margins of errors (Third and Fourth Building Block). The
standard error consists of measures of ___________ and _____________. This
calculation is too complicated and the value of the standard error will be provided; you
will not have to calculate it. However you are responsible for the calculation of the
margin of error which is a product of _____________________ and
_________________.
Example 3: Suppose you feel that the sales price of a used car depends on the odometer reading.
You find a relationship of predicted price = $17,250 – $0.06(per) mile based on a sample of 100
cars. There are four things of interest. What is the value, interpretation and margin of error of…
Parameter
Statistic (given)
Standard Error (given) Margin of Error: use a
t (n-2)
Fixed (intercept)
$17,250
$182
Variable (slope)
- $ 0.06
0.005
The average price of
Estimated average =
$38.31
all cars with 40,000
$17,250-.06(40000)
miles on the odometer
$14,850
The price of a car
Predicted price =
with 40,000 miles on
$17,250-.06(40000)
the odometer
$14,850
$328.63
Put each of the four in English with their margin of error
Intercept
Slope
Estimate of average
Prediction of an individual
2. Meaning of slope and intercept

Intercept, 0: The average value of the dependent variable when the independent
variable takes on the value zero

Slope, 1: The change in the average value of the dependent variable when the
independent variable increases by one unit.


Example: Interpret the meaning of the intercept and slope when relating Starting Salary
and GPA
Solution: Determine if variable depends on the other. In this case the starting salaries of
students should depend on their GPA.
Next determine the units of the independent variable. Here the units of GPA is in points.
Finally, replace the names and units in the definitions above.

Intercept, 0: The average value of the starting salary when the GPA is zero. (Many
times the intercept does not have a realistic meaning.)

Slope, 1: The change in the average starting salary when the GPA increases by one
point.

Click on the following link to load an Excel worksheet that will allow you to create new
examples
http://wweb.uta.edu/faculty/eakin/busa3321/Beta0-Beta1Interpretation.xls
3. Estimating Simple Linear Function

Least squares – Minimizing the squared differences between the line and the actual
values of Y does creates an estimate close to all values of Y. The estimated line is
symbolized as
yˆ  b0  b1 x
This is read as “the predicted value of Y is a linear function of X”

Interpretation of estimates

Estimated intercept, b0: The estimated average value of the dependent variable
when the independent variable takes on the value zero

Estimated slope, b1: The estimated change in the average value of the dependent
variable when the independent variable increases by one unit.

Example: Sales in millions and Size of store in thousands of square feet
Estimated slope is 1.670, Estimated intercept is 0.964
Solution: Same as before, determine the dependent and independent variable along with
the units of Y and X. Sales in millions depends on size in thousands.
Next, replace underlined terms with the appropriate names and insert the values of the
estimated slope and intercept. Rephrase if necessary.

Estimated intercept, b0: The estimated average value of the sales is $964,000 when
the size of the store is zero

Estimated slope, b1: When the size of the store increases by one thousand square
feet, the average sales is estimated to increase by $1, 670,000


Click on the following link to load an Excel worksheet that will allow you to create new
examples
http://wweb.uta.edu/faculty/eakin/busa3321/EstimateInterpretation.xls
4. Measures of Variation

Variation and estimate: Variability of data away from its average

Standard Error of Estimate, Sy|x: estimate of the typical error when predicting y that
occurs when using this model estimate on the sample data

Coefficient of Determination, R2: Percent of sample variability of Y that can be
associated with variation in X

Example: Sales and size of store, Sy|x = 0.966, R2= 0.904, Interpret the meaning of the
standard error of the estimate and the coefficient of determination.

Solution: As always replace Y and X with the names from the example
Standard Error of Estimate, Sy|x: If the estimated line is used, there is a typical error of
$966,000 when predicting sales in this sample,
Coefficient of Determination, R2: 90.4% of the sample variability of sales that can be
associated with variation in the size of the store

Click on the following link to load an Excel worksheet that will allow you to create new
examples:
http://wweb.uta.edu/faculty/eakin/busa3321/VariationExplanation.xls
Uses of R2:
Baseball Example: http://www.insiderbaseball.com/Angelo-v2.htm
Business Example: http://www.sdcounty.ca.gov/rtf/docs/CNASmallBusiness.pdf
(search for “simple linear regression”)
Information Systems example :
http://databases.about.com/od/datamining/a/datamining.htm
5. Assumptions

Linearity: the average value of the dependent variable has a straight line relationship with
the independent variable

Independence: the observations are randomly and independently selected

Normality: the values of the dependent variable are normally distributed for any value of
the independent variable

Equal Variation: the variation in the values of the dependent variable is the same (equal)
for any value of the independent variable

Example : In the previous examples, the dependent variable was sales and the
independent variable was size. Both variables were measured on stores. Interpret the
assumptions in this context.
Solution: Reword the above sentences
Linearity: the average value of the sales has a straight line relationship with the size of
the store
Independence: the stores are randomly and independently selected
Normality: the values of the sales are normally distributed for stores of the same size
Equal Variation: the variation in the values of the sales is the same regardless of the
size of the store.

Click on the following link to load an Excel worksheet that will allow you to create new
examples
http://wweb.uta.edu/faculty/eakin/busa3321/SLRassumptionsInterpretation.xls
6. Assumption Checks

Errors versus residuals: Errors are the differences between the Y’s and their true average
values while the residuals are the differences between the Y’s and the their predicted
values

Residual plots – these should show a random scattering of points, any pattern indicates a
possible violation of an assumption.

Normality probability plots – these plots should show a straight line, any pattern other
than straight indicates a possible problem of normality
Plots showing no assumption violations:
Residual Plot- All Assumptions Met
6
4
Residual
2
0
-2
-4
-6
-8
0
5
10
15
20
25

Plots showing assumption violations:
X
Residual Plot-Not a First Order Model
200
20
150
15
100
10
50
Residual
Residual
Residual Plot- Unequal Variance
25
5
0
0
-50
-5
-100
-10
-150
-200
-15
0
5
10
15
X
20
25
0
5
10
15
X
20
25
7. Slope
7.1 Test for usefulness (slope)

Hypothesis:
Ho: 1=0
H1: one sided or two sided alternatives

Rejection region: One sided or two sided t distributions with n-2 degrees of freedom

Test Statistic: The 2nd Building Block says (b1 - 1 ≠ 0). The 3rd Building Block says to
evaluate the error divide it by the standard error. Here we divide by the sample standard
error and use the t-test version with 1 hypothesized to be zero.
t

b1  1   b1  0
s b1
s b1
Conclusion: We can (not) say that changes in the values of the independent variable is
associated with changes (increases or decreases for one-sided tests) in the average value
of the dependent variable.

Example: Determine if increases in the size of the store is associated with an increase in
the average sales. You are given that b1 = 1.670, Sb1 = 0.157, and n=14
o
Hypothesis
o
H0:1=0
H1:1>0
o Rejection region: The degrees of freedom are n-2=12. Since this is a right-sided ttest, find the t-table value of 1.7823 in row 12 column 0.05.
Therefore Reject H0 if t > 1.7823
o Test Statistic t 
1.67  0  10.64
0.157
o Conclusion: We can say that increases in the size of the store is associated with
increases in the average sales

Click on the following link to load an Excel worksheet that will allow you to create new
examples (Excel file is not working correctly on some computers at the moment)

http://wweb.uta.edu/faculty/eakin/busa3321/TestOfBeta1Interpretation.xls
7.2 Confidence interval for slope

Formula: By the 4th Building Block, the population slope is estimated to be the sample
slope plus and minus the margin of error
b1  t n  2 S b1

Conclusion: We can say with ___ confidence that increase of one unit in the independent
variable is associated with (increases/decreases) of _____ in the average value of the
dependent with a margin of error of ±_________.

Example: Determine the amount of change in the average value of the sales when the size
of the store is associated increases by one thousand square feet. You are given that b1 =
1.670, Sb1 = 0.157, and n=14
Solution: Substitute names and numbers into the above template
o Formula:
b1  tn  2 Sb1
1.67  2.1788 * (0.157)
1.67  0.342
o Conclusion We can say with 95% confidence that an increase of one thousand
square feet in the size of the store is associated with an increases of $1,670,000 in
average sales with a margin of error of ±$342,000.

Click on the following link to load an Excel worksheet that will allow you to create new
examples (Excel file is not working correctly on some computers at the moment)


http://wweb.uta.edu/faculty/eakin/busa3321/SlopeC.I.Interpretation.xls
8. Estimating the average and predicting an individual
8.1 Estimating the average value of all Y values for observations with the same value of X

Formula: estimated mean plus and minus the margin of error
yˆ  tn  2 ( SE mean)

Conclusion: We can say with ___ confidence that the average value of the dependent is
_____ with a margin of error of _________ for all observations with a value of the
independent variable of ____

Example Find average sales for all stores that have 4,000 square feet. You are given that
the estimated average sales = 0.964 + 1.670 (size) and that the SEmean, the standard error
of the mean estimate = 0.309
Solution: Substitute the value of x into the estimated equation to obtain the estimated
average sales:
Estimate of average sales = 0.964 + 1.670(4) = 7.644 ($7, 644, 000)
Next substitute values into confidence interval
yˆ  tn  2 ( SE mean)
7.644  2.1788(0.309)
7.644  0.671

Conclusion: For all stores that have 4,000 square feet, we can say with 95% confidence
that the average sales is $7, 644, 000 with a margin of error of ± $671,000

Click on the following link to load an Excel worksheet that will allow you to create new
examples (Excel file is not working correctly on some computers at the moment)

http://wweb.uta.edu/faculty/eakin/busa3321/CIMuGivenXIntSEKnown.xls (Estimated
Standard Error of Mean already calculated)

http://wweb.uta.edu/faculty/eakin/busa3321/CIMuGivenXInterpreation.xls (Estimated
Standard Error of the mean needs to be calculated)
8.2 Predicting the value of Y for an observation with a given value of X

Formula: predicted value plus and minus the margin of error
yˆ  tn  2 ( SE individual)

Conclusion: We can say with ___ confidence that the value of the dependent is _____
with a margin of error of _________ for an observations with a value of the independent
variable of ____

 Example: Find sales for a store that has 4,000 square feet. You are given the predicted sales
= 0.964 + 1.670 (size) and its estimated standard error is 1.104.
Solution: Substitute the value of x into the estimated equation to obtain the value of the
predicted sales:
predicted sales = 0.964 + 1.670(4) = 7.644 ($7, 644, 000)
Next substitute values into confidence interval
yˆ  tn  2 ( SE individual)
7.644  2.1788 * (1.104)
7.644  2.484

Conclusion: For a store that has 4,000 square feet, we can say with 95% confidence that
the sales will be $7, 644, 000 with a margin of error of ± $2,484,000

Click on the following link to load an Excel worksheet that will allow you to create new
examples (Excel file is not working correctly on some computers at the moment)

http://wweb.uta.edu/faculty/eakin/busa3321/CIYGivenXIntGivenSE.xls (Estimated
Standard Error of Predicting Y already calculated)
http://wweb.uta.edu/faculty/eakin/busa3321/CIyGivenXInterpreation.xls (Estimated
Standard Error of Predicting a value of Y not calculated yet )
9. Why study regression?
The following link is a blog which has some references to studies where simple linear regression
outperforms experts. I have not followed up on the references though.
http://lesswrong.com/lw/3gv/statistical_prediction_rules_outperform_expert/
10. SAS To Be Constructed
Download