Simple Linear Regression Notes Relationships Estimating the Simple Linear Function Measures of Variation Assumptions Assumption Checks Slope Estimate Averages Predict Individual Values NCSS What you should be able to do when you finish the notes o Discuss differences in the types of relationships o Define and put in English the slope and intercept of a population o Discuss how the estimation procedure works o Define and put in English the estimates of the slope and intercept o Define and put in English the standard error of the estimate and the coefficient of determination o Discuss the required assumptions o Know how to check the validity of the assumptions o Test hypothesis and construct confidence intervals of the slope o Construct confidence intervals for an average value of Y and prediction intervals for a value of Y (for specific values of X) o Get NCSS to calculate the required estimates and tests Purpose: To illustrate the application of the Building Blocks to other estimates. 1. Relationships Example 1: Suppose you believe that the reimbursed cost of a trip is a function of the mileage. You receive a fixed cost of $40 and a variable of $0.35 per mile. In English Fixed cost: When the mileage is zero, the reimbursed cost is $40. Variable cost: For each mile traveled, the reimbursed cost increases by $0.35 Example 2: A $10,000 computer is being depreciated of over 5 years (with no resale value at the end of 5 years). In English what is the meaning of the fixed and variable? Fixed: When ____________ is zero, the _____________ is _______ Variable: For each _______, the ________ decreases by ___________ Forms of relationships: Mathematical versus tendencies Do you believe that the actual cost of a trip is exactly equal to the reimbursed cost? We will assume that the relationship is just a tendency (or on average). Types of relationships – Of all possible depreciation methods, what is the simplest? The mean value of one variable, Y, depends on the values of another, X. For example, the average starting salary depends on a student’s GPA. There are many relationship functions but the simplest is a straight line: y| x 0 1 X This is read as ”the mean value of Y given X is a straight line function of X”. Incomplete information Do you believe that it is possible to obtain information about all possible trips? If no, then the estimates are in error (Second Building Block) and we need to compute their standard errors and margins of errors (Third and Fourth Building Block). The standard error consists of measures of ___________ and _____________. This calculation is too complicated and the value of the standard error will be provided; you will not have to calculate it. However you are responsible for the calculation of the margin of error which is a product of _____________________ and _________________. Example 3: Suppose you feel that the sales price of a used car depends on the odometer reading. You find a relationship of predicted price = $17,250 – $0.06(per) mile based on a sample of 100 cars. There are four things of interest. What is the value, interpretation and margin of error of… Parameter Statistic (given) Standard Error (given) Margin of Error: use a t (n-2) Fixed (intercept) $17,250 $182 Variable (slope) - $ 0.06 0.005 The average price of Estimated average = $38.31 all cars with 40,000 $17,250-.06(40000) miles on the odometer $14,850 The price of a car Predicted price = with 40,000 miles on $17,250-.06(40000) the odometer $14,850 $328.63 Put each of the four in English with their margin of error Intercept Slope Estimate of average Prediction of an individual 2. Meaning of slope and intercept Intercept, 0: The average value of the dependent variable when the independent variable takes on the value zero Slope, 1: The change in the average value of the dependent variable when the independent variable increases by one unit. Example: Interpret the meaning of the intercept and slope when relating Starting Salary and GPA Solution: Determine if variable depends on the other. In this case the starting salaries of students should depend on their GPA. Next determine the units of the independent variable. Here the units of GPA is in points. Finally, replace the names and units in the definitions above. Intercept, 0: The average value of the starting salary when the GPA is zero. (Many times the intercept does not have a realistic meaning.) Slope, 1: The change in the average starting salary when the GPA increases by one point. Click on the following link to load an Excel worksheet that will allow you to create new examples http://wweb.uta.edu/faculty/eakin/busa3321/Beta0-Beta1Interpretation.xls 3. Estimating Simple Linear Function Least squares – Minimizing the squared differences between the line and the actual values of Y does creates an estimate close to all values of Y. The estimated line is symbolized as yˆ b0 b1 x This is read as “the predicted value of Y is a linear function of X” Interpretation of estimates Estimated intercept, b0: The estimated average value of the dependent variable when the independent variable takes on the value zero Estimated slope, b1: The estimated change in the average value of the dependent variable when the independent variable increases by one unit. Example: Sales in millions and Size of store in thousands of square feet Estimated slope is 1.670, Estimated intercept is 0.964 Solution: Same as before, determine the dependent and independent variable along with the units of Y and X. Sales in millions depends on size in thousands. Next, replace underlined terms with the appropriate names and insert the values of the estimated slope and intercept. Rephrase if necessary. Estimated intercept, b0: The estimated average value of the sales is $964,000 when the size of the store is zero Estimated slope, b1: When the size of the store increases by one thousand square feet, the average sales is estimated to increase by $1, 670,000 Click on the following link to load an Excel worksheet that will allow you to create new examples http://wweb.uta.edu/faculty/eakin/busa3321/EstimateInterpretation.xls 4. Measures of Variation Variation and estimate: Variability of data away from its average Standard Error of Estimate, Sy|x: estimate of the typical error when predicting y that occurs when using this model estimate on the sample data Coefficient of Determination, R2: Percent of sample variability of Y that can be associated with variation in X Example: Sales and size of store, Sy|x = 0.966, R2= 0.904, Interpret the meaning of the standard error of the estimate and the coefficient of determination. Solution: As always replace Y and X with the names from the example Standard Error of Estimate, Sy|x: If the estimated line is used, there is a typical error of $966,000 when predicting sales in this sample, Coefficient of Determination, R2: 90.4% of the sample variability of sales that can be associated with variation in the size of the store Click on the following link to load an Excel worksheet that will allow you to create new examples: http://wweb.uta.edu/faculty/eakin/busa3321/VariationExplanation.xls Uses of R2: Baseball Example: http://www.insiderbaseball.com/Angelo-v2.htm Business Example: http://www.sdcounty.ca.gov/rtf/docs/CNASmallBusiness.pdf (search for “simple linear regression”) Information Systems example : http://databases.about.com/od/datamining/a/datamining.htm 5. Assumptions Linearity: the average value of the dependent variable has a straight line relationship with the independent variable Independence: the observations are randomly and independently selected Normality: the values of the dependent variable are normally distributed for any value of the independent variable Equal Variation: the variation in the values of the dependent variable is the same (equal) for any value of the independent variable Example : In the previous examples, the dependent variable was sales and the independent variable was size. Both variables were measured on stores. Interpret the assumptions in this context. Solution: Reword the above sentences Linearity: the average value of the sales has a straight line relationship with the size of the store Independence: the stores are randomly and independently selected Normality: the values of the sales are normally distributed for stores of the same size Equal Variation: the variation in the values of the sales is the same regardless of the size of the store. Click on the following link to load an Excel worksheet that will allow you to create new examples http://wweb.uta.edu/faculty/eakin/busa3321/SLRassumptionsInterpretation.xls 6. Assumption Checks Errors versus residuals: Errors are the differences between the Y’s and their true average values while the residuals are the differences between the Y’s and the their predicted values Residual plots – these should show a random scattering of points, any pattern indicates a possible violation of an assumption. Normality probability plots – these plots should show a straight line, any pattern other than straight indicates a possible problem of normality Plots showing no assumption violations: Residual Plot- All Assumptions Met 6 4 Residual 2 0 -2 -4 -6 -8 0 5 10 15 20 25 Plots showing assumption violations: X Residual Plot-Not a First Order Model 200 20 150 15 100 10 50 Residual Residual Residual Plot- Unequal Variance 25 5 0 0 -50 -5 -100 -10 -150 -200 -15 0 5 10 15 X 20 25 0 5 10 15 X 20 25 7. Slope 7.1 Test for usefulness (slope) Hypothesis: Ho: 1=0 H1: one sided or two sided alternatives Rejection region: One sided or two sided t distributions with n-2 degrees of freedom Test Statistic: The 2nd Building Block says (b1 - 1 ≠ 0). The 3rd Building Block says to evaluate the error divide it by the standard error. Here we divide by the sample standard error and use the t-test version with 1 hypothesized to be zero. t b1 1 b1 0 s b1 s b1 Conclusion: We can (not) say that changes in the values of the independent variable is associated with changes (increases or decreases for one-sided tests) in the average value of the dependent variable. Example: Determine if increases in the size of the store is associated with an increase in the average sales. You are given that b1 = 1.670, Sb1 = 0.157, and n=14 o Hypothesis o H0:1=0 H1:1>0 o Rejection region: The degrees of freedom are n-2=12. Since this is a right-sided ttest, find the t-table value of 1.7823 in row 12 column 0.05. Therefore Reject H0 if t > 1.7823 o Test Statistic t 1.67 0 10.64 0.157 o Conclusion: We can say that increases in the size of the store is associated with increases in the average sales Click on the following link to load an Excel worksheet that will allow you to create new examples (Excel file is not working correctly on some computers at the moment) http://wweb.uta.edu/faculty/eakin/busa3321/TestOfBeta1Interpretation.xls 7.2 Confidence interval for slope Formula: By the 4th Building Block, the population slope is estimated to be the sample slope plus and minus the margin of error b1 t n 2 S b1 Conclusion: We can say with ___ confidence that increase of one unit in the independent variable is associated with (increases/decreases) of _____ in the average value of the dependent with a margin of error of ±_________. Example: Determine the amount of change in the average value of the sales when the size of the store is associated increases by one thousand square feet. You are given that b1 = 1.670, Sb1 = 0.157, and n=14 Solution: Substitute names and numbers into the above template o Formula: b1 tn 2 Sb1 1.67 2.1788 * (0.157) 1.67 0.342 o Conclusion We can say with 95% confidence that an increase of one thousand square feet in the size of the store is associated with an increases of $1,670,000 in average sales with a margin of error of ±$342,000. Click on the following link to load an Excel worksheet that will allow you to create new examples (Excel file is not working correctly on some computers at the moment) http://wweb.uta.edu/faculty/eakin/busa3321/SlopeC.I.Interpretation.xls 8. Estimating the average and predicting an individual 8.1 Estimating the average value of all Y values for observations with the same value of X Formula: estimated mean plus and minus the margin of error yˆ tn 2 ( SE mean) Conclusion: We can say with ___ confidence that the average value of the dependent is _____ with a margin of error of _________ for all observations with a value of the independent variable of ____ Example Find average sales for all stores that have 4,000 square feet. You are given that the estimated average sales = 0.964 + 1.670 (size) and that the SEmean, the standard error of the mean estimate = 0.309 Solution: Substitute the value of x into the estimated equation to obtain the estimated average sales: Estimate of average sales = 0.964 + 1.670(4) = 7.644 ($7, 644, 000) Next substitute values into confidence interval yˆ tn 2 ( SE mean) 7.644 2.1788(0.309) 7.644 0.671 Conclusion: For all stores that have 4,000 square feet, we can say with 95% confidence that the average sales is $7, 644, 000 with a margin of error of ± $671,000 Click on the following link to load an Excel worksheet that will allow you to create new examples (Excel file is not working correctly on some computers at the moment) http://wweb.uta.edu/faculty/eakin/busa3321/CIMuGivenXIntSEKnown.xls (Estimated Standard Error of Mean already calculated) http://wweb.uta.edu/faculty/eakin/busa3321/CIMuGivenXInterpreation.xls (Estimated Standard Error of the mean needs to be calculated) 8.2 Predicting the value of Y for an observation with a given value of X Formula: predicted value plus and minus the margin of error yˆ tn 2 ( SE individual) Conclusion: We can say with ___ confidence that the value of the dependent is _____ with a margin of error of _________ for an observations with a value of the independent variable of ____ Example: Find sales for a store that has 4,000 square feet. You are given the predicted sales = 0.964 + 1.670 (size) and its estimated standard error is 1.104. Solution: Substitute the value of x into the estimated equation to obtain the value of the predicted sales: predicted sales = 0.964 + 1.670(4) = 7.644 ($7, 644, 000) Next substitute values into confidence interval yˆ tn 2 ( SE individual) 7.644 2.1788 * (1.104) 7.644 2.484 Conclusion: For a store that has 4,000 square feet, we can say with 95% confidence that the sales will be $7, 644, 000 with a margin of error of ± $2,484,000 Click on the following link to load an Excel worksheet that will allow you to create new examples (Excel file is not working correctly on some computers at the moment) http://wweb.uta.edu/faculty/eakin/busa3321/CIYGivenXIntGivenSE.xls (Estimated Standard Error of Predicting Y already calculated) http://wweb.uta.edu/faculty/eakin/busa3321/CIyGivenXInterpreation.xls (Estimated Standard Error of Predicting a value of Y not calculated yet ) 9. Why study regression? The following link is a blog which has some references to studies where simple linear regression outperforms experts. I have not followed up on the references though. http://lesswrong.com/lw/3gv/statistical_prediction_rules_outperform_expert/ 10. SAS To Be Constructed