Regression Notes Relationships Estimating the Linear Function Measures of Variation Assumptions Assumption Checks One Slope Multiple Slopes Estimate Averages Predict Individual Values NCSS What you should be able to do when you finish the notes o Discuss differences in the types of relationships o Define and put in English the slopes and intercept of a population o Discuss how the estimation procedure works o Define and put in English the estimates of the slopes and intercept o Define and put in English the standard error of the estimate and the coefficient of determination o Discuss the required assumptions o Know how to check the validity of the assumptions o Test hypothesis and construct confidence intervals of the slope o Test the hypothesis about the effect of the model o Construct confidence intervals for an average value of Y and prediction intervals for a value of Y (for specific values of the X’s) o Get NCSS to calculate the required estimates and tests 1. Relationships Example 1: Suppose you believe that the reimbursed cost of a trip is a function of the mileage. You receive a fixed cost of $40 and a variable of $0.35 per mile. In English Fixed cost: When the mileage is zero, the reimbursed cost is $40. Variable cost: For each mile traveled, the reimbursed cost increases by $0.35 Reimbursement for a trip of 100 miles = $40 + 0.35*(100) = $75 Example 2: A $10,000 computer is being depreciated over 5 years (with no resale value at the end of 5 years). In English what is the meaning of the fixed and variable? Fixed: When ____________ is zero, the _____________ is _______ Variable: For each _______, the ________ decreases by ___________ For a 2 year old computer the remaining value of the computer is _____ _______ = ______ Example 3: Suppose you believe that the reimbursed cost of a trip is a function of the mileage and the number of days of the trip. You receive a fixed cost of $40 and variable costs of $0.35 per mile and $150 per day. In English Fixed cost: When the mileage and the number of days is zero, the reimbursed cost is $40. Variable cost of mileage: Assuming the number of days does not change, for each mile traveled, the reimbursed cost increases by $0.35 Variable cost of days: Assuming the mileage does not change, for each day of the trip, the reimbursed cost increases by $150 Reimbursement for a trip of 100 miles that lasted two days = $40 + 0.35*(100) + $150*(2) = $375 Forms of relationships: Mathematical versus tendencies Do you believe that the actual cost of a trip is exactly equal to the reimbursed cost? We will assume that the relationship is just a tendency (or on average). Types of relationships – Of all possible depreciation methods, what is the simplest? For one independent variable (Simple Linear Regression): The mean value of one variable, Y, depends on the values of another, X. For example, the average starting salary depends on a student’s GPA. There are many relationship functions but the simplest is a straight line: y| x 0 1 X This is read as ”the mean value of Y given X is a straight line function of X”. For two independent variables (two or more independent variables is Multiple Linear Regression): The mean value of one variable, Y, depends on the values of variables X1 and X2. For example, the average starting salary depends on a student’s GPA and their work experience. There are many relationship functions but the simplest is a first-order model: y|x1 , x 2 0 1X1 2 X 2 In general the model for k independent variables is: y|x1 , x 2 ,..., x k 0 1X1 2 X 2 ... k X k Incomplete information Do you believe that it is possible to obtain information about all possible trips? If no, then the estimates are in error and we need to compute their margins of error. Example 4: Suppose you believe that the number of bars you company sells depends on the price of the bar (measured in cents) and the amount spent on a promotion (in dollars). You have recorded these three values for a random sample of 34 stores in your chain of stores. There are five things of interest in this example. What are the value, interpretation and margin of error of… Parameter Statistic (given) Standard Error (given) Margin of Error: use a t (n-k-1) Fixed (intercept) 5837.52 628.15 Slope of Price -53.22 6.85 Slope of Promotion 3.61 0.68 The number of bars Estimated average 110.08 sold for all stores number of bars = where the bar is 5837.52 priced 79 cents and - 53.22(79) there $400 is spent on + 3.61(400) promotion = 3077.14 bars The number of bars predicted number of sold for a store where bars = 5837.52 the bar is priced 79 - 53.22(79) cents and there $400 + 3.61(400) is spent on promotion = 3077.14 bars 647.49 Put each of the five in English with their margin of error Intercept Slope of price Slope of promotion Estimate of average Prediction of an individual 2. Meaning of Population Slopes and Intercept Intercept, 0: The average value of the dependent variable when the independent variable takes on the value zero (We would usually add the phrase “in the population” to this. However for the remainder of this class this will not be included but will be understood to be in the sentence. Therefore I will assume you are talking about the population unless you specifically state that your slope or intercept is an estimate or use the phrase “in the sample” in its discussion.) Slope, 1: Holding all other variables constant, when the first independent variable increases by one unit, the average value of the dependent variable increases/decreases by 1 units. Slope, 2: Holding all other variables constant, when the second independent variable increases by one unit, the average value of the dependent variable increases/decreases by 2 units. Example: Interpret the meaning of the intercept and slope when relating Starting Salary and graduating GPA. Solution: (1) Determine if one variable depends on the other. In this case the starting salaries of students should depend on their GPA. (2) Next determine the units of the independent variable. Here the units of GPA is in points. (3) Finally, replace the names and units in the definitions above. Intercept, 0: The average value of the starting salary when the GPA is zero. Reword this to make it sound better. For example: For all students who graduate with a GPA of zero, 0 is the average of their starting salaries. (Many times the intercept does not have a realistic meaning.) Slope, 1: The change in the average starting salary when the GPA increases by one point. Rewording example: Average starting salary is 1 higher/lower (higher if positive and lower if negative) for all students whose graduating GPA is higher than another group by one point. Two Notes: (1) Unless this is an experiment we cannot say that increasing GPA causes average starting salary to change only that this is a difference in two populations and (2)There are no other variables to hold constant. Click on the following link to load an Excel worksheet that will allow you to create new examples: for one independent variable: http://wweb.uta.edu/faculty/eakin/busa3321/Beta0-Beta1Interpretation.xls for two independent variables http://wweb.uta.edu/faculty/eakin/busa5325/BetaInter1stOrderModel.xls 3. Estimating the Simple Linear Function 3.1 Least squares – Minimizing the squared differences between the line and the actual values of Y creates an estimate close to all values of Y. The estimated line is symbolized as yˆ b0 b1 x1 b2 x2 ... bk xk This is read as “the predicted value of Y is a first-order linear function of the Xi’s”. The predicted value will be both our (1) estimate of the population mean and (2) our prediction of the value of Y for specified values of the Xi 3.2 Interpretation of estimates Estimated intercept, b0: The estimated average value of the dependent variable when the independent variable takes on the value zero Estimated slope, bi: Holding all other variables constant, when the i-th independent variable increases by one, the average value of the dependent variable is estimated to increase/decrease by bi units. 3.3 Example: Sales in dollars, Size of store in thousands of square feet and advertising budget in dollars: Predicted sales, yˆ 964,000 1,670,000( size ) 1.45(ad budget ) Solution: Same as before, determine the dependent and independent variables along with their units. Sales in dollars depends on size in thousands and ad budget in dollars. Next, replace underlined terms with the appropriate names and insert the values of the estimated slope and intercept. Rephrase if necessary. Estimated intercept, b0: The estimated average value of the sales is $964,000 when the size of the store and the advertising budget is zero Estimated slope, b1: Holding the advertising budget constant, when the size of the store increases by one thousand square feet, the average sales is estimated to increase by $1, 670,000 Estimated slope, b2: Holding the store size constant, when the advertising budget increases by one dollar, the average sales is estimated to increase by $1.45 Click on the following link to load an Excel worksheet that will allow you to create new examples for the case of one independent variable. http://wweb.uta.edu/faculty/eakin/busa3321/EstimateInterpretation.xls for two independent variables: http://wweb.uta.edu/faculty/eakin/busa5325/EstbiMLR.xls 3.4 Correlation Coefficient, r:, a Special Case of the Slope in Simple Linear Regression The values of X and Y are standardized; i.e. subtract the sample mean from each value and then divide by the sample standard deviation of the values. This results in the two new variables: Zx= ( X – X̅ )/ Sx and Zy= ( Y – Y̅ )/ Sy when you estimate the least squares line: ẑ y b 0 b1z x , then b0 =0 and b1 = r The correlation coefficient is just the sample slope of standardized values and is used to measure the strength of the linear association between two variables. It means that for each sample standard deviation increase in X, there is a r sample standard deviation increase/decrease in Y. Notes: (1) Correlation can fall between -1 (perfect negative linear relationship) to +1,( a perfect positive linear association). See textbook for example pictures corresponding with different values of f (2) A low r just means there is a low straight line association. The association could be strong but for a non-straight line association. 4. Measures of Variation Variation and estimate: Variability of data away from its average Standard Error of Estimate, Sy|x1,x2,…,xk or Se estimate of the typical error when predicting y that occurs when using this model estimate on the sample data. This was what we called the sample standard deviation. Coefficient of Determination, R2: Percent of sample variability of Y that can be associated with variation in the independent variables Example: Sales, size, and ad budget of store, Se = 966,000, R2= 0.904, Interpret the meaning of the standard error of the estimate and the coefficient of determination. Solution: As always replace Y and X with the names from the example Standard Error of Estimate, Sy|x: If the estimated line is used, there is a typical error of $966,000 when predicting sales in this sample, Coefficient of Determination, R2: 90.4% of the sample variability of sales that can be associated with variation in the size of the store and advertising budget. Click on the following link to load an Excel worksheet that will allow you to create new examples: When asked for the independent variable list all the variables, separated by commas http://wweb.uta.edu/faculty/eakin/busa3321/VariationExplanation.xls Uses of R2: Baseball Example: http://www.insiderbaseball.com/Angelo-v2.htm Business Example: http://www.sdcounty.ca.gov/rtf/docs/CNASmallBusiness.pdf (search for “simple linear regression”) Information Systems example : http://databases.about.com/od/datamining/a/datamining.htm 5. Assumptions Linearity: the average value of the dependent variable has a first order relationship with the independent variables Independence: the observations are randomly and independently selected Normality: the values of the dependent variable are normally distributed for observations with the same value of the independent variables Equal Variation: the variation in the values of the dependent variable is the same (equal) for observations with the same value of the independent variables regardless of the value of the independent variables Example : In the previous examples, the dependent variable was sales and the independent variables were size and ad budget. All variables were measured on stores. Interpret the assumptions in this context. Solution: Reword the above sentences Linearity: the average value of the sales has a first-order relationship with the size of the store and advertising budget Independence: the stores are randomly and independently selected Normality: the values of the sales are normally distributed for stores of the same size and advertising budget Equal Variation: the variation in the values of the sales is the same (equal) for stores of the same size and advertising budget regardless of the value of the size and advertising budget Click on the following link to load an Excel worksheet that will allow you to create new examples. When asked for the independent variable, type in the list of independent variables separated by commas http://wweb.uta.edu/faculty/eakin/busa3321/RegAssumptionsInterpretation.xls 6. Assumption Checks Errors versus residuals: Errors are the differences between the Y’s and their true average values while the residuals are the differences between the Y’s and the their predicted values Residual plots – these should show a random scattering of points, any pattern indicates a possible violation of an assumption. Normality probability plots – these plots should show a straight line, any pattern other than straight indicates a possible problem of normality. Residual Plot- All Assumptions Met 6 4 Residual 2 0 -2 -4 -6 -8 0 5 10 15 20 25 Plots showing assumption violations: X Residual Plot-Not a First Order Model 200 20 150 15 100 10 50 Residual Residual Residual Plot- Unequal Variance 25 5 0 0 -50 -100 -5 -150 -10 -200 -15 0 5 10 15 X 20 25 0 5 10 15 X 20 25 7. Slope 7.1 Test for usefulness (slope) Hypothesis: Ho: 1=0 H1: one sided or two sided alternatives Rejection region: One sided or two sided t distributions with n-k-1 degrees of freedom Test Statistic : difference between estimated slope and hypothesized slope in number of estimated standard errors t b1 0 sb1 Conclusion: Holding all other variables constant, we can (not) say that changes in the values of the independent variable is associated with changes (increases or decreases for one-sided tests) in the average value of the dependent variable. Example: Holding advertising budget constant, determine if increases in the size of the store is associated with an increase in the average sales. You are given that b1 = 1,670,000 , Sb1 = 157,000 , and n=14 o Hypothesis o H0:1=0 H1:1>0 o Rejection region: The degrees of freedom are n-k-1=11. Since this is a right-sided t-test, find the t-table value of 1.7959 in row 11 column 0.05. Therefore Reject H if t > 1.7959 o Test Statistic t 1.67 0 10.64 0.157 o Conclusion: We can say that increases in the size of the store is associated with increases in the average sales holding advertising budget constant. Click on the following link to load an Excel worksheet that will allow you to create new examples http://wweb.uta.edu/faculty/eakin/busa3321/TestOfBeta1.xls 7.2 Confidence interval for slope Formula: estimated slope plus and minus the margin of error b i t n k 1Sb i Conclusion: Holding all other variables constant, we can say with ___ confidence that increase of one unit in the independent variable is associated with (increases/decreases) of _____ in the average value of the dependent with a margin of error of ±_________. Or Holding all other variables constant, we can say with ___ confidence that increase of one unit in the independent variable is associated with (increases/decreases) of between _____ and _______. Example: Holding advertising budget constant, determine the amount of change in the average value of the sales when the size of the store is associated increases by one thousand square feet. You are given that yˆ 964,000 1,670,000( size ) 1.45(ad budget ) Sb1 = 157,000 , and n=14 Solution: Substitute names and numbers into the above template o Formula: b1 t n k 1Sb1 1,670,000 2.2010 * (157,000) 1,670,000 345,557 o Conclusion: Holding advertising budget constant, we can say with 95% confidence that an increase of one thousand square feet in the size of the store is associated with an increases of $1,670,000 in average sales with a margin of error of ±$345,557. Or o Holding advertising budget constant, we can say with 95% confidence that an increase of one thousand square feet in the size of the store is associated with an increase of between $1,324,443 and $2,015,557 in average sales. Click on the following link to load an Excel worksheet that will allow you to create new examples http://wweb.uta.edu/faculty/eakin/busa3321/C.I.slope.xls 8. Testing for the Effect of the Model: 8.1 Test of the model: E (Y ) 0 1 X 1 2 X 2 ... k X k Note that any one of the X’s above could be a function of one or more other independent variables; e.g., the square of another variable or the product of two variables. Therefore find the value of k by looking at the number on the last in the model. So if our model was E (Y ) 0 1 X 1 2 X 12 3 X 12 X 2 4 X 22 5 X 2 then k = 5 8.1.1 Test form for Hypothesis: Ho: 1=2=…=k=0 (no variable has an effect) H1: at least one has an effect Rejection region: One sided F with k and n-k-1 degrees of freedom Test Statistic : variation of coefficient estimates/ variability of randomness Conclusion: We can (not) say that changes in the values of at least one independent variable is associated with changes in the average value of the dependent variable. 8.1.2 Example: Based on trying to predict exam grade, Y, based on hours studied, X1, and GPA, X2. You are given that n = 30, MSR = 200 and MSE = 10 using model: E ( ExamGrade) 0 1 ( HoursStudied ) 2 (GPA) Hypothesis: Ho: 1=2=0 (neither hours studied nor GPA has an effect on the average exam grade) H1: at least one has an effect Rejection region: Reject Ho if F > F 2, 27 = 3.354 Test Statistic : variation of coefficient estimates/ variability of randomness F = MSR/MSE = 200/10 = 20 Conclusion: We can say that changes GPA and/or hours studied is useful in predicting the exam grade. Or We can say that that changes GPA and/or hours studied is associated with changes in the average exam grade. To create other examples, click on the link below and change numbers and names http://wweb.uta.edu/faculty/eakin/busa5325/MLRTestOfModel.xls 9. Estimating the average and predicting an individual 9.1 Estimating the average value of all Y values for observations with the same value of X1, the same value of X2, …. , and the same value of Xk Formula: estimated mean plus and minus the margin of error ŷ t n k 1 (SE mean ) Conclusion: We can say with ___ confidence that the average value of the dependent is _____ with a margin of error of _________ for all observations with values of the independent variables of ____ Or : We can say with ___ confidence that the average value of the dependent is between___ and ____ for all observations with values of the independent variables of ____ Example Find average sales for all stores that have 4,000 square feet and a $100,000 advertising budget. You are given that the estimated average sales =964,000 + 1,670,000 (size)+1.45(ad budget) and that the SEmean, the standard error of the mean estimate = 309,000 Solution: Substitute the value of x into the estimated equation to obtain the estimated average sales: Estimate of average sales = 964,000 + 1,670,000 (4)+1.45(100,000) = 7,789,000 Next substitute values into confidence interval ŷ t n k 1 (SE mean ) 7,789,000 2.2010 (309,000) 7,789,000 680109 Conclusion: For all stores that have 4,000 square feet and an advertising budege of $100,000, we can say with 95% confidence that the average sales is $7,789,000 with a margin of error of ± $680,109 Or Conclusion: For all stores that have 4,000 square feet and an advertising budege of $100,000, we can say with 95% confidence that the average sales is between $ 7,108,891 and $8,469,109 Click on the following link to load an Excel worksheet that will allow you to create new examples http://wweb.uta.edu/faculty/eakin/busa5325/CIMuYMLR.xls (Special case of two independent variables with the estimated standard error of estimating the average value of Y already calculated) 9.2 Predicting the value of Y for an observation with a given value of X Formula: predicted value plus and minus the margin of error ŷ t n k 1 (SEindividual) Conclusion: We can say with ___ confidence that the value of the dependent is _____ with a margin of error of _________ for an observation with values of the independent variables of ____ or We can say with ___ confidence that the value of the dependent is between _____ and ______ for an observation with values of the independent variables of ____ Example: Find sales for a store that has 4,000 square feet. and an advertising budget of $100,000 You are given the predicted sales =964,000 + 1,670,000 (size)+1.45(ad budget) and its estimated standard error is 1,104,000. Solution: Substitute the value of x into the estimated equation to obtain the value of the predicted sales: predicted sales = 964,000 + 1,670,000 (4)+1.45(100,000) = 7,789,000 Next substitute values into confidence interval ŷ t n k 1 (SE individual) 7,789,000 2.2120 * (1,104,000) 7,789,000 2,429,904 Conclusion: For a store that has 4,000 square feet and an advertising budget of $100,000, we can say with 95% confidence that the sales will be $7,789,000 with a margin of error of ± $2,429,904 Or Conclusion: For a store that has 4,000 square feet nd an advertising budget of $100,000, we can say with 95% confidence that the sales will be between $5,359,096 and $10,218,904 Click on the following link to load an Excel worksheet that will allow you to create new examples http://wweb.uta.edu/faculty/eakin/busa5325/CIYMLR.xls (Special case of two independent variables with the estimated standard error of predicting an individual value of Y already calculated) 10. Why study regression? The following link is a blog which has some references to studies where simple linear regression outperforms experts. I have not followed up on the references though. http://lesswrong.com/lw/3gv/statistical_prediction_rules_outperform_expert/ 11. Other forms of regression 11.1 Names Simple Linear Model (SLR): a first order model with only one independent variable First Order Model : more distinct variables are added to the SLR model Second Order Model: squared terms are added to model Interaction Model: product terms are added Logarithmic Models 11.2 Effects of each variable Simple Linear Model: The change in the average value of the dependent variable when an independent variable increases by one. Note: There is no F test since only one slope needs to be tested. First Order Model: This is the change in the average value of the dependent variable when an independent variable increases by one, holding constant all the others. Second Order Model: The change in the average value of the dependent variable when an independent variable increases by one depends on the current value of the independent variable, holding constant all the others. Interaction Model: The change in the average value of the dependent variable when an independent variable increases by one depends on the value of another variable, while holding constant all the others. Logarithmic Model: Holding all other variables constant, for each one unit increase in an independent variable, there is a average percentage increase in the the dependent variable 11.3 Examples Based on trying to predict exam grade, Y, based on hours studied, X1, and GPA, X2. Simple Linear Model (using only X1 ): This is the change in the average exam grade when you study an additional hour. E(ExamGrade ) 0 1 (HoursStudi ed) First Order Model: This is the change in the average exam grade when you study an additional hour, for those people with the same GPA. E(ExamGrade ) 0 1 (HoursStudi ed) 2 (GPA) Or E(Y) 0 1X1 2 X 2 Second Order Model: E(ExamGrade ) 0 1 (HoursStudi ed) 2 (HoursStudi ed) 2 3 (GPA ) The change in the average exam grade when you study an additional hour depends on how long you have studied, controlling for the effect of GPA. To find the change, substitute the value of “HourStudied+1” into the above equation and examine the difference: E(ExamGrade ) 0 1 (HoursStudi ed 1) 2 (HoursStudi ed 1) 2 3 (GPA ) min us E(ExamGrade ) 0 1 (HoursStudi ed) 2 (HoursStudi ed) 2 3 (GPA ) which equals to the following as long as GPA is the same value in both equations: 1 2 (1 2 * HourStudie d) Interaction Model: The change in the average exam grade when for each additional hour studied depends on the value of the GPA. E(ExamGrade ) 0 1 (HoursStudi ed) 2 (GPA ) 3 (HoursStudi ed * GPA ) or E(Y) 0 1X1 2 X 2 3X1 X 2 E(Y) 0 (1 3X 2 )X1 2 X 2 Logarithmic Model: For each additional hour student, there is an average percentage increase in the exam grade holding GTA constant. ELog ExamGrade 0 1 (HoursStudi ed ) 2 (GPA ) 3 (HoursStudi ed * GPA ) E(Log(Y)) 0 1X1 2 X 2 3 X1 X 2 Other examples: Baseball: http://baseballanalysts.com/archives/2007/05/the_value_of_th.php Accounting http://www.nysscpa.org/cpajournal/2005/105/essentials/p56.htm Business: http://www.cdc.gov/MMWR/preview/mmwrhtml/00037061.htm U.S. Weather Forecasts: http://www.weather.gov/ost/NWS_TIP.pdf 11.4 Create your own examples of interpretations by changing the words and numbers in First Order Model: http://wweb.uta.edu/faculty/eakin/busa5325/BetaInter1stOrderModel.xls Interaction Model: http://wweb.uta.edu/faculty/eakin/busa5325/BetaInterInteractionModel.xls Quadratic Model http://wweb.uta.edu/faculty/eakin/busa5325/BetaInterQuadraticModel.xls