Correlation and Linear Regression Chapter 13 McGraw-Hill/Irwin Copyright © 2012 by The McGraw-Hill Companies, Inc. All rights reserved. Topics 1. Correlation 2. Scatter Diagram Correlation coefficient Test on correlation coefficient Simple Linear Regression Analysis Estimation Validity of the model—test on the slope coefficient Fitness of the model—coefficient of determination Prediction The error term and residuals 13-2 Regression Analysis Introduction Recall in chapter 4 we used Applewood Auto Group data to show the relationship between two variables using a scatter diagram. The profit for each vehicle sold and the age of the buyer were plotted on an XY graph The graph showed that as the age of the buyer increased, the profit for each vehicle also increased in general. Numerical measures to express the strength of relationship between two variables are developed in this chapter. In addition, an equation is used to express the relationship between variables, allowing us to estimate one variable on the basis of another. EXAMPLES Does the amount Healthtex spends per month on training its sales force affect its monthly sales? Is the number of square feet in a home related to the cost to heat the home in January? In a study of fuel efficiency, is there a relationship between miles per gallon and the weight of a car? Does the number of hours that students studied for an exam influence the exam score? 13-3 Dependent Versus Independent Variable The Dependent Variable is the variable being predicted or estimated. The Independent Variable provides the basis for estimation. It is the predictor variable. Which in the questions below are the dependent and independent variables? 1.Does the amount Healthtex spends per month on training its sales force affect its monthly sales? 2.Is the number of square feet in a home related to the cost to heat the home in January? 3.In a study of fuel efficiency, is there a relationship between miles per gallon and the weight of a car? 4.Does the number of hours that students studied for an exam influence the exam score? 13-4 Scatter Diagram Example The sales manager of Copier Sales of America, which has a large sales force throughout the United States and Canada, wants to determine whether there is a relationship between the number of sales calls made in a month and the number of copiers sold that month. The manager selects a random sample of 10 representatives and determines the number of sales calls each representative made last month and the number of copiers sold. 13-5 Minitab Scatter Plots Strong positive correlation No correlation Weak negative correlation 13-6 The Coefficient of Correlation, r The Coefficient of Correlation (r) is a measure of the strength of the relationship between two variables. It shows the direction and strength of the linear relationship between two interval or ratio-scale variables It can range from -1.00 to +1.00. Values of -1.00 or +1.00 indicate perfect and strong correlation. Values close to 0.0 indicate weak correlation. Negative values indicate a negative/indirect relationship and positive values indicate a positive/direct relationship. 13-7 Correlation Coefficient - Interpretation 13-8 Correlation Coefficient - Example X 45 Y 22 s X 9.189 sY 14.337 =CORREL(VAR1, VAR2) What does correlation of 0.759 mean? First, it is positive, so we see there is a direct relationship between the number of sales calls and the number of copiers sold. The value of 0.759 is fairly close to 1.00, so we conclude that the association is strong. However, does this mean that more sales calls cause more sales? No, we have not demonstrated cause and effect here, only that the two variables—sales calls and copiers sold—are related. Data Copier 13-9 Testing the Significance of the Correlation Coefficient H0: = 0 (the correlation in the population is 0) H1: ≠ 0 (the correlation in the population is not 0) Test statistic: t r n2 1 r 2 , with n-2 degrees of freedom Rejection region: reject H0 if t > t/2,n-2 or t < -t/2,n-2 13-10 Testing the Significance of the Correlation Coefficient – Copier Sales Example 1. H0: = 0 (the correlation in the population is 0) H1: ≠ 0 (the correlation in the population is not 0) 2. α=.05 r n 2 .759 10 2 3. Computing t, we get t 1 r 4. Reject H0 if: t > t/2,n-2 or t < -t/2,n-2 t > t0.025,8 or t < -t0.025,8 t > 2.306 or t < -2.306 2 Rejection region 1 .759 2 3.297 Rejection region 5. The computed t (3.297) is within the rejection region, therefore, we will reject H0. This means the correlation in the population is not zero. From a practical standpoint, it indicates to the sales manager that there is correlation with respect to the number of sales calls made and the number of copiers sold in the population of salespeople. 13-11 One-tail Test of the Correlation Coefficient – Applewood Example Sometimes instead of a two-tail test on the correlation coefficient, we many want to test whether the coefficient takes a specific sign. Example: in Chapter 4 we constructed the following scatter plot for the profit earned on a vehicle sale and the age of the purchaser of the Applewood Auto Group. The correlation coefficient is found to be .262. Test at the 5% significance level to determine whether the relationship is positive. (n = 180) Profit Plot of Profit and Age $3,500 $3,000 $2,500 $2,000 $1,500 $1,000 $500 $0 0 20 40 Age 60 80 13-12 One-tail Test of the Correlation Coefficient – Applewood Example 1. H0: = 0 (the correlation in the population is 0) H1: > 0 (the correlation in the population is positive/direct) 2. α=.05 3. Computing t, we get t r n2 1 r 2 .262 180 2 1 .262 2 3.622 4. Reject H0 if: t > t,n-2 t > t0.05,178 t > 1.653 5. The computed t (3.622) is within the rejection region, therefore, we will reject H0. This means the correlation in the population is positive. From a practical standpoint, it indicates that there is a positive correlation between profits and age in the population. 13-13 Regression Analysis In regression analysis we use the independent variable (X) to estimate the dependent variable (Y). The relationship between the variables is linear. Both variables must be at least interval scale. The least squares criterion is used to determine the equation. REGRESSION EQUATION An equation that expresses the LINEAR relationship between two variables. LEAST SQUARES PRINCIPLE Determining a regression equation by minimizing the sum of the squares of the vertical distances between the actual Y values and the predicted values of Y. 13-14 Linear Regression Model Simple Linear Regression Model: Y=α+βX+ε Y is the dependent variable and X is the independent variable. α is Y-intercept and β is the slope, both are population coefficients and need to be estimated using sample data. ε is the error term. The model represents the linear relationship between the two variables in the population Estimated Linear Regression Equation: Yˆ = a + b X The estimated equation represents the linear relationship between the two variables estimated from the sample. 13-15 Regression Analysis – Least Squares Principle The least squares principle is used to obtain a and b. 13-16 Computing the Slope of the Line and the Yintercept 13-17 Regression Equation Example Recall the example involving Copier Sales of America. The sales manager gathered information on the number of sales calls made and the number of copiers sold for a random sample of 10 sales representatives. Use the least squares method to determine a linear equation to express the relationship between the two variables. What is the expected number of copiers sold by a representative who made 20 calls? Data Copier 13-18 Regression Equation Example Representative Calls Sales Tom Keller 20 30 Jeff Hall 40 60 Excel: Data -> Data Analysis -> Descriptive Statistics See Excel instruction in Lind et al., p 329, #5 for more details. Calls Brian Virost Greg Fish Susan Welch 20 30 10 40 60 30 Mean Standard Error Median Mode Carlos Ramirez 10 40 Rich Niles Mike Keil Mark Reynolds Soni Jones 20 20 20 30 40 50 30 70 Step 1 – Find the slope (b) of the line Step 2 – Find the y-intercept (a) Sales 22.000 2.906 20.000 20.000 Mean Standard Error Median Mode 45.000 Standard Deviation 9.189 Standard Deviation 14.337 Sample Variance Kurtosis Skewness Range Minimum Maximum Sum Count 84.444 0.396 0.601 30.000 10.000 40.000 220.000 10.000 Sample Variance Kurtosis Skewness Range Minimum Maximum Sum Count 205.556 -1.001 0.566 40.000 30.000 70.000 450.000 10.000 4.534 40.000 30.000 Finding and Fitting the Regression Equation Example The regressionequation is : ^ Y a bX ^ Y 18.9476 1.1842X For a representative w homade 20 calls, the expectednumber of copiers sold is ^ Y 18.9476 1.1842(20) ^ Y 42.6316 13-20 Validity of the Model– Copier Sales Example 1. H0: β = 0 (the slope 0; there is no linear relationship; the model is invalid) H1: β ≠ 0 (the slope is not 0; there is linear relationship; the model is valid) 2. α=.05 b 0 1.1842 0 t 3.297 3. Test statistic: sb 4. Rejection region: Reject H0 if: t > t/2,n-2 or t < -t/2,n-2 t/2,n-2 =t0.025,8 =2.306 t > 2.306 or t < -2.306 0.3591 Rejection region Rejection region 5. Conclusion: The slope of the equation is significantly different from zero; there is linear relationship between the two variables and thus the model is valid. 13-21 One-tail Test on Slope Coefficient – Copier Sales Example 1. H0: β = 0 H1: β > 0 (positive slope) 2. α=.05 3. Test statistic: t b 0 1.1842 0 3.297 sb 0.3591 4. Rejection region: Reject H0 if t > t,n-2 t,n-2 =t0.05,8 = 1.860 t > 1.860 5. Conclusion: the test statistic is 3.297, which is higher than the critical value of 1.86. Thus we reject the null hypothesis and conclude that the slope is positive. 13-22 Fitness of the model— Coefficient of Determination The coefficient of determination (r2) is the proportion of the total variation in the dependent variable (Y) that is explained or accounted for by the variation in the independent variable (X). It is the square of the coefficient of correlation. It ranges from 0 to 1. The high r2 is, the better the model fits the data. 13-23 Coefficient of Determination (r2) – Copier Sales Example Variation in Y= variation explained by variation in X + variation unexplained r2 variation explainedby variation in X SSR total variation in Y SS Total •The coefficient of determination, r2 ,is 0.576, found by (0.759)2 [recall r = 0.759] •Interpretation: 57.6 percent of the variation in the number of copiers sold is explained, or accounted for, by the variation in the number of sales calls. 13-24 Confidence Interval and Prediction Interval Estimates of Y • A confidence interval reports the mean value of Y for a given X. • A prediction interval reports the range of values of Y for a particular value of X. 13-25 Confidence Interval Estimate - Example We return to the Copier Sales of America illustration. Determine a 95 percent confidence interval for all sales representatives who make 25 calls. ^ Y 18.9476 1.1842X 18.9476 1.1842(25) 48.5526 t t1.95 2 ,n2 t.025 ,8 2.306 Thus, the 95 percent confidence interval for the average sales of all sales representatives who make 25 calls is from 40.9170 up to 56.1882 copiers. 13-26 Prediction Interval Estimate - Example We return to the Copier Sales of America illustration. Determine a 95 percent prediction interval for Sheila Baker, a West Coast sales representative who made 25 calls. If Sheila Baker makes 25 sales calls, the number of copiers she will sell will be between about 24 and 73 copiers. 13-27 Confidence and Prediction Intervals – Minitab Illustration 13-28 Regression analysis using Excel SUMMARY OUTPUT Regression Statistics Multiple R 0.759014 R Square 0.576102 Adjusted R Square 0.523115 Standard Error 9.900824 Observation s See Excel instruction in Lind et al., p 510, #2. |r|, the sign of r depends on the sign of b. Coefficient of determination r2 SSR=1065.789 and SS total = 1850. r 2 10 1065 .789 .576 1850 ANOVA df SS MS Regression 1 1065.789 1065.789 Residual 8 784.2105 98.02632 Total 9 1850 Coefficients Standard Error F 10.87248 Significance F 0.010902 Estimated coefficients: a and b T test statistics for testing on the slope p-value for testing on the slope t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0% Intercept 18.94737 8.498819 2.229412 0.056349 -0.65094 38.54568 -0.65094 38.54568 Calls 1.184211 0.359141 3.297345 0.010902 0.356031 2.01239 0.356031 2.01239 Data Copier 13-29 Error Term and Residuals Simple Linear Regression Model: Y=α+βX+ε Error term: ε The error term accounts for all other variables, measurable and immeasurable that are not part of the model but can affect the magnitude of Y. In the copier sales example, the error term may represent the salesperson's skills, effort , etc. The error term varies from one salesperson to another even if they make the same number of calls, that is, for the same value of X. The error terms makes the model probabilistic and thus differentiates it from deterministic models like Profit=Revenue – Cost. 13-30 Error Term and Residuals Assumption requirement of the error term: The probability distribution of ε is normal. The mean of the distribution is 0. The variance of ε is constant variance regardless of the value of Yˆ . The value of ε associated with any particular value of Y is independent of ε associated with any other value of Y. In other words, we require the error terms to be independent of each other. 13-31 Error Term and Residuals Residuals: estimated error term, denoted ˆ The residuals can be used to investigate whether the assumption requirements of the error term are satisfied for a regression analysis. We will explore this later. The residuals measure the vertical distance between each observation and the regression line. The residual for a specific observation is calculated as: ˆ Y Yˆ Y (a bX ) 13-32 Error Term and Residuals—Example In the copier sales example, the estimated regression ^ equation is Y 18.9476 1.1842 X Let’s calculate the residual that corresponds to the observation of Soni Jones, who made 30 calls (X) and sold 70 copiers (Y). ˆ Y Yˆ ˆ 70 (18.9476 1.1842* 30) ˆ 70 54.4736 ˆ 15.5246 13-33 Error Term and Residuals—Example LEAST SQUARES PRINCIPLE Determining a regression equation by minimizing the sum of the squares of the vertical distances between the actual Y values and the predicted values of Y. If we obtain the residuals from the regression obtained using least squares principle, square the residuals and sum the squares up, the total we get will be the smallest among all possible regression lines that we can fit to the original data. Representative Calls (X) Tom Keller Jeff Hall Brian Virost Greg Fish Susan Welch Carlos Ramirez Rich Niles Mike Keil Mark Reynolds Soni Jones 20 40 20 30 10 10 20 20 20 30 Sales (Y) 30 60 40 60 30 40 40 50 30 70 Yˆ 42.63 66.32 42.63 54.47 30.79 30.79 42.63 42.63 42.63 54.47 SUM ˆ ˆ 2 -12.63 -6.32 -2.63 5.53 -0.79 9.21 -2.63 7.37 -12.63 15.53 159.56 39.89 6.93 30.54 0.62 84.83 6.93 54.29 159.56 241.07 smallest 784.21 13-34