Chapter 10 Correlation and Regression Chapter Problem: Can we predict the cost of subway fare from the price of a slice of pizza? Table 10-1 Cost of a Slice of Pizza, Subway Fare and the CPI Year 1960 1973 1986 1995 2002 2003 Cost of Pizza 0.15 0.35 1.00 1.25 1.75 2.00 Subway Fare 0.15 0.35 1.00 1.35 1.50 2.00 CPI 30.2 48.3 112.3 162.2 191.9 197.8 1 Chapter 10 Correlation and Regression Chapter Problem: Can we predict the cost of subway fare from the price of a slice of pizza? 2.5 2003 2 Subway 2002 1995 1.5 1986 1 1973 0.5 1960 0 0 0.5 1 1.5 2 2.5 Pizza Figure 10-1 Scatterplot of Pizza Costs and Subway Costs • If there is a correlation between two variables, how can it be described? Is there an equation can be used to predict the cost subway given the cost of a slice of pizza? • If we can predict the cost of subway fare, how accurate is the prediction? • If there also an correlation between the CPI and the cost of a subway fare, and if so, is the CPI better for predicting the cost of a subway? 2 10.1 Review and Preview Ch. 9 presented methods for making inferences from two samples • Sec 9-4 considers two dependent samples • Sec 9-4 considers differences between the paired values, and illustrated two methods – Hypothesis test for claim about the population of differences – Confidence Interval for estimating the mean of all such differences Ch. 10 introduces methods for determining whether a correlation, or association, between two variables exists and whether the correlation is linear. • Sec 10-2 introduces the correlation between two variables • Sec 10-3 introduces regression analysis, which allows us to describe the relationship with an equation 3 10.2 Correlation Key Concept • Linear correlation coefficient r – Numerical measure of the strength of the relationship between two variables • Use paired sample data to find r • Use the value to conclude that there is a relationship between two variables • We consider linear relationships 4 10.2 Correlation Part I: Basic Concepts of Correlation Definition – A Correlation exists between two variables when the values of one variable are somehow associated with the values of the other variable. • e.g. Table 10-1. x = cost of pizza, y = cost of subway • Exploring the data 5 10.2 Correlation Part I: Basic Concepts of Correlation – Exploring Data r = 0.721 r = 0.851 Nonlinear relationship r = 0.087 r=0 Figure 10-2 Scatterplots 6 10.2 Correlation Part I: Basic Concepts of Correlation Linear Correlation Coefficient Definition The linear correlation coefficient r measures the strength of the linear relationship between the paired x- and y-quantitative values in samples. Its value is computed by 10-1 or Formula 10-2 (next pages). The linear correlation coefficient r is sometimes referred to as the Pearson product moment correlation coefficient r in honor of Karl Pearson (1857 – 1936). 7 10.2 Correlation Part I: Basic Concepts of Correlation – Computing r Requirements • Paired (x, y) data is a simple random sample of independent quantitative data • Visual examination of the scatterplot confirms the straight-line pattern • Any outliers must be removed if they are known to be errors The paired (x, y) data must have a bivariate normal distribution. This requirement is usually difficult to check. For now, we use requirement 2 & 3. 8 10.2 Correlation Part I: Basic Concepts of Correlation – Computing r Notation for Linear Correlation Coefficient n number of pairs of data present the addition of the items indicated the sum of all x-values x each x-value should be squared and then those squares added x 2 x 2 each x-value should be added and the total then squared. It is extremely important to avoid confusing x and x multiply all paired sample, then add the correlation coefficient for a sample the correlation coefficient for a population 2 xy r 2 9 10.2 Correlation Part I: Basic Concepts of Correlation – Computing r Formula 10 – 1 r Formula 10 – 2 n x y x y n x x 2 z r x 2 n y y 2 2 zy n 1 Here zx and zy is the z-score for sample values x and y respectively. 10 10.2 Correlation Part I: Basic Concepts of Correlation – Computing r Interpreting r • Using SW, compute the P-value – If P-value is less than the significance level, there is a linear correlation – otherwise, there is no linear correlation • Using Table A – 6, find the critical value r0 (depending on n and ), – If | r | r0, then there is a linear correlation; – Otherwise, there is no linear correlation. Rounding the Linear Correlation Coefficient to the three decimal places (so that it can be compared to critical values in Table A – 6). Try rounding only final results. 11 10.2 Correlation Part I: Basic Concepts of Correlation Properties of the Linear Correlation Coefficient (LCC) r 1. 2. 3. 4. 5. The value of r is always between –1 and 1. That is, –1 r 1 The value of r does not change if all values of either variable are converted to a different scale The value of r is not affected by the choice of x and y. interchange all xand y-values and the value of r will not change. LCC measures the strength of a linear relationship. It is not designed to measure the strength of a relationship that is not linear. LCC is very sensitive to outliers 12 10.2 Correlation Part I: Basic Concepts of Correlation – Computing r e.g.1 Calculating r using SW (TI-84 for P-value and r) Sol. Table 10-2 Costs of a Slice of Pizza and Subway Fare (in dollars) Cost of Pizza 0.15 0.35 1.00 1.25 1.75 2.00 Subway Fare 0.15 0.35 1.00 1.35 1.50 2.00 1. Requirement. Simple random data; linear pattern; no outlier 2. TI-84, P-value = 0.00022195436, r = 0.987811 (LinRegTTest) 13 10.2 Correlation Part I: Basic Concepts of Correlation – Computing r e.g.2 Calculating r using 10-1 X (pizza) Y (subway) x2 y2 xy 0.15 0.15 0.0225 0.0225 0.0225 0.35 0.35 0.1225 0.1225 0.1225 1.00 1.00 1.0000 1.0000 1.0000 1.25 1.35 1.5625 1.8225 1.6875 1.75 1.50 3.0625 2.2500 2.6250 2.00 2.00 4.0000 4.0000 4.0000 = 6.50 = 6.35 = 9.77 = 9.2175 9.4575 Sol. 1. Requirement. Done in e.g.1 14 10.2 Correlation Part I: Basic Concepts of Correlation – Computing r e.g.2 Calculating r using 10-1 Sol. r n x y x y n x 2 x n y y 2 2 2 69.4575 6.50 6.35 69.77 6.50 2 69.2175 6.35 2 15.47 0.988 16.37 14.9825 15 10.2 Correlation Part I: Basic Concepts of Correlation – Computing r e.g.3 Calculating r using 10-2 Sol. First x = 1.083333, sx= 0.738693; y = 1.058333, sy= 0.706694 and y y zy sy xx zx sx x (pizza) y (subway) zx zy zx zy 0.15 0.15 –1.26349 –1.28533 1.62400 0.35 0.35 –0.99274 –1.00232 0.99504 1.00 1.00 –0.11281 –0.08254 0.00931 1.25 1.35 0.22562 0.41272 0.09312 1.75 1.50 0.90250 0.62498 1.65404 2.00 2.00 1.24093 1.33250 1.65354 4.93905 16 10.2 Correlation Part I: Basic Concepts of Correlation – Computing r e.g.3 Calculating r using 10-2 Sol. 4.93905 r 5 0.988 e.g.4 Interpreting r. Interpreting the value of r = 0.988 found in e.g.1, 2, and 3. Use the significance level of 0.05. Is there sufficient evidence to support a claim that there is a linear correlation between the cost of a slice of pizza and subway fares? Ans. Using A-6. For n = 6, c.v. = 0.811 (for = 0.05). Since |0.988| > 0.811, we conclude that there is a linear correlation between x and y. 17 10.2 Correlation Part I: Basic Concepts of Correlation – Computing r e.g.5 Interpreting r. Using a 0.05 significance level, interpret the value of r = 0.117 found using the 62 pairs of weights of discarded paper and glass listed in Data Set 22 in Appendix B. When the paired data are used with computer SW, the P-value is found to be 0.364, is there sufficient evidence to support the claim of a linear correlation between the weights of discarded paper and glass.? Ans. 1. Use Technology: P-value = 0.364 > 0.05, so there is not sufficient evidence to support the claim that there is a linear correlation between weights of discarded paper and glasses. 2. n = 62, = 0.06, C.V. 0.254. Since | r | = 0.117 < 0.254, … same conclusion as above. 18 10.2 Correlation Part I: Basic Concepts of Correlation Interpreting r: Explained Variation The value of r2 is the proportion of the variation in y that is explained by the linear relationship between x and y. 19 10.2 Correlation Part I: Basic Concepts of Correlation Interpreting r: Explained Variation e.g.6. Explained Variation. Using the pizza/subway fare costs in Table 10-2, it is found that the LCC is r = 0.988. What proportion of the variation in the subway fare can be explained by the variation in the costs of a slice of a pizza? Ans. r2 = 0.976. Interpretation. About 97.6% of the variation in the cost of a subway fares can be explained by the linear relationship between the costs of pizza and subway fares. This implies that about 2.4% of the variation in costs of subway fares cannot be explained by the costs of pizza. 20 10.2 Correlation Part I: Basic Concepts of Correlation Common Errors Involving Correlation: 1. A common error is to conclude that correlation implies causality (pizza/subway) 2. Another error arises with data based on averages. (average may make the LCC larger) 3. The third error involves the property of linearity – relation may exist between x and y even when there is no linear correlation 21 10.2 Correlation Part II: Formal Hypothesis Test Method 1. Hypothesis Test for Correlation Using Test Statistic r Notation: – n: number of pairs of sample data – r: LCC for a sample of paired data – : LCC for a population of paired data Requirement: see beginning of the chapter. Hypothesis: H0: = 0 (There is no linear correlation) H1: 0 (There is a linear correlation) 22 10.2 Correlation Part II: Formal Hypothesis Test Method 1. Hypothesis Test for Correlation Using Test Statistic r Test Statistic: r Critical Values: Use Table A-6 Conclusion: • If |r| > the critical value, reject H0 and conclude that there is a linear correlation. • If |r| the critical value, fail to reject H0 and there is not sufficient evidence to conclude that there is a linear correlation. 23 10.2 Correlation Part II: Formal Hypothesis Test e.g.7 Hypothesis Test with Pizza/Subway Fare Costs. Use the paired pizza/subway fare data in Table 10-2 to test the claim that there is a linear correlation between the costs of a slice of pizza and the subway fares. Use a 0.05 significance level. Solution. The requirements are met as discussed in e.g 1. Consider the following hypothesis test, H0: = 0 (There is no linear correlation) H1: 0 (There is a linear correlation) r = 0.988. From Table A-6 with n = 6, = 0.05, we find that the critical value is 0.811. Since | 0.988 | > 0.811, we reject H0. This indicate that there is a linear correlation between two costs. 24 10.2 Correlation Part II: Formal Hypothesis Test Method 2. P-value Method for a Hypothesis Test for Correlation Hypothesis: H0: = 0 (There is no linear correlation) H1: 0 (There is a linear correlation) Test Statistic is t: t r 1 r 2 n2 P-value: Use Table A-3 with df = n – 2 25 10.2 Correlation Part II: Formal Hypothesis Test Conclusion: • If the P-value < significance level, reject H0 and conclude that there is a linear correlation • Otherwise, fail to reject H0 and conclude that there is not sufficient evidence to support the claim of a linear correlation 26 10.2 Correlation - Part II: Formal Hypothesis Test e.g.8 Hypothesis Test with Pizza/Subway Fare Costs. Use the paired pizza/subway fare data in Table 10-2 and use the P-value method to test the claim that there is a linear correlation between the costs of a slice of pizza and the subway fares. Use a 0.05 significance level. Sol. Test Statistic t: t r 1 r n2 2 0.988 1 0.988 62 2 12.793 Table A-3 with df = n – 2 = 4, shows that the P-value is less than 0.01. Because the P-value is < 0.05, reject H0 and conclude that there is a linear correlation between two costs. note: TI-84 and Statdisk can calculate the P-value, on TI: P-value = 0.00022 27 10.2 Correlation - Part II: Formal Hypothesis Test One-tailed Tests • Previous examples involve two-tailed test • One-tailed test can occur with claiming positive correlation or claiming negative correlation: Claim of Negative Correlation (Left-tailed test) H0: = 0 H1: < 0 Claim of Positive Correlation (Right-tailed test) H0: = 0 H1: > 0 • For both one-tailed tests – The P-value Method can be handled as in earlier chapters 28 10.2 Correlation - Part II: Formal Hypothesis Test Rationale n x y x y • Formula 10 – 1 r • Formula 10 – 2 z r • Other formulas r n x x 2 x 2 n y y 2 2 zy n 1 ( x x )( y y ) r (n 1) s x s y s xy s xx s yy • Rationale: self reading 29 10.3 Regression Key Concept • In 10.2 we compute LCC r to conclude that whether there is a linear relationship between two variables x and y. • If there is, in this section we find a linear equation, ŷ = b0 + b1x, which best represent the relationship • This line is called regression line and the equation is called regression equation 30 10.3 Regression – Part 1: Basic Concept Requirements (same as in 10.2) 1. Sample of paired (x, y) data is a random sample of quantitative data 2. Scatterplot shows linear pattern 3. Outliers must be removed if they are known to be errors. Note: requirements 2 and 3 above are simplified attempts at checking more formal requirements for regression analysis (book page 537) 31 10.3 Regression – Part 1: Basic Concept Notation for Regression Equation Given a collection of paired sample data, Population parameter Sample Statistic 0 b0 y-intercept of regression equation 1 Slope of regression equation b1 Equation of the regression line yˆ b0 b1 x y 0 1 x Rounding the slope b1 and the y-intercept b0 Round b1 and b0 to the three significant digits. x: predictor variable, or independent variable ŷ : response variable, or dependent variable 32 10.3 Regression – Part 1: Basic Concept Definition Given a collection of paired sample data, ŷ = b0 + b1x, is called regression equation. It describes the relationship between x and y algebraically. Its graph is called regression line (or line of best fit, or least-square line). Finding the slope b1 and y-intercept b0 in the regression equation Formula 10 – 3 Slope: Formula 10 – 4 Intercept: b1 r sy sx b0 y b1 x 33 10.3 Regression – Part 1: Basic Concept e.g.1 (from 10.2) Using Technology to find the regression equation. Refer to the sample data in Table 10-2. Table 10-2 Costs of a Slice of Pizza and Subway Fare (in dollars) Cost of Pizza 0.15 0.35 1.00 1.25 1.75 2.00 Subway Fare 0.15 0.35 1.00 1.35 1.50 2.00 Sol. 1. Requirement. Checked in 10.2. 2. Technology. – TI-84 (LinRegTTest) a = 0.034560171, b0 0.0346 b = 0.9450213806, b1 0.945 – Statdisk Equation of regression line: ŷ = 0.0346 + 0.945x or Subway = 0.0346 + 0.945 Pizza 34 10.3 Regression – Part 1: Basic Concept Example 2. Using manual calculations to find the regression Equation. Using formula 10-3 and 10-4 to find the equation of regression line. (use TI-84) Sol. r = 0.987811, sx = 0.738693, sy = 0.706694, x = 1.083333, y = 1.058333 b1 r sy sx 0.987811 0.706694 0.945 0.738693 b0 y b1 x 1.058333 (0.945)(1.083333) 0.0346 The regression equation is: yˆ 0.0346 0.945x 35 10.3 Regression – Part 1: Basic Concept e.g.3. Graphing the regression equation. Plot the data and graph the line on the same xy-plane. (I don’t have Minitab) Sol. Data: Table 10-2 Costs of a Slice of Pizza and Subway Fare (in dollars) Cost of Pizza 0.15 0.35 1.00 1.25 1.75 2.00 Subway Fare 0.15 0.35 1.00 1.35 1.50 2.00 The regression equation : yˆ 0.0346 0.945x How do I plot two groups of data on the same Excel graph? 36 10.3 Regression – Part 1: Basic Concept Using Regression Equation for Predictions. Figure 10-5 Recommended Strategy for Predicting Values of y 37 10.3 Regression – Part 1: Basic Concept e.g.4 Predicting Subway Fare. Table 10-5 includes the pizza/subway fare costs from the chapter problem, as well as the total number of runs scored in the baseball World Series for six different years. As of this writing, the cost of a slice of pizza in New York was $2.25, and 33 runs were scored in the last World Series. Table 10-2 Costs of a Slice of Pizza and Subway Fare (in dollars) Cost of Pizza Runs Scored In WS Subway Fare a. b. 0.15 0.35 1.00 1.25 1.75 2.00 82 45 59 42 85 38 0.15 0.35 1.00 1.35 1.50 2.00 Use the pizza/subway fare data to predict the cost of a subway fare given that a slice of pizza costs $2.25. Use the runs/subway fare data to predict the subway fare in a year in which 33 runs were scored in the World Series. 38 10.3 Regression – Part 1: Basic Concept a. Use the pizza/subway fare data to predict the cost of a subway fare given that a slice of pizza costs $2.25. a. Sol. Already know that the regression equation is a good model. yˆ 0.0346 0.945x yˆ 0.0346 0.945(2.25) yˆ $2.16 39 10.3 Regression – Part 1: Basic Concept b. Use the runs/subway fare data to predict the subway fare in a year in which 33 runs were scored in the World Series. b. Sol. (calculate r and P-value using TI-84) – – – – The regression line does not fits data well (need help from Statdisk) LCC r = –0.332, which suggests that there is not a linear correlation P-value is 0.520 33 runs is not too far beyond the scope of the available data Because the regression model is not a good model, the best predicted subway fare is the mean, y $1.06 Interpretation. Use the regression equation for predictions only if it is a good model. Otherwise, use y as the predicted value. 40 10.3 Regression – Part 2: Beyond the Basic Interpreting the Regression Equation: Marginal Change (skip part 2 if time press) Definition – In working with two variables related by a regression equation, the marginal change is the amount that it changes when the other variable changes by exactly one unit. The slope b1 in the regression equation represents the marginal change in y that occurs when x changes by one unit. e.g. for the pizza/subway fare data in the chapter problem, b1= 0.945. So, if x (the cost of a slice of pizza) increases by $1, the predicted cost of a subway fare will increase by $0.945. That is, for every additional $1 increase in the cost of a slice of pizza, we expect the subway fare to increase by 94.5¢. 41 10.3 Regression – Part 2: Beyond the Basic Outliers and Influential Points Definition In a scatterplot, an outlier is a point lying far away from the other data points. Paired sample data may include one or more influential points, which are points that strongly affect the graph of the regression line. Using Technology as example (from the book) 42 10.3 Regression – Part 2: Beyond the Basic Outliers and Influential Points e.g.5 43 10.3 Regression – Part 2: Beyond the Basic Residuals and the Least-Squares Property Definition – For a pair of sample x and y values, the residual is the difference between the observed sample value of y and the yvalue that is predicted by using the regression equation. That is, residual = observed y – predicted y = ( y yˆ ) Consider an example with data x 1 2 4 5 y 4 24 8 32 The regression equation is ŷ = 5 + 4x. When x = 5, the observed value of y is 32. On the other hand, the predicted value of y is ŷ = 5 + 4(5) = 25. So the residual is 32 – 25 = 7. 44 10.3 Regression – Part 2: Beyond the Basic Residuals and the Least-Squares Property y 32 Residual = 7 30 28 26 yˆ 5 4 x 24 22 Residual = 11 20 18 16 14 Residual = 13 12 10 8 6 4 2 Residual = 5 x 0 1 2 3 4 5 Figure 10 – 6 Residuals and Squares of Residuals 45 10.3 Regression – Part 2: Beyond the Basic Residuals and the Least-Squares Property Regression equation represents the line that fits the points “best” according to the following least-squares property. Definition – A straight line satisfies the least-squares property if the sum of the squares of the residuals is the smallest sum possible. From the Figure 10-8, the residuals are –5, 11, –13, and 7, so, the sum of their squares is (5)2 112 (13) 2 72 364 Use any straight line other than the regression line, the sum of areas will be greater than 364. 46 10.3 Regression – Part 2: Beyond the Basic Residuals Plots (If time press, skip this) Definition – A residual plot of the (x, y) values after each of the y-coordinate values has been replaced by the residual values y yˆ (where ŷ denotes the predicted value of y). That is a residual plot is a Graph of the points (x, y yˆ). When analyzing a residual plot, look for a pattern in the way the points are configured, and use these criteria: • The residual plot should not have an obvious pattern that is not a straight-line pattern. (This confirms that a scatterplot of the sample data is a straight-line pattern.) • The residual plot should not become thicker (or thinner) when viewed from left to right. (This confirms the requirement that for the different fixed values of x, the distributions of the corresponding y-values all have the same standard 47 deviation.) 10.3 Regression – Part 2: Beyond the Basic Residuals Plots See plot in ch10.xlsx. Residual Versus Pizza 0.2 0.15 0.1 Residual 0.05 0 0 0.5 1 1.5 2 2.5 -0.05 -0.1 -0.15 -0.2 -0.25 Pizza 48 10.3 Regression – Part 2: Beyond the Basic Complete Regression Analysis In part 1, we identified simplified criteria for determining whether a regression equation is a good model. A more thorough analysis can be implemented with the following steps: 1. 2. 3. 4. Construct a scatterplot and verify that the pattern of the points is approximately a straight-line without outliers. Construct a residual plot and verify that there is no pattern, and also verify that the residual plot does not become thicker (or thinner). Use a histogram and/or normal quantile plot to confirm that the values of the residuals have a distribution that is approximately normal. Consider any effects of a pattern over time. 49 10.3 Regression – Part 2: Beyond the Basic Using Technology • STATDISK – Compute LCC – Graph Scatterplot with regression line included (need to practice this!) • Minitab (powerful one but we don’t have it) • Excel • TI-83/84 Plus (a lot) 50