12 Simple Linear Regression and Correlation Copyright © Cengage Learning. All rights reserved. 12.5 Correlation Copyright © Cengage Learning. All rights reserved. Correlation There are many situations in which the objective in studying the joint behavior of two variables is to see whether they are related, rather than to use one to predict the value of the other. In this section, we first develop the sample correlation coefficient r as a measure of how strongly related two variables x and y are in a sample and then relate r to the correlation coefficient 3 The Sample Correlation Coefficient r 4 The Sample Correlation Coefficient r Given n numerical pairs (x1, y1), (x2, y2), c, (xn, yn), it is natural to speak of x and y as having a positive relationship if large x’s are paired with large y’s and small x’s with small y’s. Similarly, if large x’s are paired with small y’s and small x’s with large y’s, then a negative relationship between the variables is implied. Consider the quantity 5 The Sample Correlation Coefficient r Then if the relationship is strongly positive, an xi above the mean will tend to be paired with a yi above the mean , so that and this product will also be positive whenever both xi and yi are below their respective means. Thus a positive relationship implies that Sxy will be positive. An analogous argument shows that when the relationship is negative, Sxy will be negative, since most of the products will be negative. 6 The Sample Correlation Coefficient r This is illustrated in Figure 12.19. (b) (a) (a) Scatter plot with Sxy positive; (b) scatter plot with Sxy negative [+ means (xi – x)(yi – y) > 0, and – means (xi – x)(yi – y) < 0] Figure 12.19 7 The Sample Correlation Coefficient r Although Sxy seems a plausible measure of the strength of a relationship, we do not yet have any idea of how positive or negative it can be. Unfortunately, Sxy has a serious defect: By changing the unit of measurement for either x or y, Sxy can be made either arbitrarily large in magnitude or arbitrarily close to zero. For example, if Sxy = 25,000 = 25 when x is measured in meters, then Sxy = 25,000 when x is measured in millimeters and .025 when x is expressed in kilometers. 8 The Sample Correlation Coefficient r A reasonable condition to impose on any measure of how strongly x and y are related is that the calculated measure should not depend on the particular units used to measure them. This condition is achieved by modifying Sxy to obtain the sample correlation coefficient. 9 The Sample Correlation Coefficient r Definition The sample correlation coefficient for the n pairs (x1, y1), … , (xn, yn) is (12.8) 10 Example 15 An accurate assessment of soil productivity is critical to rational land-use planning. Unfortunately, as the author of the article “Productivity Ratings Based on Soil Series” (Prof. Geographer, 1980: 158–163) argues, an acceptable soil productivity index is not so easy to come by. One difficulty is that productivity is determined partly by which crop is planted, and the relationship between the yield of two different crops planted in the same soil may not be very strong. 11 Example 15 cont’d To illustrate, the article presents the accompanying data on corn yield x and peanut yield y (mT/Ha) for eight different types of soil. With 12 Example 15 cont’d from which 13 Properties of r 14 Properties of r The most important properties of r are as follows: 1. The value of r does not depend on which of the two variables under study is labeled x and which is labeled y. 2. The value of r is independent of the units in which x and y are measured. 3. –1 r 1 4. r = 1 if and only if (iff) (xi, yi) all pairs lie on a straight line with positive slope, and r = –1 iff all (xi, yi) pairs lie on a straight line with negative slope. 15 Properties of r 5. The square of the sample correlation coefficient gives the value of the coefficient of determination that would result from fitting the simple linear regression model—in symbols, (r)2 = r 2. Property 1 stands in marked contrast to what happens in regression analysis, where virtually all quantities of interest (the estimated slope, estimated y-intercept, s2, etc.) depend on which of the two variables is treated as the dependent variable. 16 Properties of r However, Property 5 shows that the proportion of variation in the dependent variable explained by fitting the simple linear regression model does not depend on which variable plays this role. Property 2 is equivalent to saying that r is unchanged if each xi is replaced by cxi and if each yi is replaced by dyi (a change in the scale of measurement), as well as if each xi is replaced by xi – a and yi by yi – b (which changes the location of zero on the measurement axis). This implies, for example, that r is the same whether temperature is measured in °F or °C. 17 Properties of r Property 3 tells us that the maximum value of r, corresponding to the largest possible degree of positive relationship, is r = 1, whereas the most negative relationship is identified with r = –1. According to Property 4, the largest positive and largest negative correlations are achieved only when all points lie along a straight line. Any other configuration of points, even if the configuration suggests a deterministic relationship between variables, will yield an r value less than 1 in absolute magnitude. 18 Properties of r Thus r measures the degree of linear relationship among variables. A value of r near 0 is not evidence of the lack of a strong relationship, but only the absence of a linear relation, so that such a value of r must be interpreted with caution. Figure 12.20 illustrates several configurations of points associated with different values of r. (a) r near +1 (b) r near 1 (c) r near 0, no apparent relationship (d) r near 0, nonlinear relationship Data plots for different values of r Figure 12.20 19 Properties of r A frequently asked question is, “When can it be said that there is a strong correlation between the variables, and when is the correlation weak?” Here is an informal rule of thumb for characterizing the value of r: Weak –.5 r .5 Moderate either –.8 < r < –.5 or .5 < r < .8 Strong either r .8 or r –.8 It may surprise you that an r as substantial as .5 or –.5 goes in the weak category. 20 Properties of r The rationale is that if r = .5 or –.5, then r2 = .25 in a regression with either variable playing the role of y. A regression model that explains at most 25% of observed variation is not in fact very impressive. In Example 15, the correlation between corn yield and peanut yield would be described as weak. 21 Inferences About the Population Correlation Coefficient 22 Inferences About the Population Correlation Coefficient The correlation coefficient r is a measure of how strongly related x and y are in the observed sample. We can think (xi, yi) of the pairs as having been drawn from a bivariate population of pairs, with (Xi, Yi) having some joint pmf or pdf. We defined the correlation coefficient (X,Y) by 23 Inferences About the Population Correlation Coefficient Where If we think of p(x, y) or f(x, y) as describing the distribution of pairs of values within the entire population, becomes a measure of how strongly related x and y are in that population. 24 Inferences About the Population Correlation Coefficient The population correlation coefficient r is a parameter or population characteristic, just as X, Y, X, and Y, are, so we can use the sample correlation coefficient to make various inferences about . In particular, is a point estimate for r, and the corresponding estimator is 25 Example 16 In some locations, there is a strong association between concentrations of two different pollutants. The article “The Carbon Component of the Los Angeles Aerosol: Source Apportionment and Contributions to the Visibility Budget” (J. of Air Pollution Control Fed., 1984: 643–650) reports the accompanying data on ozone concentration x (ppm) and secondary carbon concentration y (g/m3). 26 Example 16 cont’d The summary quantities are n = 16, xi = 1.656, yi = 70.6, = .196912, xiyi = 20.0397, and = 2253.56 from which The point estimate of the population correlation coefficient between ozone concentration and secondary carbon concentration is = r = .716. 27 Inferences About the Population Correlation Coefficient The small-sample intervals and test procedures presented in Chapters 7–9 were based on an assumption of population normality. To test hypotheses about r, an analogous assumption about the distribution of pairs of (x, y) values in the population is required. We are now assuming that both X and Y are random, whereas much of our regression work focused on x fixed by the experimenter. 28 Inferences About the Population Correlation Coefficient Assumption The joint probability distribution of (X, Y) is specified by <x< <y< (12.9) where 1 and 1 are the mean and standard deviation of X, and 2 and 2 are the mean and standard deviation of Y; f(x, y) is called the bivariate normal probability distribution. 29 Inferences About the Population Correlation Coefficient The bivariate normal distribution is obviously rather complicated, but for our purposes we need only a passing acquaintance with several of its properties. The surface determined by f(x, y) lies entirely above the x, y plane [f(x, y) 0] and has a three-dimensional bellor mound-shaped appearance, as illustrated in Figure 12.21. A graph of the bivariate normal pdf Figure 12.21 30 Inferences About the Population Correlation Coefficient If we slice through the surface with any plane perpendicular to the x, y plane and look at the shape of the curve sketched out on the “slicing plane,” the result is a normal curve. More precisely, if X = x, it can be shown that the (conditional) distribution of Y is normal with mean Yx = 2 – 12/1 + 2x/1 and variance This is exactly the model used in simple linear regression with 0 = 2 – 12/1, 1 = 2/1, and independent of x. 31 Inferences About the Population Correlation Coefficient The implication is that if the observed pairs (xi, yi) are actually drawn from a bivariate normal distribution, then the simple linear regression model is an appropriate way of studying the behavior of Y for fixed x. If = 0, then Y x = 2 independent of x; in fact, when = 0, the joint probability density function f(x, y) of (12.9) can be factored as f1(x)f2(y), which implies that X and Y are independent variables. Assuming that the pairs are drawn from a bivariate normal distribution allows us to test hypotheses about r and to construct a CI. 32 Inferences About the Population Correlation Coefficient There is no completely satisfactory way to check the plausibility of the bivariate normality assumption. A partial check involves constructing two separate normal probability plots, one for the sample xi’s and another for the sample yi’s, since bivariate normality implies that the marginal distributions of both X and Y are normal. If either plot deviates substantially from a straight-line pattern, the following inferential procedures should not be used for small n. 33 Inferences About the Population Correlation Coefficient Testing for the Absence of Correlation When H0: = 0 is true, the test statistic has a t distribution with n – 2 df. 34 Inferences About the Population Correlation Coefficient Alternative Hypothesis Rejection Region for Level Test Ha: > 0 t t,n – 2 Ha: < 0 t –t,n – 2 Ha: ≠ 0 either t t/2,n – 2 or t –t/2,n – 2 A P-value based on n – 2 df can be calculated as described previously. 35 Inferences About the Population Correlation Coefficient Because measures the extent to which there is a linear relationship between the two variables in the population, the null hypothesis H0: = 0 states that there is no such population relationship. In Section 12.3, we used the t ratio to test for a linear relationship between the two variables in the context of regression analysis. It turns out that the two test procedures are completely equivalent because 36 Inferences About the Population Correlation Coefficient When interest lies only in assessing the strength of any linear relationship rather than in fitting a model and using it to estimate or predict, the test statistic formula just presented requires fewer computations than does the t-ratio. 37 Other Inferences Concerning 38 Other Inferences Concerning The procedure for testing Ha: = 0 when 0 0 is not equivalent to any procedure from regression analysis. The test statistic is based on a transformation of R called the Fisher transformation. Proposition When (X1, Y1), …, (Xn, Yn) is a sample from a bivariate normal distribution, the rv (12.10) 39 Other Inferences Concerning has approximately a normal distribution with mean and variance The rationale for the transformation is to obtain a function of R that has a variance independent of r; this would not be the case with R itself. Also, the transformation should not be used if n is quite small, since the approximation will not be valid. 40 Other Inferences Concerning The test statistic for testing H0: = 0 is Alternative Hypothesis Rejection Region for Level Test Ha: > 0 z z Ha: < 0 z –z Ha: ≠ 0 either z z/2 or z –z/2 A P-value can be calculated in the same manner as for previous z tests. 41 Example 18 The article “Size Effect in Shear Strength of Large Beams—Behavior and Finite Element Modelling” (Mag. of Concrete Res., 2005: 497–509) reported on a study of various characteristics of large reinforced concrete deep and shallow beams tested until failure. Consider the following data on x = cube strength and y = cylinder strength (both in MPa): 42 Example 18 cont’d Then Sxx = 367.74, Sxx = 488.54, and Sxy = 322.37, from which r = .761. Does this provide strong evidence for concluding that the two measures of strength are at least moderately positively correlated? Our previous interpretation of moderate positive correlation was .5 < < .8, so we wish to test H0: = .5 versus Ha: > .5 The computed value of V is then 43 Example 18 cont’d Thus The P-value for an upper-tailed test is .0359. The null hypothesis can therefore be rejected at significance level .05 but not at level .01. This latter result is somewhat surprising in light of the magnitude of r, but when n is small, a reasonably large r may result even is not all that substantial. At significance level .01, the evidence for a moderately positive correlation is not compelling. 44 Other Inferences Concerning To obtain a CI for , we first derive an interval for Standardizing V, writing a probability statement, and manipulating the resulting inequalities yields (12.11) as a 100(1 – )% interval for V, where This interval can then be manipulated to yield a CI for . 45 Other Inferences Concerning A 100(1 – )% confidence interval for is. where c1 and c2 are the left and right endpoints, respectively, of the interval (12.11). 46