Linear Regression and Correlation When performing research studies, scientists often wish to know whether two variables are related. If the variables are determined to be related, a scientist may then wish to find an equation that can be used to model the relationship. For instance, a goelogist might want to know whether there is relationship between the duration of an eruption of a geyser and the time between eruptions. A first step in this determination is to collect some data. Data involving two variables are called bivariate data. Table 1 gives bivariate data showing the time between two eruptions and the duration at the second eruption for 10 eruption of the geyser Old Faithful. Table 1. Table between eruptions (in seconds) , 𝒙 Duration of eruption (in second), 𝒚 272 227 237 238 203 270 218 226 250 245 89 79 83 82 81 85 78 81 85 79 Once the data are collected, a scatter diagram or scatter plot can be drawn, as shown below in Figure 2. Old Faithful 90 272; 89 88 86 270; 85 250; 85 84 237; 83 238; 82 226; 81 82 203; 81 80 227; 79 78 245; 79 218; 78 76 0 50 100 150 Figure 2. 200 250 300 Least-Squares Regression Line Ordered pairs of bivariate data are given. The line that minimizes the sum of the squares of the vertical deviations from each data point to the line is referred to as the least-squares regression line, or least- squares line for a set of bivariate data. The equation of the 𝒍𝒆𝒂𝒔𝒕 − 𝒔𝒒𝒖𝒂𝒓𝒆𝒔 𝒍𝒊𝒏𝒆 for the n noted pairs (𝑥1 , 𝑦1 ), (𝑥2 , 𝑦2 ), (𝑥3 , 𝑦3 ), … , (𝑥𝑛 , 𝑦𝑛 ), is 𝑦̂ = 𝑎𝑥 + 𝑏, where 𝑎= 𝑛 ∑ 𝑥𝑦 − (∑ 𝑥)(∑ 𝑦) 𝑛 ∑ 𝑥 2 − (∑ 𝑥)2 and b = 𝑦̅ − 𝑎𝑥̅ When the value of the other variable is known. ,the equation of the least-squares line can be used to predict the value of one variable. To apply this formula to the data for Old Faithful, we first find the value of each summation. ∑ 𝑥 = 2386, ∑ 𝑥 2 = 573,560, ∑ 𝑥𝑦 = 196,636 ∑ 𝑦 = 822, Next, we use these value to find the value of 𝑎. 𝑛 ∑ 𝑥𝑦 − (∑ 𝑥)(∑ 𝑦) 𝑛 ∑ 𝑥 2 − (∑ 𝑥)2 10(196,636) − (2386)(822) 𝑎= 10(573,560) − (2386)2 𝑎 ≈ 0.1189559666 𝑎= We then find the values of 𝑥̅ and 𝑦̅, ∑ 𝑥 2386 𝑥̅ = = = 238.6 𝑛 10 and 𝑦̅ = ∑ 𝑦 822 = = 82.2 𝑛 10 and use them to find the 𝑦 −intercept, 𝑏. 𝑏 = 𝑦̅ − 𝑎𝑥̅ ≈ 82.2 − (0.1189559666)(238.6) = 53.81710637. The regression equation is: 𝑦̂ = 𝑎𝑥 + 𝑏. The graph of the regression equation and a scatter plot of the data are shown below. Old Faithful Length of the eruption 90 88 86 𝑦=0.1189559666𝑥+53.81710637 84 82 80 78 76 0 50 100 150 200 250 300 Second between eruption We can now use the regression equation to estimate the duration of an eruption given the time between eruption. For instance, if the time between two eruption is 200 seconds then the estimated duration of the second eruption is 𝑦̂ = 0.1189559666𝑥 + 53.81710637 = 0.1189559666(200) + 53.81710637 ≈ 78. Thus, the approximate duration of the eruption is 78 seconds.∎ Linear Correlation Coefficient Is is the one that measures the strength of a linear relationship between two variable. The closer the value of |𝑟| is to 1, the strongest the linear relationship is between the variables. For 𝑛 ordered pairs (𝑥1 , 𝑦1 ), (𝑥2 , 𝑦2 ), (𝑥3 , 𝑦3 ), … , (𝑥𝑛 , 𝑦𝑛 ), the 𝒍𝒊𝒏𝒆𝒂𝒓 𝒄𝒐𝒓𝒓𝒆𝒍𝒂𝒕𝒊𝒐𝒏 𝒄𝒐𝒆𝒇𝒇𝒊𝒄𝒊𝒆𝒏𝒕 𝒓 is given by 𝑟= 𝑛(∑ 𝑥𝑦) − (∑ 𝑥)(∑ 𝑦) √𝑛(∑ 𝑥 2 ) − (∑ 𝑥)2 ⋅ √𝑛(∑ 𝑦 2 ) − (∑ 𝑦)2 The Correlation Coefficient (r) The sample correlation coefficient (r) is a measure of the closeness of association of the points in a scatter plot to a linear regression line based on those points, as in the example above for accumulated saving over time. Possible values of the correlation coefficient range from -1 to +1, with -1 indicating a perfectly linear negative, i.e., inverse, correlation (sloping downward) and +1 indicating a perfectly linear positive correlation (sloping upward). Describing Correlation Coefficients The table below provides some guidelines for how to describe the strength of correlation coefficients, but these are just guidelines for description. Also, keep in mind that even weak correlations can be statistically significant, as you will learn shortly. The four images below give an idea of how some correlation coefficients might look on a scatter plot. Example. A random sample of 50 year old men (n=10) was taken. Weight, height, and systolic blood pressure were measured, and BMI was computed. In this analysis, the independent (or predictor) variable is BMI and the dependent (or outcome) variable is systolic blood pressure (SBP). Solution. ∑ 𝑥 = 272.9, ∑ 𝑦 = 1,395, ∑ 𝑥 2 = 7,734.11, ∑ 𝑦 2 = 199,675, ∑ 𝑥𝑦 = 39,106.5 Substitute all the values to the value for 𝑟 𝑟= = 𝑛(∑ 𝑥𝑦) − (∑ 𝑥)(∑ 𝑦) √𝑛(∑ 𝑥 2 ) − (∑ 𝑥)2 ⋅ √𝑛(∑ 𝑦 2 ) − (∑ 𝑦)2 10(39106.5) − (272.9)(1395) √10(7734.11) − (272.9)2 ⋅ √10(199675) − (1395)2 ≈ 0.86 Strong positive correlation.