11.7 CORRELATION: HOW STRONG IS THE LINEAR RELATIONSHIP? DEFINITION: The sample correlation coefficient r measures the strength of the linear relationship between two quantitative variables. It describes the direction of the linear association and indicates how closely the points in a scatter-plot are to the least squares regression line. Features of the correlation coefficient. 1 r 1 1. Range 2. Sign The sign of the correlation coefficient indicates direction of association — negative [-1 , 0) or positive (0 , +1]. 3. Magnitude The magnitude of the correlation coefficient indicates the strength of the linear association. If the data follow a straight line r 1 (if the slope is positive) or r 1 (if the slope is negative), indicating a perfect linear association. If r 0 then there is no linear association. 4. 5. Measures Strength The correlation only measures the strength of the linear association. Unit-less The correlation is computed using standard scores of the two variables. It has no unit of measure and the absolute value of r will not change if the units of measurement for x or y are changed. The correlation between x and y is the same as the correlation between y and x. 15 Some Pictures.... y x x x x x x x x x x x x x x x r 0.8 . Positive, moderate to strong linear association, y x x x x x x x x x x x x x x r 0.2 x Negative, weak linear association, x A strong association, just not a linear one, y x x x r 0. x x x x x Let's Do It! 113.8Matching Graphs The scatter-plot #1 to the right yields a regression line of y = -2.6 + 1.1x and a correlation of r = 0.84. Using this information as a base, match each of the four scatter-plots below to the correct description of its regression line and correlation coefficient. The scales on the axes of the scatter-plots are the same. 16 How to Calculate the Correlation Coefficient r Dep var: PERCENT N:18 Multiple R: .840 Squared multiple R: .706 Adjusted squared multiple R: .688 Standard error of estimate: 10.547 Variable CONSTANT LENGTH Coefficient Std error 96.681 6.289 -5.970 0.963 Std coef Tolerance 0.000 . -0.840 .100E+01 T P(2 tail) 15.373 0.000 -6.201 0.000 Analysis of Variance Source Sum-of-squares DF Mean-square F-ratio P Regression 4276.908 1 4276.908 38.448 0.000 Residual 1779.815 16 111.238 “Multiple R: 0.840” = absolute value of the correlation coefficient r. The sign of r can be determined by looking at the sign of the slope, which here is -5.970. Correlation between length of putt and percentage of putts made is r 084 . . 17 The formula: r n xi y i xi y i n xi2 xi n y i2 y i 2 2 Example Test 1 v e r s us Test 2 Obtaining t he Correlation Coefficient “By Hand” We already have computed the summation quantities needed for finding r, shown in the calculation table. Completed Calculation Table Total: r xi yi xi2 xi yi yi2 8 10 12 14 16 9 13 14 15 19 64 100 144 196 256 72 130 168 210 304 81 169 196 225 361 x i y 60 i n xi yi xi yi n x 2 i x i 2 n y 2 i x 70 y i 2 2 i 760 x y i i 884 5(884) (60)(70) 5(760) (60) 2 5(1032) (70) 2 y 2 i 1032 0.965 The large positive correlation coefficient and the scatter-plot indicate a strong, positive, linear association between Test 1 and Test 2 scores. 18 Let’s Do It! 2 Birth Rates We gathered data from 1970 for twelve nations on the percentage of women aged 14 or older who were economically active and the crude birth rate. (We define the crude birth rate as the number of births in a year per 1000 population size) We are interested in the relationship of the crude birth rate (y) on the percentage of women who were economically active (x) a. Create the scatter-plot. Determine if there is a positive, negative, or association between x and y. Nation Algeria Argentina Denmark E. Germany Guatemala India Ireland Jamaica Japan Philippines USA Soviet Union x 2 19 34 40 8 12 20 20 37 19 30 46 y 48 21 14 11 41 37 22 31 19 42 15 18 b. Find the equation of the regression line. Interpret the slope. c. Find the correlation coefficient r. 19 Obtaining the Correlation Using the TI To get the regression line and the correlation coefficient using the TI we first need to turn on the diagnostic option. If the x data is in L1 and the y data is in L2, then the steps are as follows: Let’s Do It! 3 Birth Rates Check the value of r you obtained in activity2 above using TIcalculator. 20 Let’s Do It! 4 Data on Milk Production Milk samples were obtained from 14 Holstein-Friesian cows, and each was analyzed to determine uric acid concentration (Y), measured in mol/L. In addition to acid concentration, the total mild production (X), measured in kg/day, was recorded for each cow. The data was entered into a computer and the following regression output was obtained. (a) What is the equation of the least squared regression line? (b) What is the correlation between x and y? r = __________. (c) We are interested if the linear relationship is significant by testing the following hypothesis. Main Hypothesis: The slope of the regression line equals 0. Circle the p-value in the output that is used to test this hypothesis. At the level of significance of 0.05, we would (circle one): Accept H0 Reject H0 Can’t Tell 21 THE SQUARED CORRELATION r 2 — WHAT DOES IT TELL US? r = correlation coefficient, gives the strength and the direction of the linear relationship between two quantitative variables x and y; –1 r 1. Note that when we square r we get => 0 r2 1. The value of r2 Is the percentage of variation of dependent variable that are explained the independent variable x. The quantity r2 is generally denoted in computer output as R2, and is often reported as a percent. r2 = 0.75 => about 75% of the variation in the response variable y can be explained by the linear relationship between x and y. Homework page 546 : 2,3, 4 ,14,15,22, 36, 37, 39 ( for 2, 14, 22, 37, 39 create SPSS data files for the table and use the regression procedure to answer these questions, include a copy of your outputs. For the other questions about the same tables you can use the output only to check your answer, you work will still be required.). 22