3.3 Correlation: The Strength of a Linear Trend The __________________ ________________ is the statistic used to quantitatively measure the amount of linear relationship between two variables. The __________________ ________________ is usually written as ____ (rho). As well as giving information about the strength of the linear relationship (how well a line fits the data) it also gives information about the direction of the linear relationship (whether the variables are positively or negatively associated). The most commonly used correlation coefficient (there are several different ones) is the Pearson product-moment correlation coefficient, or more simply, the correlation coefficient. It is defined as “the average product of the z-scores of the x and y variables“ (using n – 1 instead of n to correct for population bias similar to the ____________ ____________ formula). r x x y y 1 n 1 s x s y 1 zx z y n 1 z x are the z-scores of the x-values and z y are the z-scores of the y-values The following graph helps you to visualize why the r value signifies what it does. (x, y) The ______________ line is at the mean value of x and the ___________ line is at the mean value of y. For points in Quadrant I both the x and y values have positive z-values, so their product contributes a positive amount to the calculation of r. For points in Quadrant III both the x and y values have negative z-values, but their product is positive and so also contributes a positive amount to the calculation of r. 1 3.3 Correlation: The Strength of a Linear Trend For the remaining points in Quadrants II and IV, the x and y values will have z-values with opposite signs, hence their product is negative and contributes a negative amount to the calculation of r. For relationships where most of the points fall in Quadrants I and III the correlation coefficient will be positive as its calculation is dominated by positive terms. Conversely for relationships where most of the points fall in Quadrants II and IV, the correlation coefficient will be negative as its calculation is dominated by negative terms. DO NOT INTERPRET CORRELATION AS CAUSATION. Just because two variables are related does not mean that one ___________ the other – some third variable may be influencing them both. E.g. seeing a fire truck at almost every fire doesn’t mean that fire trucks cause fires. This third variable that you didn’t include in your analysis, but that might explain the relationship between the variables you did include, is called a __________________. Properties of the Correlation Coefficient Correlation does not depend on which variable you call explanatory and which you call response. Because z-scores have no units, and r is the average product of two z-scores, r has no units. This means changing the units of measurement for the variables has no effect on the value of the ________________ ________________. The correlation coefficient is always a number between -1 and 1. If r = -1 or r = 1, the points all lie on a line. The following rough guidelines help you categorize r. Value of r 1 r 0 .8 0 .8 r 1 0.8 r 0.5 0.5 r 0.8 Strength of Linear Relationship strong moderate 0.5r0.5 2 weak If r > 0 the variables are ____________ associated. If r < 0 the variables are _______________ associated. If r = 0 it indicates there is no _________ association that would allow you to predict y from x. It does not mean there is no relationship – just not a linear one. A resistant statistic is one which is not dramatically affected by __________ values. R is not resistant because it is based on the ________. A single extreme value can have a powerful impact on r and cause you to overinterpret the relationship. That is why you must always look at a scatterplot of the data as well as r. 3.3 Correlation: The Strength of a Linear Trend The scatterplots on the left illustrate how values of r closer to 1 or -1 correspond to stronger linear relationships. Example 1: Match each of the five scatterplots on the right with its correlation, choosing from: -0.95, -0.5, 0, 0.5, and 0.95 3 3.3 Correlation: The Strength of a Linear Trend The Relationship Between the Correlation and the Slope As well as the correlation and the slope always having the same sign, they are also related by the equation: s b1 r y sx From the equation you can see that if the data is __________________ so that sx 1and sy 1 , then the slope of the regression line is ________ to the correlation. Note: Most of the time you will be able to use your calculator to find the LSRL equation, however the AP Exam may also ask you to calculate and interpret the linear regression equation given summary statistics x, y, sx , s y , and r. From Data SLOPE b1 = å( x - x )( y - y ) å( x - x ) i i 2 From Summary Statistics b1 r sy sx i INTERCEPT b0 = y - b1x b0 = y - b1x Example 2: P15 page 156 Imagine a scatterplot of two sets of exam scores for students in a statistics class. The score for a student on Exam 1 is graphed on the x-axis, and his or her score on Exam 2 is graphed on the y-axis. The slope of the regression line is 0.368. The mean of the Exam 1 scores is 72.99, and the standard deviation is 12.37. The mean of the Exam 2 scores is 75.80, and the standard deviation is 7.00. a. Find the correlation of these two scores. b. Find the equation of the regression line for predicting an Exam 2 score from an Exam 1 score. Predict an Exam 2 score for a student who got a score of 80 on Exam 1. c. Find the equation of the regression line for predicting an Exam 1 score from an Exam 2 score. 4 3.3 Correlation: The Strength of a Linear Trend The following is a description of one way to evaluate how well a linear model fits the data (the second way will be covered in section 3.4). Coefficient of Determination Let’s look, for example, at the relationship between girls aged 2 – 14 years and their median height in inches (E19 page 136). The equation of the LSRL is predicted height = 31.57 + 2.43 (age) If you had to guess a girl’s height with no other information, on average you'd make the smallest errors by always guessing the mean height. In general all observed y values exhibit variability. A rough measure of this variability is the total sum of squares: ( SST = å y - y ) 2 (the total variation of the observed y values about their overall average) This total variability can be broken into two parts: the first attributed to the differences in x (the linear relationship) and the second attributed to other unexplained factors (residuals). ( SSR = å ŷ - y ) 2 (the variation explained by the regression) ( ) SSE = å residual2 = y - ŷ (the remaining unexplained variation) 2 SST (total sum of squares)= SSR (regression sum of squares) + SSE (residual sums of squares) Y 5 3.3 Correlation: The Strength of a Linear Trend The coefficient of determination, r2, is a numerical quantity that tells you how well the least-squares line does at ______________ values of the response variable y. Although it is true that this quantity is equal to the square of r, the correlation coefficient, there is much more to this relationship. R2 can be calculated using the following formula: r2 = explained variation due to the linear relationship SSR = total variation SST In brief, r2 is a value from 0 and 1, and the closer it is to 1 the _________ your model is. A value of r2 equal to 1 implies that your model provides _________ predictions and it would pass through every point on the scatterplot exactly i.e. It would be able to “explain” all of the variation. In the worst case scenario, the least-squares line does no better at predicting y than y . In this case SSE = SST and r2 = 0. If you have a coefficient of determination between 0 and 1, for example r2 = 0.606, then about 61% of the variation in y among the individual subjects is due to the straight-line relationship between y and x. The other 39% is individual variation among subjects that is not explained by the linear relationship. When interpreting the coefficient of determination, r2, say the following: About r 2 % of the variation in y can be explained by the linear relationship between x and y (of course you must replace x and y with their real life meanings) With some algebra it can be shown that the coefficient of determination is actually the correlation squared (not easy to do and beyond our course). This fact provides an important connection between correlation and regression. However, just remember that while it is true that one is the square of the other, they have different meanings: r is a measure and direction of the strength of the linear relationship r2 tells you how much better the linear model is at predicting y-values than simply using y . Notes: Even though r2 is not in the AP Stats curriculum, it has appeared on some exams. Regression toward the mean is not in the course. 6 3.3 Correlation: The Strength of a Linear Trend