AP Stats Correlation Fact Sheet When asked to describe an association between two variables (scatterplot) always talk about: 1) Direction (positive, negative) 2) Form (linear, exotic, quadratic etc.) 3) Strength (strong, moderate, weak) Do not use correlation and least square lines unless the following conditions and assumptions are met: 1) Quantitative data 2) Linearity Assumption: Straight Enough Condition 3) Equal Variance Assumption: Does the Plot Thicken Condition (equal amount of scatter throughout) 4) Outlier Condition When looking for a correlation, the Explanatory variable is the x-variable and the Response variable is the yvariable. r, the correlation coefficient, is the slope of the standardized line of best fit (the line that best fits the scatterplot of z-scores for all data x and y. r = zxzy n-1 For every 1 standard ddeviation change in the explanatory variable, the response variable will change by r standard deviations. The least squares line (aka line of best fit) is the line that minimizes the sum of the squared residuals. The slope of the least squares line for the actual data (not standardized) is b = rsy Sx The point (x, y) will be on the least squares line for the actual data. The y-intercept of the least squares line for the actual data is a= y – bx The equation of the least squares line is y = a + bx. NOTE: When given data you could find the equation of the least squares line by : 1) Calculating the mean and standard deviation for x and y 2) Converting all the x and y scores to z-scores and then finding zxzy for all data points, summing them and then dividing by n-1 to find r. (r = zxzy) n–1 3) Calculating b = rsy sx 4) Calculating a = y – bx OR YOU COULD PUT THE DATA INTO LISTS ON YOUR CALCULATOR AND USE STAT-CALC-8. However, you should still be able to use the equations for b and a to find the equation of the least squares line when you are not given the data but simply a summary of the data (r, means, and standard deviations). A residual for any data point is the actual y value – predicted y value. The slope (b) of the least squares line means for each change of 1 by the explanatory variable the response variable will change by b units. The y-intercept (a) of the least squares line is the value the response variable would be if the explanatory variable was equal to 0 (sometimes unrealistic, so then just provides a starting point). A residual plot is a scatterplot with the x-values (or predicted y values) on the x-axis and the residuals on the y-axis. A positive residual means the actual value of y is greater than the predicted value and a negative residual means the actual y value is less than the predicted value. You find predicted values for y by plugging the corresponding x-value into the least squares line. When writing the equation in the context of the problem you should put the word description of the variables in for x and y. Remember, a least squares line does not give us y, it gives us a prediction for y so you should put a hat ( ) on the y variable name. If you know y and wish to predict x, you should not sue the equation y = a + bx, but should instead recalculate the least squares line using the y variable as the explanatory variable and the x variable as the response variable. NOTE: a and b will change. CORRELATION DOES NOT MEAN CAUSATION R² - the coefficient of determination or the variance of the residuals (found by squaring r). R² is the percent of change in the y-values of the data that can be attributed to change in the x-values of the data. _____% of the change in ________________ can be explained by _______________. The other _____% can be contributed to lurking variables. Be able to find r, a and b from printed software output. A lurking variable is an unknown variable that is simultaneously affecting both the explanatory and the response variables causing the association to appear different than it actually is. A sample of 25 students found the mean math ACT score to be 24.5 with a standard deviation of 3.5. The same students were tested on the Urkle Nerdiness Scale and the mean score was 12 with a standard deviation of 9. The correlation coefficient for the data sets was .79. Use the ACT score as the explanatory variable. a= 1. 2. 3. 4. 5. 6. Write the equation of the least squares line in the context of the problem. What does the line predict would be the Nerdiness score of a person with a Math ACT of 28? The residual for the person at point (31, 11) is = _______________. R² = .624 Explain what this means. What would the correlation coefficient be if we switched and made Nerdiness Score the explanatory variable? What would the equation of the least squares line be if we switched? b=