Wed, June 26, (Lecture 8-2). Nonlinearity. Significance test for correlation R-squared, SSE, and SST. Correlation in SPSS. Last time, we looked at scatterplots, which show the interaction between two variables, and correlation. The correlation coefficient r measures how well the pairs of values fit on a line. r is positive when two values increase together. r is negative when two one value goes up as the other goes down. However, correlation only shows the linear relation between two variables. The variables could still be related in a non-linear way and have little or no correlation. In real world contexts, the most common form of non-linear relationship is a curvilinear one. (SOURCE: GAPMINDER.ORG) One common reason is a scaling issue, where a fixed change in one thing doesn’t mean a fixed change in another. Life expectancy increases with the logarithm of income, not with income. (SOURCE: GAPMINDER.ORG) When we rescale income into a log-scale (a scale that shows very small and very large numbers equally well), a line appears. Another reason for non-linearity could be two competing factors. In a too-easy course, nobody learns anything new. In a too-hard course, nobody learns anything at all. Spearman correlation is a measure that can handle curves as long as the trend doesn’t switch between increasing and decreasing. The only time we’ll be using this is as a check in SPSS. Everything else we do in Ch.10 and 11 is the… Pearson correlation, which is restricted to linear relationships. We use the Pearson correlation because it produces stronger results and the math is simpler. Math: The ugly sweater around an otherwise pretty graph. You can do hypothesis testing. We may be interested in whether or not there is a correlation between two variables. Since samples are random, the sample correlation between two variables will show up as a little above or below zero by chance. How far from zero correlation does something have to be before it’s significant? This formula gives the t-score of correlation. The null hypothesis is: true correlation = zero. The alternative is: correlation not zero. The t in this formula is the same t-score as in chapters 6 and 7. This t-score gets compared the critical values in the t-table at n-2 degrees of freedom. The stronger the correlation, the farther r goes from zero. As r gets farther from zero, t-score gets bigger. So a stronger correlation gives you higher t-score. Stronger correlation better evidence of a correlation. t-score also increases with sample size. As usual, it’s under a square root. Having more data points makes it easier to detect correlations. A larger t-score meant more evidence against the null, just like before. So a large t-score means more evidence of a correlation. If there’s a weak correlation and a small sample, we might not detect it. (Example: n=10, r=.25) t* = 1.397, at 8 df, 0.20 significance. t* = 2.306, at 8 df, 0.05 significance. No significant evidence of a correlation. p > 0.20 What if we get a larger sample of this correlation? (n=46, r=0.25) We should get some evidence of a correlation, but not much. t* = 1.684, at 44 df, 0.10 significance. t* = 2.021, at 44 df, 0.05 significance. Weak evidence of a correlation, 0.10 < p < 0.05. What happens when you get a near perfect correlation? (Example: n=10, r=.99). Expectation: Very strong evidence of a correlation. t* = 2.306, at 8 df, 0.05 significance. t* = 5.041, at 8 df, 0.001 significance. Reality: Very strong evidence of a correlation. The bottom gets very small, and dividing by a small number gives you something huge. The same thing happens with a near-perfect negative correlation, but the t-score is negative and huge. For interest: You can always put a line exactly through two points. With only two points, we have no idea what the true correlation is. Points after the first two tell us about correlation. That’s why correlation has n-2 degrees of freedom. More math? More ugly sweaters! Show your pet some love by forcing it into a tea cosy. First, we need to set down a convention. We’re looking at two variables of the same object. We call these variables x and y. Example: If we were talking about dragons, X could be the length and Y could be the width. X is the independent/explanatory variable (the one we control or can measure more perfectly), Y is the dependent/response variable. When x and y are correlated, we say that some of the variation in y is explained by x. Meaning: Across all the x, the range of y can be large. But if we only consider a particular x (or a small x-interval), the range of y shrinks considerably. Y varies less for a particular X. Y has less variance when accounting for X. r 2 is the proportion that the variance of y is reduced when accounting for x. r = 0.6 in this graph, so r 36% 2 2 = 0.6 = 0.36. of the variation in Y is explained by X. The same proportion of variance is explained for a negative correlation of equal strength. A negative times itself is 2 positive, so r is always between 0 and 1. In a perfect correlation, knowing x automatically gives you y as well. So there is no variation in y left to explain. 2 r = 1 or -1, so r = 1. All of the variation in y is explained by x. When two values are uncorrelated, using a linear function of x to guess at y is useless. 2 r = 0, so r = 0 None of the variation in y is explained by x. The total squared difference from the mean of y is called the sum of squares total, or SST SST is the total square length of all the vertical red lines. If we fit a line through the middle of the points in the scatter plot (called a regression line, the subject of chapter 11), the lines, on average, get shorter. The total squared length of these lines is the sum of squared error, or SSE. The stronger the correlation, the shorter the vertical lines get. In other words, the smaller our errors get, and with them the Sum of Squared Error does too. Here, the correlation is very strong, and there are barely and errors at all. r 2 can also be expressed in terms of SSE and SST. SST is the total amount of variation in Y SSE is the amount of variation in Y left unexplained by X. 2 When r is zero, SSE is same as SST 2 When r is one, SSE disappears completely. An ugly sweater for every occasion! Even SPSS! To find a correlation in SPSS, go to Analyze Correlate Bivariate (Means two-variable) Pick the variables you want to correlate, drag them right. Pearson correlation coefficient MUST be selected. Spearman coefficient is optional. There is a correlation of r = .940 between weight and height. It’s a significant correlation, with a p-value of less than .001 (shows up as Sig. (2-tailed) = .000) Also, anything correlates with itself perfectly, so the correlation between length and length is r= 1 To build a scatterplot, go to graphs legacy dialogs Scatter/Dot Choose Simple Scatter if it’s not already picked, and click Define. Move the independent variable into the x-axis, And the dependent variable into the y-axis, , then click OK (way at the bottom) Our result: There is a definite upward trend, so the strong positive correlation of r = 0.940 makes sense. Next time: Residuals, Outliers and Influence, and the assumption of constant variance.