Wed, June 26, (Lecture 8-2). Nonlinearity. Significance test for correlation

advertisement
Wed, June 26, (Lecture 8-2). Nonlinearity.
Significance test for correlation
R-squared, SSE, and SST.
Correlation in SPSS.
Last time, we looked at scatterplots, which show the
interaction between two variables, and correlation.
The correlation coefficient r measures how well the pairs of
values fit on a line.
r is positive when two values increase together.
r is negative when two one value goes up as the other goes
down.
However, correlation only shows the linear relation between
two variables.
The variables could still be related in a non-linear way and
have little or no correlation.
In real world contexts, the most common form of non-linear
relationship is a curvilinear one. (SOURCE: GAPMINDER.ORG)
One common reason is a scaling issue, where a fixed change in
one thing doesn’t mean a fixed change in another.
Life expectancy increases with the logarithm of income, not
with income. (SOURCE: GAPMINDER.ORG)
When we rescale income into a log-scale (a scale that shows
very small and very large numbers equally well), a line appears.
Another reason for non-linearity could be two competing
factors.
In a too-easy course, nobody learns anything new.
In a too-hard course, nobody learns anything at all.
Spearman correlation is a measure that can handle curves
as long as the trend doesn’t switch between increasing and
decreasing. The only time we’ll be using this is as a check in
SPSS.
Everything else we do in Ch.10 and 11 is the…
Pearson correlation, which is restricted to linear
relationships. We use the Pearson correlation because it
produces stronger results and the math is simpler.
Math: The ugly sweater around an otherwise pretty graph.
You can do hypothesis testing. We may be interested in
whether or not there is a correlation between two variables.
Since samples are random, the sample correlation between
two variables will show up as a little above or below zero by
chance.
How far from zero correlation does something have to be
before it’s significant?
This formula gives the t-score of correlation.
The null hypothesis is: true correlation = zero.
The alternative is: correlation not zero.
The t in this formula is the same t-score as in chapters 6 and 7.
This t-score gets compared the critical values in the t-table at
n-2 degrees of freedom.
The stronger the correlation, the farther r goes from zero.
As r gets farther from zero, t-score gets bigger.
So a stronger correlation gives you higher t-score.
Stronger correlation  better evidence of a correlation.
t-score also increases with sample size. As usual, it’s under a
square root.
Having more data points makes it easier to detect correlations.
A larger t-score meant more evidence against the null, just like
before.
So a large t-score means more evidence of a correlation.
If there’s a weak correlation and a small sample, we might not
detect it. (Example: n=10, r=.25)
t* = 1.397, at 8 df, 0.20 significance.
t* = 2.306, at 8 df, 0.05 significance.
No significant evidence of a correlation.
p > 0.20
What if we get a larger sample of this correlation?
(n=46, r=0.25)
We should get some evidence of a correlation, but not much.
t* = 1.684, at 44 df, 0.10 significance.
t* = 2.021, at 44 df, 0.05 significance.
Weak evidence of a correlation, 0.10 < p < 0.05.
What happens when you get a near perfect correlation?
(Example: n=10, r=.99).
Expectation: Very strong evidence of a correlation.
t* = 2.306, at 8 df, 0.05 significance.
t* = 5.041, at 8 df, 0.001 significance.
Reality: Very strong evidence of a correlation.
The bottom gets very small, and dividing by a small number
gives you something huge.
The same thing happens with a near-perfect negative
correlation, but the t-score is negative and huge.
For interest: You can always put a line exactly through two
points.
With only two points, we have no idea what the true
correlation is. Points after the first two tell us about
correlation.
That’s why correlation has n-2 degrees of freedom.
More math? More ugly sweaters!
Show your pet some love by forcing it into a tea cosy.
First, we need to set down a convention. We’re looking at two
variables of the same object. We call these variables x and y.
Example: If we were talking about dragons, X could be the
length and Y could be the width.
X is the independent/explanatory variable (the one we control
or can measure more perfectly),
Y is the dependent/response variable.
When x and y are correlated, we say that some of the variation
in y is explained by x.
Meaning: Across all the x, the range of y can be large.
But if we only consider a particular x (or a small x-interval), the
range of y shrinks considerably.
Y varies less for a particular X.
Y has less variance when accounting for X.
r
2
is the proportion that the variance of y is reduced when
accounting for x.
r = 0.6 in this graph, so r
36%
2
2
= 0.6 = 0.36.
of the variation in Y is explained by X.
The same proportion of variance is explained for a negative
correlation of equal strength. A negative times itself is
2
positive, so r is always between 0 and 1.
In a perfect correlation, knowing x automatically gives you y
as well. So there is no variation in y left to explain.
2
r = 1 or -1, so r = 1.
All of the variation in y is explained by x.
When two values are uncorrelated, using a linear function of x
to guess at y is useless.
2
r = 0, so r = 0
None of the variation in y is explained by x.
The total squared difference from the mean of y is called the
sum of squares total, or SST
SST is the total square length of all the vertical red lines.
If we fit a line through the middle of the points in the scatter
plot (called a regression line, the subject of chapter 11),
the lines, on average, get shorter.
The total squared length of these lines is the sum of
squared error, or SSE.
The stronger the correlation, the shorter the vertical lines get.
In other words, the smaller our errors get, and with them the
Sum of Squared Error does too.
Here, the correlation is very strong, and there are barely and
errors at all.
r
2
can also be expressed in terms of SSE and SST.
SST is the total amount of variation in Y
SSE is the amount of variation in Y left unexplained by X.
2
When r is zero, SSE is same as SST
2
When r is one, SSE disappears completely.
An ugly sweater for every occasion!
Even SPSS!
To find a correlation in SPSS, go to
Analyze  Correlate  Bivariate
(Means two-variable)
Pick the variables you want to correlate, drag them right.
Pearson correlation coefficient MUST be selected.
Spearman coefficient is optional.
There is a correlation of r = .940 between weight and height.
It’s a significant correlation, with a p-value of less than .001
(shows up as Sig. (2-tailed) = .000)
Also, anything correlates with itself perfectly, so the
correlation between length and length is r= 1
To build a scatterplot, go to
graphs  legacy dialogs  Scatter/Dot
Choose Simple Scatter if it’s not already picked, and click
Define.
Move the independent variable into the x-axis,
And the dependent variable into the y-axis,
, then click OK (way at the bottom)
Our result:
There is a definite upward trend, so the strong positive
correlation of r = 0.940 makes sense.
Next time: Residuals, Outliers and Influence, and the
assumption of constant variance.
Download