3.3 Correlation: The Strength of a Linear Trend

advertisement
3.3 Correlation: The Strength of a Linear Trend
The __________________ ________________ is the statistic used to quantitatively
measure the amount of linear relationship between two variables. The
__________________ ________________ is usually written as ____ (rho). As well as giving
information about the strength of the linear relationship (how well a line fits the data) it also
gives information about the direction of the linear relationship (whether the variables are
positively or negatively associated).
The most commonly used correlation coefficient (there are several different ones) is the
Pearson product-moment correlation coefficient, or more simply, the correlation
coefficient. It is defined as “the average product of the z-scores of the x and y variables“
(using n – 1 instead of n to correct for population bias similar to the ____________
____________ formula).
r

 x  x  y  y 
1




n  1  s x  s y 
1
 zx  z y
n 1
z x are the z-scores of the x-values and z y are the z-scores of the y-values
The following graph helps you to visualize why the r value signifies what it does.
(x, y)
The ______________ line is at the mean value of x and the ___________ line is at the mean
value of y.


For points in Quadrant I both the x and y values have positive z-values, so their
product contributes a positive amount to the calculation of r.
For points in Quadrant III both the x and y values have negative z-values, but their
product is positive and so also contributes a positive amount to the calculation of r.
1
3.3 Correlation: The Strength of a Linear Trend
For the remaining points in Quadrants II and IV, the x and y values will have z-values with
opposite signs, hence their product is negative and contributes a negative amount to the
calculation of r.
For relationships where most of the points fall in Quadrants I and III the correlation
coefficient will be positive as its calculation is dominated by positive terms. Conversely for
relationships where most of the points fall in Quadrants II and IV, the correlation coefficient
will be negative as its calculation is dominated by negative terms.
DO NOT INTERPRET CORRELATION AS CAUSATION. Just because two variables are related
does not mean that one ___________ the other – some third variable may be influencing
them both. E.g. seeing a fire truck at almost every fire doesn’t mean that fire trucks cause
fires. This third variable that you didn’t include in your analysis, but that might explain the
relationship between the variables you did include, is called a __________________.
Properties of the Correlation Coefficient



Correlation does not depend on which variable you call explanatory and which
you call response.
Because z-scores have no units, and r is the average product of two z-scores, r
has no units. This means changing the units of measurement for the variables has
no effect on the value of the ________________ ________________.
The correlation coefficient is always a number between -1 and 1. If r = -1 or r = 1,
the points all lie on a line. The following rough guidelines help you categorize r.
Value of r
 1  r   0 .8
0 .8  r  1
0.8  r  0.5
0.5  r  0.8
Strength of Linear Relationship
strong
moderate
0.5r0.5



2
weak
If r > 0 the variables are ____________ associated. If r < 0 the variables are
_______________ associated.
If r = 0 it indicates there is no _________ association that would allow you to
predict y from x. It does not mean there is no relationship – just not a linear one.
A resistant statistic is one which is not dramatically affected by __________
values. R is not resistant because it is based on the ________. A single extreme
value can have a powerful impact on r and cause you to overinterpret the
relationship. That is why you must always look at a scatterplot of the data as well
as r.
3.3 Correlation: The Strength of a Linear Trend
The scatterplots on the left
illustrate how values of r
closer to 1 or -1 correspond
to stronger linear
relationships.
Example 1: Match each of the five
scatterplots on the right with its
correlation, choosing from:
-0.95, -0.5, 0, 0.5, and 0.95
3
3.3 Correlation: The Strength of a Linear Trend
The Relationship Between the Correlation and the Slope
As well as the correlation and the slope always having the same sign, they are also related
by the equation:
s
b1  r y
sx
From the equation you can see that if the data is __________________ so that sx  1and
sy  1 , then the slope of the regression line is ________ to the correlation.
Note: Most of the time you will be able to use your calculator to find the LSRL equation,
however the AP Exam may also ask you to calculate and interpret the linear regression
equation given summary statistics x, y, sx , s y , and r.
From Data
SLOPE
b1 =
å( x - x )( y - y )
å( x - x )
i
i
2
From Summary Statistics
b1  r 
sy
sx
i
INTERCEPT
b0 = y - b1x
b0 = y - b1x
Example 2: P15 page 156
Imagine a scatterplot of two sets of exam scores for students in a statistics class. The score
for a student on Exam 1 is graphed on the x-axis, and his or her score on Exam 2 is graphed
on the y-axis. The slope of the regression line is 0.368. The mean of the Exam 1 scores is
72.99, and the standard deviation is 12.37. The mean of the Exam 2 scores is 75.80, and the
standard deviation is 7.00.
a. Find the correlation of these two scores.
b. Find the equation of the regression line for predicting an Exam 2 score from an Exam
1 score. Predict an Exam 2 score for a student who got a score of 80 on Exam 1.
c. Find the equation of the regression line for predicting an Exam 1 score from an Exam
2 score.
4
3.3 Correlation: The Strength of a Linear Trend
The following is a description of one way to evaluate how well a linear model fits the data
(the second way will be covered in section 3.4).
Coefficient of Determination
Let’s look, for example, at the relationship between girls aged 2 – 14 years and their median
height in inches (E19 page 136). The equation of the LSRL is
predicted height = 31.57 + 2.43 (age)
If you had to guess a girl’s height with no
other information, on average you'd make
the smallest errors by always guessing the
mean height.
In general all observed y values exhibit variability. A rough measure of this variability is the
total sum of squares:
(
SST = å y - y
)
2
(the total variation of the observed y values about their overall average)
This total variability can be broken into two parts: the first attributed to the differences in x
(the linear relationship) and the second attributed to other unexplained factors (residuals).
(
SSR = å ŷ - y
)
2
(the variation explained by the regression)
(
)
SSE = å residual2 = y - ŷ (the remaining unexplained variation)
2
SST (total sum of squares)= SSR (regression sum of squares) + SSE (residual sums of squares)
Y
5
3.3 Correlation: The Strength of a Linear Trend
The coefficient of determination, r2, is a numerical quantity that tells you how well the
least-squares line does at ______________ values of the response variable y. Although it is
true that this quantity is equal to the square of r, the correlation coefficient, there is much
more to this relationship. R2 can be calculated using the following formula:
r2 =
explained variation due to the linear relationship SSR
=
total variation
SST
In brief, r2 is a value from 0 and 1, and the closer it is to 1 the _________ your model is.
A value of r2 equal to 1 implies that your model provides _________ predictions and it
would pass through every point on the scatterplot exactly i.e. It would be able to “explain”
all of the variation.
In the worst case scenario, the least-squares line does no better at predicting y than y . In
this case SSE = SST and r2 = 0.
If you have a coefficient of determination between 0 and 1, for example r2 = 0.606, then
about 61% of the variation in y among the individual subjects is due to the straight-line
relationship between y and x. The other 39% is individual variation among subjects that is
not explained by the linear relationship.
When interpreting the coefficient of determination, r2, say the following:
About r 2 % of the variation in y can be explained by the linear relationship between x and y
(of course you must replace x and y with their real life meanings)
With some algebra it can be shown that the coefficient of determination is actually the
correlation squared (not easy to do and beyond our course). This fact provides an important
connection between correlation and regression. However, just remember that while it is
true that one is the square of the other, they have different meanings:
r is a measure and direction of the strength of the linear relationship
r2 tells you how much better the linear model is at predicting y-values than simply
using y .
Notes:
 Even though r2 is not in the AP Stats curriculum, it has appeared on some exams.
 Regression toward the mean is not in the course.
6
3.3 Correlation: The Strength of a Linear Trend
Download