Chapter 9 – Correlation and Regression

advertisement
Chapter 9 – Correlation and Regression
9.1 Correlation
A correlation is a relationship between two statistical variables measured from the
same population. In this chapter, we will consider only linear correlation which
comes in three types:
Positive Linear Correlation - high values for one variable tend to correspond to high
values for the second variable.
Examples:
 Height vs. Weight for adults
 Blood Alcohol Level vs. Reaction Time

Negative Linear Correlation - high values for one variable tend to correspond to low
values for the second variable.
Examples:
 Age vs. Retail Value of a Ford F-150 Truck
 Blood Alcohol Level vs. Weight for adults after consuming one drink
 Reaction Time vs. Hours of Sleep for the previous night
No Linear Correlation - no relationship between the variables or a non-linear
relationship.
Examples:
 Height vs. Number of Years of Education
 Natural hair color and intelligence quotient score
1
Scatter Diagrams
One way to determine the type of linear correlation between two variables is by means
of a scatter diagram. To construct a scatter diagram, we plot the value of one
variable along the x-axis and the other along the y-axis, and then for each member of
our population or sample group, we plot a point corresponding to the measurements of
the individual.
We can then determine the type of linear correlation as follows:
Positive Linear Correlation
General trend in the plotted points is from bottom left to top right.
Negative Linear Correlation
General trend in the plotted points is from top left to bottom right.
No Linear Correlation
No general trend in plotted points, or a non-linear trend.
The strength of the linear correlation can be judged by looking at how closely the
points approximate a straight line.
2
Example: The following table shows the Height (x) vs. Femur Length (y)
measurements (both in inches) for 10 men:
x
y
70.8 66.2 71.7 68.7 67.6 69.2 66.5 67.2 68.3 65.6
42.5 40.2 44.4 42.8 40
47.3 43.4 40.1 42.1 36
Scatter Diagram for Height vs. Femur Length
Legth of Femur
50
45
40
35
65
66
67
68
69
70
71
72
Height
The diagram shows a ___________ linear correlation between the variables.
3
Example: The following table gives the weight (x) (in 1000 lbs.) and highway fuel
efficiency (y) (in miles/gallon) for a sample of 13 cars.
Vehicle
X
y
Chevrolet Camaro
Dodge Neon
Honda Accord
Lincoln Continental
Oldsmobile Aurora
Pontiac Grand Am
Mitsubishi Eclipse
BMW 3-Series
Honda Civic
Toyota Camry
Hyundai Accent
Mazda Protégé
Cadillac DeVille
3.545
2.6
3.245
3.93
3.995
3.115
3.235
3.225
2.44
3.24
2.29
2.5
4.02
30
32
30
24
26
30
33
27
37
32
37
34
26
MPG Highway
Scatter Diagram for Weight vs. Highway MPG
40
38
36
34
32
30
28
26
24
22
20
2
2.5
3
3.5
4
4.5
weight (1000 lbs)
The diagram indicates a _____________ linear correlation between the variables.
4
Coefficient of Correlation
A more precise method of determining the type and strength of a linear correlation is
to calculate the coefficient of linear correlation (denoted by r) for the two variables
using the formula:
n   xy     x   y 
r
n
 x    x
2
2
n


 y   y 
2
2
The coefficient of linear correlation will always be a number between -1 and 1, with a
positive value indicating a positive correlation and a negative value a negative
correlation. A coefficient of r  1 for a data set indicates perfect positive linear
correlation, and r  1 indicates perfect negative linear correlation, while r  0 would
indicate no linear correlation. The closer the value of r is to 1 , the stronger the
correlation, and the closer to zero, the weaker the correlation.
Calculating the Coefficient of Correlation
The coefficient of correlation between two variables is most easily calculated by
constructing a table (see example below) with columns that contain the x and y
variable values for each individual, the value of xy for each individual, and the values
of x 2 and y 2 for each individual.
The sum of each column is found, and these sums can then be substituted into the
formula above to find r.
5
Example: Using our previous data set of height vs femur length for 10 men, we get
the table:
Variable
x
70.8
66.2
71.7
68.7
67.6
69.2
66.5
67.2
68.3
65.6
Sum
3009
2661.24
3183.48
2940.36
2704
3273.16
2886.1
2694.72
2875.43
2361.6
x2
y2
5012.64 1806.25
4382.44 1616.04
5140.89 1971.36
4719.69 1831.84
4569.76
1600
4788.64 2237.29
4422.25 1883.56
4515.84 1608.01
4664.89 1772.41
4303.36
1296
418.8 28589.09
46520.4 17622.76
y
xy
42.5
40.2
44.4
42.8
40
47.3
43.4
40.1
42.1
36
681.8
The coefficient of correlation for the variables is thus:
n   xy     x   y 
r
n
r
 x    x
2
2
n


 y   y 
2
2
10  28589.09    681.8 418.8
10  46520.4    681.8
2
10 17622.76    418.8
353.06
353.06

 .651
352.76 834.16 542.4558
6
2

Exercise: Calculate the coefficient of correlation for the vehicle weight and miles per
gallon data sets. The table of variables is given below:
Variable
x
Sums
y
xy
3.545
2.6
3.245
3.93
3.995
3.115
3.235
3.225
2.44
3.24
2.29
2.5
4.02
30
32
30
24
26
30
33
27
37
32
37
34
26
41.38
398
106.35
83.2
97.35
94.32
103.87
93.45
106.755
87.075
90.28
103.68
84.73
85
104.52
1240.58 135.93675
n   xy     x   y 
r
n
 x    x
2
2
n
 y    y 
2
x2
y2
12.567025
6.76
10.530025
15.4449
15.960025
9.703225
10.465225
10.400625
5.9536
10.4976
5.2441
6.25
16.1604
2
7
900
1024
900
576
676
900
1089
729
1369
1024
1369
1156
676
12388
Significance of the Coefficient of Correlation
When the coefficient of correlation is calculated from sample data sets, there is a
chance that a linear correlation will be found when, in fact, no correlation exists
between the population variables. Therefore, before deciding that a linear correlation
exists between two variables when using sample data, we will run a test for
significance.
The population parameter representing the coefficient of correlation for population
data is denoted by  (row), and we use the sample coefficient r to determine if the
hypothesis
H 0 :   0 can be rejected. This is in fact a two-tailed t-test, but the resulting critical r
values for the   .05 and   .01 levels of significance are listed in Table 11 on page
A 28.
Example: The height vs. femur length data set has a coefficient of correlation of
r  .651, thus this correlation is significant at the   .05 level of significance (> .632) but
not at the   .01 (>.765) level of significance. Note that as the sample size grows, r
can be _________________ to reject the null hypothesis that ρ = 0 and conclude that
the coefficient of correlation, r, is significant (ie not actually zero).
Exercise: Use Table 11 to determine the level(s) of significance for the vehicle weight
vs. highway mpg data.
8
9.2 Linear Regression
If a pair of variables has a significant linear correlation, then the relationship between
the data values can be roughly approximated by a linear equation. The process of
finding the linear equation which best fits the data values is known as linear
regression and the line of best fit is called the regression line.
It is a fact of linear algebra and analysis that the least squares line of best fit to a set of
data values has an equation of the form ŷ  mx  b where:
m
n   xy     x   y 
n
 x    x
2
2
and b  y  mx 
 y  m  x
n
Example: For the vehicle weight vs. highway mileage data set, we have:
m
13 1240.58   41.38 398
13 135.937    41.38
2

341.7
 6.23
54.877
and
b
 398  (6.23)  41.38  655.797  50.45
13
13
so our regression line is given by the equation yˆ  6.23x  50.45 . The graph of this line
is shown on the scatter diagram for the data set below.
MPG Highway
Vehicle Weight vs. MPG Highway
40
38
36
34
32
30
28
26
24
22
20
2
2.5
3
3.5
4
4.5
weight (1000 lbs)
Exercise: Find the equation of the regression line for the height vs. femur length data.
9
Using the Regression Line to Predict Data Values
The primary use for the regression equation is to predict values for one variable given
a value for the other variable. Meaningful predictions can only be made for values
within the range of the original data.
Example: Using our regression equation for the car data, we could estimate that a car
that weighed 3000 lbs ( x  3) would have a highway mpg of yˆ  6.23(3)  50.45  31.76 .
Likewise, if we knew a car’s highway mpg was 36 mpg, then we would estimate its
weight by solving 36  6.23x  50.45 to get x  2.319 or a car that weighs 2319 lbs.
A margin of error could be also be added on to these estimates to generate a
confidence interval (using the t-distribution), but we will not cover this in this class.
Note: it would not be meaningful to either predict the mileage of a car weighing 5,000
lbs or predict the weight of a car getting 50 mpg because each these is not within the
range of the original data.
Exercise: Suppose a crime scene investigator digs up the femur of a man and finds
that it is 38.5 inches long. Based on our regression line for the height vs. femur length
data, what would we estimate the man’s height to have been?
10
Finding Correlation and Regression Using the TI-83
The coefficient of correlation for a set of paired data values can be found using the TI83 by placing the x values in L1 and the y values in L2, and then running the LinRegTTest program.
To run the LinRegT-Test program,
 Press the STATS key
 Use the arrow keys to select the TESTS Menu
 Choose number E: LinRegT-Test by pressing ENTER
When you are presented with the menu, choose





Xlist: L1
Ylist: L2
Freq: 1
 &  : 0
RegEQ: (blank)
Choose Calculate and press ENTER. The following will be displayed:
a
b
r
=
=
=
y-intercept of regression equation (b)
Slope of regression equation (m)
Coefficient of Correlation (r)
2nd y=stat plot
Highlight plot 1
- on
- no L1, yes L2
- graph
- zoom stat (#9)
Stat
Calc #4 lin rg ax +b
L1, L2; vars y-vars #1 (function)
Enter y1, enter, graph
11
Download