Chapter 8

advertisement
Regression and Correlation
(and scatter plots)
Outline
•
•
•
•
•
Making a Scatter plot
Calculating the Regression Line
The Correlation
Alternative Procedures
Further Considerations of Regressions and Correlations
Estimating the speed of a car
Estimatedi
8
8
35
13
5
7
14
22
27
5
Actuali
12
29
24
16
7
5
24
21
26
24
40
Four Steps
Estimated velocity (mph)
10
20
30
2. Decide the scale and draw the y axis
3. Add all the points
4. Label axes
0
1. Decide the scale and draw the x axis
0
10
20
30
Actual velocity (mph)
40
Regression Lines
• Different types
• Here, a straight line that gets close to most of points.
• One way to define “close to most points” is by finding the line
that minimizes the sum of the squared vertical distances from
line to the points. Minimizing ∑ei2.
• Called ordinary least squares (OLS) regression.
What is a straight line?
• In mathematics: y = ax + b
a is the slope
b is the intercept
• In statistics: yi = β0 + β1 xi + ei
Subscripts show people have different values on variables
Β0 is the intercept
β1 is the slope
ei are the residuals: the vertical distance between the line
and the points.
Estimatedi 3.62 0.57Actuali
0
Estimated velocity (mph)
10
20
30
40
Scatter plot with regression line and residuals
0
10
20
30
Actual velocity (mph)
40
How to find the OLS line?
1 
 ( x  x )( y  y )
 (x  x)
i
i
2
i
 0  y   1x
where y is the mean of the y variable
where x is the mean of the x variabl e.
and
Ei
Ei
Ai
E i- E i
8
12
-6.4
8
29
-6.4
35
24
20.6
13
16
-1.4
5
7
-9.4
7
5
-7.4
14
24
-0.4
22
21
7.6
27
26
12.6
5
24
-9.4
144 188
0
Ai - A i
-6.8
10.2
5.2
-2.8
-11.8
-13.8
5.2
2.2
7.2
5.2
0
Lots of calculations
(Ei-
E i)
(Ai - A i )
43.5
-65.3
107.1
3.9
110.9
102.1
-2.1
16.7
90.7
-48.9
358.8
(Ei- E i )2 (Ai- A i )2
41.0
46.2
41.0
104.0
424.4
27.0
2.0
7.8
88.4
139.2
54.8
190.4
0.2
27.0
57.8
4.8
158.8
51.8
88.4
27.0
956.4
625.6
predi ei
10.5 -2.5
20.3 -12.3
17.4 17.6
12.8 0.2
7.6 -2.6
6.5 0.5
17.4 -3.4
15.7 6.3
18.5 8.5
17.4 -12.4
144
0
Estimate for β1 = 358.8/625.6 = 0.5735
Estimate for β0 = 14.4 - .5735(18.8) = 3.62.
ei2
6.3
150.1
310.4
0.0
6.9
0.3
11.4
40.2
71.8
153.3
750.6
SStotal = SSmodel + SSresidual
•
•
•
•
•
•
SStotal is 956.4
SSresidual is 750.6
SSmodel = SStotal - SSresidual = 956.4 – 750.6 = 205.8
R2 = SSmodel / SStotal = 205.8/956.4 = 0.22
R2 = the proportion of variation accounted for by the model.
Square root of R2, R (or r), also useful.
Testing hypothesis R2 = 0 in the population
•
•
•
•
•
•
dfmodel = # variables in the model (here 1)
dferror = n - dfmodel – 1 (here 10 – 1 – 1 = 8)
MSSmodel = SSmodel /dfmodel (here 205.8/1 = 205.8)
MSSerror = SSerror /dferror (here 750.6/8 = 93.8)
F(dfmodel, dferror) = MSSmodel/MSSerror
F(1,8) = 205.8/93.8 = 2.19
• If computer doing the calculations, p = .18.
Scatter plots with several values at same coordinate
Pearson’s Correlation r
r 
 ( x  x )( y 
 (x  x)  ( y
i
i
2
i
r 
358 . 8
( 625 . 6 )( 956 . 4 )
i

y)
 y)
358 . 8
2
 0 . 46
773 . 5
Square root of R2, but keeping the sign. Ranges from -1 to 1,
with negative associations having negative r values.
Testing if r = 0 in the population
t 
r n2
1 r
2
r (n  2)
2
,
F 
1 r
2
Also worth calculating confidence intervals (see text)
An Alternative Procedure: Spearman's rS
• Rank the data, and then run Pearson’s
• Some complications and variations if there are lots of ties.
• Impact of univariate outliers (both those that increase the r
and decrease it)
Scatterplot on logs of data
Regency
Regency
2
3
4
r = .76
rS = .74
0
50
1
100
150
200
ln(Drugs + 1)
5
r = .94
rS = .78
0
Drug offences
250
300
6
Scatterplot on Raw data
0
1000
2000
Theft offences
3000
2
3
4
5
6
ln(Thefts + 1)
7
8
Further Considerations
• What we have discussed has been for straight lines. Look at
scatter plots to see if other techniques for curves necessary.
• Correlation does not imply causation (but it suggests that
somewhere in the network of hypotheses that includes these
two variables that there are causal relationships).
Download