June 20

advertisement
Overview
• Correlation
-Definition
-Deviation Score Formula, Z score formula
-Hypothesis Test
• Regression
-Intercept and Slope
-Unstandardized Regression Line
-Standardized Regression Line
-Hypothesis Tests
Associations among Continuous
Variables
3 characteristics of a relationship
Direction
Positive(+)
Negative (-)
Degree of association
Between –1 and 1
Absolute values signify strength
Form
Linear
Non-linear
Direction
Positive
Negative
C1 vs C2
C1 vs C2
120.0
20.0
80.0
C2
C2
13.3
40.0
6.7
0.0
0.0
4.0
8.0
12.0
C1
0.0
0.0
83.3
166.7
250.0
C1
Large values of X = large values of Y,
Small values of X = small values of Y.
Large values of X = small values of Y
Small values of X = large values of Y
- e.g. IQ and SAT
-e.g. SPEED and ACCURACY
Degree of association
Strong
(tight cloud)
Weak
(diffuse cloud)
C1 vs C2
C1 vs C2
120.0
20.0
80.0
C2
C2
13.3
40.0
6.7
0.0
0.0
4.0
8.0
C1
12.0
0.0
0.0
4.0
8.0
C1
12.0
Form
Linear
Non- linear
Regression & Correlation
The 2001 Mets
250
225
Weight
200
in
Pounds
175
150
67
68
69
70
71
72
73
74
75
76
77
Height in Inches
What is the best fitting straight line? Regression Equation: Y = a + bX
How closely are the points clustered around the line? Pearson’s R
Correlation
Correlation - Definition
Correlation: a statistical technique that measures and describes
the degree of linear relationship between two variables
Scatterplot
Obs
A
B
C
D
E
F
X
1
1
3
4
6
7
Y
1
3
2
5
4
5
Y
Dataset
X
Pearson’s r
A value ranging from -1.00 to 1.00 indicating the strength
and direction of the linear relationship.
Absolute value indicates strength
+/- indicates direction
The Logic of Correlation
MEAN of X
MEAN of Y
Below average on X
Above average on X
Above average on Y
Above average on Y
Below average on X
Above average on X
Below average on Y
Below average on Y
Cross-Product = ( X  X )(Y  Y )
For a strong positive
association, the cross-products
will mostly be positive
The Logic of Correlation
MEAN of X
MEAN of Y
Below average on X
Above average on X
Above average on Y
Above average on Y
Below average on X
Above average on X
Below average on Y
Below average on Y
Cross-Product = ( X  X )(Y  Y )
For a strong negative
association, the cross-products
will mostly be negative
The Logic of Correlation
MEAN of X
MEAN of Y
Below average on X
Above average on X
Above average on Y
Above average on Y
Below average on X
Above average on X
Below average on Y
Below average on Y
Cross-Product = ( X  X )(Y  Y )
For a weak association, the
cross-products will be mixed
Pearson’s r
covariatio n of X, Y
r
total variation of X, Y
Deviation score formula
SP
r
SSXSSY
SP (sum of products) =∑ (X – X)(Y – Y)
Deviation Score Formula
Femur Humerus ( X  X )
A
38
41
B
56
63
C
59
70
D
64
72
E
74
84
mean
58.2
66.00
(Y  Y ) ( X  X )2
SSX
SP
r
SSXSSY
(Y  Y )2 ( X  X )(Y  Y )
SSY
SP
Deviation Score Formula
Femur Humerus ( X  X )
A
38
41 -20.2
B
56
63 -2.2
C
59
70
0.8
D
64
72
5.8
E
74
84 15.8
mean
58.2
66.00
SP
r
SSXSSY
= .99
(Y  Y ) ( X  X )2
(Y  Y )2 ( X  X )(Y  Y )
-25 408.04
4.84
-3
.64
4
625
9
16
505
6.6
3.2
6 33.64
18 249.64
696.8
SSX
36
324
1010
SSY
34.8
284.4
834
SP
Pearson’s r
Below average on X
Above average on X
Below Average on Y
Above average on Y
Below average on X
Above average on X
Below average on Y
Below average on Y
Deviation score formula
SP
r
SSXSSY
For a strong positive
association, the SP will be a big
positive number
SP (sum of products) =∑ (X – X)(Y – Y)
Pearson’s r
Below average on X
Above average on X
Below Average on Y
Above average on Y
Below average on X
Above average on X
Below average on Y
Below average on Y
Deviation score formula
SP
r
SSXSSY
For a strong negative
association, the SP will be a big
negative number
SP (sum of products) =∑ (X – X)(Y – Y)
Pearson’s r
Below average on X
Above average on X
Below Average on Y
Above average on Y
Below average on X
Above average on X
Below average on Y
Below average on Y
Deviation score formula
SP
r
SSXSSY
For a weak association, the SP
will be a small number (+ and –
will cancel each other out)
SP (sum of products) =∑ (X – X)(Y – Y)
Pearson’s r
covariatio n of X, Y
r
total variation of X, Y
Z score formula
Z

r
X
Zy
n -1
Standardized cross-products
Z-score formula
Femur Humerus
A
38
41
B
56
63
C
59
70
D
64
72
E
74
84
mean 58.2
66.00
s 13.20
15.89
Z

r
X
Zy
n -1
ZX
ZY
ZXZY
Z-score formula
Femur Humerus
ZX
ZY
A
38
41 -1.530 -1.573
B
56
63 -0.167 -0.189
C
59
70 0.061 0.252
D
64
72 0.439 0.378
E
74
84 1.197 1.133
mean 58.2
66.00
s 13.20
15.89
Z

r
X
Zy
n -1
ZXZY
Z-score formula
Femur Humerus
ZX
ZY
ZXZY
A
38
41 -1.530 -1.573
2.408
B
56
63 -0.167 -0.189
0.031
C
59
70 0.061 0.252
0.015
D
64
72 0.439 0.378
0.166
E
74
84 1.197 1.133
1.356
mean 58.2
66.00
∑=3.976
s 13.20
15.89
Z

r
X
Zy
n -1
r = .99
Formulas for R
Deviations formula
SP
r
SS X SSY
Z score formula
Z

r
X
Zy
n -1
Interpretation of R
• A measure of strength of association: how closely
do the points cluster around a line?
•A measure of the direction of association: is it
positive or negative?
Interpretation of R
• r = .10
very small association, not usually reliable
• r = .20
small association
• r = .30
typical size for personality and social studies
• r = .40
moderate association
• r = .60
you are a research rock star
• r = .80
hmm, are you for real?
Interpretation of R-squared
r 
2
• The amount of covariation compared to the amount of total
variation
• “The percent of total variance that is shared variance”
• E.g. “If r = .80, then X explains 64% of the variability in Y”
(and vice versa)
Hypothesis testing with r
Hypotheses
H0:  = 0
HA :  ≠ 0
Test statistic = r
r
t
SEr
1 r 2
SEr 
n2
df  n  2
Or just use table E.2 to find
critical values of r
Practice
A
B
C
D
E
mean
alcohol
6.47
6.13
6.19
4.89
5.63
tobacco ( X  X )
4.03
3.76
3.77
3.34
3.47
(Y  Y ) ( X  X )2
SSX
r
SP
SS X SSY
df  n  2
(Y  Y )2 ( X  X )(Y  Y )
SSY
SP
Practice
A
B
C
D
E
mean
alcohol tobacco ( X  X )
6.47
4.03
6.13
3.76
6.19
3.77
4.89
3.34
5.63
3.47
r  .95
(Y  Y ) ( X  X )2
1.55
SSX
df  3
(Y  Y )2 ( X  X )(Y  Y )
.30
SSY
.64
SP
Properties of R
• A standardized statistic – will not change if you change the
units of X or Y. (bc based on z-scores)
•The same whether X is correlated with Y or vice versa
• Fairly unstable with small n
•Vulnerable to outliers
• Has a skewed distribution
Linear Regression
Linear Regression
• But how do we describe the line?
• If two variables are linearly related it is possible to develop
a simple equation to predict one variable from the other
• The outcome variable is designated the Y variable, and the
predictor variable is designated the X variable
• E.g. centigrade to Fahrenheit:
F = 32 + 1.8C
this formula gives a specific straight line
The Linear Equation
Where
a = intercept
b = slope
X = the predictor
Y = the criterion
a and b are constants
in a given line;
X and Y change
Criterion
• F = 32 + 1.8(C)
• General form is Y = a + bX
• The prediction equation: Y’ = a+ bX
Predictor
The Linear Equation
Where
a = intercept
b = slope
X = the predictor
Y = the criterion
Criterion
• F = 32 + 1.8(C)
• General form is Y = a + bX
• The prediction equation: Y’ = a + bX
Different b’s…
Predictor
The Linear Equation
Where
a = intercept
b = slope
X = the predictor
Y = the criterion
Criterion
• F = 32 + 1.8(C)
• General form is Y = a + bX
• The prediction equation: Y’ = a + bX
Different a’s…
Predictor
The Linear Equation
Where
a = intercept
b = slope
X = the predictor
Y = the criterion
Criterion
• F = 32 + 1.8(C)
• General form is Y = a + bX
• The prediction equation: Y’ = a + bX
Different a’s and b’s …
Predictor
Slope and Intercept
• Equation of the line
Y '  a  bX
 The slope b: the amount of change in y with one unit
change in x
 The intercept a: the value of y when x is zero
Slope and Intercept
• Equation of the line
 The slope b  r
 The intercept
Y '  a  bX
sy
SP

s x SS X
a  Y  bX
The slope is influenced by r, but is not the same as r
When there is no linear association (r = 0), the
regression line is horizontal. b=0.
The 2001 Mets
45
40
Age
in
Years
35
30
25
20
68
69
70
71
72
73
74
75
Height in Inches
and our best estimate of age is 29.5 at all heights.
76
77
When the correlation is perfect (r = ± 1.00),
all the points fall along a straight line with a
s
slope b  r s
y
x
The 2001 Mets
200
195
190
Height
in
185
centimeters
180
175
170
68
69
70
71
72
73
Height in Inches
74
75
76
77
When there is some linear association (0<|r|<1), the
regression line fits as close to the points as possible
s
and has a slope b  r
y
sx
The 2001 Mets
250
225
Weight
200
in
Pounds
175
150
67
68
69
70
71
72
73
Height in Inches
74
75
76
77
Where did this line come from?
 It is a straight line which is drawn through a scatterplot, to
summarize the relationship between X and Y
 It is the line that minimizes the squared deviations (Y’ – Y)2
 We call these vertical deviations “residuals”
Regression lines
Minimizing the squared vertical distances, or “residuals”
Unstandardized Regression Line
• Equation of the line
 The slope b  r
 The intercept
Y '  a  bX
sy
SP

s x SS X
a  Y  bX
Properties of b (slope)
• An unstandardized statistic – will change if you change
the units of X or Y.
•Depends on whether Y is regressed on X or vice versa
Standardized Regression Line
• Equation of the line
 The slope
 r
 The intercept
none
ZY '   (Z X )
A person 1 stdev above the mean on height would be
how many stdevs above the mean on weight?
Properties of β (standardized slope)
• A standardized statistic – will not change if you change
the units of X or Y.
•Is equal to r, in simple linear regression
Exercise
X
Y
1
3
1
4
5
4
5
6
9
6
9
Calculate:
r=
b=
a=
β=
Write the regression equation:
Write the standardized
equation:
7
Y '  a  bX
br
sy
sx
a  Y  bX
Exercise
X
Calculate:
Y
1
3
1
4
5
4
5
6
9
6
9
r = .866
b = .375
a = 3.125
β = .866
Write the regression equation:
Y '  3.125  .375 X
Write the standardized
equation: Z  .866Z
7
Y'
Y '  a  bX
br
sy
sx
a  Y  bX
X
Regression Coefficients Table
Predictor
Unstandardized
Coefficient
Standard Standardized
error
Coefficient
Intercept
a
SEa
-
Variable X
b
SEb

t
a
SEa
b
t
SEb
t
sig
Summary
Correlation: Pearson’s r
r
Z
r
SP
SS X SSY
X
ZY
n 1
Unstandardized Regression Line
a  Y  bX
b
SP
SS X
br
sy
sx
Standardized Regression Line
 r
t
r
SEr
Y '  a  bX
t
a
SEa
t
b
SEb
ZY '   (Z X )
Exercise in Excel
X
Y
1
-1.5
1
0
2
-3
4
-4
7
-2
9
-6
Calculate:
r=
b=
a=
β=
Write the regression equation:
Write the standardized equation:
Sketch the scatterplot and regression line
Download