Association Variables

advertisement
Association
 Variables
–Response – an outcome variable
whose values exhibit variability.
–Explanatory – a variable that we
use to try to explain the
variability in the response.
1
Association
 There
is an association between
two variables if values of one
variable are more likely to
occur with certain values of a
second variable.
2
Picturing Association
 Two
Categorical (Qualitative).
–Cross-tabs table, mosaic plot.
 Two
Numerical (Quantitative).
–Scatter diagram.
3
Categorical Data
 Who?
– Students in a statistics class at Penn
State University.
 What?
– “With whom is it easiest to make
friends?” Opposite sex, same sex,
no difference.
– Gender. Male, female.
4
Cross-tabs Table
With whom is it easiest to make friends?
Same
Sex
Opposite No Diff
Sex
Total
Female
16
58
63
137
Male
13
15
40
68
Total
29
73
103
205
5
Bar Graph
With whom is it easiest to make friends?
Distributions
Answer
Freque ncies
50.2
35.6
75
50
14.1
Count
100
Level
No Diff
Opposite
Same
Total
Count
103
73
29
205
N Missing
0
3 Levels
25
No Diff
Opposite
Same
6
Prob
0.50244
0.35610
0.14146
1.00000
Percentages
With whom is it easiest to make friends?
Count
Row %
Female
Male
Total
Same
Sex
16
11.7%
13
19.1%
29
Opposite No Diff
Sex
58
42.3%
15
22.1%
73
63
46.0%
40
58.8%
103
Total
137
100%
68
100%
205
7
Mosaic Plot
1.00
Same
0.75
Answer
Opposite
0.50
0.25
No Diff
0.00
Female
Male
Gender
8
Interpretation
 More
that 50% of males say no
difference while less than 50% of
females say no difference.
 Females are about twice as likely
as males to say opposite.
 Males are about twice as likely as
females to say the same.
9
Scatter Plot
 Statistics
is about … variation.
 Recognize, quantify and try to
explain variation.
 Variation in two quantitative
variables is displayed in a
scatter plot.
10
Scatter Plot
 Numerical
variable on the
vertical axis, y, is the response
variable.
 Numerical variable on the
horizontal axis, x, is the
explanatory variable.
11
Scatter Plot
 Example:
Body mass (kg) and
Bite force (N) for Canidae.
–y, Response: Bite force (N)
–x, Explanatory: Body mass (kg)
–Cases: 28 species of Canidae.
12
Bivariate Fit of BFca (N) By Body M ass (kg)
500
400
BFca (N)
300
200
100
0
0
5
10
15
20
25
Body Mass (kg)
30
35
40
13
Positive Association
 Positive Association
– Above average values of Bite force
are associated with above average
values of Body mass.
– Below average values of Bite force
are associated with below average
values of Body mass.
14
Scatter Plot
 Example:
Outside temperature
and amount of natural gas used.
– Response: Natural gas used
(1000 ft3).
– Explanatory: Outside
temperature (o C).
– Cases: 26 days.
15
Gas
10
5
0
-5.0
.0
5.0
Temp
10.0
15.0
16
Negative Association
–Above average values of gas
are associated with below
average temperatures.
–Below average values of gas
are associated with above
average temperatures.
17
Association
 Positive
–As x goes up, y tends to go up.
 Negative
–As x goes up, y tends to go
down.
18
Correlation
 Linear
Association
– How closely do the points on the
scatter plot represent a straight line?
– The correlation coefficient gives the
direction of and quantifies the
strength of the linear association
between two quantitative variables.
19
Correlation
 Standardize
y
 Standardize
x
y y
zy 
sy
xx
zx 
sx
20
Standardized Bite Force
Bite Force vs Body Mass of Canidae
3
2
1
0
-1
-1
0
1
2
Standardized Body Mass
3
21
Correlation Coefficient
zx z y

r
n 1
x  x  y  y 

r
n  1s x s y
22
Correlation Coefficient
 Body
mass and Bite force
zx z y

r
26.4796

n 1
27
r
= 0.9807
23
Correlation Coefficient
 There
is a very strong positive
correlation, linear association,
between the body mass and
bite force for the various
species of Canidae.
24
JMP
– Multivariate methods
– Multivariate
 Y, Columns
 Analyze
–
–
Body mass
BF ca (Bite force at the
canine)
25
M ultiv ariate
Co rre lation s
Body Mass (kg)
BFca (N)
Body Mass (kg)
1.0000
0.9807
BFca (N)
0.9807
1.0000
Scatte rp lot M atrix
40
35
30
25
Body
20
Mass (kg)
15
10
5
500
400
300
BFca (N)
200
100
26
5
10 15 20 25 30 35 40
100
200
300
400
500
Correlation Properties
sign of r indicates the direction of
the association.
 The
value of r is always between
–1 and +1.
 Correlation has no units.
 Correlation is not affected by changes
of center or scale.
 The
27
Algebra Review
 The
equation of a straight line
 y = mx + b
– m is the slope – the change in y
over the change in x – or rise over
run.
– b is the y-intercept – the value
where the line cuts the y axis.
28
y = 3x + 2
15
10
y
5
0
-5
-10
-15
-5
-4
-3
-2
-1
0
x
1
2
3
4
5
29
Review
y
= 3x + 2
–x = 0
y = 2 (y-intercept)
–x = 3
y = 11
– Change in y (+9) divided by the
change in x (+3) gives the slope, 3.
30
Linear Regression
 Example:
Body mass (kg) and
Bite force (N) for Canidae.
–y, Response: Bite force (N)
–x, Explanatory: Body mass (kg)
–Cases: 28 species of Canidae.
31
Correlation Coefficient
 Body
mass and Bite force
zx z y

r
26.4796

n 1
27
r
= 0.9807
32
Correlation Coefficient
 There
is a strong correlation,
linear association, between the
body mass and bite force for
the various species of Canidae.
33
Linear Model
 The
linear model is the equation
of a straight line through the data.
 A point on the straight line
through the data gives a predicted
value of y, denoted ŷ .
34
Residual
 The
difference between the
observed value of y and the
predicted value of y, ŷ , is
called the residual.
 Residual = y  y
ˆ
35
Regression Plot
500
BF ca (N)
400
Residual
300
200
100
0
0
5
10
15
20
25
Body mass (kg)
30
35
36
Line of “Best Fit”
 There
are lots of straight lines
that go through the data.
 The line of “best fit” is the line
for which the sum of squared
residuals is the smallest – the
least squares line.
37
Line of “Best Fit”
 Some
positive and some
negative residuals but they sum
to zero.
 Passes through the point  x, y .
38
Line of “Best Fit”
yˆ  a  bx
Least squares
sy
slope:
br
sx
intercept: a  y  bx
39
Least Squares Estimates
Body mass, x
Bite Force, y
x  9.207 kg
s x  8.016 kg
y  154.029 N
s y  109.760 N
r  0.9807
40
Least Squares Estimates
109.760
b  0.9807
 13.428
8.016
a  154.029  13.428(9.207)  30.397
yˆ  30.397  13.428 x
41
Interpretation
– for a 1 kg increase in body
mass, the bite force increases, on
average, 13.428 N.
 Intercept – there is not a reasonable
interpretation of the intercept in this
context because one wouldn’t see a
Canidae with a body mass of 0 kg.
 Slope
42
Bite Force vs Body Mass
500
ŷ  30.397  13.428 x
BF ca (N)
400
300
200
100
0
0
5
10
15
20
25
30
35
Body mass (kg)
43
Prediction
 Least
squares line
ŷ  30.397  13.428 x
x  25
ŷ  30.397  13.428( 25 )  366.1 N
44
Residual
 Body
mass, x = 25 kg
 Bite force, y = 351.5 N
 Predicted, ŷ = 366.1 N
 Residual, y  y
ˆ = 351.5 – 366.1
= – 14.6 N
45
Residuals
 Residuals
help us see if the linear
model makes sense.
 Plot residuals versus the
explanatory variable.
– If the plot is a random scatter of
points, then the linear model is the
best we can do.
46
Plot of Residuals vs Body Mass
60
50
Residual
40
30
20
10
0
-10
-20
-30
0
5
10
15
20
25
Body mass (kg)
30
35
47
Interpretation of the Plot
 The
residuals are scattered
randomly. This indicates that the
linear model is an appropriate
model for the relationship
between body mass and bite force
for Canidae.
48
2
(r)
or
2
R
 The
square of the correlation
coefficient gives the amount
of variation in y, that is
accounted for or explained by
the linear relationship with x.
49
Body mass and Bite force
r
= 0.9807
 (r)2 = (0.9807)2 = 0.962 or 96.2%
 96.2% of the variation in bite
force can be explained by the
linear relationship with body
mass.
50
Regression Conditions
variables – both
variables should be quantitative.
 Linear model – does the scatter
diagram show a reasonably
straight line?
 Outliers – watch out for outliers
as they can be very influential.
 Quantitative
51
Regression Cautions
 Beware
of extraordinary points.
 Don’t extrapolate beyond the
data.
 Don’t infer x causes y just
because there is a good linear
model relating the two variables.
52
Extraordinary Points
 https://netfiles.uiuc.edu/jimard
en/www/cuwu/datalist.html
–Scatter Plots
–Check – Blank Plot and click
Update
–Add point
53
Don’t Extrapolate
(x) – Average outdoor
temperature (o C).
 Response (y) – Amount of natural
gas used (1000 cu ft).
 Explanatory
yˆ  6.85  0.393 x
54
Don’t Extrapolate
Gas
10
5
0
-5
0
5
10
Temp
15
20
55
Don’t Extrapolate
(x = 20) – Average
outdoor temperature (o C).
 Response (y) – Amount of natural
gas used (1000 cu ft).
yˆ  6.85  0.39320
yˆ  1.01
 Explanatory
56
Correlation  Causation
 Don’t
confuse correlation with
causation.
– There is a strong positive correlation
between the number of crimes
committed in communities and the
number of 2nd graders in those
communities.
 Beware
of lurking variables.
57
Download