Association Variables

advertisement
Association

Variables
–Response – an outcome variable
whose values exhibit variability.
–Explanatory – a variable that we
use to try to explain the
variability in the response.
1
Association

There is an association
between two variables if
values of one variable are
more likely to occur with
certain values of a second
variable.
2
Picturing Association

Two Categorical
(Qualitative).
–Cross-tabs table, mosaic plot.

Two Numerical
(Quantitative).
–Scatter diagram.
3
Categorical Data

Who?
– Students in a statistics class at Penn
State University.

What?
– “With whom is it easiest to make
friends?” Opposite sex, same sex,
no difference.
– Gender. Male, female.
4
Cross-tabs Table
With whom is it easiest to make friends?
Same
Sex
Opposite No Diff
Sex
Total
Female
16
58
63
137
Male
13
15
40
68
Total
29
73
103
205
5
Bar Graph
With whom is it easiest to make friends?
Distributions
Answer
Freque ncies
50.2
35.6
75
50
14.1
Count
100
Level
No Diff
Opposite
Same
Total
Count
103
73
29
205
N Missing
0
3 Levels
25
No Diff
Opposite
Same
6
Prob
0.50244
0.35610
0.14146
1.00000
Percentages
With whom is it easiest to make friends?
Count
Row %
Female
Male
Total
Same
Sex
16
11.7%
13
19.1%
29
Opposite No Diff
Sex
58
42.3%
15
22.1%
73
63
46.0%
40
58.8%
103
Total
137
100%
68
100%
205
7
Mosaic Plot
1.00
Same
0.75
Answer
Opposite
0.50
0.25
No Diff
0.00
Female
Male
Gender
8
Interpretation



More than 50% of males say no
difference while less than 50%
of females say no difference.
Females are about twice as
likely as males to say opposite.
Males are about twice as likely
as females to say the same.
9
Scatter Plot



Statistics is about …
variation.
Recognize, quantify and try
to explain variation.
Variation in two quantitative
variables is displayed in a
scatter plot.
10
Scatter Plot


Numerical variable on the
vertical axis, y, is the
response variable.
Numerical variable on the
horizontal axis, x, is the
explanatory variable.
11
Scatter Plot

Example: Body mass (kg)
and Bite force (N) for
Canidae.
–y, Response: Bite force (N)
–x, Explanatory: Body mass (kg)
–Cases: 28 species of Canidae.
12
Bivariate Fit of BFca (N) By Body M ass (kg)
500
400
BFca (N)
300
200
100
0
0
5
10
15
20
25
Body Mass (kg)
30
35
40
13
Positive Association

Positive Association
– Above average values of Bite force
are associated with above average
values of Body mass.
– Below average values of Bite force
are associated with below average
values of Body mass.
14
Scatter Plot

Example: Outside
temperature and amount of
natural gas used.
– Response: Natural gas used
(1000 ft3).
– Explanatory: Outside
temperature (o C).
– Cases: 26 days.
15
Gas
10
5
0
-5.0
.0
5.0
Temp
10.0
15.0
16
Negative Association
–Above average values of gas
are associated with below
average temperatures.
–Below average values of gas
are associated with above
average temperatures.
17
Association

Positive
–As x goes up, y tends to go up.

Negative
–As x goes up, y tends to go
down.
18
Correlation

Linear Association
– How closely do the points on the
scatter plot represent a straight line?
– The correlation coefficient gives the
direction of and quantifies the
strength of the linear association
between two quantitative variables.
19
Correlation

Standardize y

Standardize x
y y
zy 
sy
xx
zx 
sx
20
Standardized Bite Force
Bite Force vs Body Mass of Canidae
3
2
1
0
-1
-1
0
1
2
Standardized Body Mass
3
21
Correlation Coefficient
zx z y

r
n 1
x  x  y  y 

r
n  1s x s y
22
Correlation Coefficient

Body mass and Bite force
zx z y

r
26.4796

n 1
27

r = 0.9807
23
Correlation Coefficient

There is a very strong
positive correlation, linear
association, between the
body mass and bite force for
the various species of
Canidae.
24
JMP
Analyze – Multivariate
methods – Multivariate
Y, Columns


–
–
Body mass
BF ca (Bite force at the
canine)
25
M ultiv ariate
Co rre lation s
Body Mass (kg)
BFca (N)
Body Mass (kg)
1.0000
0.9807
BFca (N)
0.9807
1.0000
Scatte rp lot M atrix
40
35
30
25
Body
20
Mass (kg)
15
10
5
500
400
300
BFca (N)
200
100
26
5
10 15 20 25 30 35 40
100
200
300
400
500
Correlation Properties

The sign of r indicates the direction
of the association.
The value of r is always between
–1 and +1.

Correlation has no units.

Correlation is not affected by
changes of center or scale.

27
Algebra Review


The equation of a straight line
y = mx + b
– m is the slope – the change in y
over the change in x – or rise over
run.
– b is the y-intercept – the value
where the line cuts the y axis.
28
y = 3x + 2
15
10
y
5
0
-5
-10
-15
-5
-4
-3
-2
-1
0
x
1
2
3
4
5
29
Review

y = 3x + 2
–x = 0
y = 2 (y-intercept)
–x = 3
y = 11
– Change in y (+9) divided by the
change in x (+3) gives the slope, 3.
30
Linear Regression

Example: Body mass (kg)
and Bite force (N) for
Canidae.
–y, Response: Bite force (N)
–x, Explanatory: Body mass (kg)
–Cases: 28 species of Canidae.
31
Correlation Coefficient

Body mass and Bite force
zx z y

r
26.4796

n 1
27

r = 0.9807
32
Correlation Coefficient

There is a strong correlation,
linear association, between
the body mass and bite force
for the various species of
Canidae.
33
Linear Model


The linear model is the equation
of a straight line through the
data.
A point on the straight line
through the data gives
ŷ a
predicted value of y, denoted .
34
Residual


The difference between the
observed value of y and the
predicted value of y,ŷ , is
called the residual.
Residual =y  yˆ
35
Regression Plot
500
BF ca (N)
400
Residual
300
200
100
0
0
5
10
15
20
25
Body mass (kg)
30
35
36
Line of “Best Fit”


There are lots of straight lines
that go through the data.
The line of “best fit” is the
line for which the sum of
squared residuals is the
smallest – the least squares
line.
37
Line of “Best Fit”
 Some
positive and some
negative residuals but they sum
to zero.
 Passes through the point  x, y .
38
Line of “Best Fit”
yˆ  a  bx
Least squares
sy
slope:
br
sx
intercept: a  y  bx
39
Least Squares Estimates
Body mass, x
Bite Force, y
x  9.207 kg
s x  8.016 kg
y  154.029 N
s y  109.760 N
r  0.9807
40
Least Squares Estimates
109.760
b  0.9807
 13.428
8.016
a  154.029  13.428(9.207)  30.397
yˆ  30.397  13.428 x
41
Interpretation


Slope – for a 1 kg increase in body
mass, the bite force increases, on
average, 13.428 N.
Intercept – there is not a reasonable
interpretation of the intercept in this
context because one wouldn’t see a
Canidae with a body mass of 0 kg.
42
Bite Force vs Body Mass
500
ŷ  30.397  13.428 x
BF ca (N)
400
300
200
100
0
0
5
10
15
20
25
30
35
Body mass (kg)
43
Prediction

Least squares line
ŷ  30.397  13.428 x
x  25
ŷ  30.397  13.428( 25 )  366.1 N
44
Residual




Body mass, x = 25 kg
Bite force, y = 351.5 N
Predicted,ŷ = 366.1 N
Residual,y  yˆ = 351.5 –
366.1
= – 14.6 N
45
Residuals


Residuals help us see if the
linear model makes sense.
Plot residuals versus the
explanatory variable.
– If the plot is a random scatter of
points, then the linear model is the
best we can do.
46
Plot of Residuals vs Body Mass
60
50
Residual
40
30
20
10
0
-10
-20
-30
0
5
10
15
20
25
Body mass (kg)
30
35
47
Interpretation of the Plot

The residuals are scattered
randomly. This indicates that
the linear model is an
appropriate model for the
relationship between body mass
and bite force for Canidae.
48
2
(r)

or
2
R
The square of the correlation
coefficient gives the amount
of variation in y, that is
accounted for or explained
by the linear relationship
with x.
49
Body mass and Bite force



r = 0.9807
(r)2 = (0.9807)2 = 0.962 or
96.2%
96.2% of the variation in bite
force can be explained by the
linear relationship with body
mass.
50
Regression Conditions



Quantitative variables – both
variables should be quantitative.
Linear model – does the scatter
diagram show a reasonably
straight line?
Outliers – watch out for outliers
as they can be very influential.
51
Regression Cautions



Beware of extraordinary points.
Don’t extrapolate beyond the
data.
Don’t infer x causes y just
because there is a good linear
model relating the two variables.
52
Extraordinary Points
53
Don’t Extrapolate
(x) – Average outdoor
temperature (o C).
 Response (y) – Amount of natural
gas used (1000 cu ft).
 Explanatory
yˆ  6.85  0.393 x
54
Don’t Extrapolate
Gas
10
5
0
-5
0
5
10
Temp
15
20
55
Don’t Extrapolate
(x = 20) – Average
outdoor temperature (o C).
 Response (y) – Amount of natural
gas used (1000 cu ft).
yˆ  6.85  0.39320
yˆ  1.01
 Explanatory
56
Correlation  Causation
 Don’t
confuse correlation with
causation.
– There is a strong positive correlation
between the number of crimes
committed in communities and the
number of 2nd graders in those
communities.
 Beware
of lurking variables.
57
Download