Chapter 3 PowerPoint

advertisement
ASSOCIATION: CONTINGENCY,
CORRELATION, AND REGRESSION
Chapter 3
3.1 The Association between Two Categorical Variables
Response and Explanatory Variables



Response variable
(dependent, y)
outcome variable
Explanatory variable
(independent, x)
defines groups
Response/Explanatory
1. Grade on test/Amount
of study time
2. Yield of corn/Amount of
rainfall
Association
Association – When a value for one variable is more
likely with certain values of the other variable
Data analysis with two variables
1. Tell whether there is an association and
2. Describe that association
Contingency Table
 Displays
two categorical variables
 The rows list the categories of one
variable; the columns list the other
 Entries in the table are frequencies
www1.pictures.fp.zimbio.com
Contingency Table



What is the response (outcome) variable? Explanatory?
What proportion of organic foods contain
pesticides?Conventionally grown?
What proportion of all sampled foods contain pesticides?
Proportions & Conditional Proportions
Proportions & Conditional Proportions
Side by side bar charts
show conditional
proportions and allow
for easy comparison
www.vitalchoice.com
Proportions & Conditional Proportions
If no association, then
proportions would be
the same
Since there is association,
then proportions are
different
3.2 The Association between Two Quantitative Variables
Internet Usage & GDP Data Set
Algeria
Argentina
Australia
Austria
Belgium
Brazil
Canada
Chile
China
Denmark
Egypt
Finland
France
Germany
Greece
India
Iran
Ireland
Israel
INTERNET
0.65
10.08
37.14
38.7
31.04
4.66
46.66
20.14
2.57
42.95
0.93
43.03
26.38
37.36
13.21
0.68
1.56
23.31
27.66
GDP
6.09
11.32
25.37
26.73
25.52
7.36
27.13
9.19
4.02
29
3.52
24.43
23.99
25.35
17.44
2.84
6
32.41
19.79
Japan
Malaysia
Mexico
Netherlands
New Zealand
Nigeria
Norway
Pakistan
Philippines
Russia
Saudi Arabia
South Africa
Spain
Sweden
Switzerland
Turkey
United Kingdom
United States
Vietnam
Yemen
INTERNET
38.42
27.31
3.62
49.05
46.12
0.1
46.38
0.34
2.56
2.93
1.34
6.49
18.27
51.63
30.7
6.04
32.96
50.15
1.24
0.09
GDP
25.13
8.75
8.43
27.19
19.16
0.85
29.62
1.89
3.84
7.1
13.33
11.29
20.15
24.18
28.1
5.89
24.16
34.32
2.07
0.79
www.knitwareblog.com
Scatterplot
Graph of two
quantitative variables:
 Horizontal Axis:
Explanatory, x
 Vertical Axis:
Response, y
Algeria
Argentina
Australia
Austria
Belgium
Brazil
Canada
Chile
China
Denmark
Egypt
Finland
France
Germany
Greece
India
Iran
Ireland
Israel
INTERNET
0.65
10.08
37.14
38.7
31.04
4.66
46.66
20.14
2.57
42.95
0.93
43.03
26.38
37.36
13.21
0.68
1.56
23.31
27.66
GDP
6.09
11.32
25.37
26.73
25.52
7.36
27.13
9.19
4.02
29
3.52
24.43
23.99
25.35
17.44
2.84
6
32.41
19.79
Japan
Malaysia
Mexico
Netherlands
New Zealand
Nigeria
Norway
Pakistan
Philippines
Russia
Saudi Arabia
South Africa
Spain
Sweden
Switzerland
Turkey
United Kingdom
United States
Vietnam
Yemen
INTERNET
38.42
27.31
3.62
49.05
46.12
0.1
46.38
0.34
2.56
2.93
1.34
6.49
18.27
51.63
30.7
6.04
32.96
50.15
1.24
0.09
GDP
25.13
8.75
8.43
27.19
19.16
0.85
29.62
1.89
3.84
7.1
13.33
11.29
20.15
24.18
28.1
5.89
24.16
34.32
2.07
0.79
Interpreting Scatterplots


The overall pattern includes
trend, direction, and
strength of the relationship
 Trend: linear, curved,
clusters, no pattern
 Direction: positive,
negative, no direction
 Strength: how closely the
points fit the trend
Also look for outliers from
the overall trend
Used-car Dealership
What association would we expect between the age
of the car and mileage?
a)
b)
c)
Positive
Negative
No association
Linear Correlation, r
Measures the strength and direction of the linear
association between x and y
Correlation coefficient: Measuring Strength &
Direction of a Linear Relationship
Positive r => positive association
Negative r => negative association
r close to +1 or -1 indicates strong
linear association
r close to 0 indicates weak association
3.3 Can We Predict the Outcome of a Variable?
Regression Line

Predicts y, given x:
yˆ  a  bx
The y-intercept and slope are a
and b
 Only an estimate – actual data
vary
 Describes relationship between
x and estimated means of y

farm4.static.flickr.com
Residuals



www.chem.utoronto.ca
Prediction errors:
vertical distance
between data point
and regression line
Large residual
indicates unusual
observation
Each residual is:
y  yˆ

Sum of residuals is
always zero

Goal: Minimize distance
from data to regression line
Least Squares Method

Residual sum of squares:
2
2
ˆ
(
residuals
)

(
y

y
)



msenux.redwoods.edu
Least squares regression
line minimizes vertical
distance between points
and their predictions
Regression Analysis
Identify response and
explanatory
variables
 Response variable
is y
 Explanatory
variable is x
Anthropologists Predict Height Using Remains?

Regression Equation:
yˆ  61.4  2.4 x
ŷ is predicted height
and x is the length of a
femur, thighbone (cm)
Predict height for femur
length of 50 cm

www.geektoysgamesandgadgets.com
Bones
Interpreting the y-Intercept and slope




y-intercept: y-value when
x=0
Helps plot line
Slope: change in y for 1
unit increase in x
1 cm increase in femur
length means 2.4 cm
increase in predicted
height
yˆ  61.4  2.4 x
Slope Values: Positive, Negative, Zero
Slope and Correlation

Slope, b:
 Doesn’t
tell strength
 Has units
 Inverts if x and y are
swapped

Correlation, r:
 Describes
 No
strength
units
 Same if x and y are
swapped
Squared Correlation, r2



Proportional reduction in error, r2
Variation in y-values explained
by relationship of y to x
A correlation, r, of .9 means
r  .9  .81  81%
2

2
81% of variation in y is
explained by x
3.4 What Are Some Cautions in Analyzing Associations?
Extrapolation

Neil Weiss, Elementary Statistics, 7th Edition
Extrapolation:
Predicting y for
x-values outside
range of data
 Riskier the
farther from
the range of x
 No guarantee
trend holds
Outliers and Influential Points


www2.selu.edu
Regression outlier
lies far away from
rest of data
Influential if both:
1.
Low or high,
compared to
rest of data
2.
Regression
outlier
Correlation Does Not Imply Causation
Strong correlation between x and y means
 Strong linear association between the
variables
 Does not mean x causes y
Ex. 95.6% of cancer patients have eaten pickles,
so do pickles cause cancer?
Lurking Variables & Confounding
1.
2.
Ice cream sales & drowning => temperature
Reading level & shoe size => age
Confounding – two explanatory variables both
associated with response variable and each other
 Lurking variables – not measured in study but may
confound

Simpson’s Paradox Example
Simpson’s Paradox:
 Association
between two
variables reverses
after third is
included
Probability of Death of Smoker
= 139/582 = 24%
Probability of Death of
Nonsmoker = 230/732 = 31%
Simpson’s Paradox Example
Break out Data by Age
Simpson’s Paradox Example
Associations look quite different
after adjusting for third variable
Download