ch_03

advertisement
Chapter 3
Association: Contingency,
Correlation, and Regression
 Learn
….
How to examine links
between two variables
Agresti/Franklin Statistics, 1 of 52
Variables
Response variable: the outcome
variable
 Explanatory variable: the variable
that explains the outcome variable

Agresti/Franklin Statistics, 2 of 52
Association

An association exists between the
two variables if a particular value for
one variable is more likely to occur
with certain values of the other
variable
Agresti/Franklin Statistics, 3 of 52
Section 3.1
How Can We Explore the Association
Between Two Categorical Variables?
Agresti/Franklin Statistics, 4 of 52
Example: Food Type and
Pesticide Status
Agresti/Franklin Statistics, 5 of 52
Example: Food Type and
Pesticide Status


What is the response variable?
What is the explanatory variable?
Pesticides:
Food Type:
Organic
Conventional
Yes
No
29
98
19485
7086
Agresti/Franklin Statistics, 6 of 52
Example: Food Type and
Pesticide Status


What proportion of organic foods contain
pesticides?
What proportion of conventionally grown foods
contain pesticides?
Pesticides:
Food Type:
Organic
Conventional
Yes
29
19485
No
98
7086
Agresti/Franklin Statistics, 7 of 52
Example: Food Type and
Pesticide Status

What proportion of all sampled items contain
pesticide residuals?
Pesticides:
Food Type:
Organic
Conventional
Yes
No
29
98
19485
7086
Agresti/Franklin Statistics, 8 of 52
Contingency Table

The Food Type and Pesticide Status
Table is called a contingency table

A contingency table:
• Displays 2 categorical variables
• The rows list the categories of 1 variable
• The columns list the categories of the other
variable
• Entries in the table are frequencies
Agresti/Franklin Statistics, 9 of 52
Example: Food Type and
Pesticide Status

Contingency Table Showing Conditional
Proportions
Agresti/Franklin Statistics, 10 of 52
Example: Food Type and
Pesticide Status



What is the sum over each row?
What proportion of organic foods contained
pesticide residuals?
What proportion of conventional foods
contained pesticide residuals?
Pesticides:
Food Type:
Yes
No
Organic
0.23
0.77
Conventional 0.73
0.27
Agresti/Franklin Statistics, 11 of 52
Example: Food Type and
Pesticide Status
Agresti/Franklin Statistics, 12 of 52
Example: For the following pair of variables,
which is the response variable and which is
the explanatory variable?
College grade point average (GPA) and high
school GPA
a.
College GPA: response variable and
High School GPA : explanatory variable
b.
College GPA: explanatory variable and
High School GPA : response variable
Agresti/Franklin Statistics, 13 of 52
Section 3.2
How Can We Explore the Association
Between Two Quantitative
Variables?
Agresti/Franklin Statistics, 14 of 52
Scatterplot

Graphical display of two quantitative
variables:
• Horizontal Axis: Explanatory variable, x
• Vertical Axis: Response variable, y
Agresti/Franklin Statistics, 15 of 52
Example: Internet Usage and
Gross National Product (GDP)
Agresti/Franklin Statistics, 16 of 52
Positive Association

Two quantitative variables, x and y, are
said to have a positive association when
high values of x tend to occur with high
values of y, and when low values of x
tend to occur with low values of y
Agresti/Franklin Statistics, 17 of 52
Negative Association

Two quantitative variables, x and y,
are said to have a negative
association when high values of x
tend to occur with low values of y,
and when low values of x tend to
occur with high values of y
Agresti/Franklin Statistics, 18 of 52
Example: Did the Butterfly Ballot Cost
Al Gore the 2000 Presidential
Election?
Agresti/Franklin Statistics, 19 of 52
Linear Correlation: r

Measures the strength of the linear
association between x and y
•
•
•
•
A positive r-value indicates a positive association
A negative r-value indicates a negative association
An r-value close to +1 or -1 indicates a strong linear
association
An r-value close to 0 indicates a weak association
Agresti/Franklin Statistics, 20 of 52
Calculating the correlation, r
1
xx y y
r
(
)(
)

n 1
sx
sy
Agresti/Franklin Statistics, 21 of 52
Example: 100 cars on the lot of a
used-car dealership
Would you expect a positive association, a
negative association or no association between
the age of the car and the mileage on the
odometer?



Positive association
Negative association
No association
Agresti/Franklin Statistics, 22 of 52
Section 3.3
How Can We Predict the Outcome
of a Variable?
Agresti/Franklin Statistics, 23 of 52
Regression Line

Predicts the value for the response
variable, y, as a straight-line function of
the value of the explanatory variable, x
ˆ  a  bx
y
Agresti/Franklin Statistics, 24 of 52
Example: How Can Anthropologists
Predict Height Using Human Remains?

Regression Equation:
yˆ  61.4  2.4 x

yˆ is the predicted height and x is the
length of a femur (thighbone), measured
in centimeters
Agresti/Franklin Statistics, 25 of 52
Example: How Can Anthropologists
Predict Height Using Human Remains?

Use the regression equation to predict the
height of a person whose femur length
was 50 centimeters
ˆ  61.4  2.4(50)
y
Agresti/Franklin Statistics, 26 of 52
Interpreting the y-Intercept

y-Intercept:
• the predicted value for y when x = 0
• helps in plotting the line
• May not have any interpretative value if
no observations had x values near 0
Agresti/Franklin Statistics, 27 of 52
Interpreting the Slope

Slope: measures the change in the
predicted variable for every unit change
in the explanatory variable

Example: A 1 cm increase in femur
length results in a 2.4 cm increase in
predicted height
Agresti/Franklin Statistics, 28 of 52
Slope Values: Positive,
Negative, Equal to 0
Agresti/Franklin Statistics, 29 of 52
Residuals

Measure the size of the prediction
errors

Each observation has a residual

Calculation for each residual:
ˆ
y y
Agresti/Franklin Statistics, 30 of 52
Residuals

A large residual indicates an unusual
observation

Large residuals can easily be found
by constructing a histogram of the
residuals
Agresti/Franklin Statistics, 31 of 52
“Least Squares Method”
Yields the Regression Line

Residual sum of squares:
2
ˆ
(residuals)  ( y  y)
2

The optimal line through the data is the
line that minimizes the residual sum of
squares
Agresti/Franklin Statistics, 32 of 52
Regression Formulas for
y-Intercept and Slope

Slope:
b  r(

sy
sx
)
Y-Intercept:
a  y  b( x )
Agresti/Franklin Statistics, 33 of 52
The Slope and the Correlation

Correlation:
• Describes the strength of the association
•
•
between 2 variables
Does not change when the units of
measurement change
It is not necessary to identify which variable is
the response and which is the explanatory
Agresti/Franklin Statistics, 34 of 52
The Slope and the Correlation

Slope:
• Numerical value depends on the units used to
•
•
•
measure the variables
Does not tell us whether the association is
strong or weak
The two variables must be identified as
response and explanatory variables
The regression equation can be used to
predict the response variable
Agresti/Franklin Statistics, 35 of 52
Section 3.4
What Are Some Cautions in
Analyzing Associations?
Agresti/Franklin Statistics, 36 of 52
Extrapolation

Extrapolation: Using a regression line
to predict y-values for x-values
outside the observed range of the
data
• Riskier the farther we move from the range
•
of the given x-values
There is no guarantee that the relationship
will have the same trend outside the range
of x-values
Agresti/Franklin Statistics, 37 of 52
Regression Outliers

Construct a scatterplot

Search for data points that are well
removed from the trend that the rest
of the data points follow
Agresti/Franklin Statistics, 38 of 52
Influential Observation

An observation that has a large effect on
the regression analysis

Two conditions must hold for an
observation to be influential:
Its x-value is relatively low or high compared to
the rest of the data
 It is a regression outlier, falling quite far from
the trend that the rest of the data follow

Agresti/Franklin Statistics, 39 of 52
Which Regression Outlier is
Influential?
Agresti/Franklin Statistics, 40 of 52
Example: Does More Education Cause
More Crime?
Agresti/Franklin Statistics, 41 of 52
Correlation does not Imply
Causation

A correlation between x and y means
that there is a linear trend that exists
between the two variables

A correlation between x and y, does
not mean that x causes y
Agresti/Franklin Statistics, 42 of 52
Lurking Variable

A lurking variable is a variable,
usually unobserved, that influences
the association between the
variables of primary interest
Agresti/Franklin Statistics, 43 of 52
Simpson’s Paradox

The direction of an association
between two variables can change
after we include a third variable and
analyze the data at separate levels of
that variable
Agresti/Franklin Statistics, 44 of 52
Example: Is Smoking Actually
Beneficial to Your Health?
Agresti/Franklin Statistics, 45 of 52
Example: Is Smoking Actually
Beneficial to Your Health?
Agresti/Franklin Statistics, 46 of 52
Example: Is Smoking Actually
Beneficial to Your Health?
Agresti/Franklin Statistics, 47 of 52
Example: Is Smoking Actually
Beneficial to Your Health?
Agresti/Franklin Statistics, 48 of 52
Example: Is Smoking Actually
Beneficial to Your Health?

An association can look quite
different after adjusting for the effect
of a third variable by grouping the
data according to the values of the
third variable
Agresti/Franklin Statistics, 49 of 52
Data are available for all fires in Chicago
last year on x = number of firefighters at the
fires and y = cost of damages due to fire
Would you expect the correlation to be
negative, zero, or positive?
a. Negative
b. Zero
c. Positive
Agresti/Franklin Statistics, 50 of 52
Data are available for all fires in Chicago
last year on x = number of firefighters at the
fires and y = cost of damages due to fire
If the correlation is positive, does this mean
that having more firefighters at a fire
causes the damages to be worse?
a. Yes
b. No
Agresti/Franklin Statistics, 51 of 52
Data are available for all fires in Chicago
last year on x = number of firefighters at the
fires and y = cost of damages due to fire
Identify a third variable that could be
considered a common cause of x and y:
a. Distance from the fire station
b. Intensity of the fire
c. Time of day that the fire was
discovered
Agresti/Franklin Statistics, 52 of 52
Download