Lectures 4-6

advertisement
Chapter 3
Association: Contingency,
Correlation, and Regression
 Learn
….
How to examine links
between two variables
Agresti/Franklin Statistics, 1 of 52
Section 3.2
How Can We Explore the Association
Between Two Quantitative
Variables?
Agresti/Franklin Statistics, 2 of 52
Scatterplot

Graphical display of two quantitative
variables:
• Horizontal Axis: Explanatory variable, x
• Vertical Axis: Response variable, y
Agresti/Franklin Statistics, 3 of 52
Example: Internet Usage and
Gross National Product (GDP)
Agresti/Franklin Statistics, 4 of 52
Positive Association

Two quantitative variables, x and y, are
said to have a positive association when
high values of x tend to occur with high
values of y, and when low values of x
tend to occur with low values of y
Agresti/Franklin Statistics, 5 of 52
Negative Association

Two quantitative variables, x and y,
are said to have a negative
association when high values of x
tend to occur with low values of y,
and when low values of x tend to
occur with high values of y
Agresti/Franklin Statistics, 6 of 52
Example: Did the Butterfly Ballot Cost
Al Gore the 2000 Presidential
Election?
Agresti/Franklin Statistics, 7 of 52
Linear Correlation: r

Measures the strength of the linear
association between x and y
•
•
•
•
A positive r-value indicates a positive association
A negative r-value indicates a negative association
An r-value close to +1 or -1 indicates a strong linear
association
An r-value close to 0 indicates a weak association
Agresti/Franklin Statistics, 8 of 52
Calculating the correlation, r
1
xx y y
r
(
)(
)

n 1
sx
sy
Agresti/Franklin Statistics, 9 of 52
Example: 100 cars on the lot of a
used-car dealership
Would you expect a positive association, a
negative association or no association between
the age of the car and the mileage on the
odometer?



Positive association
Negative association
No association
Agresti/Franklin Statistics, 10 of 52
Section 3.3
How Can We Predict the Outcome
of a Variable?
Agresti/Franklin Statistics, 11 of 52
Regression Line

Predicts the value for the response
variable, y, as a straight-line function of
the value of the explanatory variable, x
ˆ  a  bx
y
Agresti/Franklin Statistics, 12 of 52
Example: How Can Anthropologists
Predict Height Using Human Remains?

Regression Equation:
yˆ  61.4  2.4 x

yˆ is the predicted height and x is the
length of a femur (thighbone), measured
in centimeters
Agresti/Franklin Statistics, 13 of 52
Example: How Can Anthropologists
Predict Height Using Human Remains?

Use the regression equation to predict the
height of a person whose femur length
was 50 centimeters
ˆ  61.4  2.4(50)
y
Agresti/Franklin Statistics, 14 of 52
Interpreting the y-Intercept

y-Intercept:
• the predicted value for y when x = 0
• helps in plotting the line
• May not have any interpretative value if
no observations had x values near 0
Agresti/Franklin Statistics, 15 of 52
Interpreting the Slope

Slope: measures the change in the
predicted variable for every unit change
in the explanatory variable

Example: A 1 cm increase in femur
length results in a 2.4 cm increase in
predicted height
Agresti/Franklin Statistics, 16 of 52
Slope Values: Positive,
Negative, Equal to 0
Agresti/Franklin Statistics, 17 of 52
Residuals

Measure the size of the prediction
errors

Each observation has a residual

Calculation for each residual:
ˆ
y y
Agresti/Franklin Statistics, 18 of 52
Residuals

A large residual indicates an unusual
observation

Large residuals can easily be found
by constructing a histogram of the
residuals
Agresti/Franklin Statistics, 19 of 52
“Least Squares Method”
Yields the Regression Line

Residual sum of squares:
2
ˆ
(residuals)  ( y  y)
2

The optimal line through the data is the
line that minimizes the residual sum of
squares
Agresti/Franklin Statistics, 20 of 52
Regression Formulas for
y-Intercept and Slope

Slope:
b  r(

sy
sx
)
Y-Intercept:
a  y  b( x )
Agresti/Franklin Statistics, 21 of 52
The Slope and the Correlation

Correlation:
• Describes the strength of the association
•
•
between 2 variables
Does not change when the units of
measurement change
It is not necessary to identify which variable is
the response and which is the explanatory
Agresti/Franklin Statistics, 22 of 52
The Slope and the Correlation

Slope:
• Numerical value depends on the units used to
•
•
•
measure the variables
Does not tell us whether the association is
strong or weak
The two variables must be identified as
response and explanatory variables
The regression equation can be used to
predict the response variable
Agresti/Franklin Statistics, 23 of 52
Section 3.4
What Are Some Cautions in
Analyzing Associations?
Agresti/Franklin Statistics, 24 of 52
Extrapolation

Extrapolation: Using a regression line
to predict y-values for x-values
outside the observed range of the
data
• Riskier the farther we move from the range
•
of the given x-values
There is no guarantee that the relationship
will have the same trend outside the range
of x-values
Agresti/Franklin Statistics, 25 of 52
Regression Outliers

Construct a scatterplot

Search for data points that are well
removed from the trend that the rest
of the data points follow
Agresti/Franklin Statistics, 26 of 52
Influential Observation

An observation that has a large effect on
the regression analysis

Two conditions must hold for an
observation to be influential:
Its x-value is relatively low or high compared to
the rest of the data
 It is a regression outlier, falling quite far from
the trend that the rest of the data follow

Agresti/Franklin Statistics, 27 of 52
Which Regression Outlier is
Influential?
Agresti/Franklin Statistics, 28 of 52
Example: Does More Education Cause
More Crime?
Agresti/Franklin Statistics, 29 of 52
Correlation does not Imply
Causation

A correlation between x and y means
that there is a linear trend that exists
between the two variables

A correlation between x and y, does
not mean that x causes y
Agresti/Franklin Statistics, 30 of 52
Lurking Variable

A lurking variable is a variable,
usually unobserved, that influences
the association between the
variables of primary interest
Agresti/Franklin Statistics, 31 of 52
Simpson’s Paradox

The direction of an association
between two variables can change
after we include a third variable and
analyze the data at separate levels of
that variable
Agresti/Franklin Statistics, 32 of 52
Example: Is Smoking Actually
Beneficial to Your Health?
Agresti/Franklin Statistics, 33 of 52
Example: Is Smoking Actually
Beneficial to Your Health?
Agresti/Franklin Statistics, 34 of 52
Example: Is Smoking Actually
Beneficial to Your Health?
Agresti/Franklin Statistics, 35 of 52
Example: Is Smoking Actually
Beneficial to Your Health?
Agresti/Franklin Statistics, 36 of 52
Example: Is Smoking Actually
Beneficial to Your Health?

An association can look quite
different after adjusting for the effect
of a third variable by grouping the
data according to the values of the
third variable
Agresti/Franklin Statistics, 37 of 52
Data are available for all fires in Chicago
last year on x = number of firefighters at the
fires and y = cost of damages due to fire
Would you expect the correlation to be
negative, zero, or positive?
a. Negative
b. Zero
c. Positive
Agresti/Franklin Statistics, 38 of 52
Data are available for all fires in Chicago
last year on x = number of firefighters at the
fires and y = cost of damages due to fire
If the correlation is positive, does this mean
that having more firefighters at a fire
causes the damages to be worse?
a. Yes
b. No
Agresti/Franklin Statistics, 39 of 52
Data are available for all fires in Chicago
last year on x = number of firefighters at the
fires and y = cost of damages due to fire
Identify a third variable that could be
considered a common cause of x and y:
a. Distance from the fire station
b. Intensity of the fire
c. Time of day that the fire was
discovered
Agresti/Franklin Statistics, 40 of 52
Download