Chapter 5 - Moodle at EMU

advertisement
Basic Practice of
Statistics
7th Edition
Lecture PowerPoint Slides
Preliminary results
 Questions on the Chapter 4 Moodle Quiz?
 What just happened in Spreadsheet Assignment 4?
 How does this connect with SA5?
Most of research is…
 (finding variance)
 Explaining variance (prediction, correlation)
 Explaining what is causing the variance (causation)
Article
What’s wrong?
What’s wrong?
What’s wrong?
What’s wrong?
F3LIVARR
 1 = Lives by him/herself
 2 = Lives in parent/guardian’s home
 3 = Not in parents’ home; lives w/ spouse
 4 = Not in parents’ home; lives w/ partner
 5 = Not in parents’ home; lives w/ children
 6 = Not in parents’ home; lives w/ sibling
 7 = Not in parents’ home; lives w/ roommate/friend
 8 = Other living arrangement
What’s wrong?
Let’s examine our data
 Which variables have the lowest means? The highest
means?
 Which variables have the lowest standard deviation?
The highest standard deviation?
 Which pairs of variables have the strongest
correlations? (positive or negative) The weakest
correlations? Which pairs of variables provide an
interesting question to ask?
 What are the limitations of our data collection?
Starter Question
 We hear about U.S. being a “violent” place to live, but
how does it compare to the rest of the developed world
in terms of serial killings?
Let’s find and interpret the regression
line for your spreadsheets
Regression line
REVIEW OF STRAIGHT LINES
that 𝑦 is a response variable and 𝑥
is an explanatory variable.
 Suppose
𝑦 = 𝑎 + 𝑏𝑥
coefficient of 𝑥 is the slope, the amount
by which 𝑦 changes when 𝑥 increases by one
unit. The number 𝑎 is the intercept, the value
of 𝑦 when 𝑥 = 0.
 The
Influential Observations
 An
observation is influential for a statistical
calculation if removing it would markedly
change the result of the calculation.
 Results
are questionable if they depend
strongly on a few influential observations.
Chapter 5, #6: From a graph in Tania Singer et al., “Empathy for pain involves the
affective but not sensory components of pain,” Science, 303 (2004), pp. 1157-1162.
Figure 5.5, The Basic Practice of Statistics, © 2015 W. H. Freeman
Outliers and influential points
Empathy score and brain activity
After removing
observation 16
r2 = 33.1%
From all of the data
r2 = 51.5%
Multiple Regression
Let’s take a shot at predicting your future salary (with
some important caveats!)
By putting other variables into the model, we increase
our overall predictive power (R2) and we can “control” for
variables to get a better sense of the unique relationship
between two variables.
Least-squares regression
 The
distinction between explanatory and response
variables is essential in regression.
 There
is a close connection between correlation and the
slope of the least-squares line. The slope is
𝑠𝑦
𝑏=𝑟
𝑠𝑥
 The
slope b and correlation r always have the same sign.
 The
least-squares regression line always passes
through (𝑥,𝑦).
square of the correlation, r2, is the fraction of the
variation in the values of y that is explained by the leastsquares regression of y on x.
 The
Evidence of causation
 A properly
conducted experiment may establish
causation.
Other considerations when we cannot do an
experiment:
 The
association is strong and consistent.
 Control
for lurking variables.
 Higher
doses are associated with stronger
responses.
 Alleged
 Alleged
cause precedes the effect in time.
cause is plausible (reasonable
explanation).
Cautions about correlation and regression
 Correlation
and regression lines describe only linear
relationships.
 Correlation
and least-squares regression lines are
not resistant.
 Beware
ecological correlation, or correlation based
on averages rather than individuals.
 Beware
of extrapolation—predicting outside of the
range of x.
 Beware
of lurking variables—these have an
important effect on the relationship among the
variables in a study, but are not included in the study.
 Correlation
does not imply causation!
Least Squares Regression Line
Why is the trendline through a scatterplot called a “least
squares regression line”?
Regression line
 A regression line is a straight line that
describes how a response variable y changes
as an explanatory variable x changes.
Example: Predict the gain in fat
(in kg) based on the change in
Non-Exercise Activity (NEA
change, in calories).
 If the NEA change is 400
calories, what is the
expected fat gain?
This regression
line describes the
overall pattern of
the relationship
How can we explain differences in
accuracy?
Basketball Regression
The least-squares regression line
LEAST-SQUARES REGRESSION LINE
The least-squares regression line of y on x is the
line that makes the sum of the squares of the
vertical distances of the data points from the line
as small as possible.
Entry Slip Question
What’s it called when we predict a y-value for an x-value
that is far outside of our range? extrapolation
(Example: Trying to predict salary from age. We studied
people between ages 25 and 65, but now attempt to
predict the salary of a 100-year old woman using our
same regression line.)
The least-squares regression line
EQUATION OF THE LEAST-SQUARES REGRESSION LINE

We have data on an explanatory variable x and a response
variable y for n individuals. From the data, calculate the means 𝑥
and 𝑦 and the standard deviations 𝑠𝑥 and 𝑠𝑦 of the two variables
and their correlation r. The least-squares regression line is the
line
𝑦 = 𝑎 + 𝑏𝑥,

with slope
𝑏=𝑟

𝑠𝑦
𝑠𝑥
,
and intercept,
𝑎 = 𝑦 − 𝑏𝑥
Prediction via regression line
 For
the non-exercise activity example, the
least-squares regression line is:
𝑦 = 3.5051 − 0.0034𝑥
 Suppose
we know
someone has an
increase of 400
calories of NEA. What
would we predict for
fat gain?
𝑦 = 3.5051 − 0.0034 400
= 2.1451 kg
This is the predicted
response for
someone with an of
400 calories of NEA
What calculations should you know?
Definitely know these
 Mean, median
 Z-scores (and
conversions for
standard normal)
 Interpret and use the
linear regression line
No need to memorize
 How to calculate
standard deviation or
variance
 How to calculate
correlation from data
 How to calculate the
linear regression line
The $1,300 homework finding
 Remember that our regression found an average
difference in salary of $5,000 between students who
rarely completed homework and those who nearly
always did.
 Based on some (questionable) calculations, this could
be interpreted as an additional $1,300 per night.
 What should we be careful of?
Correlation does not imply causation
 Even very strong correlations may not correspond to a
real causal relationship (changes in x actually causing
changes in y).
 Correlation may be explained by a lurking variable
Social Relationships and Health
House, J., Landis, K., and Umberson, D. “Social Relationships and
Health,” Science, Vol. 241 (1988), pp 540-545.
Does lack of social relationships cause people to become ill? (There was
a strong correlation.)
 Or, are unhealthy people less likely to establish and maintain
social relationships? (reversed relationship)
 Or, is there some other factor that predisposes people both to
have lower social activity and become ill?
Caution: beware of extrapolation
Can you predict her
height at age 42
months?
Can you predict her
height at age 30
years (360 months)?
100
height (cm)
Sarah’s height was
plotted against her
age.
95
90
85
80
30 35 40 45 50 55 60 65
age (months)
Caution: beware of extrapolation
Regression line:
𝒚 = 71.95 + .383 x
Predicted height at age
30 years? 𝒚 = 209.8
 She is predicted to be
6’10.5” at age 30!
190
height (cm)
Predicted height at age
42 months? 𝒚 = 88
210
170
150
130
110
90
70
30
90
150 210 270 330 390
age (months)
Download