day19 - University of South Carolina

advertisement
STAT 110 - Section 5
Lecture 19
Professor Hao Wang
University of South Carolina
Spring 2012
Last time: normal density curve
Chapter 14 – Describing Relationships
Most statistical studies examine data on
more than one variable. The steps when
trying to talk about two variables at once
are the same as what we used earlier in
the semester with just one variables:
• Plot the data.
• Look for overall patterns and deviations
from those patterns.
• Use numerical summaries.
Scatterplots
scatterplot – shows the relationship between two
quantitative variables measured on
the same individuals
• Values of one variable appear on the x-axis. This
is typically the one doing the explaining – the
explanatory, predictor, or independent variable.
• Values of the other variable appear on the y-axis.
This is typically the one being explained – called the
response or dependent variable.
Scatterplot
Example:
When water flows across farmland, some of the soil is
washed away, resulting in erosion. An experiment
was conducted to investigate the effect of the rate of
water flow on the amount of soil washed away. Flow
is measured in liters/second and the eroded soil is
measured in kilograms.
flow rate
eroded soil
.31
.82
.85
1.95
1.26
2.18
2.47
3.01
3.75
6.07
Scatterplot
• Is there an explanatory variable?
• What’s the response variable?
• Which variable should be on the x-axis?
Flow Rate vs Eroded Soil
Eroded Soil (kg)
7
6
5
4
3
2
1
0
0
1
2
Flow Rate (liters/sec)
3
4
Measuring Strength
Through Correlation
A Linear Relationship
Correlation represented by the letter
r:
Indicator of how closely the values fall to a straight line.
Measures linear relationships only; that is, it measures how
close the individual points in a scatterplot are to a straight
line.
Correlation
Example : Verbal SAT and GPA
Scatterplot of
GPA and verbal
SAT score.
The correlation is
.485, indicating a
moderate positive
relationship.
Higher verbal SAT scores tend to indicate higher
GPAs as well, but the relationship is nowhere
close to being exact.
Example: Husbands’ and Wifes’
Ages and Heights
Scatterplot of British husbands’
and wives’ ages; r = .94
Scatterplot of British husbands’ and
wives’ heights (in millimeters); r = .36
Husbands’ and wives’ ages are likely to be closely related,
whereas their heights are less likely to be so.
Source: Marsh (1988, p. 315) and Hand et al. (1994, pp. 179-183)
Occupational Prestige
and Suicide Rates
Plot of suicide rate
versus occupational
prestige for 36
occupations.
Correlation of .109
– these is not much
of a relationship.
If outlier removed
r drops to .018.
Source: Labovitz (1970, Table 1) and Hand et al. (1994, pp. 395-396)
Example :
Professional Golfers’
Putting Success
Scatterplot of
distance of putt
and putting
success rates.
Correlation r = −.94.
Negative sign
indicates that as
distance goes up,
success rate goes
down.
Source: Iman (1994, p. 507)
Which one has
r = -0.86 ?
Which one has
r = 0.52 ?
(A was -0.86)
a.There appears to be a
strong positive linear
relationship.
b.There appears to be a
weak linear relationship
because the slope of the
line is fairly flat.
c.There can’t be a
relationship; cricket chirps
can’t be related to
temperature.
d.None of the above.
Temeprature
(degrees Fahrenheit)
Cricket chirps and temperature. Each day, the temperature was
recorded, as well as the number of times a cricket chirped
in 15 seconds. According to this scatterplot, what can be
said about the relationship between cricket chirps and
temperature?
90
80
70
60
50
40
30
20
10
0
0
15
30
# Cricket Chirps per 15 seconds
45
Summary: Features of Correlations
r has no units and won’t change if we change
the units of measurement
r ignores the distinction between explanatory
and response variables
r is strongly affected by outliers
http://bcs.whfreeman.com/ips4e/cat_010/applets/
CorrelationRegression.html
Chapter 15 – Describing Relationships
regression line – a straight line that describes how
a response variable y changes as
an explanatory variable x changes
• regression line summarizes a linear relationship
between two variables
• one variable helps explain or predict the other
Example
The data to the right concerns the
relationship between the prevalence
of a supposed fertility enhancer and
the population of Oldenburg
Germany in thousands of people
between 1930 and 1936.
The original data can be found in:
Ornithologische Monatsberichte, 44,
No.2, Jahrgang, 1936, Berlin, and
48, No.1, Jahrgang, 1940, Berlin,
and Statistiches Jahrbuch Deutscher
Gemeinden, 27-33, Jahrgang, 19321938, Gustav Fischer, Jena.
X
140
148
175
195
245
250
250
People
55.5
55.5
64.9
67.5
69.0
72.0
75.5
Example (cont’d)
r = 0.941
Equation of a Line
• The equation of a line is y = mx + b
• m is the slope of the line
• slope = the amount by which y changes when x
increases one unit - a slope of zero
means that there is no linear relationship
between x and y
• b is the intercept of the line
• intercept = the value of y when x=0
Least Squares Regression Line
least-squares regression line – the line that makes
the sum of the squared vertical distances to
the line as small as possible
Example – Fitting the Least Squares Line
People = 35.49 + 0.1507 x
75
P
E
O
P
L
E
70
65
60
150
200
250
X
Interpretation of the Slope: For every increase in X
by 1, we expect the population of Oldenburg,
Germany tends to increase by 150 people.
Prediction
Three Things to Understand about Prediction:
• Prediction is based on fitting some “model” to a
set of data.
• Prediction works best when the model fits the
data closely.
• Prediction outside the range of the available data
is risky. This is called an extrapolation.
Prediction Example
People = 35.49 + 0.1507 x
Using this equation to estimate the mean
population of Oldenburg Germany for an X level of
200, we have
35.49 + 0.1507(200) = 65.63
So, we estimate the mean population of Oldenburg
Germany (1930-1936) to be 65.53 thousand people
for an X level of 200.
Correlation and Regression
r2 - the fraction of the variation in the values of y that
is explained by the least-squares regression of y
on x
In the example, (0.941)2=0.8857 of the variation in
the population is explained by the regression using
X.
75
P
E
O
P
L
E
70
65
60
150
200
X
250
Causation
•The moral of the story is:
Only experimentation can show causation!
• When dealing with regression and/or correlation,
NEVER say that one variable causes another.
• Snake bites and ice cream sales are highly
correlated. Does that mean that one causes the
other?
I would not trust a
prediction from this
regression for a car
with City MPG of 10
because:
A - The linear model
doesn’t seem to fit the
data
B – It would be
extrapolating
C – It would have a
large error because the
points are very spread
out around the line.
If a car gets 20 MPG in the city, how many MPG do
you predict it will get on the highway?
A – 0.896
B – 17.91
C – 26.97
D – 29.06
Each time the City MPG increases by one, what do we
predict happens to the highway MPG ? Goes…
A – Down 9.06
B – Down 0.896
C – Up 0.896
D – Up 9.06
What % of the variation in Highway MPG is
explained by the regression using City MPG?
A – 1.77%
B – 89.1%
C – 93%
D – 94.39%
• Suppose an algebra professor found that the correlation
between study time (in hours) and exam score (out of 100) is
+.80, and the regression line was found to be y = 20 + 4x. He
arrived at this equation through years of collecting data on his
students, most of whom reported studying anywhere from 0
to 20 hours for his exams.
For which values of study time does the professor’s regression
equation make sense in terms of predicting exam scores?
a. Between 0 and 20 hours.
b. Between 0 and 100 hours.
c. Anything greater than or equal to 0 hours.
d. It is not possible to predict exam score with study time.
Suppose the professor later found out that his
correlation was not +.80, but rather it was +.08. How
does this change the predictions he can make about
exam scores based on study time?
a. You have to take the results and divide them by 10,
because .80/10 = .08.
b. It won’t change the predictions because the
regression line stays the same.
c. The predictions should no longer be used because
they won’t be very accurate.
d. Not enough information to tell.
Download