Chapter 3 - Examining Relationships

Chapter 3
Examining Relationships
AP Statistics
Hamilton/Mann
Introduction
• A medical study finds that short women are more
likely to have heart attacks than women of average
height, while tall women have the fewest heart
attacks.
• An insurance group reports that heavier cars have
fewer deaths per 100,000 vehicles than do lighter
cars.
• These and many other statistical studies look at the
relationship between two variables. Statistical
relationships are overall tendencies, not ironclad
rules. They allow individual exceptions.
Introduction
• For example, smokers on average die younger than
non-smokers, but some smokers who smoke three
packs a day live to be 90.
• To understand a statistical relationship between two
variables, we measure both variables on the same
individuals.
• Often we must measure other variables as well.
• To conclude that shorter women have higher risk of
heart attacks, the researchers had to eliminate the
effect of other variables such as weight and exercise
habits.
Introduction
• One thing to always remember is that the
relationship between two variables can be strongly
influenced by other variables that are lurking in the
background.
• When examining the relationship between two or
more variables, ask these key questions:
– Who are the individuals described by the data?
– What are the variables?
– Why were the data gathered?
– When, where, how, and by whom were the data
produced?
Introduction
• When we have data on several variables, categorical
variables are often present and of interest to us.
• For example, a medical study may record each
subject’s sex and smoking status along with
quantitative data such as weight, blood pressure
and cholesterol levels.
• We may be interested in the relationship between
two quantitative variables (weight and blood
pressure), between a categorical and quantitative
variable (sex and blood pressure) or between two
categorical variables (sex and smoking status).
• This chapter focuses on quantitative variables.
Introduction
• When you examine relationships among variables, a
new question becomes important
– Do you want simply to explore the nature of the
relationship, or do you think that some of the variables
help explain or even cause changes in the others?
• This gets to the idea of response and explanatory
variables.
• Explanatory variables are often called independent
variables and response dependent variables.
Introduction
• Prediction requires that we identify an explanatory
and response variable.
• Remember that calling one variable explanatory
and the other a response does not mean that
changes in one cause changes in the other.
• To examine the data we will do the following:
1. Plot the data and compute numerical summaries.
2. Look for overall patterns and deviations from these
patterns.
3. When the pattern is quite regular, use a compact
mathematical model to describe it.
CHAPTER 3 SECTION 1
Scatterplots and Correlation
HW: 3.21, 3.22, 3.23, 3.25
Scatterplots
• The most effective way to display the relationship between
two quantitative variables is a scatterplot.
• Tips for drawing scatterplots by hand:
1. Plot the explanatory variable on the horizontal axis (x-axis). The
explanatory variable is usually called x and the response variable
is usually called y.
2. Label both axes.
3. Scale the horizontal and vertical axes. The intervals must be
uniform; that is, the distance between tick marks must be the
same.
4. If you are given a grid, try to use a scale so your plot uses the
whole grid. Make the plot large enough so details can be seen.
State SAT Math Scores
• More than a million high school seniors take the SAT
each year. We sometimes see state school systems
“rated” by the average SAT score of their seniors. This
is not proper because the percent of high school
seniors who take the SAT varies from state to state.
Let’s examine the relationship between the percents of
a state’s high school seniors who took the exam in 2005
and the mean SAT score in the state that year.
• We think that percent taking will cause some change in
the mean score. So percent taking will be the
explanatory variable (on the x-axis) and mean score will
be the response variable (on the y-axis). The
scatterplot is on the next slide.
State SAT Math Scores
Continue!
Go back!
Interpreting Scatterplots
• To interpret a scatterplot, look for patterns and any
important deviations from those patterns.
• What patterns did we see in the State SAT Math
Scores scatterplot?
State SAT Math Scores
• The scatterplot shows a clear direction: the overall
pattern is from the upper left to the lower right. That
is, states with a higher percent of seniors taking the SAT
tend to have a lower mean math SAT score. This is
called a negative association between two variables.
• The form of the relationship is slightly curved. More
important, most states fall into one of two clusters. In
the cluster at the right, more than half of high school
seniors take the SAT and the mean scores are low. In
the cluster at the left, states have higher SAT math
scores and less than 30% of seniors take the test. Only
three states lie in the gap between the two clusters
(Arizona, Nevada, and California).
State SAT Math Scores
• What explains the clusters?
– There are two tests students can take for acceptance
into college: the SAT and the ACT. The cluster at the left
are states that tend to prefer the ACT while the cluster at
the right are states that prefer the SAT. The students in
ACT states who take the SAT tend to be students who are
applying to highly selective colleges. Therefore, the
mean SAT score for these states is higher because the
mean score for the best students will be higher than that
for all students.
State SAT Math Scores
• The strength of a relationship in a scatterplot is
determined by how closely the points follow a clear
form. For our example, is only moderately strong
because states with the same percent taking the
SAT show quite a bit of scatter in their mean scores.
• Are there any deviations from the pattern?
• West Virginia, where 20% of high school seniors
take the SAT, but the mean SAT Math score is only
511 stands out. This point is an outlier.
Beer and Blood Alcohol
• How well does the number of beers a student
drinks predict his or her blood alcohol content
(BAC)? Sixteen student volunteers at The Ohio
State University drank a randomly assigned number
of cans of beer. Thirty minutes later, a police officer
measured their BAC. The data are below.
Student:
1
2
3
4
5
6
7
8
Beers:
5
2
9
8
3
7
3
5
0.10
0.03
0.19
0.12
0.04
0.095
0.07
0.06
Student:
9
10
11
12
13
14
15
16
Beers:
3
5
4
6
5
7
1
4
0.02
0.05
0.07
0.10
0.085
0.09
0.01
0.05
BAC:
BAC:
Beer and Blood Alcohol
• The students were equally divided between men
and women and differed in weight and drinking
habits.
• Because of this variation, many students don’t
believe that number of drinks will predict BAC well?
What do the data say?
Beer and Blood Alcohol
Beer and Blood Alcohol
• The scatterplot shows a fairly strong positive
association. Generally, more beers consumed will
results in a higher BAC.
• The form of this relationship is linear. That is the
points lie in a straight-line pattern.
• It is a fairly strong relationship because the points
fall pretty close to a line, with relatively little scatter.
• If we know how many beers a student has
consumed, we can predict BAC quite accurately
from the scatterplot.
• Not all relationships have a
simple form and a clear
direction that we can describe
as positive association or
negative association.
Adding Categorical Variables to Scatterplots
• The South has long lagged behind the rest of the
country in the performance of its schools. Efforts to
improve our education have reduced the gap. We
wonder if the south stands out in the study of average
SAT math scores.
• To observe this relationship, we will plot the 12
southern states in blue and observe what happens.
• Most of the southern states blend in with the rest of
the country. Several southern states do lie at the lower
edges of their clusters. Florida, Georgia, South
Carolina, and West Virginia have lower SAT math scores
than we would expect from their percent of high school
seniors who take the examination.
Adding Categorical Variables to Scatterplots
• Dividing the states into “southern” and
“nonsouthern” introduces a third variable into the
scatterplot.
• This is a categorical variable that has only two
values.
• The two values are displayed by the two different
plotting colors.
Measuring Linear Association: Correlation
• A scatterplot displays the direction, form, and
strength of the relationship between two
quantitative variables.
• Linear relations are particularly important because
a straight line is a simple pattern that is quite
common.
• We say a linear relation is quite strong if the points
lie close to a straight line, and weak if they are
widely scattered about a line.
• Unfortunately, our eyes are not good judges of how
strong a linear relationship is.
• Which looks
more linear?
• They are two
different
scatterplots of
the same data.
• So neither is
more linear.
• This is why our
eyes are not
good judges of
strength.
Measuring Linear Association: Correlation
• As you can see, our eyes can be fooled by changing the
plotting scales or the amount of empty space around
the cloud of points in the scatterplot.
• We need to follow our strategy for data analysis by
using a numerical measure to supplement the graph.
Correlation is the measure we use.
Measuring Linear Association: Correlation
• Notice that the two terms inside the summation
notation,
are just standardized values
for x and y.
• The formula helps us to see what correlation is, but,
in practice, is much too difficult to calculate by
hand. Instead we will find correlation on the
calculator.
• Input the following in list 1 and list 2.
Body Weight (lb):
120
187
109
103
131
165
158
116
Bacpack Weight (lb):
26
30
26
24
29
35
31
28
Measuring Linear Association: Correlation
• Now go to Stat and then Calc.
• We have two options for running a linear
regression.
1. 4:LinReg (ax+b)
2. 8:LinReg(a+bx)
• For whatever reason, in Statistics, we like to have
a+bx, so we will use 8.
• When you select this, you should get a screen like
the one here.
Measuring Linear Association: Correlation
• If you did not get the r and r2, then we need to fix a
setting on your calculator.
• To do this:
1.
2.
3.
4.
5.
click the 2nd key and then 0 to go to Catalog
Click on the D (it’s above x-1)
Scroll down to Diagnostic On and click enter
Hit enter again
Now do the LinReg(a+bx) again and it should all be
there.
Facts about Correlation
• The formula for correlation helps us see that r is
positive when there is a positive association
between the variables.
• Height and weight, for example, have a positive
association. People who are above average in
height tend to be above average in weight.
• Let’s play with a correlation applet.
Facts about Correlation
• Here is what you need to know in order to interpret
correlation.
1. Correlation makes no distinction between explanatory
and response variable. It makes no difference which
variable you call x and y in calculating the correlation.
2. Because r uses the standardized values of the
observations, r does not change when we change the
units of measurement of x, y, or both. Measuring
height in centimeters rather than inches and weight in
kilograms rather than pounds does not change the
correlation between height and weight. The
correlation r itself has no unit of measurement; it is just
a number.
Facts about Correlation
3. Positive r indicates positive association between the
variables, and negative r indicates negative association.
4. The correlation r is always a number between -1 and 1.
Values of r near 0 indicate a very weak linear
relationship. The strength of the linear relationship
increases as r moves away from 0 toward either -1 or 1.
Values of r near -1 or 1 indicate that the points in a
scatterplot lie close to a straight line. The extreme
values -1 and 1 occur only in the case of a perfect linear
relationship, when the points lie exactly along a straight
line.
• This gives us some
idea of how the
correlation relates
to linearity and
spread of the data.
Facts about Correlation
• Describing the relationship between two variables is
more complex than describing the distribution of
one variable. Here are some cautions to keep in
mind.
1. Correlation requires that both variables be
quantitative, so it makes sense to do the arithmetic
indicated by the formula for r.
2. Correlation describes the strength of only the linear
relationship between variables. Correlation does not
describe curved relationships between variables, no
matter how strong they are.
Facts about Correlation
3. Like the mean and standard deviation, the correlation
is not resistant: r is strongly affected by a few outlying
observations.
4. Correlation is not a complete summary of two-variable
data, even when the relationship between two
variables is linear. You should give the means and
standard deviations of both x and y along with the
correlation.
• Because the formula for correlation uses the means
and standard deviations, these measures are the
proper choice to accompany a correlation.
Scoring Figure Skaters
• Two judges, Pierre and Elena, have awarded scores
for many figure skaters. Our question is how well
do they agree.
• We calculate that the correlation between their
scores is r = 0.9, but the mean of Pierre’s scores is
0.8 point lower than Elena’s mean.
• The mean score shows that Pierre awards lower
scores than Elena, but because he gives about 0.8
point lower than Elena, the correlation remains
high.
• Adding the same number to all values of x or y does
not change the correlation.
Scoring Figure Skaters
• If both judges score the same skaters, the
competition is scored consistently because Pierre
and Elena agree on which performances are better
than others.
• But if Pierre scores some skaters and Elena others,
we must add 0.8 point to Pierre’s scores to arrive at
a fair comparison.
Word of Warning
• Even giving means, standard deviations, and
correlation for “state SAT math scores” and
“percent taking” will not point out the clusters that
we saw in the scatterplot.
• Numerical summaries complement plots of data,
but they do not replace them.
CHAPTER 3 SECTION 2
Least-Squares Regression Line
HW: 3.29, 3.30, 3.32, 3.34, 3.35 due Thursday
3.40, 3.41, 3.43, 3.44, 3.46 due Tuesday after spring
break!!! Along with section 1 quizzes
Least-Squares Regression
• Linear (straight-line) relationships between two
quantitative variables are easy to understand and
are quite common.
• In Section 1, we found linear relationships in
settings as varied as sparrowhawk colonies, sodium
and calories in hot dogs, and blood alcohol levels.
• Correlation measures the strength and direction of
these relationships.
• When a scatterplot shows a linear relationship, we
would like to summarize the overall pattern by
drawing a line on the scatterplot.
Least-Squares Regression
• A regression line summarizes the relationship
between two variables, but only in a specific
setting: when one of the variables helps explain or
predict the other.
• Regression, unlike correlation, requires that we
have an explanatory variable and a response
variable.
Does Fidgeting Keep You Slim?
• Some people don’t gain weight even when they
overeat. Perhaps fidgeting and other “nonexercise
activity” (NEA) explain why – some people may
spontaneously increase NEA when fed more.
• Researchers deliberately overfed 16 healthy young
adults for 8 weeks. They measured fat gain (in
kilograms) and, as an explanatory variable, change
in energy use (in calories) from activity other than
deliberate exercise – fidgeting, daily living, and the
like.
NEA change (cal):
-94
-57
-29
135
143
151
245
355
Fat gain (kg)
4.2
3.0
3.7
2.7
3.2
3.6
2.4
1.3
NEA change (cal):
392
473
486
535
571
580
620
690
Fat gain (kg)
3.8
1.7
1.6
2.2
1.0
0.4
2.3
1.1
Does Fidgeting Keep You Slim?
1. Who? The individuals are 16 healthy young adults
who participated in a study on overeating.
2. What? The explanatory variable is change in NEA
(in calories), and the response variable is fat gain
(kilograms).
3. Why? Researchers wondered whether changes in
fidgeting and other NEA would help explain weight
gain in individuals who overeat.
4. When, Where, How, and By Whom? The data come
from a controlled experiment in which subjects
were forced to overeat for an 8-week period. The
results of the study were published in Science
magazine in 1999.
Does Fidgeting Keep You Slim?
• The correlation between NEA change and fat gain is
r = -0.7786.
Does Fidgeting Keep You Slim?
• Interpretation: the scatterplot shows a moderately
strong negative linear association between NEA
change and fat gain, with no outliers.
• People with larger increases in NEA do indeed gain
less fat.
• A line drawn through the points will describe the
overall pattern well. This is what we wish to learn
to do in this section.
Interpreting a Regression Line
• When a scatterplot displays a linear form, we can
draw a regression line through the points.
• A regression line is a model for the data.
• The equation of a regression line gives a compact
mathematical description of what this model tells
us about the dependence of the response variable y
on the explanatory variable x.
Interpreting a Regression Line
• Although you are familiar with the form y = mx + b
for the equation of a line from algebra, statisticians
have adopted y = a + bx as the form for the
equation of the regression line.
• We will also adopt this form too, so we will be
consistent with notation that is used by others.
Does Fidgeting Keep You Slim?
• Any straight line describing the nonexercise activity
data has the form
fat gain = a + b(NEA change)
• In the plot below, the regression line with equation
fat gain = 3.505 – 0.00344(NEA change)
has been drawn.
• The plot shows that the
line fits the data well.
• Go back!
• Go back!
Does Fidgeting Keep You Slim?
• Interpreting Slope
– The slope b = -0.00344 tells us that fat gained goes down
by 0.00344 kilogram for each added calorie of NEA,
according to this linear model.
– The slope of a regression line y = a +bx is the predicted
rate of change in the response y as the explanatory
variable x changes.
– The slope of the regression line is an important
numerical description of the relationship between two
variables.
Does Fidgeting Keep You Slim?
• Interpreting the y-intercept
– The y-intercept, a = 3.505 kilograms, is the fat gain
estimated by this model if NEA does not change when a
person overeats.
– Although we need the value of the y-intercept to draw
the line, it is statistically meaningful only when x can
actually take values close to zero.
Does Fidgeting Keep You Slim?
• The slope b = -0.00344 is small for our example.
• This does not mean that change in NEA has little
effect on fat gain. The size of a regression slope
depends on the units in which we measure the two
variables.
• For our example, the slope is the change in
kilograms when NEA increases by 1 calorie.
• There are 1000 grams in a kilogram. So if we
measured in grams instead, the slope would be
1000 times larger or b =-3.44.
• The point is, you cannot say how important a
relationship is by looking at how big the regression
slope is.
Prediction
• We can use a regression line to predict the response y
for a specific value of the explanatory variable x.
• For our example, we want to use the regression line to
predict the fat gain for an individual whose NEA
increase by 400 calories when she overeats.
• The easiest, and most accurate, way to do this is to
substitute x = 400 into the regression equation.
• The predicted fat gain is
fat gain = 3.505 – 0.00344(400) = 2.13 kilograms
• The accuracy of predictions from a regression line
depends on how much scatter about the line the data
shows. Our scatterplot shows that similar changes in
NEA show a spread of 1 or 2 kilograms in fat gain.
• The regression line summarizes the pattern but gives
only roughly accurate predictions.
Prediction
• Can we predict the fat gain for someone whose NEA
increases by 1500 calories when she overeats?
• Obviously we can substitute x = 1500 into the
equation.
• The prediction is
fat gain = 3.505 – 0.00344(1500) = -1.66 kilograms
• That is, we predict that this individual will lose
weight when she overeats. This prediction makes
no sense.
Prediction
• Looking at the scatterplot, we see that an NEA
increase of 1500 calories is far outside the range of
our data. We can’t say whether increase this large
ever occur, or whether the relationship remains
linear at such extreme values Predicting fat gain
when NEA increases by 1500 calories is an
extrapolation of the relationship beyond what the
data show.
The Least-Squares Regression Line
• In most cases, no line will pass through all the
points in a scatterplot.
• Different people will draw different lines by eye.
• So we need a way to draw a regression line that
doesn’t depend on our guess as to where the line
should go.
• Because we use the line to predict y from x, the
prediction errors we make are errors in y, the
vertical direction in the scatterplot.
• A good regression line makes the vertical distances
of the points from the line as small as possible.
The Least-Squares Regression Line
• This graph illustrates the idea we discussed on the
prior slide.
• Notice that
if we move
the line
down, we
would
increase the
distance for
the top two
points.
The Least-Squares Regression Line
• There are many ways to make the collection of
vertical distances “as small as possible.”
• The most common is the “least-squares” method.
• The line for the NEA and weight gain example was
the least-squares regression line.
• The next slide has a visual representation of this
idea using hiker weight and backpack weight from a
problem in the book.
The Least-Squares Regression Line
• The least-squares regression line shown minimizes
the sum of the squared vertical distances of the
points from the line to 30.90. No other line would
give a smaller sum of squared errors.
• What is the
equation of the
least-squares
regression
line?
• Is it the same
equation we
got on the
calculator?
The Least-Squares Regression Line
• One reason for the popularity of the least-squares
regression line is that the problem of finding the
equation of the line has a simple answer.
• We can give the equation of the least-squares
regression line in terms of the means and standard
deviations of the two variables and their correlation.
The Least-Squares Regression Line
• We write ŷ in the equation of the regression line
to emphasize that the line gives a predicted
response ŷ for any x.
• Because of the scatter of the points about the line,
the predicted response will usually not be exactly
the same as the actually observed response y.
• Note: If you write a least-squares prediction
equation and do not use ŷ, but use y instead, you
will get the answer wrong. This is considered a
major error.
Fat Gain and NEA
• Use the calculator to verify that the mean and
standard deviation of the 16 changes in NEA are
and that the
mean and standard deviation of the 16 fat gains are
• The correlation between fat gain and NEA change is
r = -0.7786. Therefore the least-squares regression
line of fat gain y on NEA change x has slope
Fat Gain and NEA
• Now that we have the slope, we use the fact that
the least-squares line passes through
• The equation of the least-squares line is
• When doing calculations like this by hand, you need
to carry extra decimal places in the calculations to
get accurate values of the slope and y-intercept.
Using a calculator eliminates this worry.
Using Technology
• In practice, we do not have to calculate the means,
standard deviations, and correlation first.
• The calculator will give the slope b and intercept a
of the least-squares line from keyed in values of the
variables x and y.
• This allows us to concentrate on understanding and
using the regression line.
• Some software output are given on the next slides.
How Well the Line Fits the Data: Residuals
• One of the first principles of data analysis is to look
for an overall pattern and also for striking
deviations from the pattern.
• A regression line describes the overall pattern of a
linear relationship between an explanatory variable
and a response variable.
• We see deviations from this pattern by looking at
the scatter of the data points about the regression
line.
• The vertical distances from the points to the leastsquares regression line are as small as possible, in
the sense that they have the smallest possible sum
of squares.
How Well the Line Fits the Data: Residuals
• Because they represent “leftover” variation in the
response after fitting the regression line, these
distances are called residuals.
Fat Gain and NEA
• The graph below is the scatterplot of the NEA and
fat gain data with the least-squares regression line
superimposed on it.
Go back!
Residuals: Fat Gain and NEA
• For one subject, NEA rose by 135 calories while the
subject gained 2.7 kg.
• The predicted gain for 135 calories would be
• So the residual for this subject would be
Residuals: Fat Gain and NEA
• The residual for this subject was negative because
the data point lies below the LSRL.
• The 16 data points used in calculating the leastsquares line produce 16 residuals. Rounded to two
decimal places, they are
0.37
-0.70
0.10
-0.34
0.19
0.61
-0.26
-0.98
1.64
-0.18
-0.23
0.54
-0.54
-1.11
0.93
-0.03
Most graphing calculators will calculate and store
these residuals for you.
• Because the residuals show how far the data are
from the regression line, examining the residuals
helps assess how well the line describes the data.
Residuals: Fat Gain and NEA
• Residuals can be calculated from any model that is
fitted to a line. The residuals from a least-squares
line have a special property: the sum of the leastsquares residuals is always zero.
• You can see the residuals by looking at the
scatterplot. This, however, is a little difficult. It’s
easier to see if we can rotate the graph.
• A residual plot makes it easier to study the residuals
by plotting them against the explanatory variable.
• Because the mean of the residuals is always zero,
the horizontal line at zero helps to orient us. This
horizontal line corresponds to the regression line.
Residuals: Fat Gain and NEA
Residual Plots
• The residual plot magnifies the deviations from the
line to make the patterns easier to see.
• If the regression line captures the overall pattern of
the data, there should be no pattern in the
residuals.
• Let’s look at two residual plots to better understand
this point.
Residual Plots
• A
• B
Go
back!
Residual Plots
• Two important things to look for when you examine a
residual plot.
1. The residual plot should show no obvious pattern.
•
•
A curved pattern shows that the relationship is not linear. This was
like plot B on the prior slide. In this case, a straight line may not be
the best model for the data.
Increasing (or decreasing) spread about the line as x increases
indicates that prediction of y will be less accurate for larger x (for
smaller x). An example is below.
Residual Plots
2. The residuals should be relatively small in size. A
regression line in a model that fits the data well should
come “close” to most of the points. That is, the residuals
should be fairly small. How do we decide if the residuals
are “small enough”? We consider the size of a “typical”
prediction error.
•
•
•
For the fat gain and NEA data, almost all of the residuals are
between –0.7 and 0.7. For these individuals, the predicted fat gain
is within 0.7 kg of their actual fat gain during the study. This
sounds pretty good.
The subjects, however, gained only between 0.4 kg and 4.2 kg. So
a prediction error of 0.7 kg is relatively large compared with the
actual fat gain for an individual.
The largest residual, 1.64, corresponds to a prediction error of 1.64
kg. This subject’s actual fat gain was 3.8 kg, but the regression line
only predicted a gain of 2.16 kg. That’s a pretty large error!
Residual Plots
• A commonly used measure of typical prediction
error is the standard deviation of the residuals,
which is given by
• For the NEA and fat gain data,
• Researchers would have to decide whether they
would feel comfortable using this linear model to
make predictions that might be consistently “off” by
0.74 kg.
To make a Residual Plot on the Calculator
1. Input Data into L1 and L2.
2. Go to Stat, the CALC menu and select
8:LinReg(a+bx). Hit enter.
3. Behind the LinReg(a+bx) put L1, L2, Y1. You can
find Y1 by hitting the VARS key, selecting the YVARS menu, 1:Function and then 1:Y1. What you
have entered should like this LinReg(a+bx) L1, L2, Y1
on the calculator.
4. Now go back to your lists. Clear L3 and highlight
L3. Now type in Y1(L1). This tells the calculator to
plug the values from L1 into equation Y1.
To make a Residual Plot on the Calculator
5. Now clear L4 and then highlight L4. Type in L2-L3.
This figures the difference between the actual data
and what was predicted by the LSRL.
6. Now go to StatPlot. You want to do a scatterplot
of L1 and L4. This will be your residual plot.
Before graphing do a ZoomStat.
7. If other graphs come up, you will need to go turn
off plots. If you don’t know how, we can talk about
this.
The Role of r2 in Regression
• We also have a numerical value that helps us
determine how well the least-squares line does at
predicting values of the response variable y.
• This value is r2, the coefficient of determination.
• Some computer packages call it “R-sq.”
• r2 is the square of r, the correlation, but it tells us
much more than that.
• What if our least-squares line does not help predict
the values of the response variable y as x changes?
• Then our best guess would be the mean of y.
The Role of r2 in Regression
• The idea of r2 is this: how much better is the leastsquares line at predicting responses y than if we just
used as our predictor for every point?
• For each point, we could ask: Which comes closer to
the actual y-value, the least-squares line or the
horizontal line
Then we could count how
many times each was closer, and declare a “winner.”
• This approach, however, does not take into account
how much better one line is than the other.
• r2 does take into account how much better one line
is than the other.
The Role of r2 in Regression
• If all of the points fall directly on the least-squares
line, SSE = 0 and r2 = 1. In this case, all of the
variation in y is explained by the linear relationship
with x.
• So the r2 value tells how much of the variation of y
can be explained by the linear model.
Fat Gain and NEA
•
•
•
•
•
Let’s look back at our example.
What was r2? How can we find it?
It was 0.606.
What does this mean?
About 61% of the variation in y among the
individual subjects is due to the straight-line
relationship between y and x.
Facts about Least-Squares Regression
• One reason for the popularity of LSRLs is that they
have many convenient special properties. Here is a
summary of several important facts about LSRLs.
1. The distinction between explanatory and response
variable is essential in regression because the LSRL
minimizes distances only in the y direction. If we
reverse the role of the two variables, we would get a
different LSRL.
2. There is a close connection between correlation and
the slope of the least-squares line. The slope is
Facts about Least-Squares Regression
3. The LSRL of y on x always passes through the point
So the LSRL of y on x is the line with slope rsy/sx
that passes through the point
4. The correlation r describes the strength of a straight
line relationship. In the regression setting, this
description takes a specific form: the square of the
correlation, r2, is the fraction (percent) of the variation
in the values of y that is explained by the least-squares
regression of y on x.
CHAPTER 3 SECTION 3
Correlation and Regression Wisdom
HW: 3.59, 3.60, 3.62, 3.64, 3.65, 3.70
Correlation and Regression
• Correlation and regression are powerful tools for
describing the relationship between two variables.
When you use these tools, remember that they have
limitations that we’ve already discussed:
– Correlation and regression describe only linear
relationships. The calculations can be done for any two
quantitative variables, but the results are only useful if the
scatterplot shows a linear pattern.
– Extrapolation often produces unreliable predictions.
– Correlation is not resistant. Always plot your data and look
for unusual observations before you interpret correlation.
• Here are some other cautions to keep in mind when
you apply correlation and regression or read accounts
of their use.
Look for Outliers and Influential Observations
• We already know that the correlation r is not
resistant. One unusual point in a scatterplot can
greatly change the value of r.
• Is the least-squares line resistant? The example on
the next few slides should shed some light on this
question.
Talking Age and Mental Ability
• We wonder if the age (in months) at which a child
begins to talk can predict the child’s later score on a
test of mental ability.
• A study recorded this information for 21 children.
The data are listed below.
Child
Age
Score
Child
Age
Score
Child
Age
Score
1
15
95
8
11
100
15
11
102
2
26
71
9
8
104
16
10
100
3
10
83
10
20
94
17
12
105
4
9
91
11
7
113
18
42
57
5
15
102
12
9
96
19
17
121
6
20
87
13
10
83
20
11
86
7
18
93
14
11
84
21
10
100
Talking Age and Mental Ability
• W5HW
– Who? – The individuals are 21 young children.
– What? – The variables measured are age at first spoken
word and later score on the Gessell Adaptive test.
– Why? – Researchers believe the age at which a child first
speaks can help predict the child’s mental ability.
– When, where, how and by whom? – Too specific and not
overly important for this question.
• The next slide has a scatterplot of the data with age
at first spoken word as the explanatory variable and
Gesell score as the response variable.
Talking Age and Mental Ability
• Numerical Summaries:
• Model: The LSRL has equation
• What do you
think about
Child 19 and
Child 18?
• The residual for
this plot is on
the next slide.
Go Back!
Talking Age and Mental Ability
• The residual helps us to identify outliers and
influential observations. So what do you think
about Children 18 and 19 now?
• Child 18 –
Influential
Observation
• Child 19 –
Outlier
Go Back!
Talking Age and Mental Ability
• Interpretation
– The scatterplot (and the correlation) shows a negative
association. That is, children who speak later tend to
have lower test scores than early talkers. The correlation
tells us that the overall pattern is moderately linear.
– The slope of the regression line tells us that for every
month older a child is when he or she first talks, the
Gesell score will decrease by 1.13 points. The yintercept of 109.87 means that a child who speaks at age
0 months would score a 109.87 on the Gesell test. This
is obviously ridiculous due to extrapolation.
Talking Age and Mental Ability
• Interpretation (continued)
– How well does the LSRL fit the data? The residual plot
shows a fairly “random” scatter of points around the
“residual = 0” line. There is one very large positive
residual. Most of the prediction errors (residuals) are 10
points or fewer on the Gesell test.
– Since r2 = 0.41, 41% of the variation in Gesell scores can
be explained by the LSRL of Gesell scores on age at first
spoken word. That leaves 59% of the variation is Gesell
scores unexplained by the linear relationship.
– Children 18 and 19 are both special points. Child 19 lies
far from the regression line and is an outlier. Child 18
lies close to the line but far out in the x direction and is
an influential observation.
Talking Age and Mental Ability
• As we said, Child 18 is an influential observation.
But why?
• Since this child began to speak much later, his or
her extreme position on the age scale causes this
point to have a strong influence on the position of
the regression line.
• The next slide compares the regression line with
Child 18 to the regression line without child 18.
• The equation of the LSRL without child 18 is
• The equation of the LSRL with child 18 is
Talking Age and Mental Ability
• Since the LSRL makes the sum of the squares of the
vertical distances to the points as small as possible,
a point that is extreme in the x direction with no
other point
near it, pulls
the line
toward itself.
We call these
points
influential.
• The LSRL is most likely to be heavily influenced by
observations that are outliers in the x direction. The
scatterplot will alert you to such observations.
Influential points will often have small residuals
because they pull the regression line toward
themselves. If you look at a residual plot, you may miss
influential points.
• The surest way to verify that a point is influential is to
find the regression line with and without the suspect
point. If the line moves more than a small amount, the
point is influential.
Talking Age and Mental Ability
• The strong influence of Child 18 makes the original
regression of Gesell scores on age at first word
misleading. The original data have r2 = 0.41. The
relationship is strong enough to be interesting to
parents. If we leave out Child 18, r2 drops to 11%.
The strength of the association was due largely to a
single influential observation.
• What should the researcher do?
– Should she exclude Child 18? If so, there is basically no
relationship.
– Should she keep Child 18? If so, then she must collect
data on other children that are slow to begin talking so
that the analysis is not so dependent on just one child.
Beware the Lurking Variable
• Another caution is perhaps even more important:
the relationship between two variables can often be
understood only by taking other variables into
account.
• Lurking variables can make a correlation or
regression misleading.
• You should always think about possible lurking
variables before you draw conclusions based on
correlation or regression.
Is Math the Key to College Success?
• A College Board study of 15,941 high school graduates
found a strong correlation between how much math
minority students took in high school and their later
success in college.
• News articles quoted College Board as saying that
“math is the gatekeeper for success in college.”
• This might be true, but we should also consider lurking
variables.
• Minority students from middle-class homes with
educated parents no doubt take more high school math
courses. They are also more likely to have a stable
family, parents who emphasize education and can pay
for college, and so on. As you can see, family
background is a lurking variable for this study.
Imported Goods and Private Health Spending
• As you can see, there is a strong positive linear
association between the value of imported goods
and private spending on health. The correlation is
r = 0.9749. Because
r2 = 0.9504,
least-squares
regression of y on x
will explain 95% of
the variation in the
values of y.
• Are they really this
associated?
Imported Goods and Private Health Spending
• The explanatory variable was the dollar value of
goods imported into the U.S. in the years 1990 to
2001.
• The response variable is private spending on health
in these years.
• There is no economic relationship between these
variables. The strong association is due entirely to
the fact that both imports and health spending
grew rapidly in these years. The common year is a
lurking variable for each point.
• Any two variables that both increase over time will
show a strong association. This does not mean that
one variable explains or influences the other.
Lurking Variables
• Correlations such as that between imported goods
and private health spending are sometimes called
“nonsense correlations.” The correlation is real.
What is nonsense is the idea that the variables are
directly related so that changing one of the
variables causes changes in the other.
• This example shows that association does not imply
causation.
Housing and Health
• A study of housing conditions in the city of Hull,
England, measured a large number of variables for
each of the wards in the city. Two of the variables
were a measure of overcrowding x and a measure
of the lack of indoor toilets y.
• Because x and y are both a measure of inadequate
housing, we expect a high correlation.
• However, our correlation is only r = 0.08. How can
this be?
Housing and Health
• Investigation found that some poor wards had a lot
of public housing. These wards had high values of x
but low values of y because public housing always
includes indoor toilets.
• Other poor wards lacked public housing, and these
wards had high values of both x and y.
• Within wards of both type, there was a strong
positive association between x and y.
• Analyzing all wards together ignored the lurking
variable – amount of public housing – and hid the
nature of the relationship between x and y.
Housing and Health
• The scatterplot shows the two distinct groups
formed by the lurking variable.
• There is a strong correlation between x and y in
each of the two groups.
• In fact, r = 0.85 and r = 0.91 in the two groups.
• However, because similar values of x correspond to
very different values of y, x alone is of little use in
predicting y.
• This example is another reminder of why it is
important to plot the data instead of simply
calculating numerical measures.
Beware Correlations Based on Averages
• Many regression or correlation studies work with
averages or other measures that combine information
from many individuals.
• For example, if we plot the average height of young
children against their age in months, we will see a very
strong positive association with correlation near 1.
• But individual children of the same age vary a great
deal in height. A plot of height against age for
individual children will show much more scatter and
lower correlation than the plot of average height
against age.
• Correlations based on averages are usually too high
when applied to individuals.
• This is another reminder that it is important to note
exactly what variables were measured in a study.
Adding Categorical Variables to Scatterplots
• Go Back!
Go back!
Housing and Health
Go back!