regression - Central High School

advertisement
EXAMINING
RELATIONSHIPS
Least Squares Regression
Least Squares Regression
• A regression line is a straight line that
describes how a response variable y
changes as an explanatory variable x
changes.
• The least squares regression line (LSRL) is
a mathematical model for the data.
– We often use a regression line to predict the
value of y for a given value of x.
– Regression, unlike correlation, requires that we
have an explanatory variable and a response
variable.
Least Squares Regression
• A regression line is a straight line that
describes how a response variable y
changes as an explanatory variable x
changes.
• The least squares regression line (LSRL) is
a mathematical model for the data.
– We often use a regression line to predict the
value of y for a given value of x.
– Regression, unlike correlation, requires that we
have an explanatory variable and a response
variable.
Least Squares Regression
• Example 3.9
– On the next slide is the least squares
regression line of Figure 3.2.
– We can use this line to predict the natural gas
consumption for this family.
• For example, “If a month averages 20 degree days
per day (45ºF), how much gas will the family use?”
• Locate 20 on the x-axis, then go up and over to find
the consumption on the y-axis that corresponds to
x=20.
• The family will use about 4.9 hundreds of cubic feet
of gas each day.
Least Squares Regression
• Investigating the least squares
regression line
– Different people might draw different lines by
eye on a scatterplot, especially when the
points are widely scattered.
– We want a line that can be used to predict y
from x
Least Squares Regression
• Investigating the least squares
regression line
– We also want a line that is as close as
possible in the vertical direction to each
point.
• Prediction errors occur in y, that is why we want a
line as close as possible in the vertical direction.
– When we made the prediction that the family
would use 4.9 hundreds of cubic feet for a
month with 20 degree days it is possible that
the prediction has some error involved in it.
• For example if the actual usage turns out to be
5.1 hundreds of cubic feet we have an error of .2
Least Squares Regression
• Investigating the least squares
regression line
– We want a regression line that makes the
vertical distance from each point on the
scatterplot as small as possible.
– In figure 3.10a (see the next slide) there are
three points from figure 3.9 along with the
line on an expanded scale.
– The line passes above two of the points and
below one of them
– The vertical distances of the data points
appear as vertical line segments.
Least Squares Regression
• One reason the LSRL is so popular is that
the problem of finding the line has a simple
answer.
• We can give the recipe for the LSRL in
terms of the means and the standard
deviations of the two variables and their
correlation.
Least Squares Regression
• EQUATION OF THE LEAST SQUAES
REGRESSION LINE
– We have data on an explanatory variable x and a
response variable y for n individuals. From the data,
calculate the means and the standard deviations of the
two variables, and their correlation r.
– The LSRL is the line with slope b and y-intercept a.
p. 150
3.36a, b
ˆ  a  bx
y
sy
b  r
sx
a  y  bx
Least Squares Regression
ˆ (read
• We write y
hat”)
 a“y 
binxthe equation of the
regression line to emphasize that the line is a
s
predicted response y
ˆ yfor
any
ax. b x
br
sx
b  r
sy
s x not be exactly
a response
y b
xusually
• The predicted
will
the same as the actually
y.
a observed
y  response
bx
Least Squares Regression
• The least squares regression line as reported by
the TI83 calculator is
yˆ  1.0892 .1890x
• Do not forget to put the hat symbol over the y to
show the predicted value.
Least Squares Regression
• Slope
– Slope of a regression line is usually important for
the interpretation of the data.
– The slope is the rate of change, the amount of
change in y when x increases by 1.
– The slope in this example is .1890. Meaning that
on the average, each additional degree day
predicts consumption of .1890 more hundreds of
cubic feet of natural gas per day.
Least Squares Regression
• Intercept
– The intercept of a line is the value of y when x=0.
– In our example, x=0 when the average outdoor
temperature is at least 65ºF.
– Substituting x=0 into the least squares
regression line gives a y-intercepts of 1.0892
hundreds of cubic feet of gas per day when there
are no degree days.
Least Squares Regression
• Predicting
– The least squares regression line makes
predicting easy.
– To predict gas consumption at 20 degree days,
substitute x=20 to get a y=4.869.
yˆ  1.0892 (.1890)(20)
yˆ  1.0892 3.78
yˆ  4.869
Least Squares Regression
– The least squares regression line makes
predicting easy.
– To predict gas consumption at 20 degree days,
substitute x=20 to get a y=4.869.
yˆ  1.0892 (.1890)(20)
yˆ  1.0892 3.78
yˆ  4.869
Least Squares Regression
• Plotting the line
– To plot the line, use the equation to find y for two
values of x
– One value should be near the lower end of the
range of x and one value should be at the high
end.
– Plot each y above its x and draw the line
yˆ  1.0892 (.1890)(10)
yˆ  1.0892 1.89
yˆ  2.972
yˆ  1.0892 (.1890)(50)
yˆ  1.0892 9.45
yˆ  10.539
Least Squares Regression
• Least squares regression line with the actual
data and the two points from the least
squares regression equation.
Least Squares Regression
p. 142 3.31, 3.33 Together
p. 161 3.42 – 3.45, 3.47, 3.48
Least Squares Regression
• Facts about least-squares regression
– Example 3.11
• Figure 3.11(see next slide) shows a scatterplot that
played a central role in the discovery that the
universe is expanding.
• The data represents the distances from the earth
of 24 galaxies and the speed at which these
galaxies are moving away from us reported by the
astronomer Edwin Hubble in 1929.
Least Squares Regression
• Facts about least-squares regression
– Example 3.11
• This relationship is a linear relationship with a
correlation r=.7892
• This relationship shows that more distant galaxies
are moving away more rapidly.
• It is important to note that astronomers now
believe that this relationship is in fact a perfect
relationship, and that the scatter is caused by
imperfect measurements.
Least Squares Regression
• Facts about least-squares regression
– Example 3.11
• The two lines on the plot are the two least squares
regression lines.
• The regression line of velocity on distance is solid.
• The regression line of distance on velocity is
dashed.
• Regression of velocity on distance and regression
of distance on velocity give different lines.
• Remember to make sure you know which variable
is explanatory
Least Squares Regression
• Facts about least-squares regression
– Even though correlation, r, ignores the distinction
between explanatory and response variables,
there is a close connection between correlation
and regression.
• The slope(b) of the least squares regression line
means that a change of one standard deviation in x
corresponds to a change of r standard deviations in y.
yˆ  a  bx
• The slope(b) of the least squares regression line
s y means that a change of one standard deviation in x
corresponds to a change of r standard deviations in y.
br
sx
Least Squares Regression
• Facts about least-squares regression
– SLOPE OF THE LEAST SQUARES
REGRESSION LINE
• When the variables are perfectly correlated, that is
r=1 or r=-1, the change in the predicted response y
is the same as the change in x.
• When  1  r  1, the change in y is less than the
change in x.
• Meaning that as the correlation grows less strong,
the prediction y moves less in response to changes
in x.
Least Squares Regression
• Facts about least-squares regression
• One way to determine the usefulness of the
least squares regression model is to
measure how well x predicts y.
• The coefficient of determination, r2, is the
number we use to determine how well x
predicts y
Least Squares Regression
• Facts about least-squares regression
• If x is a good predictor of y, then the r2 will be close
to 1 or 100%
r  .842
2
• Therefore there is an 84% chance that the predicted
value is correct
• We can then say that x is a pretty good predictor of
y
Least Squares Regression
• Facts about least-squares regression
• If x is a poor predictor of y, then r2 will be close to 0
0%
r  .0357
2
• Therefore there is an 3.57% chance that the
predicted value is correct
• We say that 3.57% of the variation in y is explained
by least squares regression of y on x.
• We can now say that x is not a good predictor of y
Least Squares Regression
• Facts about least-squares regression
– When you report a regression, give r2 as a
measure of how successful the regression was
in explaining the response.
– When you see a correlation, square it to get a
better feel for the strength of the association.
– Perfect correlations mean the points lie exactly
on a line. Then r2=1 and thus, x will always
predict y.
Least Squares Regression
• Facts about least-squares regression
– If r=-.7 or r=.7, then r2=.49 and about half the
variation is accounted for by the linear
relationship.
– In the r2 scale, correlation of ±.7(r), is about
halfway between 0 and ±1(r2).
– The special properties of least squares
regression is another reason why it is the most
common method of fitting a regression line to
data.
Least Squares Regression
•P. 150 3.36c, 3.37, 3.38 Together
Least Squares Regression
• RESIDUALS
– A residual is the difference between an observed value
of the response variable and the value predicted by the
regression line.
– That is residual = observed y – predicted y
residual  y  yˆ
– Pay attention to your window on your calculator when
looking at a residual plot
Least Squares Regression
• RESIDUALS
– Example 3.15
• Does the age at which a child begins to talk predict a
later score on a test of mental ability?
• A study of the development of young children
recorded the age in months at which each of the 21
children spoke their first word and Gesell Adaptive
Test Score (an aptitude test taken much later)
• The data appear in Table 3.4
• The scatterplot with age at first word as the
explanatory vairable x and Gesell score as the
response variable y appears in Figure 3.14.
Least Squares Regression
• RESIDUALS
Child Age(Months) Score
11
7
113
1
2
3
4
15
26
10
9
95
71
83
91
12
13
14
15
9
10
11
11
96
83
84
102
5
6
7
15
20
18
102
87
93
16
17
18
10
12
42
100
105
57
8
9
10
11
8
20
100
104
94
19
20
21
17
11
10
121
86
100
Least Squares Regression
• RESIDUALS
– Example 3.15 Continued
• Children 3 and 13 and also children 16 and 21 are
marked with a different plotting symbol because they
have identical values of both variables.
• The plot shows that children who begin to speak
later tend to have lower test scores than early
talkers (negative association).
• The overall pattern is moderately linear with a
correlation of r=-.640
Least Squares Regression
• RESIDUALS
– Example 3.15 Continued
• The regression line is
yˆ  109.8738 1.1270x
• For child 1, who first spoke at 15 months, the
predicted score would be 92.97
yˆ  109.8738 (1.1270)(15)  92.97
• The actual score was 95.
• So the residual is 95-92.97=2.03
– The residual is positive because the data point lies above
the line.
Least Squares Regression
• RESIDUALS
– Example 3.15 Continued
• There is a residual for each data point
• Here are the 21 residuals for the data (TI83 or
Microsoft Excel)
2.0310
-9.5721 -5.6404 -8.7309 9.0310
-.3341
3.4120
2.523
3.1421
6.6659
11.0151 -3.7309 .15.604 -13.477
4.523
1.396
8.65
-5.5403 30.285
-11.4477 1.396
Least Squares Regression
• RESIDUALS
– Example 3.15 Continued
• Check the residuals to see how well the regression
line fits the data.
• Look at the vertical deviations of the points from the
line in a scatterplot of the original data (Figure 3.14.)
• A residual plot, plots the residuals on the vertical
axis against the explanatory variable (Figure 3.15)
• This makes the patterns easier to see.
• Residuals from least squares regression always
have a mean of zero.
• When you check the sum of the 21 residuals you get
-.0002 because the numbers were rounded
(roundoff error)
Least Squares Regression
• RESIDUALS
– What should a residual plot look like
• The ideal pattern for a residual plot is Figure 3.16a.
– The plot shows a uniform scatter of points above and
below the fitted line
– The plot shows no unusual individual observations
Least Squares Regression
• RESIDUALS
– Examining Residuals
• A curved pattern is not a good model for such data
(Figure 3.16B on the next slide) .
• Increasing or decreasing spread about the line as x
increases produces a prediction of y that will be less
accurate for larger values of x in that example
(Figure 3.16c on the next slide).
• Individual points with large residuals like child 19 in
Figures 3.14 and 3.15 are outliers because they are
outside the straight line pattern.
• Individual points that are extreme in the x direction,
like Child 18 in Figures 3.14 and 3.15, but do not
have large residuals can be very important.
Least Squares Regression
• Individual Observations
– In the Gesell Test Scores example, Child 18
and Child 19 are unusual in their own way.
– Child 19 lies far from the regression line, where
as Child 18 lies close to the line but far out in
the x direction.
– Child 19 is an outlier, with Gesell score so high
that we should check for a mistake in recording.
– Child 18 began to speak much later than any of
the other children
– Because of the extreme position on the age
scale, this point has a strong influence on the
position of the regression line.
Least Squares Regression
• Individual Observations
– Figure 3.17 (next slide) adds a second
regression line, calculated after leaving out
Child 18.
– You can see that this one point moves line quite
a bit.
– Least squares regression lines make the sum
of squares of the vertical distances to the points
as small as possible.
– A point that is extreme in the x direction with no
other points near it pulls the line toward itself
(influential).
Least Squares Regression
• Individual Observations
– An outlier is an observation that lies outside
the overall pattern of the other observations in
a scatterplot.
– An observation can be an outlier in the x
direction, in the y direction, or in both
directions.
– An observation is influential if removing it
would markedly change the position of the
regression line.
– Points that are outliers in the x direction are
often influential.
Least Squares Regression
• Individual Observations
– Child 18 is an outlier in the x direction and is
influential.
– Child 19 is an outlier in the y direction but has
less influence on the regression line because
there are many of the other points with similar
values of x that anchor the line well below the
outlying point.
– Influential points often have small residuals,
because they pull the line toward themselves.
– You must do more than just look at the
residuals to find influential points.
Least Squares Regression
• Individual Observations
– Example 3.16
• Correlation and least squares regression are
strongly influenced by extreme observations.
• In the Gesell example, the original data have r2=.41.
• That means the age at which a child begins to talk
explains 41% of the variation on a later test of the
mental ability.
• This relationship is strong enough to be interesting
to parents, but if we leave out child 18, r2 drops to
.11 or 11%.
Least Squares Regression
• Individual Observations
– Example 3.16
• If the researcher in this case decides to exclude
Child 18, much of the connection between the age at
which a child begins to talk and later ability score
vanishes.
• If the researcher keeps Child 18, there needs to be
more data on other children who were also slow to
begin talking, so that the analysis no longer depends
so heavily on just one child.
Least Squares Regression
• P. 159 3.39 a,b,d Together
• P. 159 3.40, 3.41
• P. 162 3.46,3.49
• Chapter 3 Review
• p. 166 3.51, 3.52, 3.54 – 3.59
• Interactive LSRL and Residual Graph
Download