Regression and Prediction

advertisement
Regression and Prediction
Chapter 15 plus extra
May 2, 2012
Prediction
Vertical Chimneys
Regression Line
Equation of the Regression Line
Regression and Least Squares
Regression Fallacy
1.0 Prediction
If we have two quantitative variables X and Y that are
linearly related to each other, then knowing the particular
value of X for one individual can help us to estimate
(or predict) the value of Y for that individual.
We will explore what is the best prediction of the response
variable (Y ) given a value of the explanatory variable (X ).
What is the likely size of the prediction error?
1.1 Fundamental Principle of
Prediction
Incoming students at a large law school have an average
L.S.A.T. score of 163 and a S.D. of 8. You may assume the
histogram of these data follows a normal curve approximately.
Tomorrow one of these students will be chosen at random.
What is your best guess for their score?
The guess will be compared to their actual score to see
how far off it is. What is the likely size for the error in
your guess?
75
70
60
65
Son's height (inches)
70
65
60
Son's height (inches)
75
2.0 Vertical Chimneys In a
Scatterplot
55
60
65
70
Father's height (inches)
75
80
55
60
65
70
Father's height (inches)
75
80
70
65
60
Son's height (inches)
75
2.0 Vertical Chimneys in a
Scatterplot
55
60
65
70
Father's height (inches)
75
80
The graph of
averages shows the
average son’s height for
each father’s height.
It is close to a straight
line in the middle.
At the ends, it is quite
bumpy.
2.1 Prediction in a Scatterplot
Use the mean of the relevant sub-group of data as our
predictor.
S.D. of the group gives the “likely size” of the error in
our prediction.
70
65
60
Son's height (inches)
75
3.0 Regression Line
55
60
65
70
Father's height (inches)
75
80
The regression line
is a line fit to the graph
of averages.
It smooths away some
of the chance variation
in the data.
If the graph of averages
is close to a straight
line, then we use the
regression line to predict
Y for a given X .
If the graph of averages
is non-linear, it is better
to use it instead.
3.1 Predicting using a Regression Line
Estimate the average
weight of the men whose
height is 69 inches.
If you used the
regression method to
estimate weight from
height, would your
estimates generally be a
little too high, low or
about right, for men in
the sample with height
between 72 in. and 74
in?
4.0 The Regression Line
The regression line for predicting Y from
X passes
through the point of averages X̄ , Ȳ and has
slope
r × S.D. of Y
S.D. of X
5.0 The Equation of the Regression
Line
The regression line for predicting Y from X has the form:
Y
= a + b X,
= intercept + slope X .
Here
b = slope,
S.D. of Y
= r
.
S.D of X
a = intercept,
= Ȳ − b X̄ ,
S.D. of Y
X̄ .
= Ȳ − r
S.D of X
5.1 Prediction from a Regression Line
The predicted value of Y for a given value of X say X ∗
has the form:
Ŷ
= a + b X ∗,
S.D. of Y
S.D. of Y
=
Ȳ − r
X̄ + r
X ∗.
S.D of X
S.D of X
5.2 Predicting Sons’ Heights
1,078 father-son pairs and their heights were measured.
I
I
I
I
I
Average height of fathers is ≈ 68 in.
S.D. of height of fathers is ≈ 2.7 in.
Average height of sons is ≈ 69 in.
S.D. of height of sons is ≈ 2.8 in.
r is ≈ 0.5.
What are the co-ordinates for the point of averages?
What is the slope of the regression line?
What is the intercept of the regression line?
Write the equation of the regression line.
Suppose a father has a height of 72 inches. What would
you predict for his sons’ height?
Suppose a father has a height of 62 inches. What would
you predict for his son’s height?
5.3 Interpreting the Regression
Coefficients
Associated with a unit increase in X , there is some average
change in Y . The slope of the regression line estimates this
change. The formula for the slope is:
r × S.D.
of Y
S.D. of X
That is, associated with an increase of one S.D. in X , there is
an increase of r S.D.s in Y , on the average.
The intercept is just the predicted value for Y when X equals
zero. be wary of extrapolation
6.0 Regression and Least Squares
The Regression Line is familiarly referred to as the least
squares line. This is because it minimizes the sum of the
squares of the vertical distances of the data points.
y
Data point
Vertical
distance
to line
Regression Line
x
7.0 The Regression Fallacy
7.0 The Regression Fallacy
In virtually every scatterplot with less than perfect correlation,
the data points that are extreme along the x axis tend not to
be as extreme on the y axis. This is called the regression
effect.
Definition
Thinking that the regression effect must be due to something
important, not just chance error, is called the regression
fallacy.
7.1 Example
An instructor standardizes both her midterm and the final
each semester so the class average is 50 and the S.D. is 10 on
both tests. The correlation between the tests is around 0.5.
One semester she took all the students who scored below 30 in
the midterm and gave them special tutoring. On average, they
gained 10 points the final. She claims that her tutoring
worked. Can you give her alternate explanation?
Download