PEARSON'S FATHER-SON DATA • The following scatter diagram

advertisement
• How tall are the sons of 6 foot fathers?
PEARSON’S FATHER-SON DATA
• The following scatter diagram shows the heights of 1,078
fathers and their full-grown sons, in England, circa 1900.
There is one dot for each father-son pair.
Father-son pairs where the father is 6 feet tall
. . .. . .
. . . .. .
.
. . .. ................ .. ................. ..
. .................................. ....
.
.
.
......... .. .
... ...................................................................×
........................................ .......... . ..
. ....................
.......... ................................................... .... .
.
.
..
. ..... . ............................. ........... . .
..... ...... .... . ..
. ..... .......................
.
.. .
.. ........................................ ... . . ..
. . .. .. ... . .
.
. .
78
76
Heights of fathers and their full grown sons
. . .. . .
. . . .. .
.
. . .. ................ .. ................. ..
..........................................
.
.
.
... .............................................................................. .. ..
........................................ ... . . .
. ....................
.......... .................................................. ... .
.
.
..
. ..... . ........................................... . .
............. .... . ..
. ..... ......................
.
.. .
.. ........................................ ... . . ..
. . .. .. ... . .
.
. .
76
Son’s height (inches)
74
72
70
68
66
64
62
60
74
Son’s height (inches)
78
72
70
68
66
64
62
60
58
58
60
62
64
66
68
70
72
74
76
78
Father’s height (inches)
58
58
60
62
64
66
68
70
72
74
76
78
Father’s height (inches)
• How would you describe the relationship between the heights
of the fathers and the heights of their sons?
• For a father of a given height, what height would you predict
for his son?
10–1
• The points in the vertical chimney are the father-son pairs
where the father is 6 feet tall, to the nearest inch.
• The cross marks the average height of the sons of these
fathers.
• These sons are
inches tall, on average.
• This is the natural guess for the height of a son of a 6
foot father.
10–2
• What’s a simple way to describe how son’s height varies
with father’s height?
• How does son’s height vary with father’s height?
The graph of averages
. . .. . .
. . . .. .
.
. . .. ................ .. ................. ..
....
......................................×
.
.
.
........ .. .
........×
... ......................................................×
............. .......... . ..
...............×
.......×
.........×
. ...................
........................... ... .
.......... .............×
.
.
.
.
.
.
.
..
.
. ..... .×...×
.....×
........×
....................................... .. .
..
.
×
.
.
.
.
.
.... ............................ .. . .
.. . .×
.. .
.................... ............... .. . ..
. ..
.
. .
76
Son’s height (inches)
74
72
70
68
66
64
62
60
The graph of averages and the regression line
. . .. . .
. . . .. .
.
. . .. ................ .. ................. ..
....
.....................................×
.
.
.
........ .. .
........×
... ......................................................×
... . .
.
.
.
.
.
.
.
.
.
.................×
.
×
.
.
.
.
.
.
.
. ....................
.
.......×
......................... .
.
.
.
.
.
.
.
..
.
.
..
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.....×
......×
........×
. ....... .×....×
................................. . .
.
.
.
..
.
.
×. . .. ............ ...
.. ....................................... ....... ..
.. . .×
.................. ........... . . ..
. ..
.
. .
78
76
74
Son’s height (inches)
78
72
70
68
66
64
.......
.......
.......
.......
.
.
.
.
.
.
..
.......
.......
.......
.......
.
.
.
.
.
.
..
.......
.......
.......
.......
.
.
.
.
.
.
..
.......
.......
.......
.......
.
.
.
.
.
.
..
.......
.......
.......
.......
.
.
.
.
.
.
..
.......
.......
.......
.......
.
.
.
.
.
.
.....
.......
.......
.......
.......
62
60
58
58
60
62
64
66
68
70
72
74
76
78
Father’s height (inches)
58
58
60
62
64
66
68
70
72
74
76
78
Father’s height (inches)
• Reading from left to right, the crosses mark the average of
son’s height, for fathers who are respectively 61, 62, . . . , 73
inches tall, to the nearest inch.
• These averages are the conditional means of son’s height,
for fathers of the given heights.
• As fathers increase in height, on average so do their sons.
The relationship is nearly linear.
• Are the bumps “real”?
10–3
• The so-called regression line (of son’s height on father’s
height) is a smoothed version of the graph of averages.
• The average height of sons whose fathers are x inches
tall is approximately equal to the y coordinate of the point
on the regression line directly above x.
10–4
• How does the regression line compare to the SD line?
• Where does the term “regression” come from?
The regression line and the SD line
. . .. . .
..... ....... .. .
.
. . . .............. .. ............ .
............. . ..
.
.
.
.
.
.
.
.
.
.
.. ... . .
.
... ......................................................................................... .. ..
.................................... .. . . .
. ...................
.. . ..... . .... ....... . .. .
. ........ .................................................................................... . .
.................. . ..
. .............................
.
. .
.
.
.
.. ............. ........................ ... . . ..
. . .. .. ... . .
.
. . correlation coefficient
← SD line
76
Son’s height (inches)
Average 68.7, SD = 2.7
74
72
70
68
66
64
62
60
..
....
..
.....
..
.....
.
.....
76
.
.......
.......
.......
.
.
.
.
.
.
..
.......
.......
..
.......
....
.......
.
.
.
.
.
.
.
.
.....
.....
.......
..
.......
..... ..............
.
.
.
. ..
..................
.
..........
.......
.
.
.
.
.
.
..
............
....... ..
....... .
....... .....
.
.
.
.
.
.
..
..
.......
....
.......
.......
.....
.......
.
.
.
.
.
.
.
.
..
..
.......
....
.......
.......
.......
...
.
.
.
.
.
.
.
.
.
..
..
.....
.
.....
..
....
..
.....
..
....
..
....
..
.....
..
.....
r = 0.5
58
58
60
62
64
66
68
70
72
74
76
78
Father’s height (inches)
Average = 67.7, SD = 2.7
• How is the regression line like the SD line?
• It goes through the point of averages. Fathers of average
height tend to have sons of average height.
• How is it different from the SD line?
• It is less steeply sloped, by a factor of r.
• What does that mean?
• For each increase of one SD in father’s height, there is
an increase of only r SDs in son’s height, on average.
10–5
. . .. . .
..... ....... .. .
.
. . . .............. .. ........... .
................. . ..
.
.
.
.
.
.
.
.
. . ... . .
.
.
.
... ......................................................................................... .. ..
.................................... .. . . .
. ....................
.
.. ..... .. . ........ .. .
. ........ .................................................................................. .. . .
................ . ..
. . ...........................
.
. .
.
.
.
.
.. ..... .............................. ... . . ..
. . .. .. ... . .
.
. . correlation coefficient
← SD line
78
74
Son’s height (inches)
Average 68.7, SD = 2.7
78
The regression line and the SD line
.
.....
72
70
68
66
64
62
60
..
....
.
.....
..
.....
..
.....
..
....
...
.......
.......
.......
.
.
.
.
.
.
..
.......
.......
..
....
.......
.......
.
.
.
.
.
.
.
.
.
....
....
.......
.......
..
.... .............
.
.
.
.
.
.
...................
..
...........
......
.
.
.
.
.
.
.........
.
.
.
.
.
.
..
.......
....... ......
.......
.......
.
.
.
.
.
.
.
.
.
.
.
..
..
.......
..
.......
.....
.......
.......
.
.
.
.
.
.
.
.
.
...
......
.......
..
.......
.....
.......
.
.
.....
.
.....
.
.
.....
..
.....
..
....
..
.....
..
....
..
.....
r = 0.5
58
58
60
62
64
66
68
70
72
74
76
78
Father’s height (inches)
Average = 67.7, SD = 2.7
• Sons of fathers who are taller than average are themselves
shorter
taller than average, but by not so much as the fathers.
shorter
• There is a “falling back” towards the mean. In Galton’s
terms: a “regression towards mediocrity.”
• The term “regression” is widely used in statistics in connection with how one variable varies with respect to another.
10–6
• Why do sons of fathers who are taller than average tend
themselves to be taller than average, but by not so much as
their fathers?
• First answer: A person’s height depends on the heights of
both the mother and father. Very tall men generally marry
women who are not exceptionally tall, not because they prefer
short women, but because factors other than height enter into
the choice of a mate. A man who is 2 SDs taller than the
average male may marry a women who is 2 SDs taller than
the average female. But because attributes other than height
matter, the wife is very likely to be a woman who is less than 2
SDs taller than average, just because there are so many more
such women. Thus, most very tall men marry women who are
not as exceptionally tall as they are, and so have sons who are
not as exceptionally tall either.
• Second answer: Diet, exercise, and other environmental
factors influence height, so that observed height is not a
perfect reflection of one’s genes. Someone who is very tall
is much more likely to be the unusually tall result of more
average genes than the unusually short result of extremely tall
genes, simply because there are many more average genes in
the population. Thus the observed height of a very tall person
is usually an overstatement of his or her genetic height, which
is what determines the expected height of the child.
THE REGRESSION METHOD
• For any football-shaped scatter diagram
y
...............................
........
.....
.......
...
......
.
...
.
.
.
...
.
...
.
.
..
.
...
.
.
..
.
.
..
.
.
.
...
.
.
.
.
.
.
.
.
.
.
.
.
...
...
...
...
..
...
.
.
....
....
...
....
...
.....
...
.....
.
.
.
.
...
.
.....
.....
.......
........
..........................
x
the average value of y corresponding to a given value of x may
be estimated by the regression line (for y on x).
regression
estimate.. regression
↓ .............. ←
.
line
.......
.......•
y
...
.......
.......
.......
.
.
.
.
.
.
........
.......
.......
........
.
.
.
.
.
.
..
.......
.......
.......
.......
.
.
.
.
.
.
..
.......
.......
.......
.......
.
.
.
.
.
.
..
↓ ................
......
......•
.
.
.
.
.
.
..
.......
x
r # SDy
the point of
averages
Avey
# SD
Avex
x
For however many SDs x is above or below average, the
regression estimate for the conditional mean of y given x is
only r times that many SDs above or below the overall average
for y.
10–7
10–8
• Associated with each increase of one SD in x there is an
increase of only r SDs in y, on the average.
Why is r the right factor?
NONLINEAR ASSOCIATION
• Beware: linear regression doesn’t make sense when the data
structure isn’t linear.
• First answer: it just turns out that way. We saw it
happen in the father-son example. It happens that way for all
football-shaped scatted diagrams.
8
• Second answer. Three special cases are easy to understand:
r = 0, 1, and −1.
6
7
y 4
• If r = 1, there is a perfect positive association between
x and y A one SD increase in x corresponds exactly to a
one SD increase in y.
2
• If r = 1, there is a perfect negative association. A one
SD increase in x corresponds exactly to a one SD decrease
in y.
............................
.......
.....
.....
....
....
....
.
.
.
.
...
.
.
.
...
↓
..
.
...
.
.
.
...
.
.
.
...
.
.
.
.
.
.
...
..
...
...
...
...
...
...
.
...
.
.
...
..
.
...
.
.
...
.
.
...
..
.
.
...
.
.
...
.
.
...
.
.
.
...
.
.
...
..
.
...
.
.
...
...
...
.
.
.
...
.
.
.
...
.
.
.
...
.
.
.
...
.
.
...
.
.
.
...
..
.
..
.
..
.
.
..
5
• If r = 0, there is no association between x and y. For a
one SD increase in x, there is no increase or decrease in y,
on average.
.
. ..
. . ... .....
.
. .. . . . ...
regression
... .... . ..
line
.. . . ....... . . ......
.. ..... .. . ... .. .
. .. .
.. . . .
.. .
... . .
.
.
3
.
1
smoothed version of
→
the graph of averages
0
0
1
2
3
x
4
5
6
• You can always use the graph of averages to describe how y
changes with respect to x, on average.
• It usually makes sense to smooth the graph of averages to
get a simpler description of the variation.
• But when the pattern in the data is nonlinear, it doesn’t
make sense to smooth the graph of averages to a straight line.
• How do you tell if pattern in the data is nonlinear?
10–9
10–10
PREDICTIONS FOR INDIVIDUALS
• Consider again the 1,078 pairs of fathers and sons.
Heights of fathers and sons in England, circa 1990
. ... ..
76
. ........ . .
... . . ..........
. ......................
74
.....................................................................................
72
.......................
.................................................. ...
70
. .........................................
.................
........ .
..
..
..
.
.
.
.
.
68
.
.
.
.
.
.
.
.
. ....................................................................................................
66
.............................. .
...............................
64 ........
............................ . ..
62
.
..
60
Son’s height (inches)
Average 68.7
78
58
58 60 62 64 66 68 70 72 74 76 78
Father’s height (inches)
• I’m going to pick one of these pairs at random, and ask you
to predict the son’s height.
USING THE REGRESSION LINE
• For however many SDs x is above or below average, the
regression estimate for the conditional mean of y given x is
only r times that many SDs above or below the overall average
for y.
regression
estimate regression
↓ ............... ←
.......
line
......•
.........
..
.........
........
.........
........
.
.
.
.
.
.
.
.
..
.........
........
.........
........
.
.
.
.
.
.
.
.
....
........
x
y ....................
.
↓ ....................
..
.........
...•
.........
........
the point of
averages
(Ave , Ave )
# SDx
r # SDy
x
• Consider the father-son data.
• If I tell you nothing about the father, your prediction
would be
. In the diagram,
that corresponds to using the
line.
average height of fathers ≈ 68˝, SD ≈ 3˝
average height of sons ≈ 69˝, SD ≈ 3˝, r ≈ 0.5
• If I tell you the father’s height, your prediction would
be
. In the diagram, that
corresponds to using the
line.
The scatter diagram is football-shaped. On average, how tall
are sons of 6 foot fathers?
• Would you use the regression line to predict the height of a
son:
• For fathers in England circa 1900?
• For fathers who are midgets?
• For fathers in the U.S. in 1997?
• A 6 foot father is
inches above average in
height.
• That’s
of an SD.
• The average height of sons of 6 foot fathers is
the overall average by
of an SD.
inches.
• That’s
• So sons of 6 foot fathers are on average
inches
tall.
• A regression line can be used to make predictions for
individuals. But if you have to extrapolate far from the data,
or to a different group of subjects, watch out!
10–11
10–12
USING THE REGRESSION LINE
USING THE REGRESSION LINE
• For however many SDs x is above or below average, the
regression estimate for the conditional mean of y given x is
only r times that many SDs above or below the overall average
for y.
• For the men ages 18–24 in the HANES sample, the relationship between height and systolic blood pressure can be
summarized as follows:
average height ≈ 68˝,
SD ≈ 3˝
average blood pressure ≈ 124 mm, SD ≈ 15 mm, r ≈ −0.2
The scatter diagram is football-shaped. Estimate the average
blood pressure of the men who were 64 inches tall.
• These men are
inches
average in
height.
of an SD.
• That’s
• On average, these men are
the overall average
in blood pressure by
of an SD.
• That’s
mm.
• So the average blood pressure of these men is estimated
as
mm.
10–13
• In general, to estimate the average value of y for a given
value of x, you:
(1) Convert the given value of x to
.
(2) Multiply by the
. This gives
the corresponding average value of y in standard units.
(3) Convert back from standard units.
Ave.−2SD
−2
−2
−1
−1
•....
..
...
...
Ave.+2SD
.
Ave.
...
...
.
.........
.......
...
.
.....•
.....
......
.
.
.
.
.
......
.....
......
......
.
.
.
.
.....
..........
.............
•.......
...
...
...
...
...
.
.........
.......
...
0
0
1
1
2
Ave.
(Step 1)
Standard units for x
(Step 2)
2
•
Ave.−2SD
Original units for x
Ave.+2SD
Standard units for y
(Step 3)
Original units for y
In Step 1, you use the Average and SD of
.
In Step 3, you use the Average and SD of
.
10–14
THE TEST-RETEST SCENARIO
PREDICTING PERCENTILE RANKS
• For the first-year students and Podunk University, the
correlation between SAT scores and first-year GPA is 0.60.
Predict the percentile rank of the first-year GPA for a student
whose percentile rank on the SAT was 30%.
z
Area
0.30
0.35
0.40
0.45
0.50
0.55
• Line of attack:
Ave.−2SD
Ave.
−2
0
−2
−1
−1
Ave.−2SD
0
Ave.
SAT points
Ave.+2SD
1
2
1
2
Ave.+2SD
.............
..
........
............
.......
........
......
.......
.
.
.
.
.
......
.....
.....
.....
.....
.....
.....
.....
.....
.
.
.
.
.
......
.
.......
......
...........................
• Percentile rank for SAT =
• Standard units for SAT =
• Standard units for GPA =
SAT in standard units
◦
70
GPA in standard units
60
GPA points
.
.
.
• Estimated percentile rank for GPA =
.
• You don’t need to know the Average and SD for SAT or
GPA. But you do need to assume that
.
10–15
Midterm and final scores
The scores of the students in the lower quartile on the
midterm are marked by •’s; their point of averages is
marked by the ×. The scores of the remaining students
are marked by ◦’s.
Score on final
..........................
......
.......
.....
.....
.....
.....
.
.
.
.
.....
..
.
.
.
.....
.
..
.
......
.
.
.
...
......
.
.
.
.
..
......
.
.
.
.
.
........
..
.
.
.
.
.
.
...........
.
..
.....
.............
23.58
27.37
31.08
34.73
38.29
41.77
• An instructor standardizes her midterm and final so the class
average is 50 and the SD is 10 on both tests. The correlation
between the tests is always around 0.50. On one occasion,
she took the students who scored in the lower quartile at the
midterm and gave them special tutoring. On average, they
scored about 6 points higher on the final than they did on
the midterm. Does this show that the special tutoring was
effective? Answer yes or no, and explain briefly.
50
40
30
◦
◦
◦ ◦
◦
•
◦ ◦
◦
◦
•
◦
◦
◦ ◦
•
•
◦ ◦◦ ◦ ◦ ◦ ◦ ◦◦◦
•
◦
•
◦ ◦
◦
◦
◦
◦
• •• ◦ ◦ ◦
◦
×
◦
•
◦
•
•
◦
•
r=
•
◦
•
◦
◦ ◦
30
40
50
60
Score on midterm
10–16
0.52
70
• No. The improvement can be explained by chance variation,
using the model
Observed test score = True score + chance error.
The data points in the preceding diagram were actually created
as follows. First a set of normally distributed “true scores” were
randomly assigned to the students in the class; these scores
represent the students’ innate abilities. Then the midterm
and final scores were constructed by adding random normally
distributed “chance errors” to the true scores; these chance
errors account for factors such as preparedness, concentration,
and anxiety that vary from test day to test day. The following
diagram shows the breakdown of the midterm scores into their
innate-ability and test-day components.
Performance on the midterm
Students in the lower quartile on the midterm are
represented by •’s, the other students by ◦’s.
Random error on the midterm
10
◦
◦ ◦ ◦
5
•
0
•
• •
◦
•
•
−10
•
◦
40
◦
◦
•
•
◦
◦
◦
◦
◦
◦ ◦
•
50
True score
10–17
60
• What is the expected improvement?
• Suggest a design to study whether special tutoring is
effective. What problems might arise?
THE REGRESSION EFFECT AND
THE REGRESSION FALLACY
• In a typical test-retest situation, the subjects get different
scores on the two tests. Take the bottom group on the first
test. Some improve on the second test, others do worse; but
on the average the bottom group shows an improvement. Now
the top group. Some do better the second time, others fall
back; but on average the top group does worse the second
time. This is the regression effect, and it happens whenever
the scatter diagram spreads out around the SD line into a
football-shaped cloud of points.
◦
◦
◦◦
◦
•
30
◦
◦◦ ◦◦
◦
◦
◦ ◦◦
◦
◦◦
•
−5
◦
◦
◦
◦ ◦
◦◦
◦
• ◦
•
•
◦
◦
The diagram shows that students in the lower quartile on
the midterm tended to have some bad luck — their chance
errors for that test were mostly negative. Their midterm
scores thus tend to understate their true abilities, which is
what determines their average performance on the final. In
other words, you should expect this group to do better on
average on the final, even without any special tutoring. The
expected amount of improvement can be found by applying
the regression method to estimate the group’s average score
on the final from its average score on the midterm.
70
• The regression fallacy consists in thinking that the regression
effect is due to something other than spread around the SD
line.
10–18
• Two regression lines can be drawn across a scatter diagram:
THERE ARE TWO REGRESSION LINES
• As well as regressing son’s height on father’s height, we can
regress father’s height on son’s height:
The regression of father’s height on son’s height
78
76
Son’s height (SH)
Average 68.7, SD = 2.7
74
72
70
68
66
64
62
60
58
. .
.
.
. .
Regression line
for predicting
.. . . .. .
FH from SH .
. . . ................. .. ................ ..
.................................. ....
.
.
... ........................................×
.......................................... .. ..
.................................... ... . . .
. ...................
.............................................. ... .
.
..
.
.
.
.
.
.
.
.
..
.
.
. ..... . .............................................. . .
. .. .. ................ ..... . ..
... . . .
.. . .........................................
.
.
.
.
.
.
..
.
.
.. . ... .... ... . . . .
. ..
.
. .
r = 0.5
← SD line
..
...
.....
...
..
.
.
.
.
....
.
.
.
.
..
.......................................................................
.....
...
.....
...
.
.
.
.
.
...
.....
...
...
...
.
.
.
.
.
..
..
...
.....
..
...
..
.
.
....
..
.
.. ......
...
... .
.. ......
.
.
... ...
... ...
...
....
.
.
.....
........
. ...
..... .....
.
.
..... .....
..
..
.
.
.
.
.
.
.
..
..
...
.....
..
...
.
.
.
.
.
.
..
..
..
..
...
.....
...
.
.
.
.
.
....
..
...
..
...
....
..
.
.
.
.
..
.....
...
..
..
.....
...
.
.
.
...
.....
...
...
..
58
60
62
64
66
68
70
72
74
76
y
Regression line
for x on y
....
..
....
...
...
................................................
........... ....
............
.
.
.
.
.
.
.
.
.
.
.
.
....
.... ....
......
...
.... .......
......
... ...........
...
. ...
.....
..
... .... ...
.....
.
.
..................
.
.
. ....
.
.............
.
...
.................................
.
.
.
.
.
.
.
.........
...
...
............... ...
..
....
............. .. .. ..
..
......................... . ... .. .....
....
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
...... ....
.....
..... .... ....
...
......
..........
...
.......
........
................ .....
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.........................................
.
....
...
....
....
..
←−
Ave
Regression line
−→
for y on x
SD line −→
r=
Ave
x
• The regression line for y on x is used to predict y from x.
• It involves thinking about:
strips;
• dividing the scatter diagram into
• finding the average value of
in each strip; and
• smoothing the resulting graph of averages to a
straight line.
• The line goes through the point of averages and is less
steeply sloped than the SD line by a factor of |r|.
• The regression line for x on y is used to predict x from y.
78
Father’s height (FH)
Average = 67.7, SD = 2.7
• It involves thinking about:
• dividing the scatter diagram into horizontal strips;
• finding the average value of x in each strip; and
• smoothing the resulting graph of averages to a
straight line.
• The points in the horizontal strip are the father-son pairs
where the son is one SD above average, to the nearest inch.
• The cross at the point (69.3, 71.4) marks the average height
of the fathers of these sons. The cross lies very close to the
regression line for FH on SH, which predicts these fathers to
be above average in height by only
.
• In general the two regression lines are different. You can
screw up a prediction badly by using the wrong line!
• When will the two lines be the same?
10–19
10–20
• The line goes through the point of averages and is more
steeply sloped than the SD line by a factor of 1/|r|.
SUMMARY
• Associated with each increase of one SD in x there is an
increase of only r SDs in y, on the average. Plotting these
regression estimates gives the regression line of y on x.
• The graph of averages is often close to a straight line, but
may be a little bumpy. The regression line smooths out the
bumps. If the graph of averages is a straight line, then it
coincides with the regression line.
• The regression line can be used to make predictions of one
variable from another. But if you have to extrapolate far from
the data, or to a different group of subjects, be careful.
• Beware of nonlinear structure and outliers. They can fool
the regression method.
• Remember the regression effect and the regression fallacy.
• You can draw two different regression lines on a scatter
diagram: one predicts y from x; the other predicts x from y.
10–21
Download