REGRESSION

advertisement
AMS 5
REGRESSION
Regression
The idea behind the calculation of the coefficient of correlation is
that the scatter plot of the data corresponds to a cloud that
follows a straight line. This idea can be formalized by regression
methods.
In this class we will:
•
•
•
•
•
•
Consider the definition of simple linear regression
Find a method to predict an individual value
Use the normal curve to estimate the percentile ranks
Describe the regression effect
Compute the regression errors and its RMS
Study the behavior of regression errors
Regression
The regression method describes how one variable depends on
another.
The Northern California
temperature data have average
altitude of 3,524 feet and a SD
of 1,839 feet; average
temperature of 70.3
degrees and SD 6.5 degrees.
The correlation between
temperature and altitude is 0.76.
Regression
The idea behind the calculation of the coefficient of correlation is
that the scatter plot of the data corresponds to a cloud that
follows a straight line. This idea can be formalized by regression
methods.
In this class we will:
•
•
•
•
•
•
Consider the definition of simple linear regression
Find a method to predict an individual value
Use the normal curve to estimate the percentile ranks
Describe the regression effect
Compute the regression errors and its RMS
Study the behavior of regression errors
Regression
The cloud of points shows a mild negative association between
the two variables, as does the value of r. Can we use the values
of altitude to estimate the average values of temperature?
Regression
How does the regression line work?
Associated with an increase of
one SD in x there is an increase
of r SDs in y on average.
Clearly, if the correlation coefficient is negative, then the
average value of y decreases as x increases. In the temperature
and altitude example, an increase of height of 1,839 feet
produces a increase of -0.76 × 6.5 = -4.95 degrees in the
average temperature.
Regression
How do we use the method to predict an individual value?
If we consider two variables x and y and we want to predict the
value of y for a specific value of x, we use the average value of y
that corresponds to the value of x according to the regression
method.
Example: The first year GPAs and the Math SAT for the students
of a university produce the following data
average SAT score = 550, SD = 80
average 1st-year GPA = 2.6, SD = 0.6
r = 0.40
We want to predict the 1st-year GPA of a student with a SAT
score of 650.
Regression
The student's SAT score in standard units is
650 − 550
= 1.25
80
so the score is 1.25 SDs above average. An increase of one SD
above the average SAT score produces an increase of 0,4 × 0,6
GPA points. This implies that our student will have an increase of
1.25 × 0.4 × 0.6 = 0.3
points of GPA above average. Since the average GPA is 2.6, the
predicted GPA is
2.6 + 0.3 = 2.9
This is the average GPA that we expect for students with STA
scores around 650.
Regression
WARNING: You can use the regression method on new subjects
provided that they are similar to the ones that were used to
produce the averages, SDs and r used in the regression method.
In the previous example the method will not be valid for students
of a different institution.
Regression
We can use the regression method and the normal curve to
produce estimates of the percentile ranks.
Example: In the previous example suppose a student has a
percentile rank of 90% for the SAT scores. That is, only 10% of
the scores are higher than his. What is the predicted percentile
rank for the 1st-year GPA of this student?
Using the normal curve we have that a 90% probability
corresponds to z score of 1.3. This means that the student's SAT
score is 1.3 SDs above average. This corresponds to being
0.4 × 1.3 ≈ 0.5 SDs above the average GPA
and this corresponds to an accumulated probability, under the
normal curve, of approximately 69%.
Regression
So the percentile rank on 1st-year GPA of a student with a
percentile rank on SAT score of 90% is predicted to be 69%.
Notice that the student with a SAT percentile rank of 90% was
`pulled down' to only 69% by the regression method. Why is
that?
Suppose the correlation was perfect, r = 1, then 90% will convert
to 90%. The other extreme is that there is no correlation, so, in
the absence of any information, the best guess is the median or
50% percentile. The regression method produces a rank that is
somewhere between these two extremes.
Example
The shoe size and the heights of 14 men are recorded. The shoe
size average is 10.46 with a SD of 1.21. The average height is
70.45 inches with a SD of 2.45 inches. The correlation is 0.93.
What is the average height of a man that uses shoes of size 11.5?
We convert 11.5 to standard units
11.5 − 10.46
= 0.859
1.21
so the shoe size is 0.859 units above average. This means that
the height will be 0.859 × 0.93 × 2.45 = 1.95 inches above
average. So the average height of a man with shoe size
11.5 will be 70.45 + 1.95 = 72.40 inches.
Regression effect
Galton, a British statistician, studied the relationship between the
height of the fathers and the sons in 1,078 families. He noticed
that tall fathers tended to have shorter sons and short fathers
tended to have taller sons. He termed this fact regression to
mediocrity. This is where the term regression comes from.
Example: Children are tested for IQ before and after taking a
preschool program. In both cases the scores average 100 and the
SD is 15. So, on average, there seems to be no effect.
Nevertheless children below average in the first test had an
average gain of 5 IQ and those above average had an average
loss of 5 IQ. This is regression effect.
Regression effect
A model for the test-retest situation is
observed test score = true score + chance error
Suppose that the chance error can be either positive or negative.
Suppose that the true scores in the population follow the normal
curve with an average of 100 and a SD of 15. Consider the
children who scored 140 on the first test. There are two
possibilities:
• true score below 140, with a positive chance error
• true score above 140, with a negative chance error
Which one is more likely?
According to the normal curve, the first possibility is more likely,
since the mean is 100 and so the interval above 140 has less
probability than the one below 140. Under this scenario, the
second test is more likely to produce a value below 140.
Regression effect
A symmetric situation is valid for those scoring, say , 80 IQ. It is
likely that the true test is above 80 with a negative chance error,
and so the second score is likely to be above the first.
In other words, if a students scores above average in the first
test, it is likely that the true score is lower than the observed one.
If the student takes the test again, chances are that the second
score will be lower than the first. A symmetric situation is true for
a person scoring below average in the first test.
This explains the regression effect.
Regression errors
The regression method can be used to predict y from x. But actual
values differ from predictions. These are the regression errors.
error = actual value of y - predicted value of y
Some of the errors defined in this way are positive and some are
negative. Reflecting the fact that some observations are above
and some are below the regression line.
How do we measure the error in a regression?
The overall size of the error is measured using the root-meansquare (RMS), as we did to obtain the SD. This is equal to
where N is the number of points in the scatter diagram.
Regression errors
What if we ignore the values of x?
Then our prediction for y is the average of y. In this case the
RMS error coincides with the SD of y.
Computing the RMS error
We saw that the error that corresponds to a prediction where the
values of x are ignored corresponds to the SD of y. The overall
size of the error for a regression using x has to be smaller than
the SD. How much smaller?
RMS error = 1 − r 2 × SD of y
We observe the following features
• The units of the RMS error are the same as the units of the
variable being predicted.
• Perfect correlation corresponds to zero RMS error.
• Zero correlation corresponds to maximum RMS error (equal to
SD of y).
Computing the RMS error
Example 1: In the California temperature example we had that
the SD of y is 6.5 degrees and the correlation is -0.76, then
1 − 0.762 × 6.5 degrees ≈ 4.22 degrees
So, in this case, knowing the altitude reduces the SD from 6.5 to
4.22 degrees.
Example 2: In the shoe sizes examples we had that the SD of y is
2.45 inches and the correlation is 0.93, then
2
1 − 0.93 × 2.45 inches ≈ 0.90 inches
So we observe that, knowing the shoe size produces a dramatic
reduction of the SD from 2.45 to 0.90.
Plotting the residuals
Prediction errors are usually called residuals. It is important to
explore the graphical properties of residuals to find out about the
goodness of the fit by the regression line.
In a residual plot the x coordinates are the same as for the
original data. The y coordinates correspond to the values of the
residuals.
So there is one point for each point in the original scatter
diagram.
Plotting the residuals
Thus, if everything is OK
with the regression line,
we expect to see a cloud
of points around the zero
line in the y axis.
• We expect to see no trends or clusters in the residuals
• There should be about the same number of positive as
negative residuals
• A histogram of the residuals should look symmetric around
zero
Problem
The following results are taken from a study of about 1,000
families:
average height of husband 68 inches, SD ≈ 2.7 inches
average height of wife 63 inches, SD ≈ 2.5, r ≈ 0.25
Predict the height of a wife when the height of her husband is
1. 72 inches
The husband is 4 inches above average height. This is 4/2.7 =
1.5 SD above the average. So the wife is predicted to have
r × 1.5 = 0.25 × 1.5 ≈ 0.4 this corresponds to 0.4 × 2.5 = 1
inch.
2. 68 inches
This the husband is right on the average, so the wife will be
right on the average as well.
Prediction for data in a vertical strip
Example: A law school finds the following relationship between
LSAT scores and first-year scores
average LSAT score = 162, SD = 6
average first-year score = 68, SD = 10, r=0.60
Q: About what percentage of the students had first-year scores
over 75?
A: We use the normal curve approximation. Converting to
standard units
75 − 68
= 0.7
10
this corresponds to a right hand tail of 14% under the normal
curve.
Prediction for data in a vertical strip
Q: Of the students who scored 165 on the LSAT, about what
percentage had first-year scores over 75?
A: We first convert to standard units for the x variable:
165 − 162
= 0.5
6
then convert to standard units for the y variable
r × 0.5 = 0.6 × 0.5 = 0.3 which corresponds to 0.3 × 10 = 3 points
above average or 68+3 = 71.
Since the data corresponding to a strip are a smaller and more
homogeneous sample, the corresponding SD will be smaller.
How much smaller?
Prediction for data in a vertical strip
Example: A law school finds the following relationship between
LSAT scores and first-year scores
average LSAT score = 162, SD = 6
average first-year score = 68, SD = 10, r=0.60
Q: About what percentage of the students had first-year scores
over 75?
A: We use the normal curve approximation. Converting to
standard units
75 − 68
= 0.7
10
this corresponds to a right hand tail of 14% under the normal
curve.
Prediction for data in a vertical strip
We expect the dispersion in the y variable to be about the same
for each vertical strip. This is given by the RMS error, thus the
new SD is
2
2
1 − r × SD of y = 1 − 0.6 ×10 = 8 points
This new SD can be used to convert to standard units
75 − 71
= 0.5
8
and, using the normal curve, we obtain an area of 31% above
0.5. This is the percentage of students scoring more than 75 in
the first year among those who scored 165 in the LSAT. Notice
that this percentage is higher than the 14% we obtained before.
This is because we have focus on a smaller portion of the sample,
obtaining a smaller SD.
Prediction for data in a vertical strip
In summary, when considering data for a vertical strip:
• Convert to standard units in the x variable.
• Obtain the predicted value of the y variable.
• Calculate the SD for the y variable in the strip using RMS
error.
• Convert to standard units in the y variable and use the
normal curve.
Slope and intercept
All lines can be determined by a slope and an intercept.
The intercept is the height of the line when x = 0.
The slope is the rate at which y increases, per unit increase in x.
If the slope is negative then y decreases as x increases.
Slope and intercept
How do you get the slope of a regression line?
Example: A sample of 555 California men age 25-29 in 1993 was
surveyed to find out about education and income. The data are
summarized by
average education ≈ 12.5 years; SD ≈ 4 years
average income ≈ $21,500; SD ≈ $16,000; r ≈ 0.35
This means that, for every increase of one SD in education, there
is an increase of r SD in income.
Thus, 4 extra years of education are worth an extra
0.35 × $16,000 = $5,600 of income. So, each extra year is worth
0.35 × $16, 000
= $1, 400
4
this, is the slope of the regression line.
Slope and intercept
The intercept of the regression line is given by the value of y
when x = 0. This is 12.5 years below average in education. Since
each year costs $1,400, a man with no education should have an
income which is below average by 12.5 years × $1,400 per year
= $17,500 since the average income is $21,500, the income of a
man with no education is $21,500 -$17,500 = $4,000. This is the
intercept of the regression line.
This corresponds to the change in y associated with one unit
increase in x.
Slope and intercept
This is given by
average of y - slope × average of x
The equation for the regression line is called the regression
equation and can be written as
y = slope × x + intercept
So, for our example, we have that
predicted income = $1,400 per year × education + $4,000
Slope and intercept
Q: What is the predicted income of a man with an education of 15
years?
A: Using the regression equation we have
y = $1,400 × 15 + $4,000 = $25,000
we can plug in any value of education and obtain the expected
income for that level of education.
Warning: It is usually a bad idea to use the regression line for
extrapolations.
Example
Back to our shoe size example. The shoe size and the heights of
14 men are recorded. The shoe size average is 10.46 with
a SD of 1.21. The average height is 70.45 inches with a SD of
2.45 inches. The correlation is 0.93.
r × SD of height 0.93 × 2.45
=
= 1.88
The slope of the regression line is
SD of shoe size
1.21
To obtain the intercept we consider a show size of zero. This is
10.46 units below average and so will correspond to a height that
is 1.88 × 10.46 = 19.66 inches below average. So it corresponds
to a height of 70.45 – 19.69 = 50.75 inches. The regression line is
height = 1.88 × shoe size +50.74 inches
Q: What is the predicted height of a man with a show size of 9?
A: Using the regression equation we have
1.88 × 9 +50.74 inches = 67.67 inches
Least Square
Consider a cloud of points produced by obtaining the scatter
diagram of observations corresponding to two variables x and y.
There are many lines that we can draw through the cloud. Which
is the straight line that fits the points best? The regression line is
a possible solution to this problem.
This is the reason why the regression line is called the least
squares line.
Least Square
Example: Let b be the length of a spring with no load. If a load x
is attached to the spring the stretch is proportional to x. Thus the
length of the string is y = mx + b. where m and b are constants
that depend on the string. An experiment is run to determine the
constants for a given spring, the data are shown in the table.
The correlation coefficient is r = 0.999, so the points are very
close to straight line. But they are not exactly on a straight line.
This is probably due to measurement error.
The regression line for these data produces estimates of b and m,
given, respectively, by the intercept and the slope of the line. The
values are m ≈ 0.5c per kg, and b ≈ 439.01 cm. These are the
least squares estimates of m and b.
Problem
Find the regression equation for predicting final score from
midterm score, based on the following information:
average midterm score = 70, SD = 10
average final score = 55, SD = 20 , r = 0.60
The slope of the line can be obtained as
r × SD of final 0.60 × 20
=
= 1.2
SD of midterm
10
A score of 0 in the midterm will correspond to a final score that is
1.2 × 70 = 84 units below average. So the intercept is
55 – 84 = -29 units of the final score. Thus, the regression
equation is
final score = 1.2 × midterm score - 29
Download