Lesson 3 - Regression

advertisement
Objective
Find the line of regression.
Use the Line of Regression to Make
Predictions.
Relevance
To be able to find a model to best
represent quantitative data with 2
variables and use it to make predictions.
2-Variable Statistics
A better alternative to “storing” numbers!
2-Variable Statistics
Now that we have used one
variable statistics to “store” our
necessary numbers, let’s learn
another way that’s even better
Find the mean and standard deviation of the x’s and
y’s using 2-var stats.
x
y
21
6
18
9
30
3
35
4
Find the mean and standard deviation of the x’s and
y’s using 2-var stats.
x
y
21
6
18
9
30
3
35
4
Use this when
using your lists
to find r.
Find the correlation Coefficient:
x
y
4
6
8
15
15
22
19
18
22
27
𝑧𝑥
𝑧𝑦
𝑧𝑥 𝑧𝑦
3.599887921
𝑟=
= 0.900
4
Find the correlation Coefficient:
x
32
40
30
18
15
25
y
27
82
34
14
1
22
𝑧𝑥
𝑧𝑦
4.558674571
𝑟=
= 0.912
5
𝑧𝑥 𝑧𝑦
Find the correlation Coefficient:
x
2
8
10
14
28
32
18
y
72
60
64
52
43
40
32
𝑧𝑥
𝑧𝑦
𝑧𝑥 𝑧𝑦
−4.868894211
𝑟=
= −0.811
6
A student wonders if tall women tend to date taller men than do
short women. She measures herself, her dormitory roommate,
and the women in the adjoining rooms. Then she measures the
next man each woman dates. Draw & discuss the scatterplot
and calculate the correlation coefficient.
Women
(x)
Men
(y)
66
72
64
68
66
70
65
68
70
71
65
65
𝒛𝒙
𝒛𝒚
𝒛𝒙 𝒛𝒚
A student wonders if tall women tend to date taller men than do
short women. She measures herself, her dormitory roommate,
and the women in the adjoining rooms. Then she measures the
next man each woman dates. Draw & discuss the scatterplot
and calculate the correlation coefficient.
Women
(x)
Men
(y)
𝒛𝒙
𝒛𝒚
𝒛𝒙 𝒛𝒚
66
72
0
1.1859
0
64
68
-0.9535
-0.3953
0.3769
66
70
0
0.3953
0
65
68
-0.4767
-0.3953
0.1884
70
71
1.9069
0.7906
1.5076
65
65
-0.4767
-1.581
0.7538
2.826668855
𝑟=
0.565
5
Linear Regression
Guess the correlation coefficient
 http://istics.net/stat/Correlations/
Can we make a Line of Best Fit
Want:
1) The distances to
the line to be
the same.
2) The smallest
distances.
Regression Line
 When a scatterplot shows a linear relationship, we’d like to summarize the overall
pattern by drawing a line on the scatterplot.
 A regression line summarizes the relationship between two variables, but only in a
specific setting: when one of the variables helps explain or predict the
other.
 Regression – unlike scatter plots – REQUIRES that we have an explanatory
variable and a response variable.
Regression Line
 This is a line that describes how a response
variable (y) changes as an explanatory variable
(x) changes.
 It’s used to predict the value of (y) for a given
value of (x).
 The regression line is a model for the data.
Let’s try some!
 http://illuminations.nctm.org/ActivityDetail.aspx?ID=146
Regression Line
 When given the response
variable (y) and the
explanatory variable (x),
the regression line relating
y to x has equation of the
following form:
𝒚 = 𝒃𝟎 + 𝒃𝟏 𝒙
 Predicted Value: (𝒚 𝒐𝒓 𝒚 𝒉𝒂𝒕) –
The predicted value of y for a given
value of x.
 y-intercept: (𝒃𝟎 ) - the predicted
value of
the y when x is 0.
 Slope: (𝒃𝟏 ) – the amount by which
y is predicted to change when x
increases by 1 unit.
The following data shows the number of miles driven and advertised
price for 11 used Honda CR-Vs from the 2002-2006 model years
(prices found at www.carmax.com). The scatterplot below shows a
strong, negative linear association between number of miles and
advertised cost. The correlation is -0.874. The line on the plot is the
regression line for predicting advertised price based on number of
miles.
Thousand
Miles
Driven
22
29
35
39
45
49
55
56
69
70
86
Cost
(dollars)
17998
16450
14998
13998
14599
14988
13599
14599
11998
14450
10998
Use the regression line to answer the following.
𝐶𝑜𝑠𝑡 = 18773 − 86.18 (𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑚𝑖𝑙𝑒𝑠)
Slope
 The predicted price of
the car decreases by
$86.18 for every
additional thousand
miles driven.
y-intercept
 The predicted cost
($18,773) of a used
Honda 2002 to 2006
CR-V with 0 miles.
Predict the price for a Honda with 50,000
miles. (Use 50 in equation!)
𝑐𝑜𝑠𝑡 = 18773 − 86.18 (𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑚𝑖𝑙𝑒𝑠)
𝒑𝒓𝒊𝒄𝒆 = 𝟏𝟖𝟕𝟕𝟑 − 𝟖𝟔. 𝟏𝟖 𝟓𝟎
𝒑𝒓𝒊𝒄𝒆 = $14, 464.
Extrapolation
 This refers to using a
regression line for prediction
far outside the interval of
values of the explanatory
variable x used to obtain the
line.
 They are not usually very
accurate predictions.
Should we predict the asking
price for a used 2002-2006 Honda
CR-V with 250,000 miles?
No! We only have data for cars
with between 22,000 and 86,000
miles. We don’t know if the linear
pattern will continue beyond
these values. In fact, if we did
predict the asking price for a car
with 250 thousand miles, it would
be −$2772!
 Slope: 𝑆𝑙𝑜𝑝𝑒 = 40. 𝑇ℎ𝑒 𝑟𝑎𝑡 𝑤𝑖𝑙𝑙 𝑖𝑛𝑐𝑟𝑒𝑎𝑠𝑒 𝑡ℎ𝑒𝑖𝑟 𝑤𝑒𝑖𝑔ℎ𝑡 40 𝑔𝑟𝑎𝑚𝑠 𝑝𝑒𝑟 𝑤𝑒𝑒𝑘.
 Y-int: 𝑦 − 𝑖𝑛𝑡 = 100. 𝑇ℎ𝑒 𝑟𝑎𝑡 𝑤𝑖𝑙𝑙 𝑤𝑒𝑖𝑔ℎ 100 𝑔𝑟𝑎𝑚𝑠 𝑎𝑡 𝑏𝑖𝑟𝑡ℎ.
 Predict weight after 16 wk
𝑦 = 100 + 40 16 = 740 𝑔𝑟𝑎𝑚𝑠
 Predict weight at 2 years:
2 𝑦𝑟𝑠 = 104 𝑤𝑒𝑒𝑘𝑠; 𝑦 = 100 + 40 104 = 4260 𝑔𝑟𝑎𝑚𝑠 (𝑎𝑏𝑜𝑢𝑡 9.4 𝑝𝑜𝑢𝑛𝑑𝑠)
This is unreasonable and is a result of extrapolation.
Residual
A residual is the difference between
an observed value of the response
variable and the value predicted by
the regression line.
residual = observed y – predicted y
residual = 𝑦 − 𝑦
Example
 The equation of the least-squares regression line for the
sprint time and long-jump distance data is.
𝒅𝒊𝒔𝒕𝒂𝒏𝒄𝒆 = 𝟑𝟎𝟒. 𝟓𝟔 𝟐𝟕. 𝟔𝟑 𝒔𝒑𝒓𝒊𝒏𝒕 𝒕𝒊𝒎𝒆 = 𝟖𝟏. 𝟎𝟑 𝒊𝒏𝒄𝒉𝒆𝒔
 Find and interpret the residual for the student who had
a sprint time of 8.09 seconds.
𝒅𝒊𝒔𝒕𝒂𝒏𝒄𝒆 = 𝟑𝟎𝟒. 𝟓𝟔 𝟐𝟕. 𝟔𝟑 𝟖. 𝟎𝟗 = 𝟖𝟏. 𝟎𝟑 𝒊𝒏𝒄𝒉𝒆𝒔
𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙 = 𝑦 − 𝑦
𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙 = 151 − 81.03
𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙 = 69.97
This student jumped 69.97
inches farther than we
expected based on his
sprint time.
Regression
 Let’s see how a regression line is calculated.
Fat vs Calories in Burgers
Fat (g)
19
31
34
35
39
39
43
Calories
410
580
590
570
640
680
660
Let’s standardize the variables
Fat
Cal
z - x's
z - y's
19
410
-1.959
-2
31
580
-0.42
-0.1
34
590
-0.036
0
35
570
0.09
-0.2
39
640
0.6
0.56
39
680
0.6
1
43
660
1.12
0.78
𝒚
The line must contain the point
origin.
𝒙
 x, y 
and pass through the
Let’s clarify a little. (Just watch & listen)
The equation for a line that passes through the origin can be
written with just a slope & no intercept: 𝑦 = 𝑚𝑥
But, we’re using z-scores so our equation should reflect this
and thus it’s
z y  mz x
Many lines with different slope pass through the origin.
Which one fits our data the best? That is, which slope
determines the line that minimizes the sum of the squared
residuals.
Line of Best Fit –Least Squares Regression Line
It’s the line for which the sum of the squared residuals is smallest.
We want to find the mean squared residual.
𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍 = 𝑶𝒃𝒔𝒆𝒓𝒗𝒆𝒅 − 𝑷𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅
Focus on the vertical deviations
from the line.
Let’s find it. (just watch & soak it in)
Note: MSR is “Mean Squared Residual”
MSR 
 z
 zy

n 1
z


MSR 
MSR 
y
2
y
 mz x 
2
n 1
2
2
2
z

2
mz
z

m
z
 y
x y
x 
n 1
2
2
z
z
z
z


y
x y
2  x
MSR 
 2m
m
n 1
n 1
n 1
MSR  1  2mr  m 2
This is r!
𝑀𝑆𝑅 = 1 − 2𝑟𝑚 + 𝑚2
since
z y  mz x
St. Dev of z scores is
1 so variance is 1 also.
Continue……
𝑀𝑆𝑅 = 1 − 2𝑟𝑚 + 𝑚2
b
Since this is a parabola – it reaches it’s minimum at x 
2a
This gives us
− (−𝟐𝒓)
𝒎=
=𝒓
𝟐(𝟏)
Hence – the slope of the best fit line for zscores is the correlation coefficient → r.
Slope – rise over run
A slope of r for z-scores means that for every increase of 1 standard deviation in z x
there is an increase of r standard deviations in z y. “Over 1 and up r.”
Translate back to x & y values – “over one standard deviation in x, up r standard
deviations in y.
Slope of the regression line is:
𝒓𝒔𝒚
𝒃𝟏 =
𝒔𝒙
Why is correlation “r”
 Because it was calculated from the regression of y on x after
standardizing the variables – just like we have just done –
thus he used r to stand for (standardized) regression.
Let’s Write the Equation
y  mx  b
y  b0  b1 x
Slope:
from algebra
b0 y-intercept

b1 slope
𝑟 𝑠𝑦 0.961 (89.815)
𝑏1 =
=
= 11.056
𝑠𝑥
7.804
Explain the slope:
𝒚 = 𝒃𝟎 + 𝒃𝟏 𝒙
Your calories
increase by 11.056
for every additional
gram of fat.
Fat (g)
Calories
19
410
31
580
34
590
35
570
39
640
39
680
43
660
Now for the final part – the equation!
y-intercept: Remember – it has to pass through the point
𝑦 = 𝑏0 + 𝑏1 𝑥
Solve for y-intercept
𝒃𝒐 = 𝒚 − 𝒃𝟏 𝒙
Find the value of the y-intercept
𝒃𝟎 = 𝒚 − 𝒃𝒙 = 𝟐𝟏𝟎. 𝟗𝟓𝟒
 x, y.
Put the parts together to form the equation of the
regression line. Now it can be used to predict.
𝒚 = 𝟐𝟏𝟎. 𝟗𝟓𝟒 + 𝟏𝟏. 𝟎𝟓𝟔𝒙
𝒐𝒓 𝒆𝒗𝒆𝒏 𝒃𝒆𝒕𝒕𝒆𝒓
𝒄𝒂𝒍𝒐𝒓𝒊𝒆𝒔 = 𝟐𝟏𝟎. 𝟗𝟓𝟒 + 𝟏𝟏. 𝟎𝟓𝟔 𝒇𝒂𝒕 𝒈𝒓𝒂𝒎𝒔
How many calories do I expect to find in a
hamburger that has 25 grams of fat?
𝑪𝒂𝒍𝒐𝒓𝒊𝒆𝒔 = 𝒃𝒐 + 𝒃𝟏 𝒇𝒂𝒕 𝒈𝒓𝒂𝒎𝒔
𝑪𝒂𝒍𝒐𝒓𝒊𝒆𝒔 = 𝟐𝟏𝟎. 𝟗𝟓𝟒 + 𝟏𝟏. 𝟎𝟓𝟔(𝟐𝟓)
𝑪𝒂𝒍𝒐𝒓𝒊𝒆𝒔 = 𝟒𝟖𝟕. 𝟑𝟓𝟒
Try another problem
Mean call to-shock
time
Survival
Rate
2
90
6
45
7
30
9
5
12
2
𝑟=
−3.84030233
= −0.960
4
𝑏1 =
𝑟𝑠𝑦
= −9.2956
𝑠𝑥
𝑏0 = 𝑦 − 𝑏1 𝑥 = 101.3285
𝒚 = 𝟏𝟎𝟏. 𝟑𝟐𝟖𝟓 − 𝟗. 𝟐𝟗𝟓𝟔𝒙
𝑺𝒖𝒓𝒗𝒊𝒗𝒂𝒍 𝑹𝒂𝒕𝒆 = 𝟏𝟎𝟏. 𝟑𝟐𝟖𝟓 − 𝟗. 𝟐𝟗𝟓𝟔 ( 𝑴𝒆𝒂𝒏 𝑪𝒂𝒍𝒍 − 𝒕𝒐 − 𝑺𝒉𝒐𝒄𝒌 𝒕𝒊𝒎𝒆)
𝒚 = 𝟏𝟎𝟏. 𝟑𝟐𝟖𝟓 − 𝟗. 𝟐𝟗𝟓𝟔𝒙
𝑺𝒖𝒓𝒗𝒊𝒗𝒂𝒍 𝑹𝒂𝒕𝒆 = 𝟏𝟎𝟏. 𝟑𝟐𝟖𝟓 − 𝟗. 𝟐𝟗𝟓𝟔 ( 𝑴𝒆𝒂𝒏 𝑪𝒂𝒍𝒍 − 𝒕𝒐 − 𝑺𝒉𝒐𝒄𝒌 𝒕𝒊𝒎𝒆)
Interpret the slope:
The survival rate will decrease by 9.2956 for every additional minute of call-to-shock.
Interpret the y-intercept:
The survival rate is 101.3285 when there is NO call to shock time.
Predict the survival rate for a 10 min. call to shock time
𝒔𝒖𝒓𝒗𝒊𝒗𝒂𝒍 𝒓𝒂𝒕𝒆 = 𝟏𝟎𝟏. 𝟑𝟐𝟖𝟓 − 𝟗. 𝟐𝟗𝟓𝟔 𝟏𝟎 = 𝟖. 𝟑𝟕𝟐𝟓 𝒎𝒊𝒏𝒖𝒕𝒆𝒔
Predict the survival rate for a 20 min. call to shock time
𝒔𝒖𝒓𝒗𝒊𝒗𝒂𝒍 𝒓𝒂𝒕𝒆 = 𝟏𝟎𝟏. 𝟑𝟐𝟖𝟓 − 𝟗. 𝟐𝟗𝟓𝟔 𝟐𝟎 = −𝟖𝟒. 𝟓𝟖𝟑𝟓 𝒎𝒊𝒏𝒖𝒕𝒆𝒔
𝑬𝒙𝒕𝒓𝒂𝒑𝒐𝒍𝒂𝒕𝒊𝒐𝒏
Try another problem
SAT Math
SAT Verbal
600
650
720
800
540
600
450
500
620
620
3.853
𝑟=
= 0.963
4
𝑏1 = 1.05
𝑏0 = 20.7
𝑺𝑨𝑻 𝑽𝒆𝒓𝒃𝒂𝒍 = 𝟐𝟎. 𝟕 + 𝟏. 𝟎𝟓 𝑺𝑨𝑻 𝑴𝒂𝒕𝒉
𝑽𝒆𝒓𝒃𝒂𝒍 𝑺𝒄𝒐𝒓𝒆 = 𝟐𝟎. 𝟕 + 𝟏. 𝟎𝟓 𝑴𝒂𝒕𝒉
Interpret the slope:
Verbal score will increase by 1.05 pts for every additional point in math.
Interpret the y-intercept:
Verbal score with no math score.
Extrapolated!
Predict the verbal score for a math score of 400
𝑽𝒆𝒓𝒃𝒂𝒍 𝑺𝒄𝒐𝒓𝒆 = 𝟐𝟎. 𝟕 + 𝟏. 𝟎𝟓 𝟒𝟎𝟎 = 𝟒𝟑𝟗. 𝟑𝟑
Predict the verbal score for a math score of 500
𝑽𝒆𝒓𝒃𝒂𝒍 𝑺𝒄𝒐𝒓𝒆 = 𝟐𝟎. 𝟕 + 𝟏. 𝟎𝟓 𝟓𝟎𝟎 = 𝟓𝟒𝟑. 𝟗𝟗
That’s…all…..Folks!
 Homework:
p. 191 (27-32, 35, 37,39,41, 47)
Download