living with the lab - Louisiana Tech University

advertisement
living with the lab
linear regression
quality of the fit and automating the analysis in Excel
heart rate versus exercise time
0
20
40
60
cumulative exercise time (s)
good fit
heart rate versus exercise time
120
110
100
90
80
70
60
50
heart rate (bpm)
120
110
100
90
80
70
60
50
heart rate (bpm)
heart rate (bpm)
heart rate versus exercise time
0
20
40
60
cumulative exercise time (s)
better fit
120
110
100
90
80
70
60
50
0
20
40
cumulative exercise time (s)
best fit
good, better and best aren’t very quantitative words to describe the “quality of the fit”
© 2011 David Hall and the LWTL faculty team
The Living with the Lab label, the Louisiana Tech Logo, and this copyright notice should not be removed when any part of this work is
used by others. This work may not be used for commercial purposes. Inquiries should be addressed to dhall@latech.edu. This
presentation on linear regression is based partially on class notes created by Dr. Mark Barker at Louisiana Tech University.
60
living with the lab
DISCLAIMER
The content of this presentation is for informational purposes only and is intended only for students
attending Louisiana Tech University.
The author of this information does not make any claims as to the validity or accuracy of the information
or methods presented.
The procedures demonstrated here are potentially dangerous and could result in injury or damage.
Louisiana Tech University and the State of Louisiana, their officers, employees, agents or volunteers, are
not liable or responsible for any injuries, illness, damage or losses which may result from your using the
materials or ideas, or from your performing the experiments or procedures depicted in this presentation.
If you do not agree, then do not view this content.
2
living with the lab
Class Problem Determine the best fit line of “recovery for recycling” versus “year” for 1960,
1970, 1980, 1990, 2000 and 2009.
a. Use Excel to set up a table to manually determine the slope m and the y-intercept b.
b. Plot the six raw data points versus the fit. Use markers only (with no lines) for the raw data
and lines only (no markers) for the fit.
Table ES-3. Generation, materials recovery, composting, combustion with energy recovery, and
discards of municipal solid waste, 1960-2009, in pounds per person per day
http://www.wastexchange.org/upload_publications/MSWintheU.S.2010.pdf
www.epa.gov
𝑚=
𝑛 𝑥𝑖 𝑦𝑖 −
𝑛 𝑥𝑖2 −
𝑥𝑖 𝑦𝑖
𝑥𝑖 2
𝑏=
𝑦𝑖 − 𝑚
𝑛
𝑥𝑖
3
living with the lab
𝑚=
𝑛 𝑥𝑖 𝑦𝑖 −
𝑛 𝑥𝑖2 −
𝑥𝑖 𝑦𝑖
𝑥𝑖 2
𝑏=
𝑦𝑖 − 𝑚
𝑛
𝑥𝑖
solution
the “coefficient of determination,” more commonly referred to as r2, will be used
to determine the “goodness of the fit”
4
living with the lab
coefficient of determination
•
•
𝑓𝑖𝑡
the error at point 𝑥𝑖 is 𝑦𝑖 − 𝑦𝑖
since some errors are negative (fit lies below data point) and some are positive (fit lies
𝑓𝑖𝑡
•
•
2
above data point), we square the errors: 𝑒𝑟𝑟𝑜𝑟 2 = 𝑦𝑖 − 𝑦𝑖
if we simply reported the 𝑒𝑟𝑟𝑜𝑟 2 term above, the number would vary in size depending on
the problem being solved
we would like a number that varies between 0 (poor fit) and 1 (perfect fit), so we normalize
the error
𝑦
𝑓𝑖𝑡
data point 𝑖
(𝑥𝑖 , 𝑦𝑖 )
best fit line
𝑦
𝑓𝑖𝑡
𝑦𝑖 − 𝑦𝑖
𝑓𝑖𝑡
𝑦𝑖
𝑦𝑖
𝑓𝑖𝑡
𝑟2 = 1 −
𝑦𝑖 − 𝑦𝑖
𝑦 − 𝑦𝑖 2
2
=𝑚∙𝑥+𝑏
where 𝑦 is the average value of
𝑦𝑖 ; this normalizes 𝑟 2 :
= 𝑚 ∙ 𝑥𝑖 + 𝑏
0 ≤ 𝑟2 ≤ 1
𝑓𝑖𝑡
𝑦𝑖
𝑥𝑖
𝑥
5
living with the lab
alternate equation for r2
instead of using the form for r2 presented on the previous slide, we use the form
below; this form does not rely on 𝑚 and 𝑏:
𝑛
𝑟2 =
𝑛
𝑥𝑖2 −
𝑥𝑖 𝑦𝑖 −
𝑥𝑖
𝑦𝑖
2
𝑛
𝑦𝑖2 −
𝑥𝑖
∙
2
𝑦𝑖
2
0 ≤ 𝑟2 ≤ 1
Class Problem Use Excel to compute 𝑟 2 for the recycling problem completed earlier.
6
living with the lab
solution: adding r2 to the earlier spreadsheet
•
if r2 is 0, then there is no apparent
relationship between x and y
•
if r2 is 1, then
o
o
o
𝑛
2
𝑟 =
𝑛
𝑥𝑖2
−
𝑥𝑖 𝑦𝑖 −
𝑥𝑖
2
∙
𝑥𝑖
𝑦𝑖
𝑛
𝑦𝑖2
x perfectly determines y
the variation in y is wholly due to x
y depends on x and there are no other
variables that affect y
2
−
𝑦𝑖
2
0 ≤ 𝑟2 ≤ 1
7
living with the lab
repeat using built-in Excel tools
STEPS:
1.
2.
3.
4.
enter x and y data
create a scatter plot
right click on the markers and select “Add Trendline”
select “Linear”, “Display Equation of chart” and “Display R-squared value on chart”
recovery for recycling versus year
recovery for recycling
(lbs/(person*day))
1.2
1
y = 0.0212x - 41.537
R² = 0.9378
0.8
0.6
0.4
0.2
0
1950
1960
1970
1980
1990
2000
2010
2020
year
NOTE: when studying for the next exam, be sure you can solve
problems like the one today by hand and using Excel
8
Download