living with the lab linear regression quality of the fit and automating the analysis in Excel heart rate versus exercise time 0 20 40 60 cumulative exercise time (s) good fit heart rate versus exercise time 120 110 100 90 80 70 60 50 heart rate (bpm) 120 110 100 90 80 70 60 50 heart rate (bpm) heart rate (bpm) heart rate versus exercise time 0 20 40 60 cumulative exercise time (s) better fit 120 110 100 90 80 70 60 50 0 20 40 cumulative exercise time (s) best fit good, better and best aren’t very quantitative words to describe the “quality of the fit” © 2011 David Hall and the LWTL faculty team The Living with the Lab label, the Louisiana Tech Logo, and this copyright notice should not be removed when any part of this work is used by others. This work may not be used for commercial purposes. Inquiries should be addressed to dhall@latech.edu. This presentation on linear regression is based partially on class notes created by Dr. Mark Barker at Louisiana Tech University. 60 living with the lab DISCLAIMER The content of this presentation is for informational purposes only and is intended only for students attending Louisiana Tech University. The author of this information does not make any claims as to the validity or accuracy of the information or methods presented. The procedures demonstrated here are potentially dangerous and could result in injury or damage. Louisiana Tech University and the State of Louisiana, their officers, employees, agents or volunteers, are not liable or responsible for any injuries, illness, damage or losses which may result from your using the materials or ideas, or from your performing the experiments or procedures depicted in this presentation. If you do not agree, then do not view this content. 2 living with the lab Class Problem Determine the best fit line of “recovery for recycling” versus “year” for 1960, 1970, 1980, 1990, 2000 and 2009. a. Use Excel to set up a table to manually determine the slope m and the y-intercept b. b. Plot the six raw data points versus the fit. Use markers only (with no lines) for the raw data and lines only (no markers) for the fit. Table ES-3. Generation, materials recovery, composting, combustion with energy recovery, and discards of municipal solid waste, 1960-2009, in pounds per person per day http://www.wastexchange.org/upload_publications/MSWintheU.S.2010.pdf www.epa.gov 𝑚= 𝑛 𝑥𝑖 𝑦𝑖 − 𝑛 𝑥𝑖2 − 𝑥𝑖 𝑦𝑖 𝑥𝑖 2 𝑏= 𝑦𝑖 − 𝑚 𝑛 𝑥𝑖 3 living with the lab 𝑚= 𝑛 𝑥𝑖 𝑦𝑖 − 𝑛 𝑥𝑖2 − 𝑥𝑖 𝑦𝑖 𝑥𝑖 2 𝑏= 𝑦𝑖 − 𝑚 𝑛 𝑥𝑖 solution the “coefficient of determination,” more commonly referred to as r2, will be used to determine the “goodness of the fit” 4 living with the lab coefficient of determination • • 𝑓𝑖𝑡 the error at point 𝑥𝑖 is 𝑦𝑖 − 𝑦𝑖 since some errors are negative (fit lies below data point) and some are positive (fit lies 𝑓𝑖𝑡 • • 2 above data point), we square the errors: 𝑒𝑟𝑟𝑜𝑟 2 = 𝑦𝑖 − 𝑦𝑖 if we simply reported the 𝑒𝑟𝑟𝑜𝑟 2 term above, the number would vary in size depending on the problem being solved we would like a number that varies between 0 (poor fit) and 1 (perfect fit), so we normalize the error 𝑦 𝑓𝑖𝑡 data point 𝑖 (𝑥𝑖 , 𝑦𝑖 ) best fit line 𝑦 𝑓𝑖𝑡 𝑦𝑖 − 𝑦𝑖 𝑓𝑖𝑡 𝑦𝑖 𝑦𝑖 𝑓𝑖𝑡 𝑟2 = 1 − 𝑦𝑖 − 𝑦𝑖 𝑦 − 𝑦𝑖 2 2 =𝑚∙𝑥+𝑏 where 𝑦 is the average value of 𝑦𝑖 ; this normalizes 𝑟 2 : = 𝑚 ∙ 𝑥𝑖 + 𝑏 0 ≤ 𝑟2 ≤ 1 𝑓𝑖𝑡 𝑦𝑖 𝑥𝑖 𝑥 5 living with the lab alternate equation for r2 instead of using the form for r2 presented on the previous slide, we use the form below; this form does not rely on 𝑚 and 𝑏: 𝑛 𝑟2 = 𝑛 𝑥𝑖2 − 𝑥𝑖 𝑦𝑖 − 𝑥𝑖 𝑦𝑖 2 𝑛 𝑦𝑖2 − 𝑥𝑖 ∙ 2 𝑦𝑖 2 0 ≤ 𝑟2 ≤ 1 Class Problem Use Excel to compute 𝑟 2 for the recycling problem completed earlier. 6 living with the lab solution: adding r2 to the earlier spreadsheet • if r2 is 0, then there is no apparent relationship between x and y • if r2 is 1, then o o o 𝑛 2 𝑟 = 𝑛 𝑥𝑖2 − 𝑥𝑖 𝑦𝑖 − 𝑥𝑖 2 ∙ 𝑥𝑖 𝑦𝑖 𝑛 𝑦𝑖2 x perfectly determines y the variation in y is wholly due to x y depends on x and there are no other variables that affect y 2 − 𝑦𝑖 2 0 ≤ 𝑟2 ≤ 1 7 living with the lab repeat using built-in Excel tools STEPS: 1. 2. 3. 4. enter x and y data create a scatter plot right click on the markers and select “Add Trendline” select “Linear”, “Display Equation of chart” and “Display R-squared value on chart” recovery for recycling versus year recovery for recycling (lbs/(person*day)) 1.2 1 y = 0.0212x - 41.537 R² = 0.9378 0.8 0.6 0.4 0.2 0 1950 1960 1970 1980 1990 2000 2010 2020 year NOTE: when studying for the next exam, be sure you can solve problems like the one today by hand and using Excel 8