living with the lab introduction to linear regression linear regression provides a predictable way to quantify the relationship between two variables, even when significant uncertainty and measurement error exist environmental data http://mrg.bz/sWnKWI medical data process parameters http://mrg.bz/jR4UEX http://mrg.bz/sMmqHk © 2011 David Hall and the LWTL faculty team The Living with the Lab label, the Louisiana Tech Logo, and this copyright notice should not be removed when any part of this work is used by others. This work may not be used for commercial purposes. Inquiries should be addressed to dhall@latech.edu. This presentation on linear regression is based partially on class notes created by Dr. Mark Barker at Louisiana Tech University. living with the lab DISCLAIMER The content of this presentation is for informational purposes only and is intended only for students attending Louisiana Tech University. The author of this information does not make any claims as to the validity or accuracy of the information or methods presented. The procedures demonstrated here are potentially dangerous and could result in injury or damage. Louisiana Tech University and the State of Louisiana, their officers, employees, agents or volunteers, are not liable or responsible for any injuries, illness, damage or losses which may result from your using the materials or ideas, or from your performing the experiments or procedures depicted in this presentation. If you do not agree, thendo not view this content. 2 living with the lab collect some data to see how linear regression works • we know that our heart rate increases as we begin to exercise • heart rate is usually expressed in beats per minute (bpm) • we can record our pulse over a short period of time to estimate heart rate . . . we’ll collect over a 10 second period • 𝑏𝑝𝑚 = 𝑏𝑒𝑎𝑡𝑠 𝑜𝑣𝑒𝑟 𝑎 10 𝑠𝑒𝑐𝑜𝑛𝑑 𝑝𝑒𝑟𝑖𝑜𝑑 ∙ 6 • the variation of heart rate during exercise is complex and depends on many factors (fitness, the level of exertion, the duration of exercise, what you’ve been eating/drinking, . . .) • we will assume that heart rate is initially linear with the duration of exercise just to collect some data . . . this could serve as a starting point for a systematic study of heart rate during exercise 3 living with the lab collect pulse after doing jumping jacks 1. 2. measure pulse for 10 seconds (have a partner write down the number of beats) do jumping jacks for 10 seconds 10 seconds of total exercise 3. 4. 5. 6. 7. 8. 9. measure pulse for 10 seconds do jumping jacks for 10 seconds 20 seconds of total exercise measure pulse for 10 seconds do jumping jacks for 10 seconds 30 seconds of total exercise measure pulse for 10 seconds do jumping jacks for 10 seconds 40 seconds of total exercise measure pulse for 10 seconds 0 STOP 0 10 10 20 30 40 jump STOP jump STOP jump STOP jump STOP 20 30 40 50 60 collect heart rate five times 70 80 jumping time (s) 90 total time (s) 4 living with the lab logistics www.onlinestopwatch.com • choose one or two people per table to do jumping jacks; this is voluntary . . . don’t do the jumping jacks if there is any reason why this activity could be harmful to you • the people who are jumping should get away from tripping hazards and other people (clear a space around your table and keep yourself under control while exercising) • your instructor will keep track of time and tell you when to jump and when to collect heart rate; a cell phone, watch or online stopwatch can be used • we need about 7 to 10 sets of data from the entire class . . . not everybody will get to exercise • we’ll analyze and plot this data using Excel • the heart rate collected will include some error o collect pulse as soon as you stop jumping o after 10 seconds, call out the number of pulses collected over 10 seconds to your partner(s) and start jumping again • just be as accurate as possible 5 living with the lab enter heart rate data into a Excel time (s) student 1 (bpm) student 2 (bpm) student 3 (bpm) student 4 (bpm) student 5 (bpm) student 6 (bpm) student 7 (bpm) student 8 (bpm) 0 10 20 30 40 • please multiply the number of pulses collected over 10 seconds by 6 to get beats per minute (bpm) • report bpm to your instructor • build a spreadsheet on your computer along with the instructor 6 living with the lab plot data for the entire class in Excel • make a scatter plot using symbols only – no lines • time is the independent variable and is plotted as the x-axis • heart rate is the dependent variable and is plotted as the y-axis • the title of the plot is always listed as “y versus x” . . . which is “heart rate versus exercise time” for this problem heart rate versus exercise time heart rate (bpm) 150 130 110 90 70 50 0 10 20 30 40 50 cumulative exercise time (s) 7 living with the lab make a hand plot for one data set • your instructor will select one student’s data that is typical of the data for the entire class; we will analyze this data • make a hand plot using your own paper as shown below (use proper format!!) • draw a “best fit” line through the data; just use your judgment heart rate versus exercise time “best fit line” cumulative exercise time heart rate (s) (bpm) 0 67 10 82 20 86 30 96 40 120 use data from class . . . not this data 8 living with the lab find an equation to fit the data • assume the data is linear • pick two points from your data (or make up two points by picking from the line) 𝑟𝑖𝑠𝑒 ∆𝑦 • compute the slope 𝑜𝑟 𝑟𝑢𝑛 ∆𝑥 • write equation using point-slope form as 𝑦 = 𝑚 ∙ 𝑥 + 𝑏 cumulative exercise time heart rate (s) (bpm) 0 67 10 82 20 86 30 96 40 120 example (use data from your class) find the slope: ∆𝑦 120−82 = 1.27 𝑚= = ∆𝑥 40−10 find the y-intercept by plugging in one of the data points: 𝑏 = 𝑦 − 𝑚 ∙ 𝑥 = 120 − 1.27 ∙ 40 = 69.3 write the equation: ℎ𝑒𝑎𝑟𝑡 𝑟𝑎𝑡𝑒 = 1.27 ∙ 𝑡𝑖𝑚𝑒 + 69.3 . . . where heart rate is in bpm and time is in seconds. 9 living with the lab analysis of our equations • compare your answer with others in the class • if you chose the same two points to define your “best fit” line, then your equations should be the same • choosing different points causes us to get different equations • linear regression, which can be derived using calculus, gives us the same equation every time • linear regression takes the guess work out of finding best fit lines http://earthobservatory.nasa.gov/IOTD/view.php?id=46145 10 living with the lab understanding linear regression 𝑦 data point 𝑖 (𝑥𝑖 , 𝑦𝑖 ) best fit line 𝑦 =𝑚∙𝑥+𝑏 𝑓𝑖𝑡 𝑓𝑖𝑡 𝑦𝑖 − 𝑦𝑖 𝑓𝑖𝑡 𝑦𝑖 𝑦𝑖 = 𝑚 ∙ 𝑥𝑖 + 𝑏 𝑓𝑖𝑡 𝑦𝑖 𝑥𝑖 𝑥 • linear regression generates the best line by minimizing the squares of the errors 𝑓𝑖𝑡 2 𝑦𝑖 • minimize 𝑦𝑖 − for all data points to find optimum values of m and b • we call this least squares linear regression 11 living with the lab finding m and b 𝑛 𝑥𝑖 𝑦𝑖 − 𝑚= 𝑛 𝑥𝑖2 − 𝑥𝑖 𝑦𝑖 𝑥𝑖 2 𝑏= 𝑦𝑖 − 𝑚 𝑛 𝑥𝑖 𝑦 =𝑚∙𝑥+𝑏 cumulative exercise time heart rate (s) (bpm) 𝑚= x·y x2 x y 0 10 20 30 40 67 82 86 96 120 0 820 1720 2880 4800 0 100 400 900 1600 100 451 10220 3000 ∑𝑥𝑖 ∑𝑦𝑖 ∑𝑥𝑖 ∙ 𝑦𝑖 ∑𝑥𝑖2 𝑏= cumulative exercise time heart rate (s) (bpm) 0 67 10 82 20 86 30 96 40 120 5 ∙ 10220 − 100 ∙ 451 = 1.2 5 ∙ 3000 − 100 2 451 − 1.2 ∙ 100 5 = 66.2 ℎ𝑒𝑎𝑟𝑡 𝑟𝑎𝑡𝑒 = 1.2 ∙ 𝑡𝑖𝑚𝑒 + 66.2 Repeat the above procedure for the data set selected in your class. Compare the m and b that you get with your classmates. Doing this by hand is good practice for the exam. 12 living with the lab repeat for all of the class data cumulative exercise time heart rate (s) (bpm) student 4 student 3 student 2 student 1 x y x·y x2 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 • reformat your spreadsheet to have single x and y columns as shown (5 lines for each students heart rate data) • find the sums and plug them into the equations for m and b to find the best fit line; try to do these calculations in Excel . . . it’s tricky due to fixed cell references and the placement of parentheses • create a plot of all data in Excel • plot the best fit line without any symbols over the data points • see the next page for an example 400 ∑𝑥𝑖 ∑𝑦𝑖 ∑𝑥𝑖 ∙ 𝑦𝑖 ∑𝑥𝑖2 13 living with the lab details of solving previous problem in Excel cumulative exercise time (s) heart rate (bpm) x y 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 67 82 86 96 120 50 60 70 80 90 82 92 105 110 115 80 91 105 118 118 400 1817 student 2 student 3 m= b= 1.0175 70.5 D E F don’t look at these tips unless you get stuck!! x·y 0 820 1720 2880 4800 0 600 1400 2400 3600 0 920 2100 3300 4600 0 910 2100 3540 4720 40410 x2 yfit 0 100 400 900 1600 0 100 400 900 1600 0 100 400 900 1600 0 100 400 900 1600 12000 70.5 80.7 90.9 101.0 111.2 =C$28*B5+C$29 use these data point to plot the best-fit line heart rate versus exercise time heart rate (bpm) C student 4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 B student 1 A 150 130 110 90 70 50 0 10 20 30 40 50 cumulative exercise time (s) =(COUNT(B5:B24)*D26-B26*C26)/(COUNT(B5:B24)*E26-B26^2) 14