Lesson 1: Scatter Plots, Correlation, Causation, & Regression Lines The Candy Grab Investigation 1. Measure the span of your hand to the nearest half centimeter. Hand span is the distance from the tip of the thumb to the tip of the pinkie finger on your full stretched-out hand. 2. One student at a time, go to the front of the class and use your dominant hand to grab as many candies as possible from the container. You must grab the candies with your fingers pointing down (no scooping!) and hold the candies for 2 seconds before counting them. After counting, put the candy back into the container. 3. On the board, record your hand span and number of candies in the table. 4. While other students record their values on the board, copy the table onto a piece of paper and make a graph. Begin by constructing a set of coordinate axes. Label the horizontal axis “Hand span (cm)” and the vertical axis “Number of candies.” Choose an appropriate scale for each axis and plot each point from your class data table as accurately as you can on the graph. 5. What does the graph tell you about the relationship between hand span and number of candies? Summarize your observations in a sentence or two. Lesson 1: Scatter Plots, Correlation, Causation, & Regression Lines Vocabulary A ____________________________ variable measures an outcome of a study. These variables are often called __________________________ variables. An ___________________________ variable may help predict or explain changes in a _______________________________ variable. These variables are often called __________________________ variables. Identify the explanatory variable and response variable for the following relationships, if possible. Explain your reasoning. A. For the candy activity you participated in. B. The weight (in carats) and the price (in dollars) for a sample of diamonds. C. The SAT math score and the SAT evidence-based reading and writing score for a sample of students. Vocabulary A ________________________________________ is the best way to display the relationship between two quantatitive variables. The values of one variable appear on the horizontal axis, and the values of the other variable appear on the vertical axis. Each individual in the data set appears as a point on the graph. Two variables have a _______________________________ association when above-average values one variable tend to accompany above- average values of the other variable and when below-average variables also tend to occur together. Two variables have a _______________________________ association when above-average values of one variable tend to accompany below-average values of the other variable. There is ____________________________________ association between two variables if knowing the value of one variable does not help us predict the value of the other variable. Lesson 1: Scatter Plots, Correlation, Causation, & Regression Lines How to describe a scatterplot: Direction: A scatterplot can show a positive association, negative association, or no association. Form: A scatterplot can show a linear form or a nonlinear form. The form is linear if the overall pattern follows a straight line. Otherwise, the form is nonlinear. Strength: A scatterplot can show a weak, moderate, or strong association. An association is strong if the points don’t deviate much from the form identified. An association is weak if the points deviate quite a bit from the form identified. Unusual features: Look for outliers that fall outside the overall pattern and distinct clusters of points. Describe each of the following. A. The scatterplot created by the class in our activity. B. The scatterplot on the shows the relationship between the duration (in minutes) of an eruption and the interval of time until the next eruption (in minutes) of Old Faithful during a particular month. C. The scatterplot on the shows the relationship between the average income (gross domestic product per person, in dollars) and fertility rate (number of children per woman)in 187 countries. Lesson 1: Scatter Plots, Correlation, Causation, & Regression Lines For a linear association, mathematicians have defined a measure of direction and magnitude of a correlation. This measure is called the correlation coefficient and is represented by R. This value falls between –1 and 1. The closer the value is to –1 the stronger the negative correlation is. The closer the value is to 1 the value is the stronger the positive correlation is. Remember these limitations of r: Correlation does not imply causation. The correlation is not resistant, so outliers can greatly change the value of r. The correlation should only be used to describe linear relationships. Correlation ignores the distinction between explanatory and response variables. The value of r does not have units and is not affected by changes in the unit of measurement of either variable. Example Here is the scatterplot showing the relationship between payrolls (in millions of dollars) and wins for MLB teams in 2016. For these data, r = 0.613. Interpret the value of r. A Correlation measures the relationship between and 2 variables (x &y). When there is a CORRELATION we are saying that as one variable changes the other also changes either positively or negatively. There is some type of "mutual activity" in their change. CAUSATION is a special correlation that means one variable directly caused a change in the other variable. Lesson 1: Scatter Plots, Correlation, Causation, & Regression Lines Lesson 1: Scatter Plots, Correlation, Causation, & Regression Lines Example: The correlation for the data shown is 𝑟 = 0.97. Does an increase in skiing revenue cause more people to die by becoming tangled in their bedsheets? Example: Most people love chocolate for its great taste. But does it also make you smarter? A scatterplot like this one recently appeared in the New England Journal of Medicine. The explanatory variable is the chocolate consumption per person for a sample of countries. The response variable is the number of Nobel Prizes per 10 million residents of that country. A. If people in the United States started eating more chocolate, could we expect more Nobel Prizes to be awarded to residents of the United States? Explain. Lesson 1: Scatter Plots, Correlation, Causation, & Regression Lines B. What effect does Switzerland have on the correlation? Explain. Calculating Correlation The formula for correlation is: Example A student wonders if tall women tend to date taller men than do short women. She measures herself, her dormitory roommate, and the women in the adjoining dorm rooms. Then she measures the next man each woman dates. Here are the data (heights in inches): Find the correlation. Check your work in the calculator. Lesson 1: Scatter Plots, Correlation, Causation, & Regression Lines Example: The following scatterplot shows the height (in inches) and number of steps needed for a random sample of 36 students to walk the length of a school hallway. The correlation is 𝑅 = −0.632. A. Explain why it isn’t correct to say that the correlation is −0.632 steps per inch. B. What would happen to the correlation if number of steps was used as the explanatory variable and height was used as the response variable? C. What would happen to the correlation if height was measured in centimeters instead of inches? Explain. Lesson 1: Scatter Plots, Correlation, Causation, & Regression Lines A ______________________________ line is a line that describes how a response variable y changes as an explanatory variable x changes. Regression lines are expressed in the form: Exercise #1: A survey was taken of 10 low and high temperatures, in Fahrenheit, in the month of April to try to establish a relationship between a day’s low temperature and high temperatures. Low Temperature, x 26 28 30 32 34 35 37 38 41 45 High Temperature, y 49 50 57 54 60 58 64 66 63 72 (a) Enter data into lists on your calculator. And create a scatter plot using your graphing technology. What do you notice about the data? (b) Use your calculator to find the equation for the line of best fit. Round the slope of the line to the nearest hundredth and the y-intercept to the nearest integer. (c) Explain what the y-intercept of this model represents in terms of the low and high temperatures that are being modeled in this problem. (d) How would you interpret the slope of this model in terms of how the low and high temperatures change with respect to each other? (e) What is the correlation coefficient of the data? What does this indicate? Lesson 1: Scatter Plots, Correlation, Causation, & Regression Lines Exercise #2: Generally, the fuel efficiency of a car changes with the weight of the car. A survey of some cars with their weights and gas mileages is shown below. Weight (1000’s of lbs) Mileage (miles per gallon) 3.7 4.5 3.2 5.1 6.8 4.9 4.8 5.5 38 26 48 24 18 30 28 21 (a) Find the equation for the line of best fit using your calculator. Round both coefficients to the nearest tenth. List what the variables x and y represent in this problem. (b) Create a graph of the scatter plot for this data. Would you consider the correlation between weight and mileage to be positive or negative? Explain. (c) Which parameter of the linear model predicts whether the correlation is positive or negative? Use this model to help explain your answer. (d) If a car had a weight of 4,300 pounds, what would this model predict as its fuel efficiency? Round to the nearest integer. Use appropriate units and make sense of your answer. (e) If we wanted to purchase a car that got 40 miles to a gallon, what weight of car, to the nearest 100 pounds, should we purchase? Solve algebraically. Lesson 1: Scatter Plots, Correlation, Causation, & Regression Lines Exercise #3: A pediatrician would like to determine the relationship between infant female weights versus age. The pediatrician studies 100 newborn girls and finds their average weight at the end of 3 month intervals. Age (months) Average Weight (pounds) 0 3 6 9 12 15 7.2 12.2 15.1 19.4 21.5 26.3 (a) Using the linear regression command on your calculator, find the equation of the best fit line (b) Use your calculator to determine the linear correlation coefficient. Round to the nearest thousandth. How can you interpret this value in terms of the variation in weight due to age? Exercise #4: Using the equation that your calculator produced in Exercise #1, predict the weight of a baby girl after 10 months. Round your answer to the nearest tenth of a pound. The use of a model to predict outputs when the input is within the range of the known data is called interpolation. Interpolation tends to be fairly accurate. Exercise #5: Using the equation that your calculator produced in Exercise #3, predict the weight of a baby girl after 2 years. Round your answer to the nearest tenth of a pound. The use of a model to predict outputs when the input is outside of the range of the known input data is called extrapolation. Models are most helpful when they can be used to extrapolate, but tend to be less accurate. Exercise #6: Biologists are trying to create a least-squares regression equation (another name for best fit line) relating the length of steelhead salmon to their weight. Seven salmon were measured and weighed with the data given below. Length 22 24 28 34 39 42 48 (inches) Weight 3.43 4.46 7.08 14.21 22.19 31.22 35.67 (pounds) (a) Determine the least-squares regression equation, in the form y ax b , for this data. nearest (b) Using your equation from part (a), determine the expected weight of a salmon that is 30 inches long. (c) Using your equation from part (a), determine the expected weight of a salmon that is 52 inches long. (d) In which part, (b) or (c), did you use interpolation and in which part did you use extrapolation? Explain. Round all hundredth. coefficients to the 1. Bivariate Data Summary Bivariate data – data that examines the relationship between two variables • What individuals to the data describe? • What are the variables and how are they measured • Are the variables quantitative or categorical Types of bivariate data • Response variable – measures the outcome of a study • Explanatory variables – attempts to explain (not cause) the response variable to determine which is explanatory and which is response, think about which one seems to be a possible explanation of the other. if it is not obvious which one is explanatory and which is response, it very well be that it doesn’t matter. Scatterplots – shows the relationship between two quantitative variables measured on the same individuals. The values of one variable appears on the horizontal (x) axis and the other axis appears on the vertical (y) axis. Each individual appears as a point in the plot. The explanatory variable is placed on the x-axis and the response variable is placed on the y-axis. If there is no explanatoryresponse relationship, either variable can go on the horizontal axis. Scatterplots must be labeled (both axes). The intervals on each axis must be uniform. Interpreting scatterplots • Look for the overall pattern and deviations from that pattern. • Form – does the data appear linear or curved or have distinct clusters? • Direction – the term used is association • Positive association – low values of the explanatory variable accompanies low values of the response variable and high values of the explanatory variable accompanies high values of the response variable • Negative association – low values of the explanatory variable accompanies high values of the response variable and high values of the explanatory variable accompanies low values of the response variable • Strength of an association – how closely the points follow a clear form. Both of the associations above are strongly linear. • Outliers – a point that falls outside the overall pattern of the relationship. As usual, outliers should be eliminated if it can be justified. © www.MasterMathMentor.com Illegal to post on the Internet Using the calculator: Place the data in your lists: STAT Set up your plots: 2nd Then: ZOOM EDIT Y= 9 : ZoomStat Be sure that you have no graphs in your Y = list. Correlation –measures the direction and strength of the linear relationship between two quantitative variables. The variable to denote correlation is r. Facts about correlation: 1. r is a number between -1 and 1 inclusive. Positive values of r indicates a positive association between variables. Negative values of r indicates a negative association between variables. 2. Values of r near 1 or -1 mean a very strong association (the points are very close to forming a straight line). Values of r near 0 mean a very weak association. If r is exactly 1 or -1, the association is perfect (rarely happens). 3. Correlation makes no difference between explanatory and response variables. It makes no difference which variable is x and which is y when calculating r. 4. When calculating r, both variables must be quantitative. 5. r does not change when we change the units of measurements of x, y, or both. 6. Correlation measures the strength of a linear relationship between variables. It does not relationships that are curved no matter how strong they appear to be. 7. r is non-resistant … it is affected strongly by outliers. 8. In general, we will make this claim: ⎧ r ≥ .9...Very strong association ⎪ ⎪.7 ≤ r < .9...Fairly strong association ⎪ ⎨.5 ≤ r < .7...Moderately strong association ⎪ ⎪.2 ≤ r < .5...Fairly weak association ⎪ ⎩ r < .2...Very weak association ⎛ x − x ⎞⎛ y − y ⎞ 1 The formula for correlation: r = ⎟ . You are not responsible for this formula. ⎜ ⎟⎜ ∑ n −1 ⎝ sx ⎠⎜⎝ s y ⎠⎟ We will find r using the calculator. R1: Linear Regression Example Name ______________________________ A diver is investigating a wreck under the water and has to come up to the surface slowly. Following is a chart detailing his depth from the time he starts ascending. time (min) depth (ft) 0 sec 30 sec. 1 min. 240 225 203 1min. 40 sec 189 2 min 20 sec 180 3 min 164 3 min 30 sec 155 4 min 40 sec 160 5 min 30 sec 130 6 min 125 6 min 30 sec 120 1. Draw a scatterplot which is suitable for predicting depth based on time. There is one point that lies outside the pattern of the plot. Circle it. 2. Find the equation of the least squares line of the data. Comment on the strength of the association. 3. Eliminate the point you circled and recalculate the equation of the least squares line. Sketch the line on the scatterplot above. Comment on the strength of the association compared to your answer with that circled point included. 4. Using the model above (point removed), explain the meaning of the slope of the line. 5. Describe the relationship between time and depth using r2 to make your description more precise. © www.MasterMathMentor.com Illegal to post on the Internet 6. Using the model above (point removed), predict the depth of the diver at the following times. Comment on the confidence of your prediction. a) 2 minute, 50 seconds b) 5 minutes c) 7 minutes, 10 seconds 7. Find the difference between the observed depth at 1 min, 40 seconds, and the predicted depth. 8. Explain why a linear model might not be the best to describe this data. 9. In the diagram above, place a square around the point that has the largest residual. Place a triangle around the point that has the second largest residual. 10. Using the linear model, how long would your predict before the diver reaches the surface. In reality, do you believe it is a longer or shorter duration? Explain. 11. What is the exact sum of the residuals? 12. One more piece of data is added. It is found that the diver is at 40 feet, 7 minutes into his ascent. Explain the role of this new point and how it will affect the slope of the LSRL without actually calculating it. 13. New data is added to change the mean time to mean time to 6 minutes and 10 seconds with standard deviation 2 minutes and 5 seconds. The mean depth is now 115.5 feet with standard deviation 31.6 feet. If the association is still negative and r2 = .68, find the slope of the regression line of time versus depth. © www.MasterMathMentor.com Illegal to post on the Internet