Two-Variable Statistics Desired Outcomes By the end of this unit, participants will . . . • have a quick overview of Unit 8: Two-Variable Statistics. • know the difference between categorical and quantitative data. • understand two-way tables and calculate marginal, joint, and conditional probability. • understand how to find best-fit lines, Median-Median lines, and LSRL’s. • understand correlation coefficient and coefficient of determination. • know the definition of a residual and how to construct residual plots and use them to evaluate the appropriateness of a model. • have a quick overview on Non-Linear Data. • have ideas for the unit project (performance task). Materials You Have • Outline located on the Wiki – where you can take notes and find electronic copies of the handouts and other activities • Copy of PowerPoint on Wiki • Handouts – for your use now (yes, you can write on them!!) Unit 4 vs. Unit 8 • Please quickly look over the TENTATIVE outline for Unit 8 • Our Goal for today is to review/teach the concepts and give you exposure to the concepts activities • Your job is to ask questions as needed Types of Data • Qualitative – also called categorical – Group characteristics – Ex. What school do you work in? • Quantitative – Numerical – Ex. What is your height? Activity – Venn Diagram Categorical Data: Two Way Tables People leaving a soccer match were asked if they supported Manchester United or Newcastle United. They were also asked if they were happy. The table below gives the results. Manchester United Newcastle United Happy 40 8 Not Happy 2 20 vs. Categorical Data: Two-Way Tables Marginal Distribution • Counts vs. Percentages • How many Manchester fans were surveyed? • What is the probability that a randomly selected person is a fan of Newcastle? • What is the probability that a randomly selected person left the game happy? Manchester United Newcastle United Total Happy 40 8 48 Not Happy 2 20 22 Total 42 28 70 Categorical Data: Two Way Tables Joint Probability • compound event: ______ AND ______, ______ OR ______ • percentages/probability based on the table total • How many of those surveyed are happy Manchester United fans? • What percentage of those surveyed are Newcastle fans and not happy? • How likely is a person to be a Newcastle fan or Not Happy? Manchester United Newcastle United Total Happy 40 8 48 Not Happy 2 20 22 Total 42 28 70 Categorical Data: Two Way Tables Conditional Probability • How likely is one event to happen, given that another event has happened? • percentages/probability based on the row or column total of the given event • How likely is a person to be happy, given that they were a Newcastle fan? • If a person left the game happy, how likely is it that he/she is a Manchester fan? Manchester United Newcastle United Total Happy 40 8 48 Not Happy 2 20 22 Total 42 28 70 Categorical Data: Two-Way Tables • M&M’s sheets • In your group, devise at least one count or probability question for each type we discussed: – Marginal – Joint – Conditional Thirst Dilemma • In your groups of four, work through the Thirst Dilemma activity. • Be prepared to report out on your answers! • Group Roles: Survivor, Measurer, Reader, Recorder Describing Bivariate Relationships Strength •Visually – how closely do the points fall to the line or curve? •Numerical measure – the correlation coefficient (applies only to linear models) 0≤ r ≤1 0 - .5 weak .5 - .75 moderate .75 – 1 strong Form Direction • Linear • Positive • Nonlinear - exponential - quadratic • Negative (for linear and exponential models) • Positive then negative, or negative then positive (for quadratic models) Describe the Thirst Dilemma Data • Strength • Form • Direction Thirst Dilemma 16 15 14 13 12 11 Height of Water (cm) There is a strong, negative, linear relationship between the number of drinks and the height of the water. The more drinks, the lower the height of the water left in the bottle. 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 Number of Drinks 9 10 11 12 13 14 Thirst Dilemma a. x is the number of drinks of water b. y is the height of the water in cm Independent Dependent c. data table 0 15 1 14.2 2 13.8 3 13.1 4 12.6 5 11 6 10.4 7 9.7 8 9.2 Thirst Dilemma 16 14 13 12 11 Height of Water (cm) d. graph e. The more drinks, the lower the height of the water left in the bottle. The height goes down about ½ to ¾ of a cm for each drink. f. 2hours 8 drinks approximately 9 cm 15 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 Number of Drinks 9 10 11 12 13 14 Thirst Dilemma 16 15 y = -0.76x + 15.151 14 13 12 11 Height of Water (cm) g. The height of the water in the bottle decreases by .76 cm for each drink. We started with 15.151 cm of water in the bottle. h. 7 cm .76 cm/drink = 9.21 9 It would take about 9 drinks to bring the level down by 7 cm. 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 Number of Drinks 9 10 11 12 13 14 Thirst Dilemma j. y = -.76x + 15.151 A linear function is the best fit because the data follows a strong, negative, linear trend. 0 = -.76x + 15.151 -15.151 = -.76 x 19.94 = x It would take about 20 drinks before the water is all gone. If you take a drink every 15 min, that would be four drinks per hour. So the water would last 20 ÷ 4 = 5 hours. 16 15 y = -0.76x + 15.151 14 13 12 11 Height of Water (cm) i. 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 Number of Drinks 9 10 11 12 13 14 16 15 14 13 12 11 Height of Water (cm) 10 y = -0.76x + 15.151 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 11 12 Number of Drinks 13 14 15 16 17 18 19 20 21 22 Thirst Dilemma k. The height would decrease more quickly. 16 l. 15 14 13 12 11 Height of Water (cm) 10 y = -0.76x + 15.151 9 8 7 6 y = -0.76x + 4 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 11 12 Number of Drinks 13 14 15 16 17 18 19 20 21 22 Thirst Dilemma m. Group 1 started out with more water, but took bigger drinks. Group 2 started out with less water but took smaller drinks. n. Steep slope at first (drinking fast), and then a less steep slope. o. Assuming they were using the same size bottle, Sam took bigger drinks because his slope shows that for each drink, the height went down 2.4 cm. For Julie, the height only went down .9 cm for each drink. Extension • What would the graphs look like with the following bottle shapes? Would they still be linear? Sketch the graph for each bottle. Least Squares Regression Line (LSRL) How does the calculator work its magic? http://www.nctm.org/standards/content.aspx?id=26787 Least Squares Regression Line (LSRL) residual = actual value – predicted value It is the job of the LSRL to minimize the squared errors. Why do we square the residuals??? Properties of the Least Squares Regression Line (LSRL) • Minimizes the squares of the distance a real point is from the line (sum of the distances is 0 so we have to square them) • Goes through (mean x, mean y) • Slope is related to correlation, r Correlation Coefficient (r) r is the correlation coefficient It describes the strength of the linear relationship between two quantities •for linear x vs. y •for exponential x vs. log y •for power log x vs. log y •no correlation coefficient for quadratic, just R2 Correlation Correlation goes from -1 to 1, inclusive •Closer to 1 is strong •Closer to 0 is very weak Demos for Correlation and LSRL Build a plot and see the correlation coefficient: http://strader.cehd.tamu.edu/Mathematics/Statistics/LeastSquares/least_squares.html Exponential Y = abx log y = log abx log y = log a + log bx log y = log a + x log b log y = (log b)x + log a Y = M x + B Data is linearized! Power y = axb log y = log axb log y = log a + log xb log y = log a + b log x log y = b log x + log a Y = M x + B Anscombe Data Sets From Wikipedia From Wikipedia So the moral of the story is . . . GRAPH THE DATA FIRST!!! Coefficient of Determination 2 r The coefficient of determination describes the proportion of variation in y that can be explained by the linear relationship with x. It tells us how much error in prediction can be “explained” by the relationship with x. http://hadm.sph.sc.edu/courses/J716/demos/leastsquares/leastsquaresdemo.html Coefficient of Determination If there is no relationship between x and y, the best predictor for y is the average of the y-values, represented by the horizontal line, y y. yy Coefficient of Determination Residuals are represented by the vertical distance between a data point and the line. Coefficient of Determination SSM = sum of the squared errors about the mean Coefficient of Determination If there is a relationship between x and y, the average y-value is no longer the best predictor. Coefficient of Determination SSE = sum of the squared errors Coefficient of Determination 2051 162 r 2051 2 r 0.92 2 SSM is the total error we started with, SSE is the error still left after we fit the LSRL to the data; SSM-SSE is the amount of error that was taken away So the coefficient of determination measures the proportion of variation in y that can be explained by the linear relationship with x. Coefficient of Determination Note that r r 2 , so the correlation coefficient is equal to the square root of the coefficient of determination. This is mathematically true, but the meanings of the two quantities are very different! Quadratic Functions • Fitting a quadratic function to a set of data uses a different process. So, there is no rvalue. Only R2 (note the capital!) is reported. • R2 has a similar meaning to r2 and can be interpreted in the same way. Interpreting Constants/Coefficients Linear y mx b slope: the constant rate of change of y in relation to x. For every one unit increase in x, there are m units of increase (or decrease) in y. y-intercept: the initial value or starting amount Making Predictions Predictions are reliable if you have a good model: • Model fits the graph of the data • Correlation coefficient is close to 1 (or -1) • Residual plot shows no pattern Predictions Predictions are not reliable if: • The correlation is weak (low r-value) • The residual plot shows a pattern • You are making a prediction outside the domain of the data (e.g. using data from 19501990 to predict what is going to happen in 2015) Linear Data Set Activity This activity is a sheet that will be used for several concepts. There are many sets of data out there that you can use for more practice. PART ONE: Graph the data and describe. Best-Fit Lines Best-Fit Lines In Unit 4, we start with the “eye-ball” line of best fit. • What are the flaws with this method? • What do you do with an outlier? • How do you know whose line is the best model? PART THREE: Median-Median Line When it comes to outliers, which measure is best to use? MEAN MEDIAN Just like the median, the median-median line is not sensitive to outliers. Finding the Median-Median Line 1. 2. Arrange the data so that the x-values are in ascending order. Divide the data into three groups. If the number of data values does not divide evenly, then divide so that the 1st and 3rd groups contain the same number of data values and the middle group contains only one more or one less value. 3. On the plot of the data, use vertical lines to divide the groups visually. 4. Look at the first group. Determine the median x-value and the median yvalue and write them as an ordered pair. This is the summary point for the first group. Call it S1. 5. Plot S1 using a plus sign or a square instead of a dot, in order to distinguish it from the other points. 6. Repeat Steps 4 and 5 for the 2nd and 3rd groups of data points to find points S2 and S3. 7. Draw a line (lightly) through S1 and S3. Find the equation of this line. 8. Calculate the distance between the line and point S2. 9. Now adjust the line connecting points S1 and S3 by sliding it one-third of this distance towards point S2 while keeping the same slope (the resulting line should be parallel to the first line). 10. Write the equation of this new line by adjusting the value of the y-intercept by the one-third amount. If you are sliding up, add the 1/3 amount to the y-intercept; if you are sliding down, subtract. This is the equation of the Median-Median Line! PART FOUR: Towards Finding the Least Squares Regression Line Find the point ( x , y ) and graph it. Find a best-fit line that goes through this point. Try to make the distances between the line and the data points as small as possible. PART FIVE: Compare • Using your graphing calculator, find the LSRL. • Of the three lines you found, which one best matches the data? Explain Residuals and Residual Plots Residual: the difference between the actual yvalue and the predicted y-value for a given xvalue residual y yˆ Residual Plot: the plot of the x-values vs. their residuals Visually speaking, a residual is a measure of the vertical distance between a data point and the model. Residual Plot • Why do you need to make a residual plot? – To evaluate the goodness of fit of the model • What does a good residual plot look like? – Points are scattered, as if there is no correlation – There is a balance between positive and negative residuals – The values of the residuals are small compared to the size of the data Examples We want there to be no pattern and for the residuals to be small. Examples If there is a pattern, it indicates that the model is not good. This “curve” tells us that a nonlinear function is a better choice. Example This “Cone” shape tells us that the errors in prediction are getting larger as x gets larger. Examples of Residual Plots PART SEVEN: Make a Residual Plot PART SEVEN: Make a Residual Plot Finding the Best Model: Linear, Exponential or Quadratic? • Exponential • Quadratic Exponential y ab Growth (base > 1) y = a (1 + r)x Initial value or amount Growth rate x Decay (base < 1) y = a (1 – r)x Initial value or amount Decay rate Quadratic Vertex: min or max b , 2a b f 2a y –intercept : (O, c) starting value or amount e.g. initial height of a projectile x-intercept(s): e.g. time when object hits the ground Performance Task: Two-Variable Statistics 1. 2. A. B. C. D. E. F. On a piece of chart paper, make a table and a graph of your data. Describe the strength, form, and direction of the relationship. Answer the following discussion questions What type of function does your data model? What is the algebraic equation that best models your data? What is the meaning (in context) of each constant and coefficient in your equation? Find the correlation coefficient and make a residual plot. How good is your model? Answer any questions from your assigned activity. What are some tip or suggestions for using this activity in the classroom? Data Collection Investigations Note: These are a mix of different types of functions. Overhead Projector Sitting in class, you have noticed that the image projected onto a screen from an overhead projector gets larger as the overhead projector is moved farther away from the screen. Question: Is the relationship between the distance an overhead projector is from a screen and the height of the image projected on the screen linear or curved? Equipment: Overhead projector, transparency with an image in focus, meter stick or ruler to measure the height of the image, tape measures to measure distance from the projector to the screen. Data Collection: Place the overhead as close to the screen as possible with the image in focus. Measure the distance from the screen to a fixed point on the projector. Also measure the height of the image on screen. Move the overhead projector slightly away from the screen, focus the image, and take both measurements again. Repeat this process to collect at least 10 data points. Analysis: Make a scatter plot of (distance, image height). Describe the relationship. Pennies If you take a jar containing a collection of 100 pennies and empty it onto a table, how many pennies would you expect to land heads? If you remove the pennies that show heads, return the remaining pennies to the jar, shake it up and empty the jar again, how many do you expect to land heads? What happens in the long run? Question: What is the relationship between the number of times you have emptied the jar and the number of pennies that remain after you remove those that show heads? Equipment: One hundred pennies, jar. Data Collection: Take a jar containing a collection of 100 pennies, shake the jar to mix the pennies, and empty it onto a table. Remove the pennies that are showing heads and record the number of pennies remaining. Return the remaining pennies to the jar, shake it well, Continue this process until no pennies remain. Analysis: Make a scatter plot of (# times you empty the jar, # pennies remaining). Describe the relationship. Bouncing Ball If you drop a ball from the ceiling of your math classroom, it will bounce higher than if you drop it from desk level. What is the relationship between the height of the drop and the height of the bounce? Question: How is the height from which a ball is dropped related to the height of its first bounce? Equipment: Bouncing ball, tape measure, tape. Data Collection: Tape or hang the tape measure on a wall. Measure the height from which you plan to drop the ball. Drop the ball and measure the height of the first bounce. Error can be minimized by having two or three students sight the rebound height and averaging their results. Repeat this process until you collect at least 10 data points. Analysis: Make a scatter plot of (drop height, rebound height). Describe the relationship. How high will your ball bounce if it is dropped from a height of 3 meters? Other Questions to Consider: Do all balls bounce in the same way? You can try this investigation with different types of balls and make a comparison. Circles You have learned the relationship between the diameter of a circle and its circumference. Can you use data from circular objects to confirm this result? Question: How is the circumference of a circle related to its diameter? Equipment: Empty cans or jar lids, tape measure. Data Collection: Measure the diameter and circumference of empty cans, jar lids, or other circular objects until you have collected at least 10 data points. Analysis: Make a scatter plot of (diameter, circumference). Describe the relationship. Is this the relationship that you expected? Use your model to find the circumference of a can with a diameter of 3 centimeters, and compare it to the known result. Index Card (part 1) If you are sitting in the second row of a movie theater and someone sits directly in front of you, your view is probably not obstructed. However, if you are sitting towards the back and the same thing happens, it will be significantly harder to see this movie screen especially if you are not very tall. The following experiment investigates this issue by using a tape measure in place of the movie screen and an index card in place of the head of the person who is blocking your view. Question: How does the distance you are away from the wall affect the length of the tape measure that is obscured by an index car? Equipment: Index card, tape measure, tape. Data Collection: Attach a tape measure horizontally to a wall. Have a student close one eye, and hold an index card at arms length. Record the students distance from the wall and the length of the section of the tape measure that is obscured by the card. Have the student take one small step back (about 12 in or 30 cm), close one eye, and again record the distance from the wall and the length of the tape measure that is obscured. Repeat this process until you collect at least 10 data points. Index Card (part 2) If you are sitting in the second row of a movie theater and someone sits directly in front of you, your view is probably not obstructed. However, if you are sitting towards the back and the same thing happens, it will be significantly harder to see this movie screen especially if you are not very tall. The following experiment investigates this issue by using a tape measure in place of the movie screen and an index card in place of the head of the person who is blocking your view. Analysis: Make a scatter plot of (distance from wall, length of tape measure obscured). Describe the relationship. Other Questions to Consider: How does the size of the card affect the data and therefore the scatter plot? (Experiment by simply rotating the card 90o. Compare results.) How does the length of the person’s arm affect the data and therefore the scatter plot? (Experiment by having different person hold the index card. Compare results.) Pendulum If you swing a long pendulum, it takes more time to complete one swing than if you swing a short pendulum. This experiment allows you to investigate the relationship between the length of a pendulum and its period. Question: How does the period of a pendulum depend on its length? Equipment: Pendulum (constructed by tying a small nut or several washers onto a string at least two meters long) meter stick, stopwatch. Data Collection: Vary the length of the string by about 20 cm from one trial to the next, and measure how the period of the pendulum (time to complete one swing across and back) changes. To measure the period, students should hold one end of the string stable, pull the weight slightly (about 20o) to one side, let the weight make 10 complete swings, record the time, and then divide by 10. Collect at least 10 data points. Be sure to include some long lengths as well as short lengths. Analysis: Make a scatter plot of (diameter, circumference). Describe the relationship. Road Map Road maps provide a legend for computing straight-line distances as well as mileage between points along roads shown on the map. How do these distances compare in your state? Question: How does the straight-line distance between two cities relate to the shortest travel distance between the cities? Equipment: State road map, ruler. Data Collection: Use the ruler to measure the straight-line distance between two cities and convert this distance to miles using the legend on the map. Compute the travel distance by adding the distances along the shortest route between the two cities. Repeat this process to collect at least 10 data points. Be sure to include a variety of distances. Analysis: Make a scatter plot of (straight-line distance, travel distance). Describe the relationship. How many miles would you expect to travel between two cities that are exactly six inches apart on the map? Other Questions to Consider: How do you think the scatter plot for Nevada would compare to the scatter plot of West Virginia? Is the relationship observed in your scatter plot the same for all states?