Chapter 3 Review Two Variable Statistics Veronica Wright Christy Treekhem River Brooks The Big Idea • • This chapter explains how scatterplots can be used to represent data in a variety of useful ways. They give good graphical representations of the relationship between the two variables and can be used to easily spot trends such as strength and direction and help to isolate outliers. Residual plots can also be used as a tool for determining how the variables interact. The LSRL, correlation, and correlation coefficient can be used to predict results based on the data and to mathematically prove just how accurate these predictions are. We use this all the time in statistics and just about everywhere else. Just looking at a scatterplot a person already uses a number of these principles in order to infer information from it. The most obvious piece of information being how things will develop based on all of the data that has been collected so far. Almost every field uses this. Economics, politics, manufacturing, and even sports. Important Vocabulary • • • • • • • • Direction – the overall direction that data moves towards when displayed on a scatterplot Scatterplot – a graph that shows the relationship between 2 quantitative variables that are measured on the same individuals Response variable – a variable that measures the outcome of a study, i.e. dependent variable Explanatory variable – a variable that influences the response variable, i.e. independent variable Form – the shape that the data resembles when displayed on a scatterplot ex. curved, linear, exponential, etc. Strength – how closely the data points follow the form Outlier – a data point that doesn’t follow the form as closely as all the others, a data point that seems significantly out of place on a scatterplot Correlation coefficient – a measure of the direction and strength of the linear relationship between two quantitative variables, usually represented as r Important Vocabulary • • • • • • • Regression line – a line that describes how the response variable changes when the explanatory variable changes Extrapolation – using the regression line to predict results beyond the scope of the actual data LSRL, the least-squares regression line – a line that has the smallest possible total distance from the data points: ^y = a + bx Residual – the difference between an actual data point and where the regression line says that particular data point should fall Residual plot – a scatterplot of the data’s residuals against its explanatory variables Coefficient of determination – the amount of variability in the data that is accounted for by the LSRL, the higher the coefficient, the more accurately the LSRL represents the data. It is usually shown as r^2 and never greater than 1 Lurking variable – a variable other than thee response and explanatory variables that may influence the relationship between them Key Topics Covered in the Chapter • How to graph and determine the relationship between independent (explanatory) and dependent (response) variables • Correlation– how to find it, and what it means • Regression line (Best fit, LSRL) – how to find it, what it means, and how well it fits the data Formulas You Ought to Know • The regression line formula (LSRL): – ŷ = a + bx • With ŷ being the predicted response, a being the y-intercept, b being the slope, and x being the explanatory variable. • The formula for the mean: – (a1 + a2 + a3,+......+ an)/n • The formula for standard deviation: • The formula for r (correlation coefficient) Calculator Key Strokes • In this unit, on our calculator we are forced to find the Sx, Sy, mean of x, mean of y, r, r2, and LSRL, as well as graphing the scatterplot and residual plot. • To find the r^2, r, and LSRL, do the following: (enter data sets into L1 and L2) • Insert your lists in the order, (Explanatory List, Response List) To find Sx, Sy, the mean of x, or the mean of y, do all of the above, except press “2” instead of “8” (enter data sets into L1 and L2) Insert your lists in the order, (Explanatory List, Response List) Calculator Key Strokes • To plot the scatterplot, do the following: • • Enter Data set Make sure plot is On Choose the Scatterplot To find the Residual plot, do all of the above, except change “Ylist” to “Resid” (If you cannot find the RESID button in your Statlist, do the following): And now it should work, but MAKE SURE • • that you have already calculated the LSRL. (Scroll down to DiagnosticOn) Example Problems • A study shows that there is a positive correlation between the size of the hospital and the median number of days patients remain in the hospital. Does this mean you can shorten a stay by choosing a small hospital? Explain. • No, correlation is not causation. Also, the patients with minor injuries may not feel the need to go to a larger hospital, thus shortening the stay. Example Problems • The Standard and Poor 500 index is an average of the price of 500 stocks. There is a moderately strong correlation (r equals approximately 0.6) between how much this index changes in January and how much it changes during the entire year. If we looked instead at data on all 500 individual stocks, we would find a very different correlation. Would the correlation be higher or lower? Why? • The correlation would be lower; the individual stock performances will be more variable, weakening the relationship. Example Problems • A study of elementary school children ages 6-11 finds a high positive correlation between shoe size x and score y on a test of reading comprehension. What explains this correlation? • Age is a lurking variable. We would expect both quantities to increase with age. Example Problems • A college newspaper interviews a psychologist about student ratings of the teaching of faculty members. The psychologist says, “The evidence indicates that the correlation between research productivity and teaching rating of faculty members is close to zero.” The paper reports this as “Professor McDaniel said that good researchers tend to be poor teachers, and vice versa.” Explain why this is wrong, and explain the psychologist's meaning. • Professor McDaniel did not say that good researchers make poor teachers; he simply said that there is a low correlation between research productivity and teaching rating. Example Problems • Explain why this is wrong: “There is a high correlation between gender of American workers and their income.” • Gender is categorical, not quantitative. Example Problems • Explain the error: “We found a high correlation (r=1.09) between students' ratings of teaching and ratings made by other faculty members.” • r must be between 0 and 1. Helpful Hints • Some people can’t find the RESID button to get the residual plot plot/get an error: That’s because you need to find the LSRL first before that is even possible. • If you can’t see anything when you plot your scatterplot, press Zoom -> 9. • If the RESID plot has any type of pattern, you don’t want an LSRL. A different model – perhaps a power or exponential one, if it is curved – would suit the data better. The End • Click to add text