Chapter 10 Exploring Relationships Between Numerical Variables Objectives SWBAT: 1) Create and examine scatterplots of data 2) Describe the direction, form, and strength of an association 3) Use correlation to measure the strength of a linear association 4) Test for correlation • In recent years, golfers have been driving the golf ball farther than before, partly due to equipment, and partly due to the golfers themselves. • However, trying to hit the ball as far as possible off the tee can result in less accuracy, and a more challenging second shot playing out of the rough. • Each time a golfer tees off, they need to determine if they should try to hit the ball very far, which would leave them with a shorter and easier second shot but risk missing the fairway, or should they ease up a bit and make sure to hit the ball down the middle even though the ball won’t go as far. • In order to examine situations like the one on the previous slide, we need to expand beyond what we have learned so far, which has only focused on one variable at a time, such as home runs, rushing yards, or points per game. • We will now begin to investigate the relationships between two numerical variables. • In order to begin analyzing this situation we’ll investigate the association between average driving distance and driving accuracy for the top 10 women golfers in the LPGA in 2009. • Average driving distance is our explanatory variable and driving accuracy is our response variable. • We will see if average driving distance can help explain accuracy. • In other words, we want to see if the explanatory variable can help predict the value of the response variable. • In order to display the relationship between two numerical variables, we will create a scatterplot (this is our only option). • Some things to keep in mind when making scatterplots: – 1) it is essential to include labels and clear, consistent scales on each axis – 2) neither axis needs to start at 0 – 3) the explanatory variable will be plotted on the horizontal axis and the response variable will be plotted on the vertical axis (Note: the response variable is usually the one we are more interested in - either because we want to predict the response for a particular value of the explanatory variable or because we want to use the explanatory variable to explain changes in the response variable. Steps to Make a Scatterplot 1) Label and scale the horizontal (explanatory) axis and vertical (response) axis. 2) Draw a dot for each player to represent the ordered pair. 3) Finish the scatterplot by drawing dots for the remaining players. Making a Scatterplot Using the TI-84 1) Enter the average driving distance values in L1 and the driving accuracy values into L2. 2) Open STAT PLOT by pressing 2nd y=. Turn on Plot1 (make sure all other plots are off). Choose the first graph type, and enter L1 for Xlist and L2 for Ylist. 3) Press ZOOM and select 9: ZoomStat and press ENTER in order to see the scatterplot in a nice window. 4) To see the values used by the graphing calculator for the scales, press WINDOW. On this screen, you can change the starting, ending or increment values to something more convenient and then regraph. Describing the Association Between Variables • After constructing a scatterplot, there are several important characteristics about the association between the variables that should be considered. 1. the direction of the association 2. the form of the association 3. the strength of the association Describing the Direction of an Association • Here is a scatterplot showing the average driving distance and driving accuracy for the top 146 money winners on the LPGA tour in 2009. • Based on the scatterplot, we can see there is a negative association between average driving distance and driving accuracy, meaning that players who drive the ball farther typically hit a smaller percentage of fairways; those who don’t hit the ball as far typically hit a higher percentage of fairways. • Here is the relationship between a golfer’s average driving distance and her percentage of greens-inregulation (GIR means you hit the ball on the putting surface (the green) in at least two shots under par.) • There is a positive association between average driving distance and GIR. Typically, players who drive the ball farther have a better chance to land on the putting surface in at least two shots under par. • It is important to remember that a positive or negative association is describing the overall tendency of the data, NOT an absolute relationship. • Example: Even though players who drive the ball farther typically hit a higher percentage of greens, this is not always true. • Lee drives the ball farther than Creamer, but Lee hits a lower percentage of GIR. This is the opposite of what we’d expect. • Sometimes, two variables have no association. • The scatterplot to the right shows the relationship between driving accuracy and putting average. • Also included are the mean putting average and mean driving accuracy. • In this case, knowing a golfer’s driving accuracy tells us nothing about how many putts she will average per round. Therefore, we can say there is no association between driving accuracy and putting average. • To recap: if two variables have an association, then knowing the value of one variable will help you predict the value of the other variable. However, if there is no association, then knowing the value of one will not help you predict the value of the other. Describing the Form of an Association • The form of an association can either be linear or nonlinear. • If the association is linear, then a line would be a reasonable way to model the overall relationship. But if an association is nonlinear, then a line won’t match the pattern of the scatterplot very well. • Examples: Describing the Strength of an Association • The strength of an association describes the amount of scatter there is from the overall form of the data. In other words, how closely the points on the scatterplot conform to the linear (or nonlinear) form. • In a strong association, there isn’t much scatter and predictions of the response variable will be fairly precise. Examples of positive, linear associations with different amounts of strength. • The same strength associations can also be tied to negative linear associations. • Here is an example of a strong negative linear association: • Describe the strength of these associations: Strong association Very strong association Moderate association • Let’s return back to the initial question of is it better to drive the ball long or drive it straight. • Let’s look at two different scatterplots: one comparing average driving distance to scoring average, and one comparing driving accuracy to scoring average. • Negative relationship – golfers who drive the ball farther tend to have lower scores (this is a good thing in golf!) • Negative relationship – golfers who hit the fairway more often tend to have lower scores (again, lower scores are good!) • Both scatterplots show negative associations, but which one shows the stronger relationship? In other words, which explanatory variable, average driving distance or driving accuracy, is a more reliable predictor of scoring average? • Neither association is strong, but average driving distance as an explanatory variable conforms more to a linear pattern than does driving accuracy. • Therefore, low scores in golf tend to be more strongly associated with average driving distance than with driving accuracy. Measuring Strength: Correlation • Judging the strength of a relationship is difficult to do simply by looking at a scatterplot. • What might seem strong to one person might seem moderate to another. • There is a numerical way to measure the strength of a linear association in a scatterplot, and that is called correlation. • The correlation (r) is a measure of the strength and direction of a linear association between two numerical variables. • The TI-84 will calculate correlation for us, so we’ll start by talking about some of the properties of correlation. • Unfortunately, Park Place and Boardwalk are not part of correlation’s properties. Properties of the Correlation (r) 5) The value of r has no units and is not dependent on the units used to measure the variables. This makes it an ideal way to compare the strengths of the associations between different sets of variables. This also means that changing the units of the explanatory or response variable won’t affect the correlation. To give you a visual, here are some examples of different correlations, using data from the 32 NFL teams in the 2008 regular season. • Now it’s time for everyone’s favorite game….Guess the Correlation!!! • http://www.rossmanchance.com/applets/ Something to be aware of: Correlation and association are NOT synonyms. – Association is a more general word to describe the relationship between any two variables, whether numerical or categorical. – Correlation is a specific measure of the strength and direction of a linear association between two numerical variables. Calculating Correlation on the TI-84 1) Turn on the diagnostic feature. Press 2nd: 0 to enter the CATALOG. Scroll down to DiagnosticOn. Press ENTER twice so it says Done on the home screen. 2) Enter the top 10 LPGA data in L1 and L2. 3) Press the STAT button, move to the CALC menu, and scroll down to select number 8: LinReg (a+bx). After choosing enter L1, L2 and press ENTER. The last line of the output gives the value of r. In this case the correlation is r=-0.905. • An applet on the book website also calculates correlation. It is appropriately titled “Correlation and Regression.” • Book site • Let’s come back to this comparison, in which we initially said average driving distance seems to be a better predictor of scoring average than driving accuracy. We made this statement based off of just looking at the scatterplots. • It turns out that the correlation between average driving distance and scoring average is r=-0.47, compared to r=-0.23 for the other comparison. • This result confirmed our initial suspicion. FYI: Statistics 101 • In a more traditional statistics course, you may be asked to calculate the correlation by hand. Here is the formula: • The terms being multiplied in the numerator are standardized scores for the x variable and the y variable (they are then being summed). • Be cautioned: Correlation does NOT imply cause-and-effect. • Just think of Happy Gilmore. He was able to crush the ball off the tee, but he still finished in last early in the movie because other facets of his game were not polished. So even though there is correlation between average driving distance and scoring average, there is no guarantee that increasing driving distance will result in lower scores. Influential Points • What effect do you think unusual observations have on correlation? • Much like mean and standard deviation, correlation is not a resistant measure, and as such it can be strongly effected by outliers. On a scatterplot, an outlier might be a point that seems out of place. Here is a scatterplot showing the number of stolen bases and home runs for the nine primary offensive players for the 2009 Boston Red Sox. • The points on the left side of the graph seem to have a positive association, but Jacoby Ellsbury’s unusually value makes the overall relationship between stolen bases and home runs look negative. In fact, the correlation here is r=-0.35 • What would happen to the correlation is Ellsbury was omitted? • Here is the scatterplot with Ellsbury omitted. The correlation is now r=0.21. • Be careful: It’s not a good idea to remove observations from a data set without a good reason. • If a value was recorded incorrectly, correct it and keep it in the data set. However, if you don’t know the correct value, then remove it. • Any observations that do not belong should be excluded. For example, certain totals or averages of all the observations. Testing the Correlation • In sports, teams play exhibition games before the regular season begins. • Some people think a team’s PERFORMANCE in exhibition games is a good predictor of how well that team will do in the regular season. Other people think PERFORMANCE in exhibition games tells us nothing about the future. • Here is a scatterplot showing the 2009 winning percentages (WP) for the 30 MLB teams during spring training and during the regular season. • The correlation is r=0.45. There seems to be a moderate, positive, linear association between spring training winning percentage and regular season winning percentage. • Let’s keep in mind this is only based off of PERFORMANCES for one season, so we need to account for RANDOM CHANCE before making any firm conclusions. It is possible that there really is no association. • Some terminology: – The true correlation between two variables (much like ABILITY) exists only in theory. To find the true correlation between spring training winning percentage and regular season winning percentage, we would need to repeat the 2009 spring training and regular season for each team millions and millions of times and then find the correlation. – The observed correlation between two variables (much like PERFORMANCE) is based on a limited amount of data, such as one season. Because it is based on a limited amount of data, the observed correlation will vary from the true correlation due to RANDOM CHANCE. • To see if the observed data provide convincing evidence that there really is a positive association between winning percentage in spring training and winning percentage in the regular season, we will test the following hypotheses using the correlation as our test statistic. To simulate: -Get 30 notecards (for the 30 teams) -Write each of the 30 regular season winning percentages on the notecards -Shuffle -Randomly pair one of the regular season winning percentage cards to each of the spring training winning percentages -Calculate the correlation • Here are the results for 100 trials of the simulation: • What is our p-value? 0%