Section 12.1 Scatter Plots and Correlation With the quality added value you’ve come to expect from D.R.S., University of Cordele HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. HAWKES LEARNING SYSTEMS Regression, Inference, and Model Building math courseware specialists 12.1 Scatter Plots and Correlation Types of Relationships: Strong Linear Relationship No Relationship Plot (x,y) data points and think about whether x and y are somehow related Weak Linear Relationship Non-Linear Relationship Example 12.3: Determining Whether a Scatter Plot Would Have a Positive Slope, Negative Slope, or Not Follow a Straight-Line Pattern Determine whether the points in a scatter plot for the two variables are likely to have a positive slope, negative slope, or not follow a straight-line pattern. a. The number of hours you study for an exam and the score you make on that exam _________________ b. The price of a used car and the number of miles on the odometer _____________________________ c. The pressure on a gas pedal and the speed of the car _____________________________________ d. Shoe size and IQ for adults ___________________ HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Scatter Plots and Correlation The Pearson correlation coefficient, , is the parameter that measures the strength of a linear relationship between two quantitative variables in a population. ρ is the Greek letter “rho”. Practice writing the rho character here: The correlation coefficient for a sample is denoted by r. It always takes a value between −1 and 1, inclusive. 1 r 1 HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Population parameter 𝜌, Sample statistic 𝑟 𝜌 (Greek letter rho) is the population parameter for the Correlation Coefficient. 𝑟 is the sample statistic for the Correlation Coefficient. r x y n x x n y y n x i y i i 2 2 i i i 2 i 2 i We use our sample 𝑟 to estimate the population’s 𝜌. Just like in other experiments we used our sample 𝑥 to estimate the population’s mean, 𝜇. HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. HAWKES LEARNING SYSTEMS Regression, Inference, and Model Building math courseware specialists 12.1 Scatter Plots and Correlation • –1 ≤ r ≤ 1 • Close to –1 means a strong negative correlation. • Close to 0 means no correlation. • Close to 1 means a strong positive correlation. HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Some of these examples are based on the data set of 2015 Major League Baseball statistics as listed on another handout. The easiest way to get this data is to have the five lists of data loaded onto your calculator from another source. The hardest way is to type in the data yourself: 5 lists x 30 teams = 150 data values. Is there a correlation between Team Payroll and Games Won? • Scatter Plot . 2ND STAT PLOT (on Y= key, top left) L1 is payroll in $millions L2 is games won ZOOM 9:ZStat. Do the dots seem to line up in a straight line pattern? { Yes No } Is there a correlation between Team Payroll and Games Won? • Scatter Plot • Correlation Coefficient 2ND STAT PLOT (on Y= key, top left) 2ND STAT, TESTS, ALPHA F for LinRegTTest (ALPHA E on TI-83/Plus) L1 is payroll in $millions L2 is games won ZOOM 9:ZStat. Do the dots seem to line up in a straight line pattern? { Yes No } Again: L1 is payroll $millions L2 is games won VARS, Y-VARS, 1, 1 to put the Y1 into RegEQ The correlation coefficient is r = _________, which seems { strong, weak } Is there a correlation between Games Won and Attendance? • Scatter Plot • Correlation Coefficient 2ND STAT PLOT (on Y= key, top left) 2ND STAT, TESTS, ALPHA F for LinRegTTest (ALPHA E on TI-83/Plus) L___ is Games Won L___ is Attendance Which two lists do you use? ZOOM 9:ZStat. Do the dots seem to line up in a straight line pattern? { Yes No } VARS, Y-VARS, 1, 1 to put the Y1 into RegEQ The correlation coefficient is r = _________, which seems { strong, weak } LinRegTTest Inputs • Here are the inputs: • β & ρ: ≠0 – This is the Alternative Hypothesis. Always ≠0 for M2205 Ch. 12. • RegEq: VARS, right arrow to Y-VARS, 1, 1 • Xlist and Ylist – the two data lists of interest • Freq: 1 (unless…) – Just put it in – It will be used later • Highlight “Calculate” • Press ENTER LinRegTTest Outputs, first screen • • t= the t test statistic value for this test (the formula is coming soon) • p = the p-value for this t test statistic • 𝑑𝑓 = 𝑛 – 2 in this kind of a test • 𝑎 later – for regression LinRegTTest Outputs, second screen • b later, for Regression • s much later, for advanced Regression • r2 = how much of the output variable (weight) is explained by the input variable (girth) • r = the correlation coefficient for the sample – Close to +1? strong positive relationship – Or −1? strong negative Correlation does not imply Causation! If there seems to be a Correlation, it doesn’t necessarily mean that changes in one variable cause changes in the other variable. 1. There might be a lurking variable that affects both. 2. Or the two might be completely unrelated. The mathematical indication of a strong correlation is merely coincidental. Extreme examples can be seen at the Spurious Correlations web site (www.tylervigen.com) HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Testing the Correlation Coefficient for Significance Using Hypothesis Testing Testing Linear Relationships for Significance This is the one Significant Linear Relationship (Two-Tailed Test) we use the most. H0: = 0 (Implies that there is no significant linear relationship) Ha: ≠ 0 (Implies that there is a significant linear relationship) Testing Linear Relationships for Significance (cont.) Significant Negative Linear Relationship (Left-Tailed Test) H0: ≥ 0 (Implies that there is no significant negative linear relationship) Ha: < 0 (Implies that there is a significant negative linear relationship) Testing Linear Relationships for Significance (cont.) Significant Positive Linear Relationship (Right-Tailed Test) H0: ≤ 0 (Implies that there is no significant positive linear relationship) Ha: > 0 (Implies that there is a significant positive linear relationship) Be aware that this one exists. Be aware that this one exists. (Now they’re getting into the Hypothesis Testing we saw a brief preview of earlier in this set of slides.) HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Testing the Correlation Coefficient for Significance Using Hypothesis Testing Test Statistic for a Hypothesis Test for a Correlation Coefficient The test statistic for testing the significance of the correlation coefficient is given by TI-84 Test Statistic for a Hypothesis Test r for a Correlation Coefficient (cont.) t LinRegTTest will 2 where r is the sample correlation 1r calculate this coefficient and n is the number of data pairs in the n2 value for us. sample. The number of degrees of freedom for the t-distribution of the test statistic is given by n 2. HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Testing the Correlation Coefficient for Significance Using Hypothesis Testing Rejection Regions for Testing Linear Relationships Significant Linear Relationship (Two-Tailed Test) Reject the null hypothesis, H0 , if t t 2 . Significant Negative Linear Relationship (Left-Tailed Test) Reject the null hypothesis, H0 , if t t . Significant Positive Linear Relationship (Right-Tailed Test) Reject the null hypothesis, H0 , if t t . But we will use the p-value method because LinRegTTest gives us a p-value and the experiment specifies the α (alpha) HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Hypothesis Test for significant 𝑟 Null Hypothesis: 𝜌 = 0 “No relationship” Alternative: 𝜌 ≠ 0 “There is a significant relationship!” There’s some 𝛼 level of significance specified in advance, like 𝛼 = .01 or 𝛼 = .05 A 𝑡 value is calculated. Then “what is the 𝑝-value of this 𝑡?” (Area beyond 𝑡, is it a small probability?) And if 𝑝-value < 𝛼, reject the null hypothesis – If so, then we say “Yes, significant relationship!” Disregard most of the by-hand detail that is in the online Help. Example 12.7: Performing a Hypothesis Test to Determine if the Linear Relationship between Two Variables Is Significant Use a hypothesis test to determine if the linear relationship between the number of parking tickets a student receives during a semester and his or her GPA during the same semester is statistically significant at the 0.05 level of significance. Refer to the data presented in the following table. GPA and Number of Parking Tickets Number of Tickets GPA 0 0 0 0 1 1 1 2 2 2 3 3 5 7 8 3.6 3.9 2.4 3.1 3.5 4.0 3.6 2.8 3.0 2.2 3.9 3.1 2.1 2.8 1.7 HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Example 12.7: Performing a Hypothesis Test to Determine if the Linear Relationship between Two Variables Is Significant (cont.) Solution Step 1: State the null and alternative hypotheses. We wish to test the claim that a significant linear relationship exists between the number of parking tickets a student receives during a semester and his or her GPA during the same semester. Thus, the hypotheses are stated as follows. H0 : 0 (Population Correl. Coeff. = 0: No correlation.) Ha : 0 (Population Correl. Coeff. ≠ 0: Yes, correlation.) HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Example 12.7: Performing a Hypothesis Test to Determine if the Linear Relationship between Two Variables Is Significant (cont.) Step 2: Determine which distribution to use for the test statistic, and state the level of significance. We will use the t-test statistic presented previously in this section along with a significance level of = 0.05 to perform this hypothesis test. Step 3: Gather data and calculate the necessary sample statistics. (Do LinRegTTest) HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Example 12-7 Hypothesis Test, concluded Compare p = _____ vs. α = ______ Decision: { Reject / Fail to Reject } the Null Hypothesis. Conclusion about Signficant Linear Relationship: Conclusion in Plain English: Example 12.8: Performing a Hypothesis Test to Determine if the Linear Relationship between Two Variables Is Significant An online retailer wants to research the effectiveness of its mail-out catalogs. The company collects data from its eight largest markets with respect to the number of catalogs (in thousands) that were mailed out one fiscal year versus sales (in thousands of dollars) for that year. The results are as follows. Number of Catalogs Mailed and Sales Number of Catalogs (in Thousands) Sales (in Thousands) 2 3 3 3 4 4 5 6 $126 $98 $255 $394 $107 $122 $334 $403 HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Example 12.8: Performing a Hypothesis Test to Determine if the Linear Relationship between Two Variables Is Significant (cont.) Use a hypothesis test to determine if the linear relationship between the number of catalogs mailed out and sales is statistically significant at the 0.01 level of significance. Step 1: Hypotheses: H0: ___________ meaning _____________________. Ha: ___________ meaning _____________________. Step 2: Decision to use the t distribution and level of significance _____ = 0.01 HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Example 12.8: Performing a Hypothesis Test to Determine if the Linear Relationship between Two Variables Is Significant (cont.) Step 3: Gather data and calculate the necessary sample statistics. Using a TI-83/84 Plus calculator, enter the values for the numbers of catalogs mailed (x) in L1 and the sales values (y) in L2. Run LinRegTTest. Step 4: Conclusion: { Reject / Fail to Reject } the Null Hypothesis. Interpretation: HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Coefficient of Determination The coefficient of determination, r2 , is a measure of the proportion of the variation in the response variable (y) that can be associated with the variation in the explanatory variable (x). This too is reported to you in the LinRegTTest outputs. HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Example 12.9: Calculating and Interpreting the Coefficient of Determination If the correlation coefficient for the relationship between the numbers of rooms in houses and their prices is r = 0.65, how much of the variation in house prices can be associated with the variation in the numbers of rooms in the houses? Solution Recall that the coefficient of determination tells us the amount of variation in the response variable (house price) that is associated with the variation in the explanatory variable (number of rooms). HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Example 12.9: Calculating and Interpreting the Coefficient of Determination (cont.) Thus, the coefficient of determination for the relationship between the numbers of rooms in houses and their prices will tell us the proportion or percentage of the variation in house prices that can be associated with the variation in the numbers of rooms in the houses. Also, recall that the coefficient of determination is equal to the square of the correlation coefficient. HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Example 12.9: Calculating and Interpreting the Coefficient of Determination (cont.) Since we know that the correlation coefficient for these data is r = 0.65, we can calculate the coefficient of determination as r2 = _____ Thus, approximately _____% of the variation in house prices can be associated with the variation in the numbers of rooms in the houses. HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Correlation Coefficient in Excel More with Excel That’s about all that can be done with basic Excel. There is an advanced feature on Data tab, then the Data Analysis add-in. It gets into the Regression topic in the next lesson. . . . .