Chapter 4 Scatterplots and Correlation Slide 1 Section 4.1 Scatter Diagrams and Correlation • • Scatterplots Linear Correlation Coefficient Slide 2 Association between two variables: Size of diamond and price of ring The source of the data is a full page advertisement placed in the Straits Times newspaper issue of February 29, 1992, by a Singapore-based retailer of diamond jewelry. The variables are the size of the diamond in carats (1 carat = .2 gram) and the price of ladies’ rings (single diamond stone) in Singapore dollars. Carats Singapore dollars .20 495 .16 328 .17 350 .19 385 .25 642 ……. ….. How would you describe the association between the two variables? Slide 3 SCATTERPLOT: Diamond rings data N=48 X Carat Y Price in US $ Average s.d. Min Max 0.20 0.056 0.12 0.35 865.144 213.64 385 1879 Diamond carats vs Price in US$ Price in US dollars Carat Slide 4 Terminology Response variable: measures the outcome of the study (Dependent variable) Explanatory variable: explains or causes changes in the response variable (Independent variable) Example: Carat=Explanatory variable Price=Response variable Slide 5 6 Slide4-66 EXAMPLE Interpreting a Scatter Diagram The data shown to the right are based on a study for drilling rock. The researchers wanted to determine whether the time it takes to dry drill a distance of 5 feet in rock increases with the depth at which the drilling begins. • Depth, x, is the explanatory variable, • Time, y, (in minutes) to drill five feet is the response variable. Draw a scatter diagram of the data. Source: Penner, R., and Watts, D.G. “Mining Information.” The American Statistician, Vol. 45, No. 1, Feb. 1991, p. 6. Slide 7 4-8 Slide 8 Interpreting scatter plots 1. Look for the overall pattern and for striking deviations 2. Define form, direction and strength of the relationship: a. Form: roughly linear if the points follow a straight line or nonlinear… b. Direction: positive or negative? c. Strength: how closely the points follow a clear form 3. Check for the presence of outliers, individual values that fall outside the overall pattern 4. Two variables are positively (negatively) associated if the increase of one variable correspond to an increase (decrease) in the other variable. Demo Slide 9 Various Types of Relations in a Scatter Diagram 4-10 Slide 10 Example: 2000 Presidential Elections Did the butterfly ballots confuse voters? Did voters for Al Gore instead cast their votes for other candidates? Bush spokesman Ari Fleishcher stated on Nov. 9 2000 that "Palm Beach County is a Pat Buchanan stronghold and that's why Pat Buchanan received 3,407 votes there." What is the level of support that Pat Buchanan enjoys in Palm Beach County? The published election results show the association between the vote totals for Pat Buchanan and the total population for Florida counties. Slide 11 •Is the association positive or negative? •Is the form of the relationship almost linear? •Outlier present? Slide 12 Another example: The statistics of poverty and inequality Data from U.N.E.S.C.O. 1990 Demographic Year Book . For 97 countries in the world, data are given for birth rates and for an index of the Gross National Product. Slide 13 Note: More information can be added into a graph by putting the categorical variable ON the scatter plot, either • as a label of the points, or • as a symbol instead of the points themselves, or • by the use of color (different color for different category) as in the previous graph. Slide 14 Linearization using Mathematical Transformations: The plot before shows a non-linear association! Sometimes we can make it linear, by using some transformations on the variables. Possible transformations are, for example, “ln”, “exp”, “sqrt”. Here we consider the natural log of GNP. Birth rate vs Log G.N.P. Slide 15 Measure of Linear Association If there is a strong linear association between the variables, then the cloud of points on the scatter plot will be close to a line. Birth rate (1,000 pop) Log G.N.P. Slide 16 The Correlation Coefficient r The correlation coefficient r measures the direction and the strength of the linear relationship between two variables. • It is a value between –1 and 1 • The closer r is to 1 or –1, the stronger the linear association is. • Positive values of r imply a positive association, negative values imply a negative association • Values of r close to 0 imply weak linear association. • Sample r is defined as: xi x yi y 1 r n 1 s x s y Where X data have average x and standard deviation sx, and Y data have average y and standard deviation sy. Slide 17 EXAMPLE Determining the Linear Correlation Coefficient Determine the linear correlation coefficient of the drilling data. 4-18 Slide 18 (xi - 126.25)/sx (yi - 6.9858)/sy product Slide 19 xi x yi y s s x y r n 1 8.501037 12 1 0.773 4-20 Slide 2020 Properties of r The correlation coefficient r varies between –1 and 1. If r=0 means there is no linear association between X and Y. If r=1 or –1, then the points in a scatter plot lie on a straight line. Positive r indicates positive association between X and Y. Negative r indicates negative association between X and Y. Both variables X and Y must be quantitative. The correlation coefficient between X and Y is the same as the correlation between Y and X r does not change if we change the units of measurement for X and Y The correlation measures only the linear relationship between two variables r can be strongly affected by the presence of outliers. Slide 21 Example of correlation Negative association Birth rate (1,000 pop) r = -0.74 Log G.N.P. Slide 22 Diamond rings data N=48 X Carat Y Price in US $ Average s.d. Min Max 0.20 0.056 0.12 0.35 865.144 213.6 4 385 1879 Diamond carats vs Price in US$ Price in US dollars Strong positive association: r = 0.989 Carat Slide 23 Positive Correlation In each plot there are 100 points. The correlation coefficient measures the amount of clustering around a line. If r is close to 1, then points lie close to a straight line!! 24 Slide 24 Negative Correlation Negative correlation: as x increases, y tends to decrease. If r is close to – 1, then points lie close to a straight line!! 25 Slide 25 More here Match the correlation with the plot! Match the diagrams with the following correlations: – 0.93 – 0.75 –0.20 0.27 0.63 1.0 Slide 26 Change of scale These are the low and high temperatures in Boulder (CO) for the month of April 1996. The first scatter plot uses degrees in Fahrenheit and the second plot uses degrees in centigrade. Notice that Co = 5/9*(Fo – 32) r = 0.74 r=? Are the correlations between low and high temperatures in the two graphs different? Slide 27 Different correlations? In which diagram below is the correlation coefficient the largest? The smallest? 28 Slide 28 Outliers and nonlinear association How are the data sets different? 29 Slide 29 Plot the data: the nature of the association between x and y is very different. The correlation coefficient can be misleading in presence of outliers or non-linear association. Check the scatter plot of the data Perfect association! Why is r not equal to 1? For each of these: r = 0.82 Outliers change the value of r. What would the value of r be without the outliers? 30 Slide 30 Which of the following diagrams should be summarized by r? (1) (2) (3) 31 Slide 31 Correlation does not mean Causation!! 32 Slide 32 Example Ice cream sales and crime rates have a very high correlation. Does this mean that local governments should shut down all ice cream shops? Ans: There is another variable: temperature! As air temperatures rise, both ice cream sales and crime rates rise. Here, temperature is a lurking variable. Two variables can be related through a lurking variable even though there is no causal relation. 4-33 Slide 33 SCATTERPLOT and CORRELATION using Excel Slide 34 To graph a Scatterplot – (Highlight the two data columns) – Use the Chart Wizard – Choose: XY(Scatter) – Follow the dialog window steps appropriately (label axes etc.) Slide 35 Computing the Correlation coefficient The correlation coefficient is computed using the Correlation function in the Data Analysis Toolpak. Click on TOOLS > DATA ANALYSIS > Correlation Or you can use the function: = CORREL(data range X, data range Y) Example: If the X values are in B2:B25 and the Y values are in C2:C25, the correlation between the X data and Y data is obtained as follows: = CORREL(B2:B25, C2:C25) Slide 36 SCATTERPLOT and CORRELATION using Ti83 Slide 37 Create the two Lists • To input data into the STAT list editor: • Enter STAT edit mode by pressing [STAT] [1]. • Enter the data in the L1 and L2 lists, pressing [ENTER] after each entry. • Press [2nd] [MODE] to QUIT and return to the home screen. Example: L1: {7,2,4,2,5} L2: {8,4,6,2,7} Slide 38 Graph the ScatterPlot • Press [2nd] [Y=] to access the STAT PLOT editor. • Press [ENTER] to edit Plot1. • Press [ENTER] to turn ON Plot1. • Scroll down and highlight the scatter plot graph type (first option in the first row). Press [ENTER] to select the scatter plot graph type. • Scroll down and make sure Xlist: is set to L1 and Ylist: is set to L2. To input L1, press [2nd] [1]. To input L2, press [2nd] [2]. • Press [GRAPH] to display the scatter plot. You may have to change the “Windows” settings to view your graph. Slide 39 Get the Correlation Coefficient r • Turn on diagnostics with the [DiagnosticOn] command: – [2nd] [0] gets [CATALOG] – Scroll down to DiagnosticOn and press [ENTER] twice. • [STAT] [►] [CALC] • Scroll down to 4: LinReg(ax+b) • press [ENTER] twice. Slide 40