Chapter 5 Summarizing Bivariate Data Terms A multivariate data set consists of measurements or observations on each of two or more variables. The classroom data set introduced in the slides for Chapter 3 is a multivariate data set. The data set includes observations on the variables: age, weight, height, gender, vision (correction method), and smoke (status). Age, weight and height are numerical variables while gender, vision and smoke are categorical variables. 2 Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc. Terms A bivariate data set consists of measurements or observations on each of two or more variables. For the rest of this chapter we will concentrate on dealing with bivariate data sets where both variables are numeric. 3 Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc. Scatterplots A scatterplot is a plot of pairs of observed values (both quantitative) of two different variables. When one of the variables is considered to be a response variable (y) and the other an explanatory variable (x). The explanatory variable is usually plotted on the x axis. 4 Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc. Example A sample of one-way Greyhound bus fares from Rochester, NY to cities less than 750 miles was taken by going to Greyhound’s website. The following table gives the destination city, the distance and the oneway fare. Distance should be the x axis and the Fare should be the y axis. 5 Destination City Distance Albany, NY 240 Baltimore, MD 430 Buffalo, NY 69 Chicago, IL 607 Cleveland, OH 257 Montreal, QU 480 New York City, NY 340 Ottawa, ON 467 Philadelphia, PA 335 Potsdam, NY 239 Syracuse, NY 95 Toronto, ON 178 Washington, DC 496 Standard One-Way Fare 39 81 17 96 61 70.5 65 82 67 47 20 35 87 Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc. Example Scatterplot $100 Greyhound Bus Fares Vs. Distance Standard One-Way Fare $90 $80 $70 $60 $50 $40 $30 $20 $10 50 150 250 350 450 550 650 Distance from Rochester, NY (miles) 6 Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc. Comments The axes need not intersect at (0,0). For each of the axes, the scale should be chosen so that the minimum and maximum values on the scale are convenient and the values to be plotted are between the two values. Notice that for this example, 1.The x axis (distance) runs from 50 to 650 miles where the data points are between 69 and 607. 2.The y axis (fare) runs from $10 to $100 where the data points are between $17 and $96. 7 Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc. Further Comments 1. 2. 3. 8 It is possible that two points might have the same x value with different y values. Notice that Potsdam (239) and Albany (240) come very close to having the same x value but the y values are $8 apart. Clearly, the value of y in not determined solely by the x value (there are factors other than distance that affect the fare. In this example, the y value tends to increase a x increases. We say that there is a positive relationship between the variables distance and fare. It appears that the y value (fare) could be predicted reasonably well from the x value (distance) by finding a line that is close to the points in the plot. Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc. Association 9 Positive Association - Two variables are positively associated when above-average values of one tend to accompany above-average values of the other and below-average values tend similarly to occur together. (I.e., Generally speaking, the y values tend to increase as the x values increase.) Negative Association - Two variables are negatively associated when above-average values of one accompany below-average values of the other, and vice versa. (I.e., Generally speaking, the y values tend to decrease as the x values increase.) Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc. The Pearson Correlation Coefficient A measure of the strength of the linear relationship between the two variables Pierson correlation coefficient. The Pearson sample correlation coefficient is defined by x x y y sy z x z y sx r n 1 n 1 10 Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc. Example Calculation 11 x y 240 430 69 607 257 480 340 467 335 239 95 178 496 39 81 17 96 61 70.5 65 82 67 47 20 35 87 x-x sx -0.5214 0.6357 -1.5627 1.7135 -0.4178 0.9402 0.0876 0.8610 0.0571 -0.5275 -1.4044 -0.8989 1.0376 y-y sy x-x y-y s s x y -0.7856 0.8610 -1.6481 1.4491 0.0769 0.4494 0.2337 0.9002 0.3121 -0.4720 -1.5305 -0.9424 1.0962 0.4096 0.5473 2.5755 2.4831 -0.0321 0.4225 0.0205 0.7751 0.0178 0.2489 2.1494 0.8472 1.1374 11.6021 x 325.615 s x 164.2125 y=59.0385 s y 25.506 11.601 r 13 1 0.9668 Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc. Some Correlation Pictures 12 Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc. Some Correlation Pictures 13 Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc. Some Correlation Pictures 14 Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc. Some Correlation Pictures 15 Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc. Some Correlation Pictures 16 Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc. Some Correlation Pictures 17 Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc. Properties of r The value of r does not depend on the unit of measurement for each variable. 2. The value of r does not depend on which of the two variables is labeled x. 3. The value of r is between –1 and +1. 4. The correlation coefficient is 1. a) –1 only when all the points lie on a downwardsloping line, and b) +1 only when all the points lie on an upwardsloping line. 5. 18 The value of r is a measure of the extent to which x and y are linearly related. Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc. Linear Relations The relationship y = a + bx is the equation of a straight line. The value b, called the slope of the line, is the amount by which y increases when x increase by 1 unit. The value of a, called the intercept (or sometimes the vertical intercept) of the line, is the height of the line above the value x = 0. 19 Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc. Example y 15 y = 7 + 3x y increases by b = 3 10 x increases by 1 5 a=7 0 0 20 2 4 6 8 x Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc. Example y 15 y changes by b = -4 (i.e., changes by –4) 10 a = 17 y = 17 - 4x 5 x increases by 1 0 0 21 2 4 6 8 Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc. Least Squares Line The most widely used criterion for measuring the goodness of fit of a line y = a + bx to bivariate data (x1, y1), (x2, y2),, (xn, yn) is the sum of the of the squared deviations about the line: y (a bx) y (a bx ) 2 2 1 1 y n (a bx n ) 2 The line that gives the best fit to the data is the one that minimizes this sum; it is called the least squares line or sample regression line. 22 Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc. Coefficients a and b The slope of the least squares b line is And the y intercept is x x y y x x 2 a y bx We write the equation of the least squares line as ŷ a bx where the ^ above y emphasizes that ŷ (read as y-hat) is a prediction of y resulting from the subst5itution of a particular value into the equation. 23 Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc. Calculating Formula for b b 24 x y xy n 2 x 2 x n Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc. Greyhound Example Continued 25 x y xx 240 430 69 607 257 480 340 467 335 239 95 178 496 4233 39 81 17 96 61 70.5 65 82 67 47 20 35 87 768 -85.615 104.385 -256.615 281.385 -68.615 154.385 14.385 141.385 9.385 -86.615 -230.615 -147.615 170.385 (x x)2 7329.994 10896.148 65851.456 79177.302 4708.071 23834.609 206.917 19989.609 88.071 7502.225 53183.456 21790.302 29030.917 323589.08 y y -20.038 21.962 -42.038 36.962 1.962 11.462 5.962 22.962 7.962 -12.038 -39.038 -24.038 27.962 x-x y-y 1715.60 2292.45 10787.72 10400.41 -134.59 1769.49 85.75 3246.41 74.72 1042.72 9002.87 3548.45 4764.22 48596.19 Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc. Calculations From the previous slide, we have x x y y 48596.19 and x x 2 323589.08 So x x y y 48596.19 b 0.15018 323589.08 x x 2 Also n=13, x=4233 and y 768 4233 768 325.615 and y 59.0385 13 13 This gives a=y-bx=59.0385-0.15018(325.615)=10.138 so x The regression line is 26 ˆ y=10.138 + 0.15018x. Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc. Minitab Graph The following graph is a copy of the output from a Minitab command to graph the regression line. Regression Plot Standard Fare= 10.1380 + 0.150179 Distance S = 6.80319 R-Sq = 93.5 % R-Sq(adj) = 92.9 % 105 95 Standard Fare 85 75 65 55 45 35 25 15 0 27 100 200 300 400 500 600 Distance Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc. Greyhound Example Revisited 28 x y x2 xy 240 430 69 607 257 480 340 467 335 239 95 178 496 4233 39 81 17 96 61 70.5 65 82 67 47 20 35 87 768 57600 184900 4761 368449 66049 230400 115600 218089 112225 57121 9025 31684 246016 1701919 9360 34830 1173 58272 15677 33840 22100 38294 22445 11233 1900 6230 43152 298506 Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc. Greyhound Example Revisited Using the calculation formula we have: n 13, x 4233, y 768 2 x 1701919, and so xy 298506 x y xy 4233 768 298506 13 n 2 2 4233 x 2 1701919 x n 13 48596.19 0.15018 323589.1 As before a=y-bx=59.0385-0.15018(325.615)=10.138 and the regression line is yˆ =10.138 + 0.15018x. b Notice that we get the same result. 29 Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc. Three Important Questions To examine how useful or effective the line summarizing the relationship between x and y, we consider the following three questions. 1. Is a line an appropriate way to summarize the relationship between the two variables? 2. Are there any unusual aspects of the dat set that we need to consider before proceeding to use the regression line to make predictions? 3. If we decide that it is reasonable to use the regression line as a basis for prediction, how accurate can we expect predictions based on the regression line to be? 30 Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc. Terminology The predicted or fitted values result from substituting each sample x value into the equation for the least squares line. This gives ŷ1 a bx1 =1st predicted value ŷ 2 a bx 2 =2nd predicted value ... ŷ n a bx n =nth predicted value The residuals for the least squares line are the values: y1 yˆ 1 , y 2 yˆ 2 , ..., y n yˆ n 31 Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc. Greyhound Example Continued x 240 430 69 607 257 480 340 467 335 239 95 178 496 32 Predicted value yˆ 10.1 .150x 39 46.18 81 74.72 17 20.50 96 101.30 61 48.73 70.5 82.22 65 61.20 82 80.27 67 60.45 47 46.03 20 24.41 35 36.87 87 84.63 y Residual y ŷ -7.181 6.285 -3.500 -5.297 12.266 -11.724 3.801 1.728 6.552 0.969 -4.405 -1.870 2.373 Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc. Residual Plot A residual plot is a scatter plot of the data pairs (x, residual). The following plot was produced by Minitab from the Greyhound example. 33 Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc. Residual Plot – What to look for. Isolated points or patterns indicate potential problems. Ideally the the points should be randomly spread out above and below This residual plot would be indicates no zero. systematic bias using the least squares line Residual to predict the y value. Generally this is the kind of pattern that you would like to see. 0 Note: 1. Values below 0 indicate over prediction 34 2. Values above 0 indicate under prediction. x Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc. The Greyhound example continued Predicted fares are too high. 35 Predicted fares are too low. For the Greyhound example, it appears that the line systematically predicts fares that are too high for cities close to Rochester and predicts fares that are too little for most cities between 200 and 500 miles. Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc. More Residual Plots Another common type of residual plot is a scatter plot of the data pairs ( ŷ, residual). The following plot was produced by Minitab for the Greyhound data. Notice, that this residual plot shows the same type of systematic problems with the model. 36 Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc. Coefficient of Determination The coefficient of determination, denoted by r2, gives the proportion of variation in y that can be attributed to an approximate linear relationship between x and y. 37 Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc. Definition formulae The total sum of squares, denoted by SSTo, is defined as SSTo (y1 y) 2 (y 2 y) 2 (y n y) 2 (y y) 2 The residual sum of squares, denoted by SSResid, is defined as SSResid (y1 yˆ 1 ) 2 (y 2 yˆ 2 ) 2 (y y) ˆ 38 (y n yˆ n ) 2 2 Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc. Calculational formulae SSTo and SSResid are generally found as part of the standard output from most statistical packages or can be obtained using the following computational formulas: y SSTo y 2 2 n SSResid y 2 a y b xy The coefficient of determination, r2, can be computed as SSResid r 1 SSTo 2 39 Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc. Greyhound example revisited n 13, y 768, y 53119, xy 298506 b 0.150179 and a 10.1380 2 2 y 768 2 SSTo y 53119 78072.2 2 n 13 SSResid y 2 a y b xy 53119 10.1380(768) 0.150179(298506) 509.117 40 Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc. Greyhound example revisited SSResid 509.117 r 1 1 0.9348 SSTo 7807.23 2 We can say that 93.5% of the variation in the Fare (y) that can attribute to the least squares linear relationship between distance (x) and fare. 41 Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc. More on variability The standard deviation about the least squares line is denoted se and given by SSResid se n2 se is interpreted as the “typical” amount by which an observation deviates from the least squares line. 42 Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc. Greyhound example revisited SSResid 509.117 se $6.80 n2 11 The “typical” deviation of actual fare from the prediction is $6.80. 43 Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc. Minitab output for Regression Regression Analysis: Standard Fare versus Distance Least squares regression line The regression equation is Standard Fare = 10.1 + 0.150 Distance Predictor Constant Distance Coef 10.138 0.15018 S = 6.803 SE Coef 4.327 0.01196 R-Sq = 93.5% T 2.34 12.56 R-Sq(adj) = 92.9% se 44 r2 DF 1 11 12 a b Analysis of Variance Source Regression Residual Error Total P 0.039 0.000 SS 7298.1 509.1 7807.2 SSTo MS 7298.1 46.3 F 157.68 P 0.000 SSResid Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc. The Greyhound problem with additional data The sample of fares and mileages from Rochester was extended to cover a total of 20 cities throughout the country. The resulting data and a scatterplot are given on the next few slides. 45 Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc. Extended Greyhound Fare Sample Standard Distance Fare Buffalo, NY 69 17 New York City 340 65 Cleveland, OH 257 61 Baltimore, MD 430 81 Washington, DC 496 87 Atlanta, GE 998 115 Chicago, IL 607 96 San Francisco 2861 159 Seattle, WA 2848 159 Philadelphia, PA 335 67 Orlando, FL 1478 109 Phoenix, AZ 2569 149 Houston, TX 1671 129 New Orleans, LA 1381 119 Syracuse, NY 95 20 Albany, NY 240 39 Potsdam, NY 239 47 Toronto, ON 178 35 Ottawa, ON 467 82 Montreal, QU 480 70.5 46 Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc. Extended Greyhound Fare Sample Standard Fare 150 100 50 0 0 1000 2000 3000 Distance 47 Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc. Extended Greyhound Fare Sample Regression Plot Standard Far = 46.0582 + 0.0435354 Distance S = 17.4230 R-Sq = 84.9 % R-Sq(adj) = 84.1 % Standard Far 150 100 50 0 0 1000 2000 3000 Distance Minitab reports the correlation coefficient, r=0.921, R2=0.849, se=$17.42 and the regression line Standard Fare = 46.058 + 0.043535 Distance Notice that even though the correlation coefficient is reasonably high and 84.9 % of the variation in the Fare is explained, the linear model is not very usable. 48 Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc. Nonlinear Regression Example 49 Standard Distance Log10(distance) Fare Buffalo, NY 69 1.83885 17 New York City 340 2.53148 65 Cleveland, OH 257 2.40993 61 Baltimore, MD 430 2.63347 81 Washington, DC 496 2.69548 87 Atlanta, GE 998 2.99913 115 Chicago, IL 607 2.78319 96 San Francisco 2861 3.45652 159 Seattle, WA 2848 3.45454 159 Philadelphia, PA 335 2.52504 67 Orlando, FL 1478 3.16967 109 Phoenix, AZ 2569 3.40976 149 Houston, TX 1671 3.22298 129 New Orleans, LA 1381 3.14019 119 Syracuse, NY 95 1.97772 20 Albany, NY 240 2.38021 39 Potsdam, NY 239 2.37840 47 Toronto, ON 178 2.25042 35 Ottawa, ON 467 2.66932 82 Montreal, QU 480 2.68124 70.5 Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc. Extended Greyhound Fare Sample Nonlinear Regression From the previous slide we can see that the plot does not look linear, it appears to have a curved shape. We sometimes replace the one of more of the variables with a transformation of that variable and then perform a linear regression on the transformed variables. This can sometimes lead to developing a useful prediction equation. For this particular data, the shape of the curve is almost logarithmic so we might try to replace the distance with log10(distance) [the logarithm to the base 10) of the distance]. 50 Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc. Nonlinear Regression Example Minitab provides the following output. High r2 Regression Analysis: Standard Fare versus Log10(Distance) The regression equation is Standard Fare = - 163 + 91.0 Log10(Distance) Predictor Constant Log10(Di Coef -163.25 91.039 S = 7.869 SE Coef 10.59 3.826 R-Sq = 96.9% T -15.41 23.80 P 0.000 0.000 R-Sq(adj) = 96.7% Analysis of Variance Source Regression Residual Error Total DF 1 18 19 Unusual Observations Obs Log10(Di Standard 11 3.17 109.00 SS 35068 1115 36183 Fit 125.32 MS 35068 62 96.9% of the variation attributed to the model Typical Error = $7.87 Reasonably good F 566.30 SE Fit 2.43 P 0.000 Residual -16.32 St Resid -2.18R R denotes an observation with a large standardized residual 51 The only outlier is Orlando and as you’ll see from the next two slides,Copyright it is not too bad. (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc. Nonlinear Regression Example Looking at the plot of the residuals against distance, we see some problems. The model over estimates fares for middle distances (1000 to 2000 miles) and under estimates for longer distances (more than 2000 miles 52 Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc. Nonlinear Regression Example When we look at how the prediction curve looks on a graph that has the Standard Fare and log10(Distance) axes, we see the result looks reasonably linear. Regression Plot Standard Fare = -163.246 + 91.0389 Log10(Distance) S = 7.86930 R-Sq = 96.9 % R-Sq(adj) = 96.7 % Standard Fare 150 100 50 0 53 2.0 2.5 3.0 3.5 Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc. Log10(Distance) Nonlinear Regression Example When we look at how the prediction curve looks on a graph that has the Standard Fare and Distance axes, we see the result appears to work fairly well. By and large, this prediction model for the fares appears to work reasonable well. Standard Fare 150 Prediction Model 100 50 0 0 1000 2000 3000 Distance 54 Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc.