Chapter 2 Looking at Data - Relationships Relations Among Variables • Response variable - Outcome measurement (or characteristic) of a study. Also called: dependent variable, outcome, and endpoint. Labelled as y. • Explanatory variable - Condition that explains or causes changes in response variables. Also called: independent variable and predictor. Labelled as x. • Theories usually are generated about relationships among variables and statistical methods can be used to test them. • Research questions are stated such as: Do changes in x cause changes in y? Scatterplots • Identify the explanatory and response variables of interest, and label them as x and y • Obtain a set of individuals and observe the pairs (xi , yi) for each pair. There will be n pairs. • Statistical convention has the response variable (y) placed on the vertical (up/down) axis and the explanatory variable (x) placed on the horizontal (left/right) axis. (Note: economists reverse axes in price/quantity demand plots) • Plot the n pairs of points (x,y) on the graph France August,2003 Heat Wave Deaths • • • • Individuals: 13 cities in France Response: Excess Deaths(%) Aug1/19,2003 vs 1999-2002 Explanatory Variable: Change in Mean Temp in period (C) Data: City Dth03 Dth9902 %chng (y) Degchg(x) Little Marseilles Grenoble Rennes Toulouse Bordeaux Strasbourg Nice Poitiers Lyon Le Mans Dijon Paris 200 571 148 156 315 318 253 341 184 447 204 168 1854 192.3 456.8 115.6 114.7 231.6 222.4 167.5 222.9 102.8 248.3 112.1 87 766.1 4 25 28 36 36 43 51 53 79 80 82 93 142 4 4.3 6.3 5.6 6.6 6.2 5.9 4.3 7.3 6.8 7 7.4 6.7 France August,2003 Heat Wave Deaths 2003 France Heat Wave Mortality Possible Outlier 160 140 Excess Mortality (%) 120 100 80 60 40 20 0 3 3.5 4 4.5 5 5.5 6 Change in Mean Temp (Celsius) 6.5 7 7.5 8 Example - Pharmacodynamics of LSD • Response (y) - Math score (mean among 5 volunteers) • Explanatory (x) - LSD tissue concentration (mean of 5 volunteers) • Raw Data and scatterplot of Score vs LSD concentration: 80 70 60 LSD Conc (x) 1.17 2.97 3.26 4.69 5.83 6.00 6.41 50 40 SCORE Score (y) 78.93 58.20 67.47 37.47 45.65 32.92 29.97 30 20 1 2 LSD_CONC Source: Wagner, et al (1968) 3 4 5 6 7 Manufacturer Production/Cost Relation Y= Amount Produced x= Total Cost n=48 months (not in order) Month 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Prod Cost Month Prod Cost Month Prod Cost 46.75 92.64 17 36.54 91.56 33 32.26 66.71 42.18 88.81 18 37.03 84.12 34 30.97 64.37 41.86 86.44 19 36.60 81.22 35 28.20 56.09 43.29 88.80 20 37.58 83.35 36 24.58 50.25 42.12 86.38 21 36.48 82.29 37 20.25 43.65 41.78 89.87 22 38.25 80.92 38 17.09 38.01 41.47 88.53 23 37.26 76.92 39 14.35 31.40 42.21 91.11 24 38.59 78.35 40 13.11 29.45 41.03 81.22 25 40.89 74.57 41 9.50 29.02 39.84 83.72 26 37.66 71.60 42 9.74 19.05 39.15 84.54 27 38.79 65.64 43 9.34 20.36 39.20 85.66 28 38.78 62.09 44 7.51 17.68 39.52 85.87 29 36.70 61.66 45 8.35 19.23 38.05 85.23 30 35.10 77.14 46 6.25 14.92 39.16 87.75 31 33.75 75.47 47 5.45 11.44 38.59 92.62 32 34.29 70.37 48 3.79 12.69 Manufacturer Production/Cost Relation Production (x) / Cost (y) Relation 100 90 80 70 Total Cost 60 50 40 30 20 10 0 0 5 10 15 20 25 Total Production 30 35 40 45 50 Correlation • Numerical measure to summarize the strength of the linear (straight-line) association between two variables • Bounded between -1 and +1 (Labelled as r) – Values near -1 Strong Negative association – Values near 0 Weak or no association – Values near +1 Strong Positive association • Not affected by linear transformation of either x or y • Does not distinguish between response and explanatory variable (x and y can be interchaged) xi x yi y COV ( x, y ) 1 r n 1 s x s y sx s y COV ( x, y ) 1 xi x yi y n 1 Excess French Heatwave Deaths x 6.03 sx 1.16 City Little Marseilles Grenoble Rennes Toulouse Bordeaux Strasbourg Nice Poitiers Lyon Le Mans Dijon Paris Total Degchg(x) 4.0 4.3 6.3 5.6 6.6 6.2 5.9 4.3 7.3 6.8 7.0 7.4 6.7 78.4 COV ( x, y ) y 57.85 s y 36.46 n 13 %chng (y) 4 25 28 36 36 43 51 53 79 80 82 93 142 752.0 333.7 27.81 13 1 x-xbar -2.03 -1.73 0.27 -0.43 0.57 0.17 -0.13 -1.73 1.27 0.77 0.97 1.37 0.67 0.0 r y-ybar -53.85 -32.85 -29.85 -21.85 -21.85 -14.85 -6.85 -4.85 21.15 22.15 24.15 35.15 84.15 0.0 (x-xbar)(y-ybar) 109.3155 56.8305 -8.0595 9.3955 -12.4545 -2.5245 0.8905 8.3905 26.8605 17.0555 23.4255 48.1555 56.3805 333.7 27.81 27.81 0.66 (1.16)(36.46) 42.29 Examples Least-Squares Regression • Goal: Fit a line that “best fits” the relationship between the response variable and the explanatory variable • Equation of a straight line: y = a + bx – a - y-intercept (value of y when x = 0) – b - slope (amount y increases as x increases by 1 unit) • Prediction: Often want to predict what y will be at a given level of x. (e.g. How much will it cost to fill an order of 1000 t-shirts) • Extrapolation: Using a fitted line outside level of the explanatory variable observed in sample: BAD IDEA Least-Squares Regression • y = a + bx is a deterministic equation • Sample data don’t fall on a straight line, but rather around one • Obtain equation that “best fits” a sample of data points • Error - Difference between observed response and predicted response (from equation) • Least Squares criteria: Choose the line that minimizes the sum of squared errors. Resulting regression line: ^ y a bx br sy sx a y bx Excess French Heatwave Deaths x 6.03 s x 1.16 y 57.85 s y 36.46 r 0.66 36.46 b 0.66 0.66(31.43) 20.74 1 . 16 a 57.85 20.74(6.03) 57.85 125.06 67.21 ^ 2003 France Heat Wave Mortality y 67.21 20.74 x 160 140 120 Excess Mortality (%) For each 1C increase in mean temp, excess mortality increases about 20% 100 80 60 40 20 0 3 3.5 4 4.5 5 5.5 6 Change in Mean Temp (Celsius) 6.5 7 7.5 8 Effect of an Outlier (Paris) • Re-fitting the model without Paris, which had a very high excess mortality (Using EXCEL): ^ r 0.76 y * 52.78 17.34 x * Heat Wave Mortality (No Paris) 100 90 80 Excess Mortality 70 60 50 40 30 20 10 0 3 3.5 4 4.5 5 5.5 Temp Change 6 6.5 7 7.5 8 Squared Correlation • The squared correlation represents the fraction of the variation in the response variable that is “explained” by the explanatory variable • Represents the improvement (reduction in sum of squared errors) by using x (and fitted equation y-hat) to predict y as opposed to ignoring x (and simply using the sample mean y-bar) to predict y • 0 r2 1 – Values near 0 x does not help predict y (regression line flat) – Values near 1 x predicts y well (data near regression line) 2 r2 ^ y y 2 y y Residual Analysis • Residuals: Difference between observed ^ responses and their predictedyvalues: y • Useful to plot the residuals versus the level of the explanatory variable (x) • Outliers: Large (positive or negative) residuals. Values of y that are inconsistent with prediction • Influential observations: Cases where the level of the explanatory variable is far away from the other individuals (extreme x values) France Heatwave Mortality x 4 4.3 6.3 5.6 6.6 6.2 5.9 4.3 7.3 6.8 7 7.4 6.7 yhat 16.04 22.22 63.39 48.98 69.56 61.33 55.15 22.22 83.98 73.68 77.80 86.03 71.62 e=y-yhat -12.04 2.78 -35.39 -12.98 -33.56 -18.33 -4.15 30.78 -4.98 6.32 4.20 6.97 70.38 Paris (outlier) Residual Plot 80.00 60.00 40.00 20.00 Residual y 4 25 28 36 36 43 51 53 79 80 82 93 142 0.00 3 3.5 4 4.5 5 5.5 -20.00 -40.00 -60.00 Temp Change (x) 6 6.5 7 7.5 8 Miscellaneous Topics • Lurking Variable: Variable not included in regression analysis that may influence the association between y and x. Sometimes referred to as a spurious association between y and x. • Association does not imply causation (it is one of various steps to demonstrating cause-and-effect) • Do not extrapolate outside range of x observed in study • Some relationships are not linear, which may show low correlation when relation is strong • Correlations based on averages across individuals tend to be higher than those based on individuals Causation • Association between x and y demonstrated • Time order confirmed (x “occurs” before y) • Alternative explanations are considered and explained away: – Lurking variables - Another variable causes both x and y – Confounding - Two explanatory variables are highly related, and which causes y cannot be determined • Dose-Response Effect • Plausible cause