ASSOCIATION: CONTINGENCY, CORRELATION, AND REGRESSION Chapter 3 3.1 The Association between Two Categorical Variables Response and Explanatory Variables Response variable (dependent, y) outcome variable Explanatory variable (independent, x) defines groups Response/Explanatory 1. Grade on test/Amount of study time 2. Yield of corn/Amount of rainfall Association Association – When a value for one variable is more likely with certain values of the other variable Data analysis with two variables 1. Tell whether there is an association and 2. Describe that association Contingency Table Displays two categorical variables The rows list the categories of one variable; the columns list the other Entries in the table are frequencies www1.pictures.fp.zimbio.com Contingency Table What is the response (outcome) variable? Explanatory? What proportion of organic foods contain pesticides?Conventionally grown? What proportion of all sampled foods contain pesticides? Proportions & Conditional Proportions Proportions & Conditional Proportions Side by side bar charts show conditional proportions and allow for easy comparison www.vitalchoice.com Proportions & Conditional Proportions If no association, then proportions would be the same Since there is association, then proportions are different 3.2 The Association between Two Quantitative Variables Internet Usage & GDP Data Set Algeria Argentina Australia Austria Belgium Brazil Canada Chile China Denmark Egypt Finland France Germany Greece India Iran Ireland Israel INTERNET 0.65 10.08 37.14 38.7 31.04 4.66 46.66 20.14 2.57 42.95 0.93 43.03 26.38 37.36 13.21 0.68 1.56 23.31 27.66 GDP 6.09 11.32 25.37 26.73 25.52 7.36 27.13 9.19 4.02 29 3.52 24.43 23.99 25.35 17.44 2.84 6 32.41 19.79 Japan Malaysia Mexico Netherlands New Zealand Nigeria Norway Pakistan Philippines Russia Saudi Arabia South Africa Spain Sweden Switzerland Turkey United Kingdom United States Vietnam Yemen INTERNET 38.42 27.31 3.62 49.05 46.12 0.1 46.38 0.34 2.56 2.93 1.34 6.49 18.27 51.63 30.7 6.04 32.96 50.15 1.24 0.09 GDP 25.13 8.75 8.43 27.19 19.16 0.85 29.62 1.89 3.84 7.1 13.33 11.29 20.15 24.18 28.1 5.89 24.16 34.32 2.07 0.79 www.knitwareblog.com Scatterplot Graph of two quantitative variables: Horizontal Axis: Explanatory, x Vertical Axis: Response, y Algeria Argentina Australia Austria Belgium Brazil Canada Chile China Denmark Egypt Finland France Germany Greece India Iran Ireland Israel INTERNET 0.65 10.08 37.14 38.7 31.04 4.66 46.66 20.14 2.57 42.95 0.93 43.03 26.38 37.36 13.21 0.68 1.56 23.31 27.66 GDP 6.09 11.32 25.37 26.73 25.52 7.36 27.13 9.19 4.02 29 3.52 24.43 23.99 25.35 17.44 2.84 6 32.41 19.79 Japan Malaysia Mexico Netherlands New Zealand Nigeria Norway Pakistan Philippines Russia Saudi Arabia South Africa Spain Sweden Switzerland Turkey United Kingdom United States Vietnam Yemen INTERNET 38.42 27.31 3.62 49.05 46.12 0.1 46.38 0.34 2.56 2.93 1.34 6.49 18.27 51.63 30.7 6.04 32.96 50.15 1.24 0.09 GDP 25.13 8.75 8.43 27.19 19.16 0.85 29.62 1.89 3.84 7.1 13.33 11.29 20.15 24.18 28.1 5.89 24.16 34.32 2.07 0.79 Interpreting Scatterplots The overall pattern includes trend, direction, and strength of the relationship Trend: linear, curved, clusters, no pattern Direction: positive, negative, no direction Strength: how closely the points fit the trend Also look for outliers from the overall trend Used-car Dealership What association would we expect between the age of the car and mileage? a) b) c) Positive Negative No association Linear Correlation, r Measures the strength and direction of the linear association between x and y Correlation coefficient: Measuring Strength & Direction of a Linear Relationship Positive r => positive association Negative r => negative association r close to +1 or -1 indicates strong linear association r close to 0 indicates weak association 3.3 Can We Predict the Outcome of a Variable? Regression Line Predicts y, given x: yˆ a bx The y-intercept and slope are a and b Only an estimate – actual data vary Describes relationship between x and estimated means of y farm4.static.flickr.com Residuals www.chem.utoronto.ca Prediction errors: vertical distance between data point and regression line Large residual indicates unusual observation Each residual is: y yˆ Sum of residuals is always zero Goal: Minimize distance from data to regression line Least Squares Method Residual sum of squares: 2 2 ˆ ( residuals ) ( y y ) msenux.redwoods.edu Least squares regression line minimizes vertical distance between points and their predictions Regression Analysis Identify response and explanatory variables Response variable is y Explanatory variable is x Anthropologists Predict Height Using Remains? Regression Equation: yˆ 61.4 2.4 x ŷ is predicted height and x is the length of a femur, thighbone (cm) Predict height for femur length of 50 cm www.geektoysgamesandgadgets.com Bones Interpreting the y-Intercept and slope y-intercept: y-value when x=0 Helps plot line Slope: change in y for 1 unit increase in x 1 cm increase in femur length means 2.4 cm increase in predicted height yˆ 61.4 2.4 x Slope Values: Positive, Negative, Zero Slope and Correlation Slope, b: Doesn’t tell strength Has units Inverts if x and y are swapped Correlation, r: Describes No strength units Same if x and y are swapped Squared Correlation, r2 Proportional reduction in error, r2 Variation in y-values explained by relationship of y to x A correlation, r, of .9 means r .9 .81 81% 2 2 81% of variation in y is explained by x 3.4 What Are Some Cautions in Analyzing Associations? Extrapolation Neil Weiss, Elementary Statistics, 7th Edition Extrapolation: Predicting y for x-values outside range of data Riskier the farther from the range of x No guarantee trend holds Outliers and Influential Points www2.selu.edu Regression outlier lies far away from rest of data Influential if both: 1. Low or high, compared to rest of data 2. Regression outlier Correlation Does Not Imply Causation Strong correlation between x and y means Strong linear association between the variables Does not mean x causes y Ex. 95.6% of cancer patients have eaten pickles, so do pickles cause cancer? Lurking Variables & Confounding 1. 2. Ice cream sales & drowning => temperature Reading level & shoe size => age Confounding – two explanatory variables both associated with response variable and each other Lurking variables – not measured in study but may confound Simpson’s Paradox Example Simpson’s Paradox: Association between two variables reverses after third is included Probability of Death of Smoker = 139/582 = 24% Probability of Death of Nonsmoker = 230/732 = 31% Simpson’s Paradox Example Break out Data by Age Simpson’s Paradox Example Associations look quite different after adjusting for third variable