Association Variables –Response – an outcome variable whose values exhibit variability. –Explanatory – a variable that we use to try to explain the variability in the response. 1 Association There is an association between two variables if values of one variable are more likely to occur with certain values of a second variable. 2 Picturing Association Two Categorical (Qualitative). –Cross-tabs table, mosaic plot. Two Numerical (Quantitative). –Scatter diagram. 3 Categorical Data Who? – Students in a statistics class at Penn State University. What? – “With whom is it easiest to make friends?” Opposite sex, same sex, no difference. – Gender. Male, female. 4 Cross-tabs Table With whom is it easiest to make friends? Same Sex Opposite No Diff Sex Total Female 16 58 63 137 Male 13 15 40 68 Total 29 73 103 205 5 Bar Graph With whom is it easiest to make friends? Distributions Answer Freque ncies 50.2 35.6 75 50 14.1 Count 100 Level No Diff Opposite Same Total Count 103 73 29 205 N Missing 0 3 Levels 25 No Diff Opposite Same 6 Prob 0.50244 0.35610 0.14146 1.00000 Percentages With whom is it easiest to make friends? Count Row % Female Male Total Same Sex 16 11.7% 13 19.1% 29 Opposite No Diff Sex 58 42.3% 15 22.1% 73 63 46.0% 40 58.8% 103 Total 137 100% 68 100% 205 7 Mosaic Plot 1.00 Same 0.75 Answer Opposite 0.50 0.25 No Diff 0.00 Female Male Gender 8 Interpretation More that 50% of males say no difference while less than 50% of females say no difference. Females are about twice as likely as males to say opposite. Males are about twice as likely as females to say the same. 9 Scatter Plot Statistics is about … variation. Recognize, quantify and try to explain variation. Variation in two quantitative variables is displayed in a scatter plot. 10 Scatter Plot Numerical variable on the vertical axis, y, is the response variable. Numerical variable on the horizontal axis, x, is the explanatory variable. 11 Scatter Plot Example: Body mass (kg) and Bite force (N) for Canidae. –y, Response: Bite force (N) –x, Explanatory: Body mass (kg) –Cases: 28 species of Canidae. 12 Bivariate Fit of BFca (N) By Body M ass (kg) 500 400 BFca (N) 300 200 100 0 0 5 10 15 20 25 Body Mass (kg) 30 35 40 13 Positive Association Positive Association – Above average values of Bite force are associated with above average values of Body mass. – Below average values of Bite force are associated with below average values of Body mass. 14 Scatter Plot Example: Outside temperature and amount of natural gas used. – Response: Natural gas used (1000 ft3). – Explanatory: Outside temperature (o C). – Cases: 26 days. 15 Gas 10 5 0 -5.0 .0 5.0 Temp 10.0 15.0 16 Negative Association –Above average values of gas are associated with below average temperatures. –Below average values of gas are associated with above average temperatures. 17 Association Positive –As x goes up, y tends to go up. Negative –As x goes up, y tends to go down. 18 Correlation Linear Association – How closely do the points on the scatter plot represent a straight line? – The correlation coefficient gives the direction of and quantifies the strength of the linear association between two quantitative variables. 19 Correlation Standardize y Standardize x y y zy sy xx zx sx 20 Standardized Bite Force Bite Force vs Body Mass of Canidae 3 2 1 0 -1 -1 0 1 2 Standardized Body Mass 3 21 Correlation Coefficient zx z y r n 1 x x y y r n 1s x s y 22 Correlation Coefficient Body mass and Bite force zx z y r 26.4796 n 1 27 r = 0.9807 23 Correlation Coefficient There is a very strong positive correlation, linear association, between the body mass and bite force for the various species of Canidae. 24 JMP – Multivariate methods – Multivariate Y, Columns Analyze – – Body mass BF ca (Bite force at the canine) 25 M ultiv ariate Co rre lation s Body Mass (kg) BFca (N) Body Mass (kg) 1.0000 0.9807 BFca (N) 0.9807 1.0000 Scatte rp lot M atrix 40 35 30 25 Body 20 Mass (kg) 15 10 5 500 400 300 BFca (N) 200 100 26 5 10 15 20 25 30 35 40 100 200 300 400 500 Correlation Properties sign of r indicates the direction of the association. The value of r is always between –1 and +1. Correlation has no units. Correlation is not affected by changes of center or scale. The 27 Algebra Review The equation of a straight line y = mx + b – m is the slope – the change in y over the change in x – or rise over run. – b is the y-intercept – the value where the line cuts the y axis. 28 y = 3x + 2 15 10 y 5 0 -5 -10 -15 -5 -4 -3 -2 -1 0 x 1 2 3 4 5 29 Review y = 3x + 2 –x = 0 y = 2 (y-intercept) –x = 3 y = 11 – Change in y (+9) divided by the change in x (+3) gives the slope, 3. 30 Linear Regression Example: Body mass (kg) and Bite force (N) for Canidae. –y, Response: Bite force (N) –x, Explanatory: Body mass (kg) –Cases: 28 species of Canidae. 31 Correlation Coefficient Body mass and Bite force zx z y r 26.4796 n 1 27 r = 0.9807 32 Correlation Coefficient There is a strong correlation, linear association, between the body mass and bite force for the various species of Canidae. 33 Linear Model The linear model is the equation of a straight line through the data. A point on the straight line through the data gives a predicted value of y, denoted ŷ . 34 Residual The difference between the observed value of y and the predicted value of y, ŷ , is called the residual. Residual = y y ˆ 35 Regression Plot 500 BF ca (N) 400 Residual 300 200 100 0 0 5 10 15 20 25 Body mass (kg) 30 35 36 Line of “Best Fit” There are lots of straight lines that go through the data. The line of “best fit” is the line for which the sum of squared residuals is the smallest – the least squares line. 37 Line of “Best Fit” Some positive and some negative residuals but they sum to zero. Passes through the point x, y . 38 Line of “Best Fit” yˆ a bx Least squares sy slope: br sx intercept: a y bx 39 Least Squares Estimates Body mass, x Bite Force, y x 9.207 kg s x 8.016 kg y 154.029 N s y 109.760 N r 0.9807 40 Least Squares Estimates 109.760 b 0.9807 13.428 8.016 a 154.029 13.428(9.207) 30.397 yˆ 30.397 13.428 x 41 Interpretation – for a 1 kg increase in body mass, the bite force increases, on average, 13.428 N. Intercept – there is not a reasonable interpretation of the intercept in this context because one wouldn’t see a Canidae with a body mass of 0 kg. Slope 42 Bite Force vs Body Mass 500 ŷ 30.397 13.428 x BF ca (N) 400 300 200 100 0 0 5 10 15 20 25 30 35 Body mass (kg) 43 Prediction Least squares line ŷ 30.397 13.428 x x 25 ŷ 30.397 13.428( 25 ) 366.1 N 44 Residual Body mass, x = 25 kg Bite force, y = 351.5 N Predicted, ŷ = 366.1 N Residual, y y ˆ = 351.5 – 366.1 = – 14.6 N 45 Residuals Residuals help us see if the linear model makes sense. Plot residuals versus the explanatory variable. – If the plot is a random scatter of points, then the linear model is the best we can do. 46 Plot of Residuals vs Body Mass 60 50 Residual 40 30 20 10 0 -10 -20 -30 0 5 10 15 20 25 Body mass (kg) 30 35 47 Interpretation of the Plot The residuals are scattered randomly. This indicates that the linear model is an appropriate model for the relationship between body mass and bite force for Canidae. 48 2 (r) or 2 R The square of the correlation coefficient gives the amount of variation in y, that is accounted for or explained by the linear relationship with x. 49 Body mass and Bite force r = 0.9807 (r)2 = (0.9807)2 = 0.962 or 96.2% 96.2% of the variation in bite force can be explained by the linear relationship with body mass. 50 Regression Conditions variables – both variables should be quantitative. Linear model – does the scatter diagram show a reasonably straight line? Outliers – watch out for outliers as they can be very influential. Quantitative 51 Regression Cautions Beware of extraordinary points. Don’t extrapolate beyond the data. Don’t infer x causes y just because there is a good linear model relating the two variables. 52 Extraordinary Points https://netfiles.uiuc.edu/jimard en/www/cuwu/datalist.html –Scatter Plots –Check – Blank Plot and click Update –Add point 53 Don’t Extrapolate (x) – Average outdoor temperature (o C). Response (y) – Amount of natural gas used (1000 cu ft). Explanatory yˆ 6.85 0.393 x 54 Don’t Extrapolate Gas 10 5 0 -5 0 5 10 Temp 15 20 55 Don’t Extrapolate (x = 20) – Average outdoor temperature (o C). Response (y) – Amount of natural gas used (1000 cu ft). yˆ 6.85 0.39320 yˆ 1.01 Explanatory 56 Correlation Causation Don’t confuse correlation with causation. – There is a strong positive correlation between the number of crimes committed in communities and the number of 2nd graders in those communities. Beware of lurking variables. 57