Today's Agenda r2, the coefficient of determination The bivariate normal assumption Diagnostic plots: Residuals and Cook's Distance R output (moved to week 3), Syllabus note: We are ahead of schedule in regression, so we're taking the time to add more examples and details, like Cook's distance and residuals. Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 1 r2, the coefficient of determination r2 is simply the Pearson correlation coefficient r, but squared. So why all the fuss about it? When x and y are correlated, we say that some of the variation in y is explained by x. The proportion explained is r2 . It is called the coefficient of determination because it represents how well a value of y can be determined by x. Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 2 Abstract case 1: If there was a perfect correlation between x and y, then the relationship between them could be described perfectly by a line. ( r = -1 or +1) Once you have the regression equation, knowing x allows you determine what y is, and without any error. In these cases, r2 is 1, meaning that 100% of the variance in y is explained by x. Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 3 Abstract case 2: If there was NO correlation between x and y, such that r=0, then there is no linear relationship between x and y. Knowing x and using the regression equation of that (lack of) relationship would tell you literally nothing about y. In these cases, r2 is 0, so none of the variance in y is explained by x. Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 4 Medical example: On page 4 of 8 of this paper, Pak J Physiol 2010;6(1): http://www.pps.org.pk/PJP/6-1/Talay.pdf ... there are several scatterplots describing the correlation between resting heart rate (RHR) and several other possibly related variables. Consider the first scatterplot, called Figure 1A. In this figure, a regression of body-mass index (BMI, y) as a function of resting heart rate (RHR, x) is shown. Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 5 Scatterplot of Heart Rate (x) and Body-Mass Index (y) Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 6 Here, the sample correlation is r = 0.305, and there is strong evidence that the population correlation is positive because p < 0.01. r2 = 0.3052 = 0.0930, so 9.3% of the variation in BMI can be explained by RHR. Also, 9.3% of the variation in RHR can be explained by BMI. Why? Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 7 Correlation works in both directions. In Figure 1b, the sample shows that some variation in Waistto-Hip Ratio (WHR) is explained by (and explains) RHR. 0.2302 = 0.0529 or 5.3% of the variation. Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 8 If Body-Mass Index explains 9.3% of the variation of RHR, and Waist-to-Hip Ratio explains 5.3% of the variation, could they together explain 9.3 + 5.3 = 14.6% ? Sadly, no. Since BMI and WHR are measuring very similar things, there is going to be a lot of overlap in the variation that they explain. Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 9 But what is this 'variation'? Let's dig deeper! Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 10 Recall that the regression equation without the error term, α + βx , is called the least squares line. Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 11 The 'squares' being referred to are the squared errors. Mathematically, it is the line through the data that produces the smaller sum of squared error (SSE), which is where epsilon ε is the error term that we ignored earlier: Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 12 The sum of squares error SSE is the amount of variation that is left unexplained by the model. We used squared errors because... - Otherwise negative and positive errors would cancel. - This way, the regression equation will favour creating many small errors instead of one big one.* - In calculus, the derivative of x2 is easy to find. * Also why Pearson correlation is sensitive to extreme values. Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 13 The error term is in any model we use, even the null model, which is a fancy term for not regressing at all. Or In the null model, every value of y is predicted to be the average of all observed y values. So α is the sample mean of y, y-bar. Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 14 The total squared difference from the mean of y is called the sum of squares total, or SST SST is the total square length of all the vertical red lines. Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 15 If we fit a regression line, (most of the) errors become smaller. Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 16 Most importantly, the squared errors get smaller. The coefficient of determination, r2, is measuring how much smaller the squared errors get. Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 17 Here, the correlation is very strong ( r is large), and there are barely and errors at all. So SSError would be much smaller than SSTotal, and r2 is also large Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 18 2 The relationship between r , SSE, and SST is: SST is the total amount of variation in Y SSE is the amount of variation in Y left unexplained by X. 2 When r is zero, SSE is same as SST 2 When r is one, SSE disappears completely. Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 19 So we now have two different interpretations of r-squared. 1. The square of the correlation efficient. 2. The proportion of Sum of Squares Total (SST) that is removed from the error term. Interpretation #1 is specific to correlation. Interpretation #2 works for simple regression, but also for AnOVa, multiple regression, general linear models! Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 20 R-squared is truly the go-anywhere animal. Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 21 Bivariate Normality (and some diagnostics) Regression produces a line that minimizes sum squared errors, so a small number of extreme values (outliers) can have a strong effect on a model. Consider this Pearson r: Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 22 More specifically, regression is sensitive to violations of the assumption of bivariate normality. The regression model assumes: 1. The distributions of the x and y variables is normal. If you were to take a histogram of all the x values, that histogram should resemble a normal curve. Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 23 The regression model also assumes: 2. The distribution of y, conditional on x, is normal. If you were to take a histogram of all the error terms, that histogram should ALSO resemble a normal curve. Any observations that produce errors that are too large to be in the curve are potentially influential outliers. Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 24 In this diagram, the red line is the regression on all 54 points. The blue line is the regression without the 4 red points. Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 25 These points are near the lower end of x, and have very large error terms associated with them, so they 'pull' the left end of the regression line down. Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 26 Another word for these errors is residuals, literally the residue, or portion left over from the model. Here is a scatterplot of the residuals over x. A.K.A, a residual plot. Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 27 The outliers are clearly visible from the residual plot, and from the histogram below. Their values are twice as large as any other observation. Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 28 One way to measure how much an outlier is affecting the model is to remove that one point and see how much the model changes. We can see a big difference between the blue and red lines above, but that is a comparison by removing 4 points manually. Another, more systematic (and therefore quick, easy, and often more reliable) method is to remove one observation at a time and see how much the model changes. Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 29 Cook's distance is a regression deletion diagnostic. It works by comparing a model with every observation to one with only the observation in question removed / deleted. The higher Cook's distance is for a value, the more that particular value is influencing the model. If there are one or two values that are having undue leverage on the model, Cook's distance will find them. This is true even if the residual plot fails to find them (which it can if the observation is 'pulling' hard enough) Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 30 This is Cook's Distance for all 54 data points. Note that although all 4 problem points have high Cook's distance compared to the rest, two of them are not obvious problems. Cook's distance has a hard time identifying influential observes when there are several. Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 31 Dealing with outliers is like selecting an acceptable Type I error. There are conventions and guidelines in place, but it is a case-by-case judgement call. One question to ask is “does this observation belong in my data set?”, when considering things other than your model. Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 32 If the outlier is the result of a typo, it's not the same as the rest of your sample and it should go. If other information about that observation is nonsense, such as joke answers in a survey, then that's also justification to remove that outlier observation. If it just happens to be an extreme value, but otherwise everything seems fine with it, then it is best to keep it. Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 33 Don't rush to finish your model. Look for outliers first. Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 34 Next Tuesday: - Diagnostics and Regression in R. - Correlation vs Causality Read: Rubin on Causality, only Sections 1-3 for next Tuesday. Sources: xkcd.com/605 My Hobby: Extrapolating. Sand Crab Photo, by Regiane Cardillo, Brasil http://www.pps.org.pk/PJP/6-1/Talay.pdf , Pak. J. Phisol. (2010) 6:1 Mandarin Duck and Parrot on Tortoise unknown Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 35