Understanding Sum of Squares Six facts about least squares regression: 1. The sum of residuals add to 0. n n n i 1 i 1 i 1 ei (Yi Y i ) [Yi (b0 b1 X i )] 0 2. The sum of squared residuals is minimized. n n n i 1 i 1 i 1 SSE ei2 (Yi Y i ) 2 [Yi (b0 b1 X i )]2 3. The sum of the observed values equals the sum of the fitted values: n n i 1 i 1 Yi Y i 4. The sum of the residuals multiplied by each predictor value is 0: n n n i 1 i 1 i 1 X i ei X i (Yi Y i ) X i [Yi (b0 b1 X i )] 0 5. The sum of the residuals multiplied by each predicted value is 0: n n n Y e Y (Y Y ) Y [Y (b i 1 6. i i i i 1 i i i i 1 i 0 b1 X i )] 0 The regression line always goes through the point ( X , Y ) These are found on page 24 of the text. I am not going to prove them, it is not too difficult just messy. Partitioning the Sum of Squares (page 63). One idea of regression is to parse out two quantities from the inherent variation among the Y values. Let us look at our data from the using X = study hours to predict Y = GPA, studygpa.sas7bdat. Open the program chap2b.sas. proc means data=mydata.studygpa; var gpa study; run; quit; 1 symbol1 value=dot I=R; proc gplot data=mydata.studygpa; plot gpa*study / vref=3.24; label gpa='GPA'; label study='Hours of study per week'; title 'GPA vs. Hours studied'; run; quit; We have: Yi Y Yi Yi Yi Y (Yi Y ) (Yi Yi ) The Total Sum of Squares (or Sum of Squares Total) is SST) is just the total variation in the Y values: n SSTO (Yi Y ) 2 i 1 The error part, we already familiar with, this is just the sum of the squared residuals: 2 n n i 1 i 1 SSE ei2 (Yi Y i ) 2 The Regression Sum of Squares (or Sum of Squares of Regression) is the left over piece. n n n i 1 i 1 i 1 SSR SSTO SEE (Yi Y ) 2 (Yi Y i ) 2 (Y i Y ) 2 . It is also true because of the six facts we went over earlier that: (Yi Y )2 (Yi Yi Yi Y )2 (Yi Y )2 (Yi Yi ) 2 SSTO SSR SSE Analysis of Variance (ANOVA) Table Degrees of Freedom (df) The SSTO has n 1 degrees of freedom (we lost a degree when estimated Y ). The SSR has one degree of freedom because we are using one predictor. The SSE degrees of freedom is n 2 (which I like to think is because we estimated the intercept and the slope). Mean Squares (MS) The mean squares are just the Sum of Squares divided by the appropriate degrees of freedom. We summarize these quantities in the Analysis of Variance Table. Source of Variation Regression Sum of Squares (SS) Degrees of freedom (df) n 1 SSR (Y i Y ) 2 i 1 Error n SSE (Yi Yi ) 2 n2 i 1 Total n SSTO (Yi Y ) 2 Mean Square (MS) SSR MSR 1 SSE MSE n2 F F MSR MSE n2 i 1 The F-statistic in the final column is the t-test statistic squared and the p-value is exactly the same as the two-sided p-value for the slope. For the study – gpa data, the ANOVA of table is: 3 The MSR is estimating the quantity n 2 12 ( X i X ) 2 i 1 And the MSE is estimating 2. So when we calculate: n F MSR , we are estimating the ratio MSE 2 12 ( X i X )2 i 1 2 . Notice that if the slope 1 =0, then the ratio is just 1. So the larger the F statistic is, the more likely we are to reject the null hypothesis. Coefficient of Determination R2 One measure of the goodness of model fit is the coefficient of determination R2 . This number is always positive and always less than 1 because it is defined as: n SSR R2 i n1 SSTO (Y i Y ) 2 (Y Y ) i 1 , 0 R2 1. 2 i The close the value of R2 to 1, the better the fit. 4 Interpretation: The way that we interpret R2 is that we say: Insert R 2 *100 value here percent of the variation in the Y is explained by variation in the X . So for the study gpa data,we say that 11.77% of the variation in Y = gpa is explained by variation in X = hours of study. In some sense the test of hypothesis of H 0 : 1 0 answers the question “Is X useful in predicting Y ?”. The R2 value answers the question “How useful is X in predicting Y .” Now there are no hard and fast rules about how high R2 has to be in order to be deemed a good fit. Higher is better, but in some research, low R2 values are common, whereas in other fields they are much more close to 1. Correlation coefficient r The literature is full of references to the correlation coefficient. It is the square root of R2 , with the same sign as the slope. So it is a number between -1 and 1. It indicates direction. Can get it by taking R 2 and using the sign of the slope or use Notice that the p-value is the same as for the test of slope we did earlier. The test of hypothesis that is equivalent is: the correlation between X and Y 5 H0 : 0 HA : 0 Nice thing about correlation and coefficient of determination is that it is does not matter which is X and which is Y. Can play Guess the Correlation game: http://istics.net/stat/correlations/ 6 We can calculate a confidence interval for the correlation coefficient as well: proc corr data=mydata.studygpa var gpa study; run; quit; Fisher; Interpretation: 7 Spearman Rank Correlation: This is sometimes used if data is not normally distributed with severe departures. Basically, we rank each variable from low to high (within each variable) and then calculate the correlation between the ranks. Example: cigweight.sas7bdat A study is conducted to investigate the relationship between cigarette smoking during pregnancy and the weights of newborn infants. A sample of 15 women smokers kept accurate records of the number of cigarettes smoked during their pregnancies, and the weights of their children were recorded at birth. Woman 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Cigs per day 12 15 35 21 20 17 19 46 20 25 39 25 30 27 29 Rank 1 2 13 7 5.5 3 4 15 5.5 8.5 14 8.5 12 10 11 Baby weight lbs. 7.7 8.1 6.9 8.2 8.6 8.3 9.4 7.8 8.3 5.2 6.4 7.9 8.0 6.1 8.6 Rank 5 9 4 10 13.5 11.5 15 6 11.5 1 3 7 8 2 13.5 The procedure for this nonparametric test: 1. The null hypothesis: H 0 : 0; H A : 0, is the population rank correlation coefficient 2. Rank each column of values and break ties in the usual way. 3. Calculate the pearson correlation coefficient with the ranks. proc print data=mydata.cigweight; run; quit; proc corr data=mydata.cigweight; var rankcigs rankweight; run; quit; 8 Or we can have SAS calculate the Spearman rank correlation directly: proc corr data=mydata.cigweight spearman; var cigs weight; run; quit; 9 proc corr data=mydata.cigweight pearson spearman; var cigs weight; run; quit; 10 Misunderstandings of high R2 or high correlation value: 1. Correlation does NOT mean causation. Just because two variables may be highly related, does not mean one causes the other. It might mean that one causes the other, but typically there is some other variable that is affecting both X and Y . Such a variable is called a confounder. For example, monthly ice cream sales and number of drownings are correlated, but that does not mean ice cream causes drowning, it is the underlying variable of temperature. Temperature is the confounder. Other misunderstandings are listed in the text on page 75. 1. A high coefficient of determination indicates that useful predictions can be made. Often the prediction intervals for Y are so large they are useless. 2. A high coefficient of determination indicates that eh estimated regression line is a good fit. There may be curvature in the plot. We could be underfitting or overfitting in patterns. This often happens in time series data. 3. A coefficient of determination near zero indicates that X and Y are not related. This may happen, when there is a relationship between X and Y , but it is not a linear relationship. Y Non-linear Relationship 300 250 200 150 100 50 0 -50 0 -100 -150 5 10 15 20 25 X Practice Homework Examine the data ch1copier2.sas7bdat (Context explained on page 35). 1. Use prog reg to determine the 99% confidence interval for the slope of predicting minutes from number of calls. 2. Use proc reg to determine the coefficient of determination. Interpret its value in context. 3. Use proc corr to calculate the Pearson correlation coefficient. Give its p-value. Verify that the p-value is the same as the test for slope with proc reg. 4. Give the 95% confidence interval for the Pearson correlation coefficient. 11 5. Use proc corr to calculate the spearman correlation coefficient. Give its pvalue. Does it lead to the same conclusion that the pearson correlation coefficient does? 6. Give the 95% confidence interval for the mean number of minutes when the number of calls is 5. 7. Give the 95% prediction interval for the number of minutes when the number of calls is 5. 8. Explain the difference between these two intervals. 9. Make a graph that shows the scatterplot using number of calls to predict minutes, the regression line, the 95% confidence interval for the mean minutes for given number of calls and the 95% prediction interval for the minutes given the number of calls. 10. Examine the ANOVA table below. Fill in the missing values (using a calculator). Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 1 2.11505 0.1961 Error 14.80495 1.13884 Corrected Total 14 16.92000 Use the data nc20101000.sas7bdat 1. Use the variable weeks (weeks of gestation) to predict the birth weight in pounds (tpounds). Give and interpret the 95% confidence interval for the slope. 2. Give the coefficient of determination and interpret its value. 3. Make a graph that shows the scatterplot using weeks to predict tpounds, the regression line, the 95% confidence interval for the mean pounds for given weeks of gestation and the 95% prediction interval for the total pounds given the weeks of gestation. 4. Give the 95% prediction interval of birth weight for a mother who carries 35 weeks. 5. Give the spearman correlation coefficient for weeks and tpounds. 12