Simple Linear Regression - Solutions 1 Relationship Between Eighth Grade IQ and Ninth grade Math Score For a statistics class project, students examined the relationship between x = 8th grade IQ and y = 9th grade math scores for 20 students. The data are displayed below. Math Score 33 31 35 38 41 37 37 39 43 40 41 44 40 45 48 45 31 47 43 48 Student 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 IQ 95 100 100 102 103 105 106 106 106 109 110 110 111 112 112 114 114 115 117 118 Abstract Reas 28 24 29 30 33 32 34 36 38 39 40 43 41 42 46 44 41 47 42 49 Use Minitab on the dataset Finals found in the Datasets folder in ANGEL. Do Stat>Regression>Regression and enter in the Response window the variable math score and in the Predictors window enter IQ. Click ‘Storage’ and then ‘Residuals’ and ‘Fits’. These will be stored in columns C3 and C4 and named as RESI1 and FITS1. Your output should look as follows: Regression Analysis: Math Score versus IQ The regression equation is Math Score = - 21.0 + 0.567 IQ Predictor Constant IQ Coef -21.04 0.5666 S = 3.98537 SE Coef 16.00 0.1475 R-Sq = 45.0% T -1.32 3.84 P 0.205 0.001 R-Sq(adj) = 42.0% Analysis of Variance Source Regression Residual Error Total DF 1 18 19 SS 234.30 285.90 520.20 MS 234.30 15.88 F 14.75 P 0.001 1 a. Explain this equation. Discuss slope as change in Y per unit change in X in context of the variables used in this problem The slope indicates “for a unit change in X, Y will change by the amount and direction of the slope”. So here, for a 1 unit increase in IQ the predicted math score will increase by 0.567 points. b. Create a scatter plot of the measurements by Graph > Scatter Plot > Simple, and select IQ as the predictor (x-variable) and math score as the response (y-variable). Describe the relationship between math score and IQ. Scatterplot of Math Score vs IQ 50 Math Score 45 40 35 30 95 100 105 110 115 120 IQ There is a positive relationship between math score (the response variable) and IQ (the explanatory variable) c. One of the students with a high IQ (number 17) appears to be an outlier. With a sample size of only 20 this can affect our normality assumption. Also, the constant variance assumption could be compromised. We can visual check for constant variance using a Residual Plot and test for normality using a Probability Plot. To get a residual plot go to Graph > Scatterplot > Simple and enter RESI1 as y-variable and FITS1 as x-variable. Click OK. For the probability plot check of normality, go to Graph > Scatterplot > Single and enter RESI1 in the graph variables window. This provides a test of the null hypothesis that the data follows a normal distribution. Based on these two graphs and what you have learned about hypothesis testing, what interpretations do you come to regarding the assumptions of constant variance and normality? 2 Scatterplot of RESI1 vs FITS1 5 RESI1 0 -5 -10 -15 32 34 36 38 40 42 44 46 FITS1 From the residual plot, we can see the one outlier (lower right) while the remaining residuals appear to be scattered randomly around 0. This indicates a possible violation of constant variance. With a small sample size the effect of outliers can be more extreme. Probability Plot of RESI1 Normal - 95% CI 99 Mean StDev N AD P-Value 95 90 2.486900E-15 3.879 20 0.723 0.050 Percent 80 70 60 50 40 30 20 10 5 1 -15 -10 -5 0 RESI1 5 10 From the graph legend we have a p-value of 0.050 which equals our usual level of significance for hypothesis testing. Again, with such a small sample we might be concerned that the normality assumption is not satisfied. Putting these two graph results together, we might want to investigate this outlier more thoroughly (e.g. is this a data entry error, a real score, etc.) [NOTE: the AD test statistic refers to Anderson-Darling which is a test of normality. There are other such tests, however.] 3 d. The least squares regression line for predicting math score from IQ is given in the above output. What is the fitted regression line (i.e. regression equation)? The regression equation is Math Score = - 21.0 + 0.567 IQ e. What do the values in the FITS and RES columns represent? The fits are the values of the Response (e.g. math score) obtained when the observed predictor variable (e.g. IQ) values are entered into the regression The residuals (RES) are the values of the observed Response, Y, values minus the fitted values. For example, if you take the first student who had an observed math score of 33 minus predicted math score of 32.7921 you get the residual of 0.2079 f. Based on the output, what is the test of the slope for this regression equation? That is, provide the null and alternative hypotheses, the test statistic, p-value of the test, and state your decision and conclusion. Ho: B1 = 0 Ha: B1 ╪ 0 The test statistic is 3.84 with a p-value of 0.001. Since this p-value is less than 0.05, we would reject Ho and conclude that eighth grade IQ is a statistically significant linear predictor of ninth grade math scores. 2 Although outliers should never be deleted without a reason, there are several reasons why it may be legitimate to conduct an analysis without them. Delete the data point for row 17 (click on the cell with the IQ of 114, enter * and then click on any other cell - this “enters” the asterisk in that previous cell. ) and re-calculate the regression line for the remainder of the data. You should obtain the following output: (Student 17 deleted) Regression Analysis: Math Score versus IQ The regression equation is Math Score = - 32.2 + 0.676 IQ 19 cases used, 1 cases contain missing values Predictor Constant IQ Coef -32.18 0.67601 S = 2.56190 SE Coef 10.51 0.09718 R-Sq = 74.0% T -3.06 6.96 P 0.007 0.000 R-Sq(adj) = 72.5% Analysis of Variance Source Regression Residual Error Total DF 1 17 18 SS 317.58 111.58 429.16 MS 317.58 6.56 F 48.39 P 0.000 a. Use the regression line with the Student 17 deleted to estimate the math score for an individual who has an eighth grade IQ o114.. Do you think this estimate could be achieved by anybody? The fitted regression equation is Math Score = - 32.2 + 0.676 IQ . in this equation to get math score of 44.81. Substitute 114 for IQ Math Score = - 32.2 + 0.676 *(114) = 44.881 4 It is certainly possible given the range of math scores for someone to achieve this score. b. What do the values of R2 represent (just use the latest output) and how do this compare to the R2 value from the first analysis in this activity? (Explain it using the variables from this data). R2 is the coefficient of determination and in simple terms provides how much of the variation in the Response(Y) variable is explained by the Predictor(X) variable. For our example: with the observation deleted, 74.0% of the variability Ninth grade math score is explained by eighth grade IQ compared to 45.0% for when the outlier is included. c. What is the correlation between Math Score and IQ for both the data sets, including and excluding the outlier? The correlation is equal to the square root of R2 and takes the sign of the slope (therefore being able to take on a range of values from – 1 ≤ r ≤ 1). The correlation is commonly represented as a decimal value. Thus, the correlation between ninth grade math score and eighth grade IQ is equal to the square root of the correlation of determination (R2) Outlier Deleted: correlation, r, = √0.740 = .860 and is positive since the slope of the regression equation is positive. In the case where the outlier is included, the correlation is: r = 0.670 d. Use Minitab to find the correlation between Math Score and IQ (you can pick whether do include the outlier or not) by going to Stat>Basic Statistics>Correlation and entering both variables into the Variables box. Does this correlation value agree with the value you found in part c? Yes, the values are the same. e. How does the fit of the regression line of the original data (i.e. with outlier) compare (visually and statistically) to the fit of the regression line to the data with the outlier removed? To do this, use the current data with the outlier removed and go to Stat > Regression > Fitted Line Plot. Select IQ as the Predictor (x-variable) and math score as the Response (y-variable). Once the graph is created you can Click twice on the title which will open an “Edit title” box. Type in the box under Text: Outlier Deleted. Now add the IQ of 114 back into the data (i.e. replace the * with 114) and repeat these steps, labeling the graph Outlier Included. Now compare the fit of the regression line between the two sets of data. Pay particular attention to the differences in R2, the slope and how the line fits each set of data. You may want to repeat the residual plot and probability plot! NOTE: how R2 changes, 45.0% to 74.0% how the regression equation changes. Slope is more positive. scatter plot looks more tight’ around regression line because outlier is not there now. The residual and probability plots are more agreeable to the assumptions. Special Note: Just because the removal of an outlier or outliers improves our results, this does not give the researcher carte blanch to simply remove data until the results are what one wants. You should always include in your research any manipulations of the data such as these, possibly providing two reports for the reader to decide: one with the outlier(s) and another without. 5 Outlier Removed Math Score = - 32.18 + 0.6760 IQ 50 S R-Sq R-Sq(adj) 2.56190 74.0% 72.5% S R-Sq R-Sq(adj) 3.98537 45.0% 42.0% Math Score 45 40 35 30 95 100 105 110 115 120 IQ Outlier Included Math Score = - 21.04 + 0.5666 IQ 50 Math Score 45 40 35 30 95 100 105 110 115 120 IQ Scatterplot of RESI2 vs FITS2 5.0 RESI2 2.5 0.0 -2.5 -5.0 30 32 34 36 38 40 FITS2 42 44 46 48 6 Probability Plot of RESI2 Normal - 95% CI 99 Mean StDev N AD P-Value 95 90 Percent 80 -8.97528E-15 2.490 19 0.144 0.962 70 60 50 40 30 20 10 5 1 -10 -5 0 RESI2 5 10 f. Facts about correlation. Answer the following questions about correlation (r). a) What is the strongest the correlation can ever be? 1.0 b) If there is no relationship, r is equal to 0. c) The correlation coefficient ranges from – 1.0 to 1.0 d) If the points fall in an almost perfect, negative linear pattern, r is close to: - 1.0 e) If the points fall in an almost perfect, positive linear pattern, r is close to: 1.0 7