Dr. Betsy Becker Dr. Christine Schram CEP933 Key to Assignment 3: Multiple Regression Maribel A. Sevilla Wei Pan CEP 933 Assignment Three Key Due March 1 and 2, 2000 Alternate models for explaining variation in teacher community. In this homework you will return to our SPSS school-level data set and explore further explanations for level of teacher community (tchcomm). In assignment two you first explored the role of socioeconomic status (f1ses) as a predictor of teacher community (tchcomm). You also considered other possible predictors including teacher angst (tchangst), teacher organizational influence (tchinfl), job satisfaction (tchhappy), and principal leadership (prinlead). The variables in NELS:88 that were used are: H&T socio-economic status teacher success/failure due to factors beyond my control teacher influence on school matters tchinfl teacher contentment with their job principal leadership teacher community (outcome) NELS:88 SES (f1ses) tchangst tchhappy prinlead tchcomm In this analysis you are free to use whatever additional variables you wish to create a more complete model, as long as those variables satisfy the assumptions of multiple regression. We suggest that you begin by examining Hannaway and Talbert’s regression models in Table 2. The additional NELS variables that are similar to those in Hannaway and Talbert’s analysis are: School size Principal autonomy Metro status g10enrol (not really fully continuous) prncinfl g10urban (categorical) A measure of district size is not available. We have used metro status (g10urban) in Homework 1— it is a categorical variable that you will need to recode as dummy variables if you wish to use it in a regression model. Also note that the variable g10enrol is not measured on an interval scale. Other variables from NELS may be added as well. Use any appropriate variables from the school data set to find what you think is the best final model for predicting teacher community. That is, you can use any of the substantive variables in this data set to try to explain variation in levels of tchcomm across schools. You will need to run a number of different models before you settle on a final model!!! Remember that this final model should be complete (i.e., well-specified, including all important variables), but also parsimonious (not including any useless variables). There is no single “right answer” for this assignment. Note: We realize that you do not know a lot about the definitions of the variables and how they were measured. Use the H&T article and variable information in the class packet and the SPSS Utilities, Variables menus to get names and values of the variables – these will be sufficient for you to do this assignment. 1. (10 points) Explain why you have included each of the predictor variables you investigated for your final model. You should give a substantive explanation of why each one might be important. Draw a path diagram (arrow diagram) to illustrate your ideas. (Include all of the variables that you initially believed could be important to the model, including those that were not significant statistically in your final model). Model I: We started our analysis with the variables g10enrol, tchinfl, prncinfl, g10urban, f1ses from the NELS data. This set of variables is similar to the variables used in Hannaway and March 01, 2000 Page 1 of 8 CEP933 Key to Assignment 3: Multiple Regression Dr. Betsy Becker Dr. Christine Schram Maribel A. Sevilla Wei Pan Talbert’s regression models in Table 2. We recoded g10urban into two dummy variables (Urban and Suburban). In addition, we included the variables from assignment 2, tchhappy. Model II: We added some school climate variables that may affect tchcomm, such as prinlead (an indicator of principal leadership in the school), and conflict (an indicator of disagreement between teachers and administrators in the school). We also included tchmoral (teacher moral indicator), teachatt (teacher attitudes toward students) and discipln (an indicator of student discipline in school), because we believe that in schools with less conflict between teachers and students and where teachers have a positive attitude in relation to the student, as well as where teachers have a high moral sense, the school as a whole may develop strong sense of community. Model III: We also included variables to account for differences due to location (we recoded g10regon in dummies south, west, northeast) and type of schools (dummy public, if school is public). g10enrol prncinfll tchinfll Urban Suburban f1ses prinlead tchhappy conflict tchmoral teachatt discipln south tchcomm west northeast 2. public Discuss the structural assumptions you are making for the model and each of the predictors in it. Verify, as best you can, that these assumptions are reasonable. a) No relevant Xs are excluded. b) No irrelevant Xs are included. c) No measurement error. We assume this is OK for most variables. The lowest reliability for any school level variable in our course pack is .77 (others are higher). d) No multicollinearity. 3. (10 points) Write a paragraph explaining how you used regression analyses to develop your final model. Be sure to answer the following questions in your paragraph: How did you select a first model to run? Did you keep all of the variables in that model? How did you March 01, 2000 Page 2 of 8 Dr. Betsy Becker Dr. Christine Schram CEP933 Key to Assignment 3: Multiple Regression Maribel A. Sevilla Wei Pan decide whether to add or remove variables from that initial model? How did you decide you had arrived at a “final model”? (You do not need to list the variables in each model. Instead, describe the process in general, as if you were telling someone else how to approach the problem of model-building in multiple regression.) (OUTPUT In question 5) 1. Selecting the first model – MODEL I: We used the set of variables for Hannaway & Talbert, plus tchhappy and prninfl as shown in model I in the output. We have already observed that these variables are normally distributed and they have a linear relationship with tchcomm. We assume g10enroll to be an interval variable. The results show that urban and suburban differences were not significant. All the other X’s were significant and the R .372. This model explains roughly one third of the variation in tchcomm, but we hope to explain more. 2 2. Adding variables for location – MODEL II. We created the dummies urban and suburban to see if there are differences in tchcomm due to the location of the school. The results show that there are significant differences for urban but not for suburban schools. All the other X’s were significant and the R .376. There is 2 2 very little change in R . In fact, if we compare the Adjusted R-Squares (.368 and .371), we still hope to be able to explain much more of the variability in tchcomm. 3. Selecting variables for the third model - MODELIII. We selected the theoretical critical variables to be included in the model (prinlead, conflict, tchmor, teachatt, disciplin). Then, we ran a correlation between the variables and tchcomm to see if any of the new variables has strong bivariate relationship with tchcomm and if there is multicollinearity (see Correlation Table in the next page). We observed that principal leadership (prinlead) and teacher contentment (tchhappy) had the strongest correlation with teacher community (tchcomm) (See the Correlation Table). Also, we found that the variable conflict has a strong correlation with teacher moral (tchmor) and teacher attitude toward students (teachatt). In the first case, r =-.407 indicating that schools where there are conflicts between teacher and administration have low teacher moral, whereas, schools with high levels of conflict between teachers and administration also have teachers with more negative attitudes toward students, r =.464 (See letter B in the Correlation Table). We will observe these variables for signs of multicollinearity in our regression analysis in model IV. It is important to notice that finding a small bivariate correlation for any X does NOT mean the slope in a multiple regression will be small, because as more variables are “controlled for” the partial relationship between that X and Y may change. March 01, 2000 Page 3 of 8 CEP933 Key to Assignment 3: Multiple Regression Dr. Betsy Becker Dr. Christine Schram Maribel A. Sevilla Wei Pan Correlations Tc hcomm r Sig. (2-tailed) tchhappy f1s es g10enrol Urban Suburban prinlead conflict tchmor teachatt dis cipline Tc hcomm 1.000 tchhappy .485** f1s es g10enrol .323** -.242** Urban .014 Suburban -.001 prinlead .572** conflict tchmor teachatt dis cipline -.203** .320** -.230** -.012 . .000 .000 .000 .664 .969 .000 .000 .000 .000 N 964 962 964 963 964 964 964 957 958 962 957 r .485** .314** -.184** .020 -.008 .371** -.124** .210** -.205** -.038 Sig. (2-tailed) .000 . .000 .000 .526 .806 .000 .000 .000 .000 .244 N 962 963 963 962 963 963 963 956 957 961 956 r .323** .314** -.156** .030 .077* .114** -.071* .212** -.228** -.159** Sig. (2-tailed) .000 .000 . .000 .347 .017 .000 .027 .000 .000 .000 N 964 963 966 965 966 966 966 959 960 964 959 r -.242** -.184** -.156** .253** -.031 -.089** .044 -.065* .123** .060 Sig. (2-tailed) .000 .000 .000 . .000 .337 .006 .176 .044 .000 .064 N 963 962 965 965 965 965 965 958 959 963 958 r .014 .020 .030 .253** -.750** .098** .041 .041 .016 -.024 Sig. (2-tailed) .664 .526 .347 .000 . .000 .002 .205 .207 .620 .461 N 964 963 966 965 966 966 966 959 960 964 959 r -.001 -.008 .077* -.031 -.750** 1.000 -.050 -.059 .001 -.018 .009 Sig. (2-tailed) .969 .806 .017 .337 .000 . .124 .067 .970 .571 .775 N 964 963 966 965 966 966 966 959 960 964 959 r .572** .371** .114** -.089** .098** -.050 1.000 -.265** .279** -.207** .076* Sig. (2-tailed) .000 .000 .000 .006 .002 .124 . .000 .000 .000 .018 N 964 963 966 965 966 966 966 959 960 964 959 r -.203** -.124** -.071* .044 .041 -.059 -.265** -.407** .464** -.045 Sig. (2-tailed) .000 .000 .027 .176 .205 .067 .000 . .000 .000 N 957 956 959 958 959 959 959 959 953 958 952 r .320** .210** .212** -.065* .041 .001 .279** -.407** -.331** .127** Sig. (2-tailed) .000 .000 .000 .044 .207 .970 .000 .000 . .000 N 958 957 960 959 960 960 960 953 960 958 953 r -.230** -.205** -.228** .123** .016 -.018 -.207** .464** -.331** 1.000 .055 Sig. (2-tailed) .000 .000 .000 .000 .620 .571 .000 .000 .000 . .089 N 962 961 964 963 964 964 964 958 958 964 957 r -.012 -.038 -.159** .060 -.024 .009 .076* -.045 .127** .055 1.000 Sig. (2-tailed) .700 .244 .000 .064 .461 .775 .018 .166 .000 .089 . N 957 956 959 958 959 959 959 952 953 957 959 1.000 1.000 1.000 1.000 1.000 1.000 **. Correlation is significant at the 0.01 level (2-tailed). *. Correlation is significant at the 0.05 level (2-tailed). Now that we have included all important X’s to explain tchcomm, we began to eliminate predictors, if they showed non-significant slopes in many different models. The cutoff we used was, at first, more lenient 0.1. Later, when we were more certain about the variables to include in our model, we used 0.05 . 4. Selecting variables for MODEL IV. We observed that adding the new variables in model III changed the significance level for urban to non-significant. We decided to drop urban and suburban. Also, we observed that due to multicollinearity, the variables conflict, teachatt, and disciplin are not significant as well. We decided to drop these variables for the next model. 5. Selecting variables for MODEL V. As urban and suburban are not significant, it is possible that other school characteristics may influence the results. We decided to create a new dummy variable public to indicate different types of school, e.g. public versus non-public schools. We observed that public is significant, but prncinfl is not significant anymore. March 01, 2000 Page 4 of 8 .700 .166 .000 Dr. Betsy Becker Dr. Christine Schram CEP933 Key to Assignment 3: Multiple Regression Maribel A. Sevilla Wei Pan 6. Selecting variables for MODEL VI. We included variables that may not be highly related with the outcome. We transformed g10regon in a dummy for south, southeast and northeast regions. We decide to create the dummies south, west and northeast to capture region variability. We observe that there are no significant differences for west and northeast, as well as, prcinfl is not significant. So we decide to drop west, northeast and prcinfl for the last model. 7. The FINAL MODEL. It explains 50% of the variability of tchcomm in our sample, and all the predictors (X’s) are significant. The standard error of the estimate is 4.491, which is reasonable. 8. Interactions. Now we could investigate potential interactions. We would do this in two ways: First, we could look at correlations by subgroups, or plots of a particular X versus tchcomm, with cases marked differently for different subgroups where interactions might occur. We would be particularly interested in urban versus suburban and public versus private differences. If a plot of correlations suggested a different relationship for one subgroup, we could compute an interaction term and add it to the model. For this assignment, we did not include interactions. 4. Write the population model for your final set of predictors. Be sure to identify all terms in your model. Population Model I: y i 0 1 g10enrolli 2 tch inf l i 3 f 1ses i 4 tchhappyi 5 prinleadi 6 teachmori 7 public 8 south i i i yi is the teacher community score for the i th school 0 is the intercept, or the teacher community score when all other variables equal 0 1 is the slope for g10enroll, or the change in teacher community relative to one unit 2 3 change in g10enroll holding all other variables constant is the slope for tchinfl, or the change in teacher community relative to one unit change in tchinfl holding all other variables constant is the slope for f1ses, or the change in teacher community relative to one unit 5 change in f1ses holding all other variables constant is the slope for tchhappy, or the change in teacher community relative to one unit change in tchhappy, holding all other variables constant is the slope for prinlead, or the change in teacher community relative to one unit 6 change in prinlead holding all other variables constant is the slope for teachmor, or the change in teacher community relative to one unit 7 change in teachmor holding all other variables constant is the slope for public, or the difference in teacher community from public versus 8 non-public schools holding all other variables constant is the slope for south, or the difference in teacher community from schools located 4 in the south versus other regions holding all other variables constant. g10enroll i - Number of enrollment in tenth grade in the i th school March 01, 2000 Page 5 of 8 CEP933 Key to Assignment 3: Multiple Regression Dr. Betsy Becker Dr. Christine Schram tchinf li - Teacher influence score in the i - Socio-economic index in the i f 1sesi th th Maribel A. Sevilla Wei Pan school school tchhappyi - Teacher contentment in the i th school prinlead i - Principal leadership score in the i th school teachmori - Teacher moral score in the i th school - Dummy variable for public schools = 1, and non-public = 0 public i southi i - Dummy variable for school region, south = 1, and other regions = 0 Error term for i th school. Unexplained variability in tchcomm for i th school i = 1 to 868 - Index of schools 5. Use SPSS to estimate the parameters in the model in part 3. Show your output. Model Sum ma ry Model 1 R R Square Adjust ed R Square St d. E rror of the Es timate .509 .505 4.4910 .714a a. Predic tors: (Constant), tchhappy, g10enrol, f1ses, tchinfl, teac hmor, prinlead, public, s outh ANOV Ab Model 1 Sum of Squares Mean Square df Regress ion 19789.977 8 2473.747 Residual 19080.112 946 20.169 Total 38870.088 954 F Sig. 122.649 .000 b. Dependent Variable: Teacher Community (High values =lot s o' community) Coefficientsa Model 1 Unstandardized Coefficients B Std. Error (Constant) -8. 046 1.298 -.249 .089 tchinf l .553 f 1s es Standardi zed Coefficien ts Beta t Sig. -6. 199 .000 -.072 -2. 795 .005 .078 .205 7.099 .000 1.009 .302 .091 3.336 .001 tchhappy .298 .042 .193 7.030 .000 prinlead .286 .023 .341 12. 415 .000 teachmor .819 .195 .102 4.199 .000 public -1. 384 .484 -.084 -2. 861 .004 south 1.470 .328 .110 4.487 .000 g10enrol a. Dependent Vari able: Teacher Com munity (High values=lots o' community) 6. Verify that the statistical assumptions of the random part of the model have been met. March 01, 2000 Page 6 of 8 CEP933 Key to Assignment 3: Multiple Regression Dr. Betsy Becker Dr. Christine Schram Maribel A. Sevilla Wei Pan Error assumptions of the model: a) Linear relationships between Xs and Y. See correlation table. OK. b) Error assumptions: a. Errors follow normal distribution. Histogram Dependent Variable: Teacher Community (High values=lots o' community) 120 100 80 60 Frequency 40 Std. Dev = 1.00 20 Mean = 0.00 N = 955.00 0 25 4. 5 7 3. 5 2 3. 75 2. 5 2 2. 5 7 1. 5 2 1. 5 .7 5 .2 5 -.2 5 -.725 . -1 5 .7 -1 5 .2 -2 5 .7 -2 5 .2 -3 5 .7 -3 5 .2 -4 Regression Standardized Residual b. Errors are independent of all predictors. (To produce this graph, we saved the residuals, so we could plot residuals versus each X. There is one plot for each predictor (X’s), but we are showing just one of the plots). SOCIO-ECONOMIC STATUS COMPOSITE 2.0 1.5 1.0 .5 0.0 -.5 -1.0 -1.5 -2.0 -6 -4 -2 0 2 4 6 Standardized Residual The correlation between errors and predictor equals zero. c. Errors have equal variance (Heteroskedasticity) Scatterplot Dependent Variable: Teacher Community (High values=lots o' community) Regression Standardized Residual 6 4 2 0 -2 -4 -6 -4 -3 -2 -1 0 1 2 3 Regression Standardized Predicted Value The spread looks pretty similar for all the predicted values. March 01, 2000 Page 7 of 8 Dr. Betsy Becker Dr. Christine Schram 7. CEP933 Key to Assignment 3: Multiple Regression Maribel A. Sevilla Wei Pan Has the slope estimate for f1ses changed from the bivariate model you estimated in Assignment 2 to the model you estimated in part 5? Explain any differences. (This includes explaining why f1ses is not in the model if you have omitted it!!). 2 Yes, in the bivariate regression, f1ses explains 10% of the model ( R =0.104) while in the multivariate regression, f1ses explains 6% of the variability in tchcomm. This was computed by taking the difference between the and the R2 for the model without f1ses. 8. R 2 of the multivariate model with f1ses What is the most important predictor of tchcomm, according to your model? What proportion of the variation in tchcomm can be attributed to that variable alone? How much variation in tchcomm is explained by your full final model? The most important predictor is prinlead with a standardized beta of 0.341 in our final model. This variable alone explains 33% of the variability in tchcomm, as shown by the bivariate regression tchcomm on prinlead: 9. R 2 = .327, Adjusted R 2 =.326 . (10 points) Maribel believes that teacher community is related to levels of principal leadership because principals who are good leaders encourage their teachers to work together. Wei says that principals described as good leaders and high teacher community are both found in schools where the teachers are happy with their jobs, and that teacher community in such schools is high because the teachers are content. For this reason Wei says that tchhappy is the more important predictor of teacher community. Given your final model (and the other analyses that led to it), comment on the disagreement between Maribel and Wei. In our model, prinlead is more important (standardized beta for prinlead = 0.341, standardized beta for tchhappy = 0.193). If we run the model excluding prinlead, the R 2 decreases to .429, and Adjusted R 2 equals .425. If we run the model excluding 2 2 only tchhappy, the R decreases to .481, and Adjusted R equals .487. So, the absence of prinlead causes the model to decrease more in its capacity to explain tchcomm variability than if we exclude tchhappy. We tend to agree with Maribel on this controversy. March 01, 2000 Page 8 of 8