key to hw3

advertisement
Dr. Betsy Becker
Dr. Christine Schram
CEP933 Key to
Assignment 3: Multiple Regression
Maribel A. Sevilla
Wei Pan
CEP 933 Assignment Three Key
Due March 1 and 2, 2000
Alternate models for explaining variation in teacher community. In this homework you will return to our
SPSS school-level data set and explore further explanations for level of teacher community (tchcomm). In
assignment two you first explored the role of socioeconomic status (f1ses) as a predictor of teacher
community (tchcomm). You also considered other possible predictors including teacher angst (tchangst),
teacher organizational influence (tchinfl), job satisfaction (tchhappy), and principal leadership (prinlead).
The variables in NELS:88 that were used are:
H&T
socio-economic status
teacher success/failure due to factors beyond my control
teacher influence on school matters
tchinfl
teacher contentment with their job
principal leadership
teacher community (outcome)
NELS:88
SES (f1ses)
tchangst
tchhappy
prinlead
tchcomm
In this analysis you are free to use whatever additional variables you wish to create a more complete
model, as long as those variables satisfy the assumptions of multiple regression. We suggest that you
begin by examining Hannaway and Talbert’s regression models in Table 2. The additional NELS
variables that are similar to those in Hannaway and Talbert’s analysis are:
School size
Principal autonomy
Metro status
g10enrol (not really fully continuous)
prncinfl
g10urban (categorical)
A measure of district size is not available. We have used metro status (g10urban) in Homework 1— it is a
categorical variable that you will need to recode as dummy variables if you wish to use it in a regression
model. Also note that the variable g10enrol is not measured on an interval scale. Other variables from
NELS may be added as well.
Use any appropriate variables from the school data set to find what you think is the best final model for
predicting teacher community. That is, you can use any of the substantive variables in this data set to try
to explain variation in levels of tchcomm across schools. You will need to run a number of different
models before you settle on a final model!!! Remember that this final model should be complete (i.e.,
well-specified, including all important variables), but also parsimonious (not including any useless
variables). There is no single “right answer” for this assignment.
Note: We realize that you do not know a lot about the definitions of the variables and how they were
measured. Use the H&T article and variable information in the class packet and the SPSS Utilities,
Variables menus to get names and values of the variables – these will be sufficient for you to do this
assignment.
1.
(10 points) Explain why you have included each of the predictor variables you investigated for your
final model. You should give a substantive explanation of why each one might be important. Draw a
path diagram (arrow diagram) to illustrate your ideas. (Include all of the variables that you initially
believed could be important to the model, including those that were not significant statistically in your
final model).
Model I: We started our analysis with the variables g10enrol, tchinfl, prncinfl, g10urban, f1ses
from the NELS data. This set of variables is similar to the variables used in Hannaway and
March 01, 2000
Page 1 of 8
CEP933 Key to
Assignment 3: Multiple Regression
Dr. Betsy Becker
Dr. Christine Schram
Maribel A. Sevilla
Wei Pan
Talbert’s regression models in Table 2. We recoded g10urban into two dummy variables
(Urban and Suburban). In addition, we included the variables from assignment 2, tchhappy.
Model II: We added some school climate variables that may affect tchcomm, such as
prinlead (an indicator of principal leadership in the school), and conflict (an indicator of
disagreement between teachers and administrators in the school). We also included tchmoral
(teacher moral indicator), teachatt (teacher attitudes toward students) and discipln (an
indicator of student discipline in school), because we believe that in schools with less conflict
between teachers and students and where teachers have a positive attitude in relation to the
student, as well as where teachers have a high moral sense, the school as a whole may
develop strong sense of community.
Model III: We also included variables to account for differences due to location (we recoded
g10regon in dummies south, west, northeast) and type of schools (dummy public, if school is
public).
g10enrol
prncinfll
tchinfll
Urban
Suburban
f1ses
prinlead
tchhappy
conflict
tchmoral
teachatt
discipln
south
tchcomm
west
northeast
2.
public
Discuss the structural assumptions you are making for the model and each of the predictors in it.
Verify, as best you can, that these assumptions are reasonable.
a) No relevant Xs are excluded.
b) No irrelevant Xs are included.
c) No measurement error. We assume this is OK for most variables. The lowest
reliability for any school level variable in our course pack is .77 (others are higher).
d) No multicollinearity.
3. (10 points) Write a paragraph explaining how you used regression analyses to develop your
final model. Be sure to answer the following questions in your paragraph: How did you
select a first model to run? Did you keep all of the variables in that model? How did you
March 01, 2000
Page 2 of 8
Dr. Betsy Becker
Dr. Christine Schram
CEP933 Key to
Assignment 3: Multiple Regression
Maribel A. Sevilla
Wei Pan
decide whether to add or remove variables from that initial model? How did you decide you
had arrived at a “final model”? (You do not need to list the variables in each model. Instead,
describe the process in general, as if you were telling someone else how to approach the
problem of model-building in multiple regression.) (OUTPUT In question 5)
1. Selecting the first model – MODEL I: We used the set of variables for Hannaway &
Talbert, plus tchhappy and prninfl as shown in model I in the output.
We have already observed that these variables are normally distributed and they
have a linear relationship with tchcomm. We assume g10enroll to be an interval
variable.
The results show that urban and suburban differences were not significant. All the
other X’s were significant and the R  .372. This model explains roughly one
third of the variation in tchcomm, but we hope to explain more.
2
2. Adding variables for location – MODEL II. We created the dummies urban and
suburban to see if there are differences in tchcomm due to the location of the
school. The results show that there are significant differences for urban but not for
suburban schools. All the other X’s were significant and the R  .376. There is
2
2
very little change in R . In fact, if we compare the Adjusted R-Squares (.368 and
.371), we still hope to be able to explain much more of the variability in tchcomm.
3. Selecting variables for the third model - MODELIII. We selected the theoretical
critical variables to be included in the model (prinlead, conflict, tchmor, teachatt,
disciplin). Then, we ran a correlation between the variables and tchcomm to see if
any of the new variables has strong bivariate relationship with tchcomm and if
there is multicollinearity (see Correlation Table in the next page).
We observed that principal leadership (prinlead) and teacher contentment
(tchhappy) had the strongest correlation with teacher community (tchcomm) (See
the Correlation Table).
Also, we found that the variable conflict has a strong correlation with teacher moral
(tchmor) and teacher attitude toward students (teachatt). In the first case, r =-.407
indicating that schools where there are conflicts between teacher and
administration have low teacher moral, whereas, schools with high levels of
conflict between teachers and administration also have teachers with more
negative attitudes toward students, r =.464 (See letter B in the Correlation Table).
We will observe these variables for signs of multicollinearity in our regression
analysis in model IV.
It is important to notice that finding a small bivariate correlation for any X does
NOT mean the slope in a multiple regression will be small, because as more
variables are “controlled for” the partial relationship between that X and Y may
change.
March 01, 2000
Page 3 of 8
CEP933 Key to
Assignment 3: Multiple Regression
Dr. Betsy Becker
Dr. Christine Schram
Maribel A. Sevilla
Wei Pan
Correlations
Tc hcomm
r
Sig. (2-tailed)
tchhappy
f1s es
g10enrol
Urban
Suburban
prinlead
conflict
tchmor
teachatt
dis cipline
Tc hcomm
1.000
tchhappy
.485**
f1s es
g10enrol
.323**
-.242**
Urban
.014
Suburban
-.001
prinlead
.572**
conflict
tchmor
teachatt dis cipline
-.203**
.320**
-.230**
-.012
.
.000
.000
.000
.664
.969
.000
.000
.000
.000
N
964
962
964
963
964
964
964
957
958
962
957
r
.485**
.314**
-.184**
.020
-.008
.371**
-.124**
.210**
-.205**
-.038
Sig. (2-tailed)
.000
.
.000
.000
.526
.806
.000
.000
.000
.000
.244
N
962
963
963
962
963
963
963
956
957
961
956
r
.323**
.314**
-.156**
.030
.077*
.114**
-.071*
.212**
-.228**
-.159**
Sig. (2-tailed)
.000
.000
.
.000
.347
.017
.000
.027
.000
.000
.000
N
964
963
966
965
966
966
966
959
960
964
959
r
-.242**
-.184**
-.156**
.253**
-.031
-.089**
.044
-.065*
.123**
.060
Sig. (2-tailed)
.000
.000
.000
.
.000
.337
.006
.176
.044
.000
.064
N
963
962
965
965
965
965
965
958
959
963
958
r
.014
.020
.030
.253**
-.750**
.098**
.041
.041
.016
-.024
Sig. (2-tailed)
.664
.526
.347
.000
.
.000
.002
.205
.207
.620
.461
N
964
963
966
965
966
966
966
959
960
964
959
r
-.001
-.008
.077*
-.031
-.750**
1.000
-.050
-.059
.001
-.018
.009
Sig. (2-tailed)
.969
.806
.017
.337
.000
.
.124
.067
.970
.571
.775
N
964
963
966
965
966
966
966
959
960
964
959
r
.572**
.371**
.114**
-.089**
.098**
-.050
1.000
-.265**
.279**
-.207**
.076*
Sig. (2-tailed)
.000
.000
.000
.006
.002
.124
.
.000
.000
.000
.018
N
964
963
966
965
966
966
966
959
960
964
959
r
-.203**
-.124**
-.071*
.044
.041
-.059
-.265**
-.407**
.464**
-.045
Sig. (2-tailed)
.000
.000
.027
.176
.205
.067
.000
.
.000
.000
N
957
956
959
958
959
959
959
959
953
958
952
r
.320**
.210**
.212**
-.065*
.041
.001
.279**
-.407**
-.331**
.127**
Sig. (2-tailed)
.000
.000
.000
.044
.207
.970
.000
.000
.
.000
N
958
957
960
959
960
960
960
953
960
958
953
r
-.230**
-.205**
-.228**
.123**
.016
-.018
-.207**
.464**
-.331**
1.000
.055
Sig. (2-tailed)
.000
.000
.000
.000
.620
.571
.000
.000
.000
.
.089
N
962
961
964
963
964
964
964
958
958
964
957
r
-.012
-.038
-.159**
.060
-.024
.009
.076*
-.045
.127**
.055
1.000
Sig. (2-tailed)
.700
.244
.000
.064
.461
.775
.018
.166
.000
.089
.
N
957
956
959
958
959
959
959
952
953
957
959
1.000
1.000
1.000
1.000
1.000
1.000
**. Correlation is significant at the 0.01 level (2-tailed).
*. Correlation is significant at the 0.05 level (2-tailed).
Now that we have included all important X’s to explain tchcomm, we began to
eliminate predictors, if they showed non-significant slopes in many different
models. The cutoff we used was, at first, more lenient   0.1. Later, when we
were more certain about the variables to include in our model, we used   0.05 .
4. Selecting variables for MODEL IV. We observed that adding the new variables in
model III changed the significance level for urban to non-significant. We decided to
drop urban and suburban. Also, we observed that due to multicollinearity, the
variables conflict, teachatt, and disciplin are not significant as well. We decided to
drop these variables for the next model.
5. Selecting variables for MODEL V. As urban and suburban are not significant, it is
possible that other school characteristics may influence the results. We decided to
create a new dummy variable public to indicate different types of school, e.g.
public versus non-public schools. We observed that public is significant, but
prncinfl is not significant anymore.
March 01, 2000
Page 4 of 8
.700
.166
.000
Dr. Betsy Becker
Dr. Christine Schram
CEP933 Key to
Assignment 3: Multiple Regression
Maribel A. Sevilla
Wei Pan
6. Selecting variables for MODEL VI. We included variables that may not be highly
related with the outcome. We transformed g10regon in a dummy for south,
southeast and northeast regions. We decide to create the dummies south, west
and northeast to capture region variability. We observe that there are no significant
differences for west and northeast, as well as, prcinfl is not significant. So we
decide to drop west, northeast and prcinfl for the last model.
7. The FINAL MODEL. It explains 50% of the variability of tchcomm in our sample, and
all the predictors (X’s) are significant. The standard error of the estimate is 4.491,
which is reasonable.
8. Interactions. Now we could investigate potential interactions. We would do this in
two ways:
First, we could look at correlations by subgroups, or plots of a particular X versus
tchcomm, with cases marked differently for different subgroups where interactions
might occur. We would be particularly interested in urban versus suburban and
public versus private differences. If a plot of correlations suggested a different
relationship for one subgroup, we could compute an interaction term and add it to
the model.
For this assignment, we did not include interactions.
4.
Write the population model for your final set of predictors. Be sure to identify all terms in your model.
Population Model I:
y i   0   1 g10enrolli   2 tch inf l i   3 f 1ses i   4 tchhappyi   5 prinleadi   6 teachmori   7 public   8 south   i
i
i
yi is the teacher community score for the i th school
 0 is the intercept, or the teacher community score when all other variables equal 0
 1 is the slope for g10enroll, or the change in teacher community relative to one unit
2
3
change in g10enroll holding all other variables constant
is the slope for tchinfl, or the change in teacher community relative to one unit
change in tchinfl holding all other variables constant
is the slope for f1ses, or the change in teacher community relative to one unit
5
change in f1ses holding all other variables constant
is the slope for tchhappy, or the change in teacher community relative to one unit
change in tchhappy, holding all other variables constant
is the slope for prinlead, or the change in teacher community relative to one unit
6
change in prinlead holding all other variables constant
is the slope for teachmor, or the change in teacher community relative to one unit
7
change in teachmor holding all other variables constant
is the slope for public, or the difference in teacher community from public versus
8
non-public schools holding all other variables constant
is the slope for south, or the difference in teacher community from schools located
4
in the south versus other regions holding all other variables constant.
g10enroll i - Number of enrollment in tenth grade in the i th school
March 01, 2000
Page 5 of 8
CEP933 Key to
Assignment 3: Multiple Regression
Dr. Betsy Becker
Dr. Christine Schram
tchinf li
- Teacher influence score in the i
- Socio-economic index in the i
f 1sesi
th
th
Maribel A. Sevilla
Wei Pan
school
school
tchhappyi - Teacher contentment in the i th school
prinlead i - Principal leadership score in the i th school
teachmori - Teacher moral score in the i th school
- Dummy variable for public schools = 1, and non-public = 0
public i
southi
i
- Dummy variable for school region, south = 1, and other regions = 0
Error term for i
th
school. Unexplained variability in tchcomm for i
th
school
i = 1 to 868 - Index of schools
5.
Use SPSS to estimate the parameters in the model in part 3. Show your output.
Model Sum ma ry
Model
1
R
R Square
Adjust ed
R Square
St d. E rror
of the
Es timate
.509
.505
4.4910
.714a
a. Predic tors: (Constant), tchhappy, g10enrol, f1ses,
tchinfl, teac hmor, prinlead, public, s outh
ANOV Ab
Model
1
Sum of
Squares
Mean
Square
df
Regress ion
19789.977
8
2473.747
Residual
19080.112
946
20.169
Total
38870.088
954
F
Sig.
122.649
.000
b. Dependent Variable: Teacher Community (High values =lot s o' community)
Coefficientsa
Model
1
Unstandardized
Coefficients
B
Std. Error
(Constant)
-8. 046
1.298
-.249
.089
tchinf l
.553
f 1s es
Standardi
zed
Coefficien
ts
Beta
t
Sig.
-6. 199
.000
-.072
-2. 795
.005
.078
.205
7.099
.000
1.009
.302
.091
3.336
.001
tchhappy
.298
.042
.193
7.030
.000
prinlead
.286
.023
.341
12. 415
.000
teachmor
.819
.195
.102
4.199
.000
public
-1. 384
.484
-.084
-2. 861
.004
south
1.470
.328
.110
4.487
.000
g10enrol
a. Dependent Vari able: Teacher Com munity (High values=lots o' community)
6.
Verify that the statistical assumptions of the random part of the model have been met.
March 01, 2000
Page 6 of 8
CEP933 Key to
Assignment 3: Multiple Regression
Dr. Betsy Becker
Dr. Christine Schram
Maribel A. Sevilla
Wei Pan
Error assumptions of the model:
a) Linear relationships between Xs and Y. See correlation table. OK.
b) Error assumptions:
a. Errors follow normal distribution.
Histogram
Dependent Variable: Teacher Community (High values=lots o' community)
120
100
80
60
Frequency
40
Std. Dev = 1.00
20
Mean = 0.00
N = 955.00
0
25
4. 5
7
3. 5
2
3.
75
2. 5
2
2. 5
7
1. 5
2
1.
5
.7
5
.2
5
-.2
5
-.725
.
-1 5
.7
-1 5
.2
-2 5
.7
-2 5
.2
-3 5
.7
-3 5
.2
-4
Regression Standardized Residual
b. Errors are independent of all predictors.
(To produce this graph, we saved the residuals, so we could plot residuals
versus each X. There is one plot for each predictor (X’s), but we are
showing just one of the plots).
SOCIO-ECONOMIC STATUS COMPOSITE
2.0
1.5
1.0
.5
0.0
-.5
-1.0
-1.5
-2.0
-6
-4
-2
0
2
4
6
Standardized Residual
The correlation between errors and predictor equals zero.
c. Errors have equal variance (Heteroskedasticity)
Scatterplot
Dependent Variable: Teacher Community (High values=lots o' community)
Regression Standardized Residual
6
4
2
0
-2
-4
-6
-4
-3
-2
-1
0
1
2
3
Regression Standardized Predicted Value
The spread looks pretty similar for all the predicted values.
March 01, 2000
Page 7 of 8
Dr. Betsy Becker
Dr. Christine Schram
7.
CEP933 Key to
Assignment 3: Multiple Regression
Maribel A. Sevilla
Wei Pan
Has the slope estimate for f1ses changed from the bivariate model you estimated in Assignment 2 to
the model you estimated in part 5? Explain any differences. (This includes explaining why f1ses is not
in the model if you have omitted it!!).
2
Yes, in the bivariate regression, f1ses explains 10% of the model ( R =0.104) while in
the multivariate regression, f1ses explains 6% of the variability in tchcomm. This was
computed by taking the difference between the
and the R2 for the model without f1ses.
8.
R 2 of the multivariate model with f1ses
What is the most important predictor of tchcomm, according to your model? What proportion of the
variation in tchcomm can be attributed to that variable alone? How much variation in tchcomm is
explained by your full final model?
The most important predictor is prinlead with a standardized beta of 0.341 in our final
model. This variable alone explains 33% of the variability in tchcomm, as shown by the
bivariate regression tchcomm on prinlead:
9.
R 2 = .327, Adjusted R 2 =.326 .
(10 points) Maribel believes that teacher community is related to levels of principal leadership
because principals who are good leaders encourage their teachers to work together. Wei says that
principals described as good leaders and high teacher community are both found in schools where the
teachers are happy with their jobs, and that teacher community in such schools is high because the
teachers are content. For this reason Wei says that tchhappy is the more important predictor of
teacher community. Given your final model (and the other analyses that led to it), comment on the
disagreement between Maribel and Wei.
In our model, prinlead is more important (standardized beta for prinlead = 0.341,
standardized beta for tchhappy = 0.193). If we run the model excluding prinlead, the
R 2 decreases to .429, and Adjusted R 2 equals .425. If we run the model excluding
2
2
only tchhappy, the R decreases to .481, and Adjusted R equals .487. So, the
absence of prinlead causes the model to decrease more in its capacity to explain
tchcomm variability than if we exclude tchhappy. We tend to agree with Maribel on
this controversy.
March 01, 2000
Page 8 of 8
Download