The Impact of High-Stakes Testing on Student Achievement

advertisement
Preliminary and Incomplete: Comments Welcome
THE IMPACT OF HIGH-STAKES TESTING ON STUDENT ACHIEVEMENT:
EVIDENCE FROM CHICAGO∗
Brian A. Jacob
John F. Kennedy School of Government
Harvard University
June 2001
∗
I would like to thank the Chicago Public Schools, the Illinois State Board of Education and the Consortium on
Chicago School Research for providing the data used in this study. I am grateful to Anthony Bryk, Carolyn Hill,
Robert LaLonde, Lars Lefgren, Steven Levitt, Helen Levy, Susan Mayer, Melissa Roderick, Robin Tepper and
seminar participants at the University of Chicago for helpful comments and suggestions. Funding for this research
was provided by the Spencer Foundation. All remaining errors are my own.
THE IMPACT OF HIGH-STAKES TESTING ON STUDENT ACHIEVEMENT:
EVIDENCE FROM CHICAGO
Abstract
School reforms designed to hold students and teachers accountable for student achievement have
become increasingly popular in recent years. Yet there is little empirical evidence on how such
policies impact student or teacher behavior, or how they ultimately affect student achievement.
This study utilizes detailed administrative data on student achievement in Chicago to examine
the impact of a high-stakes testing policy introduced in 1996. I find that math and reading scores
on the high-stakes exam increased sharply following the introduction of the accountability
policy, but that there was little if any effect on a state-administered, low-stakes exam. At the
same time, science and social studies scores on both exams leveled off or declined under the new
regime. There is also evidence that the policy increased retention rates in primary grades that
were not subject to the promotional criteria.
1. Introduction
School reforms designed to hold students and teachers accountable for student
achievement have become increasingly popular in recent years. Statutes in 19 states explicitly
link student promotion to performance on a state or district assessment (ECS 2000). The largest
school districts in the country, including New York City, Los Angeles, Chicago and Washington,
D.C., have recently implemented policies requiring students to attend summer school and/or
repeat a grade if they do not demonstrate sufficient mastery of basic skills. At the same time, 20
states reward teachers and administrators on the basis of exemplary student performance and 32
states sanction school staff on the basis of poor student performance. Many states and districts
have passed legislation allowing the takeover or closure of schools that do not show
improvement (ECS 2000).
Despite the increasing popularity of high-stakes testing, there is little evidence on how
such policies influence student or teacher behavior, or how they ultimately affect student
achievement. Standard incentive theory suggests that high-stakes testing will increase student
achievement in the measured subjects and exams, but also may lead teachers to shift resources
away from low-stakes subjects and to neglect infra-marginal students. Such high-powered
incentives may also cause students and teachers to ignore less easily observable dimensions of
student learning such as critical thinking in order to teach the narrowly defined basic skills that
are tested on the high-stakes exam (Holmstrom 1991).
Chicago was one of the first large, urban school districts to implement a comprehensive
high-stakes accountability policy. Beginning in 1996, Chicago Public Schools in which fewer
than 15 percent of students met national norms in reading were placed on probation, which
entailed additional resources as well as oversight. If student performance did not improve in
these schools, teachers and administrators were subject to reassignment or dismissal. At the
same time, the CPS took steps to end “social promotion,” the practice of passing students to the
3
next grade regardless of their academic ability. Students in third, sixth and eighth grades (the
“gate” grades) were required to meet minimum standards in reading and mathematics on the
Iowa Test of Basic Skills (ITBS) in order to advance to the next grade.
This paper utilizes detailed administrative data on student achievement in Chicago to
examine the impact of the accountability policy on student achievement.1 I first compare
achievement trends before and after the introduction of high-stakes testing, conditioning on
changes in observable student characteristics such as demographics and prior achievement. I
then compare the achievement trends for students and schools who were likely to be strongly
influenced by high-stakes testing (e.g., such as low-achieving students and probation schools)
with trends among students and schools less affected by the policy. To the extent that these
high-achieving students and schools were not influenced by the policy, this difference-indifference strategy will eliminate any common, unobserved time-varying factors that may have
influenced overall achievement levels in Chicago. In order to assess the generalizability of
achievement changes in Chicago, I compare achievement trends on the Iowa Test of Basic Skills
(ITBS), the exam used for promotion and probation decisions, with comparable trends on a statemandated, low-stakes exam, the Illinois Goals Assessment Program (IGAP). Using a panel of
Illinois school data during the 1990s, I am able to compare IGAP achievement trends in Chicago
with other urban districts in Illinois, thereby controlling for unobserved statewide factors, such
as improvements in the economy or changes in state or federal educational policies. Finally, I
examine several unintended consequences of the policy, such as changes in retention rates and
achievement levels in grades and subjects not directly influenced by the policy.
I find that ITBS math and reading scores increased sharply following the introduction of
high-stakes testing, but that the pre-existing trend of IGAP scores did not change. There is some
1
In this analysis, I do not focus on the effect of the specific treatments that accompanied the accountability policy
such as the resources provided to probation schools, summer school and grade retention. For an evaluation of the
probation, summer school and grade retention programs, see Jacob and Lefgren (2001).
4
evidence that test preparation and short-term effort may account for a portion of the differential
ITBS-IGAP gains. While high-stakes testing did not have any impact on the likelihood that
students would transfer to a private school, move out of the CPS to a different public school
districts (in the suburbs or out of the state) or drop out of school, it appears to have increased
retention rates in the primary grades and decreased relative achievement in science and social
studies.
The remainder of this paper is organized as follows. Section 2 lays out a conceptual
framework for analyzing high-stakes testing, providing background on the Chicago reforms and
reviewing the previous literature on accountability and achievement. Section 3 discusses the
empirical strategy and Section 4 describes the data. Sections 5 to 9 present the main findings.
Section 10 summarizes the major findings and discusses several policy implications.
5
2. A Conceptual Framework for Analyzing High-Stakes Testing
2.1 Background on High-Stakes Testing in Chicago
In 1996 the CPS introduced a comprehensive accountability policy in an effort to raise
academic achievement. The first component of the policy focused on holding students
accountable for learning, ending a common practice known as “social promotion” whereby
students are advanced to the next grade regardless of their ability or achievement. Under this
policy, students in third, sixth and eighth grades are required to meet minimum standards in
reading and mathematics on the Iowa Test of Basic Skills (ITBS) in order to advance to the next
grade. Students that do not make the standard are required to attend a six-week summer school
program, after which they retake the exams. Those who pass move on to the next grade.
Students who again fail to meet the standard are required to repeat the grade, with the exception
of 15-year-olds who attend newly created “transition” centers.
One of the most striking features of Chicago’s social promotion policy was its scope.
Table 2.1 illustrates the number of students affected by the policy each year.2 The ITBS scores
are measured in terms of “grade equivalents” (GEs) that reflect the years and months of learning
a student has mastered. The exam is nationally normed so that a student at the 50th percentile in
the nation scores at the eighth month of her current grade – i.e., an average third grader will
score a 3.8. In first full year of the policy, the promotion standards for third, sixth and eighth
grade were 2.8, 5.3, and 7.0 respectively, which roughly corresponded to the 20th percentile in
the national achievement distribution. The promotional criteria were raised for eighth graders in
1997-98 and 1998-99, and for all three grades in 1999-2000.
Because many Chicago students are in special education and bilingual programs, and are
thereby exempt from standardized testing, only 70 to 80 percent of the students in the system
6
were directly affected by the accountability policies.3 Of those who were subject to the policy,
nearly 50 percent of third graders and roughly one-third of sixth and eighth graders failed to
meet the promotional criteria and were required to attend summer school in 1996-97. Of those
who failed to meet the promotional criteria in June, approximately two-thirds passed in August.
As a result, roughly 20 percent of third grade students and 10 to 15 percent of sixth and eighth
grade students were eventually held back in the Fall.
In conjunction with the social promotion policy, the CPS instituted a policy designed to
hold teachers and schools accountable for student achievement. Under this policy, schools in
which fewer than 15 percent of students scored at or above national norms on the ITBS reading
exam are placed on probation. If they do not exhibit sufficient improvement, these schools may
be reconstituted, which involves the dismissal or reassignment of teachers and school
administrators. In 1996-97, 71 elementary schools serving over 45,000 students were placed on
academic probation.4
2.2 Prior Research on High-Stakes Testing
Studies of high school graduation tests provide mixed evidence on the effect of highstakes testing. While several studies found a positive association between student achievement
and minimum competency testing (Bishop 1998, Frederisksen 1994, Neill 1998, Winfield, 1990),
a recent study with better controls for prior student achievement finds no effect (Jacob 2001).
There is also evidence that mandatory high school graduation exams increase the probability of
dropping out of school, particularly among low-achieving students (Catterall, 1987, Kreitzer
2
Note that Fall enrollment figures increase beginning in 1997-98 because students who were retained in previous
years are counted in the next Fall enrollment for the same grade.
3
Section 5 examines whether the number of students in bilingual and special education programs, or otherwise
excluded from testing, has changed since the introduction of high-stakes testing.
7
1989, Catterall 1989, MacMillan 1990, Griffin 1996, Reardon 1996, Jacob 2001 #363). Craig
and Sheu (1992) found modest improvements in student achievement after the implementation of
a school-based accountability policy in South Carolina in 1984, but Ladd (1999) found few
achievement effects for a school-based accountability and incentive program in Dallas during the
early 1990s.
The recent experience in Texas provides additional evidence regarding the impact of
high-stakes accountability systems. In the early 1990s, Texas implemented a two-tiered
accountability system, including a graduation requirement for high school as well as a system of
rewards and sanctions for schools based on student performance. Texas students made dramatic
gains on the Texas Assessment of Academic Skills (TAAS) from 1994 to 1998 (Haney 2000,
Klein 2000). In their 1998 study of state-level NAEP trends, Grissmer and Flanagan (1998)
found that Texas was one of the fastest improving states during the 1990s and that conditional on
family characteristics, Texas made the largest gains of all states. However, in an examination of
student achievement in Texas, Klein et. al. (2000) found that TAAS gains were several times
larger than NAEP gains during the 1990s, and that over a four-year period the average NAEP
gains in Texas exceeded those of the nation in only one of the three comparisons—fourth grade
math, but not fourth grade reading or eighth grade math.5
Finally, there is some evidence that high-stakes testing leads teachers to neglect nontested subjects and less easily observable dimensions of student learning, leading to test-specific
achievement gains that are not broadly generalizable. Linn and Dunbar (1990) found that many
states made smaller gains on the National Assessment of Educational Progress (NAEP) than their
4
Probation schools received some additional resources and were more closely monitored by CPS staff. Jacob and
Lefgren [, 2001 #371] examined the impact of monitoring and resources provided to schools on probation, using a
regression discontinuity design that compared the performance of students in schools that just made the probation
cutoff with those that just missed the cutoff. They found that the additional resources and monitoring provided by
probation had no impact on math or reading achievement.
5
It is difficult to compare TAAS and NAEP scores because the exams were given in different years. Fourth grade
reading is the only grade-subject for which a direct comparison is possible since Texas students took both the TAAS
8
and NAEP in 1994 and 1998. Fourth and eighth grade students took the NAEP math exams in 1992 and 1996, but
did not start taking the TAAS until 1994.
9
own achievement exams, presumably due at least in part to the larger incentives placed on the
state exams.6 Koretz et. al. (1991) specifically examined the role of high-stakes incentives on
student achievement in two state testing programs in the 1980s. They found that scores dropped
sharply when a new form of test was introduced and then rose steadily over the next several
years as teachers and students became more familiar with the exam, suggesting considerable testspecific preparation. Koretz and Barron (1998) found that gains on the state-administered exam
in Kentucky during the early 1990s were substantially larger than gains on the nationallyadministered NAEP exam and that the NAEP gains were roughly comparable to the national
average. Stecher and Barron (1999) documented that teachers in Kentucky allocated
instructional time differently across grades to match the emphasis of the testing program.
3. Empirical Strategy
Because the CPS instituted the accountability policy district-wide in 1996, it is difficult
to disentangle the effect of high-stakes testing from shifts in the composition of Chicago public
school students (e.g., an increase in middle-class students attending public schools in Chicago),
changes in state or federal education policy (e.g., the federal initiative to reduce class sizes) or
changes in the economy (e.g., reduced unemployment rates in the late 1990s). This section
describes several different sources of variation I exploit to identify the effect of the policy.
To determine the effect of high-stakes testing on student achievement, I estimate the
following education production function:
(3.1)
y isdt = ( HighStakes) dt δ + X isdt β 1 + Z sdt β 2 + γ t + η d + φ dt + ε isdt
6
In 1987, Cannell (1987) noted that a disproportionate number of states and districts report being above the
national norm on their own tests, dubbing this phenomenon the “Lake Wobegon” effect (Cannell 1987, Linn 1990,
Shepard 1990, Shepard 1988).
10
where y is an achievement score for individual i in school s in district d at time t, X is a vector of
student characteristics, Z is a vector of school and district characteristics and ε is a stochastic
error term. Unobservable factors at the state and national level are captured by time ( γ ), district
( η ) and time*district ( φ ) effects.
Using detailed administrative data for each student, I am able to control for observable
changes in student composition, including race, socio-economic status and prior achievement.
Because achievement data is available back to 1990, six years prior to the introduction of the
policy, I am also able to account for pre-existing achievement trends within the CPS. This short,
interrupted time-series design [Ashenfelter, 1978 #406] accounts for continuing improvement
stemming from earlier reform efforts.7 In the absence of any unobserved time or district*time
effects, this simple pre-post design will provide an unbiased estimate of the policy.
In order to control for unobservable changes over time that affect student achievement,
one might utilize a difference-in-difference strategy which compares the achievement trends of
students who were likely to have been strongly influenced by the policy with the trends of
students who were less likely to be influenced by the policy. While no students were
exogenously excluded from the policy, it is likely that low-achieving students were more
strongly influenced by the policy than their peers, since they were disproportionately at-risk for
retention and were disproportionately located in schools at risk for probation. Thus, if we find
that after 1996 achievement levels of low-achieving students increased more sharply than highachieving students, we might conclude that the policy had a positive impact on achievement.
This model can be implemented by estimating the following specification
(3.2)
yisdt = ( HighStakes) dt δ1 + ( Low) dt δ 2 + ( HighStakes * Low) dt δ 3 +
X isdt β1 + Z sdt β 2 + γ t + ηd + φdt + ε isdt
7
The inclusion of a linear trend implicitly assumes that any previous reforms that raised student performance would
have continued with the same marginal effectiveness in the future. If this assumption is not true, the estimates may
be biased downwards. In addition, this aggregate trend assumes that there are no school-level composition changes
11
where Low is a binary variable that indicates a low prior achievement level.
There are two major drawbacks to this strategy. First, it relies on the assumption that, in
the absence of the policy, unobserved time trends would have been equal across achievement
groups (i.e., γ Low = γ High = γ ). Second, it assumes a homogeneous policy effect (i.e.,
δ Low = δ High = δ ). In this case, the policy effect is likely to be a function of not only the
incentives generated by the policy, but the capacity to respond to the incentives (i.e., δ = δ (a )
where a is a measure of prior achievement). It is likely that higher achieving students and
schools have a greater capacity to respond to the policy (i.e., more parental support, higher
quality teachers, more effective principal leadership, etc.), in which case this comparison will
underestimate the effect of the policy. Moreover, high-achieving students may differ from lowachieving students in other ways that might be expected to influence achievement gains (e.g.,
risk preference). As we see in Section 6, it turns out that the policy effects are larger in lowachieving schools, but there is no consistent pattern in policy effects across student prior
achievement groups, which suggests that incentives and capacity may have offset each other.
To assess the generalizability of achievement gains under the policy, I compare
achievement trends on the ITBS with performance trends on the Illinois Goals Assessment
Program (IGAP), a state-mandated achievement exam that is not used for accountability
purposes in Chicago. Because these exams are measured in different metrics, I standardize the
outcome measures using the mean and standard deviation from the base year.
Regardless of whether one uses the ITBS or IGAP as an outcome, the primary
disadvantage of comparing the achievement of Chicago students in different years is that it does
not take into account changes that were taking place outside the CPS during this period. It is
possible that state or national education reforms, or changes in the economy, were responsible
in Chicago. I test this assumption by including school-specific fixed effects and school-specific trends in certain
specifications.
12
for the increases in student performance in Chicago during the late nineties. Indeed, Grissmer
and Flanagan (1998) show that student achievement increased in many states during the 1990s.
If this were the case, the estimates from equation 3.1 would overstate the impact of high-stakes
testing in Chicago.
Using a panel of IGAP achievement data on schools throughout Illinois during the 1990s,
I am able to address this concern. By comparing the change over time in Chicago with the
change over time in the rest of Illinois, one can eliminate any unobserved common time trends.
By comparing Chicago to other low-income, urban districts in Illinois, we eliminate any
unobserved trends operating within districts like Chicago. To implement this strategy, I estimate
the following model:
ysdt = ( Post )δ1 + (Chicago)δ 2 + (Urban)δ 3 +
(3.3)
(Chicago * Post )δ 4 + (Urban * Post )δ 5 + Xβ1 + Zβ 2 + ε sdt
where y is the average reading or math score for school s in district d at time t, Post is an
indicator that takes on the value of one after 1996 (when high-stakes testing was introduced in
Chicago) and zero otherwise, and Chicago and Urban are binary variables that indicate whether
the school is in Chicago or any low-income, urban district in Illinois. X captures school average
demographics and Z reflects district level demographic variables. This basic model can be
extended to account for pre-existing achievement trends in each district and to examine the
variation in policy effects across the distribution of school achievement (i.e., whether the
differential gain between low-achieving schools in Chicago and Illinois is larger than the
differential gains between high-achieving schools in Chicago and Illinois).
4. Data
This study utilizes detailed administrative data from the Chicago Public Schools as well
as the Illinois Board of Education. CPS student records include information on a student’s
13
school, home address, demographic and family background characteristics, special education and
bilingual placement, free lunch status, standardized test scores, grade retention and summer
school attendance. CPS personnel and budget files provide information on the financial
resources and teacher characteristics in each school and school files provide aggregate
information on the school population, including daily attendance rates, student mobility rates and
racial and SES composition.
The measure of achievement used in Chicago is the Iowa Test of Basic Skills (ITBS), a
standardized, multiple-choice exam developed and published by the Riverside Company. The
exam is nationally normed and student scores are reported in a grade equivalent metric that
reflects the number of years and months of learning. A student at the 50th percentile in the nation
scores at the eighth month of her current grade – i.e., an average third grader will score a 3.8.
However, because grade equivalents present a number of well-known shortcomings for
comparisons over time or across grades,8 I use an alternative outcome metric derived from an
item-response model. By taking advantage of the common items across different forms and
levels of the exam, these measures provide an effective way to compare students in different
grade levels or taking different forms of the exam.9
The primary sample used in this analysis consists of students who were in grades three to
eight for the first time in 1993 to 1999 and who were tested and included for official reporting
8
There are three major difficulties in using grade equivalents (GE) for over time or cross grade comparisons. First,
different forms of the exam are administered each year. Because the forms are not correctly equated over time, one
might confound changes in test performance over time with changes in form difficulty. Second, because the grade
equivalents are not a linear metric, a score of 5.3 on level 12 of the exam does not represent the same thing as a
score of 5.3 at level 13. Thus, it is difficult to accurately compare the ability of two students if they are in different
grades. Finally, the GE metric is not linear within test level because the scale spreads out more at the extremes of
the score distribution. For example, one additional correct response at the top or bottom of the scale can translate
into a gain of nearly one full GE whereas an additional correct answer in the middle of the scale would result in
only a fraction of this increase.
9
Models based on Item Response Theory (IRT) assume that the probability that student i answers questions j
correctly is a function of the student ability and the item difficulty. In practice, one estimates a simple logit model
in which the outcome is whether or not student i correctly answers question j . The explanatory variables include
an indicator variable for each question and each student. The difficulty of the question is given by the coefficient on
the appropriate indicator variable and the student ability is measured by the coefficient on the student indicator
variable. The resulting metric is calibrated in terms of logits. This metric can be used for cross-grade comparisons
14
purposes.10 I limit my analysis to first-time students because the implementation of the social
promotion policy caused a large number of low-performing students in third, sixth and eighth
grade to be retained, which substantially changed the student composition in these and
subsequent grades beginning in 1997-98.11 By limiting the analysis to cohorts beginning in
1993, I am able to control for at least three years of prior achievement scores for all students.12 I
do not examine cohorts after 1999 because the introduction of the social promotion substantially
changed the composition of these groups. For example, nearly 20 percent of the original 2000
sixth grade cohort was retained in 1997 when they experienced high-stakes testing as third
graders.13
Table 4.1 presents summary statistics for the sample. Like many urban districts across
the country, Chicago has a large population of minority and low-income students. In the sixth
grade group, for example, roughly 56 percent of students are Black, 30 percent are Hispanic and
over 70 percent receive free or reduced price lunch. The percent of students in special education
and bilingual programs is substantially lower in this sample than for the entire CPS population
because many students in these programs are not tested, and are thus not shown here. We see
that a relatively small number of students in these grades transfer to a private school (one
percent), move out of Chicago (five percent) or drop out of school (one percent). The rates
as long as the exams in question share a number of common questions (Wright 1979). I thank the Consortium on
Chicago School Research for providing the Rasch measures used in this analysis.
10
As explained in greater detail in Section 5, certain bilingual and special education students are not tested or, if
tested, not included for official reporting purposes.
11
While focusing on first-timers allows a consistent comparison across time, it is still possible that the composition
changes generated by the social promotion policy could have affected the performance of students in later cohorts.
For example, if first-timers in the 1998 and 1999 cohorts were in classes with a large number of low-achieving
students who had been retained in the previous year (in addition to the low-achieving students in the grade for the
first time), they might perform lower than otherwise expected. This would bias the estimates downward.
12
Because a new series of ITBS forms were introduced in 1993, focusing on these cohorts also permits the cleanest
comparisons of changes over time. For students missing prior test score information, I set the missing test score to
zero and include a variable indicating the score for this period was missing.
13
For this reason, the eighth grade sample only includes cohorts from 1993 to 1998. The 1999-2000 cohort is not
included in the third grade sample because there is evidence that retention rates in first and second grade increased
substantially after the introduction of high-stakes testing even though these grades were not directly affected by the
social promotion policy.
15
among eighth graders are substantially higher than the other elementary grades, though not
particularly large in comparison to high school rates.
5. The Impact of High-Stakes Testing on Testing Patterns
While the new accountability policies in Chicago are designed to increase student
achievement, high-stakes testing creates incentives for students and teachers that may change
test-taking patterns. For example, if students are required to pass the ITBS in order to move to
the next grade, they may be less inclined to miss the exam. On the other hand, under the
probation policy teachers have an incentive to dissuade low-achieving students from taking the
exam.14 In a recent descriptive analysis of testing patterns in Chicago, Easton et al. (2000, 2001)
found that the percent of CPS students who are tested and included15 for reporting purposes has
declined during the 1990s, particularly following the introduction of high-stakes testing.
However, they attribute this decline to changing demographics (specifically, an increase in
bilingual students in the system) and administrative changes in the testing policy16 rather than to
the new high-stakes testing program. Shifts in testing patterns are important not only because
they may indicate if some students are being neglected under the policy, but also because the
14
Schools are not explicitly judged on the percentage of their students who take the exams, although it is likely that
a school with an unusually high fraction of students who miss the exam would fall under suspicion.
15
Students in Chicago fall into one of three testing categories: (1) tested and included, (2) tested and excluded, and
(3) not tested. Included versus excluded refers to whether the student’s test scores count for official reporting
purposes. Students who are excluded for reporting are not subject to the social promotion policy and their scores do
not contribute to the determination of their school’s probation status. The majority of students in grades two
through eight take the ITBS, but roughly ten percent of these students are excluded from reporting on the basis of
participation in special education or bilingual programs. A third group of students is not required to take the ITBS
at all because of a bilingual or special education classification. Students with the least severe special education
placements and those who have been in bilingual programs for more than four years are generally tested and
included. Those with more severe disabilities and those who have been in bilingual programs three to four years are
often tested but excluded. Finally, those students with severe learning disabilities and those who have only been in
bilingual programs for one or two years are generally not tested.
16
Prior to 1997, the ITBS scores of all bilingual students who took the standardized exams were included for
official reporting purposes. During this time, CPS testing policy required students enrolled in bilingual programs
for more than three years to take the ITBS, but teachers were given the option to test other bilingual students.
According to school officials, many teachers were reluctant to test bilingual students, fearing that their low scores
would reflect poorly on the school. Beginning in 1997, CPS began excluding the ITBS scores of students who had
been enrolled in bilingual programs for three or fewer years to encourage teachers to test these students for
16
may bias estimates of the achievement gains under the policy. Building on the work of Easton
and his colleagues, this analysis not only controls for changes in observable student
characteristics over time and pre-existing trends within the CPS, but also separately considers
the trends for gate versus non-gate grades and low versus high achieving students.
5.1 Trends in Testing and Exclusion Rates in Gate Grades
Table 5.1 presents OLS estimates of the relationship between the probability of (a)
missing the ITBS and (b) being excluded from reporting conditional on taking the ITBS.17
Controls include prior achievement, race, gender, race*gender interactions, neighborhood
poverty, age, household composition, a special education indicator and a linear time trend.
Robust standard errors that account for the correlation of errors within school are shown in
parentheses. Estimates for Black students are shown separately because these students were
unlikely to have been affected by the changes in the bilingual policy.
The results suggest that high-stakes testing increased test-taking rates, particularly among
eighth grade students. For Black eighth grade students, the policy appears to have decreased the
probability of missing the ITBS between 2.8 and 5.7 percentage points, a decline of 50 to 100
percent given the baseline rate of 5.7 percent. These effects were largest among high-risk
students and schools (tables available from author upon request). In contrast, high-stakes testing
appears to have had little if any impact on the probability of being excluded from test reporting.
The probability of being excluded did increase slightly for Black students in eighth grade under
high-stakes testing, although this may be biased upward due to the increase in test-takers.18
There were no effects for third and sixth grade students.
diagnostic purposes. In 1999, the CPS began excluding the scores of fourth year bilingual students as well, but also
began requiring third-year bilingual students to take the ITBS exams.
17
Probit estimates yield virtual identical results so linear probability estimates are presented for ease of
interpretation.
18
If the students induced to take the ITBS under the high-stakes regime were more likely to be excluded, the
exclusion estimates would be biased upward. To control for possible unobservable characteristics that are
17
5.2 Trends in Testing and Exclusion Rates in Non-Gate Grades
While the social promotion policy only directly affected students in third, sixth and
eighth grades, the probation policy may have affected students in all grades since schools are
judged on the basis of student ITBS scores in grades three through eight. Because ITBS scores
do not have direct consequences for students in non-gate grades, they provide a cleaner test of
behavioral responses on the part of teachers. Probation provides teachers an incentive to exclude
low-achieving students and to encourage high-achieving students to take the exam.
While the probation policy may have influenced teacher behavior in each year after 1996,
it may be difficult to identify the effects after the first year of the policy because the introduction
of the social promotion policy in grades three and six in 1997 influenced the composition of
students in the fourth, fifth, and seventh grades beginning in the 1997-98 school year. To avoid
these selection problems, I limit the analysis to the first cohort to experience the probation
policy. Table 5.3 presents OLS estimates of the probability of not being tested and excluded for
Black students in fourth, fifth and seventh grade in 1997. The average effect shown in the top
row suggests that the accountability policy decreased the likelihood of missing the ITBS, but did
not significantly increase the probability of being excluded (with the exception of fifth grade,
where the effect was statistically significant, but substantively small). The second panel shows
that the largest decreases in missing the ITBS took place in low achieving schools. In the third
panel, we see that the likelihood of being excluded increased for low-achieving students but
decreased for high-achieving students.
correlated with exclusion and the probability of taking the exam under the high-stakes testing regime, I include five
variables indicating whether the student was excluded from test reporting in the each of the previous five years that
he/she took the exam. To further test the sensitivity of the exclusion estimates to changes in the likelihood of taking
the ITBS, I estimate each model under the alternative assumptions that exclusion rates among those not tested are
equal to, twice as high and 50 percent as high as the rate among those tested in the same school prior to high-stakes
testing. The results are robust to these assumptions.
18
6. The Impact of High-Stakes Testing on Math and Reading Achievement
Having examined the effect of high-stakes testing on test-taking and exclusion patterns,
we now turn to the effect on math and reading achievement. Figures 6.1 and 6.2 show the trends
in ITBS math and reading scores for third, sixth and eighth grades from 1990 to 2000. Because
the social promotion policy was instituted for eighth graders one year before it was instituted for
third and sixth graders, I have rescaled the horizontal axis to reflect the years before or after the
introduction of high-stakes testing. Recall that the sample includes only students who were in
third, sixth or eighth grade for the first time (i.e., no retained students). Also note that the trend
for eighth grade students ends in 1998 because the composition of eighth grade cohorts changed
substantially in 1999 when the first wave of sixth graders to experience the grade retentions in
1997 reached the eighth grade. The achievement trends rise sharply after the introduction of
high-stakes testing for both math (Figure 6.1) and reading (Figure 6.2). In mathematics,
achievement increased slightly in the six years prior to the policy, but rose dramatically
following the implementation of the accountability policy. Achievement levels are roughly 0.25
logits higher under high-stakes. Given a baseline annual gain of 0.50 logits, achievement levels
have risen by the equivalent of one-half of a year’s learning after implementation of high-stakes
testing. The slope of the line is steeper under the new regime because later cohorts were exposed
to the accountability policy for a longer period before taking the exam and/or students and
teachers may have become more efficient at responding to the policy over time.19
In estimating the impact of the high-stakes testing policy on student achievement, it is
important to control for student demographics and prior achievement scores since there is
evidence that the composition of the Chicago student population may have been changing over
this period. Controlling for prior achievement does not present a problem in estimating the
policy effects for the 1997 cohort since this group only experienced the heightened incentives in
19
the 1996-97 school year. However, as already noted, later cohorts experienced the incentives
generated by the accountability policies prior to entering the gate grade. For example, the 1999
sixth grade cohort attended fourth and fifth grades (1997 and 1998) under the new regime. To
the extent that the probation policy influenced student achievement in non-gate grades, or
students, parents or teachers changed behavior in non-gate grades in anticipation of the new
promotional standards in the gate-grades, one would expect student test scores to be higher in
these non-gate grades in comparison to the counterfactual no-policy state.
This presents two potential problems in the estimation. First, these pre-test scores would
be endogenous in the sense that they were caused by the policy. If the policy increased
achievement in these grades, controlling for these prior test scores would decrease the estimated
effect of the policy.20 Second, the inclusion of these endogenous pre-test scores will influence
the estimation of the coefficients on the prior achievement (and thus all other) variables. For this
reason, Table 6.1 shows OLS estimates with various prior achievement controls.21 The first
column for each subject includes controls for three prior years of test score information (t-1, t-2
and t-3). The second column for each subject controls for achievement scores at time t-2 and t-3.
The third columns only control for test scores at t-3.
The preferred estimates (in the shaded cells) suggest that high-stakes testing has
significantly increased student achievement in all subjects, grades and years. For example,
consider the results for eighth grade reading. The estimated effect for the 1996 cohort is 0.132,
meaning that eighth graders in 1995-96 scored 0.132 logits higher than observationally similar
eighth graders in earlier years. To get a sense of the magnitude of this effect, the bottom of the
19
With this data, we are not able to distinguish between an implementation effect and longer exposure to the
treatment.
20
Alternatively, these estimates could simply be interpreted as the impact of the policy in the gate year alone.
21
To the extent that high-stakes testing has significantly changed the educational production function, we might
expect the coefficient estimates to be significantly different for cohorts in the high-stakes testing regime. In this
case, one might be concerned that including the post-policy cohorts in the estimation might change the estimation of
the policy effect. To check this, I first estimated a model that included only pre-policy cohorts. I then used the
coefficient estimates to calculate the policy effect for post-policy cohorts. The results were virtually identical.
20
table presents the average gain for third, sixth and eighth graders in 1993 to 1995 (with the
standard deviation in parenthesis). Given the average annual learning gain of roughly 0.50,
high-stakes testing increased the reading achievement of eighth graders in 1996 by more than 25
percent of the annual learning gain. The impact on math achievement was roughly 20 percent of
the annual learning gain.22
The one exception occurs in third grade reading in 1997 and 1999. Insofar as first and
second grade teachers have not adjusted instruction since the introduction of high-stakes testing,
perhaps a reasonable assumption given the fact that these grades are not counted in determining
a school’s probation status, then it is not problematic to include controls for first and second
grade achievement. As I suggest in the next section, this may be due to form effects.
6.2. The Sensitivity of High-Stakes Testing Effects
To test the sensitivity of the findings presented in the previous section, Tables 6.2 and
6.3 present comparable estimates for a variety of different specifications and samples. Column 1
shows the baseline estimates for all students who took the test and whose scores were included
for test reporting. Columns 2 to 7 show that the results are robust to using excluded as well as
included students and to a variety of assumptions regarding the hypothetical scores of students
who missed the exam. Columns 8 and 9 suggest that the original estimates are robust to the
inclusion of a single linear trend as well as school-specific trends and school-specific intercepts.
A potentially more serious concern involves the fact that CPS administered several
different forms of the ITBS exam during the 1990s. While the exams are generally comparable,
there is evidence that some forms are slightly easier than others, particularly for certain grades
and subjects and at certain points in the ability distribution. This can be seen in the achievement
22
This table also provides a sensitivity check for the model. The results to the right of each shaded cell within each
panel are alternative estimates that use different prior achievement controls. In eighth grade reading, for example,
the specification that controls for three years of prior achievement data yields estimates for the 1996 cohort of .132.
Specifications that include only two or one year of prior test score information yield estimates of .133 and .114
21
trends shown in Figures 6.1 and 6.2. Some of the year-to-year swings that appear too large to be
explained by a change in the student population or other events taking place in the school system
are likely due to form effects. To the extent that the timing of different forms coincides with the
introduction of the accountability policy, our estimates may be biased. For example, in Table
6.1 the baseline estimates for 1997 and 1999 reading effects are noticeably smaller than the 1998
effect in third and sixth grade. It seems unlikely that any change in the policy could have been
responsible for this pattern, particularly because it is not increasing or decreasing monotonically.
This suggests that the reading section of Form M (the form given in 1997 and 1999) was more
difficult than the comparable section in earlier forms. One way to test the sensitivity of the
findings to potential form effects is to compare achievement in years in which the same form was
administered. Fortunately, the Chicago students took the ITBS Form L in 1994, 1996 and
1998.23 Columns 10 presents estimates from a sample limited to these three cohorts, which
suggest that form effects are not driving the results presented earlier.
6.3 The Heterogeneity of Effects Across Student and School Risk Level
Having examined the aggregate achievement trends in the CPS, we now examine the
heterogeneity of gains across students and schools. Figures 6.3 and 6.4 shows ITBS math and
reading trends at different points on the distribution for third, sixth and eighth grade students.
Achievement increased by roughly the same absolute amount at the different points on the
distribution, suggesting that there was not a significant difference between low and highachieving students. Table 6.4 shows OLS estimates of the differential effects across students and
schools. Because the pattern of effects is similar across cohorts, for the sake of simplicity I
respectively. Reading across the rows, it appears that the sixth and eighth grade estimates are not too sensitive to
the prior achievement controls.
23
This is the only form that was given under low and high-stakes regimes. Form K was administered in 1993 and
1995. Form M was administered in 1997 and 1999. While Chicago students took form K again in 1999-2000, it is
difficult to compare this cohort to earlier cohorts because of composition changes caused by the retention of lowachieving students in earlier cohorts.
22
combine all three cohorts that experienced high-stakes testing and control for all prior
achievement scores, which means that these estimates only capture the effect of high-stakes
testing during the gate grade.
The first row shows that the average effects for all students and schools are positive and
significant. Estimates in the middle panel show that students in low-achieving schools made
larger gains under high-stakes testing than their peers in higher-achieving schools. The bottom
panel shows that the relationship between prior student achievement and the policy effect varies
considerably by subject and grade. Overall, there is not strong evidence that low-achieving
students do better under the policy, perhaps because these students did not have the capacity to
respond to the incentives as well as their peers. Because it appears that the accountability policy
influenced students and schools across the ability distribution, the difference-in-difference
estimates will not estimate the impact of the policy.
6.4 The Impact of High-Stakes Testing in Non-Gate Grades
While the social promotion policy clearly focuses on the three gate grades, it is useful to
examine achievement trends in the non-gate grades as well. Because the probation policy
provides incentives for teachers in non-gate as well as gate grades, the achievement trends in
non-gate grades provide a better measure of the independent incentive effect of the probation
policy.24 As noted earlier, this analysis focuses on the 1996-97 cohort because the social
promotion policy changed the composition of students in later years. Figures 6.5 and 6.6 show
math and reading achievement trends for students in grades four, five and seven who took the
ITBS and were included in test reporting. Math achievement (Figure 6.5) appears to have
continued a steady upward trend that began prior to the introduction of high-stakes testing. In
24
Teachers and students in these grades may be anticipating and responding to the promotional standards in the next
grade as well. For example, interviews with students and teachers indicate that many students worry about the
cutoffs in earlier years and that many teachers use the upcoming, next-year exams as motivation and classroom
management techniques.
23
contrast, reading achievement (Figure 6.6) jumped sharply in 1997 in both fifth and seventh
grades, but there does not appear to be a significant increase in fourth grade. This is consistent
with the fact that the school accountability focuses mainly on reading.
Table 6.5 shows OLS estimates of the policy effects in fourth, fifth and seventh grade.
The aggregate estimates in the top row tell the same story as the graphs—there are significant
positive effects in all subjects and grades except for fourth grade reading. The results in the
middle panel suggest that students in lower achieving schools made larger improvements under
high-stakes testing, although the pattern of effects across student prior achievement levels is
more mixed. An analysis of policy effects within student*school achievement cells show that in
the lowest achieving schools, students above the 25th percentile showed the largest increases in
reading achievement, again consistent with the focus of the accountability policy.
24
7. How Generalizable Were the Test Score Gains in Chicago?
In the previous section, we saw that ITBS math and reading scores in Chicago increased
dramatically following the introduction of high-stakes testing. This section examines whether
the observed ITBS gains reflect a more general increase in student achievement by comparing
student performance trends on the ITBS with comparable trends on the Illinois Goals
Assessment Program (IGAP), a state-mandated achievement exam that is not used for
accountability purposes in Chicago. The data for this analysis is drawn from school “report
cards” compiled by the Illinois State Board of Education (ISBE), which provide average IGAP
scores by grade and subject as well as background information on schools and districts. The
analysis is limited to the period from 1990 to 1998 because Illinois introduced a new exam in
1999.
7.1 A Comparison of ITBS and IGAP Achievement Trends in Chicago Over Time25
Figure 7.1 shows the achievement trends on the math IGAP exam for all students in
Chicago from 1990 to 1998. IGAP math scores increased steadily in all grades over this period,
but no significant discontinuity is evident in 1997.26 Figures 7.2 to 7.4 show both IGAP and
ITBS trends for first-time students in grades three, six and eight whose scores were included in
ITBS reporting. Scores are standardized using the 1993 means and standard deviations. In third
and sixth grade, both the ITBS and IGAP trend upward over this period, although they increase
most steeply at different points. IGAP scores are increasing most sharply from 1994 to 1996
while ITBS scores rise sharply from 1996 to 1998. In the eighth grade, the ITBS and IGAP
trends track each other more closely, although it still appears that ITBS gains outpaced IGAP
scores two and three years after the introduction of high-stakes testing.
25
This analysis is limited to math because equating problems make it difficult to compare IGAP reading scores over
time (Pearson 1998).
26
In 1993, Illinois switched the scaling of the exam slightly which accounts for change from 1992 to 1993.
25
Table 7.1 shows OLS estimates of the ITBS and IGAP effects. The sample is limited to
the period from 1994 to 1998 because student level IGAP data was not available before this time.
With no control for pre-existing trends in columns 1 and 5, it appears that math achievement has
increased by a similar magnitude on the ITBS and IGAP. However, when we account for preexisting achievement trends in columns 2 and 6, ITBS achievement scores appear to have
increased significantly more than IGAP scores among third and sixth grade students. In eighth
grade, however, the ITBS and IGAP remain comparable once we control for pre-existing trends.
(It is important to note that in the eighth grade model, the prior achievement trend is only based
on two data points – 1994 and 1995). Thus, in the third and sixth grades it appears that the
positive impact of high-stakes testing is not generalizable beyond the high-stakes exam. On the
other hand, in eighth grade, the high-stakes testing effects appear more consistent across the
ITBS and IGAP.
26
7.2 A Comparison of IGAP Trends Across School Districts in Illinois
Having examined the ITBS and IGAP math trends in Chicago during the 1990s, we now
compare achievement trends in Chicago to those in other districts in Illinois. This cross-district
comparison not only allows us to examine reading as well as math scores (since the scoring
changes in reading applied to all districts), but also allows us to control for unobservable factors
operating in Illinois. Table 7.2 presents summary statistics of the sample in 1990.27 Columns 1
and 2 show the statistics for Chicago and non-urban districts in Illinois. Column 3 shows the
statistics for urban districts in Illinois excluding Chicago.28 The City of Chicago serves a
significantly more disadvantaged population than the average district in the state, surpassing
even other urban districts in terms of the percent of students receiving free lunch and the percent
of Black and Hispanic systems in the district.
Figures 7.5 and 7.6 show changes over time in the average difference in achievement
scores between Chicago and other urban districts. While it appears that Chicago has narrowed
the achievement gap with other urban districts in Illinois, the trend appears to have begun prior
to the introduction of high-stakes testing. One possible exception is in eighth grade reading,
where it appears that Chicago students may have improved more rapidly than students in other
urban districts following the introduction of high-stakes testing. Tables 7.3 and 7.4 show OLS
estimates from the specifications in equation 3.3 that control for racial composition of the school,
the percent of students receiving free or reduced price lunch, the percent of Limited English
Proficient (LEP) students, school mobility rates, per-pupil expenditures in the district and the
percent of teachers with at least a Masters degree in the district. The sample used in the
27
I delete roughly one-half of one percent of observations that were missing test score or background information.
To identify the comparison districts, I first identify districts in the top decile in terms of the percent of students
receiving free or reduced price lunch, percent minority students, and total enrollment and in the bottom decile in
terms of average student achievement (averaged over third, sixth and eighth grade reading and math scores) based
on 1990 data. Not surprisingly, Chicago falls in the bottom of all four categories. Of the 840 elementary districts in
1990, Chicago ranks first in terms of enrollment, 12th in terms of percent of low-income and minority students and
830th in student achievement. Other districts that appear at the bottom of all categories include East St. Louis,
Chicago Heights, East Chicago Heights, Calumet, Joliet, Peoria and Arora. I then use the 34 districts (excluding
28
27
regressions only included information from 1993 to 1998 in order to minimize any confounding
effects from the change in scoring. Recall that Chicago is coded as an urban district in these
models, so that the coefficient on the Chicago indicator variable reflects the difference between
Chicago and all other urban comparison districts. Once we take into account the pre-existing
trends (columns 2, 4 and 6), there is no significant difference between Chicago and the urban
comparison districts in either reading or mathematics. This is true for schools across the
achievement distribution (figures and tables available upon request).
8. Explaining the Differential Gains on the ITBS and the IGAP
There are a variety of reasons why achievement gains could differ across exams under
high-stakes testing, including differential student effort, ITBS-specific test preparation and
cheating.29 Unfortunately, because many of these factors such as effort and test preparation are
not directly observable, it is difficult to attribute a specific portion of the gains to a particular
cause. Instead we now seek to provide some evidence regarding the potential influence of
several factors.
8.1 The Role of Guessing
Prior to the introduction of high-stakes testing, roughly 20 to 30 percent of students left
items blank on the ITBS exams despite the fact that there was no penalty for guessing. One
explanation for the relatively large ITBS gains since 1996 is that the accountability policy
encouraged students to guess more often rather than leaving items blank. If we believe that the
increased test scores were due solely to guessing, we might expect the percent of questions
answered to increase, but the percent of questions answered correctly (as a percent of all
Chicago) that fall into the bottom decile in at least three out of four of the categories. I have experimented with
several different inclusion criteria and the results are not sensitive to the choice of the urban comparison group.
28
answered questions) to remain constant or perhaps even decline. However, from 1994 to 1998,
the percent of questions answered correctly increased by roughly 9 percent at the same time that
the percent blank has declined by 36 percent, suggesting that the higher completion rates were
not due entirely to guessing. Even if we were to assume that the increase in item completion is
due entirely to random guessing, however, Table 8.1 shows that guessing could only explain 10
to 20 percent of the observed ITBS gains among those students who were most likely to have left
items blank in the past.
8.2 The Role of Test Preparation
Another explanation for the differential effects involves the content of the exams. If the
ITBS and IGAP emphasize different topics or skills, and teachers have aligned their curriculum
to the ITBS in response to the accountability policy, then we might see disproportionately large
increases on the ITBS. As we can see in Table 8.2, while the two exams have the same general
format, the IGAP appears to place greater emphasis on critical thinking and problem-solving
skills. For example, the IGAP math exam has fewer straight computation questions, and even
these questions are asked in the context of a sentence or word problem. Similarly, with its long
passages, multiple correct answers and questions comparing passages, the IGAP reading exam
appears to be more difficult and more heavily weighted toward critical thinking skills than the
ITBS exam.
To the extent that the disproportionately large ITBS gains were driven by ITBS-specific
curriculum alignment or test preparation, we might expect to see the largest gains on the ITBS
items that are (a) easy to teach to and (b) relatively more common on the ITBS than the IGAP.
Table 8.3 presents OLS estimates of the relationship between high-stakes testing and ITBS math
29
Jacob and Levitt (2001) found that instances of classroom cheating increased substantially following the
introduction of high-stakes testing. However, they estimate that it could only explain at most 10 percent of the
observed ITBS gains during this period.
29
achievement by item type.30 The sample consists of tested and included students in the 1996 and
1998 cohorts, both of which took ITBS Form L. The dependent variable is the proportion of
students who answered the item correctly in the particular year. We see that students made the
largest gains on items involving computation and number concepts and made the smallest gains
on problems involving estimation, data analysis and problem-solving skills.
Using the coefficients in Table 8.3, we can adjust for the composition differences
between the ITBS and IGAP exam. Suppose that under high-stakes testing students were 2.3
percentage points more likely to correctly answer IGAP math items involving critical thinking or
problem solving skills and 4.2 percentage points more likely to answer IGAP math items
requiring basic skills. Scores on the IGAP math exam range from 1 to 500. Assuming a linear
relationship between the proportion correct and the score, a one percentage point increase in the
proportion correct translates into a five point gain (which, given the 1994 standard deviation of
roughly 84 points, translates into a gain of 0.06 standard deviations). If the distribution of
questions type on the IGAP resembled that on the ITBS, with one-third of the items devoted to
basic skill material, then IGAP scores would have been predicted to increase by an additional
two points.31 Given the fact that IGAP scores rose by roughly 15 points between 1996 and 1998,
it appears that the different compositions of items on the ITBS and IGAP exams cannot explain
much of the observed ITBS-IGAP difference.
8.3. The Role of Effort
If the consequences associated with ITBS performance led students to concentrate harder
during the exam or caused teachers to ensure optimal testing conditions for the exam, and these
factors were not present during the IGAP testing, student achievement on the ITBS may have
exceeded IGAP achievement even if the content of the exams were identical. Because we cannot
30
Because it is difficult to classify items in the reading exam, I limit the analysis to the math exam.
30
directly observe effort, however, it is difficult to determine its role in the recent ITBS
improvements. While increased guessing cannot explain a significant portion of the ITBS gains,
other forms of effort may play a larger role.
Insofar as there is a tendency for children to “give up” toward the end of the exam—
either leaving items blank or filling in answers randomly—an increase in effort may lead to a
disproportionate increase in performance on items at the end of the exam, when students are tired
and more inclined to give up. One might describe this type of effort as test stamina – i.e., the
ability to continue working and concentrating throughout the entire exam. Table 8.4 presents
OLS estimates of the relationship between item position and achievement gains from 1994 to
1998, using the same sample as Table 8.3. The analysis is limited to the reading exam because
the math exam is divided into several sections, so that item position is highly correlated with
item type. Conditional on item difficulty, student performance on the last 20 percent of items
increased 2.7 percentage points more than performance on the first 20 percent of items. In fact,
conditional on the item difficulty, it appears that all of the improvement on the ITBS reading
exam came at the end of the test. This suggests that effort played a significant role in the ITBS
gains seen under high-stakes testing.
9. Unintended Consequences of High-Stakes Testing
Having addressed the impact of the accountability policy on math and reading
achievement, we now examine how high-stakes testing has affected other student outcomes.
While there is little evidence that the accountability policy affected the probability of
transferring to private schools, moving out of the district, or dropping out of school,32 it appears
to have influenced primary grade retentions and achievement scores in science and social
studies.
31
All figures are for the average of third, sixth and eighth grades.
31
9.1 The Impact of High-Stakes Testing on the Early Primary Grades
Table 9.1 presents OLS estimates that control for a variety of student and school
demographics as well as pre-existing trends.33 34 While enrollment rates decreased slightly over
this period, there does not appear to be any dramatic change following the introduction of highstakes testing. In contrast, retention rates appear to have increased sharply under high-stakes
testing. The retention rates were quite low among the 1993 cohort, ranging from 1.1 percent for
kindergartners to 4.2 percent for first graders. First graders in 1997-1999 were roughly 50
percent more likely to be retained than comparable peers in 1993 to 1996. Second graders were
nearly twice as likely to be retained after 1997.
9.2 The Impact of High-Stakes Testing on Science and Social Studies Achievement
Figure 9.1 shows fourth grade ITBS achievement trends in all subjects from 1995 to
1997.35
We see that science and social studies scores increased substantially from 1995 to
1996, most likely because 1995 was the first year that these exams were mandatory for fourth
graders. Over this same period, there was little change in math or reading scores. Following the
introduction of high-stakes testing in 1996-97, math and reading scores rose by roughly 0.20
grade equivalents, but there was no change in science or social studies achievement. Figure 9.2
shows similar trends for eighth grade students from 1995 to 1998. The patterns for eighth grade
32
Figures and tables available from the author upon request.
Probit estimates yield nearly identical results so the estimates from a linear probability model are presented for
ease of interpretation.
34
Roderick et al. (2000) found that retention rates in kindergarten, first and second grades started to rise in 1996 and
jumped sharply in 1997 among first and second graders. This analysis builds on the previous work by examining
enrollment and special education trends (not shown here) as well as retention, controlling for observable student
characteristics and pre-existing trends and examining the heterogeneity across students.
35
Fourth graders did not take the ITBS science and social studies exams before 1995. After 1997, the composition
of fourth grade cohorts changed due to the introduction of grade retention for third grade students. In general, a
difficulty in studying trends in science and social studies achievement in Chicago is that the exams in these subjects
are not given every year and, in many cases, are administered to different grades than the math and reading exams.
The CPS has consistently required students in grades two to eight to take the reading and math portions of the ITBS,
33
32
are less clear, partly because of the noise introduced by the changes in test forms. If we focus on
years in which the same form was administered, 1996 and 1998, we see that the average social
studies achievement stayed relatively constant under high-stakes testing in comparison to scores
in the other subjects which rose by about 0.30 GEs.
IGAP trends reveal a similar pattern. Figure 9.3 shows the social studies trends for
Chicago and other urban districts in Illinois. We see that Chicago was improving relative to
other urban districts in science from 1993 to 1996 in both the fourth and seventh grades. In
1997, this improvement stopped, and even reversed somewhat, leaving Chicago students further
behind their urban counterparts in social studies in 1997. On the other hand, Figure 9.4 suggests
that Chicago students continued to gain on their urban counterparts in science achievement
through 1997.
Table 9.2 presents OLS regression estimates that control for student and school
characteristics. We see that students in both grades had higher math and reading scores after the
introduction of high-stakes testing, but that science achievement stayed relatively flat and social
studies achievement declined slightly. Table 9.3 shows the OLS estimates of the relationship
between high-stakes testing and IGAP science and social studies scores. Conditional on prior
achievement trends, Chicago fourth graders in 1997 performed significantly worse than the
comparison group in science as well as social studies. The effect of 16.4 for social studies is
particularly large, nearly one-third of a standard deviation. The point estimate for seventh grade
science is negative, though not statistically significant and the effect for seventh grade science is
zero.
Overall, it appears that social studies and science achievement have decreased relative to
reading and math achievement. The decrease appears larger for social studies than science and
somewhat larger for fourth grade in comparison to seventh or eighth grade. The decline in social
but has frequently changed which grades are required to take the science and social studies exams. The only
33
studies achievement is consistent with teachers substituting time away from this subject to focus
on reading. One reason that we might not have seen a similar drop in science scores is that
science is often taught by a separate teacher, whereas social studies is more often taught by the
regular teacher who is also responsible for reading and possibly mathematics instruction as well.
10. Conclusions
This analysis has generated a number of significant findings: (1) ITBS scores in
mathematics and reading increased substantially following the introduction of high-stakes testing
in Chicago; (2) the pattern of ITBS gains suggests that low-achieving schools made somewhat
larger gains under the policy, but that there was no consistent pattern regarding the differential
gains of low- versus high-achieving students; (3) in contrast to ITBS scores, IGAP math trends
in third and sixth grade showed little if any change following the introduction of high-stakes
testing; in eighth grade, IGAP math trends increased less sharply than comparable ITBS trends;
(4) a cross-district comparison of IGAP scores in Illinois during the 1990s suggests that highstakes testing has not affected math or reading achievement in any grade in Chicago; (5) the
introduction of high-stakes testing for math and reading coincided with a relative decline in
science and social studies achievement, as measured by both the ITBS and IGAP; (6) while the
probability that a student will transfer to a private school, leave the district, or drop out of
elementary school did not increase following the introduction of the accountability policy,
retention rates in first and second grade have increased substantially under high-stakes testing,
despite the fact that these grades are not directly influenced by the social promotion policy.
This analysis highlights both the strengths and weaknesses of high-stakes accountability
policies in education. On one hand, the results presented here suggest that students and teachers
have responded to the high-stakes testing program in Chicago. The likelihood of missing the
consistent series consists of fourth and eighth grade students from 1995 to 1999.
34
ITBS declined sharply following the introduction of the accountability policy, particularly
among low-achieving students. ITBS scores increased dramatically beginning in 1997. Survey
and interview data further indicate that many elementary students and teachers are working
harder (Engel and Roderick 2001, Tepper et. al. 2001). All of these changes are most apparent
in the lowest performing schools.
On the other hand, there is evidence that teachers and students have responded to the
policy by concentrating on the most easily observable, high-stakes topics and neglecting areas
with little consequences or topics that are less easily observed. The differential achievement
trends on ITBS and IGAP suggest that the large ITBS gains may have been driven by test-day
motivation, test preparation or cheating and therefore do not reflect greater student
understanding of mathematics or greater facility in reading comprehension. Interviews and
survey data confirm that the majority of teachers have responded to high-stakes testing with loweffort strategies such as increasing test preparation, aligning curriculum more closely with the
ITBS and referring low achieving students to after-school programs, but have not substantially
changed their classroom instruction (Tepper et. al. 2001). It appears that teachers and students
have shifted resources away from non-tested subjects, particularly social studies, which may be a
source of concern if these areas are important for future student development and growth.
Finally, the finding that the accountability policy increased retention rates in first and second
grade.
These results suggest two different avenues for policy intervention. One alternative is to
expand the scope of the accountability measures – e.g., basing student promotion and school
probation decisions on IGAP as well as ITBS scores, including science and social studies in the
accountability system along with math and reading, etc. While this solution maintains the core
of the accountability concept, it may be expensive to develop and administer a variety of
35
additional accountability measures. Increasing the number of measures used for the
accountability policy may dilute the incentive associated with any one of the measures.
Another alternative is to mandate particular curriculum programs or instructional
practices. For example, in Fall 2001 the CPS will require 200 of the lowest performing
elementary schools to adopt one of several comprehensive reading programs offered by the CPS.
Unfortunately, there is little evidence that such mandated programmatic reforms have significant
effects on student learning (Jacob and Lefgren 2001b). More generally, this approach has many
of the same drawbacks that initially led to the adoption of a test-based accountability system,
including the difficulty of monitoring the activities of classroom teachers, the heterogeneity of
needs and problems and the inherent uncertainty of how to help students learn.
Regardless of the path chosen by Chicago, this analysis illustrate the difficulties
associated with accountability-oriented reform strategies in education, including the difficulty of
monitoring teacher and student behavior, the shortcomings of standardized achievement
measures and the importance of student and school capacity in responding to policy initiatives.
This study suggests that educators and policymakers must carefully examine the incentives
generated by high-stakes accountability policies in order to ensure that the policies foster longterm student learning and not simply raise test scores in the short-run.
36
References
Becker, W. E. and S. Rosen (1992). “The Learning Effect of Assessment and Evaluation in High
School.” Economics of Education Review 11(2): 107-118.
Bishop, J. (1998). Do Curriculum-Based External Exit Exam Systems Enhance Student
Achievement? Philadelphia, Consortium for Policy Research in Education, University of
Pennsylvania, Graduate School of Education: 1-32.
Bishop, J. H. (1990). “Incentives for Learning: Why American High School Students Compare
So Poorly to their Counterparts Overseas.” Research in Labor Economics 11: 17-51.
Bishop, J. H. (1996). Signaling, Incentives and School Organization in France, the Netherlands,
Britain and the United States. Improving America's Schools: The Role of Incentives. E.
A. Hanushek and D. W. Jorgenson. Washington, D.C., National Research Council: 111145.
Bryk, A. S. and S. W. Raudenbush (1992). Hierarchical Linear Models. Newbury Park, Sage
Publications.
Bryk, A. S., P. B. Sebring, et al. (1998). Charting Chicago School Reform. Boulder, CO,
Westview Press.
Cannell, J. J. (1987). Nationally Normed Elementary Achievement Testing in America's Public
Schools: How All Fifty States Are Above the National Average. Daniels, W. V., Friends
for Education.
Carnoy, M., S. Loeb, et al. (2000). Do Higher State Test Scores in Texas Make for Better High
School Outcomes? American Educational Research Association, New Orleans, LA.
37
Catterall, J. (1987). Toward Researching the Connections between Tests Required for High
School Graduation and the Inclination to Drop Out. Los Angeles, University of
California, Center for Research on Evaluation, Standards and Student Testing: 1-26.
Catterall, J. (1989). “Standards and School Dropouts: A National Study of Tests Required for
High School Graduation.” American Journal of Education 98(November): 1-34.
Easton, J. Q., T. Rosenkranz, et al. (2001). Annual CPS Test Trend Review, 2000. Chicago, IL,
Consortium on Chicago School Research.
Easton, J. Q., T. Rosenkranz, et al. (2000). Annual CPS Test Trend Review, 1999. Chicago,
Consortium on Chicago School Research.
ECS (2000). ECS State Notes, Education Commission of the States (www.ecs.org).
Feldman, S. (1996). Do Legislators Maximize Votes? Working paper. Cambridge, MA.
Frederisksen, N. (1994). The Influence of Minimum Competency Tests on Teaching and
Learning. Princeton, Educational Testing Services, Policy Information Center.
Gardner, J. A. (1982). Influence on High School Curriculum on Determinants of Labor Market
Experience. Columbus, The National Center for Research in Vocational Education, Ohio
State University.
Goodlad, J. (1983). A Place Called School. New York, McGraw-Hill.
Griffin, B. W. and M. H. Heidorn (1996). “An Examination of the Relationship between
Minimum Competency Test Performance and Dropping Out of High School.”
Educational Evaluation and Policy Analysis 18(3): 243-252.
Grissmer, D. and A. Flanagan (1998). Exploring Rapid Achievement Gains in North Carolina
and Texas. Washington, D.C., National Education Goals Panel.
Haney, W. (2000). “The Myth of the Texas Miracle in Education.” Education Policy Analysis
Archives 8(41).
38
Heubert, J. P. and R. M. Hauser, Eds. (1999). High Stakes: Testing for Tracking, Promotion and
Graduation. Washington, D.C., National Academy Press.
Jacob, B. A. (2001). “Getting Tough? The Impact of Mandatory High School Graduation
Exams on Student Outcomes.” Educational Evaluation and Policy Analysis
Forthcoming.
Jacob, B. A. and L. Lefgren (2001a). “Making the Grade: The Impact of Summer School and
Grade Retention on Student Outcomes.” Working Paper.
Jacob, B. A. and L. Lefgren (2001b). “The Impact of a School Probation Policy on Student
Achievement in Chicago.” Working Paper.
Jacob, B. A. and S. D. Levitt (2001). “The Impact of High-Stakes Testing on Cheating: Evidence
from Chicago.” Working Paper.
Jacob, B. A., M. Roderick, et al. (2000). The Impact of Chicago's Policy to End Social
Promotion on Student Achievement. American Educational Research Association, New
Orleans.
Kang, S. and J. Bishop (1984). The Impact of Curriculum on the Non-College Bound Youth's
Labor Market Outcomes. High School Preparation for Employment. L. Hotchkiss, J.
Bishop and S. Kang. Columbus, The National Center for Research in Vocational
Education, Ohio State University: 95-135.
Klein, S. P., L. S. Hamilton, et al. (2000). What Do Test Scores in Texas Tell Us? Santa Monica,
CA, RAND.
Koretz, D. (1999). Foggy Lenses: Limitations in the Use of Achievement Tests as Measures of
Educators' Productivity. Devising Incentives to Promote Human Capital, Irvine, CA.
Koretz, D., R. L. Linn, et al. (1991). The Effects of High-Stakes Testing: Preliminary Evidence
About Generalization Across Tests. American Educational Research Association,
Chicago.
39
Koretz, D. M. and S. I. Barron (1998). The Validity of Gains in Scores on the Kentucky
Instructional Results Information System (KIRIS). Santa Monica, RAND.
Kreitzer, A. E., G. F. Madaus, et al. (1989). Competency Testing and Dropouts. Dropouts from
School: Issues, Dilemmas and Solutions. L. Weis, E. Farrar and H. G. Petrie. Albany,
State University of New York Press: 129-152.
Ladd, H. F. (1999). “The Dallas School Accountability and Incentive Program: An Evaluation of
its Impacts on Student Outcomes.” Economics of Education Review 18: 1-16.
Lillard, D. R. and P. P. DeCicca (Forthcoming). “Higher Stanards, More Dropouts? Evidence
Within and Across Time.” Economics of Education Review.
Linn, R. L. (2000). “Assessments and Accountability.” Educational Researcher 29(2): 4-16.
Linn, R. L. and S. B. Dunbar (1990). “The Nation's Report Card Goes Home: Good New and
Bad About Trends in Achievement.” Phi Delta Kappan 72(2): 127-133.
Linn, R. L., M. E. Graue, et al. (1990). “Comparing State and District Results to National
Norms: The Validity of the Claim that 'Everyone is Above Average'.” Educational
Measurement: Issues and Practice 9(3): 5-14.
Little, R. J. A. and D. B. Rubin (1987). Statistical Analysis with Missing Data. New York, John
Wiley & Sons.
MacIver, D. J., D. A. Reuman, et al. (1995). “Social Structuring of the School: Studying What Is,
Illuminating What Could Be.” Annual Review of Psychology 46: 375-400.
MacMillan, D. L., I. H. Balow, et al. (1990). A Study of Minimum Competency Tests and Their
Impact: Final Report. Washington, D.C., Office of Special Education and Rehabilitation
Services, U.S. Department of Education.
Madaus, G. and V. Greaney (1985). “The Irish Experience in Competency Testing: Implications
for American Education.” American Journal of Education 93: 268-294.
40
McNeil, L. and A. Valenzuela (1998). The Harmful Impact of the TAAS System of Testing in
Texas: Beneath the Accountability Rhetoric. High Stakes K-12 Testing Conference,
Teachers College, Columbia University.
Meyer, J. W. and B. Rowan (1978). The Structure of Educational Organizations. Environments
and Organizations. M. Meyer. San Francisco, CA, Jossey-Bass: 78-109.
Meyer, R. (1982). Job Training in the Schools. Job Training for Youth. R. Taylor, H. Rosen and
F. Pratzner. Columbus, National Center for Research in Vocational Education, Ohio State
University.
Neill, M. and K. Gayler (1998). Do High Stakes Graduation Tests Improve Learning Outcomes?
Using State-Level NAEP Data to Evaluate the Effects of Mandatory Graduation Tests.
High Stakes K-12 Testing Conference, Teachers College, Columbia University.
Pearson, D. P. and T. Shanahan (1998). “The Reading Crisis in Illinois: A Ten Year
Retrospective of IGAP.” Illinois Reading Council Journal 26(3): 60-67.
Powell, A. G. (1996). Motivating Students to Learn: An American Dilemma. Rewards and
Reform: Creating Educational Incentives that Work. S. H. Fuhrman and J. A. O'Day. San
Francisco, Jossey-Bass Inc.: 19-59.
Powell, A. G., E. Farrarr, et al. (1985). The Shopping Mall High School: Winners and Losers in
the Educational Marketplace. Boston, Houghten-Mifflin.
Reardon, S. (1996). Eighth Grade Minimum Competency Testing and Early High School
Dropout Patterns. American Educational Research Association Annual Meeting, New
York.
Roderick, M. and M. Engel (2000). The Grasshopper and the Ant: Motivational Responses of
Low-Achieving Students to High-Stakes Testing. American Educational Research
Association, New Orleans.
41
Roderick, M., J. Nagaoka, et al. (2000). Update: Ending Social Promotion. Chicago, IL,
Consortium on Chicago School Research.
Roderick, M., J. Nagaoka, et al. (2001). Helping Students Meet the Demands of High-Stakes
Testing: Is There a Role for Summer School? Annual Meeting of the American
Educational Research Association, Seattle, WA.
Sedlak, M. W., C. W. Wheeler, et al. (1986). Selling Students Short: Classroom Bargains and
Academic Reform in American High Schools. New York, Teachers College Press.
Shepard, L. A. (1988). Should Instruction be Measurement-Driven? American Educational
Research Association, New Orleans.
Shepard, L. A. (1990). “Inflated Test Score Gains: Is the Problem Old Norms or Teaching the
Test?” Educational Measurement: Issues and Practice 9(3): 15-22.
Sizer, T. R. (1984). Horace's Compromise: The Dilemma of the American High School. Boston,
Hougton-Mifflin.
Smith, B. A. and S. Degener (2001). Staying After to Learn: A First Look at Lighthouse. Annual
Meeting of the American Educational Research Association, Seattle, WA.
Stecher, B. M. and S. I. Barron (1999). Quadrennial Milepost Accountability Testing in
Kentucky. Los Angeles, Center for the Study of Evaluation, University of California.
Stevenson, H. W. and J. W. Stigler (1992). The Learning Gap: Why Our Schools are Failing and
What We Can Learn from Japanese and Chinese Education. New York, Simon and
Schuster.
Tepper, R., M. Roderick, et al. (Forthcoming). The Impact of High-Stakes Testing on Classroom
Instructional Practices in Chicago. Chicago, IL, Consortium on Chicago School
Research.
Tepper, R. L. (2001). The Influence of High-Stakes Testing on Instructional Practice in Chicago.
American Educational Research Association, Seattle, WA.
42
Tomlinson, T. M. and C. T. Cross (1991). “Student Effort: The Key to Higher Standards.”
Educational Leadership(September): 69-73.
Tyack, D. B. and L. Cuban (1995). Tinkering Toward Utopia: A Century of Public School
Reform. Cambridge, MA, Harvard University Press.
Waller, W. (1932). Sociology of Teaching. New York, Wiley.
Weick, K. E. (1976). “Educational Organizations as Loosely Coupled Systems.” Administrative
Science Quarterly 21: 1-19.
Winfield, L. F. (1990). “School Competency Testing Reforms and Student Achievement:
Exploring a National Perspective.” Educational Evaluation and Policy Analysis 12(2):
157-173.
Wright, B. and M. H. Stone (1979). Best Test Design. Chicago, IL, MESA Press.
43
Table 2.1 Descriptive Statistics on the Social Promotion Policy in Chicago
3rd Grade
1995-96
Total Fall Enrollment
Promotional Cutoffs for ITBS (Grade Equivalents)
Percent Subject to Policy (% tested and included)
Percent Failed to Meet Promotional Criteriaa
Percent Failed Reading
Percent Failed Math
Percent Failed Reading and Math
Percent retained or in transition center next yeara
1996-97
Total Fall Enrollment
Promotional Cutoffs for ITBS (Grade Equivalents)
Percent Subject to Policy (% tested and included)
Percent Failed to Meet Promotional Criteriaa
Percent Failed Reading
Percent Failed Math
Percent Failed Reading and Math
Percent retained or in transition center next yeara
1997-98
Total Fall Enrollment
Promotional Cutoffs for ITBS (Grade Equivalents)
Percent Subject to Policy (% tested and included)
Percent Failed to Meet Promotional Criteriaa
Percent Failed Reading
Percent Failed Math
Percent Failed Reading and Math
Percent retained or in transition center next yeara
1998-99
Total Fall Enrollment
Promotional Cutoffs for ITBS (Grade Equivalents)
Percent Subject to Policy (% tested and included)
Percent Failed to Meet Promotional Criteriaa
Percent Failed Reading
Percent Failed Math
Percent Failed Reading and Math
Percent retained or in transition center next yeara
1999-2000
Total Fall Enrollment
Promotional Cutoffs for ITBS (Grade Equivalents)
Percent Subject to Policy (% tested and included)
Percent Failed to Meet Promotional Criteriaa
Percent Failed Reading
Percent Failed Math
Percent Failed Reading and Math
Percent retained or in transition center next yeara
6th Grade
8th Gradeb
29,664
6.8
.82
.33
.07
.13
.13
.07
34,775
2.8
.71
.49
.06
.17
.26
.21
31,385
5.3
.81
.35
.05
.17
.13
.13
29,437
7.0
.80
.30
.06
.13
.11
.13
39,400
2.8
.71
.43
.02
.20
.19
.23
33,334
5.3
.81
.31
.07
.13
.11
.13
31,045
7.2
.80
.34
.08
.11
.14
.14
40,957
2.8
.69
.40
.05
.19
.16
.19
33,173
5.3
.80
.29
.06
.14
.09
.12
29,838
7.4
.79
.31
.07
.12
.12
.10
40,691
2.8
.69
.41
.03
.22
.16
.12
31,149
5.5
.79
.31
.04
.17
.10
.09
31,387
7.7
.77
.40
.09
.14
.17
.12
Notes: Those students who were not retained or placed in a transition center (1) were promoted, (2) were
placed in a non-graded special education classroom, or (3) left the CPS. a Conditional on being subject to
the accountability policy. b The eighth grade sample includes students in transition centers whose
transcripts indicated they were in either eighth or ninth grade.
44
Table 4.1: Summary Statistics
Variables
3rd Grade
6th Grade
8th Grade
0.178
0.076
0.087
0.122
3.471
(.962)
3.082
(1.122)
-1.247
(1.116)
-1.271
(1.091)
0.012
0.116
6.363
(1.277)
6.025
(1.484)
0.691
(.918)
0.087
(970)
0.012
0.111
7.972
(1.287)
7.881
(1.758)
1.539
(.777)
1.123
(.909)
0.058
Moved out of CPS
0.053
0.049
0.069
Dropped Out
0.010
0.010
0.054
0.274
0.374
0.226
0.207
0.387
0.399
0.353
0.387
0.393
0.110
0.161
0.122
0.263
0.313
0.298
0.151
0.154
0.172
0.178
0.162
0.186
0.298
0.210
0.222
Student Demographics
Male
0.488
0.479
0.473
Black
0.654
0.552
0.554
Hispanic
0.205
0.304
0.299
Black Male
0.316
0.259
0.255
Hispanic Male
0.103
9.219
(.590)
0.048
0.150
12.210
(.483)
0.035
0.146
14.166
(.524)
0.028
Special Education
0.855
0.315
(.742)
-0.261
(.687)
0.036
0.876
0.236
(.236)
-0.275
(.689)
0.024
0.901
0.233
(.704)
-0.275
(.684)
0.018
Free lunch
0.661
0.642
0.586
Reduced price lunch
0.065
0.072
0.068
Student Outcomes
Missed ITBS
Excluded from ITBS
ITBS Math Score (GE)
ITBS Reading Score (GE)
ITBS Math Score (Rasch)
ITBS Reading Score (Rasch)
Transferred to private school
Prior Achievement Indicators
School < 15% at norms
School 15-25% at norms
School > 25% at norms
Student < 10%
Student 10-25%
Student 25-35%
Student 35-50%
Student > 50%
Age
Living in foster care
Living with at least one parent
Neighborhood Poverty
Neighborhood Social Status
45
Currently in bilingual program
0.048
0.095
0.076
Past bilingual participation
0.116
0.192
0.210
% Limited English Proficient
12.245
15.858
15.502
% Low Income
82.459
81.974
81.263
% Daily Attendance
92.563
92.733
92.555
Mobility Rate
29.865
28.418
28.522
School Enrollment
767.588
780.305
772.123
> 85% African-American
0.545
0.450
0.445
> 85% Hispanic
0.081
0.126
0.119
> 85% African-American + Hispanic
70-85% African-American +
Hispanic
< 70% African-American + Hispanic
0.103
0.124
0.134
0.082
0.098
0.094
0.189
0.203
0.208
4848.064
4924.203
4912.159
Crime Composite
0.176
0.086
0.084
% School Age Children
0.235
0.232
0.232
% Black
0.589
0.505
0.508
% Hispanic
0.169
0.223
0.223
22448.200
23093.400
23183.110
% Own Home
0.396
0.402
0.405
% Lived in Same House for 5 Years
0.577
0.567
0.569
Mean Education Level
12.099
12.084
12.083
% Managers/Professionals
0.173
0.170
0.170
Poverty Rate
0.282
0.262
0.261
Unemployment Rate
0.419
0.405
0.404
Female Headed HH
0.442
0.402
0.401
165,143
169,377
139,683
1
School Demographics
Census Tract Characteristics
Census Tract Population
Median HH Income
Number of observations
Notes: The sample includes all tested and included students in cohorts 1993 to 1999 (1993 to 1998 for
eighth grade) who were not missing demographic information. Sample sizes for the first two statistics—
missing ITBS and excluded from ITBS—are larger than shown in this table because they are taken from
the entire sample.
46
Table 5.1: OLS Estimates of the Probability of Not Being Tested or Being Excluded from
Reporting
Dependent Variable
Excluded
Not Tested
(Conditional on Tested)
Full
Black
Full
Black
-.016
(.006)
-.029
(.008)
-.144
-.008
(.005)
-.016
(.006)
-.008
(.007)
.058
(.006)
.064
(.007)
.148
(.010)
.003
(.002)
-.000
(.002)
-.001
(.003)
3rd Grade
1997 Cohort
1998 Cohort
1999 Cohort
(.014)
Baseline Rates
Number of Observations
.055
.054
228,734
122,251
190,762
117,238
.37
.09
.56
.68
1997 Cohort
-.011
(.003)
1998 Cohort
-.018
(.004)
1999 Cohort
-.019
(.005)
-.011
(.003)
-.009
(.004)
-.008
(.005)
.018
(.002)
.010
(.002)
.020
(.003)
.005
(.002)
.003
(.002)
-.001
(.003)
R-Squared
6th Grade
Baseline Rates
Number of Observations
R-Squared
.045
.086
207,578
110,350
193,671
106,947
.28
.08
.79
.84
-.019
(.004)
-.032
(.006)
-.040
(.008)
-.028
(.005)
-.043
(.008)
-.057
(.010)
.005
(.002)
.023
(.003)
.018
(.003)
.006
(.003)
.010
(.003)
.009
(.004)
8th Grade
1996 Cohort
1997 Cohort
1998 Cohort
Baseline Rates
Number of Observations
R-Squared
.057
.080
172,205
92,781
159,559
89,042
.28
.09
.84
.86
Notes: The sample includes all first-time students in these grades from 1993 to 1999 (1993-98
for eighth grade). Control variables include a time year, race, gender, race*gender interactions,
age, household composition, and an indicator of previous special education placement along with
up to three years of prior reading and math achievement (linear, square and cubic terms).
Missing test scores are set to zero and a variable is included indicating the score is missing.
Robust standard errors that account for the correlation of errors within school are presented in
parentheses.
47
Table 5.2: OLS Estimates of the Probability of Not Being Tested or Being Excluded from
Reporting, by Prior Achievement Level (Black Students Only)
Missed Test
Excluded
rd
th
th
rd
3
6
8
3
6th
8th
-.011
(.005)
-.011
(.003)
-.022
(.005)
.002
(.001)
.005
(.002)
.006
(.002)
< 15% students scored above
the 50th percentile
-.015
(.005)
-.015
(.004)
-.032
(.006)
.004
(.003)
.009
(002)
.006
(.003)
15-25% students scored
above the 50th percentile
-.006
(.005)
-.010
(.004)
-.021
(.005)
.001
(.002)
.004
(.002)
.006
(.003)
> 25% students scored above
the 50th percentile
-.013
(.007)
-.004
(.005)
-.007
(.005)
.001
(.002)
.001
(.003)
.004
(.004)
< 10th percentile
-.005
(.007)
-.015
(.004)
-.043
(.007)
.016
(.004)
.021
(.004)
.025
(.004)
10-25th percentile
-.004
(.004)
-.013
(.003)
-.019
(.005)
.003
(.003)
.006
(.002)
.010
(.003)
25-35th percentile
-.012
(.005)
-.008
(.004)
-.019
(.005)
.002
(.003)
-.004
(.002)
-.002
(.003)
35-50th percentile
-.007
(.004)
-.009
(.004)
-.014
(.005)
-.005
(.002)
-.004
(.002)
-.006
(.003)
> 50th percentile
-.004
(.004)
-.005
(.003)
-.013
(.005)
-.004
(.002)
-.007
(.002)
-.006
(.002)
Average Effect
Average Effect by School
Prior Achievement
Average Effect by Student
Prior Achievement Level
Notes: Each cell contains an estimate from a separate regression. The estimate reflects the
coefficient on a variable indicating whether the student was part of a post-policy cohort. School
prior achievement is based on 1995 reading scores. Student prior achievement is based on the
average of all non-missing test scores in the years t-2 and t-3 for sixth and eighth graders and t-1
and t-2 for third graders. The sample includes all black first-time students in these grades from
1993 to 1999 (1993-98 for eighth grade). Control variables include a year trend, race, gender,
race*gender interactions, age, household composition, and an indicator of previous special
education placement along with up to three years of prior reading and math achievement (linear,
square and cubic terms). Missing test scores are set to zero and a variable is included indicating
the score is missing. Robust standard errors that account for the correlation of errors within
school are presented in parentheses.
48
Table 5.3: OLS Estimates of the Probability of Not Being Tested or Being Excluded from
Reporting (Black Students)
Missed Test
Excluded
th
th
th
th
4
5
7
4
5th
7th
-.007
(.003)
-.010
(.003)
-.013
(.004)
.001
(.002)
.008
(.002)
.003
(.002)
< 15% students scored above
the 50th percentile
-.009
(.005)
-.013
(.005)
-.022
(.005)
.001
(.003)
.008
(003)
.002
(.003)
15-25% students scored
above the 50th percentile
-.003
(.004)
-.007
(.004)
-.013
(.005)
-.003
(.003)
.009
(.002)
.003
(.003)
> 25% students scored above
the 50th percentile
-.009
(.005)
-.009
(.004)
-.003
(.006)
-.004
(.004)
.005
(.003)
.004
(.004)
< 10th percentile
-.008
(.006)
-.005
(.006)
-.020
(.007)
.012
(.005)
.025
(.004)
.022
(.005)
10-25th percentile
-.000
(.005)
-.009
(.004)
-.010
(.005)
.002
(.004)
.010
(.003)
.005
(.003)
25-35th percentile
-.005
(.005)
-.008
(.004)
-.015
(.005)
-.002
(.004)
.003
(.003)
-.002
(.003)
35-50th percentile
-.007
(.004)
-.011
(.004)
-.010
(.005)
-.003
(.003)
-.004
(.003)
-.010
(.003)
> 50th percentile
-.005
(.004)
-.003
(.004)
-.013
(.004)
-.007
(.003)
-.004
(.002)
-.009
(.003)
Average Effect
Average Effect by School
Prior Achievement
Average Effect by Student
Prior Achievement Level
Notes: Each cell contains an estimate from a separate regression. The estimate reflects the
coefficient on a variable indicating whether the student was part of a post-policy cohort. School
prior achievement is based on 1995 achievement scores. Student prior achievement is based on
the average of all non-missing test scores in the years t-2 and t-3 for sixth and eighth graders and
t-1 and t-2 for third graders. The sample includes all black first-time students in these grades
from 1993 to 1999 (1993-98 for eighth grade). Control variables include a year trend, race,
gender, race*gender interactions, age, household composition, and an indicator of previous
special education placement along with up to three years of prior reading and math achievement
(linear, square and cubic terms). Missing test scores are set to zero and a variable is included
indicating the score is missing. Robust standard errors that account for the correlation of errors
within school are presented in parentheses.
49
Table 6.1: OLS Estimates of ITBS Math and Reading Achievement
Dependent Variable
ITBS reading score controlling for
achievement in the following periods:
t-1,t-2,t-3
2
2.1
rd
3 Grade
1997 Cohort
1998 Cohort
1999 Cohort
t-2, t-3
ITBS math score controlling for
achievement in the following periods:
t-3
t-1,t-2,t-3
t-2, t-3
t-3
2.2
2.3
.013
(.011)
.102
(.012)
.025
(.014)
.055
(.012)
.213
(.013)
.062
(.15)
.077
(.011)
.242
(.014)
.204
(.014)
.115
(.013)
.142
(.014)
.136
(.015)
.176
(.014)
.250
(.016)
.186
(.016)
.194
(.014)
.291
(.016)
.338
(.016)
.045
(.007)
.099
(.008)
.055
(.007)
.061
(.007)
.142
(.009)
.093
(.008)
.053
(.008)
.165
(.009)
.105
(.009)
.069
(.007)
.126
(.008)
.067
(.008)
.093
(.008)
.164
(.009)
.123
(.009)
.097
(.009)
.196
(.010)
.167
(.011)
.132
(.008)
.158
(.009)
.165
(.009)
.133
(.008)
.182
(.009)
.231
(.009)
.114
(.008)
.189
(.010)
.264
(.010)
.095
(.006)
.115
(.007)
.170
(.008)
.078
(.007)
.132
(.008)
.233
(.009)
.069
(.007)
.139
(.009)
.272
(.010)
1 Year Gain
2 Year Gain
3 Year Gain
1 Year Gain
2 Year Gain
3 Year Gain
.599
(.824)
.503
(.647)
.484
(.593)
1.444
(1.011)
.945
(.674)
1.013
(.619)
.606
(.772)
.659
(.479)
.472
(.412)
1.354
(.913)
1.323
(.586)
.938
(.509)
6th Grade
1997 Cohort
1998 Cohort
1999 Cohort
8th Grade
1996 Cohort
1997 Cohort
1998 Cohort
Baseline Gains
3rd Grade
6th Grade
8th Grade
-1.432
(.773)
1.490
(.637)
-2.027
(.726)
1.594
(.578)
Notes: The sample includes all first-time students in these grades from 1993 to 1999 (1993-98 for eighth grade). Control variables include race, gender, race*gender interactions, age, household
composition, and an indicator of previous special education placement along with up to three years of prior reading and math achievement (linear, square and cubic terms). Missing test scores are set to
zero and a variable is included indicating the score is missing. Robust standard errors that account for the correlation of errors within school are presented in parentheses. Baseline gains are based on the
average gains for the 1993 to 1995 cohorts. The end point in the gain scores is always the grade in question. For example, one year gains for 6th graders refers to the 6th grade score – 5th grade score; two
year gains are calculated as 6th grade score – 4th grade score and three year gains are calculated as 6th grade score – 3rd grade score.
50
Table 6.2: OLS Estimates of ITBS Reading Achievement Using Various Specifications and Samples
Specification
Missing Missing Missing
Form L
School
With
Data
Data
Data
Black
w/
Included
Included Imputed Imputed Imputed Aggregate Specific FE Form L
Black
Aggregate
&
Baseline
Only
and
Year
at
at
at
&
Included
Year
Excluded
Trends
Trend
Excluded 0%ile in 25%ile in 50%ile in
Trend
school
school
school
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(9)
(10)
3rd Grade
1997 Cohort
1998 Cohort
1999 Cohort
2.4
2.5
.013
(.011)
.213
(.013)
.204
(.014)
.008
(.011)
.195
(.013)
.064
(.015)
.032
(.013)
.154
(.016)
.217
(.016)
.030
(.013)
.145
(.015)
.205
(.015)
.050
(.017)
.222
(.019)
.336
(.018)
.033
(.012)
.190
(.014)
.160
(.013)
.022
(.012)
.193
(.013)
.082
(.015)
.091
(.015)
.358
(.020)
.218
(.027)
.090
(.008)
.349
(.011)
.214
(.014)
.234
(.014)
.199
(.026)
.045
(.007)
.142
(.009)
.105
(.009)
.039
(.007)
.117
(.009)
.079
(.009)
.051
(.009)
.166
(.011)
.116
(.011)
.045
(.009)
.132
(.011)
.089
(.011)
.055
(.008)
.151
(.010)
.144
(.011)
.045
(.007)
.129
(.009)
.099
(.009)
.038
(.007)
.118
(.009)
.080
(.009)
.033
(.010)
.167
(.013)
.163
(.016)
.037
(.006)
.170
(.008)
.168
(.010)
.141
(.008)
.072
(.013)
.132
(.008)
.182
(.009)
.264
(.010)
.124
(.007)
.165
(.009)
.234
(.010)
.150
(.011)
.199
(.013)
.290
(.012)
.139
(.010)
.180
(.013)
.256
(.012)
.127
(.010)
.187
(.012)
.273
(.013)
.123
(.008)
.167
(.009)
.238
(.010)
.121
(.008)
.160
(.009)
.226
(.010)
.154
(.010)
.208
(.015)
.227
(.020)
.150
(.006)
.204
(.009)
.227
(.012)
.141
(.008)
--
.170
(.009)
--
6th Grade
1997 Cohort
1998 Cohort
1999 Cohort
8th Grade
1996 Cohort
1997 Cohort
1998 Cohort
Notes: The sample includes all first-time students in these grades from 1993 to 1999 (1993-98 for eighth grade). Control variables
include race, gender, race*gender interactions, age, household composition, and an indicator of previous special education placement
along with up to three years of prior reading and math achievement (linear, square and cubic terms). Missing test scores are set to
zero and a variable is included indicating the score is missing. Robust standard errors that account for the correlation of errors within
school are presented in parentheses.
51
Table 6.3: OLS Estimates of ITBS Math Achievement Using Various Specifications and Samples
Specification
Missing Missing Missing
Form L
School
With
Data
Data
Data
Black
w/
Included
Included Imputed Imputed Imputed Aggregate Specific FE Form L
Black
Aggregate
&
Baseline
Only
and
Year
at
at
at
&
Included
Year
Excluded
Trends
Trend
Excluded 0%ile in 25%ile in 50%ile in
Trend
school
school
school
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(9)
(10)
.115
(.013)
.250
(.016)
.338
(.016)
.122
(.014)
.256
(.015)
.256
(.016)
.127
(.016)
.238
(.019)
.347
(.019)
.126
(.016)
.233
(.018)
.335
(.019)
.162
(.017)
.303
(.020)
.526
(.020)
.152
(.014)
.270
(.016)
.367
(.015)
.137
(.014)
.255
(.015)
.280
(.015)
.218
(.017)
.428
(.023)
.346
(.029)
.215
(.007)
.418
(.011)
.341
(.010)
.290
(.016)
.175
(.028)
.069
(.007)
.164
(.009)
.167
(.011)
.062
(.007)
.144
(.009)
.135
(.010)
.067
(.009)
.170
(.012)
.178
(.013)
.057
(.009)
.140
(.011)
.139
(.012)
.080
(.009)
.180
(.011)
.201
(.013)
.069
(.008)
.158
(.009)
.158
(.011)
.065
(.008)
.147
(.009)
.141
(.011)
.096
(.010)
.257
(.015)
.280
(.019)
.098
(.005)
.255
(.006)
.281
(.009)
.172
(.009)
.152
(.017)
.095
(.006)
.132
(.008)
.272
(.010)
.087
(.006)
.114
(.008)
.246
(.010)
.099
(.008)
.148
(.011)
.284
(.012)
.086
(.008)
.122
(.011)
.248
(.012)
.093
(.010)
.144
(.011)
.290
(.013)
.089
(.006)
.116
(.008)
.253
(.010)
.086
(.006)
.113
(.009)
.243
(.010)
.201
(.008)
.274
(.013)
.363
(.018)
.198
(.004)
.268
(.007)
.361
(.009)
.057
(.007)
--
.128
(.009)
--
3rd Grade
1997 Cohort
1998 Cohort
1999 Cohort
6th Grade
1997 Cohort
1998 Cohort
1999 Cohort
8th Grade
1996 Cohort
1997 Cohort
1998 Cohort
Notes: The sample includes all first-time students in these grades from 1993 to 1999 (1993-98 for eighth grade). Control variables include race, gender, race*gender interactions, age, household
composition, and an indicator of previous special education placement along with up to three years of prior reading and math achievement (linear, square and cubic terms). Missing test scores are set to
zero and a variable is included indicating the score is missing. Robust standard errors that account for the correlation of errors within school are presented in parentheses.
52
Table 6.4: OLS Estimates of High-Stakes Testing Effects in Gate Grades
Reading
Math
rd
th
th
rd
3
6
8
3
6th
8th
Grade
Grade
Grade
Grade
Grade
Grade
Aggregate Effects
.046
.066
.150
.132
.088
.125
All Students
(.009)
(.005)
(.006)
(.011)
(.006)
(.005)
Effects by School
Prior
Achievement
< 15% students in the
.075
.108
.177
.180
.116
.134
school scored above
(.017)
(.012)
(.013)
(.026)
(.012)
(.009)
the 50th percentile
15-25% students in
the school scored
.059
.064
.147
.117
.072
.140
above the 50th
(.013)
(.006)
(.011)
(.016)
(.010)
(.009)
percentile
> 25% students in the
.015
.044
.139
.113
.087
.106
school scored above
(.015)
(.009)
(.009)
(.017)
(.010)
(.008)
the 50th percentile
Effects by Student
Prior
Achievement
.041
.120
.103
.161
.055
.063
< 10th percentile
(.013)
(.008)
(.012)
(.016)
(.008)
(.008)
.071
.060
.146
.153
.067
.108
10-25th percentile
(.011)
(.007)
(.007)
(.014)
(.007)
(.004)
.081
.053
.166
.131
.095
.144
25-35th percentile
(.013)
(.009)
(.009)
(.015)
(.008)
(.007)
.038
.044
.147
.111
.103
.146
35-50th percentile
(.013)
(.008)
(.009)
(.015)
(.009)
(.007)
-.084
.030
.125
.052
.129
.136
> 50th percentile
(.015)
(.010)
(.009)
(.015)
(.008)
(.008)
Notes: Each cell contains an estimate from a separate regression. The estimate reflects the
coefficient on a variable indicating whether the student was part of a post-policy cohort. School
prior achievement is based on 1995 achievement scores. Student prior achievement is based on
the average of all non-missing test scores in the years t-2 and t-3 for sixth and eighth graders and
t-1 and t-2 for third graders. The sample includes all first-time students in these grades from
1993 to 1999 (1993-98 for eighth grade). Control variables include race, gender, race*gender
interactions, age, household composition, and an indicator of previous special education
placement along with up to three years of prior reading and math achievement (linear, square and
cubic terms). Missing test scores are set to zero and a variable is included indicating the score is
missing. Robust standard errors that account for the correlation of errors within school are
presented in parentheses.
53
Table 6.5: OLS Estimates of High-Stakes Testing Effects for Non-Gate Grades in 1997
Reading
Math
th
th
th
th
4
5
7
4
5th
7th
Grade
Grade
Grade
Grade
Grade
Grade
Aggregate Effects
1997 Effect
-.015
(.008)
.089
(.007)
.068
(.008)
.105
(.006)
.071
(.004)
.040
(.008)
.130
(.010)
.093
(.007)
.019
(.017)
.106
(.008)
.102
(.013)
.144
(.020)
.083
(.013)
.093
(.015)
-.032
(.012)
.072
(.007)
.098
(.010)
.083
(.014)
.059
(.010)
.112
(.009)
-.016
(.014)
.048
(.008)
.116
(.009)
.017
(.014)
.051
(.011)
.076
(.011)
1998 Effect
By School Prior
Achievement
< 15% students in the
school scored above
the 50th percentile
15-25% students in
the school scored
above the 50th
percentile
> 25% students in the
school scored above
the 50th percentile
By Student Prior
Achievement
-.054
.088
.092
.158
.072
.049
(.015)
(.009)
(.012)
(.015)
(.009)
(.010)
-.014
.109
.052
.135
.065
.065
10-25th percentile
(.012)
(.008)
(.008)
(.014)
(.008)
(.007)
-.001
.103
.076
.082
.069
.096
25-35th percentile
(.015)
(.010)
(.010)
(.013)
(.010)
(.009)
.003
.077
.107
.045
.058
.115
35-50th percentile
(.013)
(.010)
(.010)
(.013)
(.010)
(.011)
.012
-.010
.173
-.015
.036
.157
> 50th percentile
(.014)
(.010)
(.011)
(.013)
(.010)
(.011)
Notes: Each cell contains an estimate from a separate regression. The estimate reflects the
coefficient on a variable indicating whether the student was part of a post-policy cohort. School
prior achievement is based on 1995 achievement scores. Student prior achievement is based on
the average of all non-missing test scores in the years t-2 and t-3. The sample includes all firsttime students in these grades from 1993 to 1997. Control variables include race, gender,
race*gender interactions, age, household composition, and an indicator of previous special
education placement along with up to three years of prior reading and math achievement (linear,
square and cubic terms). Missing test scores are set to zero and a variable is included indicating
the score is missing. Robust standard errors that account for the correlation of errors within
school are presented in parentheses.
< 10th percentile
54
Table 7.1: OLS Estimates of ITBS and IGAP Math Achievement
Iowa Test of Basic Skills
(High-Stakes Exam)
(1)
2.6
rd
3 Grade
(2)
Illinois Goals Assessment Program
(Low-Stakes Exam)
(3)
(5)
(6)
(7)
(8)
2.10
2.11
2.12
2.13
2.14
-.131
(.019)
-.140
(.028)
.226
(.015)
-.174
(.029)
2.7
2.8
.146
(.012)
.264
(.013)
.039
(.016)
.162
(.024)
.249
(.014)
.151
(.024)
.184
(.012)
.272
(.015)
.563
.564
.392
.392
.559
.569
.389
.401
.091
(.008)
.194
(.010)
.027
(.012)
.175
(.018)
.178
(.010)
.157
(.018)
.159
(.009)
.231
(.011)
-.037
(.014)
.009
(.022)
.217
(.012)
-.002
(.022)
.757
.757
.660
.660
.718
.722
.646
.649
.132
(.007)
.153
(.008)
.279
(.008)
.337
(.012)
.493
(.020)
.976
(.027)
.078
(.008)
--
--
--
.183
(.015)
.297
(.025)
.621
(.033)
.083
(.011)
.182
(.011)
.107
(.009)
.171
(.010)
.208
(.012)
.148
(.013)
--
R-Squared
.764
.766
.767
.743
.743
.742
Allow for pre-existing
trend
No
Yes
No
Yes
No
Yes
No
Yes
1994-98
1994-98
Form L
(94,96,98)
Form L
(94,96,98)
1994-98
1994-98
Form L
(94,96,98)
Form L
(94,96,98)
1997 Cohort
1998 Cohort
R-Squared
2.9
(4)
6th Grade
1997 Cohort
1998 Cohort
R-Squared
8th Grade
1996 Cohort
1997 Cohort
1998 Cohort
Cohorts
Notes: The sample includes all first-time students in these grades from 1994 to 1998. Control variables include race, gender, race*gender interactions, age, household composition, and an indicator of
previous special education placement along with up to three years of prior reading and math achievement (linear, square and cubic terms). Missing test scores are set to zero and a variable is included
indicating the score is missing. Robust standard errors that account for the correlation of errors within school are presented in parentheses.
55
Table 7.2 Summary Statistics for Illinois Public Schools in 1990
Non-Urban
Chicago
Districts
(1)
(2)
School Test Scores
4th Grade Science
7th Grade Science
4th Grade Social Studies
7th Grade Social Studies
3rd Grade Math
6th Grade Math
8th Grade Math
3rd Grade Reading
6th Grade Reading
8th Grade Reading
Average Test Score
Average % Tested
School Demographics
Average School Enrollment
% Black
% Hispanic
% Asian
% Native American
% LEP
% Free or Reduced Price
Lunch
% Avg. Daily Attendance
% Mobility
District Characteristics
Pupil-Teacher Ratio
Log(Avg. Teacher Salary)
Log(Per Pupil Expenditures)
Teacher Experience (years)
% Teachers with B.A.
% Teachers with M.A.+
Number of Schools
Number of Districts
Average District Enrollment
166
182
163
185
190
191
206
157
184
208
190
82
278
279
279
277
Other Urban
Districts
(3)
279
291
275
285
278
286
90
211
205
212
207
233
219
222
205
221
220
221
85
795
55.7
29.4
3.0
0.2
15.3
487
6.0
4.4
3.0
0.1
2.3
506
45.8
14.9
1.1
0.1
7.0
71.8
91.9
33.7
17.6
95.6
14.8
56.2
93.5
32.6
20.3
10.7
8.7
16.6
56.4
42.9
470
1
306,500
19.8
10.5
8.4
15.3
56.6
43.4
3,055
833
3,769
20.2
10.5
8.5
16.0
56.6
211
242
34
7,089
295
Notes: The figures presented above are averages for all schools weighted by the school
enrollment.
56
Table 7.3: OLS Estimates of IGAP Math Achievement in Illinois from 1993 to 1998
3rd Grade
6th Grade
8th Grade
1
2
3
4
5
6
Chicago*(1997-98)
13.4
(4.7)
-3.1
(3.5)
9.0
(2.6)
-2.8
(2.4)
3.6
(2.5)
-0.5
(3.3)
Urban*(1997-98)
4.1
(2.5)
0.6
(3.0)
3.3
(1.8)
8.3
(2.4)
4.9
(2.6)
6.5
(3.2)
1997-98
3.8
(1.4)
-2.6
(1.4)
4.0
(1.1)
-6.8
(1.0)
7.1
(1.0)
-0.4
(1.0)
Chicago
-19.4
(5.7)
-41.8
(7.8)
-12.4
(4.0)
-26.6
(5.6)
-2.2
(3.4)
-6.0
(5.5)
Urban
7.0
(6.6)
2.3
(7.6)
5.2
(4.8)
13.1
(6.1)
9.8
(2.8)
12.3
(6.1)
Trend*Chicago
6.0
(1.1)
4.2
(0.8)
1.5
(1.2)
Trend*Urban
1.3
(0.8)
-1.6
(0.8)
-0.5
(1.1)
Trend
2.6
(0.6)
4.1
(0.4)
2.8
(0.4)
R-Squared
Number of Obs
Difference between
Chicago and Other
Urban Districts in
Illinois
Difference between
Chicago and Non
Urban Districts in
Illinois
.67
.76
.81
13,751
10,939
8,199
13.4
-3.1
9.0
-2.8
3.6
-0.5
17.5
-2.5
12.3
5.5
8.5
6.0
Notes: The following control variables are also included in the regressions shown above: percent black, percent
Hispanic, percent Asian, percent Native American, percent low-income, percent Limited English Proficient, average
daily attendance, mobility rate, school enrollment, pupil-teacher ratio, log(average teacher salary), log(per pupil
expenditures), percent of teachers with a BA degree, and the percent of teachers with a MA degree or higher.
Robust standard errors that account for correlation within schools across years are shown in parenthesis. The
regressions are weighted by the inverse square of the number of students enrolled in the school.
57
Table 7.4: OLS Estimates of IGAP Reading Achievement in Illinois from 1993 to 1998
3rd Grade
6th Grade
8th Grade
1
2
3
4
5
6
Chicago*(1997-98)
14.8
(4.2)
-5.9
(2.7)
7.9
(2.8)
-9.6
(3.5)
7.9
(2.2)
2.1
(4.2)
Urban*(1997-98)
5.1
(2.2)
-5.3
(2.5)
4.3
(2.2)
8.7
(3.4)
5.3
(2.2)
4.2
(3.6)
1997-98
-6.4
(1.2)
-0.2
(1.1)
-21.6
(0.9)
-15.3
(1.1)
-20.9
(0.9)
-2.0
(0.9)
Chicago
-12.5
(3.6)
-43.0
(7.4)
2.2
(3.3)
-25.1
(6.1)
14.6
(4.7)
1.3
(6.9)
Urban
6.8
(4.9)
-8.8
(6.6)
8.1
(4.9)
14.4
(7.5)
11.2
(3.5)
8.8
(7.7)
Trend*Chicago
7.2
(1.1)
5.9
(1.1)
1.6
(1.5)
Trend*Urban
3.6
(0.8)
-1.4
(1.1)
0.3
(1.4)
Trend
-2.0
(0.4)
-2.3
(0.5)
-7.0
(0.3)
R-Squared
Number of Obs
Difference between
Chicago and Other
Urban Districts in
Illinois
Difference between
Chicago and Non
Urban Districts in
Illinois
.76
.78
.77
13,749
10,942
8,200
14.8
-5.9
7.9
-9.6
7.9
2.1
19.9
-11.2
12.2
-1.9
13.2
6.3
Notes: The following control variables are also included in the regressions shown above: percent black, percent
Hispanic, percent Asian, percent Native American, percent low-income, percent Limited English Proficient, average
daily attendance, mobility rate, school enrollment, pupil-teacher ratio, log(average teacher salary), log(per pupil
expenditures), percent of teachers with a BA degree, and the percent of teachers with a MA degree or higher.
Robust standard errors that account for correlation within schools across years are shown in parenthesis. The
regressions are weighted by the inverse square of the number of students enrolled in the school.
58
Table 8.1: Estimates of the achievement gain associated with greater student guessing for students who scored in the bottom 10th
percentile in prior grades
Achievement gain
Number of
Additional correct
% gain attributable
Estimated gain
associated with
additional
Actual
responses due to
to increased
associated with
additional correct
questions
gain
guessing in 1998
guessing
random guessing
response
answered in 1998
(5)
(6) = (4)/(5)
(1)
(2) = (1) *0.25
(4) = (1)*0.25*(3)
(3)
3rd Grade
Reading
.34
0.09
.14
0.01
.06
0.20
Math
.82
0.21
.07
0.01
.09
0.16
6th Grade
Reading
.46
0.12
.24
0.03
.29
0.10
Math
.99
0.25
.08
0.02
.19
0.10
8th Grade
Reading
.96
0.24
.25
0.06
.36
0.17
Math
1.63
0.41
.08
0.03
.27
0.12
Notes: The sample includes all first-time students in these grades in 1994 and 1998. Control variables include race, gender,
race*gender interactions, age, household composition, and an indicator of previous special education placement along with up to three
years of prior reading and math achievement (linear, square and cubic terms). Missing test scores are set to zero and a variable is
included indicating the score is missing. Robust standard errors that account for the correlation of errors within school are presented
in parentheses.
59
Table 8.2: A Comparison of Eighth Grade ITBS and IGAP Exams
Math
ITBS
IGAP
• 70 multiple-choice
• 135 multiple-choice
questions
questions
• 5 possible answers
• 4 possible answers
• No penalty for wrong
• No penalty for wrong
Structure
answers
answers
• Two 40 minute
• Five sessions of 20-45
sessions
minutes each
• Computation (43)
• Computation (10)
• Number Concepts (32)
• Ratios & Percentages
(10)
• Estimation (24)
• Measurement (10)
• Problem-Solving (20)
Content
• Algebra (10)
• Data Analysis (16)
• Geometry (10)
• Data Analysis (10)
• Estimation (10)
• All questions are
• Computation problems
written as word
do not have words.
problems, including the
• Data Interpretation
computation problems.
Format
section consists of a
• One question per
graph or figure
graph or figure.
followed by several
questions.
Reading
ITBS
• 7 passages followed
by 3-10 multiple-choice
questions
• 49 total questions
•
•
•
2 narrative passages
4 expository passages
1 poetry passage
• One correct answer per
question.
•
•
•
•
•
IGAP
2 passages followed
by 18 multiple-choice
questions
4 questions that ask
the student to compare
the two passages
40 questions total
1 narrative passage
1 expository passage
• Multiple correct
answers per question.
Notes: Information on the ITBS is taken from the Form L exam. Information on the IGAP is based on practice books.
60
Table 8.3: OLS Estimates of the Relationship between Item Type and Achievement Gain
on ITBS Math Exam from 1996 to 1998
Dependent Variable =
Proportion of Students Answering the Item Correctly on
the ITBS Math Exam
(1)
(2)
.023
.015
1998 Cohort
(.011)
(.012)
.019
Basic Skills * 1998
(.006)
.019
Number Concepts *1998
(.010)
.006
Estimation *1998
(.010)
.008
Data Analysis *1998
(.011)
.029
Math Computation *1998
(.009)
25-35% answered item correctly
.015
.013
prior to high-stakes testing*1998
(.012)
(.012)
35-45% answered item correctly
.019
.019
prior to high-stakes testing*1998
(.012)
(.012)
45-55% answered item correctly
.018
.016
prior to high-stakes testing*1998
(.012)
(.012)
55-65% answered item correctly
.015
.012
prior to high-stakes testing*1998
(.013)
(.013)
65-75% answered item correctly
.010
.009
prior to high-stakes testing*1998
(.014)
(.014)
75-85% answered item correctly
-.002
-.005
prior to high-stakes testing*1998
(.014)
(.014)
85-100% answered item correctly
.003
-.003
prior to high-stakes testing*1998
(.024)
(.024)
Number of Observations
692
692
R-Squared
.956
.957
Notes: The sample consists of all tested and included students in 1996 and 1998. The units of
observation are item*year proportions, reflecting the proportion of students answering the item
correctly in that year. The omitted item category in column 1 is critical thinking skills. The
omitted category in column 2 is problem-solving.
61
Table 8.4: OLS Estimates of the Relationship between Item Position and Achievement
Gain on the ITBS Reading Exam from 1994 to 1998
Dependent Variable =
Proportion of Students Answering the Item Correctly on
the ITBS Reading Exam
Total
.004
Intercept
(.021)
.002
2nd Quintile of the Exam
(.014)
.013
3rd Quintile of the Exam
(.015)
.017
4th Quintile of the Exam
(.015)
.027
5th Quintile of the Exam
(.017)
25-35% answered item correctly
.025
prior to high-stakes testing
(.020)
35-45% answered item correctly
.036
prior to high-stakes testing
(.019)
45-55% answered item correctly
.049
prior to high-stakes testing
(.019)
55-65% answered item correctly
.046
prior to high-stakes testing
(.021)
65-75% answered item correctly
.051
prior to high-stakes testing
(.025)
75-100% answered item correctly
.043
prior to high-stakes testing
(.030)
Number of Observations
258
R-Squared
.95
Notes: The sample consists of all tested and included students in 1994 and 1998. The units of
observation are item*year proportions, reflecting the proportion of students answering the item
correctly in that year. The omitted category is the first quintile of the exam.
62
Table 9.1: OLS Estimates of the Probability in the Following Year of Not Being Enrolled, Being Retained and Being Placed in
Special Education
Dependent Variables
Not Enrolled
Retained
Kindergarten
1997 Cohort
1998 Cohort
1999 Cohort
-0.003
(.002)
0.000
(.002)
0.007
(.002)
.010
(.003)
.019
(.003)
.031
(.004)
-0.001
(.001)
0.002
(.002)
0.008
(.003)
.001
(.002)
.005
(.002)
.011
(.003)
-0.006
(.002)
-0.004
(.002)
0.003
(.002)
.005
(.003)
.009
(.003)
.022
(.004)
0.013
(.003)
0.020
(.003)
0.024
(.004)
.013
(.003)
.021
(.004)
.024
(.005)
-0.002
(.002)
0.001
(.002)
0.010
(.002)
.009
(.002)
.017
(.003)
.030
(.004)
0.023
(.003)
0.020
(.002)
0.017
(.002)
.018
(.003)
.014
(.003)
.008
(.003)
No
Yes
No
Yes
1st Grade
1997 Cohort
1998 Cohort
1999 Cohort
2nd Grade
1997 Cohort
1998 Cohort
1999 Cohort
Trend
Notes: The sample includes all first-time students in these grades from 1993 to 1999 (1993-98 for eighth grade). Control variables
include race, gender, race*gender interactions, age, household composition, and an indicator of previous special education placement
along with up to three years of prior reading and math achievement (linear, square and cubic terms). Missing test scores are set to
zero and a variable is included indicating the score is missing. Robust standard errors that account for the correlation of errors within
school are presented in parentheses.
63
Table 9.2 OLS Estimates of the Relationship Between High-Stakes Testing and ITBS Scores
Independent Variable
Math
High-Stakes Regime
.165
(.010)
Dependent Variable = ITBS Score in
Social
Math
Reading
Science
Reading
Science
Studies
4th Grade
8th Grade
.156
(.010)
-.023
(.017)
-.061
(.016)
.159
(.013)
.128
(.017)
.043
(.023)
Social
Studies
-.036
(.023)
Notes: Each cell contains an estimate from a separate regression. The estimate reflects the coefficient on a variable indicating whether
the student was part of a post-policy cohort. The fourth grade sample includes the 1996 and 1997 cohort. The eighth grade sample
includes the 1996 and 1998 cohort. Both samples are limited to students who were in the grade for the first-time and were tested and
included. Control variables include race, gender, race*gender interactions, age, household composition, and an indicator of previous
special education placement along with up to three years of prior reading and math achievement (linear, square and cubic terms).
Missing test scores are set to zero and a variable is included indicating the score is missing. Robust standard errors that account for
the correlation of errors within school are presented in parentheses.
Table 9.3: OLS Estimate of the Relationship Between High-Stakes Testing and IGAP Science and Social Studies Scores
Science
Social Studies
th
th
th
4 Grade
7 Grade
4 Grade
7th Grade
1
2
3
4
5
6
7
8
6.2
-7.3
10.1
0.6
-1.3
-16.4
7.7
-5.0
Chicago*(1997)
(3.6)
(2.9)
(2.2)
(3.2)
(4.1)
(3.6)
(2.4)
(4.2)
Yes
No
Trend
No
Yes
No
No
Yes
Yes
R-Squared
.81
.81
.83
.84
.79
.79
.79
.79
Number of Observations
11,105
11,105
6,832
6,832
11,103
11,103
6,834
6,834
Notes: The following control variables are also included in the regressions shown above: percent black, percent Hispanic, percent Asian, percent Native
American, percent low-income, percent Limited English Proficient, average daily attendance, mobility rate, school enrollment, pupil-teacher ratio, log(average
teacher salary), log(per pupil expenditures), percent of teachers with a BA degree, and the percent of teachers with a MA degree or higher. Robust standard
errors that account for correlation within schools across years are shown in parenthesis. The regressions are weighted by the inverse square of the number of
students enrolled in the school.
64
Figure 6.1: Trends in ITBS Math Scores in Gate Grades in Chicago
0.80
Deviation from 1990 Mean (Adjusted Rasch Metric)
0.60
0.40
0.20
0.00
-6
-5
-4
-3
-2
-1
0
1
-0.20
-0.40
Years Before/After High-Stakes Testing
3rd Grade
6th Grade
65
8th Grade
2
3
4
5
Figure 6.2: Trends in ITBS Reading Scores in Gate Grades in Chicago
0.50
Deviation from 1990 Mean (Adjusted Rasch Metric)
0.40
0.30
0.20
0.10
0.00
-6
-5
-4
-3
-2
-1
0
1
-0.10
-0.20
Years Before/After High-Stakes Testing
3rd Grade
6th Grade
66
8th Grade
2
3
4
5
Figure 6.3: Trends in ITBS Math Scores at Different Points on The Ability Distribution
0.50
ITBS Math Score (difference from 1994 mean)
0.40
0.30
0.20
0.10
0.00
1994
10th Percentile
1996
25th Percentile
67
1998
75th Percentile
90th Percentile
Figure 6.4: Trends in ITBS Reading Scores at Different Points on the Ability Distribution
0.50
ITBS Reading Score (difference from 1994 mean)
0.40
0.30
0.20
0.10
0.00
1994
10th Percentile
1996
25th Percentile
68
1998
75th Percentile
90th Percentile
Figure 6.5: Trends in ITBS Math Achievement in Non-Gate Grades in Chicago
0.4
Deviation from 1990 Mean (Rasch Metric)
0.3
0.2
0.1
0
-6
-5
-4
-3
-2
-1
-0.1
-0.2
Years Before/After High-Stakes Testing
4th Grade
5th Grade
69
7th Grade
0
1
2
Figure 6.6: Trends in ITBS Reading Scores in Non-Gate Grades in Chicago
0.2
Deviation from 1990 Mean (Rasch Metric)
0.15
0.1
0.05
0
-6
-5
-4
-3
-2
-1
-0.05
-0.1
-0.15
Years Before/After High-Stakes Testing
4th Grade
5th Grade
70
7th Grade
0
1
2
Figure 7.1: Trends in IGAP Math Scores in Gate Grades in Chicago
250
IGAP Score
225
200
175
150
-6
-5
-4
-3
-2
-1
0
Years Before/After High-Stakes Testing
3rd Grade
6th Grade
71
8th Grade
1
2
3
Figure 7.2: Trends in Third Grade ITBS and IGAP Math Scores in Chicago
Average School Mean (1993 standard deviation units)
1.00
0.80
0.60
0.40
0.20
0.00
-3
-2
-1
0
-0.20
-0.40
Years Before/After High-Stakes Testing
IGAP
ITBS
72
1
2
Figure 7.3: Trends in Sixth Grade ITBS and IGAP Math Scores in Chicago
Average School Mean (1993 standard deviation units)
1.00
0.80
0.60
0.40
0.20
0.00
-3
-2
-1
0
-0.20
Years Before/After High Stakes Testing
IGAP
ITBS
73
1
2
Figure 7.4: Trends in Eighth Grade ITBS and IGAP Math Scores in Chicago
Average School Mean (1993 standard deviation units)
1.20
1.00
0.80
0.60
0.40
0.20
0.00
-2
-1
0
1
-0.20
Years Before/After High-Stakes Testing
IGAP
ITBS
74
2
3
Figure 7.5: Trends in the Difference Between Chicago and Other Urban Districts on the IGAP Math Exam
0
Chicago Mean IGAP - Urban Comparison District Mean IGAP
-3
-2
-1
0
1
2
-5
-10
-15
-20
-25
-30
-35
-40
-45
-50
3rd Grade
6th Grade
75
8th Grade
3
Figure 7.6: Trends in the Difference Between Chicago and Other Urban Districts on the IGAP Reading Exam
Chicago Mean IGAP - Urban Comparison District Mean IGAP
10
0
-3
-2
-1
0
1
2
-10
-20
-30
-40
-50
-60
Years Before/After High-Stakes Testing
3rd Grade
6th Grade
76
8th Grade
3
Figure 9.1: Trends in Fourth Grade ITBS Achievement in Chicago
4.60
4.40
Grade Equivalents
4.20
4.00
3.80
3.60
3.40
3.20
3.00
1995
1996
Math
Reading
1997
Science
77
Social Studies
Figure 9.2: Trends in Eighth Grade ITBS Achievement in Chicago
8.5
8.3
8.1
Grade Equivalents
7.9
7.7
7.5
7.3
7.1
6.9
6.7
6.5
1995
1996
Math
1997
Reading
Science
78
1998
Social Studies
Figure 9.3: Trends in IGAP Social Studies Scores in Illinois (Difference between Chicago and Other Urban Districts)
30
Difference (Chicago - Other Urban Districts)
20
10
0
1993
1994
1995
1996
-10
-20
-30
-40
-50
-60
4th Grade
7th Grade
79
1997
Figure 9.4: Trends in IGAP Science Scores in Illinois (Difference between Chicago and Other Urban Districts)
30
Difference (Chicago - Other Urban Districts)
20
10
0
1993
1994
1995
1996
-10
-20
-30
-40
-50
-60
4th Grade
7th Grade
80
1997
Download