Preliminary and Incomplete: Comments Welcome THE IMPACT OF HIGH-STAKES TESTING ON STUDENT ACHIEVEMENT: EVIDENCE FROM CHICAGO∗ Brian A. Jacob John F. Kennedy School of Government Harvard University June 2001 ∗ I would like to thank the Chicago Public Schools, the Illinois State Board of Education and the Consortium on Chicago School Research for providing the data used in this study. I am grateful to Anthony Bryk, Carolyn Hill, Robert LaLonde, Lars Lefgren, Steven Levitt, Helen Levy, Susan Mayer, Melissa Roderick, Robin Tepper and seminar participants at the University of Chicago for helpful comments and suggestions. Funding for this research was provided by the Spencer Foundation. All remaining errors are my own. THE IMPACT OF HIGH-STAKES TESTING ON STUDENT ACHIEVEMENT: EVIDENCE FROM CHICAGO Abstract School reforms designed to hold students and teachers accountable for student achievement have become increasingly popular in recent years. Yet there is little empirical evidence on how such policies impact student or teacher behavior, or how they ultimately affect student achievement. This study utilizes detailed administrative data on student achievement in Chicago to examine the impact of a high-stakes testing policy introduced in 1996. I find that math and reading scores on the high-stakes exam increased sharply following the introduction of the accountability policy, but that there was little if any effect on a state-administered, low-stakes exam. At the same time, science and social studies scores on both exams leveled off or declined under the new regime. There is also evidence that the policy increased retention rates in primary grades that were not subject to the promotional criteria. 1. Introduction School reforms designed to hold students and teachers accountable for student achievement have become increasingly popular in recent years. Statutes in 19 states explicitly link student promotion to performance on a state or district assessment (ECS 2000). The largest school districts in the country, including New York City, Los Angeles, Chicago and Washington, D.C., have recently implemented policies requiring students to attend summer school and/or repeat a grade if they do not demonstrate sufficient mastery of basic skills. At the same time, 20 states reward teachers and administrators on the basis of exemplary student performance and 32 states sanction school staff on the basis of poor student performance. Many states and districts have passed legislation allowing the takeover or closure of schools that do not show improvement (ECS 2000). Despite the increasing popularity of high-stakes testing, there is little evidence on how such policies influence student or teacher behavior, or how they ultimately affect student achievement. Standard incentive theory suggests that high-stakes testing will increase student achievement in the measured subjects and exams, but also may lead teachers to shift resources away from low-stakes subjects and to neglect infra-marginal students. Such high-powered incentives may also cause students and teachers to ignore less easily observable dimensions of student learning such as critical thinking in order to teach the narrowly defined basic skills that are tested on the high-stakes exam (Holmstrom 1991). Chicago was one of the first large, urban school districts to implement a comprehensive high-stakes accountability policy. Beginning in 1996, Chicago Public Schools in which fewer than 15 percent of students met national norms in reading were placed on probation, which entailed additional resources as well as oversight. If student performance did not improve in these schools, teachers and administrators were subject to reassignment or dismissal. At the same time, the CPS took steps to end “social promotion,” the practice of passing students to the 3 next grade regardless of their academic ability. Students in third, sixth and eighth grades (the “gate” grades) were required to meet minimum standards in reading and mathematics on the Iowa Test of Basic Skills (ITBS) in order to advance to the next grade. This paper utilizes detailed administrative data on student achievement in Chicago to examine the impact of the accountability policy on student achievement.1 I first compare achievement trends before and after the introduction of high-stakes testing, conditioning on changes in observable student characteristics such as demographics and prior achievement. I then compare the achievement trends for students and schools who were likely to be strongly influenced by high-stakes testing (e.g., such as low-achieving students and probation schools) with trends among students and schools less affected by the policy. To the extent that these high-achieving students and schools were not influenced by the policy, this difference-indifference strategy will eliminate any common, unobserved time-varying factors that may have influenced overall achievement levels in Chicago. In order to assess the generalizability of achievement changes in Chicago, I compare achievement trends on the Iowa Test of Basic Skills (ITBS), the exam used for promotion and probation decisions, with comparable trends on a statemandated, low-stakes exam, the Illinois Goals Assessment Program (IGAP). Using a panel of Illinois school data during the 1990s, I am able to compare IGAP achievement trends in Chicago with other urban districts in Illinois, thereby controlling for unobserved statewide factors, such as improvements in the economy or changes in state or federal educational policies. Finally, I examine several unintended consequences of the policy, such as changes in retention rates and achievement levels in grades and subjects not directly influenced by the policy. I find that ITBS math and reading scores increased sharply following the introduction of high-stakes testing, but that the pre-existing trend of IGAP scores did not change. There is some 1 In this analysis, I do not focus on the effect of the specific treatments that accompanied the accountability policy such as the resources provided to probation schools, summer school and grade retention. For an evaluation of the probation, summer school and grade retention programs, see Jacob and Lefgren (2001). 4 evidence that test preparation and short-term effort may account for a portion of the differential ITBS-IGAP gains. While high-stakes testing did not have any impact on the likelihood that students would transfer to a private school, move out of the CPS to a different public school districts (in the suburbs or out of the state) or drop out of school, it appears to have increased retention rates in the primary grades and decreased relative achievement in science and social studies. The remainder of this paper is organized as follows. Section 2 lays out a conceptual framework for analyzing high-stakes testing, providing background on the Chicago reforms and reviewing the previous literature on accountability and achievement. Section 3 discusses the empirical strategy and Section 4 describes the data. Sections 5 to 9 present the main findings. Section 10 summarizes the major findings and discusses several policy implications. 5 2. A Conceptual Framework for Analyzing High-Stakes Testing 2.1 Background on High-Stakes Testing in Chicago In 1996 the CPS introduced a comprehensive accountability policy in an effort to raise academic achievement. The first component of the policy focused on holding students accountable for learning, ending a common practice known as “social promotion” whereby students are advanced to the next grade regardless of their ability or achievement. Under this policy, students in third, sixth and eighth grades are required to meet minimum standards in reading and mathematics on the Iowa Test of Basic Skills (ITBS) in order to advance to the next grade. Students that do not make the standard are required to attend a six-week summer school program, after which they retake the exams. Those who pass move on to the next grade. Students who again fail to meet the standard are required to repeat the grade, with the exception of 15-year-olds who attend newly created “transition” centers. One of the most striking features of Chicago’s social promotion policy was its scope. Table 2.1 illustrates the number of students affected by the policy each year.2 The ITBS scores are measured in terms of “grade equivalents” (GEs) that reflect the years and months of learning a student has mastered. The exam is nationally normed so that a student at the 50th percentile in the nation scores at the eighth month of her current grade – i.e., an average third grader will score a 3.8. In first full year of the policy, the promotion standards for third, sixth and eighth grade were 2.8, 5.3, and 7.0 respectively, which roughly corresponded to the 20th percentile in the national achievement distribution. The promotional criteria were raised for eighth graders in 1997-98 and 1998-99, and for all three grades in 1999-2000. Because many Chicago students are in special education and bilingual programs, and are thereby exempt from standardized testing, only 70 to 80 percent of the students in the system 6 were directly affected by the accountability policies.3 Of those who were subject to the policy, nearly 50 percent of third graders and roughly one-third of sixth and eighth graders failed to meet the promotional criteria and were required to attend summer school in 1996-97. Of those who failed to meet the promotional criteria in June, approximately two-thirds passed in August. As a result, roughly 20 percent of third grade students and 10 to 15 percent of sixth and eighth grade students were eventually held back in the Fall. In conjunction with the social promotion policy, the CPS instituted a policy designed to hold teachers and schools accountable for student achievement. Under this policy, schools in which fewer than 15 percent of students scored at or above national norms on the ITBS reading exam are placed on probation. If they do not exhibit sufficient improvement, these schools may be reconstituted, which involves the dismissal or reassignment of teachers and school administrators. In 1996-97, 71 elementary schools serving over 45,000 students were placed on academic probation.4 2.2 Prior Research on High-Stakes Testing Studies of high school graduation tests provide mixed evidence on the effect of highstakes testing. While several studies found a positive association between student achievement and minimum competency testing (Bishop 1998, Frederisksen 1994, Neill 1998, Winfield, 1990), a recent study with better controls for prior student achievement finds no effect (Jacob 2001). There is also evidence that mandatory high school graduation exams increase the probability of dropping out of school, particularly among low-achieving students (Catterall, 1987, Kreitzer 2 Note that Fall enrollment figures increase beginning in 1997-98 because students who were retained in previous years are counted in the next Fall enrollment for the same grade. 3 Section 5 examines whether the number of students in bilingual and special education programs, or otherwise excluded from testing, has changed since the introduction of high-stakes testing. 7 1989, Catterall 1989, MacMillan 1990, Griffin 1996, Reardon 1996, Jacob 2001 #363). Craig and Sheu (1992) found modest improvements in student achievement after the implementation of a school-based accountability policy in South Carolina in 1984, but Ladd (1999) found few achievement effects for a school-based accountability and incentive program in Dallas during the early 1990s. The recent experience in Texas provides additional evidence regarding the impact of high-stakes accountability systems. In the early 1990s, Texas implemented a two-tiered accountability system, including a graduation requirement for high school as well as a system of rewards and sanctions for schools based on student performance. Texas students made dramatic gains on the Texas Assessment of Academic Skills (TAAS) from 1994 to 1998 (Haney 2000, Klein 2000). In their 1998 study of state-level NAEP trends, Grissmer and Flanagan (1998) found that Texas was one of the fastest improving states during the 1990s and that conditional on family characteristics, Texas made the largest gains of all states. However, in an examination of student achievement in Texas, Klein et. al. (2000) found that TAAS gains were several times larger than NAEP gains during the 1990s, and that over a four-year period the average NAEP gains in Texas exceeded those of the nation in only one of the three comparisons—fourth grade math, but not fourth grade reading or eighth grade math.5 Finally, there is some evidence that high-stakes testing leads teachers to neglect nontested subjects and less easily observable dimensions of student learning, leading to test-specific achievement gains that are not broadly generalizable. Linn and Dunbar (1990) found that many states made smaller gains on the National Assessment of Educational Progress (NAEP) than their 4 Probation schools received some additional resources and were more closely monitored by CPS staff. Jacob and Lefgren [, 2001 #371] examined the impact of monitoring and resources provided to schools on probation, using a regression discontinuity design that compared the performance of students in schools that just made the probation cutoff with those that just missed the cutoff. They found that the additional resources and monitoring provided by probation had no impact on math or reading achievement. 5 It is difficult to compare TAAS and NAEP scores because the exams were given in different years. Fourth grade reading is the only grade-subject for which a direct comparison is possible since Texas students took both the TAAS 8 and NAEP in 1994 and 1998. Fourth and eighth grade students took the NAEP math exams in 1992 and 1996, but did not start taking the TAAS until 1994. 9 own achievement exams, presumably due at least in part to the larger incentives placed on the state exams.6 Koretz et. al. (1991) specifically examined the role of high-stakes incentives on student achievement in two state testing programs in the 1980s. They found that scores dropped sharply when a new form of test was introduced and then rose steadily over the next several years as teachers and students became more familiar with the exam, suggesting considerable testspecific preparation. Koretz and Barron (1998) found that gains on the state-administered exam in Kentucky during the early 1990s were substantially larger than gains on the nationallyadministered NAEP exam and that the NAEP gains were roughly comparable to the national average. Stecher and Barron (1999) documented that teachers in Kentucky allocated instructional time differently across grades to match the emphasis of the testing program. 3. Empirical Strategy Because the CPS instituted the accountability policy district-wide in 1996, it is difficult to disentangle the effect of high-stakes testing from shifts in the composition of Chicago public school students (e.g., an increase in middle-class students attending public schools in Chicago), changes in state or federal education policy (e.g., the federal initiative to reduce class sizes) or changes in the economy (e.g., reduced unemployment rates in the late 1990s). This section describes several different sources of variation I exploit to identify the effect of the policy. To determine the effect of high-stakes testing on student achievement, I estimate the following education production function: (3.1) y isdt = ( HighStakes) dt δ + X isdt β 1 + Z sdt β 2 + γ t + η d + φ dt + ε isdt 6 In 1987, Cannell (1987) noted that a disproportionate number of states and districts report being above the national norm on their own tests, dubbing this phenomenon the “Lake Wobegon” effect (Cannell 1987, Linn 1990, Shepard 1990, Shepard 1988). 10 where y is an achievement score for individual i in school s in district d at time t, X is a vector of student characteristics, Z is a vector of school and district characteristics and ε is a stochastic error term. Unobservable factors at the state and national level are captured by time ( γ ), district ( η ) and time*district ( φ ) effects. Using detailed administrative data for each student, I am able to control for observable changes in student composition, including race, socio-economic status and prior achievement. Because achievement data is available back to 1990, six years prior to the introduction of the policy, I am also able to account for pre-existing achievement trends within the CPS. This short, interrupted time-series design [Ashenfelter, 1978 #406] accounts for continuing improvement stemming from earlier reform efforts.7 In the absence of any unobserved time or district*time effects, this simple pre-post design will provide an unbiased estimate of the policy. In order to control for unobservable changes over time that affect student achievement, one might utilize a difference-in-difference strategy which compares the achievement trends of students who were likely to have been strongly influenced by the policy with the trends of students who were less likely to be influenced by the policy. While no students were exogenously excluded from the policy, it is likely that low-achieving students were more strongly influenced by the policy than their peers, since they were disproportionately at-risk for retention and were disproportionately located in schools at risk for probation. Thus, if we find that after 1996 achievement levels of low-achieving students increased more sharply than highachieving students, we might conclude that the policy had a positive impact on achievement. This model can be implemented by estimating the following specification (3.2) yisdt = ( HighStakes) dt δ1 + ( Low) dt δ 2 + ( HighStakes * Low) dt δ 3 + X isdt β1 + Z sdt β 2 + γ t + ηd + φdt + ε isdt 7 The inclusion of a linear trend implicitly assumes that any previous reforms that raised student performance would have continued with the same marginal effectiveness in the future. If this assumption is not true, the estimates may be biased downwards. In addition, this aggregate trend assumes that there are no school-level composition changes 11 where Low is a binary variable that indicates a low prior achievement level. There are two major drawbacks to this strategy. First, it relies on the assumption that, in the absence of the policy, unobserved time trends would have been equal across achievement groups (i.e., γ Low = γ High = γ ). Second, it assumes a homogeneous policy effect (i.e., δ Low = δ High = δ ). In this case, the policy effect is likely to be a function of not only the incentives generated by the policy, but the capacity to respond to the incentives (i.e., δ = δ (a ) where a is a measure of prior achievement). It is likely that higher achieving students and schools have a greater capacity to respond to the policy (i.e., more parental support, higher quality teachers, more effective principal leadership, etc.), in which case this comparison will underestimate the effect of the policy. Moreover, high-achieving students may differ from lowachieving students in other ways that might be expected to influence achievement gains (e.g., risk preference). As we see in Section 6, it turns out that the policy effects are larger in lowachieving schools, but there is no consistent pattern in policy effects across student prior achievement groups, which suggests that incentives and capacity may have offset each other. To assess the generalizability of achievement gains under the policy, I compare achievement trends on the ITBS with performance trends on the Illinois Goals Assessment Program (IGAP), a state-mandated achievement exam that is not used for accountability purposes in Chicago. Because these exams are measured in different metrics, I standardize the outcome measures using the mean and standard deviation from the base year. Regardless of whether one uses the ITBS or IGAP as an outcome, the primary disadvantage of comparing the achievement of Chicago students in different years is that it does not take into account changes that were taking place outside the CPS during this period. It is possible that state or national education reforms, or changes in the economy, were responsible in Chicago. I test this assumption by including school-specific fixed effects and school-specific trends in certain specifications. 12 for the increases in student performance in Chicago during the late nineties. Indeed, Grissmer and Flanagan (1998) show that student achievement increased in many states during the 1990s. If this were the case, the estimates from equation 3.1 would overstate the impact of high-stakes testing in Chicago. Using a panel of IGAP achievement data on schools throughout Illinois during the 1990s, I am able to address this concern. By comparing the change over time in Chicago with the change over time in the rest of Illinois, one can eliminate any unobserved common time trends. By comparing Chicago to other low-income, urban districts in Illinois, we eliminate any unobserved trends operating within districts like Chicago. To implement this strategy, I estimate the following model: ysdt = ( Post )δ1 + (Chicago)δ 2 + (Urban)δ 3 + (3.3) (Chicago * Post )δ 4 + (Urban * Post )δ 5 + Xβ1 + Zβ 2 + ε sdt where y is the average reading or math score for school s in district d at time t, Post is an indicator that takes on the value of one after 1996 (when high-stakes testing was introduced in Chicago) and zero otherwise, and Chicago and Urban are binary variables that indicate whether the school is in Chicago or any low-income, urban district in Illinois. X captures school average demographics and Z reflects district level demographic variables. This basic model can be extended to account for pre-existing achievement trends in each district and to examine the variation in policy effects across the distribution of school achievement (i.e., whether the differential gain between low-achieving schools in Chicago and Illinois is larger than the differential gains between high-achieving schools in Chicago and Illinois). 4. Data This study utilizes detailed administrative data from the Chicago Public Schools as well as the Illinois Board of Education. CPS student records include information on a student’s 13 school, home address, demographic and family background characteristics, special education and bilingual placement, free lunch status, standardized test scores, grade retention and summer school attendance. CPS personnel and budget files provide information on the financial resources and teacher characteristics in each school and school files provide aggregate information on the school population, including daily attendance rates, student mobility rates and racial and SES composition. The measure of achievement used in Chicago is the Iowa Test of Basic Skills (ITBS), a standardized, multiple-choice exam developed and published by the Riverside Company. The exam is nationally normed and student scores are reported in a grade equivalent metric that reflects the number of years and months of learning. A student at the 50th percentile in the nation scores at the eighth month of her current grade – i.e., an average third grader will score a 3.8. However, because grade equivalents present a number of well-known shortcomings for comparisons over time or across grades,8 I use an alternative outcome metric derived from an item-response model. By taking advantage of the common items across different forms and levels of the exam, these measures provide an effective way to compare students in different grade levels or taking different forms of the exam.9 The primary sample used in this analysis consists of students who were in grades three to eight for the first time in 1993 to 1999 and who were tested and included for official reporting 8 There are three major difficulties in using grade equivalents (GE) for over time or cross grade comparisons. First, different forms of the exam are administered each year. Because the forms are not correctly equated over time, one might confound changes in test performance over time with changes in form difficulty. Second, because the grade equivalents are not a linear metric, a score of 5.3 on level 12 of the exam does not represent the same thing as a score of 5.3 at level 13. Thus, it is difficult to accurately compare the ability of two students if they are in different grades. Finally, the GE metric is not linear within test level because the scale spreads out more at the extremes of the score distribution. For example, one additional correct response at the top or bottom of the scale can translate into a gain of nearly one full GE whereas an additional correct answer in the middle of the scale would result in only a fraction of this increase. 9 Models based on Item Response Theory (IRT) assume that the probability that student i answers questions j correctly is a function of the student ability and the item difficulty. In practice, one estimates a simple logit model in which the outcome is whether or not student i correctly answers question j . The explanatory variables include an indicator variable for each question and each student. The difficulty of the question is given by the coefficient on the appropriate indicator variable and the student ability is measured by the coefficient on the student indicator variable. The resulting metric is calibrated in terms of logits. This metric can be used for cross-grade comparisons 14 purposes.10 I limit my analysis to first-time students because the implementation of the social promotion policy caused a large number of low-performing students in third, sixth and eighth grade to be retained, which substantially changed the student composition in these and subsequent grades beginning in 1997-98.11 By limiting the analysis to cohorts beginning in 1993, I am able to control for at least three years of prior achievement scores for all students.12 I do not examine cohorts after 1999 because the introduction of the social promotion substantially changed the composition of these groups. For example, nearly 20 percent of the original 2000 sixth grade cohort was retained in 1997 when they experienced high-stakes testing as third graders.13 Table 4.1 presents summary statistics for the sample. Like many urban districts across the country, Chicago has a large population of minority and low-income students. In the sixth grade group, for example, roughly 56 percent of students are Black, 30 percent are Hispanic and over 70 percent receive free or reduced price lunch. The percent of students in special education and bilingual programs is substantially lower in this sample than for the entire CPS population because many students in these programs are not tested, and are thus not shown here. We see that a relatively small number of students in these grades transfer to a private school (one percent), move out of Chicago (five percent) or drop out of school (one percent). The rates as long as the exams in question share a number of common questions (Wright 1979). I thank the Consortium on Chicago School Research for providing the Rasch measures used in this analysis. 10 As explained in greater detail in Section 5, certain bilingual and special education students are not tested or, if tested, not included for official reporting purposes. 11 While focusing on first-timers allows a consistent comparison across time, it is still possible that the composition changes generated by the social promotion policy could have affected the performance of students in later cohorts. For example, if first-timers in the 1998 and 1999 cohorts were in classes with a large number of low-achieving students who had been retained in the previous year (in addition to the low-achieving students in the grade for the first time), they might perform lower than otherwise expected. This would bias the estimates downward. 12 Because a new series of ITBS forms were introduced in 1993, focusing on these cohorts also permits the cleanest comparisons of changes over time. For students missing prior test score information, I set the missing test score to zero and include a variable indicating the score for this period was missing. 13 For this reason, the eighth grade sample only includes cohorts from 1993 to 1998. The 1999-2000 cohort is not included in the third grade sample because there is evidence that retention rates in first and second grade increased substantially after the introduction of high-stakes testing even though these grades were not directly affected by the social promotion policy. 15 among eighth graders are substantially higher than the other elementary grades, though not particularly large in comparison to high school rates. 5. The Impact of High-Stakes Testing on Testing Patterns While the new accountability policies in Chicago are designed to increase student achievement, high-stakes testing creates incentives for students and teachers that may change test-taking patterns. For example, if students are required to pass the ITBS in order to move to the next grade, they may be less inclined to miss the exam. On the other hand, under the probation policy teachers have an incentive to dissuade low-achieving students from taking the exam.14 In a recent descriptive analysis of testing patterns in Chicago, Easton et al. (2000, 2001) found that the percent of CPS students who are tested and included15 for reporting purposes has declined during the 1990s, particularly following the introduction of high-stakes testing. However, they attribute this decline to changing demographics (specifically, an increase in bilingual students in the system) and administrative changes in the testing policy16 rather than to the new high-stakes testing program. Shifts in testing patterns are important not only because they may indicate if some students are being neglected under the policy, but also because the 14 Schools are not explicitly judged on the percentage of their students who take the exams, although it is likely that a school with an unusually high fraction of students who miss the exam would fall under suspicion. 15 Students in Chicago fall into one of three testing categories: (1) tested and included, (2) tested and excluded, and (3) not tested. Included versus excluded refers to whether the student’s test scores count for official reporting purposes. Students who are excluded for reporting are not subject to the social promotion policy and their scores do not contribute to the determination of their school’s probation status. The majority of students in grades two through eight take the ITBS, but roughly ten percent of these students are excluded from reporting on the basis of participation in special education or bilingual programs. A third group of students is not required to take the ITBS at all because of a bilingual or special education classification. Students with the least severe special education placements and those who have been in bilingual programs for more than four years are generally tested and included. Those with more severe disabilities and those who have been in bilingual programs three to four years are often tested but excluded. Finally, those students with severe learning disabilities and those who have only been in bilingual programs for one or two years are generally not tested. 16 Prior to 1997, the ITBS scores of all bilingual students who took the standardized exams were included for official reporting purposes. During this time, CPS testing policy required students enrolled in bilingual programs for more than three years to take the ITBS, but teachers were given the option to test other bilingual students. According to school officials, many teachers were reluctant to test bilingual students, fearing that their low scores would reflect poorly on the school. Beginning in 1997, CPS began excluding the ITBS scores of students who had been enrolled in bilingual programs for three or fewer years to encourage teachers to test these students for 16 may bias estimates of the achievement gains under the policy. Building on the work of Easton and his colleagues, this analysis not only controls for changes in observable student characteristics over time and pre-existing trends within the CPS, but also separately considers the trends for gate versus non-gate grades and low versus high achieving students. 5.1 Trends in Testing and Exclusion Rates in Gate Grades Table 5.1 presents OLS estimates of the relationship between the probability of (a) missing the ITBS and (b) being excluded from reporting conditional on taking the ITBS.17 Controls include prior achievement, race, gender, race*gender interactions, neighborhood poverty, age, household composition, a special education indicator and a linear time trend. Robust standard errors that account for the correlation of errors within school are shown in parentheses. Estimates for Black students are shown separately because these students were unlikely to have been affected by the changes in the bilingual policy. The results suggest that high-stakes testing increased test-taking rates, particularly among eighth grade students. For Black eighth grade students, the policy appears to have decreased the probability of missing the ITBS between 2.8 and 5.7 percentage points, a decline of 50 to 100 percent given the baseline rate of 5.7 percent. These effects were largest among high-risk students and schools (tables available from author upon request). In contrast, high-stakes testing appears to have had little if any impact on the probability of being excluded from test reporting. The probability of being excluded did increase slightly for Black students in eighth grade under high-stakes testing, although this may be biased upward due to the increase in test-takers.18 There were no effects for third and sixth grade students. diagnostic purposes. In 1999, the CPS began excluding the scores of fourth year bilingual students as well, but also began requiring third-year bilingual students to take the ITBS exams. 17 Probit estimates yield virtual identical results so linear probability estimates are presented for ease of interpretation. 18 If the students induced to take the ITBS under the high-stakes regime were more likely to be excluded, the exclusion estimates would be biased upward. To control for possible unobservable characteristics that are 17 5.2 Trends in Testing and Exclusion Rates in Non-Gate Grades While the social promotion policy only directly affected students in third, sixth and eighth grades, the probation policy may have affected students in all grades since schools are judged on the basis of student ITBS scores in grades three through eight. Because ITBS scores do not have direct consequences for students in non-gate grades, they provide a cleaner test of behavioral responses on the part of teachers. Probation provides teachers an incentive to exclude low-achieving students and to encourage high-achieving students to take the exam. While the probation policy may have influenced teacher behavior in each year after 1996, it may be difficult to identify the effects after the first year of the policy because the introduction of the social promotion policy in grades three and six in 1997 influenced the composition of students in the fourth, fifth, and seventh grades beginning in the 1997-98 school year. To avoid these selection problems, I limit the analysis to the first cohort to experience the probation policy. Table 5.3 presents OLS estimates of the probability of not being tested and excluded for Black students in fourth, fifth and seventh grade in 1997. The average effect shown in the top row suggests that the accountability policy decreased the likelihood of missing the ITBS, but did not significantly increase the probability of being excluded (with the exception of fifth grade, where the effect was statistically significant, but substantively small). The second panel shows that the largest decreases in missing the ITBS took place in low achieving schools. In the third panel, we see that the likelihood of being excluded increased for low-achieving students but decreased for high-achieving students. correlated with exclusion and the probability of taking the exam under the high-stakes testing regime, I include five variables indicating whether the student was excluded from test reporting in the each of the previous five years that he/she took the exam. To further test the sensitivity of the exclusion estimates to changes in the likelihood of taking the ITBS, I estimate each model under the alternative assumptions that exclusion rates among those not tested are equal to, twice as high and 50 percent as high as the rate among those tested in the same school prior to high-stakes testing. The results are robust to these assumptions. 18 6. The Impact of High-Stakes Testing on Math and Reading Achievement Having examined the effect of high-stakes testing on test-taking and exclusion patterns, we now turn to the effect on math and reading achievement. Figures 6.1 and 6.2 show the trends in ITBS math and reading scores for third, sixth and eighth grades from 1990 to 2000. Because the social promotion policy was instituted for eighth graders one year before it was instituted for third and sixth graders, I have rescaled the horizontal axis to reflect the years before or after the introduction of high-stakes testing. Recall that the sample includes only students who were in third, sixth or eighth grade for the first time (i.e., no retained students). Also note that the trend for eighth grade students ends in 1998 because the composition of eighth grade cohorts changed substantially in 1999 when the first wave of sixth graders to experience the grade retentions in 1997 reached the eighth grade. The achievement trends rise sharply after the introduction of high-stakes testing for both math (Figure 6.1) and reading (Figure 6.2). In mathematics, achievement increased slightly in the six years prior to the policy, but rose dramatically following the implementation of the accountability policy. Achievement levels are roughly 0.25 logits higher under high-stakes. Given a baseline annual gain of 0.50 logits, achievement levels have risen by the equivalent of one-half of a year’s learning after implementation of high-stakes testing. The slope of the line is steeper under the new regime because later cohorts were exposed to the accountability policy for a longer period before taking the exam and/or students and teachers may have become more efficient at responding to the policy over time.19 In estimating the impact of the high-stakes testing policy on student achievement, it is important to control for student demographics and prior achievement scores since there is evidence that the composition of the Chicago student population may have been changing over this period. Controlling for prior achievement does not present a problem in estimating the policy effects for the 1997 cohort since this group only experienced the heightened incentives in 19 the 1996-97 school year. However, as already noted, later cohorts experienced the incentives generated by the accountability policies prior to entering the gate grade. For example, the 1999 sixth grade cohort attended fourth and fifth grades (1997 and 1998) under the new regime. To the extent that the probation policy influenced student achievement in non-gate grades, or students, parents or teachers changed behavior in non-gate grades in anticipation of the new promotional standards in the gate-grades, one would expect student test scores to be higher in these non-gate grades in comparison to the counterfactual no-policy state. This presents two potential problems in the estimation. First, these pre-test scores would be endogenous in the sense that they were caused by the policy. If the policy increased achievement in these grades, controlling for these prior test scores would decrease the estimated effect of the policy.20 Second, the inclusion of these endogenous pre-test scores will influence the estimation of the coefficients on the prior achievement (and thus all other) variables. For this reason, Table 6.1 shows OLS estimates with various prior achievement controls.21 The first column for each subject includes controls for three prior years of test score information (t-1, t-2 and t-3). The second column for each subject controls for achievement scores at time t-2 and t-3. The third columns only control for test scores at t-3. The preferred estimates (in the shaded cells) suggest that high-stakes testing has significantly increased student achievement in all subjects, grades and years. For example, consider the results for eighth grade reading. The estimated effect for the 1996 cohort is 0.132, meaning that eighth graders in 1995-96 scored 0.132 logits higher than observationally similar eighth graders in earlier years. To get a sense of the magnitude of this effect, the bottom of the 19 With this data, we are not able to distinguish between an implementation effect and longer exposure to the treatment. 20 Alternatively, these estimates could simply be interpreted as the impact of the policy in the gate year alone. 21 To the extent that high-stakes testing has significantly changed the educational production function, we might expect the coefficient estimates to be significantly different for cohorts in the high-stakes testing regime. In this case, one might be concerned that including the post-policy cohorts in the estimation might change the estimation of the policy effect. To check this, I first estimated a model that included only pre-policy cohorts. I then used the coefficient estimates to calculate the policy effect for post-policy cohorts. The results were virtually identical. 20 table presents the average gain for third, sixth and eighth graders in 1993 to 1995 (with the standard deviation in parenthesis). Given the average annual learning gain of roughly 0.50, high-stakes testing increased the reading achievement of eighth graders in 1996 by more than 25 percent of the annual learning gain. The impact on math achievement was roughly 20 percent of the annual learning gain.22 The one exception occurs in third grade reading in 1997 and 1999. Insofar as first and second grade teachers have not adjusted instruction since the introduction of high-stakes testing, perhaps a reasonable assumption given the fact that these grades are not counted in determining a school’s probation status, then it is not problematic to include controls for first and second grade achievement. As I suggest in the next section, this may be due to form effects. 6.2. The Sensitivity of High-Stakes Testing Effects To test the sensitivity of the findings presented in the previous section, Tables 6.2 and 6.3 present comparable estimates for a variety of different specifications and samples. Column 1 shows the baseline estimates for all students who took the test and whose scores were included for test reporting. Columns 2 to 7 show that the results are robust to using excluded as well as included students and to a variety of assumptions regarding the hypothetical scores of students who missed the exam. Columns 8 and 9 suggest that the original estimates are robust to the inclusion of a single linear trend as well as school-specific trends and school-specific intercepts. A potentially more serious concern involves the fact that CPS administered several different forms of the ITBS exam during the 1990s. While the exams are generally comparable, there is evidence that some forms are slightly easier than others, particularly for certain grades and subjects and at certain points in the ability distribution. This can be seen in the achievement 22 This table also provides a sensitivity check for the model. The results to the right of each shaded cell within each panel are alternative estimates that use different prior achievement controls. In eighth grade reading, for example, the specification that controls for three years of prior achievement data yields estimates for the 1996 cohort of .132. Specifications that include only two or one year of prior test score information yield estimates of .133 and .114 21 trends shown in Figures 6.1 and 6.2. Some of the year-to-year swings that appear too large to be explained by a change in the student population or other events taking place in the school system are likely due to form effects. To the extent that the timing of different forms coincides with the introduction of the accountability policy, our estimates may be biased. For example, in Table 6.1 the baseline estimates for 1997 and 1999 reading effects are noticeably smaller than the 1998 effect in third and sixth grade. It seems unlikely that any change in the policy could have been responsible for this pattern, particularly because it is not increasing or decreasing monotonically. This suggests that the reading section of Form M (the form given in 1997 and 1999) was more difficult than the comparable section in earlier forms. One way to test the sensitivity of the findings to potential form effects is to compare achievement in years in which the same form was administered. Fortunately, the Chicago students took the ITBS Form L in 1994, 1996 and 1998.23 Columns 10 presents estimates from a sample limited to these three cohorts, which suggest that form effects are not driving the results presented earlier. 6.3 The Heterogeneity of Effects Across Student and School Risk Level Having examined the aggregate achievement trends in the CPS, we now examine the heterogeneity of gains across students and schools. Figures 6.3 and 6.4 shows ITBS math and reading trends at different points on the distribution for third, sixth and eighth grade students. Achievement increased by roughly the same absolute amount at the different points on the distribution, suggesting that there was not a significant difference between low and highachieving students. Table 6.4 shows OLS estimates of the differential effects across students and schools. Because the pattern of effects is similar across cohorts, for the sake of simplicity I respectively. Reading across the rows, it appears that the sixth and eighth grade estimates are not too sensitive to the prior achievement controls. 23 This is the only form that was given under low and high-stakes regimes. Form K was administered in 1993 and 1995. Form M was administered in 1997 and 1999. While Chicago students took form K again in 1999-2000, it is difficult to compare this cohort to earlier cohorts because of composition changes caused by the retention of lowachieving students in earlier cohorts. 22 combine all three cohorts that experienced high-stakes testing and control for all prior achievement scores, which means that these estimates only capture the effect of high-stakes testing during the gate grade. The first row shows that the average effects for all students and schools are positive and significant. Estimates in the middle panel show that students in low-achieving schools made larger gains under high-stakes testing than their peers in higher-achieving schools. The bottom panel shows that the relationship between prior student achievement and the policy effect varies considerably by subject and grade. Overall, there is not strong evidence that low-achieving students do better under the policy, perhaps because these students did not have the capacity to respond to the incentives as well as their peers. Because it appears that the accountability policy influenced students and schools across the ability distribution, the difference-in-difference estimates will not estimate the impact of the policy. 6.4 The Impact of High-Stakes Testing in Non-Gate Grades While the social promotion policy clearly focuses on the three gate grades, it is useful to examine achievement trends in the non-gate grades as well. Because the probation policy provides incentives for teachers in non-gate as well as gate grades, the achievement trends in non-gate grades provide a better measure of the independent incentive effect of the probation policy.24 As noted earlier, this analysis focuses on the 1996-97 cohort because the social promotion policy changed the composition of students in later years. Figures 6.5 and 6.6 show math and reading achievement trends for students in grades four, five and seven who took the ITBS and were included in test reporting. Math achievement (Figure 6.5) appears to have continued a steady upward trend that began prior to the introduction of high-stakes testing. In 24 Teachers and students in these grades may be anticipating and responding to the promotional standards in the next grade as well. For example, interviews with students and teachers indicate that many students worry about the cutoffs in earlier years and that many teachers use the upcoming, next-year exams as motivation and classroom management techniques. 23 contrast, reading achievement (Figure 6.6) jumped sharply in 1997 in both fifth and seventh grades, but there does not appear to be a significant increase in fourth grade. This is consistent with the fact that the school accountability focuses mainly on reading. Table 6.5 shows OLS estimates of the policy effects in fourth, fifth and seventh grade. The aggregate estimates in the top row tell the same story as the graphs—there are significant positive effects in all subjects and grades except for fourth grade reading. The results in the middle panel suggest that students in lower achieving schools made larger improvements under high-stakes testing, although the pattern of effects across student prior achievement levels is more mixed. An analysis of policy effects within student*school achievement cells show that in the lowest achieving schools, students above the 25th percentile showed the largest increases in reading achievement, again consistent with the focus of the accountability policy. 24 7. How Generalizable Were the Test Score Gains in Chicago? In the previous section, we saw that ITBS math and reading scores in Chicago increased dramatically following the introduction of high-stakes testing. This section examines whether the observed ITBS gains reflect a more general increase in student achievement by comparing student performance trends on the ITBS with comparable trends on the Illinois Goals Assessment Program (IGAP), a state-mandated achievement exam that is not used for accountability purposes in Chicago. The data for this analysis is drawn from school “report cards” compiled by the Illinois State Board of Education (ISBE), which provide average IGAP scores by grade and subject as well as background information on schools and districts. The analysis is limited to the period from 1990 to 1998 because Illinois introduced a new exam in 1999. 7.1 A Comparison of ITBS and IGAP Achievement Trends in Chicago Over Time25 Figure 7.1 shows the achievement trends on the math IGAP exam for all students in Chicago from 1990 to 1998. IGAP math scores increased steadily in all grades over this period, but no significant discontinuity is evident in 1997.26 Figures 7.2 to 7.4 show both IGAP and ITBS trends for first-time students in grades three, six and eight whose scores were included in ITBS reporting. Scores are standardized using the 1993 means and standard deviations. In third and sixth grade, both the ITBS and IGAP trend upward over this period, although they increase most steeply at different points. IGAP scores are increasing most sharply from 1994 to 1996 while ITBS scores rise sharply from 1996 to 1998. In the eighth grade, the ITBS and IGAP trends track each other more closely, although it still appears that ITBS gains outpaced IGAP scores two and three years after the introduction of high-stakes testing. 25 This analysis is limited to math because equating problems make it difficult to compare IGAP reading scores over time (Pearson 1998). 26 In 1993, Illinois switched the scaling of the exam slightly which accounts for change from 1992 to 1993. 25 Table 7.1 shows OLS estimates of the ITBS and IGAP effects. The sample is limited to the period from 1994 to 1998 because student level IGAP data was not available before this time. With no control for pre-existing trends in columns 1 and 5, it appears that math achievement has increased by a similar magnitude on the ITBS and IGAP. However, when we account for preexisting achievement trends in columns 2 and 6, ITBS achievement scores appear to have increased significantly more than IGAP scores among third and sixth grade students. In eighth grade, however, the ITBS and IGAP remain comparable once we control for pre-existing trends. (It is important to note that in the eighth grade model, the prior achievement trend is only based on two data points – 1994 and 1995). Thus, in the third and sixth grades it appears that the positive impact of high-stakes testing is not generalizable beyond the high-stakes exam. On the other hand, in eighth grade, the high-stakes testing effects appear more consistent across the ITBS and IGAP. 26 7.2 A Comparison of IGAP Trends Across School Districts in Illinois Having examined the ITBS and IGAP math trends in Chicago during the 1990s, we now compare achievement trends in Chicago to those in other districts in Illinois. This cross-district comparison not only allows us to examine reading as well as math scores (since the scoring changes in reading applied to all districts), but also allows us to control for unobservable factors operating in Illinois. Table 7.2 presents summary statistics of the sample in 1990.27 Columns 1 and 2 show the statistics for Chicago and non-urban districts in Illinois. Column 3 shows the statistics for urban districts in Illinois excluding Chicago.28 The City of Chicago serves a significantly more disadvantaged population than the average district in the state, surpassing even other urban districts in terms of the percent of students receiving free lunch and the percent of Black and Hispanic systems in the district. Figures 7.5 and 7.6 show changes over time in the average difference in achievement scores between Chicago and other urban districts. While it appears that Chicago has narrowed the achievement gap with other urban districts in Illinois, the trend appears to have begun prior to the introduction of high-stakes testing. One possible exception is in eighth grade reading, where it appears that Chicago students may have improved more rapidly than students in other urban districts following the introduction of high-stakes testing. Tables 7.3 and 7.4 show OLS estimates from the specifications in equation 3.3 that control for racial composition of the school, the percent of students receiving free or reduced price lunch, the percent of Limited English Proficient (LEP) students, school mobility rates, per-pupil expenditures in the district and the percent of teachers with at least a Masters degree in the district. The sample used in the 27 I delete roughly one-half of one percent of observations that were missing test score or background information. To identify the comparison districts, I first identify districts in the top decile in terms of the percent of students receiving free or reduced price lunch, percent minority students, and total enrollment and in the bottom decile in terms of average student achievement (averaged over third, sixth and eighth grade reading and math scores) based on 1990 data. Not surprisingly, Chicago falls in the bottom of all four categories. Of the 840 elementary districts in 1990, Chicago ranks first in terms of enrollment, 12th in terms of percent of low-income and minority students and 830th in student achievement. Other districts that appear at the bottom of all categories include East St. Louis, Chicago Heights, East Chicago Heights, Calumet, Joliet, Peoria and Arora. I then use the 34 districts (excluding 28 27 regressions only included information from 1993 to 1998 in order to minimize any confounding effects from the change in scoring. Recall that Chicago is coded as an urban district in these models, so that the coefficient on the Chicago indicator variable reflects the difference between Chicago and all other urban comparison districts. Once we take into account the pre-existing trends (columns 2, 4 and 6), there is no significant difference between Chicago and the urban comparison districts in either reading or mathematics. This is true for schools across the achievement distribution (figures and tables available upon request). 8. Explaining the Differential Gains on the ITBS and the IGAP There are a variety of reasons why achievement gains could differ across exams under high-stakes testing, including differential student effort, ITBS-specific test preparation and cheating.29 Unfortunately, because many of these factors such as effort and test preparation are not directly observable, it is difficult to attribute a specific portion of the gains to a particular cause. Instead we now seek to provide some evidence regarding the potential influence of several factors. 8.1 The Role of Guessing Prior to the introduction of high-stakes testing, roughly 20 to 30 percent of students left items blank on the ITBS exams despite the fact that there was no penalty for guessing. One explanation for the relatively large ITBS gains since 1996 is that the accountability policy encouraged students to guess more often rather than leaving items blank. If we believe that the increased test scores were due solely to guessing, we might expect the percent of questions answered to increase, but the percent of questions answered correctly (as a percent of all Chicago) that fall into the bottom decile in at least three out of four of the categories. I have experimented with several different inclusion criteria and the results are not sensitive to the choice of the urban comparison group. 28 answered questions) to remain constant or perhaps even decline. However, from 1994 to 1998, the percent of questions answered correctly increased by roughly 9 percent at the same time that the percent blank has declined by 36 percent, suggesting that the higher completion rates were not due entirely to guessing. Even if we were to assume that the increase in item completion is due entirely to random guessing, however, Table 8.1 shows that guessing could only explain 10 to 20 percent of the observed ITBS gains among those students who were most likely to have left items blank in the past. 8.2 The Role of Test Preparation Another explanation for the differential effects involves the content of the exams. If the ITBS and IGAP emphasize different topics or skills, and teachers have aligned their curriculum to the ITBS in response to the accountability policy, then we might see disproportionately large increases on the ITBS. As we can see in Table 8.2, while the two exams have the same general format, the IGAP appears to place greater emphasis on critical thinking and problem-solving skills. For example, the IGAP math exam has fewer straight computation questions, and even these questions are asked in the context of a sentence or word problem. Similarly, with its long passages, multiple correct answers and questions comparing passages, the IGAP reading exam appears to be more difficult and more heavily weighted toward critical thinking skills than the ITBS exam. To the extent that the disproportionately large ITBS gains were driven by ITBS-specific curriculum alignment or test preparation, we might expect to see the largest gains on the ITBS items that are (a) easy to teach to and (b) relatively more common on the ITBS than the IGAP. Table 8.3 presents OLS estimates of the relationship between high-stakes testing and ITBS math 29 Jacob and Levitt (2001) found that instances of classroom cheating increased substantially following the introduction of high-stakes testing. However, they estimate that it could only explain at most 10 percent of the observed ITBS gains during this period. 29 achievement by item type.30 The sample consists of tested and included students in the 1996 and 1998 cohorts, both of which took ITBS Form L. The dependent variable is the proportion of students who answered the item correctly in the particular year. We see that students made the largest gains on items involving computation and number concepts and made the smallest gains on problems involving estimation, data analysis and problem-solving skills. Using the coefficients in Table 8.3, we can adjust for the composition differences between the ITBS and IGAP exam. Suppose that under high-stakes testing students were 2.3 percentage points more likely to correctly answer IGAP math items involving critical thinking or problem solving skills and 4.2 percentage points more likely to answer IGAP math items requiring basic skills. Scores on the IGAP math exam range from 1 to 500. Assuming a linear relationship between the proportion correct and the score, a one percentage point increase in the proportion correct translates into a five point gain (which, given the 1994 standard deviation of roughly 84 points, translates into a gain of 0.06 standard deviations). If the distribution of questions type on the IGAP resembled that on the ITBS, with one-third of the items devoted to basic skill material, then IGAP scores would have been predicted to increase by an additional two points.31 Given the fact that IGAP scores rose by roughly 15 points between 1996 and 1998, it appears that the different compositions of items on the ITBS and IGAP exams cannot explain much of the observed ITBS-IGAP difference. 8.3. The Role of Effort If the consequences associated with ITBS performance led students to concentrate harder during the exam or caused teachers to ensure optimal testing conditions for the exam, and these factors were not present during the IGAP testing, student achievement on the ITBS may have exceeded IGAP achievement even if the content of the exams were identical. Because we cannot 30 Because it is difficult to classify items in the reading exam, I limit the analysis to the math exam. 30 directly observe effort, however, it is difficult to determine its role in the recent ITBS improvements. While increased guessing cannot explain a significant portion of the ITBS gains, other forms of effort may play a larger role. Insofar as there is a tendency for children to “give up” toward the end of the exam— either leaving items blank or filling in answers randomly—an increase in effort may lead to a disproportionate increase in performance on items at the end of the exam, when students are tired and more inclined to give up. One might describe this type of effort as test stamina – i.e., the ability to continue working and concentrating throughout the entire exam. Table 8.4 presents OLS estimates of the relationship between item position and achievement gains from 1994 to 1998, using the same sample as Table 8.3. The analysis is limited to the reading exam because the math exam is divided into several sections, so that item position is highly correlated with item type. Conditional on item difficulty, student performance on the last 20 percent of items increased 2.7 percentage points more than performance on the first 20 percent of items. In fact, conditional on the item difficulty, it appears that all of the improvement on the ITBS reading exam came at the end of the test. This suggests that effort played a significant role in the ITBS gains seen under high-stakes testing. 9. Unintended Consequences of High-Stakes Testing Having addressed the impact of the accountability policy on math and reading achievement, we now examine how high-stakes testing has affected other student outcomes. While there is little evidence that the accountability policy affected the probability of transferring to private schools, moving out of the district, or dropping out of school,32 it appears to have influenced primary grade retentions and achievement scores in science and social studies. 31 All figures are for the average of third, sixth and eighth grades. 31 9.1 The Impact of High-Stakes Testing on the Early Primary Grades Table 9.1 presents OLS estimates that control for a variety of student and school demographics as well as pre-existing trends.33 34 While enrollment rates decreased slightly over this period, there does not appear to be any dramatic change following the introduction of highstakes testing. In contrast, retention rates appear to have increased sharply under high-stakes testing. The retention rates were quite low among the 1993 cohort, ranging from 1.1 percent for kindergartners to 4.2 percent for first graders. First graders in 1997-1999 were roughly 50 percent more likely to be retained than comparable peers in 1993 to 1996. Second graders were nearly twice as likely to be retained after 1997. 9.2 The Impact of High-Stakes Testing on Science and Social Studies Achievement Figure 9.1 shows fourth grade ITBS achievement trends in all subjects from 1995 to 1997.35 We see that science and social studies scores increased substantially from 1995 to 1996, most likely because 1995 was the first year that these exams were mandatory for fourth graders. Over this same period, there was little change in math or reading scores. Following the introduction of high-stakes testing in 1996-97, math and reading scores rose by roughly 0.20 grade equivalents, but there was no change in science or social studies achievement. Figure 9.2 shows similar trends for eighth grade students from 1995 to 1998. The patterns for eighth grade 32 Figures and tables available from the author upon request. Probit estimates yield nearly identical results so the estimates from a linear probability model are presented for ease of interpretation. 34 Roderick et al. (2000) found that retention rates in kindergarten, first and second grades started to rise in 1996 and jumped sharply in 1997 among first and second graders. This analysis builds on the previous work by examining enrollment and special education trends (not shown here) as well as retention, controlling for observable student characteristics and pre-existing trends and examining the heterogeneity across students. 35 Fourth graders did not take the ITBS science and social studies exams before 1995. After 1997, the composition of fourth grade cohorts changed due to the introduction of grade retention for third grade students. In general, a difficulty in studying trends in science and social studies achievement in Chicago is that the exams in these subjects are not given every year and, in many cases, are administered to different grades than the math and reading exams. The CPS has consistently required students in grades two to eight to take the reading and math portions of the ITBS, 33 32 are less clear, partly because of the noise introduced by the changes in test forms. If we focus on years in which the same form was administered, 1996 and 1998, we see that the average social studies achievement stayed relatively constant under high-stakes testing in comparison to scores in the other subjects which rose by about 0.30 GEs. IGAP trends reveal a similar pattern. Figure 9.3 shows the social studies trends for Chicago and other urban districts in Illinois. We see that Chicago was improving relative to other urban districts in science from 1993 to 1996 in both the fourth and seventh grades. In 1997, this improvement stopped, and even reversed somewhat, leaving Chicago students further behind their urban counterparts in social studies in 1997. On the other hand, Figure 9.4 suggests that Chicago students continued to gain on their urban counterparts in science achievement through 1997. Table 9.2 presents OLS regression estimates that control for student and school characteristics. We see that students in both grades had higher math and reading scores after the introduction of high-stakes testing, but that science achievement stayed relatively flat and social studies achievement declined slightly. Table 9.3 shows the OLS estimates of the relationship between high-stakes testing and IGAP science and social studies scores. Conditional on prior achievement trends, Chicago fourth graders in 1997 performed significantly worse than the comparison group in science as well as social studies. The effect of 16.4 for social studies is particularly large, nearly one-third of a standard deviation. The point estimate for seventh grade science is negative, though not statistically significant and the effect for seventh grade science is zero. Overall, it appears that social studies and science achievement have decreased relative to reading and math achievement. The decrease appears larger for social studies than science and somewhat larger for fourth grade in comparison to seventh or eighth grade. The decline in social but has frequently changed which grades are required to take the science and social studies exams. The only 33 studies achievement is consistent with teachers substituting time away from this subject to focus on reading. One reason that we might not have seen a similar drop in science scores is that science is often taught by a separate teacher, whereas social studies is more often taught by the regular teacher who is also responsible for reading and possibly mathematics instruction as well. 10. Conclusions This analysis has generated a number of significant findings: (1) ITBS scores in mathematics and reading increased substantially following the introduction of high-stakes testing in Chicago; (2) the pattern of ITBS gains suggests that low-achieving schools made somewhat larger gains under the policy, but that there was no consistent pattern regarding the differential gains of low- versus high-achieving students; (3) in contrast to ITBS scores, IGAP math trends in third and sixth grade showed little if any change following the introduction of high-stakes testing; in eighth grade, IGAP math trends increased less sharply than comparable ITBS trends; (4) a cross-district comparison of IGAP scores in Illinois during the 1990s suggests that highstakes testing has not affected math or reading achievement in any grade in Chicago; (5) the introduction of high-stakes testing for math and reading coincided with a relative decline in science and social studies achievement, as measured by both the ITBS and IGAP; (6) while the probability that a student will transfer to a private school, leave the district, or drop out of elementary school did not increase following the introduction of the accountability policy, retention rates in first and second grade have increased substantially under high-stakes testing, despite the fact that these grades are not directly influenced by the social promotion policy. This analysis highlights both the strengths and weaknesses of high-stakes accountability policies in education. On one hand, the results presented here suggest that students and teachers have responded to the high-stakes testing program in Chicago. The likelihood of missing the consistent series consists of fourth and eighth grade students from 1995 to 1999. 34 ITBS declined sharply following the introduction of the accountability policy, particularly among low-achieving students. ITBS scores increased dramatically beginning in 1997. Survey and interview data further indicate that many elementary students and teachers are working harder (Engel and Roderick 2001, Tepper et. al. 2001). All of these changes are most apparent in the lowest performing schools. On the other hand, there is evidence that teachers and students have responded to the policy by concentrating on the most easily observable, high-stakes topics and neglecting areas with little consequences or topics that are less easily observed. The differential achievement trends on ITBS and IGAP suggest that the large ITBS gains may have been driven by test-day motivation, test preparation or cheating and therefore do not reflect greater student understanding of mathematics or greater facility in reading comprehension. Interviews and survey data confirm that the majority of teachers have responded to high-stakes testing with loweffort strategies such as increasing test preparation, aligning curriculum more closely with the ITBS and referring low achieving students to after-school programs, but have not substantially changed their classroom instruction (Tepper et. al. 2001). It appears that teachers and students have shifted resources away from non-tested subjects, particularly social studies, which may be a source of concern if these areas are important for future student development and growth. Finally, the finding that the accountability policy increased retention rates in first and second grade. These results suggest two different avenues for policy intervention. One alternative is to expand the scope of the accountability measures – e.g., basing student promotion and school probation decisions on IGAP as well as ITBS scores, including science and social studies in the accountability system along with math and reading, etc. While this solution maintains the core of the accountability concept, it may be expensive to develop and administer a variety of 35 additional accountability measures. Increasing the number of measures used for the accountability policy may dilute the incentive associated with any one of the measures. Another alternative is to mandate particular curriculum programs or instructional practices. For example, in Fall 2001 the CPS will require 200 of the lowest performing elementary schools to adopt one of several comprehensive reading programs offered by the CPS. Unfortunately, there is little evidence that such mandated programmatic reforms have significant effects on student learning (Jacob and Lefgren 2001b). More generally, this approach has many of the same drawbacks that initially led to the adoption of a test-based accountability system, including the difficulty of monitoring the activities of classroom teachers, the heterogeneity of needs and problems and the inherent uncertainty of how to help students learn. Regardless of the path chosen by Chicago, this analysis illustrate the difficulties associated with accountability-oriented reform strategies in education, including the difficulty of monitoring teacher and student behavior, the shortcomings of standardized achievement measures and the importance of student and school capacity in responding to policy initiatives. This study suggests that educators and policymakers must carefully examine the incentives generated by high-stakes accountability policies in order to ensure that the policies foster longterm student learning and not simply raise test scores in the short-run. 36 References Becker, W. E. and S. Rosen (1992). “The Learning Effect of Assessment and Evaluation in High School.” Economics of Education Review 11(2): 107-118. Bishop, J. (1998). Do Curriculum-Based External Exit Exam Systems Enhance Student Achievement? Philadelphia, Consortium for Policy Research in Education, University of Pennsylvania, Graduate School of Education: 1-32. Bishop, J. H. (1990). “Incentives for Learning: Why American High School Students Compare So Poorly to their Counterparts Overseas.” Research in Labor Economics 11: 17-51. Bishop, J. H. (1996). Signaling, Incentives and School Organization in France, the Netherlands, Britain and the United States. Improving America's Schools: The Role of Incentives. E. A. Hanushek and D. W. Jorgenson. Washington, D.C., National Research Council: 111145. Bryk, A. S. and S. W. Raudenbush (1992). Hierarchical Linear Models. Newbury Park, Sage Publications. Bryk, A. S., P. B. Sebring, et al. (1998). Charting Chicago School Reform. Boulder, CO, Westview Press. Cannell, J. J. (1987). Nationally Normed Elementary Achievement Testing in America's Public Schools: How All Fifty States Are Above the National Average. Daniels, W. V., Friends for Education. Carnoy, M., S. Loeb, et al. (2000). Do Higher State Test Scores in Texas Make for Better High School Outcomes? American Educational Research Association, New Orleans, LA. 37 Catterall, J. (1987). Toward Researching the Connections between Tests Required for High School Graduation and the Inclination to Drop Out. Los Angeles, University of California, Center for Research on Evaluation, Standards and Student Testing: 1-26. Catterall, J. (1989). “Standards and School Dropouts: A National Study of Tests Required for High School Graduation.” American Journal of Education 98(November): 1-34. Easton, J. Q., T. Rosenkranz, et al. (2001). Annual CPS Test Trend Review, 2000. Chicago, IL, Consortium on Chicago School Research. Easton, J. Q., T. Rosenkranz, et al. (2000). Annual CPS Test Trend Review, 1999. Chicago, Consortium on Chicago School Research. ECS (2000). ECS State Notes, Education Commission of the States (www.ecs.org). Feldman, S. (1996). Do Legislators Maximize Votes? Working paper. Cambridge, MA. Frederisksen, N. (1994). The Influence of Minimum Competency Tests on Teaching and Learning. Princeton, Educational Testing Services, Policy Information Center. Gardner, J. A. (1982). Influence on High School Curriculum on Determinants of Labor Market Experience. Columbus, The National Center for Research in Vocational Education, Ohio State University. Goodlad, J. (1983). A Place Called School. New York, McGraw-Hill. Griffin, B. W. and M. H. Heidorn (1996). “An Examination of the Relationship between Minimum Competency Test Performance and Dropping Out of High School.” Educational Evaluation and Policy Analysis 18(3): 243-252. Grissmer, D. and A. Flanagan (1998). Exploring Rapid Achievement Gains in North Carolina and Texas. Washington, D.C., National Education Goals Panel. Haney, W. (2000). “The Myth of the Texas Miracle in Education.” Education Policy Analysis Archives 8(41). 38 Heubert, J. P. and R. M. Hauser, Eds. (1999). High Stakes: Testing for Tracking, Promotion and Graduation. Washington, D.C., National Academy Press. Jacob, B. A. (2001). “Getting Tough? The Impact of Mandatory High School Graduation Exams on Student Outcomes.” Educational Evaluation and Policy Analysis Forthcoming. Jacob, B. A. and L. Lefgren (2001a). “Making the Grade: The Impact of Summer School and Grade Retention on Student Outcomes.” Working Paper. Jacob, B. A. and L. Lefgren (2001b). “The Impact of a School Probation Policy on Student Achievement in Chicago.” Working Paper. Jacob, B. A. and S. D. Levitt (2001). “The Impact of High-Stakes Testing on Cheating: Evidence from Chicago.” Working Paper. Jacob, B. A., M. Roderick, et al. (2000). The Impact of Chicago's Policy to End Social Promotion on Student Achievement. American Educational Research Association, New Orleans. Kang, S. and J. Bishop (1984). The Impact of Curriculum on the Non-College Bound Youth's Labor Market Outcomes. High School Preparation for Employment. L. Hotchkiss, J. Bishop and S. Kang. Columbus, The National Center for Research in Vocational Education, Ohio State University: 95-135. Klein, S. P., L. S. Hamilton, et al. (2000). What Do Test Scores in Texas Tell Us? Santa Monica, CA, RAND. Koretz, D. (1999). Foggy Lenses: Limitations in the Use of Achievement Tests as Measures of Educators' Productivity. Devising Incentives to Promote Human Capital, Irvine, CA. Koretz, D., R. L. Linn, et al. (1991). The Effects of High-Stakes Testing: Preliminary Evidence About Generalization Across Tests. American Educational Research Association, Chicago. 39 Koretz, D. M. and S. I. Barron (1998). The Validity of Gains in Scores on the Kentucky Instructional Results Information System (KIRIS). Santa Monica, RAND. Kreitzer, A. E., G. F. Madaus, et al. (1989). Competency Testing and Dropouts. Dropouts from School: Issues, Dilemmas and Solutions. L. Weis, E. Farrar and H. G. Petrie. Albany, State University of New York Press: 129-152. Ladd, H. F. (1999). “The Dallas School Accountability and Incentive Program: An Evaluation of its Impacts on Student Outcomes.” Economics of Education Review 18: 1-16. Lillard, D. R. and P. P. DeCicca (Forthcoming). “Higher Stanards, More Dropouts? Evidence Within and Across Time.” Economics of Education Review. Linn, R. L. (2000). “Assessments and Accountability.” Educational Researcher 29(2): 4-16. Linn, R. L. and S. B. Dunbar (1990). “The Nation's Report Card Goes Home: Good New and Bad About Trends in Achievement.” Phi Delta Kappan 72(2): 127-133. Linn, R. L., M. E. Graue, et al. (1990). “Comparing State and District Results to National Norms: The Validity of the Claim that 'Everyone is Above Average'.” Educational Measurement: Issues and Practice 9(3): 5-14. Little, R. J. A. and D. B. Rubin (1987). Statistical Analysis with Missing Data. New York, John Wiley & Sons. MacIver, D. J., D. A. Reuman, et al. (1995). “Social Structuring of the School: Studying What Is, Illuminating What Could Be.” Annual Review of Psychology 46: 375-400. MacMillan, D. L., I. H. Balow, et al. (1990). A Study of Minimum Competency Tests and Their Impact: Final Report. Washington, D.C., Office of Special Education and Rehabilitation Services, U.S. Department of Education. Madaus, G. and V. Greaney (1985). “The Irish Experience in Competency Testing: Implications for American Education.” American Journal of Education 93: 268-294. 40 McNeil, L. and A. Valenzuela (1998). The Harmful Impact of the TAAS System of Testing in Texas: Beneath the Accountability Rhetoric. High Stakes K-12 Testing Conference, Teachers College, Columbia University. Meyer, J. W. and B. Rowan (1978). The Structure of Educational Organizations. Environments and Organizations. M. Meyer. San Francisco, CA, Jossey-Bass: 78-109. Meyer, R. (1982). Job Training in the Schools. Job Training for Youth. R. Taylor, H. Rosen and F. Pratzner. Columbus, National Center for Research in Vocational Education, Ohio State University. Neill, M. and K. Gayler (1998). Do High Stakes Graduation Tests Improve Learning Outcomes? Using State-Level NAEP Data to Evaluate the Effects of Mandatory Graduation Tests. High Stakes K-12 Testing Conference, Teachers College, Columbia University. Pearson, D. P. and T. Shanahan (1998). “The Reading Crisis in Illinois: A Ten Year Retrospective of IGAP.” Illinois Reading Council Journal 26(3): 60-67. Powell, A. G. (1996). Motivating Students to Learn: An American Dilemma. Rewards and Reform: Creating Educational Incentives that Work. S. H. Fuhrman and J. A. O'Day. San Francisco, Jossey-Bass Inc.: 19-59. Powell, A. G., E. Farrarr, et al. (1985). The Shopping Mall High School: Winners and Losers in the Educational Marketplace. Boston, Houghten-Mifflin. Reardon, S. (1996). Eighth Grade Minimum Competency Testing and Early High School Dropout Patterns. American Educational Research Association Annual Meeting, New York. Roderick, M. and M. Engel (2000). The Grasshopper and the Ant: Motivational Responses of Low-Achieving Students to High-Stakes Testing. American Educational Research Association, New Orleans. 41 Roderick, M., J. Nagaoka, et al. (2000). Update: Ending Social Promotion. Chicago, IL, Consortium on Chicago School Research. Roderick, M., J. Nagaoka, et al. (2001). Helping Students Meet the Demands of High-Stakes Testing: Is There a Role for Summer School? Annual Meeting of the American Educational Research Association, Seattle, WA. Sedlak, M. W., C. W. Wheeler, et al. (1986). Selling Students Short: Classroom Bargains and Academic Reform in American High Schools. New York, Teachers College Press. Shepard, L. A. (1988). Should Instruction be Measurement-Driven? American Educational Research Association, New Orleans. Shepard, L. A. (1990). “Inflated Test Score Gains: Is the Problem Old Norms or Teaching the Test?” Educational Measurement: Issues and Practice 9(3): 15-22. Sizer, T. R. (1984). Horace's Compromise: The Dilemma of the American High School. Boston, Hougton-Mifflin. Smith, B. A. and S. Degener (2001). Staying After to Learn: A First Look at Lighthouse. Annual Meeting of the American Educational Research Association, Seattle, WA. Stecher, B. M. and S. I. Barron (1999). Quadrennial Milepost Accountability Testing in Kentucky. Los Angeles, Center for the Study of Evaluation, University of California. Stevenson, H. W. and J. W. Stigler (1992). The Learning Gap: Why Our Schools are Failing and What We Can Learn from Japanese and Chinese Education. New York, Simon and Schuster. Tepper, R., M. Roderick, et al. (Forthcoming). The Impact of High-Stakes Testing on Classroom Instructional Practices in Chicago. Chicago, IL, Consortium on Chicago School Research. Tepper, R. L. (2001). The Influence of High-Stakes Testing on Instructional Practice in Chicago. American Educational Research Association, Seattle, WA. 42 Tomlinson, T. M. and C. T. Cross (1991). “Student Effort: The Key to Higher Standards.” Educational Leadership(September): 69-73. Tyack, D. B. and L. Cuban (1995). Tinkering Toward Utopia: A Century of Public School Reform. Cambridge, MA, Harvard University Press. Waller, W. (1932). Sociology of Teaching. New York, Wiley. Weick, K. E. (1976). “Educational Organizations as Loosely Coupled Systems.” Administrative Science Quarterly 21: 1-19. Winfield, L. F. (1990). “School Competency Testing Reforms and Student Achievement: Exploring a National Perspective.” Educational Evaluation and Policy Analysis 12(2): 157-173. Wright, B. and M. H. Stone (1979). Best Test Design. Chicago, IL, MESA Press. 43 Table 2.1 Descriptive Statistics on the Social Promotion Policy in Chicago 3rd Grade 1995-96 Total Fall Enrollment Promotional Cutoffs for ITBS (Grade Equivalents) Percent Subject to Policy (% tested and included) Percent Failed to Meet Promotional Criteriaa Percent Failed Reading Percent Failed Math Percent Failed Reading and Math Percent retained or in transition center next yeara 1996-97 Total Fall Enrollment Promotional Cutoffs for ITBS (Grade Equivalents) Percent Subject to Policy (% tested and included) Percent Failed to Meet Promotional Criteriaa Percent Failed Reading Percent Failed Math Percent Failed Reading and Math Percent retained or in transition center next yeara 1997-98 Total Fall Enrollment Promotional Cutoffs for ITBS (Grade Equivalents) Percent Subject to Policy (% tested and included) Percent Failed to Meet Promotional Criteriaa Percent Failed Reading Percent Failed Math Percent Failed Reading and Math Percent retained or in transition center next yeara 1998-99 Total Fall Enrollment Promotional Cutoffs for ITBS (Grade Equivalents) Percent Subject to Policy (% tested and included) Percent Failed to Meet Promotional Criteriaa Percent Failed Reading Percent Failed Math Percent Failed Reading and Math Percent retained or in transition center next yeara 1999-2000 Total Fall Enrollment Promotional Cutoffs for ITBS (Grade Equivalents) Percent Subject to Policy (% tested and included) Percent Failed to Meet Promotional Criteriaa Percent Failed Reading Percent Failed Math Percent Failed Reading and Math Percent retained or in transition center next yeara 6th Grade 8th Gradeb 29,664 6.8 .82 .33 .07 .13 .13 .07 34,775 2.8 .71 .49 .06 .17 .26 .21 31,385 5.3 .81 .35 .05 .17 .13 .13 29,437 7.0 .80 .30 .06 .13 .11 .13 39,400 2.8 .71 .43 .02 .20 .19 .23 33,334 5.3 .81 .31 .07 .13 .11 .13 31,045 7.2 .80 .34 .08 .11 .14 .14 40,957 2.8 .69 .40 .05 .19 .16 .19 33,173 5.3 .80 .29 .06 .14 .09 .12 29,838 7.4 .79 .31 .07 .12 .12 .10 40,691 2.8 .69 .41 .03 .22 .16 .12 31,149 5.5 .79 .31 .04 .17 .10 .09 31,387 7.7 .77 .40 .09 .14 .17 .12 Notes: Those students who were not retained or placed in a transition center (1) were promoted, (2) were placed in a non-graded special education classroom, or (3) left the CPS. a Conditional on being subject to the accountability policy. b The eighth grade sample includes students in transition centers whose transcripts indicated they were in either eighth or ninth grade. 44 Table 4.1: Summary Statistics Variables 3rd Grade 6th Grade 8th Grade 0.178 0.076 0.087 0.122 3.471 (.962) 3.082 (1.122) -1.247 (1.116) -1.271 (1.091) 0.012 0.116 6.363 (1.277) 6.025 (1.484) 0.691 (.918) 0.087 (970) 0.012 0.111 7.972 (1.287) 7.881 (1.758) 1.539 (.777) 1.123 (.909) 0.058 Moved out of CPS 0.053 0.049 0.069 Dropped Out 0.010 0.010 0.054 0.274 0.374 0.226 0.207 0.387 0.399 0.353 0.387 0.393 0.110 0.161 0.122 0.263 0.313 0.298 0.151 0.154 0.172 0.178 0.162 0.186 0.298 0.210 0.222 Student Demographics Male 0.488 0.479 0.473 Black 0.654 0.552 0.554 Hispanic 0.205 0.304 0.299 Black Male 0.316 0.259 0.255 Hispanic Male 0.103 9.219 (.590) 0.048 0.150 12.210 (.483) 0.035 0.146 14.166 (.524) 0.028 Special Education 0.855 0.315 (.742) -0.261 (.687) 0.036 0.876 0.236 (.236) -0.275 (.689) 0.024 0.901 0.233 (.704) -0.275 (.684) 0.018 Free lunch 0.661 0.642 0.586 Reduced price lunch 0.065 0.072 0.068 Student Outcomes Missed ITBS Excluded from ITBS ITBS Math Score (GE) ITBS Reading Score (GE) ITBS Math Score (Rasch) ITBS Reading Score (Rasch) Transferred to private school Prior Achievement Indicators School < 15% at norms School 15-25% at norms School > 25% at norms Student < 10% Student 10-25% Student 25-35% Student 35-50% Student > 50% Age Living in foster care Living with at least one parent Neighborhood Poverty Neighborhood Social Status 45 Currently in bilingual program 0.048 0.095 0.076 Past bilingual participation 0.116 0.192 0.210 % Limited English Proficient 12.245 15.858 15.502 % Low Income 82.459 81.974 81.263 % Daily Attendance 92.563 92.733 92.555 Mobility Rate 29.865 28.418 28.522 School Enrollment 767.588 780.305 772.123 > 85% African-American 0.545 0.450 0.445 > 85% Hispanic 0.081 0.126 0.119 > 85% African-American + Hispanic 70-85% African-American + Hispanic < 70% African-American + Hispanic 0.103 0.124 0.134 0.082 0.098 0.094 0.189 0.203 0.208 4848.064 4924.203 4912.159 Crime Composite 0.176 0.086 0.084 % School Age Children 0.235 0.232 0.232 % Black 0.589 0.505 0.508 % Hispanic 0.169 0.223 0.223 22448.200 23093.400 23183.110 % Own Home 0.396 0.402 0.405 % Lived in Same House for 5 Years 0.577 0.567 0.569 Mean Education Level 12.099 12.084 12.083 % Managers/Professionals 0.173 0.170 0.170 Poverty Rate 0.282 0.262 0.261 Unemployment Rate 0.419 0.405 0.404 Female Headed HH 0.442 0.402 0.401 165,143 169,377 139,683 1 School Demographics Census Tract Characteristics Census Tract Population Median HH Income Number of observations Notes: The sample includes all tested and included students in cohorts 1993 to 1999 (1993 to 1998 for eighth grade) who were not missing demographic information. Sample sizes for the first two statistics— missing ITBS and excluded from ITBS—are larger than shown in this table because they are taken from the entire sample. 46 Table 5.1: OLS Estimates of the Probability of Not Being Tested or Being Excluded from Reporting Dependent Variable Excluded Not Tested (Conditional on Tested) Full Black Full Black -.016 (.006) -.029 (.008) -.144 -.008 (.005) -.016 (.006) -.008 (.007) .058 (.006) .064 (.007) .148 (.010) .003 (.002) -.000 (.002) -.001 (.003) 3rd Grade 1997 Cohort 1998 Cohort 1999 Cohort (.014) Baseline Rates Number of Observations .055 .054 228,734 122,251 190,762 117,238 .37 .09 .56 .68 1997 Cohort -.011 (.003) 1998 Cohort -.018 (.004) 1999 Cohort -.019 (.005) -.011 (.003) -.009 (.004) -.008 (.005) .018 (.002) .010 (.002) .020 (.003) .005 (.002) .003 (.002) -.001 (.003) R-Squared 6th Grade Baseline Rates Number of Observations R-Squared .045 .086 207,578 110,350 193,671 106,947 .28 .08 .79 .84 -.019 (.004) -.032 (.006) -.040 (.008) -.028 (.005) -.043 (.008) -.057 (.010) .005 (.002) .023 (.003) .018 (.003) .006 (.003) .010 (.003) .009 (.004) 8th Grade 1996 Cohort 1997 Cohort 1998 Cohort Baseline Rates Number of Observations R-Squared .057 .080 172,205 92,781 159,559 89,042 .28 .09 .84 .86 Notes: The sample includes all first-time students in these grades from 1993 to 1999 (1993-98 for eighth grade). Control variables include a time year, race, gender, race*gender interactions, age, household composition, and an indicator of previous special education placement along with up to three years of prior reading and math achievement (linear, square and cubic terms). Missing test scores are set to zero and a variable is included indicating the score is missing. Robust standard errors that account for the correlation of errors within school are presented in parentheses. 47 Table 5.2: OLS Estimates of the Probability of Not Being Tested or Being Excluded from Reporting, by Prior Achievement Level (Black Students Only) Missed Test Excluded rd th th rd 3 6 8 3 6th 8th -.011 (.005) -.011 (.003) -.022 (.005) .002 (.001) .005 (.002) .006 (.002) < 15% students scored above the 50th percentile -.015 (.005) -.015 (.004) -.032 (.006) .004 (.003) .009 (002) .006 (.003) 15-25% students scored above the 50th percentile -.006 (.005) -.010 (.004) -.021 (.005) .001 (.002) .004 (.002) .006 (.003) > 25% students scored above the 50th percentile -.013 (.007) -.004 (.005) -.007 (.005) .001 (.002) .001 (.003) .004 (.004) < 10th percentile -.005 (.007) -.015 (.004) -.043 (.007) .016 (.004) .021 (.004) .025 (.004) 10-25th percentile -.004 (.004) -.013 (.003) -.019 (.005) .003 (.003) .006 (.002) .010 (.003) 25-35th percentile -.012 (.005) -.008 (.004) -.019 (.005) .002 (.003) -.004 (.002) -.002 (.003) 35-50th percentile -.007 (.004) -.009 (.004) -.014 (.005) -.005 (.002) -.004 (.002) -.006 (.003) > 50th percentile -.004 (.004) -.005 (.003) -.013 (.005) -.004 (.002) -.007 (.002) -.006 (.002) Average Effect Average Effect by School Prior Achievement Average Effect by Student Prior Achievement Level Notes: Each cell contains an estimate from a separate regression. The estimate reflects the coefficient on a variable indicating whether the student was part of a post-policy cohort. School prior achievement is based on 1995 reading scores. Student prior achievement is based on the average of all non-missing test scores in the years t-2 and t-3 for sixth and eighth graders and t-1 and t-2 for third graders. The sample includes all black first-time students in these grades from 1993 to 1999 (1993-98 for eighth grade). Control variables include a year trend, race, gender, race*gender interactions, age, household composition, and an indicator of previous special education placement along with up to three years of prior reading and math achievement (linear, square and cubic terms). Missing test scores are set to zero and a variable is included indicating the score is missing. Robust standard errors that account for the correlation of errors within school are presented in parentheses. 48 Table 5.3: OLS Estimates of the Probability of Not Being Tested or Being Excluded from Reporting (Black Students) Missed Test Excluded th th th th 4 5 7 4 5th 7th -.007 (.003) -.010 (.003) -.013 (.004) .001 (.002) .008 (.002) .003 (.002) < 15% students scored above the 50th percentile -.009 (.005) -.013 (.005) -.022 (.005) .001 (.003) .008 (003) .002 (.003) 15-25% students scored above the 50th percentile -.003 (.004) -.007 (.004) -.013 (.005) -.003 (.003) .009 (.002) .003 (.003) > 25% students scored above the 50th percentile -.009 (.005) -.009 (.004) -.003 (.006) -.004 (.004) .005 (.003) .004 (.004) < 10th percentile -.008 (.006) -.005 (.006) -.020 (.007) .012 (.005) .025 (.004) .022 (.005) 10-25th percentile -.000 (.005) -.009 (.004) -.010 (.005) .002 (.004) .010 (.003) .005 (.003) 25-35th percentile -.005 (.005) -.008 (.004) -.015 (.005) -.002 (.004) .003 (.003) -.002 (.003) 35-50th percentile -.007 (.004) -.011 (.004) -.010 (.005) -.003 (.003) -.004 (.003) -.010 (.003) > 50th percentile -.005 (.004) -.003 (.004) -.013 (.004) -.007 (.003) -.004 (.002) -.009 (.003) Average Effect Average Effect by School Prior Achievement Average Effect by Student Prior Achievement Level Notes: Each cell contains an estimate from a separate regression. The estimate reflects the coefficient on a variable indicating whether the student was part of a post-policy cohort. School prior achievement is based on 1995 achievement scores. Student prior achievement is based on the average of all non-missing test scores in the years t-2 and t-3 for sixth and eighth graders and t-1 and t-2 for third graders. The sample includes all black first-time students in these grades from 1993 to 1999 (1993-98 for eighth grade). Control variables include a year trend, race, gender, race*gender interactions, age, household composition, and an indicator of previous special education placement along with up to three years of prior reading and math achievement (linear, square and cubic terms). Missing test scores are set to zero and a variable is included indicating the score is missing. Robust standard errors that account for the correlation of errors within school are presented in parentheses. 49 Table 6.1: OLS Estimates of ITBS Math and Reading Achievement Dependent Variable ITBS reading score controlling for achievement in the following periods: t-1,t-2,t-3 2 2.1 rd 3 Grade 1997 Cohort 1998 Cohort 1999 Cohort t-2, t-3 ITBS math score controlling for achievement in the following periods: t-3 t-1,t-2,t-3 t-2, t-3 t-3 2.2 2.3 .013 (.011) .102 (.012) .025 (.014) .055 (.012) .213 (.013) .062 (.15) .077 (.011) .242 (.014) .204 (.014) .115 (.013) .142 (.014) .136 (.015) .176 (.014) .250 (.016) .186 (.016) .194 (.014) .291 (.016) .338 (.016) .045 (.007) .099 (.008) .055 (.007) .061 (.007) .142 (.009) .093 (.008) .053 (.008) .165 (.009) .105 (.009) .069 (.007) .126 (.008) .067 (.008) .093 (.008) .164 (.009) .123 (.009) .097 (.009) .196 (.010) .167 (.011) .132 (.008) .158 (.009) .165 (.009) .133 (.008) .182 (.009) .231 (.009) .114 (.008) .189 (.010) .264 (.010) .095 (.006) .115 (.007) .170 (.008) .078 (.007) .132 (.008) .233 (.009) .069 (.007) .139 (.009) .272 (.010) 1 Year Gain 2 Year Gain 3 Year Gain 1 Year Gain 2 Year Gain 3 Year Gain .599 (.824) .503 (.647) .484 (.593) 1.444 (1.011) .945 (.674) 1.013 (.619) .606 (.772) .659 (.479) .472 (.412) 1.354 (.913) 1.323 (.586) .938 (.509) 6th Grade 1997 Cohort 1998 Cohort 1999 Cohort 8th Grade 1996 Cohort 1997 Cohort 1998 Cohort Baseline Gains 3rd Grade 6th Grade 8th Grade -1.432 (.773) 1.490 (.637) -2.027 (.726) 1.594 (.578) Notes: The sample includes all first-time students in these grades from 1993 to 1999 (1993-98 for eighth grade). Control variables include race, gender, race*gender interactions, age, household composition, and an indicator of previous special education placement along with up to three years of prior reading and math achievement (linear, square and cubic terms). Missing test scores are set to zero and a variable is included indicating the score is missing. Robust standard errors that account for the correlation of errors within school are presented in parentheses. Baseline gains are based on the average gains for the 1993 to 1995 cohorts. The end point in the gain scores is always the grade in question. For example, one year gains for 6th graders refers to the 6th grade score – 5th grade score; two year gains are calculated as 6th grade score – 4th grade score and three year gains are calculated as 6th grade score – 3rd grade score. 50 Table 6.2: OLS Estimates of ITBS Reading Achievement Using Various Specifications and Samples Specification Missing Missing Missing Form L School With Data Data Data Black w/ Included Included Imputed Imputed Imputed Aggregate Specific FE Form L Black Aggregate & Baseline Only and Year at at at & Included Year Excluded Trends Trend Excluded 0%ile in 25%ile in 50%ile in Trend school school school (1) (2) (3) (4) (5) (6) (7) (8) (9) (9) (10) 3rd Grade 1997 Cohort 1998 Cohort 1999 Cohort 2.4 2.5 .013 (.011) .213 (.013) .204 (.014) .008 (.011) .195 (.013) .064 (.015) .032 (.013) .154 (.016) .217 (.016) .030 (.013) .145 (.015) .205 (.015) .050 (.017) .222 (.019) .336 (.018) .033 (.012) .190 (.014) .160 (.013) .022 (.012) .193 (.013) .082 (.015) .091 (.015) .358 (.020) .218 (.027) .090 (.008) .349 (.011) .214 (.014) .234 (.014) .199 (.026) .045 (.007) .142 (.009) .105 (.009) .039 (.007) .117 (.009) .079 (.009) .051 (.009) .166 (.011) .116 (.011) .045 (.009) .132 (.011) .089 (.011) .055 (.008) .151 (.010) .144 (.011) .045 (.007) .129 (.009) .099 (.009) .038 (.007) .118 (.009) .080 (.009) .033 (.010) .167 (.013) .163 (.016) .037 (.006) .170 (.008) .168 (.010) .141 (.008) .072 (.013) .132 (.008) .182 (.009) .264 (.010) .124 (.007) .165 (.009) .234 (.010) .150 (.011) .199 (.013) .290 (.012) .139 (.010) .180 (.013) .256 (.012) .127 (.010) .187 (.012) .273 (.013) .123 (.008) .167 (.009) .238 (.010) .121 (.008) .160 (.009) .226 (.010) .154 (.010) .208 (.015) .227 (.020) .150 (.006) .204 (.009) .227 (.012) .141 (.008) -- .170 (.009) -- 6th Grade 1997 Cohort 1998 Cohort 1999 Cohort 8th Grade 1996 Cohort 1997 Cohort 1998 Cohort Notes: The sample includes all first-time students in these grades from 1993 to 1999 (1993-98 for eighth grade). Control variables include race, gender, race*gender interactions, age, household composition, and an indicator of previous special education placement along with up to three years of prior reading and math achievement (linear, square and cubic terms). Missing test scores are set to zero and a variable is included indicating the score is missing. Robust standard errors that account for the correlation of errors within school are presented in parentheses. 51 Table 6.3: OLS Estimates of ITBS Math Achievement Using Various Specifications and Samples Specification Missing Missing Missing Form L School With Data Data Data Black w/ Included Included Imputed Imputed Imputed Aggregate Specific FE Form L Black Aggregate & Baseline Only and Year at at at & Included Year Excluded Trends Trend Excluded 0%ile in 25%ile in 50%ile in Trend school school school (1) (2) (3) (4) (5) (6) (7) (8) (9) (9) (10) .115 (.013) .250 (.016) .338 (.016) .122 (.014) .256 (.015) .256 (.016) .127 (.016) .238 (.019) .347 (.019) .126 (.016) .233 (.018) .335 (.019) .162 (.017) .303 (.020) .526 (.020) .152 (.014) .270 (.016) .367 (.015) .137 (.014) .255 (.015) .280 (.015) .218 (.017) .428 (.023) .346 (.029) .215 (.007) .418 (.011) .341 (.010) .290 (.016) .175 (.028) .069 (.007) .164 (.009) .167 (.011) .062 (.007) .144 (.009) .135 (.010) .067 (.009) .170 (.012) .178 (.013) .057 (.009) .140 (.011) .139 (.012) .080 (.009) .180 (.011) .201 (.013) .069 (.008) .158 (.009) .158 (.011) .065 (.008) .147 (.009) .141 (.011) .096 (.010) .257 (.015) .280 (.019) .098 (.005) .255 (.006) .281 (.009) .172 (.009) .152 (.017) .095 (.006) .132 (.008) .272 (.010) .087 (.006) .114 (.008) .246 (.010) .099 (.008) .148 (.011) .284 (.012) .086 (.008) .122 (.011) .248 (.012) .093 (.010) .144 (.011) .290 (.013) .089 (.006) .116 (.008) .253 (.010) .086 (.006) .113 (.009) .243 (.010) .201 (.008) .274 (.013) .363 (.018) .198 (.004) .268 (.007) .361 (.009) .057 (.007) -- .128 (.009) -- 3rd Grade 1997 Cohort 1998 Cohort 1999 Cohort 6th Grade 1997 Cohort 1998 Cohort 1999 Cohort 8th Grade 1996 Cohort 1997 Cohort 1998 Cohort Notes: The sample includes all first-time students in these grades from 1993 to 1999 (1993-98 for eighth grade). Control variables include race, gender, race*gender interactions, age, household composition, and an indicator of previous special education placement along with up to three years of prior reading and math achievement (linear, square and cubic terms). Missing test scores are set to zero and a variable is included indicating the score is missing. Robust standard errors that account for the correlation of errors within school are presented in parentheses. 52 Table 6.4: OLS Estimates of High-Stakes Testing Effects in Gate Grades Reading Math rd th th rd 3 6 8 3 6th 8th Grade Grade Grade Grade Grade Grade Aggregate Effects .046 .066 .150 .132 .088 .125 All Students (.009) (.005) (.006) (.011) (.006) (.005) Effects by School Prior Achievement < 15% students in the .075 .108 .177 .180 .116 .134 school scored above (.017) (.012) (.013) (.026) (.012) (.009) the 50th percentile 15-25% students in the school scored .059 .064 .147 .117 .072 .140 above the 50th (.013) (.006) (.011) (.016) (.010) (.009) percentile > 25% students in the .015 .044 .139 .113 .087 .106 school scored above (.015) (.009) (.009) (.017) (.010) (.008) the 50th percentile Effects by Student Prior Achievement .041 .120 .103 .161 .055 .063 < 10th percentile (.013) (.008) (.012) (.016) (.008) (.008) .071 .060 .146 .153 .067 .108 10-25th percentile (.011) (.007) (.007) (.014) (.007) (.004) .081 .053 .166 .131 .095 .144 25-35th percentile (.013) (.009) (.009) (.015) (.008) (.007) .038 .044 .147 .111 .103 .146 35-50th percentile (.013) (.008) (.009) (.015) (.009) (.007) -.084 .030 .125 .052 .129 .136 > 50th percentile (.015) (.010) (.009) (.015) (.008) (.008) Notes: Each cell contains an estimate from a separate regression. The estimate reflects the coefficient on a variable indicating whether the student was part of a post-policy cohort. School prior achievement is based on 1995 achievement scores. Student prior achievement is based on the average of all non-missing test scores in the years t-2 and t-3 for sixth and eighth graders and t-1 and t-2 for third graders. The sample includes all first-time students in these grades from 1993 to 1999 (1993-98 for eighth grade). Control variables include race, gender, race*gender interactions, age, household composition, and an indicator of previous special education placement along with up to three years of prior reading and math achievement (linear, square and cubic terms). Missing test scores are set to zero and a variable is included indicating the score is missing. Robust standard errors that account for the correlation of errors within school are presented in parentheses. 53 Table 6.5: OLS Estimates of High-Stakes Testing Effects for Non-Gate Grades in 1997 Reading Math th th th th 4 5 7 4 5th 7th Grade Grade Grade Grade Grade Grade Aggregate Effects 1997 Effect -.015 (.008) .089 (.007) .068 (.008) .105 (.006) .071 (.004) .040 (.008) .130 (.010) .093 (.007) .019 (.017) .106 (.008) .102 (.013) .144 (.020) .083 (.013) .093 (.015) -.032 (.012) .072 (.007) .098 (.010) .083 (.014) .059 (.010) .112 (.009) -.016 (.014) .048 (.008) .116 (.009) .017 (.014) .051 (.011) .076 (.011) 1998 Effect By School Prior Achievement < 15% students in the school scored above the 50th percentile 15-25% students in the school scored above the 50th percentile > 25% students in the school scored above the 50th percentile By Student Prior Achievement -.054 .088 .092 .158 .072 .049 (.015) (.009) (.012) (.015) (.009) (.010) -.014 .109 .052 .135 .065 .065 10-25th percentile (.012) (.008) (.008) (.014) (.008) (.007) -.001 .103 .076 .082 .069 .096 25-35th percentile (.015) (.010) (.010) (.013) (.010) (.009) .003 .077 .107 .045 .058 .115 35-50th percentile (.013) (.010) (.010) (.013) (.010) (.011) .012 -.010 .173 -.015 .036 .157 > 50th percentile (.014) (.010) (.011) (.013) (.010) (.011) Notes: Each cell contains an estimate from a separate regression. The estimate reflects the coefficient on a variable indicating whether the student was part of a post-policy cohort. School prior achievement is based on 1995 achievement scores. Student prior achievement is based on the average of all non-missing test scores in the years t-2 and t-3. The sample includes all firsttime students in these grades from 1993 to 1997. Control variables include race, gender, race*gender interactions, age, household composition, and an indicator of previous special education placement along with up to three years of prior reading and math achievement (linear, square and cubic terms). Missing test scores are set to zero and a variable is included indicating the score is missing. Robust standard errors that account for the correlation of errors within school are presented in parentheses. < 10th percentile 54 Table 7.1: OLS Estimates of ITBS and IGAP Math Achievement Iowa Test of Basic Skills (High-Stakes Exam) (1) 2.6 rd 3 Grade (2) Illinois Goals Assessment Program (Low-Stakes Exam) (3) (5) (6) (7) (8) 2.10 2.11 2.12 2.13 2.14 -.131 (.019) -.140 (.028) .226 (.015) -.174 (.029) 2.7 2.8 .146 (.012) .264 (.013) .039 (.016) .162 (.024) .249 (.014) .151 (.024) .184 (.012) .272 (.015) .563 .564 .392 .392 .559 .569 .389 .401 .091 (.008) .194 (.010) .027 (.012) .175 (.018) .178 (.010) .157 (.018) .159 (.009) .231 (.011) -.037 (.014) .009 (.022) .217 (.012) -.002 (.022) .757 .757 .660 .660 .718 .722 .646 .649 .132 (.007) .153 (.008) .279 (.008) .337 (.012) .493 (.020) .976 (.027) .078 (.008) -- -- -- .183 (.015) .297 (.025) .621 (.033) .083 (.011) .182 (.011) .107 (.009) .171 (.010) .208 (.012) .148 (.013) -- R-Squared .764 .766 .767 .743 .743 .742 Allow for pre-existing trend No Yes No Yes No Yes No Yes 1994-98 1994-98 Form L (94,96,98) Form L (94,96,98) 1994-98 1994-98 Form L (94,96,98) Form L (94,96,98) 1997 Cohort 1998 Cohort R-Squared 2.9 (4) 6th Grade 1997 Cohort 1998 Cohort R-Squared 8th Grade 1996 Cohort 1997 Cohort 1998 Cohort Cohorts Notes: The sample includes all first-time students in these grades from 1994 to 1998. Control variables include race, gender, race*gender interactions, age, household composition, and an indicator of previous special education placement along with up to three years of prior reading and math achievement (linear, square and cubic terms). Missing test scores are set to zero and a variable is included indicating the score is missing. Robust standard errors that account for the correlation of errors within school are presented in parentheses. 55 Table 7.2 Summary Statistics for Illinois Public Schools in 1990 Non-Urban Chicago Districts (1) (2) School Test Scores 4th Grade Science 7th Grade Science 4th Grade Social Studies 7th Grade Social Studies 3rd Grade Math 6th Grade Math 8th Grade Math 3rd Grade Reading 6th Grade Reading 8th Grade Reading Average Test Score Average % Tested School Demographics Average School Enrollment % Black % Hispanic % Asian % Native American % LEP % Free or Reduced Price Lunch % Avg. Daily Attendance % Mobility District Characteristics Pupil-Teacher Ratio Log(Avg. Teacher Salary) Log(Per Pupil Expenditures) Teacher Experience (years) % Teachers with B.A. % Teachers with M.A.+ Number of Schools Number of Districts Average District Enrollment 166 182 163 185 190 191 206 157 184 208 190 82 278 279 279 277 Other Urban Districts (3) 279 291 275 285 278 286 90 211 205 212 207 233 219 222 205 221 220 221 85 795 55.7 29.4 3.0 0.2 15.3 487 6.0 4.4 3.0 0.1 2.3 506 45.8 14.9 1.1 0.1 7.0 71.8 91.9 33.7 17.6 95.6 14.8 56.2 93.5 32.6 20.3 10.7 8.7 16.6 56.4 42.9 470 1 306,500 19.8 10.5 8.4 15.3 56.6 43.4 3,055 833 3,769 20.2 10.5 8.5 16.0 56.6 211 242 34 7,089 295 Notes: The figures presented above are averages for all schools weighted by the school enrollment. 56 Table 7.3: OLS Estimates of IGAP Math Achievement in Illinois from 1993 to 1998 3rd Grade 6th Grade 8th Grade 1 2 3 4 5 6 Chicago*(1997-98) 13.4 (4.7) -3.1 (3.5) 9.0 (2.6) -2.8 (2.4) 3.6 (2.5) -0.5 (3.3) Urban*(1997-98) 4.1 (2.5) 0.6 (3.0) 3.3 (1.8) 8.3 (2.4) 4.9 (2.6) 6.5 (3.2) 1997-98 3.8 (1.4) -2.6 (1.4) 4.0 (1.1) -6.8 (1.0) 7.1 (1.0) -0.4 (1.0) Chicago -19.4 (5.7) -41.8 (7.8) -12.4 (4.0) -26.6 (5.6) -2.2 (3.4) -6.0 (5.5) Urban 7.0 (6.6) 2.3 (7.6) 5.2 (4.8) 13.1 (6.1) 9.8 (2.8) 12.3 (6.1) Trend*Chicago 6.0 (1.1) 4.2 (0.8) 1.5 (1.2) Trend*Urban 1.3 (0.8) -1.6 (0.8) -0.5 (1.1) Trend 2.6 (0.6) 4.1 (0.4) 2.8 (0.4) R-Squared Number of Obs Difference between Chicago and Other Urban Districts in Illinois Difference between Chicago and Non Urban Districts in Illinois .67 .76 .81 13,751 10,939 8,199 13.4 -3.1 9.0 -2.8 3.6 -0.5 17.5 -2.5 12.3 5.5 8.5 6.0 Notes: The following control variables are also included in the regressions shown above: percent black, percent Hispanic, percent Asian, percent Native American, percent low-income, percent Limited English Proficient, average daily attendance, mobility rate, school enrollment, pupil-teacher ratio, log(average teacher salary), log(per pupil expenditures), percent of teachers with a BA degree, and the percent of teachers with a MA degree or higher. Robust standard errors that account for correlation within schools across years are shown in parenthesis. The regressions are weighted by the inverse square of the number of students enrolled in the school. 57 Table 7.4: OLS Estimates of IGAP Reading Achievement in Illinois from 1993 to 1998 3rd Grade 6th Grade 8th Grade 1 2 3 4 5 6 Chicago*(1997-98) 14.8 (4.2) -5.9 (2.7) 7.9 (2.8) -9.6 (3.5) 7.9 (2.2) 2.1 (4.2) Urban*(1997-98) 5.1 (2.2) -5.3 (2.5) 4.3 (2.2) 8.7 (3.4) 5.3 (2.2) 4.2 (3.6) 1997-98 -6.4 (1.2) -0.2 (1.1) -21.6 (0.9) -15.3 (1.1) -20.9 (0.9) -2.0 (0.9) Chicago -12.5 (3.6) -43.0 (7.4) 2.2 (3.3) -25.1 (6.1) 14.6 (4.7) 1.3 (6.9) Urban 6.8 (4.9) -8.8 (6.6) 8.1 (4.9) 14.4 (7.5) 11.2 (3.5) 8.8 (7.7) Trend*Chicago 7.2 (1.1) 5.9 (1.1) 1.6 (1.5) Trend*Urban 3.6 (0.8) -1.4 (1.1) 0.3 (1.4) Trend -2.0 (0.4) -2.3 (0.5) -7.0 (0.3) R-Squared Number of Obs Difference between Chicago and Other Urban Districts in Illinois Difference between Chicago and Non Urban Districts in Illinois .76 .78 .77 13,749 10,942 8,200 14.8 -5.9 7.9 -9.6 7.9 2.1 19.9 -11.2 12.2 -1.9 13.2 6.3 Notes: The following control variables are also included in the regressions shown above: percent black, percent Hispanic, percent Asian, percent Native American, percent low-income, percent Limited English Proficient, average daily attendance, mobility rate, school enrollment, pupil-teacher ratio, log(average teacher salary), log(per pupil expenditures), percent of teachers with a BA degree, and the percent of teachers with a MA degree or higher. Robust standard errors that account for correlation within schools across years are shown in parenthesis. The regressions are weighted by the inverse square of the number of students enrolled in the school. 58 Table 8.1: Estimates of the achievement gain associated with greater student guessing for students who scored in the bottom 10th percentile in prior grades Achievement gain Number of Additional correct % gain attributable Estimated gain associated with additional Actual responses due to to increased associated with additional correct questions gain guessing in 1998 guessing random guessing response answered in 1998 (5) (6) = (4)/(5) (1) (2) = (1) *0.25 (4) = (1)*0.25*(3) (3) 3rd Grade Reading .34 0.09 .14 0.01 .06 0.20 Math .82 0.21 .07 0.01 .09 0.16 6th Grade Reading .46 0.12 .24 0.03 .29 0.10 Math .99 0.25 .08 0.02 .19 0.10 8th Grade Reading .96 0.24 .25 0.06 .36 0.17 Math 1.63 0.41 .08 0.03 .27 0.12 Notes: The sample includes all first-time students in these grades in 1994 and 1998. Control variables include race, gender, race*gender interactions, age, household composition, and an indicator of previous special education placement along with up to three years of prior reading and math achievement (linear, square and cubic terms). Missing test scores are set to zero and a variable is included indicating the score is missing. Robust standard errors that account for the correlation of errors within school are presented in parentheses. 59 Table 8.2: A Comparison of Eighth Grade ITBS and IGAP Exams Math ITBS IGAP • 70 multiple-choice • 135 multiple-choice questions questions • 5 possible answers • 4 possible answers • No penalty for wrong • No penalty for wrong Structure answers answers • Two 40 minute • Five sessions of 20-45 sessions minutes each • Computation (43) • Computation (10) • Number Concepts (32) • Ratios & Percentages (10) • Estimation (24) • Measurement (10) • Problem-Solving (20) Content • Algebra (10) • Data Analysis (16) • Geometry (10) • Data Analysis (10) • Estimation (10) • All questions are • Computation problems written as word do not have words. problems, including the • Data Interpretation computation problems. Format section consists of a • One question per graph or figure graph or figure. followed by several questions. Reading ITBS • 7 passages followed by 3-10 multiple-choice questions • 49 total questions • • • 2 narrative passages 4 expository passages 1 poetry passage • One correct answer per question. • • • • • IGAP 2 passages followed by 18 multiple-choice questions 4 questions that ask the student to compare the two passages 40 questions total 1 narrative passage 1 expository passage • Multiple correct answers per question. Notes: Information on the ITBS is taken from the Form L exam. Information on the IGAP is based on practice books. 60 Table 8.3: OLS Estimates of the Relationship between Item Type and Achievement Gain on ITBS Math Exam from 1996 to 1998 Dependent Variable = Proportion of Students Answering the Item Correctly on the ITBS Math Exam (1) (2) .023 .015 1998 Cohort (.011) (.012) .019 Basic Skills * 1998 (.006) .019 Number Concepts *1998 (.010) .006 Estimation *1998 (.010) .008 Data Analysis *1998 (.011) .029 Math Computation *1998 (.009) 25-35% answered item correctly .015 .013 prior to high-stakes testing*1998 (.012) (.012) 35-45% answered item correctly .019 .019 prior to high-stakes testing*1998 (.012) (.012) 45-55% answered item correctly .018 .016 prior to high-stakes testing*1998 (.012) (.012) 55-65% answered item correctly .015 .012 prior to high-stakes testing*1998 (.013) (.013) 65-75% answered item correctly .010 .009 prior to high-stakes testing*1998 (.014) (.014) 75-85% answered item correctly -.002 -.005 prior to high-stakes testing*1998 (.014) (.014) 85-100% answered item correctly .003 -.003 prior to high-stakes testing*1998 (.024) (.024) Number of Observations 692 692 R-Squared .956 .957 Notes: The sample consists of all tested and included students in 1996 and 1998. The units of observation are item*year proportions, reflecting the proportion of students answering the item correctly in that year. The omitted item category in column 1 is critical thinking skills. The omitted category in column 2 is problem-solving. 61 Table 8.4: OLS Estimates of the Relationship between Item Position and Achievement Gain on the ITBS Reading Exam from 1994 to 1998 Dependent Variable = Proportion of Students Answering the Item Correctly on the ITBS Reading Exam Total .004 Intercept (.021) .002 2nd Quintile of the Exam (.014) .013 3rd Quintile of the Exam (.015) .017 4th Quintile of the Exam (.015) .027 5th Quintile of the Exam (.017) 25-35% answered item correctly .025 prior to high-stakes testing (.020) 35-45% answered item correctly .036 prior to high-stakes testing (.019) 45-55% answered item correctly .049 prior to high-stakes testing (.019) 55-65% answered item correctly .046 prior to high-stakes testing (.021) 65-75% answered item correctly .051 prior to high-stakes testing (.025) 75-100% answered item correctly .043 prior to high-stakes testing (.030) Number of Observations 258 R-Squared .95 Notes: The sample consists of all tested and included students in 1994 and 1998. The units of observation are item*year proportions, reflecting the proportion of students answering the item correctly in that year. The omitted category is the first quintile of the exam. 62 Table 9.1: OLS Estimates of the Probability in the Following Year of Not Being Enrolled, Being Retained and Being Placed in Special Education Dependent Variables Not Enrolled Retained Kindergarten 1997 Cohort 1998 Cohort 1999 Cohort -0.003 (.002) 0.000 (.002) 0.007 (.002) .010 (.003) .019 (.003) .031 (.004) -0.001 (.001) 0.002 (.002) 0.008 (.003) .001 (.002) .005 (.002) .011 (.003) -0.006 (.002) -0.004 (.002) 0.003 (.002) .005 (.003) .009 (.003) .022 (.004) 0.013 (.003) 0.020 (.003) 0.024 (.004) .013 (.003) .021 (.004) .024 (.005) -0.002 (.002) 0.001 (.002) 0.010 (.002) .009 (.002) .017 (.003) .030 (.004) 0.023 (.003) 0.020 (.002) 0.017 (.002) .018 (.003) .014 (.003) .008 (.003) No Yes No Yes 1st Grade 1997 Cohort 1998 Cohort 1999 Cohort 2nd Grade 1997 Cohort 1998 Cohort 1999 Cohort Trend Notes: The sample includes all first-time students in these grades from 1993 to 1999 (1993-98 for eighth grade). Control variables include race, gender, race*gender interactions, age, household composition, and an indicator of previous special education placement along with up to three years of prior reading and math achievement (linear, square and cubic terms). Missing test scores are set to zero and a variable is included indicating the score is missing. Robust standard errors that account for the correlation of errors within school are presented in parentheses. 63 Table 9.2 OLS Estimates of the Relationship Between High-Stakes Testing and ITBS Scores Independent Variable Math High-Stakes Regime .165 (.010) Dependent Variable = ITBS Score in Social Math Reading Science Reading Science Studies 4th Grade 8th Grade .156 (.010) -.023 (.017) -.061 (.016) .159 (.013) .128 (.017) .043 (.023) Social Studies -.036 (.023) Notes: Each cell contains an estimate from a separate regression. The estimate reflects the coefficient on a variable indicating whether the student was part of a post-policy cohort. The fourth grade sample includes the 1996 and 1997 cohort. The eighth grade sample includes the 1996 and 1998 cohort. Both samples are limited to students who were in the grade for the first-time and were tested and included. Control variables include race, gender, race*gender interactions, age, household composition, and an indicator of previous special education placement along with up to three years of prior reading and math achievement (linear, square and cubic terms). Missing test scores are set to zero and a variable is included indicating the score is missing. Robust standard errors that account for the correlation of errors within school are presented in parentheses. Table 9.3: OLS Estimate of the Relationship Between High-Stakes Testing and IGAP Science and Social Studies Scores Science Social Studies th th th 4 Grade 7 Grade 4 Grade 7th Grade 1 2 3 4 5 6 7 8 6.2 -7.3 10.1 0.6 -1.3 -16.4 7.7 -5.0 Chicago*(1997) (3.6) (2.9) (2.2) (3.2) (4.1) (3.6) (2.4) (4.2) Yes No Trend No Yes No No Yes Yes R-Squared .81 .81 .83 .84 .79 .79 .79 .79 Number of Observations 11,105 11,105 6,832 6,832 11,103 11,103 6,834 6,834 Notes: The following control variables are also included in the regressions shown above: percent black, percent Hispanic, percent Asian, percent Native American, percent low-income, percent Limited English Proficient, average daily attendance, mobility rate, school enrollment, pupil-teacher ratio, log(average teacher salary), log(per pupil expenditures), percent of teachers with a BA degree, and the percent of teachers with a MA degree or higher. Robust standard errors that account for correlation within schools across years are shown in parenthesis. The regressions are weighted by the inverse square of the number of students enrolled in the school. 64 Figure 6.1: Trends in ITBS Math Scores in Gate Grades in Chicago 0.80 Deviation from 1990 Mean (Adjusted Rasch Metric) 0.60 0.40 0.20 0.00 -6 -5 -4 -3 -2 -1 0 1 -0.20 -0.40 Years Before/After High-Stakes Testing 3rd Grade 6th Grade 65 8th Grade 2 3 4 5 Figure 6.2: Trends in ITBS Reading Scores in Gate Grades in Chicago 0.50 Deviation from 1990 Mean (Adjusted Rasch Metric) 0.40 0.30 0.20 0.10 0.00 -6 -5 -4 -3 -2 -1 0 1 -0.10 -0.20 Years Before/After High-Stakes Testing 3rd Grade 6th Grade 66 8th Grade 2 3 4 5 Figure 6.3: Trends in ITBS Math Scores at Different Points on The Ability Distribution 0.50 ITBS Math Score (difference from 1994 mean) 0.40 0.30 0.20 0.10 0.00 1994 10th Percentile 1996 25th Percentile 67 1998 75th Percentile 90th Percentile Figure 6.4: Trends in ITBS Reading Scores at Different Points on the Ability Distribution 0.50 ITBS Reading Score (difference from 1994 mean) 0.40 0.30 0.20 0.10 0.00 1994 10th Percentile 1996 25th Percentile 68 1998 75th Percentile 90th Percentile Figure 6.5: Trends in ITBS Math Achievement in Non-Gate Grades in Chicago 0.4 Deviation from 1990 Mean (Rasch Metric) 0.3 0.2 0.1 0 -6 -5 -4 -3 -2 -1 -0.1 -0.2 Years Before/After High-Stakes Testing 4th Grade 5th Grade 69 7th Grade 0 1 2 Figure 6.6: Trends in ITBS Reading Scores in Non-Gate Grades in Chicago 0.2 Deviation from 1990 Mean (Rasch Metric) 0.15 0.1 0.05 0 -6 -5 -4 -3 -2 -1 -0.05 -0.1 -0.15 Years Before/After High-Stakes Testing 4th Grade 5th Grade 70 7th Grade 0 1 2 Figure 7.1: Trends in IGAP Math Scores in Gate Grades in Chicago 250 IGAP Score 225 200 175 150 -6 -5 -4 -3 -2 -1 0 Years Before/After High-Stakes Testing 3rd Grade 6th Grade 71 8th Grade 1 2 3 Figure 7.2: Trends in Third Grade ITBS and IGAP Math Scores in Chicago Average School Mean (1993 standard deviation units) 1.00 0.80 0.60 0.40 0.20 0.00 -3 -2 -1 0 -0.20 -0.40 Years Before/After High-Stakes Testing IGAP ITBS 72 1 2 Figure 7.3: Trends in Sixth Grade ITBS and IGAP Math Scores in Chicago Average School Mean (1993 standard deviation units) 1.00 0.80 0.60 0.40 0.20 0.00 -3 -2 -1 0 -0.20 Years Before/After High Stakes Testing IGAP ITBS 73 1 2 Figure 7.4: Trends in Eighth Grade ITBS and IGAP Math Scores in Chicago Average School Mean (1993 standard deviation units) 1.20 1.00 0.80 0.60 0.40 0.20 0.00 -2 -1 0 1 -0.20 Years Before/After High-Stakes Testing IGAP ITBS 74 2 3 Figure 7.5: Trends in the Difference Between Chicago and Other Urban Districts on the IGAP Math Exam 0 Chicago Mean IGAP - Urban Comparison District Mean IGAP -3 -2 -1 0 1 2 -5 -10 -15 -20 -25 -30 -35 -40 -45 -50 3rd Grade 6th Grade 75 8th Grade 3 Figure 7.6: Trends in the Difference Between Chicago and Other Urban Districts on the IGAP Reading Exam Chicago Mean IGAP - Urban Comparison District Mean IGAP 10 0 -3 -2 -1 0 1 2 -10 -20 -30 -40 -50 -60 Years Before/After High-Stakes Testing 3rd Grade 6th Grade 76 8th Grade 3 Figure 9.1: Trends in Fourth Grade ITBS Achievement in Chicago 4.60 4.40 Grade Equivalents 4.20 4.00 3.80 3.60 3.40 3.20 3.00 1995 1996 Math Reading 1997 Science 77 Social Studies Figure 9.2: Trends in Eighth Grade ITBS Achievement in Chicago 8.5 8.3 8.1 Grade Equivalents 7.9 7.7 7.5 7.3 7.1 6.9 6.7 6.5 1995 1996 Math 1997 Reading Science 78 1998 Social Studies Figure 9.3: Trends in IGAP Social Studies Scores in Illinois (Difference between Chicago and Other Urban Districts) 30 Difference (Chicago - Other Urban Districts) 20 10 0 1993 1994 1995 1996 -10 -20 -30 -40 -50 -60 4th Grade 7th Grade 79 1997 Figure 9.4: Trends in IGAP Science Scores in Illinois (Difference between Chicago and Other Urban Districts) 30 Difference (Chicago - Other Urban Districts) 20 10 0 1993 1994 1995 1996 -10 -20 -30 -40 -50 -60 4th Grade 7th Grade 80 1997