Draft. Please do not cite or circulate. Does making a test high stakes for students artificially inflate achievement gaps by race and gender? Evidence from the California High School Exit Exam Nicole Arshan* Sean Reardon Stanford University March 2011 *Questions or correspondences regarding this paper may be addressed to Nicole Arshan, Center for Education Policy Analysis, Stanford University School of Education, 520 Galvez Mall, CERAS Building, 5th Floor, Stanford, CA, 94305 or narshan@stanford.edu. Draft. Please do not cite or circulate. Abstract Draft. Please do not cite or circulate. Standardized tests given to large groups of students are popular among both policymakers and researchers for understanding student achievement and attaching accountability to schools, teachers and even students. To use tests for any of these purposes, policymakers and researchers rest on the assumption that these exams provide an accurate, unbiased measure of student knowledge; essentially, that these exam scores consist of a student’s “true ability” plus an acceptably small amount of random error. Several critiques of these exams, however, suggest that this error may not be random but instead systematically correlated with test conditions, school or student behavior. These tests can broadly be broken into two types: those used purely for research and assessment and those used as a policy lever by having implications attached to their results. Exams given with no direct repercussions include the National Assessment of Educational Progress (NAEP, the so-called “Nation’s Report Card”) and other exams given by the Department of Education’s National Center for Education Statistics (NCES). These solely evaluative exams provide nationally representative student level data (such as the “High School and Beyond” dataset) used to research and better understand the relationship between any number of student, school and even family level factors and academic achievement. While some have issued concerns that students may not exhibit much effort on such exams, thereby systematically underestimating student ability, researchers generally find that these solely evaluative exams should be strong representations of student ability (Linn, Koretz, & Baker, 1996; Baumert & Demmrich, 2001). The second type of standardized tests, those with stakes attached to their outcomes, are integral to the current policy environment emphasizing accountability and standards. The rationale behind using standardized tests as a policy level is straightforward enough: if exams Draft. Please do not cite or circulate. accurately measure student knowledge, then these scores can be used to identify, reward and possibly even duplicate success; identify, intervene or correct failure; and provide incentives for all parties to work harder and raise achievement. Unfortunately, attaching repercussions to exams can also change results and behavior in unwanted ways, including the narrowing of curriculum and even cheating (Jacob & Levitt, 2003). Finally, a third school of research questions whether scores on standardized tests provide equally valid measures of student ability for all students. A large body of social psychological literature suggests that stereotype threat –an individual’s underperformance on a task when they fear confirming a negative stereotype about their group – may cause exams to systematically underestimate the ability of students from disadvantaged groups (Steele & Aronson, 1995; Steele, 1997; Schmader, Johns, & Forbes, 2008; Walton & Spencer, 2009). These stereotype threat effects may be greatest when a student faces direct repercussions as a result of the exam (Aronson, Lustina, Good, Keough, Steele, & Brown, 1999). One recent study finds indication of such systematic bias on the California High School Exit Exam (CAHSEE). Reardon, Atteberry & Arshan found the CAHSEE had negative effects on graduation rates for low achieving students – but only for women, students of color and English Language Learners (Reardon, Atteberry, Arshan, & Kurlaender, 2009). Upon investigation, the authors found that these students were less likely to pass the “high stakes” high school exit exam on the initial administration than their observationally similar white and male peers, even when controlling for prior achievement on a “low stakes” exam used for state and federal accountability, but without direct repercussions to the students. Since the robust measures of prior achievement on the low stakes exams included in the regression should control for the knowledge component, these results suggest that the noise in this exam may not have been Draft. Please do not cite or circulate. random, but instead correlated with observable student characteristics such as race, gender and ELL status. Possibly student effort is correlated with both student subgroup and personal repercussions to the exam, leading all estimates of achievement using “low stakes” exams – including those used for purely for research and those used for accountability but without direct repercussions to the student – to systematically underestimate the performance of certain subgroups. On the other hand, stereotype threat may indicate that exams with direct repercussions for students – such as high school exit exams and college entrance exams – may implicitly set a higher performance requirement for disadvantaged students than for their more privileged peers. But while the Reardon et. al findings raise concerns, they require further exploration to be fully understood. First, there are several concerns in comparing exam scores that need to be carefully evaluated before believing these results, in particular measurement error and differences in content on the two exams being compared. Second, while stereotype threat provides a clean theoretical link between differential student underperformance for students at risk to be stereotyped, it is not the only factor that may influence test taking ability. In fact, one experimental study that told students in the “treatment” condition that the exam they took would predict performance on Florida’s high school exit exam. The authors found no difference between African American student performance in the treatment and control groups, but a large performance boost for white students in the treatment group (Kellow & Jones, 2008). Access to student-level longitudinal administrative data in three large urban districts in California – the same data used for the Reardon et. al. finding -- provides us the opportunity to further explore and understand these troubling findings. Students in California high schools take two kinds of tests, both aligned to California state standards. One, the California Standards Test Draft. Please do not cite or circulate. (CST) is given on a yearly basis and has repercussions for the school; I therefore consider this exam as “low-stakes” for students. The other, the California High School Exit Exam (CAHSEE) determines whether the student will earn a high school diploma; I therefore consider the CAHSEE “high-stakes” for students. The CAHSEE is initially administered in the 10th grade and measures content similar to the 8th-10th grade CST exams. This paper looks at the high stakes/ low stakes performance gap on both by race and gender and by gender within race. We first ask whether this high stakes/ low stakes achievement gap can be explained by differences in schools attended by students of different subgroups or by measurement error, then whether the observed patterns suggest a clear stereotype threat or effort interpretation. Framework Test scores and measurement error We start with the assumption that student i has true skill π. We cannot observe this true skill, but only estimate it with error π using test score π. (1) ππ = πΜπ = ππ + ππ Tests that fail to meet these assumptions and therefore measure something other than knowledge and random noise display “construct irrelevance” (Messick, 1989; Haladyna & Downing, 2004). Construct relevance suggests that the error term, Ο, is non-random and should therefore be broken into several parts. While Ο will still contain some random error Ο΅, there will be two additional sources of measurement error that are neither random nor part of the ability, η, the test seeks to measure: one correlated to the content measured in the exam, γ, and another correlated to the individual’s ability to demonstrate their knowledge on this particular exam, δ: (2) ππ = ππ + πΏπ + πΎπ + ππ Draft. Please do not cite or circulate. Tests may be construct irrelevant for several reasons. First, γ will be nonzero if the exam measures content other than η. Take the example of an exam that conflates reading ability with math ability. In this instance, two students with the same true ability in math (ππ΄πππ‘β = ππ΅πππ‘β ) would have different math scores (ππ΄πππ‘β ≅ ππ΅πππ‘β ) if student A had a higher reading ability than student B, (ππ΄πππ‘β > ππ΅πππ‘β ). Construct irrelevance may also conflate a student’s test taking ability or the effort they expend on the exam with their true knowledge. A student who received a great deal of test prep, is particularly relaxed under stressful circumstances (such as taking a timed exam), and makes an effort to perform well is likely to score higher than a student with the same true knowledge of the material but greater test anxiety, less familiarity with the test taking process or less willingness to expend the effort required to be successful in the exam. The relationship between group membership and an unbiased exam score An accurate, unbiased test will display measurement invariance, which requires that students’ test scores be independent of their group membership J, conditional on their ability (Wicherts, Dolan, & Hessen, 2005): (3) π ⊥ π½ | π Measurement invariance implies that, while student race or gender may be correlated with exam scores, the relationship between group membership and exam score must affect this achievement gap only through true differences in ability, not through the exam’s measurement error (see Error! Reference source not found.). Measurement invariance does not, therefore, require knowledge, η, to be orthogonal to J, merely the errorπ. Group membership such as race or gender, through some feature of society, may lead directly to lower achievement. For example, the black-white test score gap can at least partially be explained by segregation (Card & Rothstein, 2007). African American students tend to attend schools in urban areas with higher Draft. Please do not cite or circulate. percentage of poor and minority students (Logan, Oakley, & Stowell, 2008). Schools in urban areas with high concentrations of African American and poor students have a harder time recruiting and retaining teachers (Lankford, Loeb, & Wycoff, 2002). Lower ability of the teaching staff will lead to lower achievement for the (disproportionately African American) students in the school (Rivkin, Hanushek, & Kain, 2005). These different ability levels, therefore, would appropriately be reflected by achievement gaps in an accurate and unbiased standardized test, as the group membership is related to ability, not measurement error. We do, however, require that measurement error be unrelated to J. Achievement gaps between groups should therefore be entirely explained by the differential knowledge between the groups, not the different test circumstances, as reflected by Ο. That is, if the test score is composed of true knowledge and measurement error (substituting Equation (1) into Equation (3)): (4) ( ππ + πΏπ + πΎπ + ππ ) ⊥ π½ | ππ (5) (πΏπ + πΎπ ) ⊥ π½ | ππ If a test fails to meet this criterion of measurement invariance, then we know that the test is construct irrelevant in a way correlated with membership in group J. Either the test measures content other than η that one or more groups J hold less knowledge on or test circumstances prevent students from fully demonstrating their knowledge on the exam in a way the disadvantages some groups as compared to others. One example of ππΎπ½ ≅ 0 would be an exam that measures ability to do long division using word problems written in English. While the ability to construct a math equation based on written information may be part of η, English Language Learners fully capable of performing this task in their native tongue will score lower than native English speakers of the same mathematics ability. Social psychologists and Draft. Please do not cite or circulate. psychometricians alike worry that ππΏπ½ ≅ 0 due to the prevalence of stereotype threat, a social psychological phenomenon wherein individuals underperform relative to their true ability when faced with the possibility of confirming a negative stereotype about their group-- the idea that women are bad at math, or that African Americans are not academically successful (Steele & Aronson, 1995; Steele, 1997; Ferrara, 2006). Differences in tests When comparing two exams both measure η – one with high stakes for the students and one without direct repercussions for the students-- we may therefore say that, if neither exam displays construct irrelevance (and are therefore measurement invariant), that π»ππβ (6) πΈ(ππ |ππ ) = πΈ(πππΏππ€ |ππ ). As demonstrated in Equation 5, one way to check for construct irrelevance is to test for measurement invariance. If we estimate π»ππβ (7) ππ = π½πππΏππ€ + π±π + ππ , then we should fail to reject the null: (8) π»0 : π = 0 ∀ π½. Theoretically, then, we would like to believe that we would not see an achievement gap on an exam between students who have demonstrated the same ability on an earlier, similar exam. However, even setting aside the issue of differing stakes on these exams, there are several reasons we may see achievement gaps persist even when controlling for concurrent achievement, including schools attended by students of different groups and measurement error. Schools may differ in their ability or motivation to help students achieve their best possible score on an exam. Depending on the incentives, schools may act in ways to superficially raise scores without meaningful gains to student achievement. One study of accountability policy Draft. Please do not cite or circulate. in Chicago found that while schools raised elementary school students’ math scores on tests with stakes for the school, these gains were not reflected in exams measuring similar content when the exam had no stakes for the school. Evidence pointed to teachers emphasizing specific skills or questions they knew would be on the high stakes test, as compared to the low stakes test and greater student effort on the test with meaningful implications for the school (Jacob, 2005). There is evidence that some teachers are willing to go so far as to cheat to help students score better on an exam than they are capable (Jacob & Levitt, 2003). Furthermore, school segregation is well documented. A study analyzing 6 of the largest districts in California (including two of the three used in this paper) in the 1988-1989 school year found that “approximately half of all high school students would need to change schools for the racial and ethnic composition of the high schools to reflect the racial and ethnic composition of the state (Rumberger & Willms, 1992, p. 378).” There is evidence that test prep does not happen equally at all schools; one study found that teachers at low SES schools with high proportions of minorities were more likely to teach narrowly to items they thought would be easiest to get kids to show improvement on in elementary schools in California (Shen, 2008). It may be, therefore, that students from more disadvantaged groups may perform better on one test than the other if their schools put more emphasis on preparing them for one as compared to the other. Secondly, measurement error in πππΏππ€ may bias our estimates of χ towards the unconditional achievement gap between groups. If we assume that the measurement error in πππΏππ€ is consistent throughout the distribution of scores and uncorrelated with group membership, then the attenuation bias in β may therefore cause us to conflate regression to the group conditional mean with an achievement gap on the two exams that mimics the gap in measured average ability between the groups (Rothstein, 2010). Unfortunately, most Draft. Please do not cite or circulate. standardized achievement tests do not have consistent measurement error throughout the scores and therefore violate the classical errors in variance assumptions required to make a prediction of the direction of bias in χ. Instead, standardized tests tend to measure scores in the center and near cut scores with more accuracy than scores at the top and bottom of the scale. We therefore worry that χ may not reflect a high stakes/ low stakes achievement gap, but instead may either be a reflection of schools attended by the students, or measurement error causing bias in the results. If there is a remaining high stakes/ low stakes achievement gap, there may be several explanations. We discuss anxiety and effort, two likely sources of differential performance, and their connections to group membership, below Anxiety and stereotype threat Stereotype threat neatly connects anxiety in a high stakes testing circumstance to group membership. Students who experience stereotype threat underperform relative to their true ability when faced with the possibility of confirming a negative stereotype about their group-the idea that women are bad at math, or that African Americans are not academically successful (Steele & Aronson, 1995; Steele, 1997). Stereotype threat is, essentially, an additional source of test anxiety that impairs the working memory of an individual faced with a misalignment between their sense of self, sense of group, and sense of ability (Schmader, Johns, & Forbes, 2008). Stereotype threat seems to have the greatest impact when there are acute repercussions to students’ performance (Aronson, Lustina, Good, Keough, Steele, & Brown, 1999). Steele and Aronson (1995) first advanced stereotype threat in their study of Stanford undergrads, claiming that people who feel at risk to be negatively stereotyped will perform worse for fear of confirming or being judged by negative stereotypes associated with their groups. They found that black students primed to think a difficult exam was a measure of their ability Draft. Please do not cite or circulate. performed worse than white students with similar SAT scores in the same condition. In the control group black and white students of similar SAT scores performed similarly well. There was evidence to show that black students primed with negative stereotypes showed more anxiety, less accuracy and performed more slowly. Subsequent research has found similar results in women with math exams (Brown & Josephs, 1999; Gonzales, Blanton, & Williams, 2002; O'Brien & Crandall, 2003; Shih, Pittinsky, & Ambady, 1999; Spencer, Steele, & Quinn, 1999), the elderly with tests of memory (Hess, Auman, Colcombe, & Rahhal, 2003), and one study on Latinos and verbal exams (Gonzales, Blanton, & Williams, 2002). Moreover, stereotype threat can be induced in groups that are not normally at risk -- such as when a group of white men who excel at math were primed with the stereotype that Asians are better at math (Aronson, Lustina, Good, Keough, Steele, & Brown, 1999). Furthermore, individuals may benefit from “stereotype lift” and perform better than their typical ability when operating under the assumption that their group benefits from a positive stereotype (Walton & Cohen, 2002). The negative effects of stereotype threat tend to be strongest for those with the highest "domain identification,” (Aronson, Lustina, Good, Keough, Steele, & Brown, 1999; Steele, 1997; Keller, 2002) that is, those to whom the exam is most important. Steele, in particular, would argue that the domain identification is a measure of how important the skill being tested is to an at-risk student’s self-perception. Aronson et. al (1999) argue that domain identification may, more broadly, indicate that stereotype threat will be activated when it has repercussions for the at-risk student. They give the example of a woman taking the GRE to apply for graduate school in Art History. This student likely has little self-perception at stake in the math GRE, but doing poorly could have direct repercussions: “It therefore may be more correct to say that high motivation—a sense that something important is at stake—is the necessary factor in stereotype Draft. Please do not cite or circulate. threat, not high identification per se (Aronson, Lustina, Good, Keough, Steele, & Brown, 1999, p. 43).” This concern over the repercussions of an exam has direct relevance to our study, where we see that at-risk students under-perform on a high stakes exam, as compared to an exam measuring similar content but the results of which only indirectly impact the students. While stereotype threat has been studied extensively, there is less direct evidence of stereotype threat outside of laboratory settings; experiments outside of the laboratory have – for obvious ethnical reasons – focused on interventions designed to mitigate the effects of stereotype threat. One experimental field study found that asking students to report their gender and ethnicity did not affect test scores on the AP Calculus AB exam or the Computerized Placement Test (used for placement primarily at community colleges) (Stricker & Ward, 2004). The authors concede, however, that the students may not have needed the treatment to activate stereotype threat on these tests, which already carry significant consequences and may threaten at-risk identities. Additionally, a subsequent re-examination of the data from this study argues for a relaxation of the authors’ fairly stringent statistical requirements; doing so produces a finding that is both statistically and, the authors argue, practically significant (Danaher & Crandall, 2008). Instead, most field work on stereotype threat has demonstrated that interventions targeted towards alleviating it can be successful in raising the academic performance of at-risk students. In fact, a number of studies have found that techniques such as mentoring or values affirmation can significantly reduce the effects of stereotype threat (Good, Aronson, & Inzlicht, 2003; Walton & Spencer, 2009; Cohen, Garcia, Purdie-Vaughns, Apfel, & Brzustoski, 2009). Taking the example of the high school exit exam, a female student may think of herself as a weak math student – a perception in line with society’s stereotypes that women struggle with mathematics as compared to men. While this perception of negative ability may dissuade her Draft. Please do not cite or circulate. from pursuing a math and science career, it should not impact her working memory and therefore her ability to perform well on a math exam -- unless the exam becomes important to her sense of self. The high stakes of the exit exam – making the safe assumption that she values a high school diploma -- will therefore misalign to her weaker sense of her own and her group’s ability. This misalignment will cause anxiety and distraction while she takes the exam, leading to underperformance as compared to an exam where she did not feel any pressure to perform well. If stereotype threat is causing a high stakes/ low stakes achievement gap, we would therefore expect women to outperform men on the ELA section and underperform on the math section. Students of color will most likely underperform on both sections, though Asian students should over perform on the mathematics section and women of color may or may not underperform on the ELA section. As mentioned in the introduction, one experiment tested just such a condition and found the opposite of this hypothesis (Kellow & Jones, 2008). Two groups of Florida 9th graders took the same exam, but only the treatment group believed it to predict performance on the 10th grade FCAT, which is used as a high school exit exam in Florida. White students performed higher in the treatment than control condition, whereas African American students performed similarly in both conditions. The authors proposed that stereotype lift may have caused the observed pattern – that linking the high stakes to the exam caused higher performance for white students. Their survey data, however, indicates that they may not have adequately lifted the source stereotype threat for students; African American students in both conditions scored higher on measures of anxiety and anticipated stereotype threat than white students, with no race by treatment interaction effect. These results (and by extension the high stakes/ low stakes gaps we observe) Draft. Please do not cite or circulate. may therefore be caused not by differential ability to perform given the test circumstances, but instead by differential effort by students in the different test conditions. Effort One other explanation for a high stakes/ low stakes performance gap would be differential effort put forth by students on the different exams. Social psychological theory suggests that effort may differ according to a student’s perception of the “stakes” of the test. Several experiments have attempted to pay students for performance on standardized tests to change their outcomes. Social psychological literature suggest that the distinction that we made between what qualifies a test as “high” or “low” stakes may not be perceived in the same way by all students. In particular, conceiving of a test as having “high stakes” for an individual and discounting the importance of the “low stakes” exam research may be a particularly Western conception. Marcus and Kitayama, summarizing a large field of research on this area, argue that Westerners, in particular Americans and males, tend to hold an independent view of the self, one that places primacy on fixed individual characteristics (1991). In contrast, East Asians are more likely to hold an interdependent view of the self, which places a greater importance on relationships, especially in group members with whom one shares a common fate. They link this view of self to motivation and effort. An independent person would see only the high school exit exam as high stakes, since the repercussions for the exam do not fall directly to them and may not work as hard as they would if the exam were personally meaningful. Interdependent people, however, would see the fate of their school as more closely linked to their own fate, and might therefore put forth more effort on an exam that had stakes for the school. When comparing tendencies of American racial and ethnic groups to act collectively, one study found that Asian and African Draft. Please do not cite or circulate. Americans scored higher on a scale of collectivism. These results seem driven largely by higher scores for men (Coon & Kemmelmeier, 2001). Attempts to pay students to motivate harder work have shown mixed results. A series of experiments in the United States found significant positive impacts on student test scores when paying students for educational inputs, such as reading books or turning in homework, with the effects concentrated among male, African American and Hispanic students; trials that offered money to students for educational outputs such as grades or test scores showed no or negative effects (Fryer, 2010). Another cash incentive program in a low income, primarily white, Ohio community found an increase of .15 sd when paying elementary school students to pass the state standardized tests, though there was no effect on the reading, social science or science exam scores (Bettinger, 2010). This experiment showed no significant differences in effects by gender. A third study, investigating the possibility that students were not putting forth enough effort on a low stakes test (NAEP) offered students $1 per correct answer, with no effects, though the authors did not examine the results separately by race or gender (Linn, Koretz, & Baker, 1996). While differential student effort therefore seems to be an unlikely source of a high stakes/ low stakes achievement gap, there is some theoretical reason to believe that perhaps women, white and Hispanic students may not put forth the same effort on state sponsored exams as male, Asian and African American students. Data, measurements and descriptive statistics Data We use longitudinal, administrative, student-level data from three of California’s largest school districts. The analytic sample includes 41,290 students in the graduating classes of 20062011 who were subject to a high school exit exam policy. Students in California high schools Draft. Please do not cite or circulate. take two kinds of tests, both aligned to California state standards. One, the California Standards Test (CST) is given on a yearly basis and has repercussions for the school; we therefore consider this exam as “low-stakes” for students. The other, the California High School Exit Exam (CAHSEE) determines whether the student will earn a high school diploma; we therefore consider the CAHSEE “high-stakes” for students. The CAHSEE is initially administered in the 10th grade and measures content similar to the 8th-10th grade CST exams. These exams and the content overlap between them will be discussed in more detail below. We exclude from our analyses students classified as special education students (roughly 10 percent of students), because these students were not subject to the CAHSEE requirement in most of the years covered by our analyses. For ease of interpretation, I standardize all exam outcome scores by sample’s within-cohort mean and standard deviation. The administrative data also provides demographic and academic covariates such as race/ ethnicity, gender, English Language Learner classification (ELL) and free, reduced price lunch eligibility and school attended. The California Standards Test and the California High School Exit Exam In the Spring of 10th grade, students in California take two exams covering material aligned to the California State Standards in mathematics and English Language Arts: the California Standards Test (CST) and the California High School Exit Exam (CAHSEE). While these two exams cover similar materials, their stakes for students are sharply different. While the CST may, in some cases, be used as one of many factors determining class placement, the CAHSEE’s role in students’ future is distinct and very clear. If students do not pass both sections of the exam, they will not earn a high school diploma. This section gives some background on both exams, and then discusses similarities and differences between the two exams in: stakes for students and schools; measurement error; and content. Draft. Please do not cite or circulate. California schools administer the California Standards Test (CST) to students in grades 2-11 annually as part of the state’s Standardized Testing and Reporting (STAR) program, designed to evaluate student achievement on the state’s content standards in four areas – English Language Arts (ELA), mathematics, history/ social sciences and science. Of interest to our study are the exams for ELA and math. The ELA exam is aligned to grade level content standards for all grades. The math exam is aligned to grade level content standards through the 7th grade, and then becomes an “end of year” exam aligned to the class a student has taken during the school year. Both exams are multiple choice, though students have an additional writing assessment as part of the 4th and 7th grade ELA exam, which is administered on a separate date. The writing assessment is given in March; the multiple choice exams may be administered between March and May. Students receive a scale score between 150-600. The scale score places students in one of five categories: advanced, proficient, basic, below basic and far below basic. In lieu of the CSTs, some special education students take the California Modified Assessment (CMA) and some Spanish speaking ELL students take the Standards-Based Test in Spanish (STS) in lieu of or in addition to the CSTs for ELA and Mathematics. The California High School Exit Exam has two sections: English Language Arts (ELA) and mathematics. The math section gives a multiple choice exam covering the California math content standards for sixth and seventh grade and a small amount of Algebra 1. The ELA section covers state content standards from 8th-10th grades and utilizes a multiple-choice format along with one essay. Both tests are administered in English, regardless of a student’s primary language, though English Language Learners are permitted to use the same modifications they use on a daily basis in the classroom. Exams are scored on a scale of 275-450 and students must score a minimum of 350 on each section to earn a high school diploma. Students scoring above Draft. Please do not cite or circulate. the passing score may also be classified as Proficient or Advanced, with a score of 380 indicating proficiency on both exams, though the proficiency score is set higher for math than ELA – roughly 420, as compared to 422. The CAHSEE is first administered to students in the spring of tenth grade, and students have at least five subsequent opportunities to retake the sections they have not yet passed (twice in eleventh grade and twelfth grade, and at least once following the end of the twelfth grade school year). Both exams are used for state and federal accountability. For state accountability, the ELA and math CSTs contribute roughly 27% and 18% of a high school’s Academic Performance Index (API), respectively. For Federal accountability, these two exams also contribute to calculating a school’s Annual Yearly Progress (AYP) through their API score (California uses a minimum API score – 680 in 2009-2010-- or one point growth on API as their fourth indicator to make AYP). The CAHSEE passing rate contributes 9% for each section to a high school’s API. For AYP, the CASHEE is used to meet both proficiency and graduation rates. Schools should therefore have no incentives to prepare students for one exam at the expense of the other; if anything, the alignment of incentives should encourage schools to emphasize the material and skills common to both exams. Students, on the other hand, are less likely to link their CST scores to immediate or obvious repercussions as compared to their CAHSEE scores. The CSTs are only used in a few instances, including class placement, reclassification for English Language Learners and eligibility for gifted programs in early elementary school and eligibility for a small number of magnet high schools. Importantly, in each of these instances, the test is one of many factors in determining a school’s decision and the role the CST plays is fluid, with no minimum score required and publicized (the exception being in early elementary school, where eligibility for the Draft. Please do not cite or circulate. gifted program typically has a minimum score or percentile as one requirement). Students should rarely, therefore, feel compelling personal stakes in their performance on the CSTs. The CAHSEE, on the other hand, provides a clear cut score and personal repercussions to falling below the score: failure to graduate. The two exams share much in common. Both are written by the Educational Testing Service (ETS). The exams are untimed, allow for the same accommodations for English Language Learners and are rigorously checked for cultural bias in the test content by ETS using Differential Item Functioning (DIF). The CST gets a second check for cultural bias by the Human Resources Research Organization (HumROO), which the State of California contracted as an independent evaluator of the exam. As mentioned above, both sections on both tests are aligned to state course or grade level standards. California Content Standards for English Language Arts consist of the same five content substrands in each grade: Word Analysis, Reading Comprehension, Literary Response and Analysis, Writing Strategies and Writing Conventions. Each grade’s CST includes a multiple choice assessment of grade level material for each content substrand. The 4th and 7th grade CSTs have an additional writing assignment, given on a separate day from the multiple choice exam. The ELA CAHSEE exam consists of 45 multiple choice questions and a single writing assignment. The multiple choice questions primarily cover 9th and 10th grade state standards, with a small number of 8th grade standards included. Both exams place assign roughly 50% of points on writing and 50% on the other three content substrands. Due to the writing assessment offered in the CAHSEE, the exit exam’s multiple choice section places comparatively less weight on the two writing substrands than does the CST. Draft. Please do not cite or circulate. The comparison of material on the math exams is less straightforward, given the end of course nature of the CST exam for students in the 8-10th grades. In the 2nd-7th grades, CST math exams are aligned to the same five content substrands: Number Sense; Algebra and Functions; Measurement and Geometry; and Mathematical Reasoning (though this final substrand is tested in ways “embedded” in the other four content substrands). Seventy-five percent of the CASHEE’s math standards are aligned to grade 7 content standards, with 10% coming from the Statistics, Data Analysis, and Probability strand of 6th grade standards and 15% coming from Algebra I. In the 8th -10th grades, most non-Special Education students take some combination of the General Math, Algebra I, Geometry and Algebra II exams (less than 5% of students take any other exam in a given grade). The General Math Exam (typically taken in the 8th grade) aligns closely to the material tested in the CASHEE, as it primarily covers material from 7th grade content standards, with a few questions from the Statistics, Data Analysis, and Probability strand of 6th grade standards. As mentioned above, measurement error in exams may play a role in biasing our estimates in unknown directions, especially if the error varies by subgroup or throughout the spectrum of exam scores. ETS, which writes both exams, provides technical documentation for both exams, but the precise information they provide on each varies slightly. Table 1 offers a summary of the Standard Errors of Measurement (SEMs) for different subgroups and the Conditional Standard Errors of Measurement (CSEMs) at different cut points. These estimates were taken from the ETS Technical Reports for the 2008-2009 CAHSEE (the most recent year available online, posted in April 2010) and corresponding year for the CST. The CAHSEE technical reports offer the same estimates for each exam administration; I presented the estimates for the March 2009 administration, which had the largest number of students taking the exam Draft. Please do not cite or circulate. (N=374, 364), though results look similar in other administrations. The table confirms that measurement error tends to be slightly larger for higher achieving groups. Female, white, Asian, non-English Language Learner and non-Special Education students have slightly lower SEMs than male, African American, Hispanic, English Language Learner and Special Education students, though with the exception of the ELL/ non-ELL distinction, these differences are generally quite small. Additionally, these SEMs are provided for the entire test taking sample; our sample excludes Special Education students, who have by far the largest SEMs. To the extent that Special Education classification is correlated with subgroup, this should reduce the already small differences in measurement accuracy between the groups. Of greater concern is the difference in accuracy by achievement level. Both exams become less accurate near the top of the distribution, with the CAHSEE in particular rising in measurement error for students at the Advanced level. Descriptive Statistics I begin with 100,910 students in the classes scheduled to graduate between 2006-2011 in three districts. As mentioned earlier, I exclude the 11,474 students who have ever been classified as Special Education, as Special Education students are not required to pass the CASHEE, though these students generally take the exam for state and federal accountability purposes. A small number are dropped for missing ELL status (322) and, to simplify the analysis, I drop students whose race/ ethnicity is listed as Native American or “unknown” (2,956). We lose a large number – 26,693 – by dropping students missing a CAHSEE score in the spring of 10th grade or a CST score or math exam indicator between the 8th-10th grade. The majority of these missing scores are missing in the 8th and/or 9th grade, indicating that our sample is more representative of students who remain in the district for three years, as opposed to those who Draft. Please do not cite or circulate. transfer in for high school. As the accountability systems include indicators for percent of students taking the exams, schools have no incentive to prevent low achieving students from taking a standardized test. We are left with an analytic sample of 59,455. Table 2 provides basic descriptive statistics for the sample. Overall, roughly half of the sample is female, 15% ELL and 70% eligible for Free and Reduced Lunch. Hispanic and Asian students make up the largest subgroups with roughly 24,500 and 18,500 respectively, with white and African American students contributing around 9,500 and 7,000, respectively. Female students are less likely to be classified as ELL than their male peers, and outscore men by roughly .2 standard deviations on both the 10th grade ELA CASHEE and CST exams, though men outscore women by about .1 standard deviation on the CAHSEE math exam. White students are the highest achieving racial/ ethnic group, scoring roughly .6 standard deviations above their cohorts mean on the 10th grade ELA exams and .45 standard deviations above their cohorts mean on the math CAHSEE exam. White students are also substantially less likely to receive free/ reduced price lunch than other students and to be classified as an ELL student, as compared to Asian and Hispanic students. Asian students are the next highest performing group, scoring.23 above the cohort mean on the 10th grade ELA CST exam and .16 above the mean on the ELA CAHSEE. Asian student’s math scores are comparable to white students’ score on the CAHSEE math, scoring .41 above the mean. Asian students are, however, both more likely to be ELL than their black and white peers and more likely to be eligible for free/ reduced lunch than their white peers. Hispanic and black students are the lowest achieving racial/ ethnic subgroups, falling between .25 and .44 below the cohort mean on each exam, though a larger portion of Hispanic students are classified as ELL as compared to black students, both groups are more likely to be eligible for free/ reduced price lunch than their white peers. Students eligible for free/ reduced Draft. Please do not cite or circulate. price lunch score below the mean in all exams, with the third of the sample not eligible for free lunch scoring roughly half a standard deviation above the mean on each exam. Students classified as English Language Learners are by far the lowest performing subgroup, scoring 1.1 standard deviations below the mean on the 10th grade ELA CST and the CAHSEE math, and 1.5 standard deviations below the mean on the CAHSEE ELA exam. Figures 3a and 3b provide kernel density plots of 10th grade ELA and Math CASHEE scores by racial/ ethnic group, with lines demarcating the scores at which a student is classified as basic (a passing score on the exam), proficient and advanced. It demonstrates both the differential performance by race and the ceiling effects regarding the CAHSEE. White and Asian students’ distributions are shifted to the right of the graph, with large bumps in the upper tail of the CAHSEE for these two groups. There are therefore a much larger percentage of white and Asian students measured with more error at the top of the distribution and a very small number of white students at the lower tail end of the distribution. Table 3 provides descriptive statistics that break down the mean ELA and math CAHSEE scores by race and gender. The first column for each score is the mean for the full sample, the second column is the mean score for students who are non-ELL and non-Free Lunch eligible. Reducing the sample to non-ELL and non-Free Lunch students reduces the sample size by about 25% for white students, 70% for Asian students, 86% for Hispanic students and 75% for black students. Limiting the sample to more privileged students also increases the mean achievement for each racial and gender group, with white students’ means improving by about .15 standard deviation on both exams, and non-white students’ means increasing between .3 and .5 standard deviations on each exam. The most dramatic improvements in mean scores come in the Hispanic students, on both the ELA and Math section. Once we limit the sample to non-ELL, non-Free Draft. Please do not cite or circulate. Lunch students, white women continue to have the highest ELA score (.918), though Asian women outperform white men (.820 as compared to .666); Asian men and women have the highest mean scores on the math exam (.798 and .801), with white males trailing at .658. The only groups whose more privileged samples fall below the overall cohort means are black male on ELA (-.105) and Hispanic and black women on math (-.076 and -.223). Analysis I use student level longitudinal administrative data from three large, urban districts in California to examine a high stakes/low stakes test gap between groups of students, in particular girls, students of color and English Language Learners in comparison to boys, whites, and nonEnglish Language Learners and the interaction of race and gender in these effects. I begin with a model predicting students’ CASHEE scores based on both their prior achievement and group membership. Following these main results, I differences between schools in preparing students for the exam, statistical concerns in comparing the exam scores, and differential content between the two exams. Basic Model and Results I begin by estimating the high stakes CAHSEE score Y on exam E (math or ELA) if student i in cohort c in district d as a function of their low stakes achievement CST scores. π πΈ ππππ = ππππ’ππ π© + π πππ π + πΌππ + ππ In the above equation Β is a vector of coefficients on the prior achievement proxies – the 8th, 9th and 10th grade CST scores (as well as their quadratic and cubic terms) for the relevant exam. I determined quadratic and cubic terms by comparing the fit of models using linear terms and adding polynomial terms through the quintic term. While the quadratic and cubic terms were consistently statistically significant, the quartic and quintic terms were only occasionally Draft. Please do not cite or circulate. statistically significant, depending on the covariates included and they did not tend to increase the model fit, as measured by R2. Finally, the cubic term allows for a flexible enough functional form that this basic model only predicts students to score under or over the possible range for CAHSEE scores for .06% of cases. Models estimating the CAHSEE math score will also include indicator variables for the actual end-of-course math exam taken. Β is a vector of indicator variables for group membership for women, students of color, ELL students and free/ reduced lunch eligibility; all models are run both with and without race by gender interactions. Because our data come from multiple school districts and cohorts of students, I include district-by-cohort fixed effects πΆππ . The outcome of interest is the vector π½, which describes the differences in average differences in high- and low-stakes standardized test scores between the subgroups defined by πΏ, in this case a vector of dummies indicating ELL status, free/ reduced lunch eligibility and raceby-gender dummies. The subgroups for which π½ is negative and statistically significant demonstrate a subgroup achievement gap between predicted performance based on prior (and concurrent) low stakes test scores and the high stakes CAHSEE in comparison to that of the reference group – white, male, non-ELL, non-Free/ Reduced Lunch students with an average score on the 8th, 9th and 10th grade CST. I estimate the predicted CAHSEE scores both by math and ELA exam for the entire sample, then separately by prior achievement level. I define prior achievement level as the student’s proficiency level on the 9th grade ELA CST exam. I use the ELA proficiency score in the 9th grade both because it loosely aligns to proficiency scores on the ELA CAHSEE and to avoid issues either of selecting on the dependant variable or of selection bias caused by the different end of course math exams taken by students. Subdividing students by prior achievement Draft. Please do not cite or circulate. provides several advantages. First, as the distribution of students in the reference group (white, male, non-ELL, non-Free Lunch) is disproportionately high in comparison to that of their nonwhite, ELL and free lunch eligible classmates, subdividing by prior achievement prevents us from extrapolating beyond the region of common support for these students. Second, focusing on students around the middle of the distribution will provide us with the most accurate test scores. Finally, comparing the results across the distribution of ability may provide us some insight as to whether the high-stakes/ low stakes gap varies by student ability. Figures 4a & 4b describe the distribution of ELA and math CAHSEE scores by 9th grade ELA proficiency level. The “Far Below Basic” group is the smallest, with a wide distribution centered towards the bottom of the scores. The “Advanced” students have a long tail, but are clustered disproportionately at the top of the score distribution, where test scores contain the most measurement error. The middle three distributions, on the other hand, both contain a large sample size and are centered between 350 and 400, where scores are measured more accurately. These three groups should, therefore, provide us with the most reliable results. Main Results Results for ELA can be seen in Table 4. Once conditioning on the same prior and concurrent achievement, race/ ethnicity, and ELL and Free/ reduced price lunch status, women score .054 above men on the ELA CAHSEE. When looking at the estimates by prior achievement, it appears that this gender gap is driven by Basic, Proficient and Advanced students; these coefficients are positive and statistically significant, whereas the coefficient on gender is negative and statistically significant for the Far Below Basic students and essentially zero for Below Basic students. Hispanic and Asian students underperform by about a third of a standard deviation, with these results driven primarily by students in the middle of the Draft. Please do not cite or circulate. distribution. Overall, black students underperform observationally similar white students by .86 standard deviations, with these results being driven by low achieving students; coefficients on black are negative for all five achievement groups, but the point estimate becomes smaller as the achievement group increases, with the coefficient on the Advanced group the smallest and not statistically different from zero, despite having a comparable standard error to the other groups. The second set of models in Table 4 presents race by gender interactions for the CASHEE ELA. Gender patterns within race look similar to the models that do not include race by gender interactions. For each subgroup, women outperform observationally similar men of the same race/ ethnicity on the CASHEE ELA and for each of these groups, this outperformance seems driven by higher achieving students, with race by gender coefficients being negative and statistically significant only for the top three achievement groups. The race by gender coefficients are negative for the Far Below Basic sample for women of all four groups (though this coefficient is statistically significant only for Hispanic women). For the Below Basic sample, the gender coefficients are negative and statistically insignificant for black and Hispanic women and positive but statistically insignificant for white and Asian women. The results of the F-test for the gender coefficients are less clear. When analyzing the full sample, we can reject the hypothesis that the race by gender coefficients are all equal, despite the fact that they are all positive. Post hoc t-tests (not shown for any post hoc t-tests) indicate that the race by gender interaction coefficients for white and Asian women are similar to each other, but different from those of black and Hispanic women, who are more similar to each other. When we look within prior ability samples, however, the relationship between race and gender looks more similar for each racial group – we can only reject the null that the race by gender interactions are Draft. Please do not cite or circulate. the same for the “Proficient” group, and post hoc t-tests indicate that the only difference is between Asian and Hispanic women. A lack of compelling difference in these interaction terms should not be confused with women of different races having similar high stakes/ low stakes gaps, however. For both the overall sample and the middle three proficiency categories, post-hoc t-tests indicate that white women generally outperform observationally similar women of other races on the high stakes test. Hispanic and Asian women also tend to outperform black women. The interaction models indicate that, for the overall sample, Asian and black men continue to underperform relative to their white male counterparts, with these differences driven by the lower middle part of the distribution. When using the overall sample, Hispanic men do not underperform relative to observationally similar white men, though their underperformance is statistically significant in the middle achievement samples. Post hoc t-tests indicate that Hispanic men tend to underperform less than black men. For math, men outperform similar women by roughly .1, with the high stakes/ low stakes performance gap increasing by about a third within students of the middle three proficiency categories. Asian and Hispanic students exhibit similar patterns of underperformance on the CAHSEE, despite very different overall performance levels on the exam. On average, as seen in the descriptive Table 3, Asian students outscore Hispanic students by about .35 standard deviations – an achievement gap that jumps to about .8 when limiting the sample to non-ELL, non-Free lunch students. In these models, however, we see that both groups underperform observationally similar white students by about .05 standard deviations, with underperformance being strongest in the Below Basic and Basic category for both groups. Post hoc t-tests suggest that Asian students exhibit less underperformance that Hispanic students for the two lowest Draft. Please do not cite or circulate. achieving groups, but point to similar underperformance on the three highest achieving groups. Black students exhibit the most underperformance on the math exam, however, with a coefficient of -.14. Their underperformance is significant for all five performance levels and, much like Asian and Hispanic students, estimates are highest in the Below Basic and Basic categories. Post hoc t-tests indicate that their underperformance is significantly greater than that of all three other racial groups. Similarly to the ELA results, race by gender coefficients indicate that the relationship between race, gender and underperformance on the CAHSEE are similar with each racial group. There is some indication that high achieving Asian women exhibit less underperformance relative to Asian men than white or Hispanic women in relation to observationally similar men, but given both the inaccuracy of the test at the high end and the number of t-tests performed, these results are not terribly compelling. Looking at the overall performance of students of different genders and races, however, a more interesting pattern emerges when looking at these results. Specifically, post hoc t-tests indicate that Hispanic and Asian women typically exhibit differential underperformance from each other, though Hispanic and Asian Males generally exhibit similar levels of underperformance – so while the gender relationship is similar within race, the observed Hispanic – Asian difference in the earlier model was driven by a differential gender gap. Threat of school differences While there is some reason to believe that schools could play an intervening role in student performance, it is more difficult to believe that these particular effects are driven solely by differential school behavior. First, school segregation is unlikely to drive the gender results. Second, as discussed above, schools have incentives to prepare students for both exams. Draft. Please do not cite or circulate. Regardless, to eliminate the possibility that a high stakes/ low stakes achievement gap may be driven by differences in high schools attended by students, I will add school by cohort fixed effects πΆππ and estimate: πΈ π πππ π = ππππ’π¬π π© + π πππ π + πΌπ π + ππ Our coefficient of interest θ would now represent the achievement gap between groups for students with the same levels of prior achievement who attend the same school. If θ remains negative and statistically significant then between-school differences cannot be responsible for the observed high stakes/ low stakes test gap, though these school fixed effects do not eliminate the possibility that the high stakes/ low stakes achievement gap varies between schools. Tables 6 and 7 provide the estimates controlling for school by cohort fixed effects predicting the ELA and Math CAHSEE. The point estimates change very little when controlling for school effects in either math or ELA. Measurement error I next address the concern that these results are driven by regression to the mean. As noted above, regression to the mean may cause estimates of θ to be artificially high by conflating measurement error with group membership. I use two techniques to deal with the threat of measurement error: shrunken estimates and differences. First, Tables 8 and 9 present estimates using 10th Grade CST test scores that have been shrunken to their estimated score. As with all other models, the outcome variable is the ELA and Math CAHSEE score, standardized by cohort. Controls other than those shown include "shrunken" 10th grade Math CST scores, as well as the quadratic and cubic versions of these shrunken scores. The shrunken scores are a weighted average of the students observed 10th grade score (or that score squared or cubed) and their predicted score. The predicted score is Draft. Please do not cite or circulate. calculated by regressing the 10th grade Math CST test scores (or those scores squared or cubed) on 8th & 9th grade test scores (and those scores squared and cubed), dummies for math CST taken in the 8th-10th grade, the demographic controls to be included in the model (ell, free lunch, race and gender or race by gender) and district by cohort fixed effects. The "shrunken" score weights the observed score by .85 and the predicted score by .15, arrived on by averaging the published test reliability of roughly .94 and the observed R2 of roughly .75 when predicting the 10th grade CST scores from the 8th and 9th grade scores. Students are reclassified into predicted "shrunken" proficiency categories based on where the shrunken (linear) ELA score would place a student according to the actual cut scores for that cohort. Estimates from these models change very little from the OLS models. Finally, I move to a Two Stage Least Squares model. I use students’ 8th and 9th grade CST scores (and, in the case of the CASHEE math, the math exam taken in each grade) to instrument for the 10th grade CST score, thus yielding estimates of π that are not biased by the measurement error. The results for these models are presented in Table 10. The overall pattern for these results looks fairly similar to the OLS results, with a few notable differences. The over performance by women in ELA appears, in these models, to be driven exclusively by white and Asian women. White women no longer appear to underperform on the math section, though the point estimate is negative. Hispanic men do not display a high stakes/ low stakes achievement gap on either section of the exam, though the interaction term on Hispanic women is significant and negative for the math exam. African American and Asian students continue to underperform on both sections of the exam, with the gender interaction term being negative for both groups on the math exam, and positive for women on the ELA exam. Draft. Please do not cite or circulate. Discussion Overall, these results are suggestive, though not conclusive, of a differential underperformance on the high stakes high school exit exam by race and gender, dependant on the stakes attached to an exam. Women tend to perform higher than their low stakes test scores would suggest on the ELA, but lower on the math – a result consistent with a stereotype threat interpretation. Asian men’s underperformance on the math exam, however, makes a stereotype threat interpretation more difficult. On the other hand, Asian and African American students underperform on the Math section, with men underperforming on the ELA section, as well, a story more in line with differential effort due to a collectivist culture. This interpretation is problematic, as well -- why would women exert differential effort on exams of different subjects? These puzzling results raise doubt about the accuracy and unbiasedness of the exams used for so many important purposes. While the effect sizes are small, a difference of only a few points may prevent a student from gaining a high school diploma, entrance to the college of their choice, or a scholarship. Educators and policymakers should therefore use caution when considering test scores as part of an accountability system, lest small differences in bias unfairly exacerbate greater inequality in educational opportunity. But, moreover, they call for further investigation to understand the source of differential student performance given the stakes of the exam. Effort may be hard to counteract, given the weak findings from the Fryer and Bettinger experiments paying American children for high test scores (Bettinger, 2010; Fryer, 2010). Experiments aimed at lifting the effects of stereotype threat, on the other hand, have had more success; a series of short self –affirmation writing assignments lifted the GPAs of African American students in the treatment group by .4 over the course of 2 years (Cohen, Garcia, Purdie-Vaughns, Apfel, & Brzustoski, 2009). A clearer Draft. Please do not cite or circulate. understanding of the causes of these high stakes/ low stakes gaps would help guide interventions to give us the most accurate test results possible. Draft. Please do not cite or circulate. Figure 1: The presumed relationship between group membership and test scores Race School or societal features Lower achievement Lower test scores Figure 2: The presumed relationship between high and low stakes test scores Race School or societal features Lower achievement Lower low stakes test scores Lower high stakes test scores Draft. Please do not cite or circulate. Figure 3a: Distribution of Racial/ Ethnic Groups on ELA CAHSEE 0 100 200 300 Lines: 'Pass/ Basic=350' 'Proficient~380' 'Advanced ~400' 250 300 350 ELA CAHSEE Score White Hispanic 400 450 Asian Black Figure 3b: Distribution of Racial/ Ethnic Groups on Math CAHSEE 0 100 200 300 Lines: 'Pass/ Basic=350' 'Proficient~380' 'Advanced ~420' 250 300 350 Math CAHSEE Score White Hispanic 400 Asian Black 450 Draft. Please do not cite or circulate. Figure 4a: Distribution of ELA CAHSEE 0 100 200 300 400 By 9th Grade ELA Proficieny Levels Lines: 'Pass/ Basic=350' 'Proficient~380' 'Advanced ~400' 250 300 350 ELA CAHSEE Score Far Below Basic Basic Advanced 400 450 Below Basic Proficient Figure 4b: Distribution of Math CAHSEE 0 100 200 300 400 By 9th Grade ELA Proficieny Levels Lines: 'Pass/ Basic=350' 'Proficient~380' 'Advanced ~420' 250 300 350 Math CAHSEE Score Far Below Basic Basic Advanced 400 Below Basic Proficient 450 Draft. Please do not cite or circulate. Table 1: Measurement Error in the CASHEE and CST ELA CAHSEE MC Only MC + Essay OVERALL Below Basic Basic/ Pass Proficient Advanced Male Female 3.39 3.26 ELA CST 8th 9th 10th Grade Grade Grade 3.77 3.75 3.58 CSEMs at Cut Points 15 8 14 8 14 12 17 15 13 14 17 13 13 14 17 SEMs for Subgroups 4.06 3.88 3.86 3.57 3.84 3.59 3.69 3.72 American Indian3.36 Asian 3.06 Pacific Islander3.39 Filipino 3.13 Hispanic 3.51 African American 3.53 White 2.96 3.99 3.8 4.02 3.8 4.06 4.17 3.72 3.9 3.47 3.72 3.54 3.77 3.91 3.45 3.67 3.48 3.82 3.69 3.97 3.81 3.45 3.82 3.3 3.72 3.57 3.95 3.78 3.38 Non-ELL ELL 3.16 3.83 3.85 4.33 4.02 3.98 3.95 SPED Non-SPED 3.81 3.27 4.47 3.9 3.97 3.59 4.03 3.63 3.86 3.79 Non Free Lunch 3.56 3.64 3.49 Free Lunch 3.8 3.99 3.68 Source: 2008-2009 ETS Technical Reports for the California High School Exit Exam and California Standards Tests Draft. Please do not cite or circulate. Table 2: Demographic Breakdown and Mean Achievement Scores (standardized), by subgroup % Free Lunch 70% Mean ELA CST 10 0.00 (1.00) Mean ELA Mean Math CAHSEE CAHSEE 0.00 0.00 (1.00) (1.00) % Female 52% % ELL 15% 59,541 0% 17% 28,587 68% -0.10 (1.01) -0.11 (1.00) 0.05 (1.01) Female 100% 13% 30,954 71% 0.09 (0.98) 0.11 (0.99) -0.05 (0.99) By Race/ Ethnicity White 50% 1% 9,508 29% 0.59 (1.01) 0.62 (0.89) 0.45 (0.91) Asian 51% 17% 18,460 70% 0.23 (0.96) 0.16 (1.00) 0.41 (0.96) Hispanic 52% 23% 24,452 84% -0.32 (0.90) -0.30 (0.92) -0.36 (0.90) Black 57% 1% 7,121 75% -0.29 (0.90) -0.24 (0.91) -0.44 (0.87) By ELL Status Non-English Language Learners 53% 0% 50,790 66% 0.18 (0.95) 0.20 (0.90) 0.15 (0.96) English Language Learners 45% 100% 8,751 89% -1.04 (0.55) -1.18 (0.66) -0.86 (0.77) By Free-Reduced Lunch Status Non-Free Lunch 50% 5% 18,040 0% 0.47 (1.01) 0.48 (0.95) 0.40 (0.96) Free Lunch 53% Overall By Gender Male 19% 100% -0.21 -0.21 41,501 (0.92) (0.95) Note: The sample excludes students ever classified as Special Education. Test scores standardized within cohort. -0.18 (0.96) Draft. Please do not cite or circulate. Table 3: Descriptives by Race and Gender, for all students and non-ELL, non-FL Sample ELA CAHSEE Full Sample Non ELL, Non-FL Only Math CAHSEE Full Sample Non ELL, Non-FL Only White Males 0.497 0.666 0.509 0.658 sd (0.903) 4,734 (0.832) 3,427 (0.913) 4,734 (0.863) 3,427 0.745 0.918 0.383 0.558 N (0.863) 4,768 (0.788) 3,335 (0.901) 4,768 (0.847) 3,335 Asian Males 0.026 0.513 0.423 0.798 sd (0.983) 9,020 (0.825) 2,614 (0.962) 9,020 (0.824) 2,614 0.294 0.820 0.399 0.801 (0.997) 9,422 (0.797) 2,582 (0.968) 9,422 (0.834) 2,582 N White Females sd N Asian Females sd N Hispanic Males -0.404 0.105 -0.307 0.065 sd N (0.930) 11,743 (0.879) 1,649 (0.920) 11,743 (0.897) 1,649 Hispanic Females -0.196 0.305 -0.402 -0.076 sd N (0.909) 12,668 (0.852) 1,704 (0.871) 12,668 (0.869) 1,704 Black Males -0.369 -0.105 -0.391 -0.172 sd N (0.901) 3,048 (0.954) 800 (0.883) 3,048 (0.891) 800 Black Females -0.138 0.187 -0.474 -0.223 (0.906) 4,052 (0.923) 964 (0.857) 4,052 (0.895) 964 Note: The sample excludes students ever classified as Special Education, as well as students classified as English Language Learners or eligible for Free/ Reduced Price lunch in the 10th grade. Test scores standardized within cohort. Draft. Please do not cite or circulate. Table 4: OLS Models Predicting 10th Grade ELA CAHSEE Score, By 9th Grade ELA CST Proficency Level Full Sample Female 0.054 *** (0.004) Far Below Basic -0.048 * (0.022) Below Basic -0.008 (0.011) Basic 0.040 *** (0.007) Proficient 0.094 *** (0.007) Advanced Full Sample 0.077 *** -0.031 *** -0.082 (0.060) -0.139 *** (0.027) -0.072 *** (0.013) -0.034 ** (0.012) 0.031 ** (0.012) Asian x Female -0.032 *** (0.009) 0.079 *** (0.007) Hispanic -0.037 *** (0.007) -0.064 (0.055) -0.114 *** (0.025) -0.056 *** (0.012) -0.054 *** (0.011) -0.208 *** (0.059) -0.189 *** (0.027) -0.106 *** (0.014) -0.079 *** (0.014) -0.022 -0.071 *** (0.011) 0.047 *** -0.190 *** Free Lunch N Adjusted R2 -0.043 *** -0.268 *** (0.029) -0.068 -0.175 *** (0.014) -0.066 *** -0.164 *** (0.011) -0.050 *** -0.155 *** (0.038) -0.017 * 0.014 (0.023) -0.050 (0.011) (0.007) -0.027 (0.064) (0.018) -0.256 * (0.109) -0.045 *** -0.190 *** (0.007) -0.044 *** -0.137 *** (0.034) (0.050) -0.012 Black x Female ELL -0.079 (0.070) (0.009) 0.028 *** -0.086 *** 0.022 (0.046) -0.010 (0.006) (0.008) -0.010 (0.113) (0.013) Hispanic x Female Black Basic Proficient Advanced (0.008) (0.010) (0.007) Below Basic 0.101 *** White x Female Asian Far Below Basic -0.058 * (0.029) -0.197 ** (0.070) -0.099 ** (0.031) -0.015 (0.014) -0.171 *** (0.035) -0.049 -0.022 (0.056) (0.028) -0.268 *** (0.029) -0.069 -0.176 *** (0.014) -0.066 *** 0.062 ** (0.021) -0.061 *** (0.017) 0.038 ** (0.012) -0.044 ** (0.016) 0.037 *** (0.009) -0.094 *** (0.020) 0.038 * (0.017) -0.164 *** (0.011) -0.050 *** 0.095 *** (0.016) -0.048 ** (0.016) 0.121 *** (0.012) -0.038 * (0.015) 0.067 *** (0.012) -0.081 *** (0.021) 0.100 *** (0.022) -0.155 *** (0.038) -0.018 * 0.099 *** (0.015) 0.025 (0.016) 0.109 *** (0.013) -0.006 (0.018) 0.091 *** (0.018) -0.021 (0.028) 0.097 ** (0.033) -0.256 * (0.109) -0.045 *** (0.005) (0.035) (0.018) (0.009) (0.009) (0.009) (0.005) (0.035) (0.018) (0.009) (0.009) (0.009) 59,541 3,074 8,803 19,005 16,396 12,263 59,541 3,074 8,803 19,005 16,396 12,263 0.778 0.305 0.355 0.41 0.414 0.406 0.779 0.304 0.355 0.41 0.414 0.406 0.000 *** 0.940 0.614 0.733 P-Value on F-Test: All Gender Interactions Equal NOTE: All samples include controls for students 8th -10th grade CST scores for the relevant subject and these scores squared and cubed. All test scores are standardized by cohort 0.019 * 0.86 Draft. Please do not cite or circulate. Table 5: OLS Models Using CST Scores to Predict 10th Grade Math CAHSEE Score, By 9th Grade ELA CST Proficency Level Full Sample Female Far Below Basic -0.109 *** (0.004) -0.113 *** (0.019) Below Basic -0.131 *** (0.010) Basic -0.133 *** (0.006) Proficient -0.122 *** (0.007) Advanced Full Sample -0.103 *** -0.043 *** 0.085 (0.052) -0.082 ** (0.026) -0.049 *** (0.013) -0.021 (0.011) -0.023 * (0.010) Asian x Female -0.047 *** (0.008) -0.095 *** (0.006) Hispanic -0.058 *** (0.006) -0.012 (0.048) -0.115 *** (0.023) -0.062 *** (0.012) -0.028 ** (0.010) -0.018 (0.010) Hispanic x Female -0.047 *** (0.008) -0.123 *** (0.006) Black -0.142 *** (0.007) -0.123 * (0.051) -0.197 *** (0.026) -0.134 *** (0.013) -0.100 *** (0.013) -0.083 *** (0.015) Black x Female -0.140 *** (0.010) -0.106 *** (0.010) ELL -0.153 *** (0.006) Free Lunch N Adjusted R2 -0.011 * -0.104 *** (0.024) -0.061 *** (0.012) -0.096 *** (0.010) Basic Proficient Advanced (0.007) (0.009) (0.006) Below Basic -0.082 *** White x Female Asian Far Below Basic -0.024 -0.053 (0.034) (0.088) -0.011 * 0.107 (0.060) -0.127 ** (0.044) 0.014 (0.055) -0.134 *** (0.025) -0.141 * (0.060) -0.029 (0.048) -0.104 *** (0.024) -0.142 ** (0.043) -0.096 ** (0.032) -0.108 *** (0.022) -0.117 *** (0.029) -0.135 *** (0.014) -0.194 *** (0.033) -0.146 *** (0.027) -0.062 *** (0.012) -0.137 *** (0.020) -0.048 ** (0.017) -0.137 *** (0.012) -0.064 *** (0.016) -0.133 *** (0.009) -0.142 *** (0.019) -0.123 *** (0.017) -0.096 *** (0.010) -0.139 *** (0.015) -0.046 ** (0.015) -0.092 *** (0.011) -0.029 * (0.014) -0.135 *** (0.011) -0.095 *** (0.019) -0.145 *** (0.020) -0.081 *** (0.012) -0.037 ** (0.013) -0.056 *** (0.011) 0.005 (0.015) -0.122 *** (0.015) -0.070 ** (0.023) -0.102 *** (0.027) -0.024 -0.057 (0.034) (0.088) -0.001 -0.024 -0.015 0.004 -0.005 -0.024 -0.015 0.004 (0.005) (0.031) (0.017) (0.009) (0.008) (0.008) (0.005) (0.031) (0.017) (0.009) (0.008) (0.008) 59,455 3,056 8,769 18,983 16,386 12,261 59,455 3,056 8,769 18,983 16,386 12,261 0.810 0.433 0.523 0.651 0.699 0.683 0.810 0.433 0.523 0.651 0.700 0.683 0.009 ** 0.234 0.669 0.910 0.010 * 0.003 ** P-Value on F-Test: All Gender Interactions Equal 0.016 * -0.153 *** (0.006) -0.050 (0.098) 0.016 * NOTE: All samples include controls for students 8th -10th grade CST scores for the relevant subject and these scores squared and cubed. All test scores are standardized within the cohort.M odels also include dummy variables for the math exam taken in each grade. Draft. Please do not cite or circulate. Table 6: OLS Models Predicting 10th Grade ELA CAHSEE Score, By 9th Grade ELA CST Proficency Level Using School by Cohort Fixed Effects Full Sample Female Far Below Basic Below Basic -0.043 -0.007 (0.024) (0.011) 0.053 *** (0.004) Basic 0.040 *** (0.007) Proficient 0.093 *** (0.007) Advanced Full Sample Far Below Basic 0.085 *** (0.010) -0.045 *** (0.007) -0.007 (0.065) -0.141 *** (0.029) -0.076 *** (0.014) -0.047 *** (0.013) 0.005 (0.013) Asian x Female -0.038 *** (0.010) 0.070 *** (0.007) Hispanic -0.049 *** (0.007) -0.006 (0.060) -0.110 *** (0.027) -0.068 *** (0.013) -0.066 *** (0.012) -0.022 (0.014) Hispanic x Female -0.020 * (0.009) 0.028 *** (0.006) Black Adjusted R2 0.039 0.043 (0.118) (0.048) 0.012 (0.074) -0.129 *** (0.036) (0.035) 0.008 (0.052) (0.023) 0.019 -0.087 ** (0.068) (0.033) (0.053) (0.015) (0.030) (0.015) 0.062 ** (0.021) -0.063 *** (0.019) 0.035 ** (0.012) -0.056 ** (0.017) 0.038 *** (0.010) 0.104 *** (0.017) -0.051 ** (0.017) 0.112 *** (0.012) -0.044 ** (0.016) 0.063 *** (0.013) 0.106 *** (0.015) 0.012 (0.017) 0.093 *** (0.014) -0.007 (0.019) 0.078 *** (0.019) -0.116 -0.177 *** -0.112 *** -0.095 *** -0.060 ** -0.082 *** -0.097 -0.153 *** -0.102 *** -0.096 *** -0.055 -0.064 -0.029 -0.015 -0.015 -0.019 -0.012 -0.075 -0.037 -0.021 -0.022 -0.031 -0.037 -0.017 (0.060) (0.029) 0.051 *** -0.188 *** (0.007) N Advanced -0.009 (0.012) Free Lunch Proficient -0.099 *** Black x Female ELL Basic (0.009) White x Female Asian Below Basic 0.094 *** -0.046 *** -0.249 *** (0.031) -0.061 -0.170 *** (0.014) -0.045 * -0.166 *** (0.011) -0.051 *** -0.126 ** (0.039) -0.020 * -0.332 ** (0.125) -0.050 *** -0.188 *** (0.007) -0.046 *** -0.248 *** (0.031) -0.063 -0.170 *** (0.014) -0.046 * 0.042 * (0.018) -0.166 *** (0.011) -0.051 *** 0.104 *** (0.023) -0.127 ** (0.039) -0.021 * 0.095 ** (0.035) -0.333 ** (0.125) -0.050 *** (0.005) (0.038) (0.018) (0.010) (0.009) (0.010) (0.005) (0.038) (0.018) (0.010) (0.009) (0.010) 54,793 2,901 8,202 17,608 15,035 11,047 54,793 2,901 8,202 17,608 15,035 11,047 0.782 0.328 0.374 0.419 0.427 0.424 0.782 0.328 0.374 0.419 0.427 0.424 0.000 *** 0.891 0.589 0.736 0.033 * 0.718 P-Value on F-Test: All Gender Interactions Equal NOTE: All models include controls for students 8th -10th grade CST scores for the relevant subject and these scores squared and cubed. All test scores are standardized by cohort Draft. Please do not cite or circulate. Table 7: OLS Models Using CST Scores to Predict 10th Grade Math CAHSEE Score, By 9th Grade ELA CST Proficency Level using School by Cohort Fixed Effects Full Sample Female Far Below Basic -0.105 *** (0.004) -0.099 *** (0.020) Below Basic -0.126 *** (0.011) Basic -0.134 *** (0.007) Proficient -0.118 *** (0.007) Advanced Full Sample -0.095 *** -0.042 *** 0.123 * (0.056) -0.089 ** (0.028) -0.058 *** (0.013) -0.019 -0.014 (0.012) (0.011) Asian x Female -0.043 *** (0.009) -0.093 *** (0.007) Hispanic -0.057 *** (0.006) 0.032 (0.051) -0.105 *** (0.025) -0.069 *** (0.012) -0.040 *** (0.011) -0.031 ** (0.011) Hispanic x Female -0.044 *** (0.008) -0.121 *** (0.006) Black -0.136 *** -0.058 -0.174 *** -0.137 *** -0.095 *** -0.089 *** -0.008 -0.055 -0.027 -0.014 -0.014 -0.016 Black x Female -0.135 *** (0.011) -0.098 *** (0.011) ELL -0.151 *** (0.006) Free Lunch N Adjusted R2 -0.014 ** -0.078 ** (0.026) -0.055 *** (0.013) -0.095 *** (0.010) Proficient Advanced -0.033 -0.090 (0.035) (0.101) -0.152 *** (0.006) -0.015 ** 0.003 (0.101) 0.151 * (0.064) -0.101 * (0.044) 0.069 (0.058) -0.126 *** (0.026) -0.064 (0.065) -0.018 (0.051) -0.078 ** (0.026) -0.130 ** (0.045) -0.097 ** (0.034) -0.111 *** (0.022) -0.106 *** (0.031) -0.129 *** (0.014) -0.172 *** (0.035) -0.134 *** (0.028) -0.055 *** (0.013) -0.139 *** (0.020) -0.060 *** (0.018) -0.135 *** (0.012) -0.070 *** (0.016) -0.137 *** (0.009) -0.150 *** (0.020) -0.115 *** (0.017) -0.095 *** (0.010) -0.019 * -0.133 *** (0.016) -0.042 ** (0.016) -0.089 *** (0.011) -0.039 ** (0.015) -0.132 *** (0.012) -0.089 *** (0.021) -0.140 *** (0.021) -0.067 *** (0.013) -0.026 (0.014) -0.047 *** (0.011) -0.002 (0.016) -0.118 *** (0.016) -0.073 ** (0.026) -0.093 ** (0.029) -0.033 -0.095 (0.035) (0.101) -0.010 -0.007 -0.009 0.002 -0.014 -0.007 -0.009 0.002 (0.005) (0.033) (0.017) (0.009) (0.009) (0.009) (0.005) (0.033) (0.017) (0.009) (0.009) (0.009) 54,707 2,883 8,168 17,586 15,025 11,045 54,707 2,883 8,168 17,586 15,025 11,045 0.814 0.456 0.536 0.662 0.704 0.69 0.814 0.456 0.535 0.662 0.704 0.007 ** 0.204 0.902 0.717 0.022 * P-Value on F-Test: All Gender Interactions Equal -0.019 * Basic (0.007) (0.009) (0.007) Below Basic -0.071 *** White x Female Asian Far Below Basic 0.69 0.003 ** NOTE: All samples include controls for students 8th -10th grade CST scores for the relevant subject and these scores squared and cubed. All test scores are standardized within the cohort.Models also include dummy variables for the math exam taken in each grade. Draft. Please do not cite or circulate. Table 8: Using Shrunken ELA CST10 to Predict 10th Grade ELA CAHSEE Score, By Shrunken 9th Grade ELA CST Proficency Level Full Sample Female 0.068 *** (0.004) Far Below Basic Below Basic -0.038 0.008 (0.024) (0.011) Basic 0.050 *** (0.007) Proficient 0.102 *** (0.007) Advanced Full Sample 0.074 *** -0.072 *** -0.094 (0.067) -0.148 *** (0.028) -0.108 *** (0.013) -0.065 *** (0.012) -0.007 (0.013) Asian x Female -0.085 *** (0.010) 0.099 *** (0.008) Hispanic -0.071 *** (0.007) -0.045 (0.062) -0.105 *** (0.026) -0.071 *** (0.012) -0.072 *** (0.011) -0.054 *** (0.014) Hispanic x Female -0.056 *** (0.009) 0.044 *** (0.007) Black -0.143 *** -0.217 *** -0.227 *** -0.125 *** -0.084 *** -0.075 *** -0.009 -0.065 -0.028 -0.014 -0.015 -0.020 Black x Female -0.134 *** (0.012) 0.057 *** (0.012) ELL -0.397 *** (0.007) Free Lunch N Adjusted R2 -0.077 *** -0.355 *** (0.031) -0.078 * -0.274 *** (0.013) -0.075 *** -0.269 *** (0.011) -0.059 *** -0.290 *** (0.045) -0.039 *** Basic Proficient Advanced (0.009) (0.010) (0.007) Below Basic 0.116 *** White x Female Asian Far Below Basic -0.157 (0.138) -0.069 *** -0.397 *** (0.007) -0.078 *** -0.031 0.023 (0.126) (0.049) -0.111 (0.078) -0.147 *** (0.035) -0.011 0.030 (0.053) (0.023) -0.051 (0.072) -0.089 ** (0.032) -0.038 -0.003 (0.030) (0.015) -0.200 * (0.078) -0.212 *** (0.036) -0.088 -0.003 (0.062) (0.029) -0.351 *** (0.031) -0.077 * -0.276 *** (0.013) -0.075 *** 0.060 ** (0.021) -0.105 *** (0.018) 0.049 *** (0.013) -0.066 *** (0.017) 0.047 *** (0.010) -0.123 *** (0.020) 0.054 ** (0.018) -0.269 *** (0.011) -0.059 *** 0.095 *** (0.017) -0.083 *** (0.016) 0.131 *** (0.013) -0.066 *** (0.016) 0.085 *** (0.012) -0.074 *** (0.022) 0.082 *** (0.023) -0.290 *** (0.045) -0.040 *** 0.112 *** (0.016) -0.014 (0.017) 0.124 *** (0.014) -0.055 ** (0.020) 0.112 *** (0.020) -0.070 * (0.031) 0.102 ** (0.036) -0.157 (0.138) -0.069 *** (0.005) (0.039) (0.018) (0.009) (0.009) (0.010) (0.005) (0.039) (0.018) (0.009) (0.009) (0.010) 59,455 2,884 9,211 19,545 16,482 11,333 59,455 2,884 9,213 19,544 16,484 11,330 0.739 0.213 0.25 0.315 0.321 0.303 0.739 0.214 0.249 0.315 0.322 0.303 0.000 0.825 0.658 0.95 0.046 * 0.895 P-Value on F-Test: All Gender Interactions Equal NOTE: M odels predict the CAHSEE ELA score, standardized by cohort. Controls other than those shown include "shrunken" 10th grade ELA CST scores, as well as the quadratic and cubic versions of these shrunken scores. The shrunken scores are a weighted average of the students observed 10th grade score (or that score squared or cubed) and their predicted score. The predicted score is calculated by regressing the 10th grade ELA CST test scores (or those scores squared or cubed) on 8th & 9th grade test scores (and those scores squared and cubed), the demographic controls to be included in the model (ell, free lunch, race and gender or race by gender) and district by cohort fixed effects. The "shrunken" score weights the observed score by .85 and the predicted score by .15. Students are reclassified into predicted "shrunken" proficiency categories based on where the shrunken (linear) score would place a student according to the actual cut scores for that cohort. Draft. Please do not cite or circulate. Table 9:Using Shrunken Math CST10 to Predict 10th Grade Math CAHSEE Score, By Shrunken 9th Grade ELA CST Proficency Level Full Sample Female Far Below Basic -0.122 *** (0.004) -0.152 *** (0.021) Below Basic -0.159 *** (0.011) Basic -0.163 *** (0.007) Proficient -0.147 *** (0.007) Advanced Full Sample -0.114 *** -0.047 *** 0.116 -0.020 (0.061) (0.028) -0.032 * (0.014) -0.013 (0.012) -0.021 * (0.011) Asian x Female -0.058 *** (0.009) -0.094 *** (0.007) Hispanic -0.087 *** (0.007) -0.020 (0.055) -0.095 *** (0.026) -0.067 *** (0.013) -0.039 *** (0.011) -0.025 * (0.012) Hispanic x Female -0.071 *** (0.009) -0.145 *** (0.006) Black -0.208 *** -0.168 ** -0.202 *** -0.164 *** -0.129 *** -0.095 *** -0.008 -0.059 -0.028 -0.015 -0.014 -0.017 Black x Female -0.200 *** (0.012) -0.129 *** (0.012) ELL -0.298 *** (0.006) Free Lunch N Adjusted R2 -0.165 *** (0.027) -0.031 *** -0.112 *** (0.013) -0.141 *** (0.011) -0.086 * (0.044) Basic Proficient Advanced (0.008) (0.010) (0.007) Below Basic -0.089 *** White x Female Asian Far Below Basic -0.057 (0.116) -0.299 *** (0.006) 0.158 * (0.070) -0.171 *** (0.048) 0.026 (0.065) -0.176 *** (0.027) -0.164 * (0.070) -0.071 (0.055) -0.164 *** (0.027) -0.205 *** (0.048) -0.052 (0.035) -0.122 *** (0.023) -0.111 *** (0.032) -0.158 *** (0.015) -0.199 *** (0.036) -0.196 *** (0.028) -0.113 *** (0.013) -0.183 *** (0.022) -0.047 ** (0.018) -0.153 *** (0.013) -0.075 *** (0.017) -0.166 *** (0.010) -0.175 *** (0.021) -0.161 *** (0.018) -0.141 *** (0.011) -0.163 *** (0.016) -0.045 ** (0.016) -0.103 *** (0.012) -0.034 * (0.015) -0.172 *** (0.012) -0.116 *** (0.021) -0.181 *** (0.022) -0.087 * (0.044) -0.098 *** (0.013) -0.051 *** (0.015) -0.046 *** (0.012) -0.003 (0.016) -0.136 *** (0.017) -0.065 * (0.026) -0.145 *** (0.031) -0.060 (0.116) -0.006 -0.021 -0.009 0.004 0.009 -0.008 -0.022 -0.009 0.003 0.009 (0.005) (0.035) (0.018) (0.009) (0.009) (0.009) (0.005) (0.035) (0.018) (0.009) (0.009) (0.009) 59,455 2,884 9,211 19,545 16,482 11,333 59,455 2,884 9,213 19,544 16,484 11,330 0.762 0.336 0.405 0.56 0.633 0.615 0.762 0.341 0.404 0.56 0.633 0.616 0.000 *** 0.184 0.16 0.643 0.000 *** 0.000 *** P-Value on F-Test: All Gender Interactions Equal -0.031 *** -0.007 (0.112) NOTE: M odels predict the CAHSEE M ath score, standardized by cohort. Controls other than those shown include "shrunken" 10th grade M ath CST scores, as well as the quadratic and cubic versions of these shrunken scores. The shrunken scores are a weighted average of the students observed 10th grade score (or that score squared or cubed) and their predicted score. The predicted score is calculated by regressing the 10th grade M ath CST test scores (or those scores squared or cubed) on 8th & 9th grade test scores (and those scores squared and cubed), dummies for math CST taken in the 8th-10th grade, the demographic controls to be included in the model (ell, free lunch, race and gender or race by gender) and district by cohort fixed effects. The "shrunken" score weights the observed score by .85 and the predicted score by .15. Students are reclassified into predicted "shrunken" proficiency categories based on where the shrunken (linear) ELA score would place a student according to the actual cut scores for that cohort. Draft. Please do not cite or circulate. Table 10: 2SLS Estimates ELA Female Math 0.026 *** -0.050 *** (0.005) White x Female (0.007) 0.033 ** -0.024 (0.012) Asian -0.090 *** (0.008) Asian x Female -0.110 *** (0.011) (0.017) -0.053 *** (0.012) 0.071 *** (0.012) -0.015 0.003 -0.016 0.004 (0.008) (0.011) (0.011) (0.015) Hispanic x Female -0.002 -0.062 *** (0.008) Black -0.030 ** (0.010) Black x Female -0.008 (0.014) (0.011) -0.117 *** (0.014) -0.007 -0.147 *** (0.009) Free Lunch N -0.034 *** -0.146 *** (0.009) -0.035 *** -0.105 *** (0.020) -0.048 * (0.014) ELL (0.016) -0.049 *** (0.009) Hispanic -0.040 * (0.020) -0.285 *** (0.012) -0.030 *** -0.286 *** (0.012) -0.030 *** (0.006) (0.006) (0.009) (0.009) 59,455 59,455 59,455 59,455 P-Value on F-Test: All Gender Interactions Equal 0.000 *** 0.295 NOTE: Models use 8th -9th grade CST scores to instrument for the 10th grade CST score in the relevant subject. Both predictor and instrument variables include quadratic and cubic terms. CAHSEE scores are standardized within the cohort. Models also include dummy variables for the math exam taken in each grade as an instrument. Draft. Please do not cite or circulate. Works Cited Aronson, J., Lustina, M. J., Good, C., Keough, K., Steele, C. M., & Brown, J. (1999). When white men can’t do math: Necessary and sufficient factors in stereotype threat. Journal of Experimental Social Psychology , 35, 29-46. Baumert, J., & Demmrich, A. (2001). Test motivation in the assessment of student skills:The effects of incentives on motivation and performance. European Journal of Psychology of Education , 441-462. Ben-Zeev, T., Fein, S., & Inzlicht, M. (2005). Arousal and stereotype threat. Journal of Experimental Social Psychology , 41, 174-181. Bettinger, E. P. (2010, September). Paying to Learn: The Effect of Financial Incentives on Elementary School Test Scores. NBER Working Paper No. 16333 . Bifulco, R., Ladd, H. F., & Ross, S. (2007). Public school choice and integration: Evidence from Durham, North Carolina. University of Connecticut Economics Working Papers , Number 2007-41. Brown, R. P., & Josephs, R. A. (1999). A burden of proof: stereotype relvance and gender differences in math performance. Journal of Personality and Social Psychology , 76 (2), 246-257. Card, D., & Rothstein, J. (2007). Racial segregation and the black–white test score gap. Journal of Public Economics , 91, 2158–2184. Cohen, G. L., & Garcia, J. (In Press). Recursive processes in self-affirmation: Intervening to close the minority achivement gap. Science . Cohen, G. L., Garcia, J., Purdie-Vaughns, V., Apfel, N., & Brzustoski, P. (2009). Recursive processes in self-affirmation: Intervening to close the minority achievement gap. Science , 324 (5925), 400-403. Coon, H. M., & Kemmelmeier, M. (2001). Cultural orientations in the United States : (Re)examining differences among ethnic groups. Journal of Cross-Cultural Psychology , 32 (3), 348-364. Draft. Please do not cite or circulate. Cullen, J. B., Jacob, B. A., & Levitt, S. D. (2005). The impact of school choice on student outcomes: An analysis of Chicago Public Schools. Journal of Public Economics , 729-760. Danaher, K., & Crandall, C. S. (2008). Stereotype threat in applied settings re-examined. Journal of Applied Social Psychology , 38 (6), 1639–1655. Ferrara, S. (2006). Toward a psychology of large-scale educational achievement testing: Some features and capabilities. Educational Measurement: Issues and Practice , 25 (4), 2-5. Fryer, J. R. (2010, April). Financial incentives and student achievement: Evidence from randomized trials. NBER Working Paper No. 15898 . Gonzales, P. M., Blanton, H., & Williams, K. J. (2002). The effects of stereotype threat and doubleminority status on the test performance of latino women. Personality and Social Psychology Bulletin , 28, 659-670. Good, C., Aronson, J., & Inzlicht, M. (2003). Improving adolescents’ standardized test performance: An intervention to reduce the effects of stereotype threat. Applied Developmental Psychology , 24, 645– 662. Haladyna, T. M., & Downing, S. M. (2004). Construct-irrelevant variance in high-stakes testing. Educational Measurement: Issues and Practice , 23 (1), 17-27. Jacob, B. A., & Levitt, S. D. (2003). Rotten apples: An investigation os the prevalence and predictors of teacher cheating. The Quarterly Journal of Economics , 118 (3), 843-877. Keller, J. (2002). Blatant stereotype threat and women’s math performance: Self-handicapping as a strategic means to cope with obtrusive negative performance expectations. Sex Roles , 47 (3/4), 193-198. Draft. Please do not cite or circulate. Kellow, J. T., & Jones, B. D. (2008). The effects of stereotype on the achievement gap: Reexamining the academic performance of African American high school students. Journal of Black Psychology , 34 (1), 94-120. Lankford, H., Loeb, S., & Wycoff, J. (2002). Teacher sorting and the plight of the urban school. Education Evaluation and Policy Analysis , 24, 37-62. Linn, R. L., Koretz, D., & Baker, E. L. (1996). Assessing the validity of the national assessment of educational progress: NAEP technical review panel white paper. Los Angeles: Center for the Study of Evaluation. : CSE Technical Report 416. Logan, J. R., Oakley, D., & Stowell, J. (2008). School segregation in metropolitan regions, 19702000: The impacts of policy choices on education . American Journal of Sociology , 1611-1644. Marcus, H. R., & Kitayama, S. (1991). Culture and the self: Implications for cognition, emotion, and motivation. Psychological Review , 98 (2), 224-253. Messick, S. (1989). Meaning and values in test validation: The science and ethics of assessment. Educational Researcer , 18 (5), 5-11. O'Brien, L. T., & Crandall, C. S. (2003). Stereotype Threat and Arousal: Effects on Women's Math Performance. Personality and Social Psychology Bulletin , 29, 782-789. Reardon, S. F., Atteberry, A., Arshan, N., & Kurlaender, M. (2009). Effect of the California High School Exit Exam on student persistance, Achievement and graduation. Institute for Research on Education Policy and Practice, Working Paper #2009-12 . Rivkin, S. G., Hanushek, E. A., & Kain, J. F. (2005). Teachers, schools and academic achievement. Econometrica , 417–458. Draft. Please do not cite or circulate. Rumberger, R. W., & Willms, J. D. (1992). The impact of racial and ethnic segregation on the achievement gap in California high school. Education Evaluation and Policy Analysis , 377-396. Schmader, T., Johns, M., & Forbes, C. (2008). An integrated process model of stereotype threat effects on performance. Psychological Review , 115 (2), 336–356. Shih, M., Pittinsky, T., & Ambady, N. (1999). Sterotype susceptibility: Identity saliance and shifts in quantitative performance. Psychological Science , 10 (1), 80-83. Smith, J. L. (2004). Understanding the process of stereotype threat: A review of mediational variables and new performance goal directions. Educational Psychology Review , 16 (3), 177-206. Spencer, S. J., Steele, C. M., & Quinn, D. M. (1999). Stereotype threat and women's math performance. Journal of Experimental Social Psychology , 35, 4-28. Steele, C. M. (1997). A threat in the air: How sterotypes shape intellectual identity and performance. American Psychologist , 52 (6), 613-629. Steele, C. M., & Aronson, J. (1995). Stereotype threat and the intellectual test performance of African Americans. Journal of Personality and Social Psychology , 67 (5), 797-811. Stricker, L. J., & Ward, W. C. (2004). Stereotype threat, inquiring about test takers' ethnicity and gender, and standardized test performance. Journal of Applied Social Psychology , 34 (4), 665-693. Walton, G. M., & Cohen, G. L. (2002). Stereotype lift. Journal of Experimental Social Psychology , 39, 456-467. Walton, G. M., & Spencer, S. J. (2009). Latent ability: Grades and test scores systematically underestimate the intellectual ability of negatively sterotyped students. Psychological Science , 20 (9), 1132-1139. Draft. Please do not cite or circulate. Wicherts, J. M., Dolan, C. V., & Hessen, D. J. (2005). Stereotype threat and group differences in test performance: A question of measurement invariance. Journal of Personality and Social Psychology , 89 (5), 696–716.