Nicole Arshan - Association for Education Finance and Policy

advertisement
Draft. Please do not cite or circulate.
Does making a test high stakes for students artificially inflate achievement gaps by race and
gender? Evidence from the California High School Exit Exam
Nicole Arshan*
Sean Reardon
Stanford University
March 2011
*Questions or correspondences regarding this paper may be addressed to Nicole Arshan, Center
for Education Policy Analysis, Stanford University School of Education, 520 Galvez Mall,
CERAS Building, 5th Floor, Stanford, CA, 94305 or narshan@stanford.edu.
Draft. Please do not cite or circulate.
Abstract
Draft. Please do not cite or circulate.
Standardized tests given to large groups of students are popular among both policymakers
and researchers for understanding student achievement and attaching accountability to schools,
teachers and even students. To use tests for any of these purposes, policymakers and researchers
rest on the assumption that these exams provide an accurate, unbiased measure of student
knowledge; essentially, that these exam scores consist of a student’s “true ability” plus an
acceptably small amount of random error. Several critiques of these exams, however, suggest
that this error may not be random but instead systematically correlated with test conditions,
school or student behavior.
These tests can broadly be broken into two types: those used purely for research and
assessment and those used as a policy lever by having implications attached to their results.
Exams given with no direct repercussions include the National Assessment of Educational
Progress (NAEP, the so-called “Nation’s Report Card”) and other exams given by the
Department of Education’s National Center for Education Statistics (NCES). These solely
evaluative exams provide nationally representative student level data (such as the “High School
and Beyond” dataset) used to research and better understand the relationship between any
number of student, school and even family level factors and academic achievement. While some
have issued concerns that students may not exhibit much effort on such exams, thereby
systematically underestimating student ability, researchers generally find that these solely
evaluative exams should be strong representations of student ability (Linn, Koretz, & Baker,
1996; Baumert & Demmrich, 2001).
The second type of standardized tests, those with stakes attached to their outcomes, are
integral to the current policy environment emphasizing accountability and standards. The
rationale behind using standardized tests as a policy level is straightforward enough: if exams
Draft. Please do not cite or circulate.
accurately measure student knowledge, then these scores can be used to identify, reward and
possibly even duplicate success; identify, intervene or correct failure; and provide incentives for
all parties to work harder and raise achievement. Unfortunately, attaching repercussions to exams
can also change results and behavior in unwanted ways, including the narrowing of curriculum
and even cheating (Jacob & Levitt, 2003).
Finally, a third school of research questions whether scores on standardized tests provide
equally valid measures of student ability for all students. A large body of social psychological
literature suggests that stereotype threat –an individual’s underperformance on a task when they
fear confirming a negative stereotype about their group – may cause exams to systematically
underestimate the ability of students from disadvantaged groups (Steele & Aronson, 1995;
Steele, 1997; Schmader, Johns, & Forbes, 2008; Walton & Spencer, 2009). These stereotype
threat effects may be greatest when a student faces direct repercussions as a result of the exam
(Aronson, Lustina, Good, Keough, Steele, & Brown, 1999).
One recent study finds indication of such systematic bias on the California High School
Exit Exam (CAHSEE). Reardon, Atteberry & Arshan found the CAHSEE had negative effects
on graduation rates for low achieving students – but only for women, students of color and
English Language Learners (Reardon, Atteberry, Arshan, & Kurlaender, 2009). Upon
investigation, the authors found that these students were less likely to pass the “high stakes” high
school exit exam on the initial administration than their observationally similar white and male
peers, even when controlling for prior achievement on a “low stakes” exam used for state and
federal accountability, but without direct repercussions to the students. Since the robust measures
of prior achievement on the low stakes exams included in the regression should control for the
knowledge component, these results suggest that the noise in this exam may not have been
Draft. Please do not cite or circulate.
random, but instead correlated with observable student characteristics such as race, gender and
ELL status. Possibly student effort is correlated with both student subgroup and personal
repercussions to the exam, leading all estimates of achievement using “low stakes” exams –
including those used for purely for research and those used for accountability but without direct
repercussions to the student – to systematically underestimate the performance of certain
subgroups. On the other hand, stereotype threat may indicate that exams with direct
repercussions for students – such as high school exit exams and college entrance exams – may
implicitly set a higher performance requirement for disadvantaged students than for their more
privileged peers.
But while the Reardon et. al findings raise concerns, they require further exploration to be
fully understood. First, there are several concerns in comparing exam scores that need to be
carefully evaluated before believing these results, in particular measurement error and
differences in content on the two exams being compared. Second, while stereotype threat
provides a clean theoretical link between differential student underperformance for students at
risk to be stereotyped, it is not the only factor that may influence test taking ability. In fact, one
experimental study that told students in the “treatment” condition that the exam they took would
predict performance on Florida’s high school exit exam. The authors found no difference
between African American student performance in the treatment and control groups, but a large
performance boost for white students in the treatment group (Kellow & Jones, 2008).
Access to student-level longitudinal administrative data in three large urban districts in
California – the same data used for the Reardon et. al. finding -- provides us the opportunity to
further explore and understand these troubling findings. Students in California high schools take
two kinds of tests, both aligned to California state standards. One, the California Standards Test
Draft. Please do not cite or circulate.
(CST) is given on a yearly basis and has repercussions for the school; I therefore consider this
exam as “low-stakes” for students. The other, the California High School Exit Exam (CAHSEE)
determines whether the student will earn a high school diploma; I therefore consider the
CAHSEE “high-stakes” for students. The CAHSEE is initially administered in the 10th grade and
measures content similar to the 8th-10th grade CST exams. This paper looks at the high stakes/
low stakes performance gap on both by race and gender and by gender within race. We first ask
whether this high stakes/ low stakes achievement gap can be explained by differences in schools
attended by students of different subgroups or by measurement error, then whether the observed
patterns suggest a clear stereotype threat or effort interpretation.
Framework
Test scores and measurement error
We start with the assumption that student i has true skill πœ‚. We cannot observe this true skill,
but only estimate it with error πœ™ using test score π‘Œ.
(1) π‘Œπ‘– = πœ‚Μ‚π‘– = πœ‚π‘– + πœ™π‘–
Tests that fail to meet these assumptions and therefore measure something other than
knowledge and random noise display “construct irrelevance” (Messick, 1989; Haladyna &
Downing, 2004). Construct relevance suggests that the error term, Ο•, is non-random and should
therefore be broken into several parts. While Ο• will still contain some random error Ο΅, there will
be two additional sources of measurement error that are neither random nor part of the ability, η,
the test seeks to measure: one correlated to the content measured in the exam, γ, and another
correlated to the individual’s ability to demonstrate their knowledge on this particular exam, δ:
(2) π‘Œπ‘– = πœ‚π‘– + 𝛿𝑖 + 𝛾𝑖 + πœ€π‘–
Draft. Please do not cite or circulate.
Tests may be construct irrelevant for several reasons. First, γ will be nonzero if the exam
measures content other than η. Take the example of an exam that conflates reading ability with
math ability. In this instance, two students with the same true ability in math (πœ‚π΄π‘€π‘Žπ‘‘β„Ž = πœ‚π΅π‘€π‘Žπ‘‘β„Ž )
would have different math scores (π‘Œπ΄π‘€π‘Žπ‘‘β„Ž ≅ π‘Œπ΅π‘€π‘Žπ‘‘β„Ž ) if student A had a higher reading ability than
student B, (π‘Œπ΄π‘€π‘Žπ‘‘β„Ž > π‘Œπ΅π‘€π‘Žπ‘‘β„Ž ). Construct irrelevance may also conflate a student’s test taking
ability or the effort they expend on the exam with their true knowledge. A student who received
a great deal of test prep, is particularly relaxed under stressful circumstances (such as taking a
timed exam), and makes an effort to perform well is likely to score higher than a student with the
same true knowledge of the material but greater test anxiety, less familiarity with the test taking
process or less willingness to expend the effort required to be successful in the exam.
The relationship between group membership and an unbiased exam score
An accurate, unbiased test will display measurement invariance, which requires that students’
test scores be independent of their group membership J, conditional on their ability (Wicherts,
Dolan, & Hessen, 2005):
(3) π‘Œ ⊥ 𝐽 | πœ‚
Measurement invariance implies that, while student race or gender may be correlated with
exam scores, the relationship between group membership and exam score must affect this
achievement gap only through true differences in ability, not through the exam’s measurement
error (see Error! Reference source not found.). Measurement invariance does not, therefore,
require knowledge, η, to be orthogonal to J, merely the errorπœ™. Group membership such as race
or gender, through some feature of society, may lead directly to lower achievement. For
example, the black-white test score gap can at least partially be explained by segregation (Card
& Rothstein, 2007). African American students tend to attend schools in urban areas with higher
Draft. Please do not cite or circulate.
percentage of poor and minority students (Logan, Oakley, & Stowell, 2008). Schools in urban
areas with high concentrations of African American and poor students have a harder time
recruiting and retaining teachers (Lankford, Loeb, & Wycoff, 2002). Lower ability of the
teaching staff will lead to lower achievement for the (disproportionately African American)
students in the school (Rivkin, Hanushek, & Kain, 2005). These different ability levels,
therefore, would appropriately be reflected by achievement gaps in an accurate and unbiased
standardized test, as the group membership is related to ability, not measurement error.
We do, however, require that measurement error be unrelated to J. Achievement gaps
between groups should therefore be entirely explained by the differential knowledge between the
groups, not the different test circumstances, as reflected by Ο•. That is, if the test score is
composed of true knowledge and measurement error (substituting Equation (1) into Equation
(3)):
(4) ( πœ‚π‘– + 𝛿𝑖 + 𝛾𝑖 + πœ€π‘– ) ⊥ 𝐽 | πœ‚π‘–
(5) (𝛿𝑖 + 𝛾𝑖 ) ⊥ 𝐽 | πœ‚π‘–
If a test fails to meet this criterion of measurement invariance, then we know that the test is
construct irrelevant in a way correlated with membership in group J. Either the test measures
content other than η that one or more groups J hold less knowledge on or test circumstances
prevent students from fully demonstrating their knowledge on the exam in a way the
disadvantages some groups as compared to others. One example of π‘Ÿπ›Ύπ½ ≅ 0 would be an exam
that measures ability to do long division using word problems written in English. While the
ability to construct a math equation based on written information may be part of η, English
Language Learners fully capable of performing this task in their native tongue will score lower
than native English speakers of the same mathematics ability. Social psychologists and
Draft. Please do not cite or circulate.
psychometricians alike worry that π‘Ÿπ›Ώπ½ ≅ 0 due to the prevalence of stereotype threat, a social
psychological phenomenon wherein individuals underperform relative to their true ability when
faced with the possibility of confirming a negative stereotype about their group-- the idea that
women are bad at math, or that African Americans are not academically successful (Steele &
Aronson, 1995; Steele, 1997; Ferrara, 2006).
Differences in tests
When comparing two exams both measure η – one with high stakes for the students and
one without direct repercussions for the students-- we may therefore say that, if neither exam
displays construct irrelevance (and are therefore measurement invariant), that
π»π‘–π‘”β„Ž
(6) 𝐸(π‘Œπ‘–
|πœ‚π‘– ) = 𝐸(π‘Œπ‘–πΏπ‘œπ‘€ |πœ‚π‘– ).
As demonstrated in Equation 5, one way to check for construct irrelevance is to test for
measurement invariance. If we estimate
π»π‘–π‘”β„Ž
(7) π‘Œπ‘–
= π›½π‘Œπ‘–πΏπ‘œπ‘€ + π‘±πŒ + πœ™π‘– ,
then we should fail to reject the null:
(8) 𝐻0 : πœ’ = 0 ∀ 𝐽.
Theoretically, then, we would like to believe that we would not see an achievement gap on an
exam between students who have demonstrated the same ability on an earlier, similar exam.
However, even setting aside the issue of differing stakes on these exams, there are several
reasons we may see achievement gaps persist even when controlling for concurrent achievement,
including schools attended by students of different groups and measurement error.
Schools may differ in their ability or motivation to help students achieve their best
possible score on an exam. Depending on the incentives, schools may act in ways to superficially
raise scores without meaningful gains to student achievement. One study of accountability policy
Draft. Please do not cite or circulate.
in Chicago found that while schools raised elementary school students’ math scores on tests with
stakes for the school, these gains were not reflected in exams measuring similar content when the
exam had no stakes for the school. Evidence pointed to teachers emphasizing specific skills or
questions they knew would be on the high stakes test, as compared to the low stakes test and
greater student effort on the test with meaningful implications for the school (Jacob, 2005).
There is evidence that some teachers are willing to go so far as to cheat to help students score
better on an exam than they are capable (Jacob & Levitt, 2003). Furthermore, school segregation
is well documented. A study analyzing 6 of the largest districts in California (including two of
the three used in this paper) in the 1988-1989 school year found that “approximately half of all
high school students would need to change schools for the racial and ethnic composition of the
high schools to reflect the racial and ethnic composition of the state (Rumberger & Willms,
1992, p. 378).” There is evidence that test prep does not happen equally at all schools; one study
found that teachers at low SES schools with high proportions of minorities were more likely to
teach narrowly to items they thought would be easiest to get kids to show improvement on in
elementary schools in California (Shen, 2008). It may be, therefore, that students from more
disadvantaged groups may perform better on one test than the other if their schools put more
emphasis on preparing them for one as compared to the other.
Secondly, measurement error in π‘Œπ‘–πΏπ‘œπ‘€ may bias our estimates of χ towards the
unconditional achievement gap between groups. If we assume that the measurement error in
π‘Œπ‘–πΏπ‘œπ‘€ is consistent throughout the distribution of scores and uncorrelated with group
membership, then the attenuation bias in β may therefore cause us to conflate regression to the
group conditional mean with an achievement gap on the two exams that mimics the gap in
measured average ability between the groups (Rothstein, 2010). Unfortunately, most
Draft. Please do not cite or circulate.
standardized achievement tests do not have consistent measurement error throughout the scores
and therefore violate the classical errors in variance assumptions required to make a prediction of
the direction of bias in χ. Instead, standardized tests tend to measure scores in the center and near
cut scores with more accuracy than scores at the top and bottom of the scale.
We therefore worry that χ may not reflect a high stakes/ low stakes achievement gap, but
instead may either be a reflection of schools attended by the students, or measurement error
causing bias in the results. If there is a remaining high stakes/ low stakes achievement gap, there
may be several explanations. We discuss anxiety and effort, two likely sources of differential
performance, and their connections to group membership, below
Anxiety and stereotype threat
Stereotype threat neatly connects anxiety in a high stakes testing circumstance to group
membership. Students who experience stereotype threat underperform relative to their true
ability when faced with the possibility of confirming a negative stereotype about their group-the idea that women are bad at math, or that African Americans are not academically successful
(Steele & Aronson, 1995; Steele, 1997). Stereotype threat is, essentially, an additional source of
test anxiety that impairs the working memory of an individual faced with a misalignment
between their sense of self, sense of group, and sense of ability (Schmader, Johns, & Forbes,
2008). Stereotype threat seems to have the greatest impact when there are acute repercussions to
students’ performance (Aronson, Lustina, Good, Keough, Steele, & Brown, 1999).
Steele and Aronson (1995) first advanced stereotype threat in their study of Stanford
undergrads, claiming that people who feel at risk to be negatively stereotyped will perform worse
for fear of confirming or being judged by negative stereotypes associated with their groups. They
found that black students primed to think a difficult exam was a measure of their ability
Draft. Please do not cite or circulate.
performed worse than white students with similar SAT scores in the same condition. In the
control group black and white students of similar SAT scores performed similarly well. There
was evidence to show that black students primed with negative stereotypes showed more anxiety,
less accuracy and performed more slowly. Subsequent research has found similar results in
women with math exams (Brown & Josephs, 1999; Gonzales, Blanton, & Williams, 2002;
O'Brien & Crandall, 2003; Shih, Pittinsky, & Ambady, 1999; Spencer, Steele, & Quinn, 1999),
the elderly with tests of memory (Hess, Auman, Colcombe, & Rahhal, 2003), and one study on
Latinos and verbal exams (Gonzales, Blanton, & Williams, 2002). Moreover, stereotype threat
can be induced in groups that are not normally at risk -- such as when a group of white men who
excel at math were primed with the stereotype that Asians are better at math (Aronson, Lustina,
Good, Keough, Steele, & Brown, 1999). Furthermore, individuals may benefit from “stereotype
lift” and perform better than their typical ability when operating under the assumption that their
group benefits from a positive stereotype (Walton & Cohen, 2002).
The negative effects of stereotype threat tend to be strongest for those with the highest
"domain identification,” (Aronson, Lustina, Good, Keough, Steele, & Brown, 1999; Steele,
1997; Keller, 2002) that is, those to whom the exam is most important. Steele, in particular,
would argue that the domain identification is a measure of how important the skill being tested is
to an at-risk student’s self-perception. Aronson et. al (1999) argue that domain identification
may, more broadly, indicate that stereotype threat will be activated when it has repercussions for
the at-risk student. They give the example of a woman taking the GRE to apply for graduate
school in Art History. This student likely has little self-perception at stake in the math GRE, but
doing poorly could have direct repercussions: “It therefore may be more correct to say that high
motivation—a sense that something important is at stake—is the necessary factor in stereotype
Draft. Please do not cite or circulate.
threat, not high identification per se (Aronson, Lustina, Good, Keough, Steele, & Brown, 1999,
p. 43).” This concern over the repercussions of an exam has direct relevance to our study, where
we see that at-risk students under-perform on a high stakes exam, as compared to an exam
measuring similar content but the results of which only indirectly impact the students.
While stereotype threat has been studied extensively, there is less direct evidence of
stereotype threat outside of laboratory settings; experiments outside of the laboratory have – for
obvious ethnical reasons – focused on interventions designed to mitigate the effects of stereotype
threat. One experimental field study found that asking students to report their gender and
ethnicity did not affect test scores on the AP Calculus AB exam or the Computerized Placement
Test (used for placement primarily at community colleges) (Stricker & Ward, 2004). The authors
concede, however, that the students may not have needed the treatment to activate stereotype
threat on these tests, which already carry significant consequences and may threaten at-risk
identities. Additionally, a subsequent re-examination of the data from this study argues for a
relaxation of the authors’ fairly stringent statistical requirements; doing so produces a finding
that is both statistically and, the authors argue, practically significant (Danaher & Crandall,
2008). Instead, most field work on stereotype threat has demonstrated that interventions targeted
towards alleviating it can be successful in raising the academic performance of at-risk students.
In fact, a number of studies have found that techniques such as mentoring or values affirmation
can significantly reduce the effects of stereotype threat (Good, Aronson, & Inzlicht, 2003;
Walton & Spencer, 2009; Cohen, Garcia, Purdie-Vaughns, Apfel, & Brzustoski, 2009).
Taking the example of the high school exit exam, a female student may think of herself
as a weak math student – a perception in line with society’s stereotypes that women struggle with
mathematics as compared to men. While this perception of negative ability may dissuade her
Draft. Please do not cite or circulate.
from pursuing a math and science career, it should not impact her working memory and therefore
her ability to perform well on a math exam -- unless the exam becomes important to her sense of
self. The high stakes of the exit exam – making the safe assumption that she values a high school
diploma -- will therefore misalign to her weaker sense of her own and her group’s ability. This
misalignment will cause anxiety and distraction while she takes the exam, leading to
underperformance as compared to an exam where she did not feel any pressure to perform well.
If stereotype threat is causing a high stakes/ low stakes achievement gap, we would therefore
expect women to outperform men on the ELA section and underperform on the math section.
Students of color will most likely underperform on both sections, though Asian students should
over perform on the mathematics section and women of color may or may not underperform on
the ELA section.
As mentioned in the introduction, one experiment tested just such a condition and found
the opposite of this hypothesis (Kellow & Jones, 2008). Two groups of Florida 9th graders took
the same exam, but only the treatment group believed it to predict performance on the 10th grade
FCAT, which is used as a high school exit exam in Florida. White students performed higher in
the treatment than control condition, whereas African American students performed similarly in
both conditions. The authors proposed that stereotype lift may have caused the observed pattern
– that linking the high stakes to the exam caused higher performance for white students. Their
survey data, however, indicates that they may not have adequately lifted the source stereotype
threat for students; African American students in both conditions scored higher on measures of
anxiety and anticipated stereotype threat than white students, with no race by treatment
interaction effect. These results (and by extension the high stakes/ low stakes gaps we observe)
Draft. Please do not cite or circulate.
may therefore be caused not by differential ability to perform given the test circumstances, but
instead by differential effort by students in the different test conditions.
Effort
One other explanation for a high stakes/ low stakes performance gap would be
differential effort put forth by students on the different exams. Social psychological theory
suggests that effort may differ according to a student’s perception of the “stakes” of the test.
Several experiments have attempted to pay students for performance on standardized tests to
change their outcomes.
Social psychological literature suggest that the distinction that we made between what
qualifies a test as “high” or “low” stakes may not be perceived in the same way by all students.
In particular, conceiving of a test as having “high stakes” for an individual and discounting the
importance of the “low stakes” exam research may be a particularly Western conception. Marcus
and Kitayama, summarizing a large field of research on this area, argue that Westerners, in
particular Americans and males, tend to hold an independent view of the self, one that places
primacy on fixed individual characteristics (1991). In contrast, East Asians are more likely to
hold an interdependent view of the self, which places a greater importance on relationships,
especially in group members with whom one shares a common fate. They link this view of self
to motivation and effort. An independent person would see only the high school exit exam as
high stakes, since the repercussions for the exam do not fall directly to them and may not work
as hard as they would if the exam were personally meaningful. Interdependent people, however,
would see the fate of their school as more closely linked to their own fate, and might therefore
put forth more effort on an exam that had stakes for the school. When comparing tendencies of
American racial and ethnic groups to act collectively, one study found that Asian and African
Draft. Please do not cite or circulate.
Americans scored higher on a scale of collectivism. These results seem driven largely by higher
scores for men (Coon & Kemmelmeier, 2001).
Attempts to pay students to motivate harder work have shown mixed results. A series of
experiments in the United States found significant positive impacts on student test scores when
paying students for educational inputs, such as reading books or turning in homework, with the
effects concentrated among male, African American and Hispanic students; trials that offered
money to students for educational outputs such as grades or test scores showed no or negative
effects (Fryer, 2010). Another cash incentive program in a low income, primarily white, Ohio
community found an increase of .15 sd when paying elementary school students to pass the state
standardized tests, though there was no effect on the reading, social science or science exam
scores (Bettinger, 2010). This experiment showed no significant differences in effects by gender.
A third study, investigating the possibility that students were not putting forth enough effort on a
low stakes test (NAEP) offered students $1 per correct answer, with no effects, though the
authors did not examine the results separately by race or gender (Linn, Koretz, & Baker, 1996).
While differential student effort therefore seems to be an unlikely source of a high stakes/
low stakes achievement gap, there is some theoretical reason to believe that perhaps women,
white and Hispanic students may not put forth the same effort on state sponsored exams as male,
Asian and African American students.
Data, measurements and descriptive statistics
Data
We use longitudinal, administrative, student-level data from three of California’s largest
school districts. The analytic sample includes 41,290 students in the graduating classes of 20062011 who were subject to a high school exit exam policy. Students in California high schools
Draft. Please do not cite or circulate.
take two kinds of tests, both aligned to California state standards. One, the California Standards
Test (CST) is given on a yearly basis and has repercussions for the school; we therefore consider
this exam as “low-stakes” for students. The other, the California High School Exit Exam
(CAHSEE) determines whether the student will earn a high school diploma; we therefore
consider the CAHSEE “high-stakes” for students. The CAHSEE is initially administered in the
10th grade and measures content similar to the 8th-10th grade CST exams. These exams and the
content overlap between them will be discussed in more detail below. We exclude from our
analyses students classified as special education students (roughly 10 percent of students),
because these students were not subject to the CAHSEE requirement in most of the years
covered by our analyses. For ease of interpretation, I standardize all exam outcome scores by
sample’s within-cohort mean and standard deviation. The administrative data also provides
demographic and academic covariates such as race/ ethnicity, gender, English Language Learner
classification (ELL) and free, reduced price lunch eligibility and school attended.
The California Standards Test and the California High School Exit Exam
In the Spring of 10th grade, students in California take two exams covering material
aligned to the California State Standards in mathematics and English Language Arts: the
California Standards Test (CST) and the California High School Exit Exam (CAHSEE). While
these two exams cover similar materials, their stakes for students are sharply different. While the
CST may, in some cases, be used as one of many factors determining class placement, the
CAHSEE’s role in students’ future is distinct and very clear. If students do not pass both sections
of the exam, they will not earn a high school diploma. This section gives some background on
both exams, and then discusses similarities and differences between the two exams in: stakes for
students and schools; measurement error; and content.
Draft. Please do not cite or circulate.
California schools administer the California Standards Test (CST) to students in grades
2-11 annually as part of the state’s Standardized Testing and Reporting (STAR) program,
designed to evaluate student achievement on the state’s content standards in four areas – English
Language Arts (ELA), mathematics, history/ social sciences and science. Of interest to our study
are the exams for ELA and math. The ELA exam is aligned to grade level content standards for
all grades. The math exam is aligned to grade level content standards through the 7th grade, and
then becomes an “end of year” exam aligned to the class a student has taken during the school
year. Both exams are multiple choice, though students have an additional writing assessment as
part of the 4th and 7th grade ELA exam, which is administered on a separate date. The writing
assessment is given in March; the multiple choice exams may be administered between March
and May. Students receive a scale score between 150-600. The scale score places students in one
of five categories: advanced, proficient, basic, below basic and far below basic. In lieu of the
CSTs, some special education students take the California Modified Assessment (CMA) and
some Spanish speaking ELL students take the Standards-Based Test in Spanish (STS) in lieu of
or in addition to the CSTs for ELA and Mathematics.
The California High School Exit Exam has two sections: English Language Arts (ELA)
and mathematics. The math section gives a multiple choice exam covering the California math
content standards for sixth and seventh grade and a small amount of Algebra 1. The ELA section
covers state content standards from 8th-10th grades and utilizes a multiple-choice format along
with one essay. Both tests are administered in English, regardless of a student’s primary
language, though English Language Learners are permitted to use the same modifications they
use on a daily basis in the classroom. Exams are scored on a scale of 275-450 and students must
score a minimum of 350 on each section to earn a high school diploma. Students scoring above
Draft. Please do not cite or circulate.
the passing score may also be classified as Proficient or Advanced, with a score of 380 indicating
proficiency on both exams, though the proficiency score is set higher for math than ELA –
roughly 420, as compared to 422. The CAHSEE is first administered to students in the spring of
tenth grade, and students have at least five subsequent opportunities to retake the sections they
have not yet passed (twice in eleventh grade and twelfth grade, and at least once following the
end of the twelfth grade school year).
Both exams are used for state and federal accountability. For state accountability, the
ELA and math CSTs contribute roughly 27% and 18% of a high school’s Academic Performance
Index (API), respectively. For Federal accountability, these two exams also contribute to
calculating a school’s Annual Yearly Progress (AYP) through their API score (California uses a
minimum API score – 680 in 2009-2010-- or one point growth on API as their fourth indicator to
make AYP). The CAHSEE passing rate contributes 9% for each section to a high school’s API.
For AYP, the CASHEE is used to meet both proficiency and graduation rates. Schools should
therefore have no incentives to prepare students for one exam at the expense of the other; if
anything, the alignment of incentives should encourage schools to emphasize the material and
skills common to both exams.
Students, on the other hand, are less likely to link their CST scores to immediate or
obvious repercussions as compared to their CAHSEE scores. The CSTs are only used in a few
instances, including class placement, reclassification for English Language Learners and
eligibility for gifted programs in early elementary school and eligibility for a small number of
magnet high schools. Importantly, in each of these instances, the test is one of many factors in
determining a school’s decision and the role the CST plays is fluid, with no minimum score
required and publicized (the exception being in early elementary school, where eligibility for the
Draft. Please do not cite or circulate.
gifted program typically has a minimum score or percentile as one requirement). Students should
rarely, therefore, feel compelling personal stakes in their performance on the CSTs. The
CAHSEE, on the other hand, provides a clear cut score and personal repercussions to falling
below the score: failure to graduate.
The two exams share much in common. Both are written by the Educational Testing
Service (ETS). The exams are untimed, allow for the same accommodations for English
Language Learners and are rigorously checked for cultural bias in the test content by ETS using
Differential Item Functioning (DIF). The CST gets a second check for cultural bias by the
Human Resources Research Organization (HumROO), which the State of California contracted
as an independent evaluator of the exam. As mentioned above, both sections on both tests are
aligned to state course or grade level standards. California Content Standards for English
Language Arts consist of the same five content substrands in each grade: Word Analysis,
Reading Comprehension, Literary Response and Analysis, Writing Strategies and Writing
Conventions. Each grade’s CST includes a multiple choice assessment of grade level material for
each content substrand. The 4th and 7th grade CSTs have an additional writing assignment, given
on a separate day from the multiple choice exam. The ELA CAHSEE exam consists of 45
multiple choice questions and a single writing assignment. The multiple choice questions
primarily cover 9th and 10th grade state standards, with a small number of 8th grade standards
included. Both exams place assign roughly 50% of points on writing and 50% on the other three
content substrands. Due to the writing assessment offered in the CAHSEE, the exit exam’s
multiple choice section places comparatively less weight on the two writing substrands than does
the CST.
Draft. Please do not cite or circulate.
The comparison of material on the math exams is less straightforward, given the end of
course nature of the CST exam for students in the 8-10th grades. In the 2nd-7th grades, CST math
exams are aligned to the same five content substrands: Number Sense; Algebra and Functions;
Measurement and Geometry; and Mathematical Reasoning (though this final substrand is tested
in ways “embedded” in the other four content substrands). Seventy-five percent of the
CASHEE’s math standards are aligned to grade 7 content standards, with 10% coming from the
Statistics, Data Analysis, and Probability strand of 6th grade standards and 15% coming from
Algebra I. In the 8th -10th grades, most non-Special Education students take some combination of
the General Math, Algebra I, Geometry and Algebra II exams (less than 5% of students take any
other exam in a given grade). The General Math Exam (typically taken in the 8th grade) aligns
closely to the material tested in the CASHEE, as it primarily covers material from 7th grade
content standards, with a few questions from the Statistics, Data Analysis, and Probability strand
of 6th grade standards.
As mentioned above, measurement error in exams may play a role in biasing our
estimates in unknown directions, especially if the error varies by subgroup or throughout the
spectrum of exam scores. ETS, which writes both exams, provides technical documentation for
both exams, but the precise information they provide on each varies slightly. Table 1 offers a
summary of the Standard Errors of Measurement (SEMs) for different subgroups and the
Conditional Standard Errors of Measurement (CSEMs) at different cut points. These estimates
were taken from the ETS Technical Reports for the 2008-2009 CAHSEE (the most recent year
available online, posted in April 2010) and corresponding year for the CST. The CAHSEE
technical reports offer the same estimates for each exam administration; I presented the estimates
for the March 2009 administration, which had the largest number of students taking the exam
Draft. Please do not cite or circulate.
(N=374, 364), though results look similar in other administrations. The table confirms that
measurement error tends to be slightly larger for higher achieving groups. Female, white, Asian,
non-English Language Learner and non-Special Education students have slightly lower SEMs
than male, African American, Hispanic, English Language Learner and Special Education
students, though with the exception of the ELL/ non-ELL distinction, these differences are
generally quite small. Additionally, these SEMs are provided for the entire test taking sample;
our sample excludes Special Education students, who have by far the largest SEMs. To the
extent that Special Education classification is correlated with subgroup, this should reduce the
already small differences in measurement accuracy between the groups. Of greater concern is the
difference in accuracy by achievement level. Both exams become less accurate near the top of
the distribution, with the CAHSEE in particular rising in measurement error for students at the
Advanced level.
Descriptive Statistics
I begin with 100,910 students in the classes scheduled to graduate between 2006-2011 in
three districts. As mentioned earlier, I exclude the 11,474 students who have ever been classified
as Special Education, as Special Education students are not required to pass the CASHEE,
though these students generally take the exam for state and federal accountability purposes. A
small number are dropped for missing ELL status (322) and, to simplify the analysis, I drop
students whose race/ ethnicity is listed as Native American or “unknown” (2,956). We lose a
large number – 26,693 – by dropping students missing a CAHSEE score in the spring of 10th
grade or a CST score or math exam indicator between the 8th-10th grade. The majority of these
missing scores are missing in the 8th and/or 9th grade, indicating that our sample is more
representative of students who remain in the district for three years, as opposed to those who
Draft. Please do not cite or circulate.
transfer in for high school. As the accountability systems include indicators for percent of
students taking the exams, schools have no incentive to prevent low achieving students from
taking a standardized test. We are left with an analytic sample of 59,455.
Table 2 provides basic descriptive statistics for the sample. Overall, roughly half of the
sample is female, 15% ELL and 70% eligible for Free and Reduced Lunch. Hispanic and Asian
students make up the largest subgroups with roughly 24,500 and 18,500 respectively, with white
and African American students contributing around 9,500 and 7,000, respectively. Female
students are less likely to be classified as ELL than their male peers, and outscore men by
roughly .2 standard deviations on both the 10th grade ELA CASHEE and CST exams, though
men outscore women by about .1 standard deviation on the CAHSEE math exam. White students
are the highest achieving racial/ ethnic group, scoring roughly .6 standard deviations above their
cohorts mean on the 10th grade ELA exams and .45 standard deviations above their cohorts mean
on the math CAHSEE exam. White students are also substantially less likely to receive free/
reduced price lunch than other students and to be classified as an ELL student, as compared to
Asian and Hispanic students. Asian students are the next highest performing group, scoring.23
above the cohort mean on the 10th grade ELA CST exam and .16 above the mean on the ELA
CAHSEE. Asian student’s math scores are comparable to white students’ score on the CAHSEE
math, scoring .41 above the mean. Asian students are, however, both more likely to be ELL than
their black and white peers and more likely to be eligible for free/ reduced lunch than their white
peers. Hispanic and black students are the lowest achieving racial/ ethnic subgroups, falling
between .25 and .44 below the cohort mean on each exam, though a larger portion of Hispanic
students are classified as ELL as compared to black students, both groups are more likely to be
eligible for free/ reduced price lunch than their white peers. Students eligible for free/ reduced
Draft. Please do not cite or circulate.
price lunch score below the mean in all exams, with the third of the sample not eligible for free
lunch scoring roughly half a standard deviation above the mean on each exam. Students
classified as English Language Learners are by far the lowest performing subgroup, scoring 1.1
standard deviations below the mean on the 10th grade ELA CST and the CAHSEE math, and 1.5
standard deviations below the mean on the CAHSEE ELA exam.
Figures 3a and 3b provide kernel density plots of 10th grade ELA and Math CASHEE
scores by racial/ ethnic group, with lines demarcating the scores at which a student is classified
as basic (a passing score on the exam), proficient and advanced. It demonstrates both the
differential performance by race and the ceiling effects regarding the CAHSEE. White and Asian
students’ distributions are shifted to the right of the graph, with large bumps in the upper tail of
the CAHSEE for these two groups. There are therefore a much larger percentage of white and
Asian students measured with more error at the top of the distribution and a very small number
of white students at the lower tail end of the distribution.
Table 3 provides descriptive statistics that break down the mean ELA and math CAHSEE
scores by race and gender. The first column for each score is the mean for the full sample, the
second column is the mean score for students who are non-ELL and non-Free Lunch eligible.
Reducing the sample to non-ELL and non-Free Lunch students reduces the sample size by about
25% for white students, 70% for Asian students, 86% for Hispanic students and 75% for black
students. Limiting the sample to more privileged students also increases the mean achievement
for each racial and gender group, with white students’ means improving by about .15 standard
deviation on both exams, and non-white students’ means increasing between .3 and .5 standard
deviations on each exam. The most dramatic improvements in mean scores come in the Hispanic
students, on both the ELA and Math section. Once we limit the sample to non-ELL, non-Free
Draft. Please do not cite or circulate.
Lunch students, white women continue to have the highest ELA score (.918), though Asian
women outperform white men (.820 as compared to .666); Asian men and women have the
highest mean scores on the math exam (.798 and .801), with white males trailing at .658. The
only groups whose more privileged samples fall below the overall cohort means are black male
on ELA (-.105) and Hispanic and black women on math (-.076 and -.223).
Analysis
I use student level longitudinal administrative data from three large, urban districts in
California to examine a high stakes/low stakes test gap between groups of students, in particular
girls, students of color and English Language Learners in comparison to boys, whites, and nonEnglish Language Learners and the interaction of race and gender in these effects. I begin with a
model predicting students’ CASHEE scores based on both their prior achievement and group
membership. Following these main results, I differences between schools in preparing students
for the exam, statistical concerns in comparing the exam scores, and differential content between
the two exams.
Basic Model and Results
I begin by estimating the high stakes CAHSEE score Y on exam E (math or ELA) if
student i in cohort c in district d as a function of their low stakes achievement CST scores.
𝐄
𝐸
π‘Œπ‘–π‘π‘‘
= π‚π’π“π’πœπ
𝚩 + 𝐗 π’Šπ’„π’… 𝛉 + 𝛼𝑐𝑑 + 𝑒𝑖
In the above equation Β is a vector of coefficients on the prior achievement proxies – the
8th, 9th and 10th grade CST scores (as well as their quadratic and cubic terms) for the relevant
exam. I determined quadratic and cubic terms by comparing the fit of models using linear terms
and adding polynomial terms through the quintic term. While the quadratic and cubic terms were
consistently statistically significant, the quartic and quintic terms were only occasionally
Draft. Please do not cite or circulate.
statistically significant, depending on the covariates included and they did not tend to increase
the model fit, as measured by R2. Finally, the cubic term allows for a flexible enough functional
form that this basic model only predicts students to score under or over the possible range for
CAHSEE scores for .06% of cases. Models estimating the CAHSEE math score will also include
indicator variables for the actual end-of-course math exam taken. Β is a vector of indicator
variables for group membership for women, students of color, ELL students and free/ reduced
lunch eligibility; all models are run both with and without race by gender interactions. Because
our data come from multiple school districts and cohorts of students, I include district-by-cohort
fixed effects πœΆπ’„π’… .
The outcome of interest is the vector 𝜽, which describes the differences in average
differences in high- and low-stakes standardized test scores between the subgroups defined by 𝑿,
in this case a vector of dummies indicating ELL status, free/ reduced lunch eligibility and raceby-gender dummies. The subgroups for which 𝜽 is negative and statistically significant
demonstrate a subgroup achievement gap between predicted performance based on prior (and
concurrent) low stakes test scores and the high stakes CAHSEE in comparison to that of the
reference group – white, male, non-ELL, non-Free/ Reduced Lunch students with an average
score on the 8th, 9th and 10th grade CST.
I estimate the predicted CAHSEE scores both by math and ELA exam for the entire
sample, then separately by prior achievement level. I define prior achievement level as the
student’s proficiency level on the 9th grade ELA CST exam. I use the ELA proficiency score in
the 9th grade both because it loosely aligns to proficiency scores on the ELA CAHSEE and to
avoid issues either of selecting on the dependant variable or of selection bias caused by the
different end of course math exams taken by students. Subdividing students by prior achievement
Draft. Please do not cite or circulate.
provides several advantages. First, as the distribution of students in the reference group (white,
male, non-ELL, non-Free Lunch) is disproportionately high in comparison to that of their nonwhite, ELL and free lunch eligible classmates, subdividing by prior achievement prevents us
from extrapolating beyond the region of common support for these students. Second, focusing on
students around the middle of the distribution will provide us with the most accurate test scores.
Finally, comparing the results across the distribution of ability may provide us some insight as to
whether the high-stakes/ low stakes gap varies by student ability.
Figures 4a & 4b describe the distribution of ELA and math CAHSEE scores by 9th grade
ELA proficiency level. The “Far Below Basic” group is the smallest, with a wide distribution
centered towards the bottom of the scores. The “Advanced” students have a long tail, but are
clustered disproportionately at the top of the score distribution, where test scores contain the
most measurement error. The middle three distributions, on the other hand, both contain a large
sample size and are centered between 350 and 400, where scores are measured more accurately.
These three groups should, therefore, provide us with the most reliable results.
Main Results
Results for ELA can be seen in Table 4. Once conditioning on the same prior and
concurrent achievement, race/ ethnicity, and ELL and Free/ reduced price lunch status, women
score .054 above men on the ELA CAHSEE. When looking at the estimates by prior
achievement, it appears that this gender gap is driven by Basic, Proficient and Advanced
students; these coefficients are positive and statistically significant, whereas the coefficient on
gender is negative and statistically significant for the Far Below Basic students and essentially
zero for Below Basic students. Hispanic and Asian students underperform by about a third of a
standard deviation, with these results driven primarily by students in the middle of the
Draft. Please do not cite or circulate.
distribution. Overall, black students underperform observationally similar white students by .86
standard deviations, with these results being driven by low achieving students; coefficients on
black are negative for all five achievement groups, but the point estimate becomes smaller as the
achievement group increases, with the coefficient on the Advanced group the smallest and not
statistically different from zero, despite having a comparable standard error to the other groups.
The second set of models in Table 4 presents race by gender interactions for the
CASHEE ELA. Gender patterns within race look similar to the models that do not include race
by gender interactions. For each subgroup, women outperform observationally similar men of
the same race/ ethnicity on the CASHEE ELA and for each of these groups, this outperformance
seems driven by higher achieving students, with race by gender coefficients being negative and
statistically significant only for the top three achievement groups. The race by gender
coefficients are negative for the Far Below Basic sample for women of all four groups (though
this coefficient is statistically significant only for Hispanic women). For the Below Basic sample,
the gender coefficients are negative and statistically insignificant for black and Hispanic women
and positive but statistically insignificant for white and Asian women.
The results of the F-test for the gender coefficients are less clear. When analyzing the full
sample, we can reject the hypothesis that the race by gender coefficients are all equal, despite the
fact that they are all positive. Post hoc t-tests (not shown for any post hoc t-tests) indicate that the
race by gender interaction coefficients for white and Asian women are similar to each other, but
different from those of black and Hispanic women, who are more similar to each other. When we
look within prior ability samples, however, the relationship between race and gender looks more
similar for each racial group – we can only reject the null that the race by gender interactions are
Draft. Please do not cite or circulate.
the same for the “Proficient” group, and post hoc t-tests indicate that the only difference is
between Asian and Hispanic women.
A lack of compelling difference in these interaction terms should not be confused with
women of different races having similar high stakes/ low stakes gaps, however. For both the
overall sample and the middle three proficiency categories, post-hoc t-tests indicate that white
women generally outperform observationally similar women of other races on the high stakes
test. Hispanic and Asian women also tend to outperform black women.
The interaction models indicate that, for the overall sample, Asian and black men
continue to underperform relative to their white male counterparts, with these differences driven
by the lower middle part of the distribution. When using the overall sample, Hispanic men do not
underperform relative to observationally similar white men, though their underperformance is
statistically significant in the middle achievement samples. Post hoc t-tests indicate that Hispanic
men tend to underperform less than black men.
For math, men outperform similar women by roughly .1, with the high stakes/ low stakes
performance gap increasing by about a third within students of the middle three proficiency
categories. Asian and Hispanic students exhibit similar patterns of underperformance on the
CAHSEE, despite very different overall performance levels on the exam. On average, as seen in
the descriptive Table 3, Asian students outscore Hispanic students by about .35 standard
deviations – an achievement gap that jumps to about .8 when limiting the sample to non-ELL,
non-Free lunch students. In these models, however, we see that both groups underperform
observationally similar white students by about .05 standard deviations, with underperformance
being strongest in the Below Basic and Basic category for both groups. Post hoc t-tests suggest
that Asian students exhibit less underperformance that Hispanic students for the two lowest
Draft. Please do not cite or circulate.
achieving groups, but point to similar underperformance on the three highest achieving groups.
Black students exhibit the most underperformance on the math exam, however, with a coefficient
of -.14. Their underperformance is significant for all five performance levels and, much like
Asian and Hispanic students, estimates are highest in the Below Basic and Basic categories. Post
hoc t-tests indicate that their underperformance is significantly greater than that of all three other
racial groups.
Similarly to the ELA results, race by gender coefficients indicate that the relationship
between race, gender and underperformance on the CAHSEE are similar with each racial group.
There is some indication that high achieving Asian women exhibit less underperformance
relative to Asian men than white or Hispanic women in relation to observationally similar men,
but given both the inaccuracy of the test at the high end and the number of t-tests performed,
these results are not terribly compelling. Looking at the overall performance of students of
different genders and races, however, a more interesting pattern emerges when looking at these
results. Specifically, post hoc t-tests indicate that Hispanic and Asian women typically exhibit
differential underperformance from each other, though Hispanic and Asian Males generally
exhibit similar levels of underperformance – so while the gender relationship is similar within
race, the observed Hispanic – Asian difference in the earlier model was driven by a differential
gender gap.
Threat of school differences
While there is some reason to believe that schools could play an intervening role in
student performance, it is more difficult to believe that these particular effects are driven solely
by differential school behavior. First, school segregation is unlikely to drive the gender results.
Second, as discussed above, schools have incentives to prepare students for both exams.
Draft. Please do not cite or circulate.
Regardless, to eliminate the possibility that a high stakes/ low stakes achievement gap may be
driven by differences in high schools attended by students, I will add school by cohort fixed
effects πœΆπ’”π’„ and estimate:
𝐸
𝐄
π‘Œπ‘–π‘ π‘‘
= π‚π’π“π’π¬πœ
𝚩 + 𝐗 π’Šπ’”π’„ 𝛉 + 𝛼𝑠𝑐 + 𝑒𝑖
Our coefficient of interest θ would now represent the achievement gap between groups
for students with the same levels of prior achievement who attend the same school. If θ remains
negative and statistically significant then between-school differences cannot be responsible for
the observed high stakes/ low stakes test gap, though these school fixed effects do not eliminate
the possibility that the high stakes/ low stakes achievement gap varies between schools. Tables 6
and 7 provide the estimates controlling for school by cohort fixed effects predicting the ELA and
Math CAHSEE. The point estimates change very little when controlling for school effects in
either math or ELA.
Measurement error
I next address the concern that these results are driven by regression to the mean. As
noted above, regression to the mean may cause estimates of θ to be artificially high by conflating
measurement error with group membership. I use two techniques to deal with the threat of
measurement error: shrunken estimates and differences.
First, Tables 8 and 9 present estimates using 10th Grade CST test scores that have been
shrunken to their estimated score. As with all other models, the outcome variable is the ELA and
Math CAHSEE score, standardized by cohort. Controls other than those shown include
"shrunken" 10th grade Math CST scores, as well as the quadratic and cubic versions of these
shrunken scores. The shrunken scores are a weighted average of the students observed 10th
grade score (or that score squared or cubed) and their predicted score. The predicted score is
Draft. Please do not cite or circulate.
calculated by regressing the 10th grade Math CST test scores (or those scores squared or cubed)
on 8th & 9th grade test scores (and those scores squared and cubed), dummies for math CST
taken in the 8th-10th grade, the demographic controls to be included in the model (ell, free lunch,
race and gender or race by gender) and district by cohort fixed effects. The "shrunken" score
weights the observed score by .85 and the predicted score by .15, arrived on by averaging the
published test reliability of roughly .94 and the observed R2 of roughly .75 when predicting the
10th grade CST scores from the 8th and 9th grade scores. Students are reclassified into predicted
"shrunken" proficiency categories based on where the shrunken (linear) ELA score would place a
student according to the actual cut scores for that cohort. Estimates from these models change
very little from the OLS models.
Finally, I move to a Two Stage Least Squares model. I use students’ 8th and 9th grade
CST scores (and, in the case of the CASHEE math, the math exam taken in each grade) to
instrument for the 10th grade CST score, thus yielding estimates of 𝛉 that are not biased by the
measurement error. The results for these models are presented in Table 10. The overall pattern
for these results looks fairly similar to the OLS results, with a few notable differences. The over
performance by women in ELA appears, in these models, to be driven exclusively by white and
Asian women. White women no longer appear to underperform on the math section, though the
point estimate is negative. Hispanic men do not display a high stakes/ low stakes achievement
gap on either section of the exam, though the interaction term on Hispanic women is significant
and negative for the math exam. African American and Asian students continue to underperform
on both sections of the exam, with the gender interaction term being negative for both groups on
the math exam, and positive for women on the ELA exam.
Draft. Please do not cite or circulate.
Discussion
Overall, these results are suggestive, though not conclusive, of a differential
underperformance on the high stakes high school exit exam by race and gender, dependant on the
stakes attached to an exam. Women tend to perform higher than their low stakes test scores
would suggest on the ELA, but lower on the math – a result consistent with a stereotype threat
interpretation. Asian men’s underperformance on the math exam, however, makes a stereotype
threat interpretation more difficult. On the other hand, Asian and African American students
underperform on the Math section, with men underperforming on the ELA section, as well, a
story more in line with differential effort due to a collectivist culture. This interpretation is
problematic, as well -- why would women exert differential effort on exams of different
subjects?
These puzzling results raise doubt about the accuracy and unbiasedness of the exams used for
so many important purposes. While the effect sizes are small, a difference of only a few points
may prevent a student from gaining a high school diploma, entrance to the college of their
choice, or a scholarship. Educators and policymakers should therefore use caution when
considering test scores as part of an accountability system, lest small differences in bias unfairly
exacerbate greater inequality in educational opportunity.
But, moreover, they call for further investigation to understand the source of differential
student performance given the stakes of the exam. Effort may be hard to counteract, given the
weak findings from the Fryer and Bettinger experiments paying American children for high test
scores (Bettinger, 2010; Fryer, 2010). Experiments aimed at lifting the effects of stereotype
threat, on the other hand, have had more success; a series of short self –affirmation writing
assignments lifted the GPAs of African American students in the treatment group by .4 over the
course of 2 years (Cohen, Garcia, Purdie-Vaughns, Apfel, & Brzustoski, 2009). A clearer
Draft. Please do not cite or circulate.
understanding of the causes of these high stakes/ low stakes gaps would help guide interventions
to give us the most accurate test results possible.
Draft. Please do not cite or circulate.
Figure 1: The presumed relationship between group membership and test scores
Race
School or
societal
features
Lower
achievement
Lower test
scores
Figure 2: The presumed relationship between high and low stakes test scores
Race
School or
societal
features
Lower
achievement
Lower low
stakes test
scores
Lower high
stakes test
scores
Draft. Please do not cite or circulate.
Figure 3a:
Distribution of Racial/ Ethnic Groups on ELA CAHSEE
0
100
200
300
Lines: 'Pass/ Basic=350' 'Proficient~380' 'Advanced ~400'
250
300
350
ELA CAHSEE Score
White
Hispanic
400
450
Asian
Black
Figure 3b:
Distribution of Racial/ Ethnic Groups on Math CAHSEE
0
100
200
300
Lines: 'Pass/ Basic=350' 'Proficient~380' 'Advanced ~420'
250
300
350
Math CAHSEE Score
White
Hispanic
400
Asian
Black
450
Draft. Please do not cite or circulate.
Figure 4a:
Distribution of ELA CAHSEE
0
100
200
300
400
By 9th Grade ELA Proficieny Levels
Lines: 'Pass/ Basic=350' 'Proficient~380' 'Advanced ~400'
250
300
350
ELA CAHSEE Score
Far Below Basic
Basic
Advanced
400
450
Below Basic
Proficient
Figure 4b:
Distribution of Math CAHSEE
0
100
200
300
400
By 9th Grade ELA Proficieny Levels
Lines: 'Pass/ Basic=350' 'Proficient~380' 'Advanced ~420'
250
300
350
Math CAHSEE Score
Far Below Basic
Basic
Advanced
400
Below Basic
Proficient
450
Draft. Please do not cite or circulate.
Table 1: Measurement Error in the CASHEE and CST
ELA CAHSEE
MC Only
MC +
Essay
OVERALL
Below Basic
Basic/ Pass
Proficient
Advanced
Male
Female
3.39
3.26
ELA CST
8th
9th
10th
Grade
Grade
Grade
3.77
3.75
3.58
CSEMs at Cut Points
15
8
14
8
14
12
17
15
13
14
17
13
13
14
17
SEMs for Subgroups
4.06
3.88
3.86
3.57
3.84
3.59
3.69
3.72
American Indian3.36
Asian
3.06
Pacific Islander3.39
Filipino
3.13
Hispanic
3.51
African American
3.53
White
2.96
3.99
3.8
4.02
3.8
4.06
4.17
3.72
3.9
3.47
3.72
3.54
3.77
3.91
3.45
3.67
3.48
3.82
3.69
3.97
3.81
3.45
3.82
3.3
3.72
3.57
3.95
3.78
3.38
Non-ELL
ELL
3.16
3.83
3.85
4.33
4.02
3.98
3.95
SPED
Non-SPED
3.81
3.27
4.47
3.9
3.97
3.59
4.03
3.63
3.86
3.79
Non Free Lunch
3.56
3.64
3.49
Free Lunch
3.8
3.99
3.68
Source: 2008-2009 ETS Technical Reports for the California
High School Exit Exam and California Standards Tests
Draft. Please do not cite or circulate.
Table 2: Demographic Breakdown and Mean Achievement Scores (standardized), by subgroup
% Free
Lunch
70%
Mean ELA
CST 10
0.00
(1.00)
Mean ELA Mean Math
CAHSEE
CAHSEE
0.00
0.00
(1.00)
(1.00)
% Female
52%
% ELL
15%
59,541
0%
17%
28,587
68%
-0.10
(1.01)
-0.11
(1.00)
0.05
(1.01)
Female
100%
13%
30,954
71%
0.09
(0.98)
0.11
(0.99)
-0.05
(0.99)
By Race/ Ethnicity
White
50%
1%
9,508
29%
0.59
(1.01)
0.62
(0.89)
0.45
(0.91)
Asian
51%
17%
18,460
70%
0.23
(0.96)
0.16
(1.00)
0.41
(0.96)
Hispanic
52%
23%
24,452
84%
-0.32
(0.90)
-0.30
(0.92)
-0.36
(0.90)
Black
57%
1%
7,121
75%
-0.29
(0.90)
-0.24
(0.91)
-0.44
(0.87)
By ELL Status
Non-English Language Learners
53%
0%
50,790
66%
0.18
(0.95)
0.20
(0.90)
0.15
(0.96)
English Language Learners
45%
100%
8,751
89%
-1.04
(0.55)
-1.18
(0.66)
-0.86
(0.77)
By Free-Reduced Lunch Status
Non-Free Lunch
50%
5%
18,040
0%
0.47
(1.01)
0.48
(0.95)
0.40
(0.96)
Free Lunch
53%
Overall
By Gender
Male
19%
100%
-0.21
-0.21
41,501
(0.92)
(0.95)
Note: The sample excludes students ever classified as Special Education. Test scores standardized within cohort.
-0.18
(0.96)
Draft. Please do not cite or circulate.
Table 3: Descriptives by Race and Gender, for all students and non-ELL, non-FL Sample
ELA CAHSEE
Full Sample
Non ELL,
Non-FL Only
Math CAHSEE
Full Sample
Non ELL,
Non-FL Only
White Males
0.497
0.666
0.509
0.658
sd
(0.903)
4,734
(0.832)
3,427
(0.913)
4,734
(0.863)
3,427
0.745
0.918
0.383
0.558
N
(0.863)
4,768
(0.788)
3,335
(0.901)
4,768
(0.847)
3,335
Asian Males
0.026
0.513
0.423
0.798
sd
(0.983)
9,020
(0.825)
2,614
(0.962)
9,020
(0.824)
2,614
0.294
0.820
0.399
0.801
(0.997)
9,422
(0.797)
2,582
(0.968)
9,422
(0.834)
2,582
N
White Females
sd
N
Asian Females
sd
N
Hispanic Males
-0.404
0.105
-0.307
0.065
sd
N
(0.930)
11,743
(0.879)
1,649
(0.920)
11,743
(0.897)
1,649
Hispanic Females
-0.196
0.305
-0.402
-0.076
sd
N
(0.909)
12,668
(0.852)
1,704
(0.871)
12,668
(0.869)
1,704
Black Males
-0.369
-0.105
-0.391
-0.172
sd
N
(0.901)
3,048
(0.954)
800
(0.883)
3,048
(0.891)
800
Black Females
-0.138
0.187
-0.474
-0.223
(0.906)
4,052
(0.923)
964
(0.857)
4,052
(0.895)
964
Note: The sample excludes students ever classified as Special Education, as well as students classified as English
Language Learners or eligible for Free/ Reduced Price lunch in the 10th grade. Test scores standardized within
cohort.
Draft. Please do not cite or circulate.
Table 4: OLS Models Predicting 10th Grade ELA CAHSEE Score, By 9th Grade ELA CST Proficency Level
Full Sample
Female
0.054 ***
(0.004)
Far Below
Basic
-0.048 *
(0.022)
Below Basic
-0.008
(0.011)
Basic
0.040 ***
(0.007)
Proficient
0.094 ***
(0.007)
Advanced
Full Sample
0.077 ***
-0.031 ***
-0.082
(0.060)
-0.139 ***
(0.027)
-0.072 ***
(0.013)
-0.034 **
(0.012)
0.031 **
(0.012)
Asian x Female
-0.032 ***
(0.009)
0.079 ***
(0.007)
Hispanic
-0.037 ***
(0.007)
-0.064
(0.055)
-0.114 ***
(0.025)
-0.056 ***
(0.012)
-0.054 ***
(0.011)
-0.208 ***
(0.059)
-0.189 ***
(0.027)
-0.106 ***
(0.014)
-0.079 ***
(0.014)
-0.022
-0.071 ***
(0.011)
0.047 ***
-0.190 ***
Free Lunch
N
Adjusted R2
-0.043 ***
-0.268 ***
(0.029)
-0.068
-0.175 ***
(0.014)
-0.066 ***
-0.164 ***
(0.011)
-0.050 ***
-0.155 ***
(0.038)
-0.017 *
0.014
(0.023)
-0.050
(0.011)
(0.007)
-0.027
(0.064)
(0.018)
-0.256 *
(0.109)
-0.045 ***
-0.190 ***
(0.007)
-0.044 ***
-0.137 ***
(0.034)
(0.050)
-0.012
Black x Female
ELL
-0.079
(0.070)
(0.009)
0.028 ***
-0.086 ***
0.022
(0.046)
-0.010
(0.006)
(0.008)
-0.010
(0.113)
(0.013)
Hispanic x Female
Black
Basic
Proficient
Advanced
(0.008)
(0.010)
(0.007)
Below Basic
0.101 ***
White x Female
Asian
Far Below
Basic
-0.058 *
(0.029)
-0.197 **
(0.070)
-0.099 **
(0.031)
-0.015
(0.014)
-0.171 ***
(0.035)
-0.049
-0.022
(0.056)
(0.028)
-0.268 ***
(0.029)
-0.069
-0.176 ***
(0.014)
-0.066 ***
0.062 **
(0.021)
-0.061 ***
(0.017)
0.038 **
(0.012)
-0.044 **
(0.016)
0.037 ***
(0.009)
-0.094 ***
(0.020)
0.038 *
(0.017)
-0.164 ***
(0.011)
-0.050 ***
0.095 ***
(0.016)
-0.048 **
(0.016)
0.121 ***
(0.012)
-0.038 *
(0.015)
0.067 ***
(0.012)
-0.081 ***
(0.021)
0.100 ***
(0.022)
-0.155 ***
(0.038)
-0.018 *
0.099 ***
(0.015)
0.025
(0.016)
0.109 ***
(0.013)
-0.006
(0.018)
0.091 ***
(0.018)
-0.021
(0.028)
0.097 **
(0.033)
-0.256 *
(0.109)
-0.045 ***
(0.005)
(0.035)
(0.018)
(0.009)
(0.009)
(0.009)
(0.005)
(0.035)
(0.018)
(0.009)
(0.009)
(0.009)
59,541
3,074
8,803
19,005
16,396
12,263
59,541
3,074
8,803
19,005
16,396
12,263
0.778
0.305
0.355
0.41
0.414
0.406
0.779
0.304
0.355
0.41
0.414
0.406
0.000 ***
0.940
0.614
0.733
P-Value on F-Test: All Gender Interactions Equal
NOTE: All samples include controls for students 8th -10th grade CST scores for the relevant subject and these scores squared and cubed. All test scores are standardized by cohort
0.019 *
0.86
Draft. Please do not cite or circulate.
Table 5: OLS Models Using CST Scores to Predict 10th Grade Math CAHSEE Score, By 9th Grade ELA CST Proficency Level
Full Sample
Female
Far Below
Basic
-0.109 ***
(0.004)
-0.113 ***
(0.019)
Below Basic
-0.131 ***
(0.010)
Basic
-0.133 ***
(0.006)
Proficient
-0.122 ***
(0.007)
Advanced
Full Sample
-0.103 ***
-0.043 ***
0.085
(0.052)
-0.082 **
(0.026)
-0.049 ***
(0.013)
-0.021
(0.011)
-0.023 *
(0.010)
Asian x Female
-0.047 ***
(0.008)
-0.095 ***
(0.006)
Hispanic
-0.058 ***
(0.006)
-0.012
(0.048)
-0.115 ***
(0.023)
-0.062 ***
(0.012)
-0.028 **
(0.010)
-0.018
(0.010)
Hispanic x Female
-0.047 ***
(0.008)
-0.123 ***
(0.006)
Black
-0.142 ***
(0.007)
-0.123 *
(0.051)
-0.197 ***
(0.026)
-0.134 ***
(0.013)
-0.100 ***
(0.013)
-0.083 ***
(0.015)
Black x Female
-0.140 ***
(0.010)
-0.106 ***
(0.010)
ELL
-0.153 ***
(0.006)
Free Lunch
N
Adjusted R2
-0.011 *
-0.104 ***
(0.024)
-0.061 ***
(0.012)
-0.096 ***
(0.010)
Basic
Proficient
Advanced
(0.007)
(0.009)
(0.006)
Below Basic
-0.082 ***
White x Female
Asian
Far Below
Basic
-0.024
-0.053
(0.034)
(0.088)
-0.011 *
0.107
(0.060)
-0.127 **
(0.044)
0.014
(0.055)
-0.134 ***
(0.025)
-0.141 *
(0.060)
-0.029
(0.048)
-0.104 ***
(0.024)
-0.142 **
(0.043)
-0.096 **
(0.032)
-0.108 ***
(0.022)
-0.117 ***
(0.029)
-0.135 ***
(0.014)
-0.194 ***
(0.033)
-0.146 ***
(0.027)
-0.062 ***
(0.012)
-0.137 ***
(0.020)
-0.048 **
(0.017)
-0.137 ***
(0.012)
-0.064 ***
(0.016)
-0.133 ***
(0.009)
-0.142 ***
(0.019)
-0.123 ***
(0.017)
-0.096 ***
(0.010)
-0.139 ***
(0.015)
-0.046 **
(0.015)
-0.092 ***
(0.011)
-0.029 *
(0.014)
-0.135 ***
(0.011)
-0.095 ***
(0.019)
-0.145 ***
(0.020)
-0.081 ***
(0.012)
-0.037 **
(0.013)
-0.056 ***
(0.011)
0.005
(0.015)
-0.122 ***
(0.015)
-0.070 **
(0.023)
-0.102 ***
(0.027)
-0.024
-0.057
(0.034)
(0.088)
-0.001
-0.024
-0.015
0.004
-0.005
-0.024
-0.015
0.004
(0.005)
(0.031)
(0.017)
(0.009)
(0.008)
(0.008)
(0.005)
(0.031)
(0.017)
(0.009)
(0.008)
(0.008)
59,455
3,056
8,769
18,983
16,386
12,261
59,455
3,056
8,769
18,983
16,386
12,261
0.810
0.433
0.523
0.651
0.699
0.683
0.810
0.433
0.523
0.651
0.700
0.683
0.009 **
0.234
0.669
0.910
0.010 *
0.003 **
P-Value on F-Test: All Gender Interactions Equal
0.016 *
-0.153 ***
(0.006)
-0.050
(0.098)
0.016 *
NOTE: All samples include controls for students 8th -10th grade CST scores for the relevant subject and these scores squared and cubed. All test scores are standardized within the cohort.M odels also include dummy variables for the math exam taken in each grade.
Draft. Please do not cite or circulate.
Table 6: OLS Models Predicting 10th Grade ELA CAHSEE Score, By 9th Grade ELA CST Proficency Level Using School by Cohort Fixed Effects
Full Sample
Female
Far Below
Basic
Below Basic
-0.043
-0.007
(0.024)
(0.011)
0.053 ***
(0.004)
Basic
0.040 ***
(0.007)
Proficient
0.093 ***
(0.007)
Advanced
Full Sample
Far Below
Basic
0.085 ***
(0.010)
-0.045 ***
(0.007)
-0.007
(0.065)
-0.141 ***
(0.029)
-0.076 ***
(0.014)
-0.047 ***
(0.013)
0.005
(0.013)
Asian x Female
-0.038 ***
(0.010)
0.070 ***
(0.007)
Hispanic
-0.049 ***
(0.007)
-0.006
(0.060)
-0.110 ***
(0.027)
-0.068 ***
(0.013)
-0.066 ***
(0.012)
-0.022
(0.014)
Hispanic x Female
-0.020 *
(0.009)
0.028 ***
(0.006)
Black
Adjusted R2
0.039
0.043
(0.118)
(0.048)
0.012
(0.074)
-0.129 ***
(0.036)
(0.035)
0.008
(0.052)
(0.023)
0.019
-0.087 **
(0.068)
(0.033)
(0.053)
(0.015)
(0.030)
(0.015)
0.062 **
(0.021)
-0.063 ***
(0.019)
0.035 **
(0.012)
-0.056 **
(0.017)
0.038 ***
(0.010)
0.104 ***
(0.017)
-0.051 **
(0.017)
0.112 ***
(0.012)
-0.044 **
(0.016)
0.063 ***
(0.013)
0.106 ***
(0.015)
0.012
(0.017)
0.093 ***
(0.014)
-0.007
(0.019)
0.078 ***
(0.019)
-0.116
-0.177 ***
-0.112 ***
-0.095 ***
-0.060 **
-0.082 ***
-0.097
-0.153 ***
-0.102 ***
-0.096 ***
-0.055
-0.064
-0.029
-0.015
-0.015
-0.019
-0.012
-0.075
-0.037
-0.021
-0.022
-0.031
-0.037
-0.017
(0.060)
(0.029)
0.051 ***
-0.188 ***
(0.007)
N
Advanced
-0.009
(0.012)
Free Lunch
Proficient
-0.099 ***
Black x Female
ELL
Basic
(0.009)
White x Female
Asian
Below Basic
0.094 ***
-0.046 ***
-0.249 ***
(0.031)
-0.061
-0.170 ***
(0.014)
-0.045 *
-0.166 ***
(0.011)
-0.051 ***
-0.126 **
(0.039)
-0.020 *
-0.332 **
(0.125)
-0.050 ***
-0.188 ***
(0.007)
-0.046 ***
-0.248 ***
(0.031)
-0.063
-0.170 ***
(0.014)
-0.046 *
0.042 *
(0.018)
-0.166 ***
(0.011)
-0.051 ***
0.104 ***
(0.023)
-0.127 **
(0.039)
-0.021 *
0.095 **
(0.035)
-0.333 **
(0.125)
-0.050 ***
(0.005)
(0.038)
(0.018)
(0.010)
(0.009)
(0.010)
(0.005)
(0.038)
(0.018)
(0.010)
(0.009)
(0.010)
54,793
2,901
8,202
17,608
15,035
11,047
54,793
2,901
8,202
17,608
15,035
11,047
0.782
0.328
0.374
0.419
0.427
0.424
0.782
0.328
0.374
0.419
0.427
0.424
0.000 ***
0.891
0.589
0.736
0.033 *
0.718
P-Value on F-Test: All Gender Interactions Equal
NOTE: All models include controls for students 8th -10th grade CST scores for the relevant subject and these scores squared and cubed. All test scores are standardized by cohort
Draft. Please do not cite or circulate.
Table 7: OLS Models Using CST Scores to Predict 10th Grade Math CAHSEE Score, By 9th Grade ELA CST Proficency Level using School by Cohort Fixed Effects
Full Sample
Female
Far Below
Basic
-0.105 ***
(0.004)
-0.099 ***
(0.020)
Below Basic
-0.126 ***
(0.011)
Basic
-0.134 ***
(0.007)
Proficient
-0.118 ***
(0.007)
Advanced
Full Sample
-0.095 ***
-0.042 ***
0.123 *
(0.056)
-0.089 **
(0.028)
-0.058 ***
(0.013)
-0.019
-0.014
(0.012)
(0.011)
Asian x Female
-0.043 ***
(0.009)
-0.093 ***
(0.007)
Hispanic
-0.057 ***
(0.006)
0.032
(0.051)
-0.105 ***
(0.025)
-0.069 ***
(0.012)
-0.040 ***
(0.011)
-0.031 **
(0.011)
Hispanic x Female
-0.044 ***
(0.008)
-0.121 ***
(0.006)
Black
-0.136 ***
-0.058
-0.174 ***
-0.137 ***
-0.095 ***
-0.089 ***
-0.008
-0.055
-0.027
-0.014
-0.014
-0.016
Black x Female
-0.135 ***
(0.011)
-0.098 ***
(0.011)
ELL
-0.151 ***
(0.006)
Free Lunch
N
Adjusted R2
-0.014 **
-0.078 **
(0.026)
-0.055 ***
(0.013)
-0.095 ***
(0.010)
Proficient
Advanced
-0.033
-0.090
(0.035)
(0.101)
-0.152 ***
(0.006)
-0.015 **
0.003
(0.101)
0.151 *
(0.064)
-0.101 *
(0.044)
0.069
(0.058)
-0.126 ***
(0.026)
-0.064
(0.065)
-0.018
(0.051)
-0.078 **
(0.026)
-0.130 **
(0.045)
-0.097 **
(0.034)
-0.111 ***
(0.022)
-0.106 ***
(0.031)
-0.129 ***
(0.014)
-0.172 ***
(0.035)
-0.134 ***
(0.028)
-0.055 ***
(0.013)
-0.139 ***
(0.020)
-0.060 ***
(0.018)
-0.135 ***
(0.012)
-0.070 ***
(0.016)
-0.137 ***
(0.009)
-0.150 ***
(0.020)
-0.115 ***
(0.017)
-0.095 ***
(0.010)
-0.019 *
-0.133 ***
(0.016)
-0.042 **
(0.016)
-0.089 ***
(0.011)
-0.039 **
(0.015)
-0.132 ***
(0.012)
-0.089 ***
(0.021)
-0.140 ***
(0.021)
-0.067 ***
(0.013)
-0.026
(0.014)
-0.047 ***
(0.011)
-0.002
(0.016)
-0.118 ***
(0.016)
-0.073 **
(0.026)
-0.093 **
(0.029)
-0.033
-0.095
(0.035)
(0.101)
-0.010
-0.007
-0.009
0.002
-0.014
-0.007
-0.009
0.002
(0.005)
(0.033)
(0.017)
(0.009)
(0.009)
(0.009)
(0.005)
(0.033)
(0.017)
(0.009)
(0.009)
(0.009)
54,707
2,883
8,168
17,586
15,025
11,045
54,707
2,883
8,168
17,586
15,025
11,045
0.814
0.456
0.536
0.662
0.704
0.69
0.814
0.456
0.535
0.662
0.704
0.007 **
0.204
0.902
0.717
0.022 *
P-Value on F-Test: All Gender Interactions Equal
-0.019 *
Basic
(0.007)
(0.009)
(0.007)
Below Basic
-0.071 ***
White x Female
Asian
Far Below
Basic
0.69
0.003 **
NOTE: All samples include controls for students 8th -10th grade CST scores for the relevant subject and these scores squared and cubed. All test scores are standardized within the cohort.Models also include dummy variables for the math exam taken in each grade.
Draft. Please do not cite or circulate.
Table 8: Using Shrunken ELA CST10 to Predict 10th Grade ELA CAHSEE Score, By Shrunken 9th Grade ELA CST Proficency Level
Full Sample
Female
0.068 ***
(0.004)
Far Below
Basic
Below Basic
-0.038
0.008
(0.024)
(0.011)
Basic
0.050 ***
(0.007)
Proficient
0.102 ***
(0.007)
Advanced
Full Sample
0.074 ***
-0.072 ***
-0.094
(0.067)
-0.148 ***
(0.028)
-0.108 ***
(0.013)
-0.065 ***
(0.012)
-0.007
(0.013)
Asian x Female
-0.085 ***
(0.010)
0.099 ***
(0.008)
Hispanic
-0.071 ***
(0.007)
-0.045
(0.062)
-0.105 ***
(0.026)
-0.071 ***
(0.012)
-0.072 ***
(0.011)
-0.054 ***
(0.014)
Hispanic x Female
-0.056 ***
(0.009)
0.044 ***
(0.007)
Black
-0.143 ***
-0.217 ***
-0.227 ***
-0.125 ***
-0.084 ***
-0.075 ***
-0.009
-0.065
-0.028
-0.014
-0.015
-0.020
Black x Female
-0.134 ***
(0.012)
0.057 ***
(0.012)
ELL
-0.397 ***
(0.007)
Free Lunch
N
Adjusted R2
-0.077 ***
-0.355 ***
(0.031)
-0.078 *
-0.274 ***
(0.013)
-0.075 ***
-0.269 ***
(0.011)
-0.059 ***
-0.290 ***
(0.045)
-0.039 ***
Basic
Proficient
Advanced
(0.009)
(0.010)
(0.007)
Below Basic
0.116 ***
White x Female
Asian
Far Below
Basic
-0.157
(0.138)
-0.069 ***
-0.397 ***
(0.007)
-0.078 ***
-0.031
0.023
(0.126)
(0.049)
-0.111
(0.078)
-0.147 ***
(0.035)
-0.011
0.030
(0.053)
(0.023)
-0.051
(0.072)
-0.089 **
(0.032)
-0.038
-0.003
(0.030)
(0.015)
-0.200 *
(0.078)
-0.212 ***
(0.036)
-0.088
-0.003
(0.062)
(0.029)
-0.351 ***
(0.031)
-0.077 *
-0.276 ***
(0.013)
-0.075 ***
0.060 **
(0.021)
-0.105 ***
(0.018)
0.049 ***
(0.013)
-0.066 ***
(0.017)
0.047 ***
(0.010)
-0.123 ***
(0.020)
0.054 **
(0.018)
-0.269 ***
(0.011)
-0.059 ***
0.095 ***
(0.017)
-0.083 ***
(0.016)
0.131 ***
(0.013)
-0.066 ***
(0.016)
0.085 ***
(0.012)
-0.074 ***
(0.022)
0.082 ***
(0.023)
-0.290 ***
(0.045)
-0.040 ***
0.112 ***
(0.016)
-0.014
(0.017)
0.124 ***
(0.014)
-0.055 **
(0.020)
0.112 ***
(0.020)
-0.070 *
(0.031)
0.102 **
(0.036)
-0.157
(0.138)
-0.069 ***
(0.005)
(0.039)
(0.018)
(0.009)
(0.009)
(0.010)
(0.005)
(0.039)
(0.018)
(0.009)
(0.009)
(0.010)
59,455
2,884
9,211
19,545
16,482
11,333
59,455
2,884
9,213
19,544
16,484
11,330
0.739
0.213
0.25
0.315
0.321
0.303
0.739
0.214
0.249
0.315
0.322
0.303
0.000
0.825
0.658
0.95
0.046 *
0.895
P-Value on F-Test: All Gender Interactions Equal
NOTE: M odels predict the CAHSEE ELA score, standardized by cohort. Controls other than those shown include "shrunken" 10th grade ELA CST scores, as well as the quadratic and cubic versions of these shrunken scores. The shrunken scores are a weighted
average of the students observed 10th grade score (or that score squared or cubed) and their predicted score. The predicted score is calculated by regressing the 10th grade ELA CST test scores (or those scores squared or cubed) on 8th & 9th grade test scores (and
those scores squared and cubed), the demographic controls to be included in the model (ell, free lunch, race and gender or race by gender) and district by cohort fixed effects. The "shrunken" score weights the observed score by .85 and the predicted score by .15.
Students are reclassified into predicted "shrunken" proficiency categories based on where the shrunken (linear) score would place a student according to the actual cut scores for that cohort.
Draft. Please do not cite or circulate.
Table 9:Using Shrunken Math CST10 to Predict 10th Grade Math CAHSEE Score, By Shrunken 9th Grade ELA CST Proficency Level
Full Sample
Female
Far Below
Basic
-0.122 ***
(0.004)
-0.152 ***
(0.021)
Below Basic
-0.159 ***
(0.011)
Basic
-0.163 ***
(0.007)
Proficient
-0.147 ***
(0.007)
Advanced
Full Sample
-0.114 ***
-0.047 ***
0.116
-0.020
(0.061)
(0.028)
-0.032 *
(0.014)
-0.013
(0.012)
-0.021 *
(0.011)
Asian x Female
-0.058 ***
(0.009)
-0.094 ***
(0.007)
Hispanic
-0.087 ***
(0.007)
-0.020
(0.055)
-0.095 ***
(0.026)
-0.067 ***
(0.013)
-0.039 ***
(0.011)
-0.025 *
(0.012)
Hispanic x Female
-0.071 ***
(0.009)
-0.145 ***
(0.006)
Black
-0.208 ***
-0.168 **
-0.202 ***
-0.164 ***
-0.129 ***
-0.095 ***
-0.008
-0.059
-0.028
-0.015
-0.014
-0.017
Black x Female
-0.200 ***
(0.012)
-0.129 ***
(0.012)
ELL
-0.298 ***
(0.006)
Free Lunch
N
Adjusted R2
-0.165 ***
(0.027)
-0.031 ***
-0.112 ***
(0.013)
-0.141 ***
(0.011)
-0.086 *
(0.044)
Basic
Proficient
Advanced
(0.008)
(0.010)
(0.007)
Below Basic
-0.089 ***
White x Female
Asian
Far Below
Basic
-0.057
(0.116)
-0.299 ***
(0.006)
0.158 *
(0.070)
-0.171 ***
(0.048)
0.026
(0.065)
-0.176 ***
(0.027)
-0.164 *
(0.070)
-0.071
(0.055)
-0.164 ***
(0.027)
-0.205 ***
(0.048)
-0.052
(0.035)
-0.122 ***
(0.023)
-0.111 ***
(0.032)
-0.158 ***
(0.015)
-0.199 ***
(0.036)
-0.196 ***
(0.028)
-0.113 ***
(0.013)
-0.183 ***
(0.022)
-0.047 **
(0.018)
-0.153 ***
(0.013)
-0.075 ***
(0.017)
-0.166 ***
(0.010)
-0.175 ***
(0.021)
-0.161 ***
(0.018)
-0.141 ***
(0.011)
-0.163 ***
(0.016)
-0.045 **
(0.016)
-0.103 ***
(0.012)
-0.034 *
(0.015)
-0.172 ***
(0.012)
-0.116 ***
(0.021)
-0.181 ***
(0.022)
-0.087 *
(0.044)
-0.098 ***
(0.013)
-0.051 ***
(0.015)
-0.046 ***
(0.012)
-0.003
(0.016)
-0.136 ***
(0.017)
-0.065 *
(0.026)
-0.145 ***
(0.031)
-0.060
(0.116)
-0.006
-0.021
-0.009
0.004
0.009
-0.008
-0.022
-0.009
0.003
0.009
(0.005)
(0.035)
(0.018)
(0.009)
(0.009)
(0.009)
(0.005)
(0.035)
(0.018)
(0.009)
(0.009)
(0.009)
59,455
2,884
9,211
19,545
16,482
11,333
59,455
2,884
9,213
19,544
16,484
11,330
0.762
0.336
0.405
0.56
0.633
0.615
0.762
0.341
0.404
0.56
0.633
0.616
0.000 ***
0.184
0.16
0.643
0.000 ***
0.000 ***
P-Value on F-Test: All Gender Interactions Equal
-0.031 ***
-0.007
(0.112)
NOTE: M odels predict the CAHSEE M ath score, standardized by cohort. Controls other than those shown include "shrunken" 10th grade M ath CST scores, as well as the quadratic and cubic versions of these shrunken scores. The shrunken scores are a weighted
average of the students observed 10th grade score (or that score squared or cubed) and their predicted score. The predicted score is calculated by regressing the 10th grade M ath CST test scores (or those scores squared or cubed) on 8th & 9th grade test scores (and
those scores squared and cubed), dummies for math CST taken in the 8th-10th grade, the demographic controls to be included in the model (ell, free lunch, race and gender or race by gender) and district by cohort fixed effects. The "shrunken" score weights the
observed score by .85 and the predicted score by .15. Students are reclassified into predicted "shrunken" proficiency categories based on where the shrunken (linear) ELA score would place a student according to the actual cut scores for that cohort.
Draft. Please do not cite or circulate.
Table 10: 2SLS Estimates
ELA
Female
Math
0.026 ***
-0.050 ***
(0.005)
White x Female
(0.007)
0.033 **
-0.024
(0.012)
Asian
-0.090 ***
(0.008)
Asian x Female
-0.110 ***
(0.011)
(0.017)
-0.053 ***
(0.012)
0.071 ***
(0.012)
-0.015
0.003
-0.016
0.004
(0.008)
(0.011)
(0.011)
(0.015)
Hispanic x Female
-0.002
-0.062 ***
(0.008)
Black
-0.030 **
(0.010)
Black x Female
-0.008
(0.014)
(0.011)
-0.117 ***
(0.014)
-0.007
-0.147 ***
(0.009)
Free Lunch
N
-0.034 ***
-0.146 ***
(0.009)
-0.035 ***
-0.105 ***
(0.020)
-0.048 *
(0.014)
ELL
(0.016)
-0.049 ***
(0.009)
Hispanic
-0.040 *
(0.020)
-0.285 ***
(0.012)
-0.030 ***
-0.286 ***
(0.012)
-0.030 ***
(0.006)
(0.006)
(0.009)
(0.009)
59,455
59,455
59,455
59,455
P-Value on F-Test: All Gender Interactions Equal
0.000 ***
0.295
NOTE: Models use 8th -9th grade CST scores to instrument for the 10th grade CST score in the relevant subject. Both
predictor and instrument variables include quadratic and cubic terms. CAHSEE scores are standardized within the
cohort. Models also include dummy variables for the math exam taken in each grade as an instrument.
Draft. Please do not cite or circulate.
Works Cited
Aronson, J., Lustina, M. J., Good, C., Keough, K., Steele, C. M., & Brown, J. (1999). When white
men can’t do math: Necessary and sufficient factors in stereotype threat. Journal of Experimental Social
Psychology , 35, 29-46.
Baumert, J., & Demmrich, A. (2001). Test motivation in the assessment of student skills:The effects
of incentives on motivation and performance. European Journal of Psychology of Education , 441-462.
Ben-Zeev, T., Fein, S., & Inzlicht, M. (2005). Arousal and stereotype threat. Journal of Experimental
Social Psychology , 41, 174-181.
Bettinger, E. P. (2010, September). Paying to Learn: The Effect of Financial Incentives on
Elementary School Test Scores. NBER Working Paper No. 16333 .
Bifulco, R., Ladd, H. F., & Ross, S. (2007). Public school choice and integration: Evidence from
Durham, North Carolina. University of Connecticut Economics Working Papers , Number 2007-41.
Brown, R. P., & Josephs, R. A. (1999). A burden of proof: stereotype relvance and gender differences
in math performance. Journal of Personality and Social Psychology , 76 (2), 246-257.
Card, D., & Rothstein, J. (2007). Racial segregation and the black–white test score gap. Journal of
Public Economics , 91, 2158–2184.
Cohen, G. L., & Garcia, J. (In Press). Recursive processes in self-affirmation: Intervening to close the
minority achivement gap. Science .
Cohen, G. L., Garcia, J., Purdie-Vaughns, V., Apfel, N., & Brzustoski, P. (2009). Recursive processes
in self-affirmation: Intervening to close the minority achievement gap. Science , 324 (5925), 400-403.
Coon, H. M., & Kemmelmeier, M. (2001). Cultural orientations in the United States : (Re)examining
differences among ethnic groups. Journal of Cross-Cultural Psychology , 32 (3), 348-364.
Draft. Please do not cite or circulate.
Cullen, J. B., Jacob, B. A., & Levitt, S. D. (2005). The impact of school choice on student outcomes:
An analysis of Chicago Public Schools. Journal of Public Economics , 729-760.
Danaher, K., & Crandall, C. S. (2008). Stereotype threat in applied settings re-examined. Journal of
Applied Social Psychology , 38 (6), 1639–1655.
Ferrara, S. (2006). Toward a psychology of large-scale educational achievement testing: Some
features and capabilities. Educational Measurement: Issues and Practice , 25 (4), 2-5.
Fryer, J. R. (2010, April). Financial incentives and student achievement: Evidence from randomized
trials. NBER Working Paper No. 15898 .
Gonzales, P. M., Blanton, H., & Williams, K. J. (2002). The effects of stereotype threat and doubleminority status on the test performance of latino women. Personality and Social Psychology Bulletin , 28,
659-670.
Good, C., Aronson, J., & Inzlicht, M. (2003). Improving adolescents’ standardized test performance:
An intervention to reduce the effects of stereotype threat. Applied Developmental Psychology , 24, 645–
662.
Haladyna, T. M., & Downing, S. M. (2004). Construct-irrelevant variance in high-stakes testing.
Educational Measurement: Issues and Practice , 23 (1), 17-27.
Jacob, B. A., & Levitt, S. D. (2003). Rotten apples: An investigation os the prevalence and predictors
of teacher cheating. The Quarterly Journal of Economics , 118 (3), 843-877.
Keller, J. (2002). Blatant stereotype threat and women’s math performance: Self-handicapping as a
strategic means to cope with obtrusive negative performance expectations. Sex Roles , 47 (3/4), 193-198.
Draft. Please do not cite or circulate.
Kellow, J. T., & Jones, B. D. (2008). The effects of stereotype on the achievement gap: Reexamining
the academic performance of African American high school students. Journal of Black Psychology , 34
(1), 94-120.
Lankford, H., Loeb, S., & Wycoff, J. (2002). Teacher sorting and the plight of the urban school.
Education Evaluation and Policy Analysis , 24, 37-62.
Linn, R. L., Koretz, D., & Baker, E. L. (1996). Assessing the validity of the national assessment of
educational progress: NAEP technical review panel white paper. Los Angeles: Center for the Study of
Evaluation. : CSE Technical Report 416.
Logan, J. R., Oakley, D., & Stowell, J. (2008). School segregation in metropolitan regions, 19702000: The impacts of policy choices on education . American Journal of Sociology , 1611-1644.
Marcus, H. R., & Kitayama, S. (1991). Culture and the self: Implications for cognition, emotion, and
motivation. Psychological Review , 98 (2), 224-253.
Messick, S. (1989). Meaning and values in test validation: The science and ethics of assessment.
Educational Researcer , 18 (5), 5-11.
O'Brien, L. T., & Crandall, C. S. (2003). Stereotype Threat and Arousal: Effects on Women's Math
Performance. Personality and Social Psychology Bulletin , 29, 782-789.
Reardon, S. F., Atteberry, A., Arshan, N., & Kurlaender, M. (2009). Effect of the California High
School Exit Exam on student persistance, Achievement and graduation. Institute for Research on
Education Policy and Practice, Working Paper #2009-12 .
Rivkin, S. G., Hanushek, E. A., & Kain, J. F. (2005). Teachers, schools and academic achievement.
Econometrica , 417–458.
Draft. Please do not cite or circulate.
Rumberger, R. W., & Willms, J. D. (1992). The impact of racial and ethnic segregation on the
achievement gap in California high school. Education Evaluation and Policy Analysis , 377-396.
Schmader, T., Johns, M., & Forbes, C. (2008). An integrated process model of stereotype threat
effects on performance. Psychological Review , 115 (2), 336–356.
Shih, M., Pittinsky, T., & Ambady, N. (1999). Sterotype susceptibility: Identity saliance and shifts in
quantitative performance. Psychological Science , 10 (1), 80-83.
Smith, J. L. (2004). Understanding the process of stereotype threat: A review of mediational variables
and new performance goal directions. Educational Psychology Review , 16 (3), 177-206.
Spencer, S. J., Steele, C. M., & Quinn, D. M. (1999). Stereotype threat and women's math
performance. Journal of Experimental Social Psychology , 35, 4-28.
Steele, C. M. (1997). A threat in the air: How sterotypes shape intellectual identity and performance.
American Psychologist , 52 (6), 613-629.
Steele, C. M., & Aronson, J. (1995). Stereotype threat and the intellectual test performance of African
Americans. Journal of Personality and Social Psychology , 67 (5), 797-811.
Stricker, L. J., & Ward, W. C. (2004). Stereotype threat, inquiring about test takers' ethnicity and
gender, and standardized test performance. Journal of Applied Social Psychology , 34 (4), 665-693.
Walton, G. M., & Cohen, G. L. (2002). Stereotype lift. Journal of Experimental Social Psychology ,
39, 456-467.
Walton, G. M., & Spencer, S. J. (2009). Latent ability: Grades and test scores systematically
underestimate the intellectual ability of negatively sterotyped students. Psychological Science , 20 (9),
1132-1139.
Draft. Please do not cite or circulate.
Wicherts, J. M., Dolan, C. V., & Hessen, D. J. (2005). Stereotype threat and group differences in test
performance: A question of measurement invariance. Journal of Personality and Social Psychology , 89
(5), 696–716.
Download