Running head: ESTIMATING ACHIEVEMENT GAPS Estimating Achievement Gaps from Test Scores Reported in Ordinal "Proficiency" Categories Andrew D. Ho Harvard Graduate School of Education Sean F. Reardon Stanford University Author Note This paper was supported in part by a grant from the Institute of Education Sciences (#R305A070377) and a fellowship from the National Academy of Education and the Spencer Foundation. The authors benefited from the research assistance of Demetra Kalogrides and Erica Greenberg of Stanford University and Katherine Furgol of the University of Iowa. ESTIMATING ACHIEVEMENT GAPS 1 Abstract Test scores are commonly reported in a small number of ordered categories. These contexts include state accountability testing, Advanced Placement tests, and English proficiency tests. This paper introduces and evaluates methods for estimating achievement gaps on a familiar standard-deviation-unit metric using data from these ordered categories alone. These methods hold two practical advantages over alternative achievement gap metrics. First, they require only categorical proficiency data, which are often available where means and standard deviations are not. Second, they result in gap estimates that are invariant to score scale transformations, providing a stronger basis for achievement gap comparisons over time and across jurisdictions. We find three candidate estimation methods that recover full-distribution gap estimates well when only censored data are available. Keywords: achievement gaps, proficiency, nonparametric statistics, ordinal statistics ESTIMATING ACHIEVEMENT GAPS 2 Estimating Achievement Gaps from Test Scores Reported in Ordinal "Proficiency" Categories Achievement gaps are among the most visible large-scale educational statistics. Closing achievement gaps among traditionally advantaged and disadvantaged groups is an explicit goal of state and federal education policies, including current and proposed authorizations of the Elementary and Secondary Education Act (Miller & McKeon, 2007). Gaps and gap trends are thus a commonplace topic of national and state report cards, newspaper articles, scholarly articles, and major research reports (e.g., Education Week, 2010; Jencks & Phillips, 1998; Magnuson & Waldfogel, 2008; Vanneman, Hamilton, Baldwin Anderson, & Rahman, 2009). Researchers selecting an achievement gap metric face three issues. First, average-based gaps—effect sizes or simple differences in averages—are variable under plausible transformations of the test score scale (Ho, 2007; Reardon, 2008a; Seltzer, Frank, & Bryk, 1994; Spencer, 1983). Second, gaps based on percentages above a cut score, such as differences in “proficiency” or passing rates, vary substantially under alternative cut scores (Ho, 2008; Holland, 2002). Third, researchers often face a practical challenge: Although they may wish to use an average-based gap metric, the necessary data may be unavailable. This last situation has become common even as the reporting requirements of the No Child Left Behind Act (NCLB) have led to large amounts of easily accessible test score data. The emphasis of NCLB on measuring proficiency rates over average achievement has led states and districts to report “censored data”: test score results in terms of categorical achievement levels, typically given labels like “below basic,” “basic,” “proficient,” and “advanced.” These censored data are often reported in lieu of traditional distributional statistics like means and standard deviations. A recent Center on Education Policy (2007) report noted that state-level black and white means and standard deviations required for estimating black-white achievement gaps were available in only 24 states for reading and 25 states for mathematics. Moreover, many ESTIMATING ACHIEVEMENT GAPS 3 of these states only made these statistics available upon formal request. Without access to basic distributional statistics, much less full distributional information, research linking changes in policies and practices to changes in achievement gaps becomes substantially compromised in the absence of alternative methodological approaches. This paper develops and evaluates a set of methods for estimating achievement gaps when standard distributional statistics are unavailable. The first half of this paper reviews traditional gap measures and their shortcomings and then presents alternative gap measures in an ordinal, or nonparametric framework. Links to a large literature in nonparametric statistics and signal detection theory are emphasized. This nonparametric approach generally assumes full information about the test score distributions of both groups. The second half of the paper introduces and evaluates methods for estimating achievement gaps using censored, unanchored data. This describes most readily available state testing data under NCLB, where only a small number of categories are defined, and the cut scores delineating categories are either unknown or not locatable on an interval scale. The contribution of the paper is a toolbox of transformationinvariant gap estimation methods that either overcomes or circumvents the aforementioned theoretical and practical challenges: transformation-dependence, cut-score-dependence, and the scarcity of standard distributional statistics. Traditional Achievement Gap Measures and Their Shortcomings A test score gap is a comparative statistic describing the difference between two distributions. Typically, the target of inference is the difference between central tendencies. Three “traditional” gap metrics dominate this practice of gap reporting. The first is the test’s score scale, where gaps are most often expressed as a difference in group averages. For a student ESTIMATING ACHIEVEMENT GAPS 4 test score, , a typically higher scoring reference group, , and a typically lower scoring focal group, , the difference in averages, , follows: . (1) The second traditional metric expresses the gap in terms of standard deviation units. This metric allows for standardized interpretations when the test score scale is not well established and affords aggregation and comparison across tests with differing score scales (Hedges & Olkin, 1985). Sometimes described as Cohen’s , this effect size expresses and quadratic average of both groups’ standard deviations, in terms of a , as follows: . (2) The third traditional metric, the percentage-above-cut (PAC) metric, has become particularly widespread under NCLB, which mandates state selection of cut scores delineating “proficiency.” Schools with insufficient percentages of proficient students face the threat of sanctions. The relevance of the cut score and the mandated reporting of disaggregated proficiency percentages lead to a readily available gap statistic: the difference in percentages of proficient students. If and are the percentages of groups and above a given cut score, the PAC-based gap is . (3) The PAC-based gap in Equation 3 is known to be dependent upon the location of the cut score (Ho, 2008; Holland, 2002). If the two distributions are normal and share a common variance, however, this cut-score dependence can be eliminated by a transformation of PACs onto the standard-deviation-unit metric using an inverse normal transformation. The resulting ESTIMATING ACHIEVEMENT GAPS gap estimate, denoted 5 , for the “difference in transformed percentages-above-cut” (DTPAC), is written: Φ Φ This method implicitly assumes the test scores in groups . and (4) are both normally distributed with equal variance. The resulting gap can be interpreted in terms of standard deviation units. If the distributions meet this strict normal, equal-variance assumption, Equation 4 returns the same effect size regardless of cut-score location. Moreover, this common effect size will be equal to Cohen’s . More formal demonstrations of the logic of this transformation are widespread (e.g., Hedges and Olkin, 1985; Ho, 2009). Table 1 displays these four gap measures as they describe the White-Black fourth-grade reading test score gap in Alabama. 1 Data are from the National Assessment of Educational Progress (NAEP) in 2005 and 2007. Gap trends are also shown as the simple gap change from 2005 to 2007. Row 1 of Table 1 shows that Alabama’s White-Black gap decreased 6.41 scale score points: from 32.0 to 25.6 between 2005 and 2007. Row 2 of Table 1 shows that the - based gap for Alabama decreased from 0.88 standard deviation units in 2005 to 0.74 standard deviation units in 2007, a decrease of 0.14 standard deviation units. Rows 3-5 show -based gaps for Alabama on the three NAEP cut scores. The NAEP cut scores are the lower borders of three NAEP achievement levels: Basic, Proficient, and Advanced. The gaps in the percentages above these cut scores are denoted in Rows 3, 4, and 5 as , , and respectively. The differences between gaps across the three cut scores are stark. Gaps appear to be much larger at Basic and Proficient cut scores than at the Advanced cut score. Gaps decrease 1 We chose Alabama for this illustrative example for two reasons. First, Alabama was one of only 7 out of a possible 200 state-subject-grade combinations with statistically significant NAEP White-Black gap trends between 2005 and 2007 (Vanneman, Hamilton, Baldwin Anderson, & Rahman, 2009). Second, the White-Black gap in Alabama in 2007 was similar in magnitude to that of the U.S. as a whole. ESTIMATING ACHIEVEMENT GAPS 6 5.52 percentage points at the Basic cut score but increase 1.45 and 1.34 percentage points at the Proficient and Advanced cut scores. These seeming contradictions cloud the overall finding of “a statistically significant gap closure” and encourage inferences about interactions between gap trends and particular regions of the score distributions. Finally, the DTPAC approach shown in Rows 6-8 yields results that differ substantially across cut scores, revealing violations of the normal, equal variance assumption in practice. For example, the DTPAC-based gap trend at the “Advanced” cut score shows twice the decline of the gap trend at other cut scores. Note that the relatively large gap trend for Alabama over other states makes sign-reversal—the most dramatic dependence—less likely across metrics. Alabama thus provides a somewhat conservative example of the degree to which interpretations may change under gap metric selection. Each of the four metrics compared in Table 1 has shortcomings. The first two averagebased metrics depend on the assumption that the test score scale has equal-interval properties (Reardon, 2008b). If equal-interval differences do not share the same meaning throughout all levels of the test score distribution, nonlinear transformations become permissible, and distortions of averages and Cohen-type effect sizes will result. In educational measurement, arguments for strict equal interval properties are difficult to support (Kolen & Brennan, 2004; Lord, 1980; Spencer, 1983; Zwick, 1992). And the magnitude of differences under plausible scale transformations can be of practical significance. Ho (2007) has shown that can vary by more than 0.10 from baseline values under plausible monotone transformations. Further, estimates of based on different reported test score metrics of the Early Childhood Longitudinal Study—Kindergarten Cohort (ECLS-K) reveal cross-metric gap trend differences as large as 0.10 (author calculations from Pollack, Narajian, Rock, Atkins-Burnett, & Hausken, 2005). This range is sufficient to call many gap comparisons and gap trends into question. ESTIMATING ACHIEVEMENT GAPS 7 Gap inferences based on PAC-based metrics are subject to a different kind of distortion. Holland (2002) demonstrates that gap magnitude is confounded by the location of the cut score. With two normal distributions, for example, PAC-based gaps are maximized when the cut score is at the midpoint between the modes of the two distributions. In Table 1, this explains the greater magnitudes of the gaps in Row 3 over Rows 4 and 5, as the NAEP Basic cut score is generally more central than the higher Proficient and extremely high Advanced cut scores. As a corollary, when the scores of both groups are increasing, gap trends based on low cut scores will show gap closure, and gap trends based on high cut scores will show gap increases. Although it is tempting to interpret the gap trends in Rows 3, 4, and 5 as a story about gap trends increasing for high-scoring students but decreasing for low-scoring students, there is an alternative interpretation: proportionately more white students than black students crossed the Advanced NAEP cut score simply because more of them were near the cut score initially. The inverse normal transformation of the percentages-above-cuts helps to address the predictable relationship between estimated gaps and the locations of cut scores, but it rests on the assumption of normality and equal variance, which is rarely satisfied in practice (Ho, 2008, 2009). As Table 1 illustrates, when this assumption is not met, estimated gaps and gap trends may be highly sensitive to the location of the cut score used to estimate them. Taken together, these shortcomings raise serious concerns about the gap and gap trend metrics in Table 1. The first two rows assume not only equal-interval scale properties but also, for gap trends, the maintenance of equal-interval properties over time. The gaps and gap trends in the next three rows confound the movement of score distributions with the density of students adjacent to the cut score. Unadjusted and Δ statistics will either appear contradictory across cut scores or encourage inaccurate inferences about gaps and gap trends at different ESTIMATING ACHIEVEMENT GAPS 8 regions of the distribution. The gaps and trends in the last three rows illustrate the consequences of a violation of the strict normal, equal variance distributional assumption. These shortcomings motivate an alternative approach to achievement gap reporting. An Ordinal Framework for Gap Trend Reporting The literature on ordinal distributional comparisons contains attractive alternatives to traditional gap metrics. When the scale-dependence of gap statistics is a concern, gaps can be derived from transformation-invariant representations like the probability-probability (PP) plot (Ho, 2009; Livingston, 2006; Wilk & Gnanadesikan, 1968). The PP plot is best described by considering the two Cumulative Distribution Functions (CDFs), the proportions of students ( and ) at or below a given score and in groups , which return and , respectively. The left panel of Figure 1 shows the normal CDFs of Alabama’s White and Black test score distributions on the 2005 Grade 4 NAEP Reading test as an example. These are labeled generically as a lower-scoring focal distribution, distribution, , and the higher scoring reference . The vertical axis expresses the proportion at or below a given NAEP scale score . The left panel of Figure 1 shows that, for the basic cut score of 208, 33% of the reference group is at or below Basic, whereas 69% of the focal group is at or below Basic. Due to the construction of the PP plot from paired cumulative proportions, all statistics generated from a PP plot are transformation-invariant. One useful statistic is the area under the PP curve, which is equal to Pr , denoted for short: the probability that a randomly drawn group a student has a greater score than a randomly drawn group b student. This statistic has a substantial background in the nonparametric and ordinal statistics literature (e.g., Cliff, 1993; McGraw & Wong, 1992; Vargha & Delaney, 2000). In signal detection theory and medical testing, a mathematically equivalent expression is known as the Area Under the Curve ESTIMATING ACHIEVEMENT GAPS 9 (AUC) of the Receiver Operating Characteristic (ROC) Curve, where the ROC curve is a PP plot with a particular interpretation. In this literature, the two distributions are usually those of healthy and sick populations along some test criterion, and the interpretation of AUC is as a summary measure of the diagnostic capability of the criterion (Swets & Pickett, 1982). The use of these approaches for expressing achievement gaps in education is fairly limited (exceptions include Livingston, 2006; Neal, 2006; Reardon, 2008b). Although the interpretation of may be appealing, a Cohen-like effect size that avoids the proportion metric allows for better equal-interval properties and interpretation in terms of standard deviation units. For this purpose, Ho and Haertel (2006) and Ho (2009) propose the statistic, a nonlinear monotonic transformation of √2Φ The : . statistic has several useful properties. First, (5) is equal to the Cohen effect size when the two test score distributions are normal, even if they have unequal variances. 2 However, even in these circumstances, Cohen’s will not. The implicit condition equating to will vary under scale transformations whereas is respective normality (Ho, 2009). That is, the two distributions need not be normal in the metric in which they are observed, but there must be a common transformation of that metric that would render both distributions normal. This is a more flexible assumption than assuming distributions are normal on their extant common scale. It is, in fact, departures from respective equivariant normality, not just 2 The statistic arises from this relationship between the parameters of two normal distributions and : the area under the PP curve for the two normal distributions. When both distributions are normal with mean and variance , , and , the relationship follows (Downton, 1973): parameters , Φ . √2 Solving for yields . Equivalent expressions to have proposed in the ROC literature (e.g., Simpson and Fitter, 1973), where it is commonly known as . However, in the context of medical tests, AUC-type measures are most commonly used (Pepe, 2003). ESTIMATING ACHIEVEMENT GAPS 10 equivariant normality, that lead to disagreements between gaps estimated from different cut scores. In general, distributional assumptions in an ordinal framework are best described as respective or transformation-inducible. In the ROC literature, where the concern is sensitivity and specificity of diagnostic tests, the transformation-inducible normality assumption has been described as “binormal” (Swets & Pickett, 1982). In the context of gaps between test score distributions, we retain the descriptor “respective normal” to make more explicit that the distributions need not be normal and need not be restricted to two in number. The statistic can be understood as the difference in mean test scores between two groups, both with normal test score distributions, that would correspond to a PP plot with an area under the curve of . As shown in Equation 5, can be computed directly from the area under the PP curve. It is thus broadly interpretable as a transformation-invariant analogue of Cohen’s even when distributions are not respectively normal. When the full CDFs are known for both groups, the calculation of nonparametric gap statistics like and follows in straightforward fashion from the PP plot. When only censored, PAC-type data are available, however, these statistics cannot be calculated exactly. The single-cut-score statistics and are estimable, but, as Table 1 shows, they can vary widely across alternative cut scores. The next section describes the use of PAC data as observed points to estimate a PP curve. These curves allow for nonparametric gap estimates from ordered categorical data alone. Estimating Ordinal Gaps from Censored Data In order to estimate the gap measure using censored data, we apply the PP framework described in Figure 1. Extending previous notation, we begin with an interest in the gap between two groups, and . The th student score in group is given by . Suppose there are 1 ESTIMATING ACHIEVEMENT GAPS 11 distinct achievement levels common to both groups. These are delineated by , so that a student from group for 2… cumulative proportion of students in group achieves Level 1 if 1 if ; and Level cut scores ; Level . The CDF if returns the at or below cut score , denoted . Note that these proportions are simply the complements of the PAC statistics described above: 1 1 . If we had the full data from the test score distributions (that is, if we knew and ), we would be able to compute any gap measure we like. We could plot the full PP curve (Figure 1) describing the proportion of members of group with scores below given percentiles of group : . The problem arises when we do not know or (6) but instead have access only to the and proportions of each group above cut scores. That is, we know only course, the associated , because 1 (and, of ) for some small number of cut scores . Usefully, the representation of the PP plot, , allows for the possibility of an estimate of the PP plot, , from the PAC data. In fact, the points, , , , ,…, , , fall on the curve described by , by definition. The points 0,0 and 1,1 can be added given the logic that some score exists below all observed score points, and some score exists above all observed score points. Figure 2 shows these 2 points for the previously used example of Alabama Grade 4 NAEP Reading in 2005. The point defined by the NAEP Basic cut score is highlighted, where 33% of reference group is below Basic and 69% of the focal group is below Basic. The other two points are defined by cumulative proportions for the Proficient and Advanced cut scores respectively. ESTIMATING ACHIEVEMENT GAPS Our strategy will be to use these 12 2 points to estimate the function square. If these points provide enough information to estimate reliable estimates of within the unit reliably, then we can obtain , as the area under , and reliable estimates of from . We √2Φ denote this version of , estimated from censored data alone, as . The √2Φ contrasting target statistic, estimated from the full distributions, is . In the next section, we describe nine candidate methods that attempt to minimize the distance between and : obtaining usable gap statistics from censored data alone. The criteria for evaluation of these methods have both theoretical and statistical motivations. First, symmetry is a desirable property. Logically, the distance between groups and should be the same, whether the expression is “group over group ” or “group group .” Under symmetry, the following expression will hold: corollary, following Equation 5, a a statistic calculated using statistic calculated using 1 under . As a will have the opposite sign of . Second, the function, , should be monotonically nondecreasing on the unit interval, following the theoretical restrictions on PP curves. Third, the estimate of should be unbiased over a range of realistic situations: the average distance should be zero. Finally, the magnitude of the distance between the estimate and the target should be as small as possible over a range of realistic situations. This will be evaluated using the root mean square deviation (RMSD) between and . The nine candidate methods follow. Piecewise Linear Interpolation (PLI) A graphically simple approach is to fit a linear spline function to the essentially “connecting the dots” to estimate . Computing , the integral of interval, is then a straightforward sum of areas of rectangles and triangles: 2 points, over the unit ESTIMATING ACHIEVEMENT GAPS 13 1 2 · 0 and where , (7) 1. The PLI approach is also notable because of its equivalence to the so-called midrank convention (Conover, 1973), a conventional nonparametric approach to adjusting when a pair of full distributions has ties. Ties result in unconnected PP points on a PP plot: the same problem addressed by this paper. The midrank convention adjusts as ; this is equivalent to Equation 7 if the censored follows: distributions are treated as the full distributions of interest. Although this method has the advantage of being relatively simple, the linear spline function is unlikely to accurately describe the underlying distributional shape. In particular, the integral will be biased toward 0.5 if the true function has a regular shape—concave up everywhere or concave down everywhere—because the linear spline will tend to truncate portions of the area between and the 45-degree line. Polynomial Fitted Curve (PFC) A second straightforward approach is to fit a the 2 points: ∑ -order polynomial (where 1) to . To obtain the area under the curve, the appropriate integral over the unit interval follows: 1 . When the number of points is small, as in the common testing scenario where (8) 3, one simple approach is to fit a cubic polynomial using ordinary least-squares. Symmetry can be achieved by refitting the function after reversing and , then averaging as follows, ESTIMATING ACHIEVEMENT GAPS 14 Φ Φ √2 . A variation of this approach incorporates the known variance of the cumulative proportions, , into a weighted least-squares regression. The equation can be further constrained by ensuring that the function passes through (0,0) and (1,1), points that should exist on a PP curve regardless of sampling or measurement error. Algebraic reconfiguration of the regression equations under these constraints allows for estimation of regression parameters through standard statistical software programs. We evaluate both free and weighted-constrained polynomial fitted curves and designate them PFCf and PFCwc respectively. One anticipated drawback of the PFC approaches is that theoretically impossible PP curves can result, including those with negative slopes and those that break the vertical boundaries of the unit square. Monotone Cubic Interpolation (MCI) A third approach that combines some of the advantages of the PLI and PFC approaches fits a piecewise cubic spline through the data using the Fritsch-Carlson (1980) method. The Fritsch-Carlson method guarantees a function that is monotonic, differentiable everywhere, and passes through each data point. For the purpose of fitting PP curves, this affords three primary advantages. First, the estimated curve, , passes through each of the K+2 points, like PLI but unlike PFC. Second, the function is monotonic, resolving the problem of negative slopes and unbounded PP curves that can arise under the PFC approach. Third, the curve is smooth everywhere on the unit interval, potentially resolving the bias that may arise with PLI. Given the MCI-fitted curve and coefficients returned by the Fritsch-Carlson algorithm, we can compute: (9) ESTIMATING ACHIEVEMENT GAPS 15 A drawback of the MCI approach is asymmetry. Like the PLI method, MCI will return asymmetrical gaps when groups and are switched on the axes. As with the PFC approaches, we resolve this by averaging. Fitted Curve to Transformed Data An alternative to fitting the PP points directly is to transform the two axes and fit the transformed data points. We can then transform the fitted line back into the original metric and integrate in order to compute . If the transformation results in a more familiar or easily estimable functional relationship between the variables, such as a line, then we can obtain more accurate estimates of and . We call this method by the mnemonic acronym TFIT (Transform-Fit-Inverse Transform). In particular, we investigate two potential transformations: one using the probit (inverse-normal) function that is denoted PTFIT and one using the logit function that is denoted LTFIT. Both functions are monotonic mappings of the domain 0,1 to the range ∞, ∞ . Due to the infinite mappings of (0,0) and (1,1), we exclude these two theoretical points and apply TFIT only to the observed data points. Probit transform-fit-inverse transform (PTFIT). By transforming each and transformed points. That is, we can fit a Φ where using the probit function, we can fit a curve to the -order polynomial of the form Φ , . Moreover, should be odd such that the fitted curve goes toward (10) ∞, ∞ and ∞, ∞ . Such a curve will approach 0,0 and 1,1 when inverse-transformed back to PP space. When 3, as is standard in NAEP and common in many state accountability systems, the ESTIMATING ACHIEVEMENT GAPS 16 linear fit is the only option. This estimated line can be transformed back into PP space and evaluated numerically as the following integral: Φ . Φ Symmetry may be obtained by fitting a principal axis regression line and obtaining and . However, preliminary results showed marked improvement with a weighted least squares approach. Each PP point may be weighted by the inverse of the variance in the transformed space. When plotting group on group , as in a typical PP plot, an estimate of the standard error of each point in the transformed space is given by the delta method: ̂ Here, ̂ 1 is the normal density function, Φ ̂ / Φ ̂ (11) . is the probit function, and 1/ Φ ̂ is the slope of the probit function at ̂ . Fitting Equation 10 while weighting each point by the inverse of the square of Equation 11, we obtain a weighted least squares estimate of the slope and intercept. Due to the asymmetry of the approach, we can achieve an average by repeating the process and plotting group appropriate estimate of , and on group . A geometric average of the slopes provides the can be estimated knowing that the line must pass through the intersections of each individual line. These may be transformed back into PP space for numeric integration as described before. When PTFIT is constrained to a linear fit, a more theoretically appealing estimate of follows from the fact that a line in probit-transformed (normal-normal) space defines two respectively normal distributions. The averaged and can be then used to obtain directly. It is relatively straightforward to show that, if two distributions and are respectively ESTIMATING ACHIEVEMENT GAPS 17 normal and can be transformed to have normal parameters transformed PP plot will be a line with slope Thus, we can express as a function of , , and intercept , and , the probit(e.g., Pepe, 2003). and in a quasi-Cohen expression: ̂ 1 2 2 . (12) Fitting a line through the probit-transformed PP points therefore implicitly assumes that the two distributions are respectively normal. With enough cut scores (at least 4), one could fit a higher-order odd polynomial through the PP points. Such a procedure would not imply respective normality, and numerical integration procedures would be required. Logit transform-fit-inverse transform (LTFIT). A similar procedure uses the logit function in place of the probit function. The logit and probit functions are very similar, but the logistic function has thicker tails and might be more stable when fitting extreme proportions, particularly when cut scores are very high or low for certain groups. In this case, let a standard logistic function be Λ logistic, the logit, follows: Λ , and the inverse- . Following an analogous estimate of variance as that shown in Equation 11, we can transform each proportion using the logit function, then weight each point in a linear regression by the following factor: Λ ̂ 1 Here, ̂ ̂ (13) . is a standard logistic density function given by . The geometric average slope and appropriate intercept are calculated as before, then numerical ESTIMATING ACHIEVEMENT GAPS integration obtains computation of 18 after the line is transformed back into PP space. Note that no direct is available as it was in the linearly fit PTFIT case. Average Normal Shift (ANS) The Normal Shift (NS) approach was introduced by Furgol, Ho, and Zimmerman (2010) as a method of estimating from censored data. The authors adapt a maximum-likelihood- based algorithm from Wolynetz (1979) that estimates a mean and variance from censored data with known cut scores assuming an underlying normal distribution. With cut scores for state tests, cut scores are either unavailable or lack strong equal-interval properties for meaningful use. Therefore, the authors established cut scores by assuming the reference distribution, standard normal, Φ, leading to cut scores defined by Φ for scores anchor the cumulative proportions for the focal group, the mean and variance, ̂ and , was 1 … . These cut , and these are used to estimate , via the Wolynetz algorithm. Given the assumed standard normal parameters of the reference distribution, the appropriate effect size estimate is simply ̂ 1 /2 . (14) A weakness of the NS model is that, like the PFC and MCI approaches, gap estimates are not symmetric under the choice of reference group. We resolve this by averaging with its negative when the groups are reversed, and we contrast this approach with the Furgol, Ho, and Zimmerman approach by describing this as the Average Normal Shift (ANS) approach. Both approaches assume respective normality but allow for variances to differ across the groups. It is similar to the linear PTFIT approach in its assumptions but uses a maximum likelihood approach on the CDFs instead of a weighted regression on transformed cumulative percentages. Receiver Operating Characteristic Fit (ROCFIT) ESTIMATING ACHIEVEMENT GAPS 19 A previous section described the interpretation of a PP plot as a ROC curve in signal detection theory. Within this literature, maximum likelihood estimates of the parameters for the ROC curve have been developed by Dorfman and Alf (1969) under the binormal or respectively normal assumption. We use the algorithm as implemented in the Stata command, “rocfit”; it is also available in the R package “pROC.” In Stata, the area under the curve, or as the scalar e(area), and the statistic, , is stored , is stored as the scalar, e(da). The ROCFIT approach can be considered a more formal version of the linearly constrained PTFIT. It fits the probit-transformed PP points in normal-normal space using a maximum likelihood approach. It enjoys the property of symmetry. A similar maximum likelihood extension of linearly constrained LTFIT in the ROC context was proposed by Ogilvie and Creelman (1968); we do not evaluate this bilogistic approach due to the lack of available software and the lack of promise of results to be shown. Average Difference in Transformed Percents-Above-Cut (ADTPAC) , an achievement gap measure that expresses the A previous section described difference between groups and by taking the difference of probit-transformed PACs. Table 1 showed that this measure of the gap may differ across the cut scores due to departures from respective normality. If the two test score distributions are normal with equal standard deviations, however, this measure will be the same regardless of which score is used as the cut score. Moreover, it will be equal to Cohen’s and, in this special case, to . Given that deviations from normality are expected, a simple method for obtaining a gap estimate is to average across the estimates of . We use an improved approach that takes advantage of the same weighting principles introduced for PFCwc and the two TFIT approaches. Equation 11 can be used to obtain an ESTIMATING ACHIEVEMENT GAPS 20 ̂ approximation of the variance of the difference in transformed PACs: ̂ . The inverse of these variances can be used as weights, , to obtain a weighted average difference of transformed PACs: . Here, and the calculation of each ∑ (15) . Due to the inverse normal transformations in , this approach implicitly assumes that the two distributions are respectively normal with equal standard deviations. The variation in the assumed to be sampling variation. The average is thus an estimate of gaps over is obtained without directly estimating the PP curve, . Table 2 summarizes the nine proposed methods of estimating from the observed censored data. Note that most of the methods guarantee monotonicity, and few of them are inherently symmetric. For the asymmetric methods, we find some approach to taking an average of gaps estimated both ways in order to avoid the arbitrariness of the choice. Due to the ordinal framework, the implied distributional assumptions are not traditional but respective. The respective normal assumption implies that some shared transformation can render both distributions normal. Likewise, the respective logistic assumption implies that some shared transformation can shape both distributions to the derivative of the logistic function: the shape of a Hubbert curve. The respective distributional assumptions for the TFIT approaches apply only for linear models in the transformed space: when 1. The TFIT and PFC approaches are thus a much larger family of approaches when greater numbers of cut scores allow for higher-order polynomial fits. Evaluating Approaches to Ordinal Gap Estimation ESTIMATING ACHIEVEMENT GAPS 21 This section uses simulated and real data to compare approaches as they attempt to recover full-distribution gap estimates, , using censored data alone. As Table 2 describes, there are strong a priori reasons to discount seemingly straightforward approaches, such as the anticipated bias of the PLI approach and the possibility of unbounded or decreasing functions with PFC approaches. The first subsection compares the performance of different approaches across simulated scenarios. The second subsection compares recovery of in the real data context of NAEP White-Black achievement gaps in 2003, 2005, and 2007. Recovery of in Controlled Scenarios The initial goal is to identify the best gap estimation approaches in controlled scenarios where respective distributional shapes are known. This section presents three simulation scenarios: an equivariant normal scenario, an unequal-variance normal scenario, and a skewed scenario using lognormal distributions. For each of these, we 1) draw two samples from generating distributions with known parameters, 2) record using these two full samples, 3) define a set of centered, plausible cut scores, 4) apply these cut scores to the two samples to obtain cumulative proportions and PACs, 5) apply each approach in Table 2 to these cumulative proportions to obtain values, and 6) repeat this many times to evaluate bias and variance under sampling. We add the gap between the generating distributions as an additional factor to understand how the magnitudes of bias and variance vary for gaps between 0 and 1.5 standard deviation units in size. For these scenarios, we draw 2000 students for the reference group and 500 students for the focal group , approximating the median sample sizes used for state NAEP. Following the NAEP design and the designs of many state testing programs, we censor the data using three cut scores. Larger numbers of cut scores will increase the similarity between the censored and ESTIMATING ACHIEVEMENT GAPS 22 full distributions and dampen the differences between estimation approaches. To establish generic cut score locations, we use a symmetric approach with respect to both distributions: The three cut scores result in unweighted cross-group averages of PACs as follows: 80% above Basic, 50% above Proficient, and 20% Advanced (cumulative proportions of 0.2, 0.5, and 0.8). The cut scores are obtained through an approach akin to mixture modeling that results in the appropriate PACs (or cumulative proportions) for the mixture of both distributions. For example, when two normal distributions with unit variance are centered on 0 and 1 respectively, the cut scores -0.45, 0.50, and 1.45 result in 20%, 50%, and 80% PACs for the unweighted mixture of the two distributions. This approach also has the advantage of keeping the amount of cut-score information constant—in the sense that the combined cumulative proportions are always the same—even as the gap between distributions shifts from 0 to 1.5. Figure 3 shows the generating distributions mapped into PP space along with the PP points that would be generated in the population. These cut scores censor the distributions, leaving three pairs of cumulative proportions. Following the example in the previous paragraph, for a Basic cut score of -0.45, the normal CDF shows that 32.6% of a low-scoring, scoring, 0,1 distribution scores below Basic, and 7.4% of a high- 1,1 distribution scores below Basic. Note that these proportions average to 0.20, as expected by the cut score approach detailed in the previous paragraph. Clearly, these percentages may change in any given sample due to sampling. The sampled PP point will thus vary around the point .074, .326 on a PP plot; this point can be found in Figure 3. The other two cut scores define two more PP points, and these, with the two theoretical points at the origin and 1, 1 where appropriate, are the data that the methods in the previous section use to obtain . The second, unequal variance scenario increases the variance of the ESTIMATING ACHIEVEMENT GAPS 23 generating distribution for the low-scoring group by 50%, and the third, skewed scenario uses the lognormal distribution to impart respective positive skew. These are also shown in Figure 3 and are described in greater detail in the next subsections. The criteria for the recovery of are bias, the average of over all replications, and the root mean squared deviation (RMSD), the square root of the average of over all replications. We use 5000 replications for each distance between generating distributions, drawing 2000 for the reference group and 500 for the focal group for every replication. The distance between the generating distributions is varied between 0 and 1.5 at intervals of .02. This allows comparison of approaches across a range of plausible gap magnitudes and across distributional scenarios likely to arise in practice. Note that appropriate criterion over , because is the is a transformation-dependent statistic that cannot be fully specified within an ordinal framework. Although happens to be equal to in the two normal scenarios that follow, this does not change the fact that a transformation can distort but not . The normal, equivariant scenario. The most straightforward model for test scores is the normal model, and the equalvariance assumption is an appropriate baseline assumption in the absence of other information. Figure 3 displays the population PP curves that result from the normal, equivariant model when the mean difference is 0, 0.5, 1.0, and 1.5 standard deviation units. The figure displays the population normal, equivariant PP curves as black, solid lines above the diagonal, and the “observed” points from the cut score algorithm in the population are also shown. As expected, the curves bulge from the diagonal as the mean difference increases. The observed points, however, stay on a line with slope -1, as expected from the cut score algorithm that keeps the ESTIMATING ACHIEVEMENT GAPS 24 cumulative proportion of the mixture of distributions constant over mean differences. The goal of each of the nine proposed approaches is to approximate the full curve using the five observed points alone. The top half of Figure 4 shows the bias—the average of over 5000 replications—over the range of mean differences and for each approach. When the mean difference is zero, the PP curve is the line , and all nine approaches estimate this easily. As the mean difference increases, the PLI approach is the most biased, underestimating the full gap by almost 10%. This is not surprising given that the linear approach truncates area under any concave curve. The PFCf approach also underestimates , and its weighted, constrained counterpart, PFCwc, is positively biased. The MCI and LTFIT approaches show slight negative bias when gaps are very large, and PTFIT, ANS, ROCFIT, and ADTPAC perform very well in a scenario that matches their assumptions perfectly. The bottom half of Figure 4 shows the RMSD, and there is a clear distinction between the PLI/PFC approaches and the others. The weighted, constrained PFC approach can be seen to improve upon the free PFC approach when gaps are small. However, when gaps are large, one or both groups may have cumulative proportions close to 0 or 1. In these cases, the PFCwc approach places heavy weight on these extreme points while also forcing the curve to go through the origin and 1, 1 . It is clear that this results in poor performance. Although there is a substantive and statistical justification to weight points and force the curve through the origin and 1, 1 , there appears to be little advantage to doing so. The normal, unequal-variance scenario. Returning back to Figure 3, the unequal variance scenario can be seen mapped into PP space and reversed over the diagonal to unclutter the figure. The generating distribution for the ESTIMATING ACHIEVEMENT GAPS 25 low-scoring focal group has a variance of 1.5 compared to the unit variance of the reference group’s generating distribution. This difference in variance, equivalent to increasing the standard deviation by 22.5%, represents a fairly high variance difference in practice, but differences in observed variances are not uncommon. For example, the average absolute WhiteBlack variance ratio for 2009 NAEP was 1.15 across 172 state-subject-grade combinations, and 4 state-subject-grade combinations exceeded an absolute ratio of 1.5. As expected of the cutscore-selection algorithm, comparing the “observed” PP points across the population PP curves reveals alignment on a line with slope of -1. The top half of Figure 5 shows the bias plotted on the standardized mean difference as defined by in the population. The results are very similar to Figure 4 in spite of the notable variance differences. The PLI approach remains negatively biased to a similar degree, and the weighted, constrained PFC approach begins performing more poorly than its free counterpart regardless of gap size. The ROCFIT, PTFIT, and ANS account for variance differences explicitly and therefore perform without bias. A notable difference from Figure 4 is that ADTPAC begins to show negative bias when gaps are large. This is a reminder that ADTPAC assumes respective normality with equal variances, and its performance worsens when this assumption is not met. The bottom half of Figure 5 shows the RMSD for the normal, unequalvariance scenario. It is worth noting that the RMSD for recovery by the six best approaches hovers around .015, a fairly small amount of variability for the estimation of gaps when the only three paired cumulative proportions are available. The lognormal, skewed scenario. To challenge the assumptions of respective normal approaches like ANS, ROCFIT, and PTFIT, which account for respective normality and unequal variances, we use respectively ESTIMATING ACHIEVEMENT GAPS 26 skewed lognormal distributions. We define a random variable whose log is distributed 0,0.3 . Such a distribution has mean 1.05, a standard deviation of 0.32, and positive skew of 0.95. To generate a gap, we shift one distribution above another such that varies from 0 to 1.5 standard deviation units. Unlike the previous two scenarios, this is not equivalent to shifting from 0 to 1.5, as only when distributions are normal. Cut scores are generated as before. These PP curves are also plotted back in Figure 3. A useful conceptual point is that two respectively lognormal distributions are not equivalent to two shifted normal distributions on the same scale that are transformed by the exponential function. This latter construction is ordinally equivalent to the respective normal distributions presented in the first scenario. Respectively lognormal distributions cannot be transformed to normal with a single transformation unless their CDFs completely overlap. The top half of Figure 6 shows the bias plotted on the metric as before. The performances of the PFCwc and ADTPAC approaches are notably worse. It is clear at this point that the PLI and PFC approaches are flawed under even the most typical scenarios; they will not be considered further. Taking Figures 5 and 6 together, the poor performance of ADTPAC under variance differences and skewness indicates its inability to adequately estimate the full distribution through weighted averaging of transformed PACs. In contrast, although ANS, ROCFIT, and PTFIT show biased recovery of , their bias in this case is very small at around .004 for the largest gaps. In this scenario, the ANS approach outperforms ROCFIT and PTFIT by very small amounts. The LTFIT and MCI approaches have a larger amount of negative bias at around -.015; this is still less than 1% of the largest gaps. The bottom half of Figure 6 shows the RMSD, where the decline in ADTPAC performance is also apparent. The efficiency of recovery of the five best approaches continues to ESTIMATING ACHIEVEMENT GAPS 27 hover at around .015 and increases to just over .022 when population gaps are very large. These approaches outperform seemingly attractive alternatives like PLI by a considerable margin and suggest that gap recovery is possible even when respective normal assumptions are not met. Recovery of in Real Data Scenarios This subsection assesses the performance of these approaches in real data scenarios. We use the full distributions of plausible values from NAEP state distributions, averaging over the five available sets of plausible values as recommended by Mislevy, Johnson, and Muraki (1992) to obtain and . The state distributions correspond to White and Black students in 2003, 2005, and 2007, for Reading and Mathematics in Grades 4 and 8. Out of 600 possible statesubject-grade-year combinations (50 states by 2 subjects by 2 grades by 3 years), 490 have sufficient sampling of Black students to allow for the reporting of achievement gaps. We calculate the nonparametric gap measure, , for these 490 White-Black gaps; these are the targets that the approaches attempt to recover under censored data scenarios. The criteria are bias and RMSD averaging over these 490 trials. The full distributions clearly cannot have their standardized mean differences, variances, or skew manipulated as in Figures 4-6, as these distributions are real. Their properties remain the same as those actually reported. However, the factor of cut score location can be usefully introduced into this analysis, as recovery of gaps is expected to depend on the location of cut scores in the distributions. We vary cut score location along two dimensions, the breadth of the cut scores and the stringency of the cut scores. The cut scores are indexed by the average cumulative percentages as before, except instead of fixing the average cumulative percentages at 20%, 50%, and 80%, they are varied systematically. The breadth dimension has average cumulative percentages varying from 5%, 50%, and 95% (broad cut scores) to 45%, 50%, and ESTIMATING ACHIEVEMENT GAPS 28 55% (narrow cut scores). The stringency dimension has average cumulative percentages varying from 5%, 30%, and 55% (low cut scores leading to low cumulative percentages and high PACs) to 45%, 70%, and 95% (high cut scores leading to high cumulative percentages and low PACs). Unlike NAEP reporting, where there are common cut scores for each subject-grade combination, this approach allows each pair of distributions to have its own trio of cut scores. This is done to ensure that the interpretation of “broad” or “stringent” is consistent across pairs of distributions. If a common set of cut scores were used, broad or stringent cut scores for one pair of distributions would be less broad or stringent for another. Note also that some approximation of the results from the actual NAEP cut scores is located reasonably high along the stringency dimension, where the unweighted average cumulative proportions between White and Black students approach 45%, 70%, and 95% (55% basic and above, 30% proficient and above, 5% advanced) across state-subject-grade combinations. Recovery of gaps depending on cut score breadth. The top half of Figure 7 shows the bias of the 6 best-performing metrics in their recovery of real-data gaps across broad and narrow cut scores. As noted previously, the PLI and PFC approaches perform poorly in common scenarios and are not considered further. The top half of Figure 7 shows that the overall bias of these six candidate methods can be very low. The LTFIT and MCI approaches do not perform well when cut scores are narrow. However, the four best metrics, ADTPAC, PTFIT, ANS, and ROCFIT have bias less than .02. Focusing on these four approaches, the ADTPAC approach performs relatively poorly, and PTFIT does not perform as well as ANS or ROCFIT particularly when cut scores are broadly spaced. The lowest bias across all methods occurs close to cumulative proportions of 20%, 50%, and 80%. This suggests that the bias scenarios in Figures 4-6 are optimistic. However, for ANS and ROCFIT in particular, ESTIMATING ACHIEVEMENT GAPS 29 the bias ranges from -.007 to +.013, a very small bias given that the median White-Black NAEP gaps are generally about .75 standard deviations in size. The bottom half of Figure 7 shows the RMSD across cut score breadth. As before, the MCI and LTFIT approaches perform poorly when cut scores are more narrowly spaced. Within the top four approaches, the ADTPAC and PTFIT approaches perform relatively poorly when cut scores are extreme. The poor ADTPAC performance is consistent with the findings in Figures 5 and 6. The performance of ANS and ROCFIT are indistinguishable along the RMSD criterion. The overall efficiency when cut scores are neither broad nor narrow is quite good, with RMSDs bottoming out at around .009, a small percentage of White-Black gaps in practice. Recovery of gaps depending on cut score stringency. The top half of Figure 8 shows the bias in recovery across cut score stringency. The symmetry of these curves suggests that methods perform best when cut scores are central with respect to the unweighted mixture of both distributions. The MCI and LTFIT approaches continue to perform worse than their counterparts, with negative bias. The absolute bias of the PTFIT, ANS, and ROCFIT approaches are similar, and ADTPAC bias is negative when cut scores are low. The bottom half of Figure 8 shows the RMSD of the approaches and results in similar conclusions. The LTFIT and MCI approaches perform relatively poorly. The ANS and ROCFIT approaches perform the best, with slightly better efficiency than PTFIT. ADTPAC does not perform as well outside of the region where it happens to show no bias. Focusing on the righthand portion of the graph, where cut scores are closer to their real-world NAEP counterparts, the RMSDs are between .025 and .029. This may still be considered surprisingly low given how little information exists about the lower half of the respective distributions. When the basic cut ESTIMATING ACHIEVEMENT GAPS 30 score is lower, as it often is in practice, Figure 7 suggests that performance will improve. Further, because state cut scores are usually lower or much lower than NAEP cut scores, the RMSDs are likely to be closer to those seen towards the center of Figure 8. Discussion These results suggest three promising candidates for the estimation of gaps under censored data scenarios. The two best approaches are ROCFIT, implemented by Stata in a command motivated by signal detection theory, and ANS, a simple adaptation of a maximum likelihood estimation procedure developed by Furgol, Ho, and Zimmerman (2010). Both result in very small amounts of bias and RMSD across a range of simulated and real-data scenarios. The ROCFIT approach is symmetrical and estimates a PP curve directly, a comparative theoretical advantage over ANS, which is asymmetrical and estimates normal CDFs. In addition, the ANS implementation in R does not have documentation and is not widely available. Both packages also allow for the estimation of standard errors; the ROC approaches to standard error estimation are reviewed by Pepe (2003). For those who do not have access to ROCFIT approaches in Stata, the PTFIT approach is intuitive, easy to implement with standard routines in statistical packages, and shows little loss in performance across scenarios. There may be greater possibilities for PTFIT when more cut scores are available, and higher-order polynomials can be fit to data on the probit-transformed axes. The magnitudes of the bias and RMSD for all three of these methods are rarely over .02 and are usually much less, an impressive result under the real-data and lognormal scenarios, where the respective normal assumption is threatened or violated outright. These results suggest that the estimation approach is robust to deviations from respective normality across a range of cut score locations. ESTIMATING ACHIEVEMENT GAPS 31 The applicability of these approaches extends beyond gap estimation for censored state testing data. Tests reported on score scales with few ordinal categories, such as Advanced Placement exams, which report scores on a 1-5 integer scale, and some exams for English Learners are also natural applications for these gap estimation approaches. In these cases, the data are treated as censored even if the grain-size of the data is the finest available. The argument in favor of the use of this framework is that some continuous scale underlies the observed scale. Similarly, a case where ceiling or floor effects compress a theoretically distinguishable score range into a single undifferentiated score point—this is a censored data problem. These are cases where an ANS-, PTFIT-, or ROCFIT-estimated statistic may in fact be preferable to effect sizes calculated from means and standard deviations on the established score scale. A small number of technical issues remain. The effects of sample size, sample size ratio across groups, and the overall number of cut scores are of interest. We do not spend time on them here because the findings will be straightforward: more is better. Increasing cut scores and sample size beyond the levels here will also mute the differences between methods that were our primary interest. The adequate recovery of when there are only three cut scores suggests that a higher benchmark for the minimum number of cut scores is not necessary. When sample sizes are smaller, cut scores are extreme, group differences are large, or some combination of these instances, there is an increased likelihood that the highest or lowest score category will have no student representation from one group or another. In these situations, a number of the methods proposed here will fail, including PTFIT and ANS, which would both attempt to take an inverse-normal transformation of 0 or 1. A simple correction involves adding a student or a fraction of a student to the highest or lowest score bin. ESTIMATING ACHIEVEMENT GAPS 32 Measurement error is known to attenuate Cohen-type effect sizes by inflating standard deviations. The same issues arise in PP plots, as measurement error will attenuate PP curves toward the main diagonal. The NAEP examples are adjusted for measurement error through the plausible values methodology (Mislevy, Johnson, & Muraki, 1992), however, gap comparison across tests, times, or groups with different degrees of measurement error must acknowledge or adjust for attenuation. An ad hoc disattenuation approach treats statistics like their counterparts and divides by the square root of the reliability estimate; this is discussed briefly in Ho (2009). Finally, it may seem straightforward to extend these analyses from gaps to trends. After all, any two distributions on the same scale can be expressed as a PP plot; it may not seem to matter whether they are Groups and or Times 1 and 2. However, we recommend caution in using these methods for descriptive trend analyses for a few reasons. First, if cross-sectional, within-grade trends are the target of inference, these are much smaller in magnitude, and the degree of bias and variance reported here will have a greater impact on substantive interpretations. Second, trends rely on the year-to-year linking of score scales, a source of error that this ordinal framework does not currently address. Third, discrete score scales are typically aligned imperfectly from year to year due to linking functions. A score bin in one year may not exist in the next, or two bins may be combined that once were separate. This has little effect on the means of the full distributions, but a small shift in the scale can lead to a large percentage of students crossing one of a small number of cut scores. This is not an issue with within-year gap measures, where score points align, but this issue may severely distort PP-based trend measures. Interestingly, this latter problem with trends does not necessarily generalize to a problem with gap trends. As Ho (2009) has noted, one can express a gap trend as a “change in gap” or a ESTIMATING ACHIEVEMENT GAPS 33 “difference in changes.” These are equivalent in an average-based framework but not in an ordinal framework. A “difference in changes” formulation subjects a gap trend to not one but two opportunities for the distortions noted in the previous paragraph. However, a change-in-gap formulation, where gaps are estimated within each year and then subtracted from each other, manages to avoid the problems of year-to-year equating and misaligned score bins. This is the recommended approach to tracking gaps over time. With widespread reporting of test scores in ordinal achievement levels, researchers interested in achievement gaps are increasingly faced with censored data scenarios. This paper evaluates ordinal approaches for estimating achievement gaps using censored data alone and finds three approaches—ROCFIT, ANS, and PTFIT—whose performance justifies recommendation. These estimates represent dramatic improvement over any gap estimate derived from a single cut score. They also recover gaps well over a range of scenarios, in both an absolute sense and relative to alternative ordinal approaches. The resulting estimates are interpretable on a familiar Cohen-type metric and are transformation-invariant: particularly useful properties for gap comparisons across different tests, times, grades, and jurisdictions. ESTIMATING ACHIEVEMENT GAPS 34 References Center on Education Policy. (2007). Answering the question that matters most: Has student achievement increased since No Child Left Behind? Retrieved November 1, 2008, from http://www.cepdc.org/index.cfm?fuseaction=document.showDocumentByID&nodeID=1&DocumentID =200 Cliff, N. (1993). Dominance statistics: Ordinal analyses to answer ordinal questions. Psychological Bulletin, 114, 494-509. Conover, W. J. (1973). Rank tests for one sample, two sample, and k samples without the assumption of a continuous distribution function. The Annals of Statistics, 1, 1106-1125. Dorfman, D. D., & Alf, E. (1969). Maximum likelihood estimation of parameters of signal detection theory and determination of confidence intervals-rating method data. Journal of Mathematical Psychology, 6, 487-496. Downton, F. (1973). The estimation of Pr(Y>X) in the normal case. Technometrics, 15, 551-558. Education Week. (2010, January 14). State of the states: Sources and notes. Education Week, 29(17), 49-50. Retrieved June 1, 2010, from http://www.edweek.org/ew/articles/2010/01/14/17sources.h29.html Fritsch, F. N., & Carlson, R. E. (1980). Monotone piecewise cubic interpolation. Society for Industrial and Applied Mathematics: Journal on Numerical Analysis, 17, 238-246. Furgol, K. E., Ho, A. D., & Zimmerman, D. L. (2010). Estimating trends from censored assessment data under No Child Left Behind. Educational and Psychological Measurement, 70(5), 760-776. ESTIMATING ACHIEVEMENT GAPS 35 Hedges, L. V., & Olkin, I. (1985). Statistical methods for meta-analysis. Orlando, FL: Academic Press. Ho, A. D. (2007). Describing the pliability of growth statistics under transformations of the vertical scale. Paper presented at the 2007 annual meeting of the National Council on Measurement in Education. Ho, A. D. (2008). The problem with "proficiency": Limitations of statistics and policy under No Child Left Behind. Educational Researcher, 37(6), 351-360. Ho, A. D. (2009). A nonparametric framework for comparing trends and gaps across tests. Journal of Educational and Behavioral Statistics, 34, 201-228. Ho, A. D., & Haertel, E. H. (2006). Metric-Free Measures of Test Score Trends and Gaps with Policy-Relevant Examples (CSE Report No. 665). Los Angeles, CA: Center for the Study of Evaluation, National Center for Research on Evaluation, Standards, and Student Testing, Graduate School of Education & Information Studies. Holland, P. (2002). Two measures of change in the gaps between the CDFs of test score distributions. Journal of Educational and Behavioral Statistics, 27, 3-17. Jencks, C., & Phillips, M. (Eds.). (1998). The Black-White Test Score Gap. Washington D.C.: Brookings Institution Press. Kolen, M. J., & Brennan, R. L. (2004). Test equating, scaling, and linking: methods and practices (2nd ed.). New York: Springer-Verlag. Livingston, S. A. (2006). Double P-P plots for comparing differences between two groups. Journal of Educational and Behavioral Statistics, 31, 431-435. Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Erlbaum. ESTIMATING ACHIEVEMENT GAPS 36 Magnuson, K., & Waldfogel, J. (Eds.). (2008). Steady Gains and Stalled Progress: Inequality and the Black-White Test Score Gap. New York: Russell Sage Foundation. Massachusetts Department of Education. (2002). 2001 MCAS Technical Report. Retrieved October 10, 2009, from http://www.doe.mass.edu/mcas/tech/01techrpt.pdf McGraw, K. O., & Wong, S. P. (1992). A common language effect size statistic. Psychological Bulletin, 111, 361-365. Miller, G., & McKeon, H. (2007, August 28). Miller-McKeon Discussion Draft: Amendments to Title I. Retrieved June 1, 2010, from http://republicans.edlabor.house.gov/Media/File/PDFs/Discussion_Draft_Title_I.pdf Mislevy, R. J., Johnson, E. G., & Muraki, E. (1992). Scaling procedures in NAEP. Journal of Educational Statistics, 17, 131-154. Neal, D. A. (2006). Why has Black-White skill convergence stopped? In E. A. Hanushek & F. Welch (Eds.), Handbook of the Economics of Education (Vol. 1, pp. 511-576): Elsevier. Ogilvie, J. C., & Creelman, C. D. (1968). Maximum-likelihood estimation of receiver operating characteristic curve parameters. Journal of Mathematical Psychology, 5, 377-391. Pepe, M. S. (2003). The Statistical Evaluation of Medical Tests for Classification and Prediction. New York: Oxford University Press. Pollack, J. M., Narajian, M., Rock, D. A., Atkins-Burnett, S., & Hausken, E. G. (2005). Early Childhood Longitudinal Study-Kindergarten Class of 1998-99 (ECLS-K), Psychometric Report for the Fifth Grade (NCES Report No. 2006-036). Washington, DC: U.S. Department of Education, National Center for Education Statistics. Reardon, S. F. (2008a). Differential Growth in the Black-White Achievement Gap During Elementary School Among Initially High- and Low-Scoring Students. Stanford, CA: ESTIMATING ACHIEVEMENT GAPS 37 Working Paper Series, Institute for Research on Educational Policy and Practice, Stanford University. Reardon, S. F. (2008b). Thirteen Ways of Looking at the Black-White Test Score Gap. Stanford, CA: Working Paper Series, Institute for Research on Educational Policy and Practice, Stanford University. Seltzer, M. H., Frank, K. A., & Bryk, A. S. (1994). The metric matters: the sensitivity of conclusions about growth in student achievement to choice of metric. Educational Evaluation and Policy Analysis, 16(1), 41-49. Simpson, A. J., & Fitter, M. J. (1973). What is the best index of detectability? Psychological Bulletin, 80, 481-488. Spencer, B. D. (1983). Test scores as social statistics: Comparing distributions. Journal of Educational Statistics, 8(4), 249-269. Swets, J. A., & Pickett, R. M. (1982). Evaluation of Diagnostic Systems: Methods from Signal Detection Theory. New York: Academic Press. Vanneman, A., Hamilton, L., Baldwin Anderson, J., & Rahman, T. (2009). Achievement Gaps: How Black and White Students in Public Schools Perform in Mathematics and Reading on the National Assessment of Educational Progress, (NCES 2009-455). National Center for Education Statistics, Institute of Education Sciences, U.S. Department of Education. Washington, DC. Retrieved June 9, 2010, from http://nces.ed.gov/nationsreportcard/pdf/studies/2009455.pdf Vargha, A., & Delaney, H. D. (2000). A critique and modification of the common language effect size measure of McGraw and Wong. Journal of Educational and Behavioral Statistics, 25, 101-132. ESTIMATING ACHIEVEMENT GAPS 38 Wilk, M. B., & Gnanadesikan, R. (1968). Probability plotting methods for the analysis of data. Biometrika, 55, 1-17. Wolynetz, M. S. (1979). Algorithm AS 138: Maximum likelihood estimation from confined and censored normal data. Applied Statistics, 28, 185-195. Zwick, R. (1992). Statistical and psychometric issues in the measurement of educational achievement trends: Examples from the National Assessment of Educational Progress. Journal of Educational Measurement, 20, 299-326. ESTIMATING ACHIEVEMENT GAPS 39 TABLE 1. White-Black gaps and gap trends on four different metrics; Alabama, Grade 4, 20052007. 2005 32.00 2007 25.59 Change, -6.41 0.88 0.74 -0.14 35.36 29.84 -5.52 23.85 25.31 +1.45 6.09 7.43 +1.34 0.92 0.79 -0.13 0.95 0.82 -0.13 1.01 0.76 -0.26 Note: Authors’ calculations from data from Alabama’s Grade 4 Reading administration of the National Assessment of Educational Progress, 2005-2007, obtained at http://nces.ed.gov/ nationsreportcard/naepdata/dataset.aspx. denotes the average-based metric; denotes Cohen’s ; denotes the percentagedenotes the transformed above-cut metric; percentage-above-cut metric. ESTIMATING ACHIEVEMENT GAPS FIGURE 1. Illustrating the construction of a Probability-Probability (PP) plot from the paired cumulative proportions of distributions. Note: Data from idealized normal distributions using parameters from Alabama’s Grade 4 Reading administration of the National Assessment of Educational Progress, 2005. 40 ESTIMATING ACHIEVEMENT GAPS 41 FIGURE 2. The 5 points on a PP plot from paired cumulative proportions defined by 3 cut scores. (0.33, 0.69) Note: Data from Alabama’s Grade 4 Reading administration of the National Assessment of Educational Progress, 2005. The PP pair for the Basic cut score is referenced. ESTIMATING ACHIEVEMENT GAPS 42 TABLE 2. Characteristics of Proposed Methods of Estimating Properties Method PLI Monotonicity Symmetry 9 Respective Distributional Assumptions 9 Notes Probable bias toward zero gap. PFCf PFCc Lack of monotonicity may lead to substantial bias and variance. Implemented by Matlab’s “pchip” spline option. MCI 9 PTFIT 9 Normal when 1 4 requires linear constraint. LTFIT 9 Logistic when 1 4 requires linear constraint. ANS 9 Normal Maximum Likelihood. Not readily available. ROCFIT 9 9 Normal Maximum Likelihood. Implemented by Stata’s “rocfit” command. ADTPAC 9 9 Normal, Equal Variance Simple to implement. ESTIMATING ACHIEVEMENT GAPS FIGURE 3. Generating distributions and the observed PP points in the population for the equivariant normal, unequal variance (reversed over the diagonal for visual clarity), and lognormal scenarios. 43 ESTIMATING ACHIEVEMENT GAPS 44 FIGURE 4. Bias and Root Mean Squared Deviation of gap estimation approaches plotted on the size of the true gap in a normal, equivariance scenario. ESTIMATING ACHIEVEMENT GAPS 45 FIGURE 5. Bias and Root Mean Squared Deviation of gap estimation approaches plotted on the size of the true gap in a normal, unequal-variance scenario. ESTIMATING ACHIEVEMENT GAPS 46 FIGURE 6. Bias and Root Mean Squared Deviation of gap estimation approaches plotted on the size of the true gap in a lognormal scenario. ESTIMATING ACHIEVEMENT GAPS 47 FIGURE 7. Bias and Root Mean Squared Deviation of gap estimation approaches for WhiteBlack state achievement gaps on the National Assessment of Educational Progress plotted on the breadth of three cut scores. ESTIMATING ACHIEVEMENT GAPS 48 FIGURE 8. Bias and Root Mean Squared Deviation of gap estimation approaches for WhiteBlack state achievement gaps on the National Assessment of Educational Progress plotted on the stringency of three cut scores.