An Alternate Estimate of AYP A Foray into the Bayesian Approach Kimberly S. Maier Michigan State University Coauthors: Tapabrata Maiti, Sarat Dass, & Chae Young Lim 1 Content of Today’s Talk Quick review of NCLB requirements and state accountability systems, with a particular focus on the reliability and validity of estimating AYP status. A sketch of a conventional estimator of AYP status. Development and description of an alternate estimator of AYP status. Summary of a comparative evaluation of the competing estimators. Brief discussion of the possible extensions to the proposed estimator. Examples draw on data from Michigan. 2 NCLB and Adequate Yearly Progress The reauthorization of the Elementary and Secondary Education Act of 1965, NCLB, was a major overhaul. Other than requiring that measures of AYP follow “established technical and professional standards” (NCLB, 2002), details for establishing and implementing the supporting accountability systems were vague. State education agencies were responsible for fleshing out the details of their respective accountability systems. 3 NCLB and Adequate Yearly Progress The law stipulates that the measure of Adequate Yearly Progress (AYP) include both student performance on assessments and other non-assessment academic indicators. For instance, Michigan incorporates: Student performance on standardized tests, High school graduation rates (80%) Elementary and middle school attendance rates (90%) Assessment measures of AYP dominate state accountability systems. For a school to demonstrate AYP, all students and specified student subgroups must meet all AYP targets for each component. 4 The Demands of NCLB NCLB pushed change at all levels (states, districts, schools, and classrooms) in substantial ways; for example, Organizational and administrative changes to support AYP measurement and tracking Given the high-stakes nature of the law, NCLB influences pedagogical and curricular choices in the classroom The development of accountability systems required states to address a substantial number of technical and/or methodological challenges. 5 Technical/Methodological Challenges of NCLB Some examples of these technical challenges include: Definition and development of the accountability ‘measure’ Assessments for measuring student achievement Fleshing out the components of the measure Setting standards for components Determining intermediate milestones for standards for years prior to 2013-14 deadline. Development of valid and reliable assessments Choice of scaling method (e.g., IRT) Standard setting procedures Vertical and horizontal equating across assessments Definition and development of procedures for “reliable and valid” determination of AYP (NCLB, 2002). This talk will focus on the last point: The procedure for a valid and reliable determination of AYP. 6 To Facilitate the Present Discussion… Focus on validity and reliability of the procedure for making decisions about AYP rather than the validity and reliability of the assessments. Determination of AYP is considered as a classification problem (‘meeting’ or ‘achieving’ AYP) Estimates of AYP status consider only the assessment component. Assumptions about the assessments: Have adequate validity and reliability No DIF Scaling model is appropriate Standard setting produced acceptable standards Equating procedure produced scores that are comparable across forms within a subject (e.g., reading, math) 7 Valid and Reliable AYP Classification In order for the procedure of AYP status determination to be valid and reliable, AYP classification based on student assessment performance must be valid and reliable Errors in classification of students based on performance have been studied long before NCLB was put in place (e.g., Brennan & Kane, 1977; Livingston & Lewis, 1995; Yen, 1997) Large-sample theory estimators of AYP that are typically used are more accurate and precise as sample sizes increase (Yen, 1997). Makes estimates of AYP for small groups difficult or impossible due to unacceptable levels of error (separate from FERPA issue) Heterogeneity of school and district sizes impacts standard errors, influences comparative reliability of AYP decisions across student subgroups, schools, districts. 8 State Solutions for Classification Error The uncertainty associated with AYP status has been conceptualized either as a sampling-related issue or a psychometric issue: About 80% of states use conventional confidence interval (CI) approach to express uncertainty (U.S Dept of Education, 2010). Far fewer states take the approach of using the standard error of measurement (SEM) that is related to the assessment. The CI approach produces intervals that are a function of the proportion of students who demonstrated proficient via the assessment. The SEM approach produces test- and score-specific intervals for AYP status. Here, we will focus on the CI approach. 9 The Challenge of Heterogeneity For example, consider the distribution of the number of 4th graders within school districts across the state of Michigan (2010): 10 The Challenge of Heterogeneity The standard deviation of student scores is also highly variable across the state: Large differences in district sizes and their variability make conventional estimators heterogeneous, which in term compromises the usefulness of the estimator for making inferences and drawing conclusions. 11 AYP Status Estimators In order to facilitate the development of the alternate estimator, an estimator that represents a conventional approach to estimation of AYP status will be developed first. Given the latitude that NCLB gave states about the procedure for determining AYP, it’s important to note that the following estimator is not one used by all states, but merely serves as a representative example. 12 A Conventional Estimator of AYP Status Consider an individual school i that has mi students who have a score on a standardized assessment. Of those mi students, pi is the proportion of mi students who were classified as proficient; That is, the scores of these students on the assessment met or exceeded the state-specific standard that year (note that this standard is a cut-point of an assessment, distinct from the AYP proportion proficient target). Thus, 100pi% is the percent proficient for the ith school. The quantity pi is unknown and is estimated for school i using pˆ i , the observed proportion of students meeting the state standard (assessment cut-point). To make a determination of school i’s AYP status, pˆ i is compared to the target p0. 13 A Conventional Estimator of AYP Status A student’s score on a test can be assumed to be a random variable (recall the ideas of true score and measurement error) Further, an estimate of the proportion of proficient students depends on all the individual students’ scores, each having measurement error. It follows then that pˆ i also has measurement error and is a random quantity. 14 A Conventional Estimator of AYP Status Now consider k = 1,2, …,K subgroups within an individual school i, each having mik students, and each having proportion proficient pik Again, pik is unobserved and estimated by pˆ ik , which is compared to the target pk0 to determine AYP status (here pk0is pk0 specified as a group-specific target, but this is not necessary). Defining the AYP score as a comparative quantity, for school i, K mik AYP score 100 pˆ ik pk0 , k 1 mi which is compared to the threshold value of 0. School i is declared to have met AYP if the score is equal or greater than 0. 15 A Conventional Estimator of AYP Status K mik AYP score 100 pˆ ik pk0 k 1 mi The key statistic used to determine school i’s AYP score is the proportion of proficient students in each of the k subgroups. The mean AYP score is obtained by replacing each of the statistics pˆ ik by its unknown true value pik. More precisely, a school i has truly met AYP if the AYP score computed with the unknown true value is greater than or equal to 0. 16 A Conventional Estimator of AYP Status Due to the uncertainty associated with the pˆ ik s, there are two types of errors associated with the AYP status classification of a school: A school can be declared to have met AYP when the true mean score is below 0 = False-positive A school can be declared not to have met AYP when the true mean score is above 0 = False-negative False-positives and false-negatives can be minimized when the unknown true values pik are estimated accurately. 17 Conventional Estimation of AYP Uncertainty The conventional confidence interval approach for quantifying uncertainty associated with an estimate of AYP status for school i is a function of the standard error of the observed proportion proficient. Most accountability systems that use this procedure are interested in the upper bound of a confidence interval, which prioritizes the minimization of false-negatives: 0, pˆ i z0.95SE pˆ i , where z0.95 is a z-value such that Φ(z0.95) = 0.95 for the cumulative distribution function of the standard normal distribution, Φ (CI uses the normal approximation). This approximation is valid when group sizes are large. 18 Conventional Estimation of AYP Uncertainty For small group sizes, the standard error of the proportion estimate is only a crude approximation to the true SE, producing a large bias that results in a higher upper confidence limit. Furthermore, if this upper confidence criterion is used to determine AYP status for n > 1 schools simultaneously, it will produce a significant upward bias, even in the case of large group sizes. 19 Conventional Estimation of AYP Uncertainty To demonstrate the upward bias of the estimated SE, consider the estimation of the proportion of schools in a district that meet AYP, 0 1: 1 n I pi c n i 1 where I is the indicator function which takes the value 1 if pi c , and 0 otherwise. kjkjkk The upper CI approach gives rise to an estimate of , ˆU , given by n 1 ˆU I pˆ i z0.95SE ( pˆ i ) c . n i 1 20 Conventional Estimation of AYP Uncertainty The estimate ˆU has an upward bias, resulting from the introduction of the term z0.95SE ( pˆ i ), which includes some schools that have true proportions less than c. Second, if group size is small, the confidence interval may also be wider due to the large variance of small subgroups, thus accentuating the upward bias. 21 An Alternative Estimator of AYP Status This particular approach for estimating AYP status and its uncertainty increases the probability of false-positives. A viable alternate estimator would: Minimize false-positives and false-negatives Not depend on restrictive assumptions that may not be appropriate for data at hand (e.g., normal approximation for computation of standard errors) Be able to incorporate auxiliary information (student-, school-level covariates; this discussion is beyond the scope of today’s talk) 22 A Bayes Estimator of AYP Status First, assume for the moment that mi is large so that pˆ i is approximately normally distributed with mean pi and variance i2 pi (1 pi ) mi Consider the hierarchical model given by: ind pˆ i ~ N ( pi , i2 ) (1) iid logit( pi ) ~ N ( , 2 ), (2) for i = 1, 2,…,n; in (1), ‘ind’ refers to independent and in (2), ‘iid’ refers to independent and identically distributed. p The logit transformation of pi, logit( pi ) log 1 pi i . 23 A Bayes Estimator of AYP Status Using Bayes theorem, the posterior distribution of each pi is determined by the model specification of (1) and (2), and is given, up to a proportionality constant, by where ( pi | pˆ i , , ) ( pˆ i | pi ) ( pi , ) ( pˆ i | pi ) (2 ) 2 1 2 i e pi pˆ i 2 (3) 2 2 i (4) and ( pi | , ) ( logit( pi ) pi )(2 ) 2 1 2 e pi 2 2 (5) 2 24 A Bayes Estimator of AYP Status The alternative Bayes estimator of is n 1 ˆB P * ( pi c ) n i 1 where the probability P* is computed with respect to the posterior distribution of pi, given p hˆ i in (3). Probabilistic statements about are now possible using this approach. 25 Comparison of non-Bayes and Bayes A simulation procedure was used to compare the performance of ˆB and ˆU The design of the simulation study: 500 replicates AYP target value, c = 0.70 Number of students in a school, mi = {30, 200} Number of schools in a district, n = {10, 15} True proportion of schools meeting AYP in the district, = {0.5, 0.599, 0.705, 0.813} Variance for prior distribution of pi, 2 = 1. True and observed proportions, pi and pˆ i , were generated based on the parameter specifications. Numerical techniques were used; P * ( pi c ) could not be derived in a closed form due to the use of nonconjugate densities involving pi. 26 Performance of the Estimators 27 Performance of the Estimators The Bayes estimator outperformed the conventional estimator for each of the three performance indices. The contribution of z0.95SE ( pˆ i ) to increased bias for the conventional estimator is shown in the table. Increased bias contributes to a larger MSE. The conventional estimate is based on the maximum likelihood estimator pi hat, which uses information only from the ith school, thus increasing the variance of the conventional estimator due to inclusion of small schools. In contrast, the ith term of the Bayes estimator is derived pˆ i and the overall mean , which from a combination of pi has the effect of reducing the variance of ˆB (and increasing its stability). 28 Flexibility of the Bayes Estimator The previous model can be extended: To handle small sample sizes mi To incorporate distributions other than the normal distribution for pi By using proper extensions of the Bayes estimator, small sample sizes can be addressed and the true proportion can be more flexibly modeled. By treating these challenges appropriately, the Bayes approach minimizes false-positives and false-negatives. 29 Extensions to the Bayesian Approach A comparative reduction of variance as seen in Table 2 for ˆB is typical for an estimator that uses a combination of the ith raw score and the overall mean score . This concept is commonly known as shrinkage estimation (for example, see Efron & Morris, 1972a; 1972b). In general, this technique works well if a vector of parameters are the object of estimation; in our context, this parameter could be several schools within a district. The pooling of information is induced by the hierarchical model (1) & (2), which produces the shrinkage estimator ˆB This idea of “borrowing strength” can be further exploited to consider regression-based and regression-withinclustering-based shrinkage methods for further improvement of the estimation of . 30 References Brennan, R. L. and Kane, M.T. (1977). An index of dependability of mastery tests. Journal of Educational Measurement, 14(3), 277-289. Efron, B. and Morris, C. (1972a). Limiting the risk of Bayes and empirical Bayes estimators, Part II: The empirical Bayes case. Journal of the American Statistical Association, 67, 130-139. Efron, B. and Morris, C. (1972b). Empirical Bayes on vector observations: An extension of Stein's method. Biometrika, 59, 335-347. Livingston, S.A. and Lewis, C. (1995). Estimating the consistency and accuracy of classifications based on test scores. Journal of Educational Measurement, 32(2), 179-197. No Child Left Behind Act of 2001, Pub. L. No. 107-110 section 115 Stat. 1425 (2002). U.S. Department of Education (2010). State and Local Implementation of the No Child Left Behind Act, Volume IV - Accountability Under NCLB: Final Report. Washington D.C.: Author. Yen, W.M. (1997). The technical quality of performance assessments: Standard errors of percents of pupils reaching standards. Educational Measurement, 16(3), 5-15. 31 For Further Reading Brooks, S. P. (1998). Markov chain Monte Carlo method and its application. Statistician, 47(1), 69-100. Casella, G., & George, E. I. (1992). Explaining the Gibbs Sampler. The American Statistician, 46(3), 167-174. Chib, S., & Greenberg, E. (1995). Understanding the Metropolis-Hastings algorithm. The American Statistician, 49(4), 327-335. Gelfand, A. E., Hills, S. E., Racine-Poon, A., & Smith, A. F. M. (1990). Illustration of Bayesian inference in Normal data models using Gibbs sampling. Journal of the American Statistical Association, 85(412), 972-985. Gill, Jeff (2002). Bayesian Methods: A Social and Behavioral Sciences Approach. Boca Raton, FL: Chapman & Hall/CRC. Jackman, S. (2000). Estimation and inference via Bayesian simulation: An introduction to Markov chain Monte Carlo. American Journal of Political Science, 44(2), 375-404. Ntzoufras, I. (2010). Bayesian Modeling Using WinBUGS. New York: Wiley. Tanner, M. A., & Wong, W. H. (1987). The calculation of posterior distributions by data augmentation. Journal of the American Statistical Association, 82(398), 528-540. Western, B., & Jackman, S. (1994). Bayesian inference for comparative research. The American Political Science Review, 88(2), 412-423. Western, B. (1999). Bayesian analysis for sociologists: An introduction. Sociological Methods and Research, 28(1), 7-34. 32 Software for Bayesian Approach Approaches that works well with Gibbs Sampling Winbugs http://www.mrc-bsu.cam.ac.uk/bugs/winbugs/contents.shtml Programming approaches are recommended when procedure involves intractable posterior distributions R Matlab C++/Java 33