EFFECTS OF SIMULATED RATER BIAS ON PREDETERMINED STANDARD SETTING JUDGMENTS A Thesis Presented to the faculty of the Department of Psychology California State University, Sacramento Submitted in partial satisfaction of the requirements for the degree of MASTER OF ARTS in Psychology (Industrial/Organizational Psychology) by Charles Howard Strike FALL 2013 EFFECTS OF SIMULATED RATER BIAS ON PREDETERMINED STANDARD SETTING JUDGMENTS A Thesis by Charles Howard Strike Approved by: __________________________________, Committee Chair Gregory M. Hurtz, Ph.D. __________________________________, Second Reader Lawrence S. Meyers, Ph.D. __________________________________, Second Reader Timothy Gaffney, Ph.D. ____________________________ Date ii Student: Charles Howard Strike I certify that this student has met the requirements for format contained in the University format manual, and that this thesis is suitable for shelving in the Library and credit is to be awarded for the thesis. __________________________, Graduate Coordinator Jianjian Qin, Ph.D. Department of Psychology iii ___________________ Date Abstract of EFFECTS OF SIMULATED RATER BIAS ON PREDETERMINED STANDARD SETTING JUDGMENTS by Charles Howard Strike This study is designed to explore the usefulness of different strategies used to convert item-level proportion correct standard-setting judgments in to a θ-metric test cutoff score that can be used with item response theory (IRT) scoring using Monte Carlo simulations. Simulated Angoff ratings, consisting of 1000 independent 100 item by 15 rater matrices were generated at five points along the θ continuum, ranging from negative 2 to positive 2, at five levels of rater bias in regards to the item characteristics curves. A total of 37,500,000 ratings were generated as the basis of the analyses. These simulated proportion-correct ratings were converted to the IRT θ scale using test-level and itemlevel methods developed by Kane (1987). Overwhelmingly, Kane’s method 1 weighted and method 3 performed the best when recovering the original θ values. _______________________, Committee Chair Gregory M. Hurtz, Ph.D. _______________________ Date iv ACKNOWLEDGMENTS I would like to acknowledge my Committee Chair, Gregory M. Hurtz, Ph. D., for all of his efforts reviewing my work, and the support and encouragement given throughout this process. My second and third readers, Lawrence S. Meyers, Ph.D., and Timothy Gaffney, Ph. D. were important in providing encouragement and feedback in order for successful completion of this project, and I am thankful for their patience and support. I would also like to thank my family and friends who have encouraged me and supported me throughout my educational efforts and who have allowed me to get to where I am today. I want to thank my boss Kelli Johnson for allowing me the time off work to get this process completed. Thank you to everyone who contributed to this project and providing me the motivation to complete this project. v TABLE OF CONTENTS Page Acknowledgments......................................................................................................... v List of Tables .............................................................................................................. ix List of Figures ............................................................................................................... x Chapter 1. INTRODUCTION ……………….…………………………….………………… 1 Latent Trait Theory ........................................................................................... 2 Standard Setting……………………………………. ....................................... 3 Standard Setting Judgments .............................................................................. 7 Standard Setting Evaluation .............................................................................. 8 2. ESTIMATION OF TEST SCORES ..................................................................... 10 Test Score Estimation ..................................................................................... 10 Classical Test Theory…………………………………. ................................. 10 Item Response Theory .................................................................................... 18 Rasch Model ................................................................................................... 21 One Parameter Model…………………………………… ............................. 22 Two Parameter Logistic Model ...................................................................... 23 Three Parameter Logistic Model…………………………………. ............... 24 IRT Assumptions ............................................................................................ 25 Item and Test Information…………………………………… .......................25 vi Test Characteristic Curve ................................................................................ 27 Standard Error of Measurement…………………………………. ................. 28 Invariance…………………………………. ................................................... 30 Parameter Estimation ...................................................................................... 31 3. ANGOFF METHOD AND THE LATENT TRAIT CONTINUUM ................... 34 Angoff Method ............................................................................................... 34 4. MONTE CARLO SIMULATIONS ...................................................................... 39 Monte Carlo .................................................................................................... 39 Beta Distribution ............................................................................................. 42 5. METHOD ............................................................................................................. 43 Purpose of the Study ....................................................................................... 43 Minimum Passing Level…………………………………. ............................ 44 Length of Test and the Number of Items ........................................................ 46 Simulated Rating Data .................................................................................... 47 IRT to CTT Conversion .................................................................................. 48 Data Analysis .................................................................................................. 54 6. RESULTS ............................................................................................................. 55 Unbiased Ratings ............................................................................................ 55 Biased Ratings ................................................................................................ 68 7. DISCUSSION ..................................................................................................... 115 vii Unbiased Ratings ...........................................................................................115 Biased Ratings ...............................................................................................117 Conclusion .....................................................................................................124 References ..................................................................................................................127 viii LIST OF TABLES Tables Page 1. Unbiased Ratings Low Variability…….……………...……………..…………...59 2. Unbiased Ratings Medium Variability.……………..……………………….......63 3. Unbiased Ratings High Variability…….…………………..…………………….67 4. .15 Above the ICC Low Variability….…………………..…………………........71 5. .15 Above the ICC Medium Variability.………………..……………………….75 6. .15 Above the ICC High Variability….………………….…………………........79 7. 15 Below the ICC Low Variability…….………………………...………………83 8. .15 Below the ICC Medium Variability…………………………….………........87 9. .15 Below the ICC High Variability…….…………………………..…………...91 10. .35 Above the ICC Low Variability.…………………………………………......95 11. .35 Above the ICC Medium Variability………………………………………....99 12. .35 Above the ICC High Variability.………………………………………..….102 13. .35 Below the ICC Low Variability.…………………………………………....106 14. .35 Below the ICC Medium Variability…………………………………….......110 15. .35 Below the ICC High Variability.……………………………………….......114 ix LIST OF FIGURES Figures Page 1. Item characteristic curve.............…………….…………………………………19 2. Test characteristic curve ……………………...………………………………...28 Chapter 1 INTRODUCTION Exams are important in the selection of employees, especially when the jobs require licensure and certification testing. Performance based standards have to be established so candidates taking those exams can be accurately assessed. Over time, several procedures have been developed in order to get these performance-based standards (Kane, 1987). Candidates who take these exams must demonstrate a specified amount of knowledge in order to meet the specified criteria. Having this knowledge demonstrates that the candidate can successfully perform the task that he or she is licensed or certified to do so. To determine the amount of knowledge needed to perform the job successfully, the exam creators must use standard-setting methods using specific procedures to determine the minimum score necessary for the candidate to perform at the required standard (Reckase, 2006) This minimum score is referred to as a cutoff score, which is used to make x 2 decisions that affect the lives of real people. This means that these scores must be valid and credible, and when they do not meet such standards, they are subject to legal defensibility reviews (Hurtz, Muh, Pierce, & Hertz, 2012). These scores result in standards that must be able to be compared over time. They also must be able to equate pass/fail decisions so that any one test-taker is not given an advantage or disadvantage over others because of poor exam construction (Hurtz et al., 2012). The focus of these standard setting methods is to define how much of a latent trait an examinee needs to have in order to perform successfully on an exam. For the purposes of this study, the standard derived from the standard setting method is the desired ability level along the latent trait continuum, where the cut-score is the operational definition of the ability level for the specific exam (Hurtz et al., 2012). The most relevant research to the present study was a study by Hurtz, Jones, and Jones (2008), which evaluated these methods and provided ideas for further research. CTT conversions at the item level with different weighting schemes were evaluated using fixed values of θ* as representations of the judges minimum competency conceptualizations. The items used were from a published exam by Hambelton (as cited by Hurtz et al., 2008) consisting of 75 items. The population standard deviation was set at .09, and examined rater bias as over and under estimation of +/- .10 of each ICC (Hurtz, et al., 2008). The present study is an expansion of the previous study and used different methods of determining rater bias, agreement, and examinee ability levels, with a randomly generated item sample instead of the previous study’s use of the Hambleton 3 seventy-five item exam. Latent Trait Theory In testing situations, the defining characteristics of examinees, called traits, can predict examinee performance. Scores are estimated based on these traits and then used to predict performance. These traits are unobservable and cannot be directly measured, so they are referred to as latent traits (Hambleton & Cook, 1977). In order to determine the relationship between the observable test score and the unobservable latent traits, a mathematical model is used, referred to as a latent trait model. This model relies on assumptions, and the assumptions are used to see how well the model fits the test data according to standard IRT practice. If the assumptions of a selected model are not met, a different model should be used (Hambleton & Cook, 1977). Latent trait theory was originally developed as a mental test theory. It began with the dichotomous response model, referring to each item on an exam being scored either correct or incorrect. The methods have advanced since their inception, but the dichotomous response model is the focus of the present study. An examinee’s ability is hypothesized and can only be measured by examining responses to concrete entities, called items (Samejima, 1988). Standard Setting Standard setting is a major part of testing, because of the need for examinees to demonstrate the knowledge they possess, usually determined by answering a specific number of questions correctly. The knowledge that the examinee possesses is the unobservable latent trait and his or her score is the observable portion of the model. 4 Using this perspective can assist the researcher in finding a theoretical framework in order to practically compare the credibility of different standard setting methods (Hurtz et al., 2012). This information is used in certification, licensure and selection exams where the examinee has to demonstrate a minimum amount of knowledge to pass. One purpose is to set several different passing grades or bands. Examinees are grouped into one of the different passing grades, and the employer considers the top band or bands when making employment selections. All candidates in each band are considered as equal and are selected at random. Once the number of candidates in a single band is exhausted, the employer drops down to the next lowest band (Cascio, Alexander, & Barrett, 1988). This is the type used most commonly in civil service such as in the State of California. Another common method is top-down selection, in which the employer selects the examinee that scores the highest on the exam first and moves down the list until the openings are filled. When determining which examinees have sufficient knowledge of the subject matter, the test needs to be fitted with a cut score, or a standard of knowledge necessary to pass the examination. Standard setting refers to the process of establishing cut scores on examinations. In these situations, standard setting helps to create categories such as pass/fail, allow/deny a license, or award/withhold a credential. To set standards, a system of rules or procedures must be followed to assign a number that differentiates between two or more levels of performance. Additionally, more than two categories such as basic, proficient, and advanced are recommended, commonly used to imply differing degrees of 5 achievement (Cizek, 2006). Determining how the standards are set must be considered early in the process so that it matches the purpose of the test, the selected test item, or the test format. The standard setting process should be able to identify relevant sources of evidence affecting the validity of the assigned categories. Performance standards are used to set passing scores. The desired competence level, or latent trait, serves as the standard, which is translated to a specific number of correct items on an exam. Standard-setting procedures give participants the opportunity to use personal judgments about organizational policy to set a specific position on a score scale (Cizek, 2006). Several factors influence passing scores; the most important being how to determine the expertise or knowledge required for the purpose of the exam. This includes licensing medical personnel, where judgments need to be made about public health and safety. Even though standard setting is supposed to be objective, value judgments in addition to the technical and empirical considerations influence the decision (Cizek, 2006). The need for establishing the standard must be determined. In some cases, a passing score is not appropriate, and participants involved in the standard setting process need to know what the purpose is. Once the participants know this information, it allows them to make informed decisions regarding what is required to pass or fail. To make sure that the participants know what is necessary, they need to participate in an orientation process to outline the purpose and other necessary information needed to set the standard for that specific exam (Cizek, 2006). One example involves revising standards, 6 specifically the amount of mathematical expertise a particular elementary school needs to be comparable to other schools. The orientation process informs the participants of this need as well as any economical and social issues that may make a difference on the final standard. Additionally, examples of the competencies defined by the content standards are provided. With licensure and certification examinations, participants receive information about the consequences of incorrect credentialing decisions; such as licensing an unsafe practitioner as more dangerous than failing to license a competent person (Cizek, 2006). Organizational and legal policies have a large impact on the rejection of a standard-setting panel’s recommended cut-score in favor of a higher performance standard. An additional component to participant orientation is to give the participants information regarding examinee latent traits, or their ability level when taking the exam. This involves giving participants information about examinees and their ability levels to help balance out the exam in addition to organizational information to give a clear picture of what an examinee is required to do in order to pass the exam. Standard setting can occur at different times during the development of an exam. It can occur after administration of a live test form with consequences for the examinees, or it can be done after pilot testing before it is given as a live exam. Standards set using normative p values based on live exam results are more accurate and stable because the examinees are operating under real exam conditions and the p values given to the judges are more accurate. Real exam data is more useful for the standard setting procedures, 7 because data collected during a “no-stakes” test administration does not provide adequate motivation for the examinees (Cizek, 2006). Generally, standards are set through many different methods depending on the purpose and the resources available. They involve using subject matter experts (SMEs) to set these standards in some form of a workshop. This gathers experts who know the material and the necessary knowledge to qualify as proficient for the purpose of the exam. The workshops should be structured in a way that allows SMEs to efficiently review the information and the purpose of the exam so that the standards can be set as accurately and reliably as possible. Typical best practices involve gathering wellcalibrated items that match the exam plan in the form of one or more complete test forms. After the items are rated, they need to be converted and linked to some kind of standard. An argument can be made that they should be linked to the latent trait scale because it can make the standard setting process more efficient and precise. Additionally, less resources are used and the standards set are more consistent and reliable (Hurtz, et al, 2012) Standard Setting Judgments It is crucially important to properly identify and train qualified participants. The judges need to be representative and large enough to ensure the judgments can be replicated. The expertise of the participants needs to be documented, which limits the pool of judges to those with sufficient experience, so the panels can only be representative of the experts in the field (Cizek, 2006). The selected participants are 8 provided with additional information. They get feedback consisting of normative, reality, or impact information. This is designed to help judges make decisions in each iterative round (Cizek, 2006). Additionally, they need to be able to consider examinee ability, and identify how an examinee may perform on a specific exam based on the latent trait being tested. These judgments must be made based on the latent trait model used, and the judges must be properly trained on what they need to rate. Standard Setting Evaluation All standard setting procedures must have documentation regarding all steps taken during the process. The method, test design, test purpose, agency goals, and the participant’s characteristics all must be examined (Cizek, 2006). The process must be externally evaluated to ensure that the process was done correctly, and few if any deviations of the established principles occurred. Any deviation must be reasonable, specified in advance, and consistent with the prescribed goals (Cizek, 2006). Internal evaluation must be conducted and documented to determine participant agreement and if any one participant had undue influence over the process, and how participant agreement was achieved. A minimum of two evaluations are recommended, involving examining how the judges reached their ratings, usually involving a standard set of evaluative questions decided upon before the standard setting session (Cizek, 2006). Many different standard setting methods exist, and the Standards for Educational and Psychological Testing (Standards) (1999) does not endorse one specific method. The Standards include methods that make judgments about the test content or about the 9 test takers. The participants need help to make informed judgments that can be reproduced and are fair to each examinee. The method selected needs to fit the purpose and format of the examination and the scoring model of the exam (Cizek, 2006). The way that standards are set relies heavily on individuals making judgments on specified criteria, most commonly a score that a minimally competent candidate would have to achieve in order to demonstrate that they have the requisite knowledge deemed necessary to perform a function. These individuals are most commonly making judgments based on a Classical Test Theory (CTT) based score, which as outlined above, can be problematic. With the new emphasis on Item Response Theory (IRT), these judgments must be transformed into the values necessary for computation of θ and the item parameters. 10 Chapter 2 ESTIMATION OF TEST SCORES Test Score Estimation The standards are set based on the process of estimating examinee results. The test theories discussed here, Classical Test Theory (CTT), Item Response Theory (IRT), and Rasch Modeling (RM) all have their own ways of estimating examinee scores. Classical Test Theory Classical psychometric theory assumes that an obtained score from an exam is composed of two parts, the examinee’s true score and an error component. Other assumptions include the true score and the error score are not correlated, error scores from one measure are not correlated with error scores obtained in other measures, and error scores and true scores are not correlated when obtained from different measures. These assumptions imply that error scores are random and unpredictable. This does not consider systematic errors made during repeated testing (Guion, 1998). Practically, an examinee’s actual score is of more interest than the true score, and the actual score is a combination of a systematic score and random error. This means that there are two error scores. The first is individual error affecting one examinee and the second is a systematic error that affects every examinee that takes the test. Both error scores influence the measure that is used, and comprise the total variance in a set of scores in the form of systematic causes and random error (Guion, 1998). 11 CTT is interested in the true score and estimates attributes by using a linear combination of test item responses composed of the true score, the observed score, and the error (Ellis & Mead, 2002). The true score is an examinee’s expected score on a test over repeated administrations or across parallel test forms. The observed score is expressed as: 𝑋 =𝑇+𝐸 The unobserved score, or error, E, is composed of the difference between an observed and a true score (Ellis & Mead, 2002). Observed scores are random variables with unknown distributions, and the mean of the theoretical distribution of a group of observed scores represents the true score concept or: ∈ (𝑋) = 𝑇 T cannot be observed but its properties allow useful inferences to be made (Ellis & Mead, 2002). Examinees can have a low score because their true score is low, their error score is low, or a combination of both conditions. The lowest scoring examinees on any given test most likely have low T and low E scores, indicating that observed scores on repeated examinations would be higher because the error scores vary on each administration. Both high and low scoring examinees receive scores closer to the mean on repeated measurements, a concept known as regression to the mean (Kachigan, 1986). The true score can only be estimated because of its mathematical abstraction, and the researcher must use the estimate to see how well it fits the particular model in regards to the practicality of the model, score and fit (Lord, 1980). 12 Another assumption of CTT is that the expected value or mean error score for a population of examinees is zero, which is represented by the following equation: 𝜇𝐸 = 0 Additionally, the correlation between true score and error for a population of examinees is assumed to be zero shown by this equation: 𝜌𝑇𝐸 = 0 Finally, the correlation between error on test one (E1) and error on test two (E2) is assumed to be zero, or: 𝜌𝐸1 𝐸2 = 0 Practitioners want to know the true score, but can only use the observed score, leaving the practitioner to determine the relationship between both scores (Ellis & Mead, 2002). The reliability index (ρXT) refers to the correlation between observed scores and true scores from a population of examinees, expressed as the ratio of standard deviations of true scores and observed scores: 𝜌𝑋𝑇 = 𝜎𝑇 /𝜎𝑋 The reliability index is unknown because the standard deviation of the true score distribution is unknown (Ellis & Mead, 2002). The consistency of a set of measures in regards to a specific trait and individual systematic errors determines reliability, and random error has little to no effect on reliability. The more reliable the test is, the less random error the test possesses, with smaller error variances indicating that the test is reliable (Guion, 1998). 13 Reliability can be determined by repeated testing, but a more useful way is to use parallel forms to eliminate the practice effect of answering the same questions repeatedly. Parallel test forms are composed of different items of similar difficulty covering the same information. The reliability is determined by how similar an examinee’s scores are on each form. Practically, a single score that depicts the examinee’s behavior is desired, so the scores across the test forms are averaged and the result is interpreted as if the examinee took a single exam. This score is more reliable because it is based on a larger sample of behavior and is more representative of an examinee’s true behavior (Lord, 1980). Usually, when determining test reliability, examinees are tested on the same test twice or by using two parallel tests. However, strictly parallel tests are hard to achieve, because that refers to the occasion when the examinee’s true score and error variance are the same. This is hard to do because the items have to be different but similar in difficulty. However, difficulty can only be approximated based on how much information you have on the items. This depends on the resources that you have before the exam is put into use, and many times this is not practical. Approximation leads to a correlation between examinee scores on parallel tests, which is known as the reliability coefficient (Ellis & Mead, 2002). The reliability coefficient is an estimate of the square of the reliability index, which estimates the proportion of the total systematic variance. (Guion, 1998). Mathematically speaking, the reliability coefficient is the ratio of true score variance to observed score variance: 14 𝜌𝑋1𝑋2 = 𝜎𝑇2 /𝜎𝑋2 This implication means that if a test had perfect reliability, there would be no error, which is theoretically possible, but is usually not achieved in the real world. Item analysis using CTT is designed to maximize internal consistency estimates of reliability using coefficient alpha, expressed as a decimal between 0 and 1 (Ellis & Mead, 2002). The closer the value gets to 1, the exam in question has less random error variance and is determined to be reliable (Guion, 1998). CTT item analysis is used to determine item characteristics of difficulty and discrimination. Item difficulty refers to how well items fit a target population and ranges from 0 to 1, with values nearer the limits providing little or no useful information. Item difficulty is related to the total test score, which determines item variance. Information about examinee differences and total test score variance is maximized when pi = .50 assuming the inter-item correlations are held constant. Item discrimination is the determination of an examinee’s knowledge, in that examinees who have not mastered the material will not get the item correct and examinees who have mastered the material will get the item correct. Item discrimination refers to the difference between the percentage correct for each of these groups. Item discrimination indices include the D index, the point biserial correlation, and the biserial correlation (Ellis & Mead, 2002). The D index refers to the difference in the proportion of examinees passing an item for overall upper (Pu) and lower (Pi) groups: 𝐷 = 𝑃𝑢 − 𝑃𝑖 15 that are defined by the upper and lower percentages, usually defined as the top and bottom 27% of the distribution (Ellis & Mead, 2002). The point biserial indices show the relationship between the total test score and examinees’ performance on an item, computed by this formula: 𝑟𝑝𝑏𝑖𝑠 = (𝑀+ − 𝑀𝑇 ) √𝑝⁄𝑞 𝑆𝑇 in which M+ is the mean of the scores for the examinees successfully passing an item; MT refers to the mean of the test scores; p is the item difficulty; and q is 1 – p (Ellis & Mead, 2002). T he biserial correlation assumes that the latent variable underlying the item response is continuous and possesses a normal distribution and is computed by this formula: 𝑟𝑏𝑖𝑠 = (𝑀+ − 𝑀𝑇 ) (𝑝⁄𝑌) 𝑆𝑇 in which Y is the height of the standard normal distribution at the z-score separating the area under the curve proportionately between p and q (Ellis & Mead, 2002). The rpbis is always smaller than rbis shown by this equation: 𝑟𝑝𝑏𝑖𝑠 = 𝑌 √𝑝𝑞 𝑟𝑏𝑖𝑠 Choosing which distribution to use depends on how practical the information is. The D index is easier to calculate, but the correlational indices may give better information depending on the characteristics of the analysis. If the items have moderate difficulty, very little difference can be observed between the three methods (Ellis & Mead, 2002). 16 When item difficulties are in extreme ranges, or when different examinees are sampled, or the developer prefers to use indices of discrimination and difficulty that are independent of each other, rbis is the preferred method. A disadvantage with rbis is that it can yield coefficients over 1.00 if the underlying assumptions are violated. If items with high internal consistency are desired, rpbis would be a better choice (Ellis & Mead, 2002). Additionally, rpbis gives more information about the test’s predictive validity because it works well with moderately difficult items. These last two methods provide more information than the D index, which can discard a third or more of the data. CTT is based on weak assumptions easily met by most test data sets, and the models have been applied to multiple varieties of test development and test score analysis problems, but the item difficulty index, the item discrimination index, the observed score, and the true score are completely sample and administration dependent (Ellis & Mead, 2002; Hambleton & Swaminathan, 1985; Lord, 1980). Other shortcomings exist when dealing with CTT, such as examinee characteristics and test characteristics cannot be interpreted independently. Examinee ability is test dependent, and cannot be compared outside of a particular test. Test item difficulty is examinee dependent, meaning that items are rated easy or difficult because of the characteristics of the people taking the test (Hambleton, Swaminathan, & Rogers, 1991). Calculating examinee ability is dependent on the test item difficulty, and all of the item statistics such as item discrimination, reliability, and validity are dependent on the group of examinees taking that particular examination. This means that these statistics change after every exam administration. 17 Group-dependent item indices cannot be used when tests are constructed for examinee populations if they have different characteristics to the examinees who provided the indices. This makes comparisons extremely difficult or impossible because the scores are dependent on the test in addition to being based on two different scales (Hambleton et al, 1991). This even occurs when examinees take the same or parallel tests, because examinees possess different ability levels and the amount of error is different. It is more desirable to have examinees answer some number of items correct and some number of items incorrect, because it provides information about the examinee’s ability, and can give a precise ability level. If two examinee scores contain equal amounts of error, it allows test difficulty to be matched with approximate ability levels (Hambleton et al., 1991). CTT also has issues with defining reliability and the standard error of measurement (SEM). SEM is a function of test score reliability and variance, as shown by the following formula: 𝑆𝐸 = 𝑆𝑥 √1 − 𝑟𝑥𝑥 where SE is the standard error, Sx is the test score standard deviation, and rxx is the test reliability. An assumption of CTT is that it is the same for all examinees, where reliability is the correlation between test scores on parallel forms (Hambleton et al., 1991). Several methods of finding this correlation exist, but the problem is meeting the definition of parallel tests, which is very difficult using CTT if not impossible. Reliability coefficients, such as alpha, can provide lower bound estimates of reliability, or reliability estimates with unknown biases. This can result in scores on an exam not being 18 precise measures of examinees with different ability levels. This means that the assumption of equal errors of measurement for all examinees is implausible (Hambleton et al., 1991). CTT is oriented to the test rather than the item, and the classical true score model can predict examinee responses to any given item in a linear fashion, but the accuracy suffers. CTT provides less-than-ideal solutions to many testing problems such as test design, identification of biased items, adaptive testing and test score equating. Finding an alternative test theory, researchers need to look at finding one that has items characteristics that are not group-dependent, examinee scores that are not test dependent, a reliability measure that does not require parallel tests, and finding a model that precisely measures ability scores. Item Response Theory The concepts of IRT began in 1906 when Binet and Simon plotted performance levels in relation to an independent variable. Thurstone developed a method of paired comparisons in 1928 that can be used to scale a collection of stimuli. Richardson was able to derive relationships between IRT models and classical item parameters providing a method of obtaining IRT parameter estimates in 1936. In 1952, Lord developed the two-parameter normal ogive model, and Birnbaum developed the logistic models and supplied necessary statistical foundations. Rasch developed three item response models in 1960, spurring on more research and the development of computer programs to assist in the underlying statistical analysis of the Rasch model. IRT became necessary because of the numerous CTT deficiencies. IRT is based 19 on the theory that the probability of an examinees answer on an item can be determined by a function of the examinee’s position in the distribution of the latent trait being obeserved. This can be displayed graphically as an item characteristic curve (ICC) (see Fig 1) (Fischer & Molenaar, 1995): Probability of a Correct Response Figure 1. Item Characteristic Curve (ICC) for a hypothetical item. 1.00 0.90 0.80 0.70 0.60 0.50 0.40 0.30 0.20 0.10 0.00 -3.00 -2.50 -2.00 -1.50 -1.00 -0.50 0.00 0.50 1.00 1.50 2.00 2.50 3.00 θ (Ability Level) The ICC is a plot of the level of performance on some task or tasks against an independent measure. A smooth nonlinear curve is fitted to the data so minor irregularities in the data pattern are removed, which makes it easier to design and analyze tests. The ICC provides the probability of examinees with a given ability level answering each item correctly, and the probability value is independent of the number of examinees at that ability level (Hambleton & Swaminathan, 1985). IRT deals with predicting examinee performance by defining examinee 20 characteristics, consisting of traits or abilities (Hambleton & Swaminathan, 1985). Tests contain multiple items and when each of the item scores are summed, the test score is found. To describe test scores coming from a specific group of examinees, statistics that show individual item scores are used. Georg Rasch (1980), in developing the Rasch Model (RM), wanted to use invariant comparison, to describe items and examinees by their parameters. This allows computation of the probability of any examinee’s response to any item even if similar examinees have never taken similar items before. The relationship between examinee ability level and response to an item is known as the item response function (Lord, 1980). To predict or explain item and test performance, examinee scores are estimated using trait and ability scores. The item response model chosen specifies the relationship between examinee test performance, which is observable, and the unobservable traits or abilities being measured (Hambleton & Swaminathan, 1985). Many different models are available for selection and there is no one “correct” model, requiring the use of goodness of fit tests. Each model has specific mathematical functions that describe the observable and unobservable quantities by specifying assumptions about test data. These models can be unidimensional or multidimensional; measuring one underlying trait or more than one. They can be linear or non-linear in addition to using dichotomous scoring, either correct or not, or polytomous, having multiple responses (Hambleton & Swaminathan, 1985). Item response models are defined by the mathematical form of the item characteristic function and the number of specified parameters (Hambleton et al., 1991). 21 There are one or more parameters describing the item and the examinee, and their utility is determined by assessing how well the model fits the data. Once a model is found that fits the test data, and all parameters are held constant except for item difficulty, examinee ability estimates and item indices can be determined. These ability estimates and item indices are not test or group dependent. This means that even ability estimates obtained from different items and item parameter estimates obtained from different examinees are the same. This is the biggest advantage of using IRT over CTT. Another IRT bonus is that estimates of standard errors for individual ability estimates can be obtained instead of a single estimate of error for all examinees (Hambleton et al., 1991). Rasch Model The Rasch Model (RM) makes more stringent assumptions than other IRT models by stating an ideal measurement model and using data to see if it fits that model. The probability of an examinee making a specific response is derived from a logistic function of the person and item parameters, which means that higher ability level examinees have higher probability of a correct answer (Fischer & Molenaar, 1995). The continuum of total score assessments is used to determine where an examinee is located using the scores, which are counts of discrete observations representing an observable outcome between an item and an examinee (Fischer & Molenaar, 1995). The RM is restrictive, because it holds strong to a specific model that confines each item to the same discrimination, rather than changing the model to fit the data. Guessing behavior is not directly modeled, but it is theoretically included in the error structure and must be 22 evaluated by the researcher. One Parameter Model Unidimensional and multidimensional models can be used for both dichotomously and polytomously scored data. Commonly used logistic models are the one-, two-, and three-parameter logistic models. The ICCs for the one-parameter logistic (1PL) model are given by the following equation (Hambleton et al., 1991): 𝑃𝑖 (𝜃) = 𝑒 (𝜃−𝑏𝑖 ) 𝑖 = 1, 2, … 𝑛 1 + 𝑒 (𝜃−𝑏𝑖 ) where Pi(θ) is the probability that a randomly chosen examinee with ability θ answers item i correctly, bi is the item i difficulty parameter, n is the number of items in the test, e is a transcendental number whose value is 2.718 and Pi(θ) is an S-shaped curve with values between 0 and 1 over the ability scale. The bi parameter is the point on the ability scale where the probability of a correct response is 0.5. It also indicates the position of the ICC in relation to the ability scale. The larger this parameter is, the higher the ability level required for an examinee to have a 50% chance of getting the item correct (Hambleton et al., 1991). The more difficult the item the higher it is on the scale. The ability values from a particular group are standardized to a mean of 0 and a standard deviation of 1. The values of bi are theoretically defined on a scale that ranges from negative infinity to positive infinity but typically the values only vary from about -2.0 to +2.0 (Hambleton et al., 1991). The b parameter is expresssed as a z-score, and the mean and standard deviation are based on the distribution for the ability being measured. With a normal distribution, values between +/- 1.64 are reasonable depending on the use of the 23 exam, because this stays away from the extremes where little additional information is provided (Ellis & Mead, 2002). The key assumption of the 1PL model is that item difficulty is the main characteristic influencing examinee performance. Other factors such as guessing behavior are considered, but similarly to the Rasch model, are absorbed into the residuals unless they harm the model fit too much. In IRT, the choice made to use this model depends on the data being analyzed and how it will be applied – the data set and the purpose is the main factor in selection of other models (Hambleton et al., 1991). Two Parameter Logistic Model Lord first based his two-parameter item response model on the cumulative normal distribution, or the normal ogive, but this was later replaced by the two-parameter logistic model (2PL) by Birnbaum (Lord & Novick, 1968). The logistic model is an explicit function of item and ability parameters. Item characteristic curves for the 2PL model are given by the equation (Hambleton et al., 1991): 𝑃𝑖 (𝜃) = 𝑒 𝐷𝑎𝑖 (𝜃−𝑏𝑖 ) 1 + 𝑒 𝐷𝑎𝑖 (𝜃−𝑏𝑖 ) where the parameters Pi(θ), e and bi are defined the same way as the 1PL model. The factor D is a scaling factor introduced to make the logistic function as close as possible to the normal ogive function (Hambleton et al., 1991). The parameter ai is the item discrimination parameter, usually ranging from .5 to 2.0, with values below .5 limiting an item’s information, and values above 2.0 may indicate a problem with its estimation (Ellis & Mead, 2002). The a parameter is proportional to the slope of the ICC at bi on the 24 ability scale. It is more desirable to have items with steeper slopes because they do a better job of sorting examinees into different ability levels. Higher item discrimination values (ai) result in item characteristic functions that increase as the examinee’s ability increases. This provides the opportunity to use differently discriminating items, but this model does not provide an allowance for guessing behavior (Hambleton et al., 1991). Three Parameter Logistic Model The three-parameter logistic (3PL) model uses three parameters to describe the ICC: a, which is the discrimination parameter, the difficulty parameter is denoted as b, and the pseudo-guessing parameter is c. It is written as (Ellis & Mead, 2002): 𝜌𝑖 (𝜃) = 𝑐𝑖 + (1 − 𝑐𝑖 ) 1 [1 + exp{−𝐷𝑎𝑖 (𝜃 − 𝑏𝑖 )}] where Pi(θ) is probability that an examinee with ability θ answers item I correctly; ai is proportional to the slope of the ICC at its point of inflection; ci is height of the lower asymptote of the ICC; and D, the same scaling constant as in the 2PL model, is 1.7 (Ellis & Mead, 2002). Parameter c is the probability that a person completely lacking in ability will answer the item correctly. It is called the guessing parameter or the pseudo-chance score level (Lord, 1980). Theoretically, this ranges from 0 to 1, but practical c parameters are frequently lower than the probability of random guessing, depending on the available number of response options. Large c values degrade the item’s ability to discriminate between low- and high-ability examinees. The c parameter influences the shape of the ICC, which must be fitted between the c parameter and 1.0 (Ellis & Mead, 2002). 25 IRT Assumptions Local independence is the first assumption for IRT and Rasch models. This requires that any two items are uncorrelated if examinee ability level, or θ, is held constant. Local independence is only obtained when each ability dimension in addition to non-ability dimensions, such as examinee personality, and test-taking behaviors that influence performance have been taken into account (Hambleton et al., 1991). Unidimensionality, a special type of local independence, refers to the test items only measuring a single ability. This assumption is not strictly met because of other external factors affecting examinee performance. This assumption is practically met by allowing for one dominant ability to explain examinee performance. If more than one ability explains examinee performance, the model is multidimensional (Hambleton et al., 1991). If the guessing parameter, or c, is equal to 0, the tetrachoric intercorrelations matrix is of unit rank with θ as the common factor (Lord, 1980). The major distinction among commonly used IRT models is in the number and type of item characteristics assumed to affect examinee performance (Hambleton et al., 1991). Item and Test Information IRT uses the concept of information instead of the CTT concept of reliability. IRT makes it possible to assess information functions for individual items instead of a single reliability estimate for an entire test (Ellis & Mead, 2002). The item information function is formed by looking at each item and its conditional variance at each ability level. More information is provided as the slope increases and the variance decreases. 26 This also lowers the SEM, which is used as a tool to discard items that are not performing as well. Higher SEM items are discarded, as they do not provide as much information. Item information functions can be shaped in a number of ways, depending on how the test is constructed. They provide the maximum information at bi for the one and two parameter models. For the three-parameter model, the maximum information slightly higher than b1, as shown by the following equation (Hambleton et al, 1991): 𝜃𝑚𝑎𝑥 = 𝑏𝑖 + 1 1 1 1𝑛 [ + √1 + 8𝑐𝑖 ] 𝐷𝑎𝑖 2 2 The maximum value of the information is held constant for the 1PL model, but for the 2PL it is proportional to the square of the item discrimination parameter, so for larger values of a, greater information is provided (Hambleton et al., 1991). The following equation provides the maximum value for the 3PL models: 𝐼(𝜃, 𝑢𝑖 )𝑚𝑎𝑥 3 𝐷2 𝑎𝑖2 2 2] (1 ) = [1 − 20𝑐 − 8𝑐 + + 8𝑐 𝑖 𝑖 𝑖 8(1 − 𝑐𝑖2 ) The closer the guessing parameter, ci, is to zero, the more information is obtained. Item information functions determine the test information function, which is found by the following equation (Hambleton & Swaminathan, 1985): 𝑛 𝐼(𝜃, 𝜃̂) = ∑{(𝑃𝑖′ )2 ⁄𝑃𝑖 𝑄𝑖 } 𝑖=1 The quality and number of items influences the information provided from the test information function. The test information function is defined for a set of test items at each point on the ability scale, and the contribution of each item is independent of the 27 other items. The amount of information provided at each ability level is negatively correlated with the error associated with ability estimates. Each item’s contribution is additive making it easy to determine the impact of each item (Hambleton & Swaminathan, 1985). Test Characteristic Curve The test characteristic curve (TCC) describes the relationship that exists between a true score and the ability scale. When given an ability level, the researcher can determine a corresponding true score by using the TCC. If the decision is made to use a one- or two-parameter model for a complete test, the left tail of the curve nears zero as the ability score decreases. The upper tail nears the number of items in the test as the ability score increases as shown here (Baker, 2001): 28 Figure 2. Test Characteristic Curve (TCC) for a hypothetical 15-item test. 16.00 14.00 Expected True Score 12.00 10.00 8.00 6.00 4.00 2.00 0.00 -3.00 -2.50 -2.00 -1.50 -1.00 -0.50 0.00 0.50 1.00 1.50 2.00 2.50 3.00 q (Ability Level) This graph shows the assumption that a true score of zero matches up with an ability score of negative infinity. A true score of N, or the number of items on the test would show an ability level of positive infinity. 3PL models will show the left end of the curve trailing off at the level of the c parameter, showing that low-level examinees can achieve a score higher than zero by guessing at the item level. At the test level, this feature is only shown when aggregating across items. As the ability level of the examinee gets closer to positive infinity, the TCC shows that examinees will have a true score of N. This allows test developers to transform examinee ability into true scores, and give the examinee a method to determine their own ability level (Baker, 2001). 29 Standard Error of Measurement Standard error of measurement (SEM) is used to describe how examinee test scores fluctuate over repeated testing because of the error component. Through this process, confidence intervals are generated to help interpret test scores. When working within the CTT framework, SEM can be computed by taking the square root of 1 minus reliability (1-rn)1/2, times the standard deviation of the test variance as follows (Embretson & Reise, 2000): SE𝑀𝑠𝑚𝑡 = 𝜎(1 − 𝑟𝑛 )1/2 Assuming that measurement error is normally distributed equally for all score levels it allows for confidence intervals to be constructed. This means that the same confidence score applies to each score level and the true score is derived linearly. These apply only to a particular population because the computations are computed using population statistics. The raw score mean and standard deviation must be estimated for a population to use the standard score conversion, and the standard error must be computed using variance and reliability estimates (Embretson & Reise, 2000). SEM in IRT can be estimated for any ability level because it is not dependent on population distributions. There is empirical and theoretical evidence indicating that the SEM values are different depending on examinee ability or score levels, and traditional computations of SEM are not adequate (Woodruff, 1990). This means that the SEM is conditional because it changes when ability or score levels change. Trait scores are estimated separately and each score or response pattern, yielding smaller standard errors 30 when the items are most appropriate for a specific trait score level in addition to when the items have high discrimination (Embretson & Reise, 2000). Information is defined for both the item and the total scale, and the item information function shows the contribution that an item makes along the θ continuum (Ellis & Mead, 2002). Item information is found by obtaining the reciprocal of the error variance, or the squared standard error. Smaller error scores provide more information. Confidence intervals around an examinee’s score are constructed by using the variability among test scores, or the conditional standard error of measurement (CSEM) (Ellis & Mead, 2002). It is useful to know the CSEM for a given ability level, but it is more useful to use item and test information when determining which items to select for the test. Test developers specify the desired test information function, and use item analysis to select items where the summation of those item information functions approximates the desired test information function (Ellis & Mead, 2002). SEM is used to detect the differences between two people’s scores, to see if an examinee’s score differs from some true score, or to assess if scores can discriminate differently between different demographic groups or other groups defined by different score ranges (Guion, 1998). Invariance A major component of IRT is the property of invariance regarding item and ability parameters. Invariance refers to the item and ability parameters remaining the same regardless of the examinees or a specific test administration. Item specific parameters are independent of the examinee’s ability distribution and examinee specific 31 parameters are independent of the set of test items (Hambleton et al., 1991). Assuming the model fits the data, the same ICC is obtained for a test item, regardless of examinee ability and population. This is a property derived from linear regression, when the regression model fits the data; the same regression line is found, even if the distribution of the predictor variable changes (Hambleton et al., 1991). The probability that examinees at a specific ability level answer item i correctly is dependent on θ. When the model holds, ai, bi, and ci do not change when the group being tested changes, resulting in the three parameters being invariant (Lord, 1980). Parameter Estimation Any time IRT is applied to test data, the parameters for the IRT model chosen are estimated. The examinee’s response is used to estimate the item parameters and the examinee ability level (Hambleton et al, 1991). Several methods of parameter estimation can be used. If θ is known, the data points that are necessary to estimate the unknown parameters are the same as the item parameters in the model, assuming perfect model fit. In practical applications, the model does not exactly fit the data, so the goal is to find the parameter values that will best fit the curve, using a maximum likelihood criterion (Hambleton et al., 1991). Assuming examinee responses are independent, and θ is known, an item is administered to many examinees, and a likelihood function of N examinee responses is obtained. In this case, the likelihood function is multi-faceted. To obtain the maximum likelihood estimation (MLE) of the parameters a, b, and c, when θ is unknown, the values 32 corresponding to the maximum value of a surface in three dimensions must be found (Hambleton et al., 1991). When the ability of each examinee is known, each item may be considered separately without reference to the other items, and is repeated once for each item (Hambleton et al., 1991). A common and difficult problem is that both θ and the item parameters are unknown. In order to determine these values, all of the examinee responses must be examined. Assuming local independence, for the three-parameter model, a total of 3n + N (items + people) parameters need to be estimated (Hambleton et al., 1991). The easiest way to do this is to select an arbitrary scale for the ability scale, for example, setting the mean to zero and the standard deviation to 1. In order to set the initial values for the ability parameters, a logarithm of the ratio of the number of correct responses to the number of incorrect responses must be obtained. These values are standardized and used to estimate the item parameters. Once the estimate of the item parameters is obtained, the ability parameters are estimated. This process is repeated until the values remain consistent between two successive estimations. This results in an approximation of the item parameters and ability estimates (Hambleton et al.,1991). A different approach uses Bayesian estimates of the parameters using a priori distributions. This results in both the item and ability parameters being simultaneously estimated, which can result in inconsistent joint maximum likelihood estimates. To resolve this issue, the item parameters are estimated without referring to the ability parameters. Using a random selection of examinees and specifying an ability parameter 33 distribution allows the ability parameters to be integrated out of the likelihood function (Hambleton et al., 1991). The resulting marginal maximum likelihood estimates of the item parameters remain consistent as the number of examinees increases, allowing them to be treated as known values and used to estimate examinee ability. Using Bayesian estimation helps to resolve the issue of poor c estimates resulting in less accurate estimations, and using more items allow for better parameter estimates (Hambleton et al., 1991). Theoretically, high-ability examinees should never get an easy item wrong, but careless mistakes exist. The logistic function reaches the asymptotes more slowly than the normal ogive and mistakes have less of an impact. Ability is difficult to accurately measure, so a small-sample frequency has to be inferred from the model by using an observable quantity with known parameter distributions (Lord, 1980). Predictions are made using the estimated values and observed data is used to see how accurate they are by fitting them to the observed data. Several real world issues come into play, such as examinees becoming tired, sick or uncooperative before completing the testing. Omission of items, not finishing, skipping back and forth through the test and poor item quality also have an effect on model fit (Lord, 1980). 34 Chapter 3 ANGOFF METHOD AND THE LATENT TRAIT CONTINUUM Angoff Method A common method of standard setting is the Angoff method. The standards are set based on test scores, and these scores need to be given meaning because a number in itself is not useful unless there is a way to make them useful instruments of measurement (Angoff, 1971). In order for this to occur, a scale structure must be defined, called scaling. These scores must also be interpreted, and this means that norms or other interpretive guides should be established so that these scores can be used for assessment. Additionally, because most exams have multiple forms, there is a need to calibrate or equate the scores on different forms (Angoff, 1971). These are separate issues, but scaling is the key component for standard setting and the Angoff method. Mental ability is not observable, meaning that a score of zero on a mental ability test does not signify the absence of mental ability. Additionally, equal differences between scores may not be equal representations of different ability units (Angoff, 1971). Several types of scaling are available. The raw score scale consists of the number of items answered correctly. This scale can help identify problems within the test, but it is not generalizeable. If more than one form of the test exists, the raw score scale cannot be used to compare scores from different forms because of the natural variations between forms and administrations (Angoff, 1971). The percentage-mastery scale scores suggest 35 if an examinee has received a score of 85, that examinee has “mastered” 85% of the material. This can be problematic because it is hard to quantify knowledge and difficult to determine the percentage mastered. This method is also flawed when multiple forms of an exam are in use (Angoff, 1971). The next scaling method involves linear transformation or standard scores. This involves administering the exam to a reference group, either randomly drawn from a population or the reference group could be a population with defined characteristics. Following this, the raw score mean is transformed at a desired scaled score involving a uniform change in unit size to find the standard deviation of the scaled score. The raw score mean and standard deviation are placed in a linear scaling equation where the standard score deviate for any scaled score equals the standard-score deviate for a corresponding raw score for the reference group (Angoff, 1971). The percentile rank scale involves finding the percentage of individuals who receive scores located below the midpoint of each score or score level. The frequencies for all of the scores below the selected score are totaled, added to half of the frequencies at the selected score, and that total is divided by the total number of cases (Angoff, 1971). The normalized scale, or normalized standard scores, involves transforming the scores into units independent of the test characteristics and equally spaced. Plotting the distribution results in an S shaped curve, similar to the ICC used in IRT analyses. The percentile-derived linear scale is the one that is most useful to the Angoff standard setting method. This scaling method deals with norms, meaning that the standard of 36 performance is set as observed in samples from the population (Angoff, 1971). An example is that a minimum passing score is set at some number, such as 70, and it is expected that a certain percentage of examinees will pass, such as 65, representing the minimum acceptable performance. In order to determine these numbers, a systematic process using the “minimally acceptable person” is employed. This involves reviewing the exam item by item and determining if this hypothetical person, someone who would possess the minimum aptitude to answer the question correctly, in the purely theoretical realm of correct or incorrect. The number of items that this person can answer is the raw score that the “minimally acceptable person” would earn on the exam. Another method is asking judges to rate the probability that any “minimally acceptable person” would answer an item correctly. This means asking the judges to conceptualize multiple people instead of just one (Angoff, 1971). This original version of the Angoff method led to many other researchers developing their own versions of the Angoff standard setting method. Many of these arose from practical innovations, which have advanced the method further. Discussions of the competence of the minimally acceptable person occur on a continuum of the latent trait, referring to the ability being tested on that particular exam. The participants need to be given operational definitions of what is excellent, medium, and poor performance, so that the judges can make accurate decisions. These decisions are the participants judgments that the minimally acceptable person will have success on a particular item based on the operational definitions, and are aggregated in order to 37 define the cutoff score (Hurtz, et al, 2012). Typical Angoff participants are not given the ICCs for each item, however this procedure of defining performance and estimating success can be translated into the establishment of the horizontal axis (ability) and the height of the ICC (probability of getting the item correct). The problem with most Angoff procedures is that unless the exam is scored using IRT, the process of aggregating ratings into a cutoff score only involves estimating the success or failure of the examinee answering the item, and cannot be linked to the latent construct the exam is measuring (Hurtz et al, 2012). Hurtz et al. (2012) have proposed a method to tackle this problem in order to maintain consistent standards across multiple test forms and across time. An Angoff standard setting workshop is developed with at least 10 or more SMEs. These SMEs are given items to rate that should be well representative of the exam. These items must be calibrated properly and perform properly according to the goal of the exam. This can consist of single or multiple test forms. The Angoff participants provide ratings on each item. These ratings must be converted from proportion-correct values to the latent scale being observed, or Hurtz et al. (2012) recommends a method that maximizes the fit between the ratings and the ICCs based on the Monte Carlo results of Hurtz et al. (2008) in order to define a preliminary standard. The next step is to review information regarding the latent population so that an expected passing rate can be estimated using the preliminary standard. If any adjustments are necessary, the SEM can be derived from the ratings (Hurtz et al., 2012). 38 Additionally the CSEM at the preliminary standard’s threshold can be used to compute a 95% confidence interval to make further adjustments. This adjusted standard is applied to all forms because the latent scale is independent of the sample of items without the need to establish a new standard setting workshop. With exams being scored using IRT, the resulting θ* is used as the operational cutoff score, which can also be scaled to the proper score reporting metric. If the exam is being scored using CTT, the standard must be converted to a percent-cutoff score by using the TCC (Hurtz et al, 2012). Over time the exam plan, qualification requirements, requirements for successful job performance or the job itself, or other reasons may require the standard to be updated. In these cases, confirmatory studies should be held to evaluate the standard to see if changes need to be made or if it is still valid for its intended use. 39 Chapter 4 MONTE CARLO SIMULATIONS Monte Carlo Monte Carlo simulations allow for the control of multiple variations of statistical data in order to simulate real world data. When evaluating different conditions using Monte Carlo methodology, the researcher obtains a statistical model, subject to the laws of chance. The conditions of this model can be manipulated to whatever real world conditions that need to be evaluated (Kalos & Whitlock, 1986). Calculating Monte Carlo figures requires the use of a sequence of random events. The most basic example is a single elementary event of flipping a coin. Each time you flip it, the possible outcome is associated with a probability ranging from 0 to 1. When dealing with more than one elementary event, such as flipping two coins, the probability of the same outcome is known as the joint probability. If the goal is to determine the probability of multiple elementary events, the joint probabilities are combined into a marginal distribution resulting in a conditional probability (Kalos & Whitlock, 1986). This process can be extrapolated to any number of elementary events. The key to evaluating the behavior of statistics is the sampling distribution, referring to the values that a specific statistic can have with respect to a given population and the probabilities associated with those values. The bias of a statistic can be evaluated by examining the expected value of the sampling distribution, variability, and functional 40 form to evaluate the efficiency of that statistic and to make inferences about the population. How can a statistic be evaluated when the conditions necessary for a mathematical theory to be valid do not exist or when no strong theory exists? Monte Carlo simulations allow researchers to understand that specific statistic’s sampling distribution and evaluate its behavior in random samples by using random samples from known populations of simulated data (Mooney, 1997) Every Monte Carlo experiment deals with generating a random sample (Fishman, 1996). Most of the time, the random event outcome can be expressed as a numerical value. When dealing with computer simulations, the random choice outcome usually is a logical event. Covariance is used to measure the independence of two random variables. The covariance is 0 if the variables are independent. Covariance can be positive or negative, resulting in the variance of a linear combination of dependent variables being larger or smaller than the resulting variance of the independent variables (Kalos & Whitlock, 1986). Statistical analyses are used to describe and make inferences using measured variables about social phenomena. A characteristic is estimated with an estimator computed from observed data (Mooney, 1997). To evaluate a given statistic, the sampling distribution is needed. This distribution consists of the range of values that the statistic can possess in random samples from a specific population and the resulting probabilities associated with that value range (Mooney, 1997). If a statistical bias exists, it can be determined by looking at the expected sampling distribution, variability, and 41 functional form, as a way to determine the statistic’s efficiency (Mooney, 1997). Random samples from known populations of simulated data are used to determine the behavior of the statistic in question, which is artificially generated, because sampling data multiple times from real people in real world situations is very impractical and inefficient. Statistics in random samples can be evaluated by generating many random samples and observing the resulting behavior. The multiple random samples create a pseudo-population that resembles the real world in all relevant aspects (Mooney, 1997). When performing a basic Monte Carlo procedure, the pseudo-population is specified in symbolic terms, so that samples can be generated by sampling from this pseudopopulation by using the same sampling strategy and size that reflects the statistical situation being investigated. Next, the estimator ̂𝜃 is calculated from the pseudo-sample and stored in a vector ̂𝜃, repeating this process as many times as the number of desired trials. Finally, a relative frequency distribution of the resulting 𝜃̂𝑡 values is constructed. This is the Monte Carlo estimate of the sampling distribution with respect to the specified conditions of the pseudo-population and sampling procedures (Mooney, 1997). After defining a variable in terms of its distribution function, the probability density function (PDF) is used to map the probability of x falling between two values of X for continuous random variables. The inverse distribution function incorporates probability values, a, to determine the value of x so that Pr(X ≤ x) = α (Mooney, 1997). Parameters determine the location, scale, and/or shape of the distribution, and each distribution function has specific requirements pertaining to a range of possible values of 42 X. The chosen distribution determines its range, some having infinite range and others having the range truncated at one or both ends. Because of this, it is important to know the mean, variance, skewness, and kurtosis. The skewness and kurtosis describe how normal or non-normal the distribution is. Researchers need to consider which distribution will yield the proper range, shape and variation that will match the types of simulated variables and processes and fit the design of the experiment (Mooney, 1997). Beta Distribution Evaluating rater bias in standard setting can be done by using the beta distribution (Hurtz, Jones, & Jones, 2008; Reckase, 2006). This is a flexible distribution bounded by 0 and 1, and has two adjustable parameters, a and b. It has flexibility and a range of PDFs that are highly right-skewed, uniform, approaching normality, highly left-skewed, or even bimodal distributions with varying levels of interior dip (Mooney, 1997). The a and b parameters determine the shape of the distribution. If a or b falls below a value of 1, the PDF curves downward on one end. If both parameters are below one, the PDF is bimodal. As a or b decreases toward 0, the height of the bimodal distribution is increased. If a and b have the same value, the distribution is symmetrical (Mooney, 1997). Monte Carlo simulations were used in the generation of the Angoff ratings because they can be systematically varied to match the specified conditions to evaluate how different judges may perform in the real world. 43 Chapter 5 METHOD Purpose of the Study The purpose of the present study is to use the methods outlined in this paper to evaluate the Angoff standard setting method. This method involves subjective ratings based on the information provided to the participants. This information involves the exam that the standard is being set on, and the mathematical conversions of the generalizeable θ values from the exam. However, most subject matter experts are unfamiliar with IRT and cannot make accurate ratings based on the θ statistic. This needs to be converted to a CTT proportion correct statistic so that training the Angoff raters can be accomplished in a manner that will not expend extensive resources. Following the standard setting meeting, the ratings that the judges give must then be converted back into the θ metric so that the ratings can be generalized to other settings. This study is designed to check how simulated rater agreement affects how this occurs using simulated ratings. What happens if the Angoff rating panel does not agree with each other? What happens if the judges rate the items too easy? What happens if they rate the items too difficult? Theoretically the conversion from CTT back to IRT should result in finding the same θ value, however if the Angoff raters rate the items inaccurately, does this happen? This study is designed to explore different conversion methods and how they affect 44 simulated Angoff ratings. The present study has expanded the Hurtz et al. (2008) research by using an expanded range of θ by going to more extremes (adding a -2 and 2 conditions) in addition to using different bias conditions, such as 15 or 35 percent above and below the ICC instead of +/- of .10 on the ICC. Additionally, this present study used a simulated exam based on 100 items, instead of an actual exam using 75 items. Minimum Passing Level When evaluating the Angoff method, the minimum passing level (MPL) should fit the IRT model. The fit needs to be examined because the procedures used to generate the ratings differ from the procedures used to gather the examinee data. Additionally, the raters possess different characteristics than the examinees (Kane, 1987). Even if examinee performance data fits an IRT model, the ratings may still not fit the model because the Angoff method can yield different results, and do not provide the same fit to every IRT model. To account for this, the average MPLs for each individual item are combined into a passing score by summing the average item MPLs over each individual item (Kane, 1987). The MPLs found through the Angoff method are interpreted as true score estimates for minimally competent examinees. IRT model item parameters are estimated for test items using examinee response data. First, if there is some value, θ*, on the θ scale that is the examinee minimal competency level, then Pi(θ*) is the value for θ* of the ICC for item i, which is the indicator of the minimally competent examinee’s expected observed score (Kane, 1987). When the raters’ MPLs fit the selected IRT model, the 45 value of Pi(θ*) is equal to the expected MPL for each item i. Different raters may assign different MPLs to different items in addition to random error. The expected MPL over the entire rater population for each item will equal Pi(θ*) when the item has a fixed value of θ* (Kane, 1987). The expected MPL over the population of raters will only deviate from Pi(θ*) if there are different standards used with random variations. The unbiased estimate of the sampling variance for individual raters on Item i is given by the following equation (Kane, 1987): 𝜎̂𝑖2 = ∑𝑟(𝑀𝑖𝑟 − 𝑀𝑖𝑅 ) 2 𝑘−1 where k is the number of raters sampled, Mir is the MPL for Rater r on Item i, and MiR is the average MPL on Item i for k raters (Kane, 1987). The sampling variance for the mean MPL over k raters can be estimated as: 𝜎̂𝑖2 (𝑀𝑖𝑅 ) = 𝜎̂𝑖2 (𝑀𝑖𝑟 )/𝑘 The average rating distribution over samples of raters should be approximately normal, especially if k is large. Assuming the ratings fit the model, implying that Mir is an unbiased estimate of Pi(θ*), and that the average rating is normally distributed (Kane, 1987), the following equation: 𝑍𝑖𝑅 = 𝑀𝑖𝑅 − 𝑃𝑖 (𝜃 ∗ ) 𝜎𝑖 (𝑀𝑖𝑅 ) should be normally distributed with mean of zero and a standard deviation of 1 for some value of θ*. Assuming that ZiR for the n items are independently distributed, the overall fit of the ratings to the model can be examined using the statistic: 46 2 ∑ 𝑍𝑖𝑅 = ∑[(𝑀𝑖𝑅 − 𝑃𝑖 (𝜃 ∗ ))/𝜎𝑖 (𝑀𝑖𝑅 )] 𝑖 2 𝑖 which is distributed as a chi-square with n-1 degrees of freedom under the null hypothesis (Kane, 1987). Sometimes the independence assumption cannot be met if the same raters review each item because the error due to rater differences is correlated across all items. The independence assumption can be met if the correlated error is minimal in comparison to the random error. After examination, if the ratings do not fit the model, different models should be examined so that the researcher can combine the MPLs over raters and items to obtain a passing score, or to estimate the expected error in the passing score (Kane, 1987). Length of Test/Number of Items In simulating Angoff style ratings for a set of items, and exam that contains items with IRT properties is necessary to get useable results. The previous study used an exam length of 75 questions. In order to determine if the length of the exam has an effect, this study used a length of 100 questions. To make the scope of the study more generalizable, a specific exam was not used, and instead the items’ IRT parameters were simulated. These items fit the 3PL IRT model, and a set of simulated ratings around the ICCs was generated. The IRT a parameter was generated with a mean of 1.56 and a standard deviation of .29 with a minimum of .49 and a maximum of 2.24. The IRT b parameter was generated with a mean of -.032 and a standard deviation of .566 with a minimum of 1.69 and a maximum of 1.76. The IRT c parameter was generated with a mean of .18 and a standard deviation of .03, with a minimum of .09 and a maximum of .24. These values 47 were selected based on research from Plake and Kane (1991), and further modified during personal conversations with Gregory M. Hurtz, Ph D, as to replicate item properties on exams given in an applied setting. Simulated Rating Data The beta distribution defined for each item was used to draw simulated ratings. The beta distribution was selected because of the ability to manipulate parameters as well as allowing for adjustments of the lower and upper bounds within the 0 to 1 range. This allows the simulated rating distributions to fall above the c parameter because those ratings falling below it do not correspond to a value on the θ scale. The upper limit of the items was fixed at 1.00 and three population standard deviations were used. The population standard deviations were set at .05, .10, and .15. This choice was made to explore the effects of rater agreement on the conversion method. The previous study focused on evaluating rater agreement at the ICC, and conditions of .10 above and below the ICCs. The following conditions were selected in order to see if extending the rater agreement both positively and negatively would have an effect. The ratings were simulated to see if raters with strong agreement (.05) to somewhat weaker agreement (.15) with a mid range value (.10) included in the analysis would affect the conversions. The rv.beta function in SPSS was used to generate samples from the beta distribution, requiring two parameters A and B that are used to determine the mean and variance in the distribution: 𝜇= 𝐴 𝐴+𝐵 48 𝜎2 = 𝐴𝐵 (𝐴 + 𝐵)2 (𝐴 + 𝐵 + 1) except in situations where the lower limit is greater than 0 and/or the upper limit is less than 1, the mean and variance are determined by: 𝜇 = 𝐿 + (𝑈 − 𝐿) 𝜎 2 = (𝑈 − 𝐿)2 𝐴 𝐴+𝐵 𝐴𝐵 (𝐴 + 𝐵)2 (𝐴 + 𝐵 + 1) The simulation of this data requires the lower limit to be set at the ci IRT parameter, and the variance is set to the square of the three population standard deviations for each condition, .052, .102, and .152. The value of the mean is set to simulate ratings that reflect no bias, 85% of the raters overestimating the ICCs, 65% of the raters overestimating the ICCs, 65% of the raters underestimating the ICCs, and 85% of the raters underestimating the ICCs. Fifteen raters were simulated 1500 times for each of the 100-item draws at each of the five a priori θ* values (-2.00, -1.00, .00, 1.00, and 2.00). Ratings were generated around five population means, either equaling Pi(θ*), falling .85 above Pi(θ*), falling .85 below Pi(θ*), falling .65 above Pi(θ*), and falling .65 below Pi(θ*). Whenever a value was found over one, it was set to equal .99, and when a value fell below an item’s c parameter it was set to c + .01. 49 IRT to CTT Conversion Most commonly used judgmental standard-setting procedures like the Angoff method were developed using CTT. They use the number or proportion correct cutoff scores obtained by averaging or summing the judged ratings. New procedures are being developed, and do not have a direct link between judged p values and the operational cutoff score (Hurtz et al., 2008). The judgments from older procedures require a transformation to find a comparable cut-score on the θ metric. Because the older procedures are still being used, a transformation method needs to be selected. This is done by computing the number-correct cutoff score and converting it to a θ value using the TCC. After aggregating the judged p values along the vertical axis of the TCC, the related θ value is the cutoff score, symbolized by θ*. Kane (1987) explored alternative methods by transforming judged p values into a θ cut score. Kane’s Method 1 uses the mean of the ratings for each item and converts them to the θ scale by using the ICC. The mean ratings of the judges are used with the ICC to obtain a corresponding θ value, which are averaged to obtain a cut score (Kane, 1987). Proportion correct means are located on the ICC and have a corresponding θ value, and the mean of these individual items becomes θ* as shown by the following equation: 𝑀𝑖𝑅 = 𝑃𝑖 (θ̂∗ 𝑖𝑅 ) where MiR is the mean proportion-correct value across raters R, for an individual item I, and 𝑃𝑖 (θ̂∗ 𝑖𝑅 ) is the height of the ICC for that item at θ̂∗ 𝑖𝑅 (Kane, 1987, equation 5). 50 Kane was able to improve this method by using a weighted mean of the individual values, which minimizes the sampling variance because items with high interrater agreement are given higher weights (Hurtz et al., 2008). The weights for each item can be found by the following equation (Kane, 1987, equation 14): 1 ∗ 𝜎𝑖2 (θ̂ 𝑖𝑅 ) 𝑤𝑖 = 1 ∑ 2 ∗ 𝜎𝑖 (θ̂ 𝑖𝑅 ) ∗ and the weights sum to 1 across items. For this formula, 𝜎𝑖2 (θ̂ 𝑖𝑅 ) represents the variance in θ̂∗ 𝑖𝑅 that can be found by examining two factors, the variance of the original proportion-correct ratings and the slope of an item’s ICC. Hurtz, et al, (2008) found that the variance should be restricted to a minimum value of .0025 in order to avoid spurious effects when computing item level weights. This minimum variance was adopted in the current study. Kane then developed Method 2, where the cut score is determined by averaging across items and raters, then located along the TCC and the corresponding θ value is found as the θ* cut score (Hurtz, et al, 2008). This is demonstrated by the following equation (Kane, 1987, equation 15): ∑ 𝑀𝑖𝑅 = ∑ 𝑃𝑖 (θ̂∗ ) 𝑖 𝑖 An approximation formula for Method 2 was developed using individual item values, and is denoted by the following equation: 51 θ̂∗ ≅ 1 ∗ ∗ ̂ ∑ 𝑃𝑖′ (θ̂ 𝑖𝑅 ) θ𝑖𝑅 ∗ ∑𝑖 𝑃𝑖′ (θ̂ 𝑖𝑅 ) 𝑖 The accuracy of this equation is dependent on assuming that the values of θ∗𝑖𝑅 for each item are close to θ* (Kane, 1987). A weighted version of Method 2 was also developed. Higher weights are applied to items where there is more agreement among judges. Kane further developed Method 3 by determining the θ* cut score by finding the value where the fit between the mean of the proportion-correct ratings for the item and the item’s ICC is maximized. This method is used to maximize the fit to the judges predictions for testtaker performance (Hurtz, et al., 2008). Method 2 weighted is denoted by the following equation (Kane, 1987, equation 21): ∗ ≅ θ̂ 𝑤 1 ∗ 𝑃′ (θ̂ ) ∑𝑖 [ 𝑖 𝑖𝑅 ] (𝑀 𝜎𝑖 𝑖𝑅 ) ∑[ 𝑖 ∗ 𝑃𝑖′ (θ̂ 𝑖𝑅 ) ̂ ] θ∗ 𝜎𝑖 (𝑀𝑖𝑅 ) 𝑖𝑅 Kane’s (1987) Method 3 attempts to maximize the fit between the mean of the proportion-correct ratings for each item and that item’s ICC. There is an approximation formula for method 3 that is similar to Method 2 weighted, except that it uses the reciprocals of the θ* metric rater variances, as calculated in the following equation (Kane, 1987): 𝜃̂ ∗ ≅ ∗ 𝑃𝑖′2 (𝜃̂𝑖𝑅 ) ∗ ̂ ∑ [ 2 (𝑀 ) ] 𝜃𝑖𝑅 ′2 ̂ ∗ 𝜎𝑖 𝑃𝑖 (𝜃𝑖𝑅 ) 𝑖𝑅 ∑𝑖 [ 2 ] 𝑖 𝜎𝑖 (𝑀𝑖𝑅 ) 1 ∗ The accuracy is based on how close the values of 𝜃̂𝑖𝑅 are the θ* (Kane, 1987). Kane (1987) concluded that the conversion is more effective if made at the item level instead of 52 the test level. The aggregation process should be conducted with an optimal weighting scheme so that the error variance can be minimized, based on theoretical assumptions about model fit. These techniques have been evaluated with simulated data, and using several fixed a priori values of θ* along the competence continuum at locations of the judge’s minimum competence conceptualizations (Hurtz, et al., 2008). In expanding on previous research, this study compared the results of each of the following methods: ∗ ̂1∗ ), method 1 weighted, (θ̂ ̂∗ method 1, (θ 1𝑤 ), method 2, (θ2 ), the method 2 approximation ∗ ̂∗̅ ), method 2 weighted (θ̂ ̂∗ formula, (θ 2𝑤 ), and method 3 (θ3 ). 2 The present study took advantage of the directions for future research based on the past study, Hurtz et al, (2008), and used more items, defined by randomization instead of being tied to a specific exam, three standard deviation conditions, and a different estimation of bias. The first step in this study involved randomly generating IRT parameters for 100 items. They were randomly generated using the beta distribution using SPSS. The next step simulated the Angoff ratings. For this study, a total of fifteen simulated raters were used and three different determinations of rater agreement were selected by setting the standard deviations to .05, .10, and .15. These values were picked based on the previous study, and were selected to give a greater range to previous research. The bias conditions were established through the Monte Carlo simulation, where the rater judgments were manipulated by setting five conditions. The first was no bias, or ratings as would be expected with accurate rater panels. The rest of the conditions were set as if the pool of raters rated the exam as easier as or harder than it 53 theoretically is, which would result in cut scores that would indicate that the examinees have higher or lower ability levels than would theoretically occur. These conditions were simulated as if the raters rated the exam as 15 or 35 percent harder or easier than what would be expected with more accurate rater panels. This should determine if the rater judgments drive the mathematical conversion, or if it is robust enough to compensate for rater bias and disagreement. This was achieved by shifting the mean of the ICC by the bias condition using the cumulative distribution function (CDF) of the Monte Carlo beta distribution, and then transformed back into a cut score by using the inverse distribution function (IDF) of the Monte Carlo beta distribution. The simulated rater judgments were generated based on the new mean of the ICC, and rating the exam as if it was, for example, fifteen percent easier than in reality. Additionally, the range of theta values being examined were condensed from the previous study to a truncated range of -2.00, 1.00, 0.00, 1.00, and 2.00. The goal of this present study was to examine how accurate (possessing less error) and how consistent the simulated rater judgments are based on the conditions specified. H1: The θ* resulting from simulated rater judgments will be more consistent and have less error when the simulated ratings have no bias H2: The θ* resulting from simulated rater judgments will be more consistent and have less error when the simulated ratings have more agreement. H3: The θ* resulting from simulated rater judgments will be more consistent and have less error when the θ* value is closer to 0.00. 54 H4: Method 1 weighted and Method 3 will perform the best in recovering the original θ* values. Data Analysis All calculations for the ratings conversions were performed using SPSS syntax command files, written by Gregory M. Hurtz, from California State University Sacramento (Hurtz et al., 2008) and modified by this researcher to fit the present conditions under manipulation. The input files used the simulated item parameters as defined above, in addition to the simulated ratings from each judge. The cut score conversions were performed to find the θ cut scores from the proportion-correct cut scores. The previous study (Hurtz et al., 2008) set variance restrictions for method 1 weighted and method 3 in order to eliminate spurious effects, and these restrictions were included in this analysis. 55 Chapter 6 RESULTS Unbiased Ratings Low Variability (.05) In Table 1, when the standard deviation is set to .05, and the a priori θ* value is set to 0, in the unbiased condition all 6 methods did a good job of recovering the original θ* value as shown by the low bias, with means ranging from -.013 to .003. The RMSE mean values range from .004 to .015 showing that there is low error in the mathematical conversions. Additionally the means of the estimated θ* values across replications range from .023 through .042. The means of the error index were low, .043 for all six equations, with a small standard deviation, .001. The means of the consistency index were high, .939 for all six equations, and the standard deviation of .002, which is also a good indication of how the conversion equations would consistently show these values with other simulated ratings. Going forward, values for the error and consistency indexes below .5 are classified as low, values ranging from .5 to .8 are considered moderate, and values above .8 are considered high. When the standard deviation is set to .05 and the a priori value is set to -1, in the unbiased condition, method 1 weighted and method 3 did the best job of recovering the original θ* values as shown by the low bias, with means of .043. The unweighted method 2 did the next best job of recovering the original θ* values as shown by the bias 56 mean of .056. The rest of the methods do a similar job of recovering the original θ* as shown by the bias means of .100 to .159, i.e., the range of the bias means show that there are small differences in determining which is the best method, suggesting that you would get a recovery of the original θ* values that could be equally applied regardless of the selected method. The RMSE mean values mirror the bias values and show that method 1 weighted and method 3 have the lowest error in recovering the estimated θ* values. Additionally the means of the estimated θ* values across replications range from -.841 through -.957. The means of the error index were low, ranging from .044 through .059, with standard deviations in the .002 to .003 range. The means of the consistency index were high, ranging from .917 through .938, with standard deviations in the .002 to .004 range. When the standard deviation is set to .05 and the a priori value is set to 1, in the unbiased condition, method 1 weighted and method 3 did the best job of recovering the original θ* values as shown by the low bias, with means of -.048. The weighted method 2 did the next best job of recovering the original θ* values as shown by the bias mean of .150. Method 1 followed with bias means of -.282, the unweighted method 2 showed a bias mean of -.322, and the approximation formula for method 2 did the worst job with a bias mean of -.641. The RMSE mean values mirror the bias values and show that method 1 weighted and method 3 have the lowest error in recovering the estimated θ* values. Additionally the means of the estimated θ* values across replications range from .678 (unweighted method 2) through .952 (method 1 weighted and method 3). The means of 57 the error index were low, ranging from .099 through .223, with standard deviations in the .002 to .003 range. The means of the consistency index were moderately high, ranging from .715 through .889, with standard deviations in the .002 to .004 range. When the standard deviation is set to .05 and the a priori value is set to -2, in the unbiased condition, method 2 did the best job of recovering the original θ* value with a bias mean of .739. Method 1 weighted and Method 3 showed a bias mean of 1.011, doing a better job than Method 2 weighted as shown by a bias mean of 1.047. The approximation formula for method 2 did the worst job of recovering the θ* values with a bias mean of 1.123. The RMSE mean values mirror the bias values and show that the unweighted method 2 has the lowest error in recovering the estimated θ* values. Additionally the means of the estimated θ* values across replications range from -1.261 through -.877. The means of the error index were low, ranging from .084 through .126, with standard deviations in the .003 to .006 range. The means of the consistency index were high, ranging from .823 through .890, with standard deviations in the .004 to .010 range. When the standard deviation is set to .05 and the a priori value is set to 2, in the unbiased condition, method 1 weighted and method 3 did the best job of recovering the original θ* values as shown by the bias means of -.954. The approximation formula for method 2 did the worst job with a bias mean of -2.013. The RMSE mean values mirror the bias means and show that method 1 weighted and method 3 have the least error. Additionally the means of the estimated θ* values across replications range from -.013 58 through 1.046. The means of the error index were moderate, ranging from .315 through .406, with standard deviations in the .005 to .011 range. The means of the consistency index were moderately low, ranging from .423 through .655, with standard deviations in the .009 to .015 range. 59 Table 1 Summary of Cut Scores Across Methods for Different A Priori θ* Values, Where Ratings are At the Item Characteristic Curves (ICCs) on Average (Low Variability SD .05) Mean SD Range Bias RMSE Error M Error SD Consistency M Consistency SD 0.246 -1.046 -0.989 -1.261 -0.877 -0.953 -0.989 0.002 0.016 0.044 0.017 0.024 0.021 0.044 0.020 0.118 0.304 0.120 0.156 0.143 0.304 0.954 1.011 0.739 1.123 1.047 1.011 0.954 1.012 0.739 1.123 1.047 1.012 0.104 0.111 0.084 0.126 0.115 0.111 0.003 0.006 0.003 0.005 0.004 0.006 0.858 0.847 0.890 0.823 0.840 0.847 0.004 0.010 0.004 0.008 0.006 0.010 0.300 -0.841 -0.957 -0.944 -0.856 -0.900 -0.957 0.002 0.012 0.008 0.008 0.013 0.009 0.008 0.010 0.077 0.053 0.050 0.085 0.057 0.053 0.159 0.043 0.056 0.144 0.100 0.043 0.160 0.044 0.056 0.145 0.101 0.044 0.059 0.044 0.045 0.056 0.050 0.044 0.003 0.002 0.002 0.003 0.002 0.002 0.917 0.938 0.937 0.920 0.930 0.938 0.004 0.002 0.002 0.004 0.003 0.002 0.606 -0.003 0.002 -0.006 -0.013 0.003 0.002 0.002 0.006 0.004 0.006 0.006 0.004 0.004 0.010 0.042 0.023 0.030 0.038 0.026 0.023 -0.003 0.002 -0.006 -0.013 0.003 0.002 0.007 0.004 0.008 0.014 0.005 0.004 0.043 0.043 0.043 0.043 0.043 0.043 0.001 0.001 0.001 0.001 0.001 0.001 0.939 0.939 0.939 0.938 0.939 0.939 0.002 0.002 0.002 0.002 0.002 0.002 0.832 0.718 0.952 0.678 0.359 0.850 0.952 0.005 0.017 0.006 0.018 0.020 0.008 0.006 0.030 0.105 0.039 0.120 0.114 0.051 0.039 -0.282 -0.048 -0.322 -0.641 -0.150 -0.048 0.282 0.049 0.322 0.642 0.151 0.049 0.132 0.099 0.140 0.223 0.110 0.099 0.007 0.005 0.008 0.010 0.005 0.005 0.846 0.889 0.835 0.715 0.876 0.889 0.008 0.005 0.010 0.014 0.006 0.005 θ* = -2 p 𝜃̂1∗ ∗ 𝜃̂1𝑤 ̂ 𝜃2∗ 𝜃̂̅2∗ ∗ ̂ 𝜃2𝑤 ̅̅̅̅ 𝜃̂3∗ θ* = -1 p 𝜃̂1∗ ∗ 𝜃̂1𝑤 𝜃̂2∗ 𝜃̂2̅∗ ∗ 𝜃̂̅̅̅̅ 2𝑤 ̂ 𝜃3∗ θ* = 0 p 𝜃̂1∗ ∗ 𝜃̂1𝑤 𝜃̂2∗ 𝜃̂2̅∗ ∗ 𝜃̂2𝑤 ̅̅̅̅ ̂ 𝜃3∗ θ* = 1 p 𝜃̂1∗ ∗ 𝜃̂1𝑤 𝜃̂2∗ 𝜃̂̅2∗ ∗ ̂ 𝜃2𝑤 ̅̅̅̅ 𝜃̂3∗ θ* = 2 p 𝜃̂1∗ ∗ 𝜃̂1𝑤 ̂ 𝜃2∗ 𝜃̂2̅∗ ∗ ̂ 𝜃2𝑤 ̅̅̅̅ 𝜃̂3∗ 0.719 0.009 0.050 0.415 0.023 0.136 -1.585 1.585 0.372 0.007 0.531 0.012 1.046 0.068 0.373 -0.954 0.956 0.315 0.011 0.655 0.015 0.301 0.026 0.170 -1.699 1.700 0.382 0.007 0.502 0.013 -0.013 0.021 0.132 -2.013 2.014 0.406 0.005 0.423 0.009 0.478 0.027 0.189 -1.522 1.522 0.366 0.008 0.546 0.013 1.046 0.068 0.373 -0.954 0.956 0.315 0.011 0.655 0.015 ∗ Note. RMSE = root mean square error, 𝜃̂1∗ = converted cutoff score using unweighted Method 1; 𝜃̂1𝑤 = converted cutoff score using weighted Method 1; 𝜃̂2∗ = converted cutoff score using unweighted Method 2; 𝜃̂2̅∗ = converted cutoff score ∗ ̂∗ using approximation formula for unweighted Method 2; 𝜃̂̅̅̅̅ 2𝑤 = converted cutoff score using weighted method 2; 𝜃3 = converted cutoff score using Method 3. 60 Medium Variability (.10) As shown by Table 2, when the standard deviation is set to .10 and the a priori value is set to 0, in the unbiased condition, all six methods did a similar job of recovering the original θ* values as shown by the range of the bias means of -.003 to .003, i.e., the range of the bias means show that there are small differences in determining which is the best method, suggesting that you would get a recovery of the original θ* values that could be equally applied regardless of the selected method. The RMSE mean values mirror the bias means ranging from .008 through .012. Additionally the means of the estimated θ* values across replications, range from -.003 to .003. The means of the error index were low, at .088, with a standard deviation of .002. The means of the consistency index were high, at .875 to .876 with a standard deviation of .004. When the standard deviation is set to .10 and the a priori value is set to -1, in the unbiased condition, method 1 weighted and method 3 did the best job of recovering the original θ* value as shown by the bias mean of .140. Method 2 weighted did the next best job as shown by the bias value of .253, while method 1 and method 2 followed as shown by the bias means of .366 and .343 respectively. The approximation formula for method 2 did the worst job as shown by the bias mean of .621. The RMSE means mirror the bias means and show the same results. Additionally, the means of the estimated θ* values range from -.860 to -.379. The means of the error index were low, ranging from .148 to .249 with a standard deviation ranging from .005 to .007. The means of the consistency index were moderate, ranging from .627 to .791, with standard deviations 61 ranging from .007 to .011. When the standard deviation is set to .10, in the unbiased condition, and the a priori value is set to 1, in the unbiased condition, method 1 weighted and method 3 did the best job of recovering the original θ* values as shown by the bias means of -.187. Method 2 weighted did the next best job as shown by the bias mean of -.363, and method 1 and method 2 followed with bias means of -.497 and -.530 respectively. The approximation formula for method 2 did the worst job as shown by the bias mean of .827. The RMSE means mirror the bias means. Additionally, the means of the estimated θ* values range from .173 to .813. The means of the error index were low, ranging from .188 to .247, with standard deviations ranging from .007 to .009. The means of the consistency index were moderate, ranging from .567 to .785, with low standard deviations. When the standard deviation is set to .10, in the unbiased condition, and the a priori value is set to -2, method 1 and method 3 did an adequate job of recovering the original θ* values as shown by the bias means of 1.395. The rest of the methods did worse; the approximation formula of method 2 did the worst job, with a bias mean of 2.053. The RMSE values mirror the bias means. Additionally, the means of the estimated θ* values range from -.605 to .053. The means of the error index were low, ranging from .378 to .408, with low standard deviations. The means of the consistency index were also low, ranging from .414 to .444, also with low standard deviations. The proportion-correct cutoff score generated in this condition was affected by not enough 62 low ratings being possible in the sampling when attempting to achieve the .10 standard deviation of rater variability at the -2 θ* level. When the standard deviation is set to .10, in the unbiased condition, and the a priori value is set to 2, method 1 weighted and method 3 did the best job of recovering the original θ* values as shown by the bias means of -1.527. The rest of the methods did worse; the approximation formula for method 2 did the worst with a bias mean of -1.969. The RMSE values mirror the bias means. Additionally, the means of the estimated θ* values range from .031 to .472. The means of the error index were low ranging from .362 to .370, with low standard deviations. The means of the consistency index were moderately low, ranging from .497 to .549 with low standard deviations. 63 Table 2 Summary of Cut Scores Across Methods for Different A Priori θ* Values, Where Ratings are At the Item Characteristic Curves (ICCs) on Average (Medium Variability SD .10) Mean SD Range Bias RMSE Error M Error SD Consistency M Consistency SD θ* = -2 p 𝜃̂1∗ ∗ 𝜃̂1𝑤 ̂ 𝜃2∗ 𝜃̂̅2∗ ∗ ̂ 𝜃2𝑤 ̅̅̅̅ 𝜃̂3∗ 0.554 -0.179 -0.605 -0.141 0.053 -0.332 -0.605 0.009 0.020 0.070 0.023 0.018 0.028 0.070 0.060 0.143 0.469 0.150 0.112 0.202 0.469 1.821 1.395 1.859 2.053 1.668 1.395 1.821 1.397 1.859 2.053 1.668 1.397 0.398 0.378 0.400 0.408 0.391 0.378 0.006 0.008 0.006 0.005 0.007 0.008 0.414 0.444 0.415 0.429 0.415 0.444 0.008 0.015 0.007 0.007 0.010 0.015 0.373 -0.634 -0.860 -0.657 -0.379 -0.747 -0.860 0.005 0.017 0.018 0.016 0.012 0.013 0.018 0.030 0.104 0.127 0.100 0.091 0.078 0.127 0.366 0.140 0.343 0.621 0.253 0.140 0.367 0.141 0.344 0.621 0.253 0.141 0.186 0.148 0.181 0.249 0.164 0.148 0.007 0.005 0.007 0.007 0.005 0.005 0.728 0.791 0.735 0.627 0.764 0.791 0.010 0.007 0.011 0.010 0.008 0.007 0.608 0.002 0.002 -0.003 0.001 0.003 0.002 0.003 0.011 0.007 0.009 0.009 0.007 0.007 0.020 0.081 0.045 0.050 0.066 0.048 0.045 0.002 0.002 -0.003 0.001 0.003 0.002 0.012 0.008 0.010 0.010 0.009 0.008 0.088 0.088 0.088 0.088 0.088 0.088 0.002 0.002 0.002 0.002 0.002 0.002 0.876 0.876 0.875 0.876 0.876 0.876 0.004 0.004 0.004 0.004 0.004 0.004 0.774 0.503 0.813 0.470 0.173 0.637 0.813 0.006 0.020 0.020 0.020 0.019 0.016 0.020 0.040 0.150 0.179 0.140 0.130 0.108 0.179 -0.497 -0.187 -0.530 -0.827 -0.363 -0.187 0.497 0.188 0.531 0.827 0.363 0.188 0.240 0.188 0.247 0.319 0.214 0.188 0.008 0.007 0.009 0.009 0.007 0.007 0.704 0.785 0.692 0.567 0.746 0.785 0.011 0.008 0.013 0.014 0.010 0.008 θ* = -1 p 𝜃̂1∗ ∗ 𝜃̂1𝑤 𝜃̂2∗ 𝜃̂2̅∗ ∗ 𝜃̂̅̅̅̅ 2𝑤 ̂ 𝜃3∗ θ* = 0 p 𝜃̂1∗ ∗ 𝜃̂1𝑤 𝜃̂2∗ 𝜃̂2̅∗ ∗ 𝜃̂2𝑤 ̅̅̅̅ ̂ 𝜃3∗ θ* = 1 p 𝜃̂1∗ ∗ 𝜃̂1𝑤 𝜃̂2∗ 𝜃̂̅2∗ ∗ ̂ 𝜃2𝑤 ̅̅̅̅ 𝜃̂3∗ θ* = 2 p 𝜃̂1∗ ∗ 𝜃̂1𝑤 ̂ 𝜃2∗ 𝜃̂2̅∗ ∗ ̂ 𝜃2𝑤 ̅̅̅̅ 𝜃̂3∗ 0.675 0.009 0.050 0.227 0.022 0.130 -1.773 1.773 0.370 0.006 0.506 0.010 0.472 0.114 0.683 -1.527 1.532 0.362 0.008 0.549 0.021 0.178 0.025 0.150 -1.822 1.822 0.372 0.006 0.497 0.010 0.031 0.020 0.112 -1.969 1.969 0.377 0.005 0.469 0.009 0.228 0.030 0.190 -1.772 1.772 0.370 0.006 0.506 0.011 0.472 0.114 0.683 -1.527 1.532 0.362 0.008 0.549 0.021 ∗ Note. RMSE = root mean square error, 𝜃̂1∗ = converted cutoff score using unweighted Method 1; 𝜃̂1𝑤 = converted cutoff score using weighted Method 1; 𝜃̂2∗ = converted cutoff score using unweighted Method 2; 𝜃̂2̅∗ = converted cutoff score ∗ ̂∗ using approximation formula for unweighted Method 2; 𝜃̂̅̅̅̅ 2𝑤 = converted cutoff score using weighted method 2; 𝜃3 = converted cutoff score using Method 3. 64 High Variability (.15) According to Table 3, when the standard deviation is set to .15, and the a priori θ* value is set to 0, in the unbiased condition, method 2 weighted did the best job of recovering the original θ* values as shown by the bias mean of .007. Method 1 did the next best with a bias mean of -.010. The other four methods do a similar job of recovering the original θ* values with bias means ranging from -.026 to .026, i.e., the range of the bias means show that there are small differences in determining which is the best method, suggesting that you would get a recovery of the original θ* values that could be equally applied regardless of the selected method. The RMSE means mirror the bias means. Additionally, the means of the estimated θ* values across replications, range from -.026 to .026. The means of the error index were low at .136 with low standard deviations. The means of the consistency index were high ranging from .806 to .808 with low standard deviations. When the standard deviation is set to .15 and the a priori θ* value is set to -1, in the unbiased condition, method 1 weighted and method 3 did the best job of recovering the original θ* values with a bias mean of .349. Method 2 and method 2 weighted did the next best job with means of .471 and .428 respectively. Method 1 and the approximation formula for method 2 performed the worst with bias means of .504 and .731. The RMSE means mirror the bias means. Additionally, the means of the estimated θ* values across replications range from -.269 to -.651. The means of the error index were very low, ranging from .224 to .300 with small standard deviations. The means of the consistency 65 index were moderate, ranging from .553 to .672, with small standard deviations. When the standard deviation is set to .15 and the a priori θ* value is set to 1, in the unbiased condition, method 1 weighted and method 3 did the best job of recovering the original θ* values with bias means of -.433. Method 2 weighted did the next best job, with a bias mean of -.585. Method 1 and method 2 followed with bias means of -.661 and -.682, and the approximation formula for method 2 did the worst job with a bias mean of -.895. The RMSE means mirror the bias means. The means of the estimated θ* values range from .105 to .567. The means of the error index were low ranging from .275 to .353 with low standard deviations. The means of the consistency index were moderate ranging from .513 to .667 with low standard deviations. When the standard deviation is set to .15 and the a priori θ* value is set to -2, in the unbiased condition, method 1 weighted and method 3 did the best job of recovering the original θ* values with bias means of 1.482. The rest of the methods did an equally poor job of recovering the original θ* values with bias means ranging from 1.182 to 2.031. The RMSE values mirror the bias values. The means of the estimated θ* values range from -.518 to .031. The means of the error index range were moderate from .373 to .375 with low standard deviations. The means of the consistency index were low, ranging from .445 to .473 with low standard deviations. The proportion-correct cutoff score generated in this condition was affected by not enough low ratings being possible in the sampling when attempting to achieve the .15 standard deviation of rater variability at the -2 θ* level. 66 When the standard deviation is set to .15 and the a priori θ* value is set to 2, in the unbiased condition, method 1 weighted and method 3 did the best job of recovering the original θ* values with bias means of -1.781. The rest of the methods did an equally poor job of recovering the original θ* values with bias means ranging from -1.861 to 1.954. The RMSE means mirror the bias means. The means of the estimated θ* values range from .046 to .219. The means of the error index were low, ranging from .362 to .366 with low standard deviations. The means of the consistency index were moderate ranging from .488 to .515 with low standard deviations. 67 Table 3 Summary of Cut Scores Across Methods for Different A Priori θ* Values, Where Ratings are At the Item Characteristic Curves (ICCs) on Average (High Variability – SD .15) Mean SD Range Bias RMSE Error M Error SD Consistency M Consistency SD 0.603 -0.042 -0.518 -0.014 0.031 -0.188 -0.518 0.009 0.020 0.104 0.024 0.018 0.039 0.104 0.060 0.121 0.659 0.140 0.114 0.255 0.659 1.958 1.482 1.986 2.031 1.812 1.482 1.958 1.486 1.987 2.031 1.813 1.486 0.374 0.374 0.375 0.375 0.373 0.374 0.006 0.007 0.006 0.006 0.006 0.007 0.464 0.445 0.467 0.473 0.450 0.445 0.008 0.011 0.008 0.008 0.009 0.011 0.413 -0.496 -0.651 -0.529 -0.269 -0.572 -0.651 0.006 0.019 0.024 0.018 0.013 0.015 0.024 0.041 0.122 0.157 0.120 0.085 0.102 0.157 0.504 0.349 0.471 0.731 0.428 0.349 0.504 0.350 0.472 0.731 0.428 0.350 0.250 0.224 0.244 0.300 0.236 0.224 0.007 0.006 0.008 0.007 0.006 0.006 0.628 0.672 0.638 0.553 0.651 0.672 0.011 0.010 0.012 0.011 0.010 0.010 0.600 -0.010 0.026 -0.021 -0.026 0.008 0.026 0.005 0.014 0.018 0.013 0.013 0.012 0.018 0.031 0.087 0.095 0.080 0.078 0.082 0.095 -0.010 0.026 -0.021 -0.026 0.007 0.026 0.017 0.031 0.025 0.029 0.015 0.031 0.136 0.136 0.136 0.136 0.136 0.136 0.003 0.003 0.004 0.004 0.003 0.003 0.807 0.808 0.806 0.806 0.808 0.808 0.005 0.005 0.005 0.005 0.005 0.005 0.725 0.339 0.567 0.318 0.105 0.415 0.567 0.008 0.022 0.032 0.023 0.020 0.021 0.032 0.048 0.133 0.368 0.150 0.115 0.138 0.368 -0.661 -0.433 -0.682 -0.895 -0.585 -0.433 0.662 0.434 0.682 0.896 0.586 0.434 0.312 0.275 0.315 0.353 0.299 0.275 0.008 0.009 0.009 0.008 0.008 0.009 0.598 0.667 0.591 0.513 0.623 0.667 0.013 0.013 0.014 0.013 0.013 0.013 θ* = -2 p 𝜃̂1∗ ∗ 𝜃̂1𝑤 ̂ 𝜃2∗ 𝜃̂̅2∗ ∗ ̂ 𝜃2𝑤 ̅̅̅̅ 𝜃̂3∗ θ* = -1 p 𝜃̂1∗ ∗ 𝜃̂1𝑤 𝜃̂2∗ 𝜃̂2̅∗ ∗ 𝜃̂̅̅̅̅ 2𝑤 ̂ 𝜃3∗ θ* = 0 p 𝜃̂1∗ ∗ 𝜃̂1𝑤 𝜃̂2∗ 𝜃̂2̅∗ ∗ 𝜃̂2𝑤 ̅̅̅̅ ̂ 𝜃3∗ θ* = 1 p 𝜃̂1∗ ∗ 𝜃̂1𝑤 𝜃̂2∗ 𝜃̂̅2∗ ∗ ̂ 𝜃2𝑤 ̅̅̅̅ 𝜃̂3∗ θ* = 2 p 𝜃̂1∗ ∗ 𝜃̂1𝑤 ̂ 𝜃2∗ 𝜃̂2̅∗ ∗ ̂ 𝜃2𝑤 ̅̅̅̅ 𝜃̂3∗ 0.656 0.009 0.061 0.139 0.023 0.151 -1.861 1.861 0.364 0.006 0.503 0.009 0.219 0.066 0.593 -1.781 1.782 0.362 0.006 0.515 0.014 0.124 0.025 0.160 -1.876 1.876 0.364 0.006 0.500 0.009 0.046 0.018 0.104 -1.954 1.954 0.366 0.006 0.488 0.009 0.134 0.023 0.143 -1.866 1.866 0.364 0.006 0.502 0.009 0.219 0.066 0.593 -1.781 1.782 0.362 0.006 0.515 0.014 ∗ Note. RMSE = root mean square error, 𝜃̂1∗ = converted cutoff score using unweighted Method 1; 𝜃̂1𝑤 = converted cutoff score using weighted Method 1; 𝜃̂2∗ = converted cutoff score using unweighted Method 2; 𝜃̂2̅∗ = converted cutoff score ∗ ̂∗ using approximation formula for unweighted Method 2; 𝜃̂̅̅̅̅ 2𝑤 = converted cutoff score using weighted method 2; 𝜃3 = converted cutoff score using Method 3. 68 Biased Ratings Low Variability (.05) .15 above the ICC. According to Table 4, when the standard deviation is set to .05, and the a priori θ* value is set to 0 in the bias condition where the simulated ratings are set to .15 above the ICC, all six methods did a similar job recovering the original θ* values as shown by the bias means ranging from .092 to .114, i.e., the range of the bias means show that there are small differences in determining which is the best method, suggesting that you would get a recovery of the original θ* values that could be equally applied regardless of the selected method. The RMSE values mirror the bias means except for method 2 with a mean of .304. The means of the estimated θ* values across replications range from .092 to .113. The means of the error index were low at .048 with a low standard deviation, and the means of the consistency index were high, ranging from .933 to .944 with a low standard deviation. When the standard deviation is set to .05 and the a priori θ* value is set to -1 in the bias condition where the simulated ratings are set to .15 above the ICC, method 2 did the best job of recovering the original θ* value as shown by the bias mean of .251. The other five methods did a similar job of recovering the original θ* with bias means ranging from .393 to .411, i.e., the range of the bias means show that there are small differences in determining which is the best method, suggesting that you would get a recovery of the original θ* values that could be equally applied regardless of the selected method. The RMSE means mirror the bias means. The means of the estimated θ* values across 69 replications range from -.589 to -.749. The means of the error index were low ranging from .082 to .113 with a low standard deviation. The means of the consistency index were high ranging from .833 to .881 with a low standard deviation. When the standard deviation is set to .05 and the a priori θ* value is set to 1 in the bias condition where the simulated ratings are set to .15 above the ICC, method 1 weighted and method 3 did the best job of recovering the original θ* values as shown by the bias mean of .045. Method 2 weighted did the next best job of recovering the original θ* values as shown by the bias mean of -.066. The approximation formula for method 2 did the poorest job of recovering the original θ* values as shown by the bias mean of .644. The means of the estimated θ* values across replications range from .356 to 1.045. The means of the error index were low, ranging from .098 to .242, with a low standard deviation. The means of the consistency index were moderately high, ranging from .689 to .893 with a low standard deviation. When the standard deviation is set to .05 and the a priori θ* value is set to -2 in the bias condition where the simulated ratings are set to .15 above the ICC, method 2 did the best job of recovering the original θ* value as shown by the bias mean of 1.192. The rest of the methods did a poor job of recovering the original θ* value with bias means ranging from 1.376 to 1.459. The RMSE means mirror the bias means. The means of the estimated θ* values across replications range from -.808 to -.541. The means of the error index were low ranging from .121 to .156 with low standard deviations, and the means of the consistency index were high, ranging from .769 to .827 with low standard deviations. 70 When the standard deviation is set to .05 and the a priori θ* value is set to 2 in the bias condition where the simulated ratings are set to .15 above the ICC, method 1 weighted and method 3 did the best job of recovering the original θ* values as shown by the bias mean of -1.084. The rest of the methods did a poor job of recovering the original θ* value as shown by the bias means ranging from -1.537 to -2.037. The RMSE mean values mirror the bias mean values. The means of the estimated θ* values across replications range from -.036 to .916. The means of the error index were low, ranging from .333 to .409 with a low standard deviation, and the means of the consistency index were moderately low, ranging from .415 to .628 with low standard deviations. 71 Table 4 Summary of Cut Scores Across Methods for Different A Priori θ* Values, Where Ratings are .15 Above the Item Characteristic Curves (ICCs) on Average (Low Variability – SD .05) Mean SD Range Bias RMSE Error M Error SD Consistency M Consistency SD 0.331 -0.624 -0.541 -0.808 -0.573 -0.574 -0.541 0.001 0.005 0.015 0.006 0.006 0.008 0.015 0.010 0.030 0.091 0.030 0.038 0.052 0.091 1.376 1.459 1.192 1.427 1.426 1.459 1.376 1.459 1.192 1.427 1.426 1.459 0.142 0.156 0.121 0.150 0.150 0.156 0.001 0.003 0.001 0.002 0.002 0.003 0.792 0.769 0.827 0.778 0.779 0.769 0.002 0.005 0.002 0.003 0.003 0.005 0.347 -0.607 -0.603 -0.749 -0.589 -0.590 -0.603 0.001 0.005 0.014 0.006 0.006 0.007 0.014 0.010 0.031 0.093 0.030 0.043 0.043 0.093 0.393 0.397 0.251 0.411 0.410 0.397 0.393 0.397 0.251 0.411 0.410 0.397 0.109 0.110 0.082 0.113 0.113 0.110 0.002 0.003 0.002 0.002 0.002 0.003 0.840 0.838 0.881 0.833 0.833 0.838 0.002 0.005 0.002 0.003 0.003 0.005 0.644 0.113 0.110 0.092 0.094 0.113 0.110 0.002 0.007 0.005 0.005 0.005 0.004 0.005 0.010 0.050 0.030 0.020 0.030 0.025 0.030 0.113 0.110 0.092 0.094 0.113 0.110 0.114 0.110 0.304 0.094 0.113 0.110 0.048 0.048 0.048 0.048 0.048 0.048 0.001 0.001 0.001 0.001 0.001 0.001 0.933 0.933 0.934 0.934 0.933 0.933 0.002 0.002 0.002 0.002 0.002 0.002 0.851 0.861 1.045 0.760 0.356 0.933 1.045 0.005 0.018 0.007 0.020 0.023 0.009 0.007 0.030 0.118 0.050 0.140 0.141 0.065 0.050 -0.139 0.045 -0.240 -0.644 -0.066 0.045 0.140 0.046 0.241 0.645 0.067 0.046 0.120 0.098 0.139 0.242 0.109 0.098 0.006 0.005 0.008 0.011 0.005 0.005 0.864 0.893 0.840 0.689 0.879 0.893 0.008 0.005 0.010 0.016 0.006 0.005 θ* = -2 p 𝜃̂1∗ ∗ 𝜃̂1𝑤 𝜃̂2∗ 𝜃̂2̅∗ ∗ 𝜃̂̅̅̅̅ 2𝑤 ̂ 𝜃3∗ θ* = -1 p 𝜃̂1∗ ∗ 𝜃̂1𝑤 𝜃̂2∗ 𝜃̂2̅∗ ∗ 𝜃̂2𝑤 ̅̅̅̅ 𝜃̂3∗ θ* = 0 p 𝜃̂1∗ ∗ 𝜃̂1𝑤 ̂ 𝜃2∗ 𝜃̂2̅∗ ∗ ̂ 𝜃2𝑤 ̅̅̅̅ 𝜃̂3∗ θ* = 1 p 𝜃̂1∗ ∗ 𝜃̂1𝑤 𝜃̂2∗ 𝜃̂2̅∗ ∗ 𝜃̂̅̅̅̅ 2𝑤 ̂ 𝜃3∗ θ* = 2 p 𝜃̂1∗ ∗ 𝜃̂1𝑤 𝜃̂2∗ 𝜃̂2̅∗ ∗ 𝜃̂2𝑤 ̅̅̅̅ 𝜃̂3∗ 0.718 0.008 0.060 0.463 0.022 0.141 -1.537 1.537 0.372 0.007 0.537 0.011 0.916 0.075 0.419 -1.084 1.087 0.333 0.011 0.628 0.017 0.299 0.024 0.150 -1.701 1.701 0.385 0.007 0.497 0.012 -0.036 0.020 0.133 -2.037 2.037 0.409 0.005 0.415 0.009 0.374 0.026 0.161 -1.626 1.626 0.379 0.007 0.516 0.012 0.916 0.075 0.419 -1.084 1.087 0.333 0.011 0.628 0.017 ∗ Note. RMSE = root mean square error, 𝜃̂1∗ = converted cutoff score using unweighted Method 1; 𝜃̂1𝑤 = converted cutoff score using weighted Method 1; 𝜃̂2∗ = converted cutoff score using unweighted Method 2; 𝜃̂2̅∗ = converted cutoff score ∗ ̂∗ using approximation formula for unweighted Method 2; 𝜃̂2𝑤 ̅̅̅̅ = converted cutoff score using weighted method 2; 𝜃3 = converted cutoff score using Method 3. 72 Moderate Variability (.10) .15 above the ICC. According to Table 5, when the standard deviation is set to .10 and the a priori θ* value is set to 0 in the bias condition where the simulated ratings are set to .15 above the ICC, all six methods did a similar job recovering the original θ* with bias means ranging from .241 to .318, i.e., the range of the bias means show that there are small differences in determining which is the best method, suggesting that you would get a recovery of the original θ* values that could be equally applied regardless of the selected method. The RMSE values mirror the bias means. The means of the estimated θ* values across replications range from .241 to .318. The means of the error index were low, at .003, with a low standard deviation, and the means of the consistency index were high, ranging from .833 to .836 with low standard deviations. When the standard deviation is set to .10 and the a priori θ* value is set to -1 in the bias condition where the simulated ratings are set to .15 above the ICC, method 1 weighted and method 3 did the best job of recovering the original θ* values as shown by the bias mean of .784. Method 2 did the worst job of recovering the original θ* value with a bias mean of 1.003. The RMSE means mirror the bias means. The means of the estimated θ* values across replications range from -.216 to .012. The means of the error index were low, ranging from .236 to .245 with low standard deviations, and the means of the consistency index were moderate, ranging from .651 to .654 with low standard deviations. When the standard deviation is set to .10 and the a priori θ* value is set to 1 in the 73 bias condition where the simulated ratings are set to .15 above the ICC, method 1 weighted and method 3 did the best job of recovering the original θ* values as shown by the bias mean of -.146. The approximation formula for method 2 did the worst job of recovering the original θ* value as shown by the bias mean of -.957. The RMSE means mirror the bias means. The means of the estimated θ* values across replications range from .132 to .929. The means of the error index were low ranging from .221 to .366 with low standard deviations, and the means of the consistency index were moderate, ranging from .498 to .753 with low standard deviations. When the standard deviation is set to .10 and the a priori θ* value is set to -2 in the bias condition where the simulated ratings are set to .15 above the ICC, method 1 weighted and method 3 did the best job of recovering the original θ* values as shown by the bias means of 1.679. The rest of the methods do a similarly poor job of recovering the original θ* values with bias means ranging from 1.909 to 2.135, i.e., the range of the bias means show that there are small differences in determining which is the best method, suggesting that you would get a recovery of the original θ* values that could be equally applied regardless of the selected method. The RMSE means mirror the bias means. The means of the estimated θ* values across replications range from -.321 to .135. The means of the error index were low, ranging from .315 to .319 with low standard deviations, and the means of the consistency index were moderate, ranging from .523 to .566 with low standard deviations. The proportion-correct cutoff score generated in this condition was affected by not enough low ratings being possible in the sampling when 74 attempting to achieve the .10 standard deviation of rater variability at the -2 θ* level. When the standard deviation is set to .10 and the a priori θ* value is set to 2 in the bias condition where the simulated ratings are set to .15 above the ICC, method 1 weighted and method 3 did the best job of recovering the original θ* values as shown by the bias mean of -1.653. The rest of the methods do a similarly poor job of recovering the original θ* values with bias means ranging from -1.839 to -1.964, i.e., the range of the bias means show that there are small differences in determining which is the best method, suggesting that you would get a recovery of the original θ* values that could be equally applied regardless of the selected method. The RMSE means mirror the bias means. The means of the estimated θ* values across replications range from .036 to .347. The means of the error index were low, ranging from .363 to .368, with low standard deviations, and the means of the consistency index were moderately low, ranging from .483 to .533 with low standard deviations. 75 Table 5 Summary of Cut Scores Across Methods for Different A Priori θ* Values, Where Ratings are .15 Above the Item Characteristic Curves (ICCs) on Average (Medium Variability SD .10) Mean SD Range Bias RMSE Error M Error SD Consistency M Consistency SD 0.660 0.109 -0.321 0.135 0.103 -0.091 -0.321 0.008 0.017 0.038 0.023 0.016 0.018 0.038 0.050 0.108 0.289 0.140 0.094 0.112 0.289 2.109 1.679 2.135 2.103 1.909 1.679 2.109 1.680 2.135 2.103 1.909 1.680 0.315 0.319 0.315 0.315 0.315 0.319 0.005 0.006 0.005 0.005 0.005 0.006 0.566 0.523 0.569 0.565 0.545 0.523 0.007 0.009 0.007 0.007 0.008 0.009 0.610 0.012 -0.216 0.003 -0.009 -0.135 -0.216 0.005 0.009 0.024 0.012 0.008 0.012 0.024 0.030 0.062 0.157 0.070 0.046 0.078 0.157 1.012 0.784 1.003 0.991 0.865 0.784 1.012 0.784 1.004 0.991 0.865 0.784 0.245 0.236 0.245 0.244 0.238 0.236 0.005 0.004 0.005 0.004 0.004 0.004 0.653 0.651 0.654 0.654 0.653 0.651 0.006 0.007 0.006 0.006 0.006 0.007 0.706 0.249 0.318 0.263 0.241 0.292 0.318 0.004 0.010 0.017 0.011 0.010 0.009 0.017 0.020 0.073 0.120 0.070 0.066 0.055 0.120 0.249 0.318 0.263 0.241 0.292 0.318 0.249 0.318 0.513 0.242 0.292 0.318 0.125 0.126 0.125 0.125 0.125 0.126 0.003 0.003 0.003 0.003 0.003 0.003 0.834 0.836 0.835 0.833 0.836 0.836 0.004 0.004 0.004 0.005 0.004 0.004 0.767 0.536 0.929 0.448 0.132 0.644 0.929 0.007 0.023 0.044 0.022 0.020 0.022 0.044 0.050 0.147 0.285 0.150 0.117 0.134 0.285 -0.579 -0.146 -0.664 -0.957 -0.498 -0.146 0.580 0.161 0.664 0.957 0.499 0.161 0.283 0.221 0.300 0.366 0.262 0.221 0.009 0.008 0.010 0.008 0.009 0.008 0.655 0.753 0.625 0.498 0.689 0.753 0.013 0.011 0.014 0.014 0.012 0.011 θ* = -2 p 𝜃̂1∗ ∗ 𝜃̂1𝑤 𝜃̂2∗ 𝜃̂2̅∗ ∗ 𝜃̂̅̅̅̅ 2𝑤 ̂ 𝜃3∗ θ* = -1 p 𝜃̂1∗ ∗ 𝜃̂1𝑤 𝜃̂2∗ 𝜃̂2̅∗ ∗ 𝜃̂2𝑤 ̅̅̅̅ 𝜃̂3∗ θ* = 0 p 𝜃̂1∗ ∗ 𝜃̂1𝑤 ̂ 𝜃2∗ 𝜃̂2̅∗ ∗ ̂ 𝜃2𝑤 ̅̅̅̅ 𝜃̂3∗ θ* = 1 p 𝜃̂1∗ ∗ 𝜃̂1𝑤 𝜃̂2∗ 𝜃̂2̅∗ ∗ 𝜃̂̅̅̅̅ 2𝑤 ̂ 𝜃3∗ θ* = 2 p 𝜃̂1∗ ∗ 𝜃̂1𝑤 𝜃̂2∗ 𝜃̂2̅∗ ∗ 𝜃̂2𝑤 ̅̅̅̅ 𝜃̂3∗ 0.657 0.009 0.060 0.161 0.024 0.156 -1.839 1.839 0.365 0.006 0.504 0.009 0.347 0.098 0.666 -1.653 1.656 0.363 0.006 0.533 0.017 0.129 0.026 0.150 -1.871 1.872 0.366 0.006 0.498 0.009 0.036 0.019 0.121 -1.964 1.964 0.368 0.006 0.483 0.009 0.160 0.025 0.176 -1.840 1.840 0.365 0.006 0.503 0.010 0.347 0.098 0.666 -1.653 1.656 0.363 0.006 0.533 0.017 ∗ Note. RMSE = root mean square error, 𝜃̂1∗ = converted cutoff score using unweighted Method 1; 𝜃̂1𝑤 = converted cutoff score using weighted Method 1; 𝜃̂2∗ = converted cutoff score using unweighted Method 2; 𝜃̂2̅∗ = converted cutoff score ∗ ̂∗ using approximation formula for unweighted Method 2; 𝜃̂2𝑤 ̅̅̅̅ = converted cutoff score using weighted method 2; 𝜃3 = converted cutoff score using Method 3. 76 High Variability (.15) .15 above the ICC. According to Table 6, when the standard deviation is set to .15 and the a priori θ* value is set to 0 in the bias condition where the simulated ratings are set to .15 above the ICC, the approximation formula for method 2 did the best job of recovering the original θ* values as shown by the bias mean of .233. The rest of the methods did a similar job of recovering the original θ* values with bias means ranging from .277 to .327, i.e., the range of the bias means show that there are small differences in determining which is the best method, suggesting that you would get a recovery of the original θ* values that could be equally applied regardless of the selected method. The RMSE means mirror the bias means. The means of the estimated θ* values across replications range from .233 to .327. The means of the error index were low ranging from .181 to .185 with low standard deviations, and the means of the consistency index were high ranging from .754 to .765 with low standard deviations. When the standard deviation is set to .15 and the a priori θ* value is set to -1 in the bias condition where the simulated ratings are set to .15 above the ICC, method 1 weighted and method 3 do the best job of recovering the original θ* values as shown by the bias mean of .766. The rest of the methods to a similarly poor job of recovering the original θ* value with bias means ranging from .982 to 1.218, i.e., the range of the bias means show that there are small differences in determining which is the best method, suggesting that you would get a recovery of the original θ* values that could be equally applied regardless of the selected method. The RMSE means mirror the bias means. The 77 means of the estimated θ* values across replications range from -.234 to .218. The means of the error index were low, ranging from .283 to .308, with low standard deviations, and the means of the consistency index were moderate, ranging from .543 to .621. When the standard deviation is set to .15 and the a priori θ* value is set to 1 in the bias condition where the simulated ratings are set to .15 above the ICC, the simulated ratings failed to generate. This may be due to the restricted range at the top end of the distribution and the variability in the rater agreement being so high. When the standard deviation is set to .15 and the a priori θ* value is set to -2 in the bias condition where the simulated ratings are set to .15 above the ICC, all six methods do a similarly poor job of recovering the original θ* values as shown by the bias means ranging from 1.946 to 2.102, i.e., the range of the bias means show that there are small differences in determining which is the best method, suggesting that you would get a recovery of the original θ* values that could be equally applied regardless of the selected method. The RMSE values mirror the bias means. The range of the estimated θ* values across replications range from -.054 to .103. The means of the error index were low, ranging from .346 to .350 with low standard deviations, and the means of the consistency index were moderate, ranging from .498 to .522 with low standard deviations. When the standard deviation is set to .15 and the a priori θ* value is set to 2 in the bias condition where the simulated ratings are set to .15 above the ICC, the simulated 78 ratings failed to generate. This may be due to the restricted range at the top end of the distribution and the variability in the rater agreement being so high. 79 Table 6 Summary of Cut Scores Across Methods for Different A Priori θ* Values, Where Ratings are .15 Above the Item Characteristic Curves (ICCs) on Average (High Variability SD .15) Mean SD Range Bias RMSE Error M Error SD Consistency M Consistency SD 0.648 0.102 -0.054 0.103 0.069 0.038 -0.054 0.010 0.021 0.047 0.026 0.018 0.021 0.047 0.063 0.140 0.402 0.170 0.108 0.134 0.402 2.102 1.946 2.103 2.069 2.038 1.946 2.102 1.946 2.104 2.069 2.039 1.946 0.346 0.350 0.346 0.347 0.347 0.350 0.005 0.006 0.005 0.005 0.005 0.006 0.522 0.498 0.522 0.517 0.512 0.498 0.008 0.011 0.008 0.008 0.008 0.011 0.690 0.209 -0.234 0.218 0.133 -0.018 -0.234 0.007 0.017 0.030 0.019 0.012 0.017 0.030 0.045 0.105 0.206 0.130 0.092 0.103 0.206 1.209 0.766 1.218 1.133 0.982 0.766 1.209 0.767 1.218 1.133 0.982 0.767 0.284 0.308 0.283 0.286 0.292 0.308 0.004 0.006 0.004 0.005 0.005 0.006 0.620 0.543 0.621 0.609 0.584 0.543 0.006 0.010 0.006 0.006 0.008 0.010 0.728 0.305 0.277 0.327 0.233 0.287 0.277 0.005 0.016 0.015 0.015 0.014 0.013 0.015 0.032 0.102 0.099 0.090 0.088 0.075 0.099 0.305 0.277 0.327 0.233 0.287 0.277 0.306 0.278 0.572 0.233 0.287 0.278 0.182 0.182 0.181 0.185 0.182 0.182 0.004 0.004 0.004 0.005 0.004 0.004 0.763 0.760 0.765 0.754 0.761 0.760 0.006 0.006 0.006 0.007 0.006 0.006 θ* = -2 p 𝜃̂1∗ ∗ 𝜃̂1𝑤 𝜃̂2∗ 𝜃̂2̅∗ ∗ 𝜃̂̅̅̅̅ 2𝑤 ̂ 𝜃3∗ θ* = -1 p 𝜃̂1∗ ∗ 𝜃̂1𝑤 𝜃̂2∗ 𝜃̂2̅∗ ∗ 𝜃̂2𝑤 ̅̅̅̅ 𝜃̂3∗ θ* = 0 p 𝜃̂1∗ ∗ 𝜃̂1𝑤 ̂ 𝜃2∗ 𝜃̂2̅∗ ∗ ̂ 𝜃2𝑤 ̅̅̅̅ 𝜃̂3∗ θ* = 1 p 𝜃̂1∗ ∗ 𝜃̂1𝑤 𝜃̂2∗ 𝜃̂2̅∗ ∗ 𝜃̂̅̅̅̅ 2𝑤 ̂ 𝜃3∗ θ* = 2 p 𝜃̂1∗ ∗ 𝜃̂1𝑤 𝜃̂2∗ 𝜃̂2̅∗ ∗ 𝜃̂2𝑤 ̅̅̅̅ 𝜃̂3∗ ∗ Note. RMSE = root mean square error, 𝜃̂1∗ = converted cutoff score using unweighted Method 1; 𝜃̂1𝑤 = converted cutoff ∗ ∗ ̂ ̂ score using weighted Method 1; 𝜃2 = converted cutoff score using unweighted Method 2; 𝜃2̅ = converted cutoff score ∗ ̂∗ using approximation formula for unweighted Method 2; 𝜃̂2𝑤 ̅̅̅̅ = converted cutoff score using weighted method 2; 𝜃3 = converted cutoff score using Method 3. 80 Low Variability (.05) .15 below the ICC. According to Table 7, when the standard deviation is set to .05 and the a priori θ* value is set to 0 in the bias condition where the simulated ratings are set to .15 below the ICC, all six methods did a similar job of recovering the original θ* value as shown by the bias means ranging from -.083 to -.110, i.e., the range of the bias means show that there are small differences in determining which is the best method, suggesting that you would get a recovery of the original θ* values that could be equally applied regardless of the selected method. The RMSE values mirror the bias means. The means of the estimated θ* values across replications range from -.083 to -.110. The means of the error index were low and range from .051 to .052 with low standard deviations, and the means of the consistency index were high ranging from .925 to .926 with low standard deviations. When the standard deviation is set to .05 and the a priori θ* value is set to -1 in the bias condition where the simulated ratings are set to .15 below the ICC, method 2 did the best job of recovering the original θ* value as shown by the bias mean of -.009. Method 1 weighted and method 3 did the next best job as shown by the bias mean of .011, and the approximation formula for method 2 follows with a bias mean of .082. The other two methods do a similar job as shown by the bias means of .137 and .140, i.e., the range of the bias means show that there are small differences in determining which is the best method, suggesting that you would get a recovery of the original θ* values that could be equally applied regardless of the selected method. The RMSE means mirror the bias 81 means. The means of the estimated θ* values across replications range from -.860 to 1.011. The means of the error index were low, ranging from .070 to .089 with low standard deviations, and the means of the consistency index were high ranging from .875 to .904 with low standard deviations. When the standard deviation is set to .05 and the a priori θ* value is set to 1 in the bias condition where the simulated ratings are set to .15 below the ICC, method 1 weighted and method 3 did the best job of recovering the original θ* values as shown by the bias means of -.131. Method 2 weighted did the next best job as shown by the bias mean of -.234. Method 2 and Method 1 followed with bias means of -.399 and -.416 respectively. The approximation formula for method 2 did the worst job with a bias mean of -.646. The RMSE means mirror the bias means. The means of the estimated θ* values across replications range from .353 to .869. The means of the error index were low ranging from .101 to .204 with low standard deviations, and the means of the consistency index were moderately high ranging from .738 to .886 with low standard deviations. When the standard deviation is set to .05 and the a priori θ* value is set to -2 in the bias condition where the simulated ratings are set to .15 below the ICC, method 2 did the best job of recovering the original θ* values as shown by the bias mean of .779. The rest of the methods do a similarly poor job of recovering the original θ* values as shown by the bias means ranging from 1.019 to 1.049, i.e., the range of the bias means show that there are small differences in determining which is the best method, suggesting that you 82 would get a recovery of the original θ* values that could be equally applied regardless of the selected method. The RMSE means mirror the bias means. The means of the estimated θ* values across replications range from -.910 to -1.221. The means of the error index were low, ranging from .085 to .111 with low standard deviations, and the means of the consistency index were high, ranging from .836 to .888. When the standard deviation is set to .05 and the a priori θ* value is set to 2 in the bias condition where the simulated ratings are set to .15 below the ICC, method 1 weighted and method 3 do the best job of recovering the original θ* values as shown by the bias means of -.849. The rest of the methods do a similarly poor job of recovering the original θ* values as shown by the bias means ranging from -1.428 to -1.983, i.e., the range of the bias means show that there are small differences in determining which is the best method, suggesting that you would get a recovery of the original θ* values that could be equally applied regardless of the selected method. The RMSE means mirror the bias means. The means of the estimated θ* values across replications range from .353 to 1.151. The means of the error index were low, ranging from .302 to .399 with low standard deviations, and the means of the consistency index were moderately low, ranging from .437 to .675 with low standard deviations. 83 Table 7 Summary of Cut Scores Across Methods for Different A Priori θ* Values, Where Ratings are .15 Below the Item Characteristic Curves (ICCs) on Average (Low Variability SD .05) Mean SD Range Bias RMSE Error M Error SD Consistency M Consistency SD 0.251 -0.987 -0.981 -1.221 -0.910 -0.951 -0.981 0.002 0.016 0.042 0.017 0.022 0.021 0.042 0.010 0.098 0.270 0.110 0.162 0.136 0.270 1.013 1.019 0.779 1.090 1.049 1.019 1.013 1.020 0.779 1.090 1.049 1.020 0.107 0.108 0.085 0.117 0.111 0.108 0.003 0.006 0.003 0.004 0.004 0.006 0.852 0.851 0.888 0.836 0.845 0.851 0.004 0.010 0.004 0.007 0.006 0.010 0.286 -0.860 -1.011 -1.009 -0.863 -0.918 -1.011 0.002 0.014 0.013 0.011 0.016 0.013 0.013 0.010 0.083 0.075 0.070 0.108 0.079 0.075 0.140 -0.011 -0.009 0.137 0.082 -0.011 0.141 0.017 0.014 0.138 0.083 0.017 0.089 0.070 0.070 0.088 0.081 0.070 0.003 0.002 0.002 0.003 0.003 0.002 0.875 0.904 0.903 0.875 0.887 0.904 0.004 0.003 0.003 0.005 0.004 0.003 0.569 -0.110 -0.083 -0.103 -0.102 -0.090 -0.083 0.002 0.007 0.004 0.006 0.006 0.004 0.004 0.010 0.054 0.024 0.030 0.042 0.028 0.024 -0.110 -0.083 -0.103 -0.102 -0.090 -0.083 0.110 0.083 0.103 0.102 0.090 0.083 0.052 0.051 0.051 0.051 0.051 0.051 0.001 0.001 0.001 0.001 0.001 0.001 0.925 0.926 0.926 0.926 0.926 0.926 0.002 0.002 0.002 0.002 0.002 0.002 0.812 0.584 0.869 0.601 0.353 0.766 0.869 0.005 0.012 0.006 0.017 0.019 0.008 0.006 0.030 0.081 0.049 0.100 0.127 0.050 0.049 -0.416 -0.131 -0.399 -0.646 -0.234 -0.131 0.416 0.131 0.400 0.647 0.234 0.131 0.144 0.101 0.140 0.204 0.111 0.101 0.006 0.004 0.008 0.009 0.005 0.004 0.826 0.886 0.832 0.738 0.872 0.886 0.008 0.005 0.010 0.013 0.006 0.005 θ* = -2 p 𝜃̂1∗ ∗ 𝜃̂1𝑤 𝜃̂2∗ 𝜃̂2̅∗ ∗ 𝜃̂̅̅̅̅ 2𝑤 ̂ 𝜃3∗ θ* = -1 p 𝜃̂1∗ ∗ 𝜃̂1𝑤 𝜃̂2∗ 𝜃̂2̅∗ ∗ 𝜃̂2𝑤 ̅̅̅̅ 𝜃̂3∗ θ* = 0 p 𝜃̂1∗ ∗ 𝜃̂1𝑤 ̂ 𝜃2∗ 𝜃̂2̅∗ ∗ ̂ 𝜃2𝑤 ̅̅̅̅ 𝜃̂3∗ θ* = 1 p 𝜃̂1∗ ∗ 𝜃̂1𝑤 𝜃̂2∗ 𝜃̂2̅∗ ∗ 𝜃̂̅̅̅̅ 2𝑤 ̂ 𝜃3∗ θ* = 2 p 𝜃̂1∗ ∗ 𝜃̂1𝑤 𝜃̂2∗ 𝜃̂2̅∗ ∗ 𝜃̂2𝑤 ̅̅̅̅ 𝜃̂3∗ 0.714 0.008 0.050 0.353 0.020 0.144 -1.647 1.647 0.372 0.007 0.522 0.011 1.151 0.051 0.366 -0.849 0.850 0.302 0.010 0.675 0.012 0.287 0.025 0.160 -1.713 1.713 0.378 0.007 0.505 0.013 0.017 0.020 0.119 -1.983 1.983 0.399 0.005 0.437 0.010 0.571 0.022 0.159 -1.428 1.429 0.352 0.008 0.574 0.012 1.151 0.051 0.366 -0.849 0.850 0.302 0.010 0.675 0.012 ∗ Note. RMSE = root mean square error, 𝜃̂1∗ = converted cutoff score using unweighted Method 1; 𝜃̂1𝑤 = converted cutoff score using weighted Method 1; 𝜃̂2∗ = converted cutoff score using unweighted Method 2; 𝜃̂2̅∗ = converted cutoff score ∗ ̂∗ using approximation formula for unweighted Method 2; 𝜃̂2𝑤 ̅̅̅̅ = converted cutoff score using weighted method 2; 𝜃3 = converted cutoff score using Method 3. 84 Medium Variability (.10) .15 below the ICC. According to Table 8, when the standard deviation is set to .10 and the a priori value is set to 0 in the bias condition where the simulated ratings are set to .15 below the ICC, all 6 methods did a similar job of recovering the original θ* values as shown by the bias means ranging from -.104 to .009, i.e., the range of the bias means show that there are small differences in determining which is the best method, suggesting that you would get a recovery of the original θ* values that could be equally applied regardless of the selected method. The RMSE means mirror the bias means. The means of the estimated θ* values across replications range from -.104 to .009. The means of the error index were low ranging from .126 to .135 and the means of the consistency index were high, ranging from .809 to .818. When the standard deviation is set to .10 and the a priori value is set to -1 in the bias condition where the simulated ratings are set to .15 below the ICC, method 1 weighted and method 3 did the best job of recovering the original θ* values as shown by the bias means of .237. Method 2 weighted did the next best job as shown by the bias mean of .584. The rest of the methods do a similar job, as shown by the bias means ranging from .885 to 1.005, i.e., the range of the bias means show that there are small differences in determining which is the best method, suggesting that you would get a recovery of the original θ* values that could be equally applied regardless of the selected method. The RMSE means mirror the bias means. The means of the estimated θ* values across replications range from -.764 to .005. The means of the error index were low, 85 ranging from .338 to .392, with low standard deviations, and the means of the consistency index were low, ranging from .441 to .514. When the standard deviation is set to .10 and the a priori value is set to 1 in the bias condition where the simulated ratings are set to .15 below the ICC, method 1 weighted and method 3 did the best job of recovering the original θ* values as shown by the bias means of -.278. Method 2 weighted did the next best job as shown by the bias mean of -.438. The rest of the methods do a similar job as shown by the bias means ranging from -.609 to -.800, i.e., the range of the bias means show that there are small differences in determining which is the best method, suggesting that you would get a recovery of the original θ* values that could be equally applied regardless of the selected method. The RMSE means mirror the bias means. The means of the estimated θ* values across replications range from .200 to .722. The means of the error index were low ranging from .189 to .288 with low standard deviations, and the means of the consistency index were moderate ranging from .613 to .780. When the standard deviation is set to .10 and the a priori value is set to -2 in the bias condition where the simulated ratings are set to .15 below the ICC, method 1 weighted and method 3 did the best job of recovering the original θ* values as shown by the bias means of 1.864, the rest of the methods did a similarly poor job of recovering the original θ* values as shown by the bias means ranging from 1.977 to 2.059, i.e., the range of the bias means show that there are small differences in determining which is the best method, suggesting that you would get a recovery of the original θ* values that could be 86 equally applied regardless of the selected method. The RMSE means mirror the bias means. The means of the estimated θ* values across replications, range from -.137 to .059. The means of the error index were low, ranging from .361 to .364 with low standard deviations, and the means of the consistency index were low ranging from .468 to .495 with low standard deviations. The proportion-correct cutoff score generated in this condition was affected by not enough low ratings being possible in the sampling when attempting to achieve the .10 standard deviation of rater variability at the -2 θ* level. When the standard deviation is set to .10 and the a priori value is set to 2 in the bias condition where the simulated ratings are set to .15 below the ICC, all of the methods did a similarly poor job of recovering the original θ* value with bias means ranging from -1.424 to -1.958, i.e., the range of the bias means show that there are small differences in determining which is the best method, suggesting that you would get a recovery of the original θ* values that could be equally applied regardless of the selected method. The RMSE means mirror the bias means. The means of the estimated θ* values across replications range from .168 to .576. The means of the error index were low, ranging from .356 to .373 with low standard deviations, and the means of the consistency index were moderately low, ranging from .476 to .570 with low standard deviations. 87 Table 8 Summary of Cut Scores Across Methods for Different A Priori θ* Values, Where Ratings are .15 Below the Item Characteristic Curves (ICCs) on Average (Medium Variability SD .10) Mean SD Range Bias RMSE Error M Error SD Consistency M Consistency SD 0.631 0.033 -0.137 0.059 0.053 -0.023 -0.137 0.009 0.020 0.064 0.025 0.018 0.022 0.064 0.050 0.127 0.426 0.140 0.108 0.139 0.426 2.033 1.863 2.059 2.053 1.977 1.863 2.033 1.864 2.059 2.053 1.977 1.864 0.361 0.364 0.361 0.361 0.362 0.364 0.006 0.006 0.006 0.006 0.006 0.006 0.492 0.468 0.496 0.495 0.484 0.468 0.008 0.012 0.009 0.009 0.009 0.012 0.565 -0.115 -0.764 -0.112 0.005 -0.416 -0.764 0.008 0.018 0.030 0.022 0.016 0.020 0.030 0.060 0.126 0.181 0.150 0.109 0.131 0.181 0.885 0.237 0.888 1.005 0.584 0.237 0.885 0.238 0.888 1.005 0.584 0.238 0.384 0.338 0.385 0.392 0.362 0.338 0.006 0.008 0.006 0.005 0.007 0.008 0.441 0.514 0.441 0.446 0.458 0.514 0.008 0.012 0.008 0.007 0.011 0.012 0.597 -0.036 -0.104 -0.030 0.009 -0.087 -0.104 0.004 0.012 0.008 0.012 0.010 0.007 0.008 0.030 0.075 0.050 0.060 0.063 0.043 0.050 -0.036 -0.104 -0.030 0.009 -0.087 -0.104 0.038 0.104 0.032 0.013 0.088 0.104 0.129 0.126 0.130 0.135 0.126 0.126 0.004 0.004 0.004 0.004 0.004 0.004 0.815 0.818 0.815 0.809 0.818 0.818 0.005 0.005 0.006 0.005 0.005 0.005 0.749 0.388 0.722 0.391 0.200 0.563 0.722 0.006 0.017 0.016 0.020 0.017 0.014 0.016 0.050 0.118 0.149 0.140 0.114 0.099 0.149 -0.612 -0.278 -0.609 -0.800 -0.438 -0.278 0.612 0.279 0.609 0.800 0.438 0.279 0.243 0.189 0.243 0.288 0.209 0.189 0.008 0.006 0.009 0.009 0.007 0.006 0.691 0.780 0.692 0.613 0.746 0.780 0.011 0.008 0.013 0.013 0.009 0.008 θ* = -2 p 𝜃̂1∗ ∗ 𝜃̂1𝑤 𝜃̂2∗ 𝜃̂2̅∗ ∗ 𝜃̂̅̅̅̅ 2𝑤 ̂ 𝜃3∗ θ* = -1 p 𝜃̂1∗ ∗ 𝜃̂1𝑤 𝜃̂2∗ 𝜃̂2̅∗ ∗ 𝜃̂2𝑤 ̅̅̅̅ 𝜃̂3∗ θ* = 0 p 𝜃̂1∗ ∗ 𝜃̂1𝑤 ̂ 𝜃2∗ 𝜃̂2̅∗ ∗ ̂ 𝜃2𝑤 ̅̅̅̅ 𝜃̂3∗ θ* = 1 p 𝜃̂1∗ ∗ 𝜃̂1𝑤 𝜃̂2∗ 𝜃̂2̅∗ ∗ 𝜃̂̅̅̅̅ 2𝑤 ̂ 𝜃3∗ θ* = 2 p 𝜃̂1∗ ∗ 𝜃̂1𝑤 𝜃̂2∗ 𝜃̂2̅∗ ∗ 𝜃̂2𝑤 ̅̅̅̅ 𝜃̂3∗ 0.672 0.009 0.060 0.201 0.022 0.144 -1.799 1.799 0.368 0.006 0.505 0.009 0.576 0.110 0.700 -1.424 1.428 0.356 0.008 0.570 0.019 0.168 0.026 0.150 -1.832 1.832 0.369 0.006 0.499 0.010 0.042 0.019 0.134 -1.958 1.958 0.373 0.005 0.476 0.009 0.258 0.028 0.170 -1.742 1.742 0.366 0.006 0.516 0.010 0.576 0.110 0.700 -1.424 1.428 0.356 0.008 0.570 0.019 ∗ Note. RMSE = root mean square error, 𝜃̂1∗ = converted cutoff score using unweighted Method 1; 𝜃̂1𝑤 = converted cutoff score using weighted Method 1; 𝜃̂2∗ = converted cutoff score using unweighted Method 2; 𝜃̂2̅∗ = converted cutoff score ∗ ̂∗ using approximation formula for unweighted Method 2; 𝜃̂2𝑤 ̅̅̅̅ = converted cutoff score using weighted method 2; 𝜃3 = converted cutoff score using Method 3. 88 High Variability (.15) .15 below the ICC. According to Table 9, when the standard deviation is set to .15 and the a priori value is set to 0 in the bias condition where the simulated ratings are set to .15 below the ICC, the approximation formula for method 2 did the best job of recovering the original θ* values as shown by the bias mean of -.051. Method 1 did the next best job as shown by the bias mean of -.083. Method 2 did the next best job as shown by the bias mean of -.085. Method 1 weighted and method 3 did the worst job as shown by the bias means of -.152. The RMSE means mirror the bias means. The means of the estimated θ* values across replications range from -.051 to -.152. The means of the error index were low ranging from .178 to .183, and the means of the consistency index were moderately high, ranging from .737 to .740 with low standard deviations. When the standard deviation is set to .15 and the a priori value is set to -1 in the bias condition where the simulated ratings are set to .15 below the ICC, method 1 weighted and method 3 did the best job of recovering the original θ* values as shown by the bias means of .552. Method 2 weighted did the next best job as shown by the bias mean of .774. The rest of the methods do a similar job of recovering the original θ* values as shown by the bias means ranging from .919 to 1.024, i.e., the range of the bias means show that there are small differences in determining which is the best method, suggesting that you would get a recovery of the original θ* values that could be equally applied regardless of the selected method. The RMSE values mirror the bias means. The means of the estimated θ* values across replications range from -.448 to .024. The 89 means of the error index were low, ranging from .371 to .391 and the means of the consistency index were low and range from .435 to .449 with low standard deviations. When the standard deviation is set to .15 and the a priori value is set to 1 in the bias condition where the simulated ratings are set to .15 below the ICC, method1 weighted and method 3 did the best job of recovering the original θ* values as shown by the bias means of -.0493. The rest of the methods did a similar job of recovering the original θ* values as shown by the bias means ranging from -.650 to -.891, i.e., the range of the bias means show that there are small differences in determining which is the best method, suggesting that you would get a recovery of the original θ* values that could be equally applied regardless of the selected method. The RMSE values mirror the bias means. The range of the estimated θ* values across replications range from .109 to .507. The means of the error index were low ranging from .264 to .323 and the means of the consistency index were moderate, ranging from .555 to .675 with low standard deviations. When the standard deviation is set to .15 and the a priori value is set to -2 in the bias condition where the simulated ratings are set to .15 below the ICC, all six methods did a poor job of recovering the original θ* values as shown by the bias means ranging from 1.997 to 2.067. The RMSE means mirror the bias means. The range of the estimated θ* values across replications range from -.002 to .067. The means of the error index were low ranging from .359 to .360 and the means of the consistency index were moderately low ranging from .490 to .500 with low standard deviations. The proportion- 90 correct cutoff score generated in this condition was affected by not enough low ratings being possible in the sampling when attempting to achieve the .15 standard deviation of rater variability at the -2 θ* level. When the standard deviation is set to .15 and the a priori value is set to 2 in the bias condition where the simulated ratings are set to .15 below the ICC, all six methods did a similar job of recovering the original θ* values as shown by the bias means ranging from -.1689 to -1.882, i.e., the range of the bias means show that there are small differences in determining which is the best method, suggesting that you would get a recovery of the original θ* values that could be equally applied regardless of the selected method. The RMSE values mirror the bias means. The range of the estimated θ* values across replications range from .058 to .311. The means of the error index were low ranging from .357 to .361 and the means of the consistency index were moderate, ranging from .496 to .536 with low standard deviations. 91 Table 9 Summary of Cut Scores Across Methods for Different A Priori θ* Values, Where Ratings are .15 Below the Item Characteristic Curves (ICCs) on Average (High Variability SD .15) Mean SD Range Bias RMSE Error M Error SD Consistency M Consistency SD 0.634 0.040 -0.002 0.067 0.055 0.028 -0.002 0.010 0.021 0.033 0.026 0.018 0.021 0.033 0.064 0.145 0.297 0.170 0.127 0.164 0.297 2.040 1.997 2.067 2.054 2.028 1.997 2.040 1.998 2.067 2.054 2.028 1.998 0.359 0.360 0.359 0.359 0.359 0.360 0.006 0.006 0.006 0.006 0.006 0.006 0.496 0.490 0.500 0.498 0.494 0.490 0.009 0.010 0.009 0.009 0.009 0.010 0.578 -0.081 -0.448 -0.078 0.024 -0.226 -0.448 0.009 0.020 0.042 0.023 0.017 0.021 0.042 0.062 0.135 0.259 0.160 0.109 0.125 0.259 0.919 0.552 0.922 1.024 0.774 0.552 0.919 0.554 0.922 1.024 0.774 0.554 0.388 0.371 0.388 0.391 0.381 0.371 0.006 0.007 0.006 0.005 0.006 0.007 0.440 0.446 0.441 0.449 0.435 0.446 0.007 0.011 0.007 0.007 0.009 0.011 0.576 -0.083 -0.152 -0.085 -0.051 -0.128 -0.152 0.006 0.013 0.011 0.015 0.012 0.010 0.011 0.036 0.090 0.075 0.090 0.081 0.063 0.075 -0.083 -0.152 -0.085 -0.051 -0.128 -0.152 0.084 0.152 0.086 0.053 0.129 0.152 0.181 0.178 0.181 0.183 0.178 0.178 0.005 0.004 0.005 0.004 0.004 0.004 0.739 0.740 0.739 0.737 0.740 0.740 0.007 0.006 0.007 0.006 0.006 0.006 0.693 0.227 0.507 0.228 0.109 0.349 0.507 0.008 0.020 0.023 0.022 0.017 0.017 0.023 0.050 0.129 0.193 0.130 0.113 0.104 0.193 -0.773 -0.493 -0.772 -0.891 -0.650 -0.493 0.774 0.493 0.773 0.892 0.651 0.493 0.304 0.264 0.303 0.323 0.285 0.264 0.007 0.007 0.008 0.007 0.007 0.007 0.595 0.675 0.595 0.555 0.634 0.675 0.012 0.011 0.013 0.012 0.011 0.011 θ* = -2 p 𝜃̂1∗ ∗ 𝜃̂1𝑤 𝜃̂2∗ 𝜃̂2̅∗ ∗ 𝜃̂̅̅̅̅ 2𝑤 ̂ 𝜃3∗ θ* = -1 p 𝜃̂1∗ ∗ 𝜃̂1𝑤 𝜃̂2∗ 𝜃̂2̅∗ ∗ 𝜃̂2𝑤 ̅̅̅̅ 𝜃̂3∗ θ* = 0 p 𝜃̂1∗ ∗ 𝜃̂1𝑤 ̂ 𝜃2∗ 𝜃̂2̅∗ ∗ ̂ 𝜃2𝑤 ̅̅̅̅ 𝜃̂3∗ θ* = 1 p 𝜃̂1∗ ∗ 𝜃̂1𝑤 𝜃̂2∗ 𝜃̂2̅∗ ∗ 𝜃̂̅̅̅̅ 2𝑤 ̂ 𝜃3∗ θ* = 2 p 𝜃̂1∗ ∗ 𝜃̂1𝑤 𝜃̂2∗ 𝜃̂2̅∗ ∗ 𝜃̂2𝑤 ̅̅̅̅ 𝜃̂3∗ 0.651 0.009 0.070 0.118 0.022 0.141 -1.882 1.882 0.360 0.005 0.505 0.009 0.311 0.071 0.600 -1.689 1.691 0.357 0.006 0.536 0.014 0.111 0.026 0.160 -1.889 1.889 0.360 0.005 0.504 0.009 0.058 0.018 0.116 -1.942 1.942 0.361 0.005 0.496 0.009 0.157 0.024 0.143 -1.843 1.843 0.359 0.006 0.511 0.009 0.311 0.071 0.600 -1.689 1.691 0.357 0.006 0.536 0.014 ∗ Note. RMSE = root mean square error, 𝜃̂1∗ = converted cutoff score using unweighted Method 1; 𝜃̂1𝑤 = converted cutoff score using weighted Method 1; 𝜃̂2∗ = converted cutoff score using unweighted Method 2; 𝜃̂2̅∗ = converted cutoff score ∗ ̂∗ using approximation formula for unweighted Method 2; 𝜃̂2𝑤 ̅̅̅̅ = converted cutoff score using weighted method 2; 𝜃3 = converted cutoff score using Method 3. 92 Low Variability (.05) .35 above the ICC. According to According to Table 10, when the standard deviation is set to .05 and the a priori value is set to 0 in the bias condition where the simulated ratings are set to .35 above the ICC, the approximation formula for method 2 did the best job of recovering the original θ* value as shown by the bias mean of .086. The rest of the methods did a similar job as shown by the bias means ranging from .106 to .147, i.e., the range of the bias means show that there are small differences in determining which is the best method, suggesting that you would get a recovery of the original θ* values that could be equally applied regardless of the selected method The RMSE means mirror the bias means. The range of the estimated θ* values across replications range from .086 to .147. The means of the error index were low, ranging from .059 to .061, and the means of the consistency index were high, ranging from .917 to .918 with low standard deviations. When the standard deviation is set to .05 and the a priori value is set to -1 in the bias condition where the simulated ratings are set to .35 above the ICC, method 2 did the best job of recovering the original θ* value as shown by the bias mean of .253. The rest of the methods did a similar job of recovering the original θ* value as shown by the bias means ranging from .396 to .413, i.e., the range of the bias means show that there are small differences in determining which is the best method, suggesting that you would get a recovery of the original θ* values that could be equally applied regardless of the selected method. The RMSE means mirror the bias means. The range of the estimated 93 θ* values across replications range from -.587 to -.747. The means of the error index were low, ranging from .083 to .113, and the means of the consistency index were high, ranging from .833 to .839 with low standard deviations. When the standard deviation is set to .05 and the a priori value is set to 1 in the bias condition where the simulated ratings are set to .35 above the ICC, method 1 weighted and method 3 did the best job of recovering the original θ* value as shown by the bias means of .033. Method 2 weighted did the next best job as shown by the bias mean of -.174. Method 1 followed as shown by the bias mean of -.283. Method 2 is next with a bias mean of -.432, and the approximation formula for method 2 did the worst job as shown by the bias mean of -.902. The RMSE values mirror the bias means. The means of the estimated θ* values across replications range from .098 to 1.033. The means of the error index were low ranging from .172 to .372 and the means of the consistency index were moderately high ranging from .486 to .812 with low standard deviations. When the standard deviation is set to .05 and the a priori value is set to -2 in the bias condition where the simulated ratings are set to .35 above the ICC, all six methods did a similar job of recovering the original θ* values as shown by the bias means ranging from 1.192 to 1.459, i.e., the range of the bias means show that there are small differences in determining which is the best method, suggesting that you would get a recovery of the original θ* values that could be equally applied regardless of the selected method. The RMSE means mirror the bias means. The means of the estimated θ* values 94 across replications range from -.313 to .136. The means of the error index were low, ranging from .314 to .318 and the means of the consistency index were moderate, ranging from .525 to .570 with low standard deviations. When the standard deviation is set to .05 and the a priori value is set to 2 in the bias condition where the simulated ratings are set to .35 above the ICC, all six methods did a similar job of recovering the original θ* value as shown by the bias means ranging from -1.319 to -1.987, i.e., the range of the bias means show that there are small differences in determining which is the best method, suggesting that you would get a recovery of the original θ* values that could be equally applied regardless of the selected method. The RMSE means mirror the bias means. The means of the estimated θ* values across replications range from .013 to .681. The means of the error index were low, ranging from .361 to .376 and the means of the consistency index were moderate, ranging from .469 to .575 with low standard deviations. 95 Table 10 Summary of Cut Scores Across Methods for Different A Priori θ* Values, Where Ratings are .35 Above the Item Characteristic Curves (ICCs) on Average (Low Variability SD .05) Mean SD Range Bias RMSE Error M Error SD Consistency M Consistency SD 0.331 0.110 -0.313 0.136 0.103 -0.087 -0.313 0.001 0.018 0.036 0.023 0.015 0.017 0.036 0.010 0.116 0.229 0.150 0.095 0.099 0.229 1.376 1.459 1.192 1.427 1.426 1.459 1.376 1.459 1.192 1.427 1.426 1.459 0.314 0.318 0.314 0.314 0.314 0.318 0.005 0.006 0.005 0.005 0.005 0.006 0.567 0.525 0.570 0.566 0.546 0.525 0.007 0.010 0.007 0.007 0.008 0.010 0.347 -0.604 -0.602 -0.747 -0.587 -0.588 -0.602 0.001 0.006 0.014 0.006 0.006 0.007 0.014 0.010 0.033 0.100 0.030 0.036 0.055 0.100 0.396 0.398 0.253 0.413 0.412 0.398 0.396 0.398 0.253 0.413 0.412 0.398 0.109 0.110 0.083 0.113 0.113 0.110 0.002 0.003 0.001 0.002 0.002 0.003 0.839 0.838 0.881 0.833 0.833 0.838 0.002 0.005 0.002 0.003 0.003 0.005 0.649 0.147 0.125 0.106 0.086 0.136 0.125 0.002 0.008 0.005 0.007 0.007 0.004 0.005 0.020 0.044 0.036 0.050 0.059 0.030 0.036 0.147 0.125 0.106 0.086 0.136 0.125 0.147 0.125 0.107 0.086 0.136 0.125 0.061 0.059 0.059 0.060 0.060 0.059 0.002 0.002 0.002 0.002 0.002 0.002 0.917 0.918 0.918 0.917 0.918 0.918 0.002 0.003 0.003 0.003 0.003 0.003 0.801 0.717 1.033 0.561 0.098 0.826 1.033 0.006 0.022 0.009 0.022 0.023 0.012 0.009 0.040 0.146 0.058 0.140 0.161 0.078 0.058 -0.283 0.033 -0.439 -0.902 -0.174 0.033 0.284 0.034 0.440 0.902 0.175 0.034 0.223 0.172 0.256 0.372 0.202 0.172 0.009 0.006 0.010 0.010 0.008 0.006 0.740 0.812 0.689 0.486 0.770 0.812 0.011 0.007 0.014 0.016 0.009 0.007 θ* = -2 p 𝜃̂1∗ ∗ 𝜃̂1𝑤 𝜃̂2∗ 𝜃̂2̅∗ ∗ 𝜃̂2𝑤 ̅̅̅̅ 𝜃̂3∗ θ* = -1 p 𝜃̂1∗ ∗ 𝜃̂1𝑤 ̂ 𝜃2∗ 𝜃̂̅2∗ ∗ ̂ 𝜃2𝑤 ̅̅̅̅ 𝜃̂3∗ θ* = 0 p 𝜃̂1∗ ∗ 𝜃̂1𝑤 𝜃̂2∗ 𝜃̂2̅∗ ∗ 𝜃̂̅̅̅̅ 2𝑤 ̂ 𝜃3∗ θ* = 1 p 𝜃̂1∗ ∗ 𝜃̂1𝑤 𝜃̂2∗ 𝜃̂2̅∗ ∗ 𝜃̂2𝑤 ̅̅̅̅ ̂ 𝜃3∗ θ* = 2 p 𝜃̂1∗ ∗ 𝜃̂1𝑤 𝜃̂2∗ 𝜃̂̅2∗ ∗ ̂ 𝜃2𝑤 ̅̅̅̅ 𝜃̂3∗ 0.668 0.009 0.060 0.220 0.022 0.178 -1.780 1.780 0.371 0.006 0.504 0.009 0.681 0.080 0.489 -1.319 1.322 0.361 0.008 0.575 0.015 0.159 0.025 0.180 -1.841 1.842 0.373 0.006 0.493 0.010 0.013 0.019 0.138 -1.987 1.987 0.376 0.005 0.469 0.009 0.240 0.024 0.163 -1.760 1.760 0.371 0.006 0.507 0.010 0.681 0.080 0.489 -1.319 1.322 0.361 0.008 0.575 0.015 ∗ Note. RMSE = root mean square error, 𝜃̂1∗ = converted cutoff score using unweighted Method 1; 𝜃̂1𝑤 = converted cutoff score using weighted Method 1; 𝜃̂2∗ = converted cutoff score using unweighted Method 2; 𝜃̂̅2∗ = converted cutoff score ∗ ̂∗ using approximation formula for unweighted Method 2; 𝜃̂2𝑤 ̅̅̅̅ = converted cutoff score using weighted method 2; 𝜃3 = converted cutoff score using Method 3. 96 Medium Variability (.10) .35 above the ICC. According to Table 11, when the standard deviation is set to .10 and the a priori value is set to 0 in the bias condition where the simulated ratings are set to .35 above the ICC, all six methods did a similar job of recovering the original θ* value as shown by the bias means ranging from .370 to .477, i.e., the range of the bias means show that there are small differences in determining which is the best method, suggesting that you would get a recovery of the original θ* values that could be equally applied regardless of the selected method. The RMSE means mirror the bias means. The means of the estimated θ* values across replications range from .370 to .477. The means of the error index were low, ranging from .115 to .119, and the means of the consistency index were high, ranging from .848 to .857 with low standard deviations. When the standard deviation is set to .10 and the a priori value is set to -1 in the bias condition where the simulated ratings are set to .35 above the ICC, method 1 weighted and method 3 did the best job of recovering the original θ* values as shown by the bias means of .815. The rest of the methods did a similar job of recovering the original θ* values as shown by the bias means ranging from .892 to 1.038, i.e., the range of the bias means show that there are small differences in determining which is the best method, suggesting that you would get a recovery of the original θ* values that could be equally applied regardless of the selected method. The RMSE means mirror the bias means. The means of the estimated θ* values across replications range from -.185 to .038. The means of the error index were low, ranging from .228 to .237, and the means 97 of the consistency index were moderate, ranging from .664 to .666 with low standard deviations. When the standard deviation is set to .10 and the a priori value is set to 1 in the bias condition where the simulated ratings are set to .35 above the ICC, method 1 weighted and method 3 did the best job of recovering the original θ* values as shown by the bias means of -.146. Method 2 weighted did the next best job of recovering the original θ* values as shown by the bias mean 0f -.498. Method 1 did the next best job as shown by the bias mean of -.579. Method 2 did the next best job, as shown by the bias mean of -.664. The approximation formula for Method 2 did the worst job, as shown by the bias mean of -.957. The RMSE means mirror the bias means. The means of the estimated θ* values across replications range from .043 to .854. The means of the error index were low, ranging from .297 to .402 with low standard deviations, and the means of the consistency index were moderate, ranging from .436 to .663 with low standard deviations. When the standard deviation is set to .10 and the a priori value is set to -2 in the bias condition where the simulated ratings are set to .35 above the ICC, all six methods did a poor job of recovering the original θ* values as shown by the bias means ranging from 1.687 to 2.110, i.e., the range of the bias means show that there are small differences in determining which is the best method, suggesting that you would get a recovery of the original θ* values that could be equally applied regardless of the selected method. The RMSE means mirror the bias means. The means of the estimated θ* values 98 across replications range from -.313 to .136. The means of the error index were low, ranging from .314 to .318, and the means of the consistency index were moderate ranging from .525 to .570 with low standard deviations. The proportion-correct cutoff score generated in this condition was affected by not enough low ratings being possible in the sampling when attempting to achieve the .10 standard deviation of rater variability at the -2 θ* level. When the standard deviation is set to .10 and the a priori value is set to 2 in the bias condition where the simulated ratings are set to .35 above the ICC , all six methods did a poor job of recovering the original θ* values as shown by the bias means ranging from -1.855 to -1.976, i.e., the range of the bias means show that there are small differences in determining which is the best method, suggesting that you would get a recovery of the original θ* values that could be equally applied regardless of the selected method. The RMSE means mirror the bias means. The means of the estimated θ* values across replications range from .024 to .145. The means of the error index were low, ranging from .366 to .368, and the means of the consistency index were moderate ranging from .482 to .501 with low standard deviations. 99 Table 11 Summary of Cut Scores Across Methods for Different A Priori θ* Values, Where Ratings are .35 Above the Item Characteristic Curves (ICCs) on Average (Medium Variability SD .10) Mean SD Range Bias RMSE Error M Error SD Consistency M Consistency SD θ* = -2 p 𝜃̂1∗ ∗ 𝜃̂1𝑤 𝜃̂2∗ 𝜃̂2̅∗ ∗ 𝜃̂̅̅̅̅ 2𝑤 ̂ 𝜃3∗ 0.660 0.110 -0.313 0.136 0.103 -0.087 -0.313 0.009 0.018 0.036 0.023 0.015 0.017 0.036 0.060 0.116 0.229 0.150 0.095 0.099 0.229 1.687 2.110 2.136 2.103 1.912 2.110 1.687 2.110 2.136 2.103 1.913 2.110 0.314 0.318 0.314 0.314 0.314 0.318 0.005 0.006 0.005 0.005 0.005 0.006 0.567 0.525 0.570 0.566 0.546 0.525 0.007 0.010 0.007 0.007 0.008 0.010 0.520 0.038 -0.185 0.030 0.016 -0.108 -0.185 0.005 0.010 0.022 0.012 0.008 0.012 0.022 0.030 0.059 0.140 0.070 0.044 0.071 0.140 1.038 0.815 1.030 1.016 0.892 0.815 1.038 0.815 1.030 1.016 0.892 0.815 0.238 0.228 0.237 0.236 0.230 0.228 0.005 0.004 0.005 0.004 0.004 0.004 0.666 0.664 0.666 0.666 0.666 0.664 0.006 0.006 0.006 0.006 0.006 0.006 0.762 0.444 0.477 0.431 0.370 0.458 0.477 0.004 0.016 0.018 0.012 0.012 0.010 0.018 0.030 0.100 0.122 0.080 0.072 0.072 0.122 0.444 0.477 0.432 0.370 0.458 0.477 0.445 0.477 0.432 0.370 0.458 0.477 0.116 0.115 0.116 0.119 0.115 0.115 0.003 0.003 0.003 0.004 0.003 0.003 0.855 0.857 0.855 0.848 0.856 0.857 0.004 0.004 0.004 0.005 0.004 0.004 0.731 0.421 0.854 0.336 0.043 0.502 0.854 0.008 0.025 0.067 0.024 0.021 0.027 0.067 0.050 0.156 0.492 0.140 0.124 0.178 0.492 -0.579 -0.146 -0.664 -0.957 -0.498 -0.146 0.580 0.161 0.664 0.957 0.499 0.161 0.355 0.297 0.366 0.402 0.343 0.297 0.008 0.011 0.008 0.006 0.009 0.011 0.553 0.663 0.527 0.436 0.577 0.663 0.013 0.017 0.014 0.011 0.013 0.017 θ* = -1 p 𝜃̂1∗ ∗ 𝜃̂1𝑤 𝜃̂2∗ 𝜃̂2̅∗ ∗ 𝜃̂2𝑤 ̅̅̅̅ 𝜃̂3∗ θ* = 0 p 𝜃̂1∗ ∗ 𝜃̂1𝑤 ̂ 𝜃2∗ 𝜃̂2̅∗ ∗ ̂ 𝜃2𝑤 ̅̅̅̅ 𝜃̂3∗ θ* = 1 p 𝜃̂1∗ ∗ 𝜃̂1𝑤 𝜃̂2∗ 𝜃̂2̅∗ ∗ 𝜃̂̅̅̅̅ 2𝑤 ̂ 𝜃3∗ θ* = 2 p 𝜃̂1∗ ∗ 𝜃̂1𝑤 𝜃̂2∗ 𝜃̂2̅∗ ∗ 𝜃̂2𝑤 ̅̅̅̅ 𝜃̂3∗ 0.653 0.009 0.060 0.121 0.021 0.129 -1.879 1.879 0.366 0.006 0.497 0.009 0.145 0.049 0.337 -1.855 1.856 0.366 0.006 0.501 0.012 0.118 0.025 0.160 -1.882 1.882 0.366 0.006 0.497 0.009 0.024 0.018 0.134 -1.976 1.976 0.368 0.006 0.482 0.009 0.105 0.021 0.144 -1.895 1.895 0.366 0.006 0.494 0.009 0.145 0.049 0.337 -1.855 1.856 0.366 0.006 0.501 0.012 ∗ Note. RMSE = root mean square error, 𝜃̂1∗ = converted cutoff score using unweighted Method 1; 𝜃̂1𝑤 = converted cutoff score using weighted Method 1; 𝜃̂2∗ = converted cutoff score using unweighted Method 2; 𝜃̂̅2∗ = converted cutoff score ∗ ̂∗ using approximation formula for unweighted Method 2; 𝜃̂2𝑤 ̅̅̅̅ = converted cutoff score using weighted method 2; 𝜃3 = converted cutoff score using Method 3. 100 High Variability (.15) .35 above the ICC. According to Table 12, when the standard deviation is set to .15 and the a priori value is set to 0 in the bias condition where the simulated ratings are set to .35 above the ICC, all six methods did a similar job of recovering the original θ* values as shown by the bias means ranging from .048 to .083, i.e., the range of the bias means show that there are small differences in determining which is the best method, suggesting that you would get a recovery of the original θ* values that could be equally applied regardless of the selected method. The RMSE means mirror the bias means. The means of the estimated θ* values across replications range from .048 to .083. The means of the error index were low, ranging from .354 to .355, and the means of the consistency index were moderate, ranging from .503 to .508 with low standard deviations. When the standard deviation is set to .15 and the a priori value is set to -1 in the bias condition where the simulated ratings are set to .35 above the ICC, all six methods did a similar job of recovering the original θ* values as shown by the bias means ranging from 1.077 to 1.319, i.e., the range of the bias means show that there are small differences in determining which is the best method, suggesting that you would get a recovery of the original θ* values that could be equally applied regardless of the selected method. The RMSE means mirror the bias means. The means of the estimated θ* values across replications range from .077 to .360. The means of the error index were low ranging from .236 to .249, and the means of the consistency index were moderate ranging from .653 to .697 with low standard deviations. 101 When the standard deviation is set to .15 and the a priori value is set to 1 in the bias condition where the simulated ratings are set to .35 above the ICC, the simulated ratings failed to generate. This may be due to the restricted range at the top end of the distribution and the variability in the rater agreement being so high. When the standard deviation is set to .15 and the a priori value is set to -2 in the bias condition where the simulated ratings are set to .25 above the ICC, all six methods did a similar job of recovering the original θ* values as shown by the bias means ranging from 2.019 to 2.120, i.e., the range of the bias means show that there are small differences in determining which is the best method, suggesting that you would get a recovery of the original θ* values that could be equally applied regardless of the selected method. The RMSE means mirror the bias means. The means of the estimated θ* values across replications range from .019 to .120. The means of the error index were low, ranging from .342 to .344, and the means of the consistency index were moderate, ranging from .514 to .530 with low standard deviations. When the standard deviation is set to .15 and the a priori value is set to 2 in the bias condition where the simulated ratings are set to .35 above the ICC, the simulated ratings failed to generate. This may be due to the restricted range at the top end of the distribution and the variability in the rater agreement being so high. 102 Table 12 Summary of Cut Scores Across Methods for Different A Priori θ* Values, Where Ratings are .35 Above the Item Characteristic Curves (ICCs) on Average (High Variability SD .15) Mean SD Range Bias RMSE 0.652 0.120 0.019 0.116 0.082 0.071 0.019 0.010 0.021 0.028 0.027 0.019 0.020 0.028 0.065 0.125 0.213 0.170 0.129 0.126 0.213 2.120 2.019 2.116 2.082 2.071 2.019 2.120 2.019 2.116 2.082 2.071 2.019 0.739 0.319 0.077 0.360 0.277 0.192 0.077 0.007 0.017 0.023 0.020 0.016 0.015 0.023 0.043 0.111 0.142 0.130 0.099 0.093 0.142 1.319 1.077 1.360 1.276 1.191 1.077 0.640 0.073 0.048 0.083 0.056 0.060 0.048 0.010 0.021 0.022 0.026 0.019 0.020 0.022 0.056 0.127 0.132 0.150 0.107 0.120 0.132 0.073 0.048 0.083 0.056 0.060 0.048 Error M Error SD Consistency M Consistency SD 0.342 0.344 0.342 0.343 0.343 0.344 0.006 0.006 0.006 0.006 0.006 0.006 0.530 0.514 0.529 0.524 0.522 0.514 0.008 0.010 0.009 0.009 0.009 0.010 1.319 1.077 1.361 1.277 1.192 1.077 0.237 0.249 0.236 0.238 0.242 0.249 0.004 0.005 0.004 0.004 0.004 0.005 0.692 0.653 0.697 0.687 0.674 0.653 0.006 0.008 0.006 0.006 0.006 0.008 0.076 0.052 0.087 0.059 0.064 0.052 0.354 0.355 0.354 0.355 0.355 0.355 0.006 0.006 0.006 0.006 0.006 0.006 0.507 0.503 0.508 0.504 0.505 0.503 0.009 0.010 0.010 0.010 0.010 0.010 θ* = -2 p 𝜃̂1∗ ∗ 𝜃̂1𝑤 𝜃̂2∗ 𝜃̂2̅∗ ∗ 𝜃̂2𝑤 ̅̅̅̅ 𝜃̂3∗ θ* = -1 p 𝜃̂1∗ ∗ 𝜃̂1𝑤 ̂ 𝜃2∗ 𝜃̂̅2∗ ∗ ̂ 𝜃2𝑤 ̅̅̅̅ 𝜃̂3∗ θ* = 0 p 𝜃̂1∗ ∗ 𝜃̂1𝑤 𝜃̂2∗ 𝜃̂2̅∗ ∗ 𝜃̂̅̅̅̅ 2𝑤 ̂ 𝜃3∗ θ* = 1 p 𝜃̂1∗ ∗ 𝜃̂1𝑤 𝜃̂2∗ 𝜃̂2̅∗ ∗ 𝜃̂2𝑤 ̅̅̅̅ ̂ 𝜃3∗ θ* = 2 p 𝜃̂1∗ ∗ 𝜃̂1𝑤 𝜃̂2∗ 𝜃̂̅2∗ ∗ ̂ 𝜃2𝑤 ̅̅̅̅ 𝜃̂3∗ ∗ Note. RMSE = root mean square error, 𝜃̂1∗ = converted cutoff score using unweighted Method 1; 𝜃̂1𝑤 = converted cutoff score using weighted Method 1; 𝜃̂2∗ = converted cutoff score using unweighted Method 2; 𝜃̂2̅∗ = converted cutoff score ∗ ̂∗ using approximation formula for unweighted Method 2; 𝜃̂̅̅̅̅ 2𝑤 = converted cutoff score using weighted method 2; 𝜃3 = converted cutoff score using Method 3. 103 Low Variability (.05) .35 below the ICC. According to Table 13, when the standard deviation is set to .05 and the a priori value is set to 0 in the bias condition where the simulated ratings are set to .35 below the ICC, all six methods did a similar job of recovering the original θ* values as shown by the bias means ranging from -.150 to -.185, i.e., the range of the bias means show that there are small differences in determining which is the best method, suggesting that you would get a recovery of the original θ* values that could be equally applied regardless of the selected method. The RMSE means mirror the bias means. The means of the estimated θ* values across replications range from -.150 to -.185. The means of the error index were low, ranging from .054 to .055, and the means of the consistency index were high, ranging from .920 to .921 with low standard deviations. When the standard deviation is set to .05 and the a priori value is set to -1 in the bias condition where the simulated ratings are set to .35 below the ICC, Method 1 weighted, Method 3, and Method 2 did the best job of recovering the original θ* values as shown by the bias means ranging from -.049 to -.050. Method 2 weighted, did the next best job as shown by the bias mean of .51. The approximation formula for method 2 and method 1 are last with bias means of .111 and .123. The RMSE means mirror the bias means. The means of the estimated θ* values across replications range from -1.049 .877. The means of the error index were low, ranging from .073 to .094, and the means of the consistency index were high, ranging from .871 to .900 with low standard deviations. 104 When the standard deviation is set to .05 and the a priori value is set to 1 in the bias condition where the simulated ratings are set to .35 below the ICC, Method 1 weighted and Method 3 did the best job of recovering the original θ* values as shown by the bias means of -.214. Method 2 weighted did the next best job as shown by the bias mean of -.309. Method 2 and Method 1 did the next best job as shown by the bias means of -.467 and -.484. The approximation formula for method 2 did the worst as shown by the bias mean of -.670. The RMSE means mirror the bias means. The means of the estimated θ* values across replications range from .330 to .786. The means of the error index were low, ranging from .102 to .193 and the means of the consistency index were high, ranging from .751 to .882 with low standard deviations. When the standard deviation is set to .05 and the a priori value is set to -2 in the bias condition where the simulated ratings are set to .35 below the ICC, Method 2 did the best job of recovering the original θ* values as shown by the bias mean of .776. The rest of the methods did a similar job as shown by the bias means ranging from 1.012 to 1.088, i.e., the range of the bias means show that there are small differences in determining which is the best method, suggesting that you would get a recovery of the original θ* values that could be equally applied regardless of the selected method. The RMSE means mirror the bias means. The means of the estimated θ* values across replications range from -1.224 to -.912. The means of the error index were low, ranging from .085 to .117 and the means of the consistency index were high ranging from .836 to .888 with low standard deviations for both the error and consistency indices. 105 When the standard deviation is set to .05 and the a priori value is set to 2 in the bias condition where the simulated ratings are set to .35 below the ICC, Method 1 weighted and Method 3 did the best job of recovering the original θ* values as shown by the bias means of -.185. The rest of the methods did a similar job as shown by the bias means ranging from -1.975 to -1.411, i.e., the range of the bias means show that there are small differences in determining which is the best method, suggesting that you would get a recovery of the original θ* values that could be equally applied regardless of the selected method. The RMSE means mirror the bias means. The means of the estimated θ* values across replications range from .025 to 1.185. The means of the error index were low, ranging from .298 to .396 and the means of the consistency index were moderate, ranging from .443 to .681. 106 Table 13 Summary of Cut Scores Across Methods for Different A Priori θ* Values, Where Ratings are .35 Below the Item Characteristic Curves (ICCs) on Average (Low Variability SD .05) Mean SD Range Bias RMSE Error M Error SD Consistency M Consistency SD θ* = -2 p 𝜃̂1∗ ∗ 𝜃̂1𝑤 𝜃̂2∗ 𝜃̂2̅∗ ∗ 𝜃̂̅̅̅̅ 2𝑤 ̂ 𝜃3∗ 0.251 -0.988 -0.962 -1.224 -0.912 -0.951 -0.962 0.002 0.015 0.039 0.016 0.021 0.020 0.039 0.010 0.095 0.330 0.100 0.149 0.153 0.330 1.012 1.038 0.776 1.088 1.049 1.038 1.012 1.039 0.776 1.088 1.049 1.039 0.107 0.111 0.085 0.117 0.112 0.111 0.003 0.006 0.003 0.004 0.004 0.006 0.852 0.847 0.888 0.836 0.845 0.847 0.004 0.009 0.004 0.007 0.006 0.009 0.279 -0.877 -1.049 -1.050 -0.889 -0.949 -1.049 0.002 0.014 0.016 0.012 0.018 0.015 0.016 0.010 0.088 0.097 0.080 0.103 0.090 0.097 0.123 -0.049 -0.050 0.111 0.051 -0.049 0.124 0.052 0.051 0.113 0.053 0.052 0.094 0.073 0.073 0.092 0.084 0.073 0.003 0.002 0.002 0.004 0.003 0.002 0.868 0.900 0.900 0.871 0.883 0.900 0.004 0.004 0.003 0.006 0.004 0.004 0.541 -0.185 -0.150 -0.176 -0.172 -0.162 -0.150 0.002 0.007 0.004 0.006 0.006 0.004 0.004 0.010 0.048 0.025 0.030 0.038 0.029 0.025 -0.185 -0.150 -0.176 -0.172 -0.162 -0.150 0.185 0.150 0.177 0.172 0.162 0.150 0.055 0.054 0.054 0.054 0.054 0.054 0.001 0.001 0.001 0.001 0.001 0.001 0.920 0.921 0.920 0.921 0.921 0.921 0.002 0.002 0.002 0.002 0.002 0.002 0.516 0.786 0.533 0.330 0.691 0.786 0.011 0.006 0.016 0.017 0.007 0.006 0.082 0.036 0.100 0.113 0.041 0.036 -0.484 -0.214 -0.467 -0.670 -0.309 -0.214 0.485 0.214 0.467 0.670 0.309 0.214 0.143 0.102 0.139 0.193 0.111 0.102 0.006 0.004 0.007 0.009 0.005 0.004 0.824 0.882 0.829 0.751 0.870 0.882 0.008 0.005 0.010 0.012 0.006 0.005 θ* = -1 p 𝜃̂1∗ ∗ 𝜃̂1𝑤 𝜃̂2∗ 𝜃̂2̅∗ ∗ 𝜃̂2𝑤 ̅̅̅̅ 𝜃̂3∗ θ* = 0 p 𝜃̂1∗ ∗ 𝜃̂1𝑤 ̂ 𝜃2∗ 𝜃̂2̅∗ ∗ ̂ 𝜃2𝑤 ̅̅̅̅ 𝜃̂3∗ θ* = 1 p 𝜃̂1∗ ∗ 𝜃̂1𝑤 𝜃̂2∗ 𝜃̂2̅∗ ∗ 𝜃̂̅̅̅̅ 2𝑤 ̂ 𝜃3∗ θ* = 2 p 𝜃̂1∗ ∗ 𝜃̂1𝑤 𝜃̂2∗ 𝜃̂2̅∗ ∗ 𝜃̂2𝑤 ̅̅̅̅ 𝜃̂3∗ 0.793 0.005 0.030 0.337 0.019 0.127 -1.662 1.662 0.371 0.007 0.521 0.011 1.185 0.045 0.309 -0.185 0.185 0.298 0.009 0.681 0.011 0.278 0.025 0.160 -1.722 1.722 0.376 0.007 0.506 0.012 0.025 0.019 0.121 -1.975 1.975 0.396 0.005 0.443 0.009 0.589 0.021 0.157 -1.411 1.411 0.349 0.008 0.580 0.012 1.185 0.045 0.309 -0.185 0.185 0.298 0.009 0.681 0.011 ∗ Note. RMSE = root mean square error, 𝜃̂1∗ = converted cutoff score using unweighted Method 1; 𝜃̂1𝑤 = converted cutoff score using weighted Method 1; 𝜃̂2∗ = converted cutoff score using unweighted Method 2; 𝜃̂̅2∗ = converted cutoff score ∗ ̂∗ using approximation formula for unweighted Method 2; 𝜃̂2𝑤 ̅̅̅̅ = converted cutoff score using weighted method 2; 𝜃3 = converted cutoff score using Method 3. 107 Medium Variability (.10) .35 below the ICC. According to Table 14, when the standard deviation is set to .10 and the a priori value is set to 0 in the bias condition where the simulated ratings are set to .35 below the ICC, all six methods did a similar job of recovering the original θ* values as shown by the bias means ranging from -.070 to -.237, i.e., the range of the bias means show that there are small differences in determining which is the best method, suggesting that you would get a recovery of the original θ* values that could be equally applied regardless of the selected method. The RMSE means mirror the bias means. The means of the estimated θ* values across replications range from -.070 to -.237. The means of the error index were low ranging from .142 to .161, and the means of the consistency index were moderately high, ranging from .768 to .789 with low standard deviations. When the standard deviation is set to .10 and the a priori value is set to -1 in the bias condition where the simulated ratings are set to .35 below the ICC, Method 1 weighted and Method 3 did the best job of recovering the original θ* values as shown by the bias means of .295. Method 2 weighted did the next best job as shown by the bias mean of .649. The rest of the methods did a similar job as shown by the bias means ranging from .886 to 1.039, i.e., the range of the bias means show that there are small differences in determining which is the best method, suggesting that you would get a recovery of the original θ* values that could be equally applied regardless of the selected method. The RMSE means mirror the bias means. The means of the estimated θ* values 108 across replications range from -.705 to .039. The means of the error index were low, ranging from .365 to .400, and the means of the consistency index were low, ranging from .428 to .470, with low standard deviations. When the standard deviation is set to .10 and the a priori value is set to 1 in the bias condition where the simulated ratings are set to .35 below the ICC, Method 1 weighted and Method 3 did the best job of recovering the original θ* values as shown by the bias means of -.389. Method 2 weighted did the next best job as shown by the bias mean of -.524. The rest of the methods did a similar job as shown by the bias means ranging from -.681 to -.840, i.e., the range of the bias means show that there are small differences in determining which is the best method, suggesting that you would get a recovery of the original θ* values that could be equally applied regardless of the selected method. The RMSE means mirror the bias means. The range of the estimated θ* values across replications range from .160 to .611. The means of the error index were low, ranging from .190 to .273, and the means of the consistency index were moderate, ranging from .628 to .772 with low standard deviations. When the standard deviation is set to .10 and the a priori value is set to -2 in the bias condition where the simulated ratings are set to .35 below the ICC, all six methods did a similar job of recovering the original θ* value as shown by the bias means ranging from 1.961 to 2.065, i.e., the range of the bias means show that there are small differences in determining which is the best method, suggesting that you would get a recovery of the original θ* values that could be equally applied regardless of the selected 109 method. The RMSE means mirror the bias means. The range of the estimated θ* values across replications, range from -.039 to .065. The means of the error index were low, ranging from .359 to .361, and the means of the consistency index were moderately low, ranging from .483 to .499 with low standard deviations. The proportion-correct cutoff score generated in this condition was affected by not enough low ratings being possible in the sampling when attempting to achieve the .10 standard deviation of rater variability at the -2 θ* level. When the standard deviation is set to .10 and the a priori value is set to 2 in the bias condition where the simulated ratings are set to .35 below the ICC, all six methods did a similar job of recovering the original θ* values as shown by the bias means, ranging from -1.378 to -1.953, i.e., the range of the bias means show that there are small differences in determining which is the best method, suggesting that you would get a recovery of the original θ* values that could be equally applied regardless of the selected method. The RMSE means mirror the bias means. The range of the estimated θ* values across replications, range from .047 to .622. The means of the error index were low, ranging from .351 to .370, and the means of the consistency index were moderately low, ranging from .481 to .580 with low standard deviations. 110 Table 14 Summary of Cut Scores Across Methods for Different A Priori θ* Values, Where Ratings are .35 Below the Item Characteristic Curves (ICCs) on Average (Medium Variability SD .10) Mean SD Range Bias RMSE Error M Error SD Consistency M Consistency SD 0.633 0.031 -0.039 0.065 0.058 0.015 -0.039 0.010 0.021 0.045 0.025 0.018 0.022 0.045 0.060 0.135 0.319 0.160 0.119 0.136 0.319 2.031 1.961 2.065 2.057 2.015 1.961 2.031 1.961 2.065 2.058 2.015 1.961 0.360 0.361 0.359 0.360 0.360 0.361 0.006 0.006 0.006 0.006 0.006 0.006 0.494 0.483 0.499 0.498 0.491 0.483 0.009 0.011 0.009 0.009 0.009 0.011 0.570 -0.113 -0.705 -0.100 0.039 -0.351 -0.705 0.009 0.019 0.046 0.023 0.017 0.022 0.046 0.050 0.126 0.290 0.140 0.112 0.145 0.290 0.886 0.295 0.900 1.039 0.649 0.295 0.887 0.299 0.901 1.039 0.650 0.299 0.394 0.365 0.395 0.400 0.383 0.365 0.006 0.008 0.006 0.005 0.007 0.008 0.427 0.470 0.428 0.438 0.427 0.470 0.007 0.013 0.007 0.006 0.010 0.013 0.560 -0.125 -0.237 -0.127 -0.070 -0.209 -0.237 0.005 0.012 0.008 0.013 0.010 0.007 0.008 0.030 -0.162 -0.262 -0.160 -0.100 -0.231 -0.262 -0.125 -0.237 -0.127 -0.070 -0.208 -0.237 0.126 0.237 0.128 0.070 0.209 0.237 0.152 0.142 0.151 0.161 0.143 0.142 0.005 0.004 0.005 0.005 0.004 0.004 0.779 0.789 0.779 0.768 0.788 0.789 0.007 0.006 0.008 0.007 0.006 0.006 0.723 0.319 0.611 0.312 0.160 0.476 0.611 0.006 0.017 0.014 0.019 0.017 0.012 0.014 0.040 0.120 0.120 0.110 0.110 0.080 0.120 -0.681 -0.389 -0.688 -0.840 -0.524 -0.389 0.681 0.389 0.688 0.840 0.525 0.389 0.236 0.190 0.238 0.273 0.207 0.190 0.008 0.006 0.008 0.008 0.006 0.006 0.693 0.772 0.691 0.628 0.744 0.772 0.011 0.007 0.013 0.012 0.009 0.007 θ* = -2 p 𝜃̂1∗ ∗ 𝜃̂1𝑤 𝜃̂2∗ 𝜃̂2̅∗ ∗ 𝜃̂̅̅̅̅ 2𝑤 ̂ 𝜃3∗ θ* = -1 p 𝜃̂1∗ ∗ 𝜃̂1𝑤 𝜃̂2∗ 𝜃̂2̅∗ ∗ 𝜃̂2𝑤 ̅̅̅̅ 𝜃̂3∗ θ* = 0 p 𝜃̂1∗ ∗ 𝜃̂1𝑤 ̂ 𝜃2∗ 𝜃̂2̅∗ ∗ ̂ 𝜃2𝑤 ̅̅̅̅ 𝜃̂3∗ θ* = 1 p 𝜃̂1∗ ∗ 𝜃̂1𝑤 𝜃̂2∗ 𝜃̂2̅∗ ∗ 𝜃̂̅̅̅̅ 2𝑤 ̂ 𝜃3∗ θ* = 2 p 𝜃̂1∗ ∗ 𝜃̂1𝑤 𝜃̂2∗ 𝜃̂2̅∗ ∗ 𝜃̂2𝑤 ̅̅̅̅ 𝜃̂3∗ 0.669 0.009 0.070 0.194 0.022 0.162 -1.806 1.806 0.365 0.006 0.508 0.009 0.622 0.102 0.680 -1.378 1.382 0.351 0.008 0.580 0.017 0.161 0.025 0.190 -1.839 1.839 0.366 0.006 0.502 0.010 0.047 0.019 0.127 -1.953 1.954 0.370 0.005 0.481 0.009 0.266 0.028 0.187 -1.734 1.734 0.363 0.006 0.522 0.010 0.622 0.102 0.680 -1.378 1.382 0.351 0.008 0.580 0.017 ∗ Note. RMSE = root mean square error, 𝜃̂1∗ = converted cutoff score using unweighted Method 1; 𝜃̂1𝑤 = converted cutoff score using weighted Method 1; 𝜃̂2∗ = converted cutoff score using unweighted Method 2; 𝜃̂̅2∗ = converted cutoff score ∗ ̂∗ using approximation formula for unweighted Method 2; 𝜃̂2𝑤 ̅̅̅̅ = converted cutoff score using weighted method 2; 𝜃3 = converted cutoff score using Method 3. 111 High Variability (.15) .35 below the ICC. According to Table 15, when the standard deviation is set to .15 and the a priori value is set to 0 in the bias condition where the simulated ratings are set to .35 below the ICC, the approximation formula for method 2 did the best job of recovering the original θ* values as shown by the bias mean of -.131. Method 1 did the next best job as shown by the bias mean of -.174. Method 2 did the next best job as shown by the bias mean of -.198. Method 2 weighted did the next best job as shown by the bias mean of -.270. Method 1 weighted and Method 3 did the worst job as shown by the bias means of -.328. The RMSE means mirror the bias means. The range of the estimated θ* values across replications, range from -.131 to -.328. The means of the error index were low, ranging from .203 to .225 with low standard deviations, and the means of the consistency index were moderate, ranging from .672 to .696 with low standard deviations. When the standard deviation is set to .15 and the a priori value is set to -1 in the bias condition where the simulated ratings are set to .35 below the ICC, Method 1 weighted and Method 3 did the best job of recovering the original θ* value as shown by the bias mean of .627. The rest of the methods did a similar job as shown by the bias means ranging from .838 to 1.046, i.e., the range of the bias means show that there are small differences in determining which is the best method, suggesting that you would get a recovery of the original θ* values that could be equally applied regardless of the selected method. The RMSE means mirror the bias means. The means of the estimated 112 θ* values across replications range from -.373 to .046. The means of the error index were low, ranging from .382 to .386, with low standard deviations, and the means of the consistency index were moderately low, ranging from .429 to .459 with low standard deviations. When the standard deviation is set to .15 and the a priori value is set to 1 in the bias condition where the simulated ratings are set to .35 below the ICC, Method 1 weighted and Method 3 did the best job of recovering the original θ* value as shown by the bias means of -.622. The rest of the methods did a similar job as shown by the bias means ranging from -.747 to -.947, i.e., the range of the bias means show that there are small differences in determining which is the best method, suggesting that you would get a recovery of the original θ* values that could be equally applied regardless of the selected method. The means of the estimated θ* values across replications, range from .053 to .378. The means of the error index were low, ranging from .262 to .305 with low standard deviations, and the means of the consistency index were moderate, ranging from .574 to .666 with low standard deviations. When the standard deviation is set to .15 and the a priori value is set to -2 in the bias condition where the simulated ratings are set to .35 below the ICC, all six methods did a similar job of recovering the original θ* values as shown by the bias means ranging from 2.043 to 2.074, i.e., the range of the bias means show that there are small differences in determining which is the best method, suggesting that you would get a recovery of the original θ* values that could be equally applied regardless of the selected 113 method. The RMSE means mirror the bias means. The range of the estimated θ* values across replications, range from .043 to .074. The means of the error index were low, ranging from .358 to .359, with low standard deviations, and the means of the consistency index were moderately low, ranging from .497 to .501 with low standard deviations. The proportion-correct cutoff score generated in this condition was affected by not enough low ratings being possible in the sampling when attempting to achieve the .15 standard deviation of rater variability at the -2 θ* level. When the standard deviation is set to .15 and the a priori value is set to 2 in the bias condition where the simulated ratings are set to .35 below the ICC, all six methods did a similar job of recovering the original θ* value as shown by the bias means ranging from -1.943 -1.685. The RMSE means mirror the bias means. The means of the estimated θ* values across replications, range from .057 to .315. The means of the error index were low, ranging from .354 to .358 with low standard deviations, and the means of the consistency index were moderately low, ranging from .499 to .540 with low standard deviations. 114 Table 15 Summary of Cut Scores Across Methods for Different A Priori θ* Values, Where Ratings are .35 Below the Item Characteristic Curves (ICCs) on Average (High Variability SD .15) Mean SD Range Bias RMSE Error M Error SD Consistency M Consistency SD 0.637 0.062 0.043 0.074 0.064 0.053 0.043 0.009 0.020 0.022 0.024 0.017 0.019 0.022 0.059 0.129 0.148 0.160 0.108 0.122 0.148 2.062 2.043 2.074 2.064 2.053 2.043 2.062 2.043 2.074 2.064 2.054 2.043 0.359 0.359 0.358 0.359 0.359 0.359 0.006 0.006 0.006 0.006 0.006 0.006 0.499 0.497 0.501 0.500 0.498 0.497 0.009 0.010 0.009 0.009 0.009 0.010 0.596 -0.044 -0.373 -0.032 0.046 -0.162 -0.373 0.009 0.021 0.061 0.024 0.018 0.027 0.061 0.064 0.136 0.428 0.160 0.116 0.173 0.428 0.956 0.627 0.968 1.046 0.838 0.627 0.956 0.630 0.969 1.046 0.839 0.630 0.385 0.382 0.385 0.386 0.384 0.382 0.005 0.006 0.005 0.005 0.006 0.006 0.448 0.429 0.450 0.459 0.436 0.429 0.007 0.010 0.007 0.007 0.008 0.010 0.532 -0.175 -0.328 -0.198 -0.131 -0.270 -0.328 0.006 0.015 0.012 0.016 0.012 0.011 0.012 0.041 0.096 0.081 0.110 0.082 0.076 0.081 -0.174 -0.328 -0.198 -0.131 -0.270 -0.328 0.175 0.328 0.198 0.132 0.271 0.328 0.218 0.203 0.214 0.225 0.207 0.203 0.006 0.005 0.006 0.006 0.005 0.005 0.680 0.696 0.684 0.672 0.692 0.696 0.008 0.008 0.009 0.008 0.008 0.008 0.664 0.159 0.378 0.148 0.053 0.253 0.378 0.008 0.019 0.021 0.021 0.017 0.016 0.021 0.050 0.120 0.153 0.130 0.117 0.109 0.153 -0.841 -0.622 -0.852 -0.947 -0.747 -0.622 0.841 0.622 0.853 0.947 0.747 0.622 0.289 0.262 0.291 0.305 0.277 0.262 0.007 0.007 0.007 0.007 0.007 0.007 0.606 0.666 0.603 0.574 0.634 0.666 0.010 0.010 0.011 0.011 0.010 0.010 θ* = -2 p 𝜃̂1∗ ∗ 𝜃̂1𝑤 𝜃̂2∗ 𝜃̂2̅∗ ∗ 𝜃̂̅̅̅̅ 2𝑤 ̂ 𝜃3∗ θ* = -1 p 𝜃̂1∗ ∗ 𝜃̂1𝑤 𝜃̂2∗ 𝜃̂2̅∗ ∗ 𝜃̂2𝑤 ̅̅̅̅ 𝜃̂3∗ θ* = 0 p 𝜃̂1∗ ∗ 𝜃̂1𝑤 ̂ 𝜃2∗ 𝜃̂2̅∗ ∗ ̂ 𝜃2𝑤 ̅̅̅̅ 𝜃̂3∗ θ* = 1 p 𝜃̂1∗ ∗ 𝜃̂1𝑤 𝜃̂2∗ 𝜃̂2̅∗ ∗ 𝜃̂̅̅̅̅ 2𝑤 ̂ 𝜃3∗ θ* = 2 p 𝜃̂1∗ ∗ 𝜃̂1𝑤 𝜃̂2∗ 𝜃̂2̅∗ ∗ 𝜃̂2𝑤 ̅̅̅̅ 𝜃̂3∗ 0.648 0.009 0.067 0.114 0.022 0.172 -1.886 1.886 0.357 0.006 0.508 0.009 0.315 0.072 0.656 -1.685 1.685 0.354 0.006 0.540 0.014 0.105 0.025 0.180 -1.895 1.895 0.357 0.006 0.507 0.009 0.057 0.018 0.117 -1.943 1.943 0.358 0.006 0.499 0.009 0.153 0.024 0.171 -1.847 1.847 0.356 0.006 0.515 0.010 0.315 0.072 0.656 -1.685 1.685 0.354 0.006 0.540 0.014 ∗ Note. RMSE = root mean square error, 𝜃̂1∗ = converted cutoff score using unweighted Method 1; 𝜃̂1𝑤 = converted cutoff score using weighted Method 1; 𝜃̂2∗ = converted cutoff score using unweighted Method 2; 𝜃̂̅2∗ = converted cutoff score ∗ ̂∗ using approximation formula for unweighted Method 2; 𝜃̂2𝑤 ̅̅̅̅ = converted cutoff score using weighted method 2; 𝜃3 = converted cutoff score using Method 3. 115 Chapter 7 DISCUSSION This present study replicated previous research (Hurtz et al., 2008), and expanded on the range of simulated rater agreement and bias, in addition to expanding the range of θ*. This study compared the performance of Kane’s (1987) methods to convert proportion-correct standard-setting judgments to a value on the θ scale used in scoring IRT examinations, using the restricted range based on Hurtz et al., 2008. Overall, throughout all of the conditions, method 1 weighted and method 3 had the most success in recovering the original θ* values because they most often showed the least amount of bias and had the smallest error index ratings. The differences occur when investigating individual conditions, such as when bias is introduced, or when the a priori θ* values are moved more toward the extreme ratings. Unbiased Ratings When the a priori θ* value is set to 0, and the simulated ratings have low variability, meaning that the simulated raters have more agreement, there is very little difference between the six methods. As the a priori θ* values are moved away from 0, to 1 and negative 1, methods 1 weighted and method 3 perform the best when recovering the original θ* values. As the a priori θ* values are moved to negative 2 and positive 2, all six methods do not get close to the original θ* values, showing that at the extreme ranges, the simulated ratings suffer from truncation and the variability is not conducive to 116 recovering those values. With the a priori θ* value at negative 2, method 2 comes closest, with only a -1.261 value. This can be due to the difficulty with rating easier items, as the simulated “expert” raters do not have a lot of leeway in figuring out how to rank the items. Additionally, items of this difficulty should be relatively easy for any exam taker to get correct, and finding the minimally acceptable candidate is more difficult. When the a priori θ* value is set to positive 2, method 1 weighted and method 3 come out ahead, but not by much with a 1.046 recovered θ* value. This is due to items in this difficulty range being too difficult to discern the minimally acceptable candidate. Additionally, these extreme ranges are not as useful for most exams, because it is rare when the purpose of the exam is to get as many candidates as possible or restrict as few as possible from receiving a license. When the variability among the simulated raters increases to a medium level, there is little difference in the methods when the a priori θ* value is set to 0. As in the previous condition, method 1 weighted and method 3 do perform the best when the a priori θ* value is moved to negative and positive 1. This suggests that with simulated raters mimicking the real performance of expert individuals, suggests that these methods work the best in the applied sense because these θ* values are more applicable when constructing real world exams. As the a priori θ* values move toward the extremes, as in negative and positive 2, method 1 weighted and method 3 are still the front runners when recovering the original θ* values, even though they do not get close to the negative or positive 2. This is again reflected in that items in this range do not give a lot of 117 information about the examinees getting these items correct or incorrect. This kind of information is hard for human raters to discern the minimally acceptable candidate because of the extreme examinee ability levels and is equally difficult for simulated ratings mimicking real raters. As the variability of the simulated raters increase, the same pattern emerges with the a priori θ* value set at 0. With the increased variability in the simulated ratings, as the a priori θ* value is moved to negative and positive 1, method 1 and method 3 emerge as the frontrunners, even though they do not get as close to the original θ* values as with the previous variability conditions. The same occurs when the a priori θ* values are moved to even further extremes of negative and positive 2. Method 1 weighted and method 3 do appear to be the most robust in recovering the original θ* values across conditions, suggesting that they are the best to use, especially when the simulated ratings are unbiased. Throughout the unbiased conditions, the error and consistency indexes are strongest at the a priori θ* value of 0 and trend lower when the a priori θ* values approach the extremes, suggesting that the simulated ratings at those conditions are less replicable if performed again. This also confirms the lack of test and item information when the θ* ranges are moved out to the extremes, suggesting that both the simulated ratings and the conversion methods have difficulty with those values. Biased Ratings In the first biased condition, where the simulated ratings are moved .15 above the ICC, and when the a priori θ* values are set at 0, and there is low variability with the 118 simulated ratings, all six methods do a similar job of recovering the original θ* values, however they do not get as close to 0 as in the unbiased conditions. As expected, they are higher than the unbiased conditions with values ranging from .92 to .113. With bias being introduced, as the a priori θ* value is moved to negative one, method 2 gets closer than method 1 weighted and method 3 when attempting to recover the original θ* values, but not by much with the respective θ* values of -.749 and -.603. However, when the a priori θ* value moves to positive 1, method 1 weighted and method 3 come forth as the best methods to use, with values falling slightly above 1. The same pattern emerges with the negative and positive 2 a priori θ* values suggesting that method 1 weighted and method 3 do better when the simulated ratings show the items as harder for examinees, but do not function as well when the simulated ratings suggests that the items are easier. In those easier cases, method 2 seems to perform the best. Even with those method performing the best, they still are far away from the original θ* values. For this condition, the consistency and error indexes indicate more replicable values in the negative 2 to positive 1 range and decline in the positive 2 a priori θ* value. In the bias condition where the simulated ratings are .15 above the ICC and the simulated rater variability is increased to medium, even at the a priori θ* value of 0, the recovered θ* values are higher than the low variability condition, even though there is little difference between the methods. When the a priori θ* value moves into the negative 1 and 2 values, none of the methods do a decent job of recovering the original θ* values, suggesting that the bias condition is doing its job and showing that the 119 simulated ratings are showing that the items are being rated as more difficult. In the a priori θ* value of positive 1, method 1 weighted and method 3 perform well in recovering the original θ* value, making up for some of the deficiencies in the previous unbiased condition with this variability level. Once the a priori value moves to positive 2, the bias condition is in effect as all six methods do a poor job of recovering the original θ* value as with the bias of rating the items as more difficult, restricts the range and the methods are restricted because there is not a lot of room to maneuver and thus cannot recover the a priori θ* value of 2. The error and consistency indexes follow the pattern of the unbiased condition, showing stronger reliability at the a priori θ* value of 0 and degrading as that value is moved to the extremes. Once the variability of the simulated ratings is moved to high, in the bias condition where the simulated ratings are generated at .15 above the ICC, none of the methods come close to recovering the original θ* values, even though method 1 weighted and method 3 perform the best. When the a priori θ* values are moved into the positive range, the simulated ratings fail to generate, suggesting when the simulated raters have this much disagreement, there isn’t much that can be done to get accurate ratings of the items. In contrast, in the bias condition when the simulated ratings are generated .15 below the ICC, in the low variability condition, and the a priori value is set to 0, all six methods do a similar job of recovering the original θ* value, but they follow the bias condition with negative values. When the θ* value is set to negative 1, all six methods 120 range around values of negative one when attempting to recover the original θ* value. This shows better performance than the unbiased condition, with method 1 weighted, method 3 and method 2 performing the best. As the a priori θ* value is moved to negative 2, the values remain similar to the negative 1 condition, showing that with the reduced range, and the bias of rating the items as easier, there isn’t a lot of room for the simulated ratings to go too much lower. As the a priori θ* value is move to positive 1, method 1 weighted and method 3 still get closest to recovering the original θ* value, showing their robustness with the bias conditions. Interestingly enough, when moving the a priori θ* value to positive 2, method 1 weighted and method 3 come close to recovering the original θ* values at 1.151, suggesting that these methods work well at the upper ranges when the simulated rating bias is set to rate the items easier. This allows for more variability and thus allows the equations to do their work effectively. This is somewhat dampened by the error and consistency index showing more issues at the positive 2 range, but demonstrating that these ratings could be replicated more consistently when the a priori θ* values are set to below positive 2. In the bias condition where the simulated ratings are set at .15 below the ICC and the variability is set to medium, and the a priori θ* value is set to 0, all of the methods do a decent job of recovering the original θ* value. As the a priori θ* value is set to more extreme values, method 1 weighted and method 3 do the best job of recovering the original θ* values, however they do not come close to recovering the original θ* values when the a priori θ* value is moved to positive or negative 2, even though they are the 121 best performers. The error and consistency indexes are strongest at the a priori θ* value of 0 and are lower as the a priori value is moved to the extremes, resulting in the lowest values at positive or negative 2. In the bias condition where the simulated ratings are set to .15 below the ICC and the variability is set to high, the only condition where any of the methods come close to recovering the original θ* values is when the a priori θ* value is set to 0. In the rest of the θ* value conditions, the high variability among the ratings does not let the θ* values perform to their best efforts, suggesting that as the θ* values are moved away from 0, agreement amongst the simulated raters is more important to recovering the original θ* values. In the bias condition where the simulated ratings are set to .35 above the ICC, in the low variability conditions, denoting that the simulated ratings are judging the items as more difficult, all six methods do a decent job of recovering the a priori θ* values at 0. In the positive 1 a priori θ* value, method 1 weighted and method 3 emerge as the best methods to use, but in the negative 1 a priori θ* value, method 2 performs the best, suggesting that the rater bias affects method 1 weighted and method 3 in recovering the original θ* value. As the positive and negative 2 values are investigated, each of the methods do poorly in recovering the original θ* values, although method 1 weighted and method 3 do better in the positive 2 value. This suggests that these two methods are the best ones to use when the practitioner knows the makeup of the raters in an applied setting. The error and consistency indexes are strongest in the negative 1 to positive 1 122 range, and degrade when the a priori θ* value is moved to the extremes. This is in part due to the restricted range at the extremes and due to the bias condition, one would expect to get better performance in the positive a priori θ* values. In the bias condition where the simulated ratings are set to .35 above the ICC, in the medium variability condition, the only condition where method 1 weighted and method 3 do a decent job of recovering the original θ* value is when it is set to positive 1. In the rest of the conditions, due to the simulated ratings being set so high, all six methods do poorly when trying to recovering the original θ* values. The error and consistency values have a similar pattern to previous conditions where they are strongest at 0 and degrade when the a priori θ* value is moved to the extremes. In the bias condition where the simulated ratings are set to .35 above the ICC, in the high variability condition, the only a priori value that the methods come close to recovering is when it is set to 0. In the negative conditions, none of the methods come close, and in the positive conditions, the ratings fail to generate. This is due to the high variability combined with the simulated ratings being so far above the ICC, that there is not enough room at those extreme ranges in those conditions for the equations to work as expected. In the bias condition where the simulated ratings are set to .35 below the ICC, all six methods do the best when the a priori θ* value is set to 0, and the values are negative as expected. In the negative 1 a priori θ* value, method 1 weighted and method 3 do a good job of recovering the original θ* value, because the there is enough variability at 123 that value so those equations can come close to recovering the original θ* value. Conversely, those methods do a decent job at the positive 1 a priori θ* value, although they do come up a little short of recovering it exactly, which is to be expected when the simulated ratings are set so low. In the positive 2 a priori θ* value, those two methods do the best job of recovering the θ* value, because of the simulated rater agreement. In the negative 2 a priori θ* value, method 2 emerges as the best candidate to use, even though it is closer to negative 1. In the bias condition where the simulated ratings are set to .35 below the ICC and the variability increases to medium, each of the methods do a decent job of recovering the original θ* values at the a priori θ* value of 0. For method 1 weighted and method 3, they perform best at the negative and positive 1 a priori θ* values, and none of the methods do a decent job at the negative and positive 2 a priori θ* values. As the simulated rater variability increases, it is harder to recover the original θ* values at the extreme ranges. In the bias condition where the simulated ratings are set to .35 below the ICC and the variability increases to high, the similar pattern emerges where the methods are similar at the a priori θ* value of 0 and then all do poorly at the extreme ranges. Another issue that arose was the simulation of the proportion-correct cutoff scores that were generated in the medium and high variability conditions at the -2 θ values. Because the simulation tried to fit the standard deviation conditions, it forced the 124 sampling of higher ratings, which drove up the average proportion-correct cutoff scores to values that would not be expected based on the shape of the ICCs. Conclusion Based on these results, for the most part, method 1 weighted and method 3 are the most robust and useful methods. However, this only really applies when knowing how the population of raters is going to judge the items. If expecting the raters to judge items easy or hard, these methods can come close to recovering the original θ* values. Rater agreement is key to this process. As the simulated rater agreement increases in variability, each of the methods become less useful. This suggests for Angoff standard setting panels, rater orientation and training becomes crucial in order to get each of the raters calibrated. When they disagree, especially in the extreme a priori +/- 2 θ* values, any method chosen is not going to recover the original θ* value with any value that will be useful for exam purposes. It is also important to examine the purpose of the examination. If desiring a lowly screen in purpose for the exam, such as determining if a candidate has the basic ability to read or write, the Angoff method in conjunction with IRT may significantly overestimate the minimum competence threshold on the latent scale for the -2 θ* conditions. Also, if the goal of the examination is to eliminate potential mistakes, such as that with a surgeon or police officer, where the public safety is in danger, in the 2 θ* conditions this process may lead to underestimation of the minimum competence threshold on the latent scale. If the purpose of the examination is for normal testing to differentiate between qualified and unqualified candidates in the mid 125 ability ranges (approximately +/- 1), then method 1 weighted and method 3 are the best to use when invoking an Angoff standard setting method in conjunction with IRT. For directions in future research, expanding the range of θ* values might not be as useful, and increasing the variability of the simulated ratings leads to less accuracy in the recovery of the original θ* values. Investigating Hurtz et al.’s (2008) variance restrictions for method 1 weighted and method 3 and adjusting them to reflect the expanded theta range may result in better performance by those methods. Additionally, the uniform bias conditions may have an effect on each of the method’s performance. Adjusting the bias so that each rater has an individual bias condition, some rating easier and some rating harder might prove more useful in an applied setting. Another suggestion would be to constrain the proportion-correct cutoff scores when using simulated ratings in order to maintain the shape of the ICC and have more realistic cutoff scores generated at the lower ability levels. Another possibility would be to use an actual exam in addition to actual raters, which would allow the researcher to calibrate the ratings instead of letting the simulation run its course. The items used for this study were simulated, and that may result in artificial embellishments of the conversion methods. Using this procedure with multiple actual exams with actual raters instead of simulating those ratings could give a better indication of how these methods perform with actual data, instead of simulated data. When that θ* ability level is in the range that most likely captures most minimum competence thresholds, ranging from -1 to 1, these conversions work well, especially, method 1 126 weighted and method 3. Method 2 does emerge in extreme conditions of coming closest to recovering the original θ* values when the simulated ratings around a particular θ* value is unexpected, such as in the positive bias conditions when looking at negative a priori θ* values. When using Angoff standard setting panels in conjunction with IRT, the underlying mathematics will convert the CTT Angoff ratings back into and IRT metric, but the quality of those conversions depend on the makeup of the raters and the amount of agreement that the raters have, and is dependent on the purpose and goal of the exam. 127 References American Educational Research Association, American Psychological Association, National Council on Measurement in Education. (1999). Standards for educational and psychological testing. Washington, DC: American Educational Research Association. Angoff, W. H. (1971). Scales, norms, and equivalent scores. In R. L. Thorndike (Ed.), Educational measurement (2nd ed., pp. 508-600). Washington, DC: American Council on Education. Baker, F. B. (2001). The basics of item response theory. College Park, MD: ERIC Clearinghouse on Assessment and Evaluation, University of Maryland. Cascio, W. F., Alexander, R. F., & Barrett, G. V. (1988). Setting cutoff scores: Legal, psychometirc, and professional issues and guidelines. Personell Psychology, 41, 1-24. Cizek, G. J. (2006). Standard Setting. In S. M. Downing, & T. M. Haladyna (Eds.), Handbook of test development (pp. 225-258). Mahway, New Jersey: Lawrence Erlbaum Associates. Ellis, B. B., & Mead, A. D. (2002). Item analysis: Theory and practice using classical and modern test theory. In S. Rogelberg (Ed.), Handbook of Research Methods in Industrial and Organizational Psychology (pp. 324-343). Malden, Massachusetts: Blackwell. 128 Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologists. Mahwah, New Jersey: Lawrence Erlbaum Associates. Ferdous, A. A., & Plake, B. (2008). Item response theory based approaches for computing minimum passing scores from and Angoff-based standard-setting study. Educational and Psychological Measurement, 68(5), 778-796. Ferdous, A. A., & Plake, B. S. (2005). The use of subsets of test questions in an Angoff standard-setting method. Educational and Psychological Measurement, 65(2), 185-201. Ferdous, A. A., & Plake, B. S. (2007). Item selection strategy for reducing the number of items raten in an Angoff standard setting study. Educational and Psychological Measurement, 67(2), 193-206. Fischer, G. H., & Molenaar, I. W. (1995). Rasch Models: Foundations, recent development, and applications. New York: Springer-Verlag. Fishman, G. S. (1996). Monte Carlo: Concepts, algorithms, and applications. New York: Springer. Guion, R. M. (1998). Assessment, measurement, and prediction for personnel decisions. Mahwah, New Jersey: Lawrence Erlbaum Associated. Hambleton, R. K., & Cook, L. L. (1977). Latent trait models and their use in the analysis of educational test data. Journal of Educational Measurement, 14(2), 75-96. Hambleton, R. K., & Swaminathan, H. (1985). Item response theory: Principles and applications. Boston: Kluwer-Nijhoff Publishing. 129 Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991a). Fundamentals of item response theory. London: Sage Publications. Hurtz, G. M., Jones, J. P., & Jones, C. N. (2008). Conversion of proportion-correct standard-setting judgments to cutoff score on the item response theory θ scale. Applied Psychological Measurement, 32(5), 385-406. Hurtz, G. M., Muh, V., Pierce, M., & Hertz, N. (2012). The Angoff method through the lens of latent trait theory: Theoretical and practical benefits of setting standards on the latent scale (where they belong). SIOP Conference. San Diego, California. Kachigan, S. K. (1986). Statistical analysis: An interdisciplinary introduction to univariate and multivariate methods. New York: Radius Press. Kalos, M. H., & Whitlock, P. A. (1986). Monte Carlo methods. (Vol. I: Basics). New York: Wiley-Interscience. Kane, M. T. (1987). On the use of IRT models with judgmental standard setting procedures. Journal of Educational Measurement, 24(4), 333-345. Livingston, S. A., & Zieky, M. J. (1982). Performance on educational and occupational tests. Princeton, NJ: Educational Testing Service. Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, New Jersey: Lawrence Erlbaum Associates. Mooney, C. Z. (1997). Monte carlo simulation. London: Sage. 130 Norcini, J. J., Shea, J. A., & Ping, J. C. (1998). A note on the application of multiple matrix sampling to standard setting. Journal of Educational Measurement, 25(2), 159-164. Plake, B. S., & Kane, M. T. (1991). Comparison of methods for combining the minimum passing levels for individual items into a passing score for a test. Journal of Educational Measurement, 28(3), 249-256. Rasch, G. (1980). Probabilistic models for some intelligence and attainment tests. Chicago: The University of Chicago Press. Reckase, M. (2006). A conceptual framework for a psychometric theory for standard setting with examples of its use for evaluating the functioning of two standard setting methods. Educational Measurement: Issues and Practice, 25(2), 4-18. Samejima, F. (1988). Comprehensive latent trait theory. Behaviormetrika, 24, 1-24. Williams, V. S., Pommerich, M., & Thissen, D. (1998). A comparison of developmental scales based on Thurstone methods and item response theory. Journal of Educational Measurement, 35(2), 93-107. Woodruff, D. (1990). Conditional standard error of measurement in prediction. Journal of Educational Measurement, 27(3), 191-208.