EFFECTS OF SIMULATED RATER BIAS ON PREDETERMINED STANDARD SETTING JUDGMENTS A Thesis

advertisement
EFFECTS OF SIMULATED RATER BIAS ON PREDETERMINED STANDARD
SETTING JUDGMENTS
A Thesis
Presented to the faculty of the Department of Psychology
California State University, Sacramento
Submitted in partial satisfaction of
the requirements for the degree of
MASTER OF ARTS
in
Psychology
(Industrial/Organizational Psychology)
by
Charles Howard Strike
FALL
2013
EFFECTS OF SIMULATED RATER BIAS ON PREDETERMINED STANDARD
SETTING JUDGMENTS
A Thesis
by
Charles Howard Strike
Approved by:
__________________________________, Committee Chair
Gregory M. Hurtz, Ph.D.
__________________________________, Second Reader
Lawrence S. Meyers, Ph.D.
__________________________________, Second Reader
Timothy Gaffney, Ph.D.
____________________________
Date
ii
Student: Charles Howard Strike
I certify that this student has met the requirements for format contained in the University
format manual, and that this thesis is suitable for shelving in the Library and credit is to
be awarded for the thesis.
__________________________, Graduate Coordinator
Jianjian Qin, Ph.D.
Department of Psychology
iii
___________________
Date
Abstract
of
EFFECTS OF SIMULATED RATER BIAS ON PREDETERMINED STANDARD
SETTING JUDGMENTS
by
Charles Howard Strike
This study is designed to explore the usefulness of different strategies used to convert
item-level proportion correct standard-setting judgments in to a θ-metric test cutoff score
that can be used with item response theory (IRT) scoring using Monte Carlo simulations.
Simulated Angoff ratings, consisting of 1000 independent 100 item by 15 rater matrices
were generated at five points along the θ continuum, ranging from negative 2 to positive
2, at five levels of rater bias in regards to the item characteristics curves. A total of
37,500,000 ratings were generated as the basis of the analyses. These simulated
proportion-correct ratings were converted to the IRT θ scale using test-level and itemlevel methods developed by Kane (1987). Overwhelmingly, Kane’s method 1 weighted
and method 3 performed the best when recovering the original θ values.
_______________________, Committee Chair
Gregory M. Hurtz, Ph.D.
_______________________
Date
iv
ACKNOWLEDGMENTS
I would like to acknowledge my Committee Chair, Gregory M. Hurtz, Ph. D., for all of
his efforts reviewing my work, and the support and encouragement given throughout this
process. My second and third readers, Lawrence S. Meyers, Ph.D., and Timothy
Gaffney, Ph. D. were important in providing encouragement and feedback in order for
successful completion of this project, and I am thankful for their patience and support. I
would also like to thank my family and friends who have encouraged me and supported
me throughout my educational efforts and who have allowed me to get to where I am
today. I want to thank my boss Kelli Johnson for allowing me the time off work to get
this process completed. Thank you to everyone who contributed to this project and
providing me the motivation to complete this project.
v
TABLE OF CONTENTS
Page
Acknowledgments......................................................................................................... v
List of Tables .............................................................................................................. ix
List of Figures ............................................................................................................... x
Chapter
1. INTRODUCTION ……………….…………………………….………………… 1
Latent Trait Theory ........................................................................................... 2
Standard Setting……………………………………. ....................................... 3
Standard Setting Judgments .............................................................................. 7
Standard Setting Evaluation .............................................................................. 8
2. ESTIMATION OF TEST SCORES ..................................................................... 10
Test Score Estimation ..................................................................................... 10
Classical Test Theory…………………………………. ................................. 10
Item Response Theory .................................................................................... 18
Rasch Model ................................................................................................... 21
One Parameter Model…………………………………… ............................. 22
Two Parameter Logistic Model ...................................................................... 23
Three Parameter Logistic Model…………………………………. ............... 24
IRT Assumptions ............................................................................................ 25
Item and Test Information…………………………………… .......................25
vi
Test Characteristic Curve ................................................................................ 27
Standard Error of Measurement…………………………………. ................. 28
Invariance…………………………………. ................................................... 30
Parameter Estimation ...................................................................................... 31
3. ANGOFF METHOD AND THE LATENT TRAIT CONTINUUM ................... 34
Angoff Method ............................................................................................... 34
4. MONTE CARLO SIMULATIONS ...................................................................... 39
Monte Carlo .................................................................................................... 39
Beta Distribution ............................................................................................. 42
5. METHOD ............................................................................................................. 43
Purpose of the Study ....................................................................................... 43
Minimum Passing Level…………………………………. ............................ 44
Length of Test and the Number of Items ........................................................ 46
Simulated Rating Data .................................................................................... 47
IRT to CTT Conversion .................................................................................. 48
Data Analysis .................................................................................................. 54
6. RESULTS ............................................................................................................. 55
Unbiased Ratings ............................................................................................ 55
Biased Ratings ................................................................................................ 68
7. DISCUSSION ..................................................................................................... 115
vii
Unbiased Ratings ...........................................................................................115
Biased Ratings ...............................................................................................117
Conclusion .....................................................................................................124
References ..................................................................................................................127
viii
LIST OF TABLES
Tables
Page
1.
Unbiased Ratings Low Variability…….……………...……………..…………...59
2.
Unbiased Ratings Medium Variability.……………..……………………….......63
3.
Unbiased Ratings High Variability…….…………………..…………………….67
4.
.15 Above the ICC Low Variability….…………………..…………………........71
5.
.15 Above the ICC Medium Variability.………………..……………………….75
6.
.15 Above the ICC High Variability….………………….…………………........79
7.
15 Below the ICC Low Variability…….………………………...………………83
8.
.15 Below the ICC Medium Variability…………………………….………........87
9.
.15 Below the ICC High Variability…….…………………………..…………...91
10.
.35 Above the ICC Low Variability.…………………………………………......95
11.
.35 Above the ICC Medium Variability………………………………………....99
12.
.35 Above the ICC High Variability.………………………………………..….102
13.
.35 Below the ICC Low Variability.…………………………………………....106
14.
.35 Below the ICC Medium Variability…………………………………….......110
15.
.35 Below the ICC High Variability.……………………………………….......114
ix
LIST OF FIGURES
Figures
Page
1.
Item characteristic curve.............…………….…………………………………19
2.
Test characteristic curve ……………………...………………………………...28
Chapter 1
INTRODUCTION
Exams are important in the selection of employees, especially when the jobs
require licensure and certification testing. Performance based standards have to be
established so candidates taking those exams can be accurately assessed. Over time,
several procedures have been developed in order to get these performance-based
standards (Kane, 1987). Candidates who take these exams must demonstrate a specified
amount of knowledge in order to meet the specified criteria. Having this knowledge
demonstrates that the candidate can successfully perform the task that he or she is
licensed or certified to do so. To determine the amount of knowledge needed to perform
the job successfully, the exam creators must use standard-setting methods using specific
procedures to determine the minimum score necessary for the candidate to perform at the
required standard (Reckase, 2006)
This minimum score is referred to as a cutoff score, which is used to make
x
2
decisions that affect the lives of real people. This means that these scores must be valid
and credible, and when they do not meet such standards, they are subject to legal
defensibility reviews (Hurtz, Muh, Pierce, & Hertz, 2012). These scores result in
standards that must be able to be compared over time. They also must be able to equate
pass/fail decisions so that any one test-taker is not given an advantage or disadvantage
over others because of poor exam construction (Hurtz et al., 2012). The focus of these
standard setting methods is to define how much of a latent trait an examinee needs to
have in order to perform successfully on an exam. For the purposes of this study, the
standard derived from the standard setting method is the desired ability level along the
latent trait continuum, where the cut-score is the operational definition of the ability level
for the specific exam (Hurtz et al., 2012).
The most relevant research to the present study was a study by Hurtz, Jones, and
Jones (2008), which evaluated these methods and provided ideas for further research.
CTT conversions at the item level with different weighting schemes were evaluated using
fixed values of θ* as representations of the judges minimum competency
conceptualizations. The items used were from a published exam by Hambelton (as cited
by Hurtz et al., 2008) consisting of 75 items. The population standard deviation was set
at .09, and examined rater bias as over and under estimation of +/- .10 of each ICC
(Hurtz, et al., 2008). The present study is an expansion of the previous study and used
different methods of determining rater bias, agreement, and examinee ability levels, with
a randomly generated item sample instead of the previous study’s use of the Hambleton
3
seventy-five item exam.
Latent Trait Theory
In testing situations, the defining characteristics of examinees, called traits, can
predict examinee performance. Scores are estimated based on these traits and then used
to predict performance. These traits are unobservable and cannot be directly measured,
so they are referred to as latent traits (Hambleton & Cook, 1977). In order to determine
the relationship between the observable test score and the unobservable latent traits, a
mathematical model is used, referred to as a latent trait model. This model relies on
assumptions, and the assumptions are used to see how well the model fits the test data
according to standard IRT practice. If the assumptions of a selected model are not met, a
different model should be used (Hambleton & Cook, 1977). Latent trait theory was
originally developed as a mental test theory. It began with the dichotomous response
model, referring to each item on an exam being scored either correct or incorrect. The
methods have advanced since their inception, but the dichotomous response model is the
focus of the present study. An examinee’s ability is hypothesized and can only be
measured by examining responses to concrete entities, called items (Samejima, 1988).
Standard Setting
Standard setting is a major part of testing, because of the need for examinees to
demonstrate the knowledge they possess, usually determined by answering a specific
number of questions correctly. The knowledge that the examinee possesses is the
unobservable latent trait and his or her score is the observable portion of the model.
4
Using this perspective can assist the researcher in finding a theoretical framework in
order to practically compare the credibility of different standard setting methods (Hurtz et
al., 2012). This information is used in certification, licensure and selection exams where
the examinee has to demonstrate a minimum amount of knowledge to pass. One purpose
is to set several different passing grades or bands. Examinees are grouped into one of the
different passing grades, and the employer considers the top band or bands when making
employment selections. All candidates in each band are considered as equal and are
selected at random. Once the number of candidates in a single band is exhausted, the
employer drops down to the next lowest band (Cascio, Alexander, & Barrett, 1988). This
is the type used most commonly in civil service such as in the State of California.
Another common method is top-down selection, in which the employer selects the
examinee that scores the highest on the exam first and moves down the list until the
openings are filled.
When determining which examinees have sufficient knowledge of the subject
matter, the test needs to be fitted with a cut score, or a standard of knowledge necessary
to pass the examination. Standard setting refers to the process of establishing cut scores
on examinations. In these situations, standard setting helps to create categories such as
pass/fail, allow/deny a license, or award/withhold a credential. To set standards, a system
of rules or procedures must be followed to assign a number that differentiates between
two or more levels of performance. Additionally, more than two categories such as basic,
proficient, and advanced are recommended, commonly used to imply differing degrees of
5
achievement (Cizek, 2006). Determining how the standards are set must be considered
early in the process so that it matches the purpose of the test, the selected test item, or the
test format. The standard setting process should be able to identify relevant sources of
evidence affecting the validity of the assigned categories.
Performance standards are used to set passing scores. The desired competence
level, or latent trait, serves as the standard, which is translated to a specific number of
correct items on an exam. Standard-setting procedures give participants the opportunity
to use personal judgments about organizational policy to set a specific position on a score
scale (Cizek, 2006). Several factors influence passing scores; the most important being
how to determine the expertise or knowledge required for the purpose of the exam. This
includes licensing medical personnel, where judgments need to be made about public
health and safety. Even though standard setting is supposed to be objective, value
judgments in addition to the technical and empirical considerations influence the decision
(Cizek, 2006).
The need for establishing the standard must be determined. In some cases, a
passing score is not appropriate, and participants involved in the standard setting process
need to know what the purpose is. Once the participants know this information, it allows
them to make informed decisions regarding what is required to pass or fail. To make sure
that the participants know what is necessary, they need to participate in an orientation
process to outline the purpose and other necessary information needed to set the standard
for that specific exam (Cizek, 2006). One example involves revising standards,
6
specifically the amount of mathematical expertise a particular elementary school needs to
be comparable to other schools. The orientation process informs the participants of this
need as well as any economical and social issues that may make a difference on the final
standard. Additionally, examples of the competencies defined by the content standards
are provided. With licensure and certification examinations, participants receive
information about the consequences of incorrect credentialing decisions; such as
licensing an unsafe practitioner as more dangerous than failing to license a competent
person (Cizek, 2006).
Organizational and legal policies have a large impact on the rejection of a
standard-setting panel’s recommended cut-score in favor of a higher performance
standard. An additional component to participant orientation is to give the participants
information regarding examinee latent traits, or their ability level when taking the exam.
This involves giving participants information about examinees and their ability levels to
help balance out the exam in addition to organizational information to give a clear picture
of what an examinee is required to do in order to pass the exam.
Standard setting can occur at different times during the development of an exam.
It can occur after administration of a live test form with consequences for the examinees,
or it can be done after pilot testing before it is given as a live exam. Standards set using
normative p values based on live exam results are more accurate and stable because the
examinees are operating under real exam conditions and the p values given to the judges
are more accurate. Real exam data is more useful for the standard setting procedures,
7
because data collected during a “no-stakes” test administration does not provide adequate
motivation for the examinees (Cizek, 2006).
Generally, standards are set through many different methods depending on the
purpose and the resources available. They involve using subject matter experts (SMEs)
to set these standards in some form of a workshop. This gathers experts who know the
material and the necessary knowledge to qualify as proficient for the purpose of the
exam. The workshops should be structured in a way that allows SMEs to efficiently
review the information and the purpose of the exam so that the standards can be set as
accurately and reliably as possible. Typical best practices involve gathering wellcalibrated items that match the exam plan in the form of one or more complete test forms.
After the items are rated, they need to be converted and linked to some kind of standard.
An argument can be made that they should be linked to the latent trait scale because it
can make the standard setting process more efficient and precise. Additionally, less
resources are used and the standards set are more consistent and reliable (Hurtz, et al,
2012)
Standard Setting Judgments
It is crucially important to properly identify and train qualified participants. The
judges need to be representative and large enough to ensure the judgments can be
replicated. The expertise of the participants needs to be documented, which limits the
pool of judges to those with sufficient experience, so the panels can only be
representative of the experts in the field (Cizek, 2006). The selected participants are
8
provided with additional information. They get feedback consisting of normative, reality,
or impact information. This is designed to help judges make decisions in each iterative
round (Cizek, 2006). Additionally, they need to be able to consider examinee ability, and
identify how an examinee may perform on a specific exam based on the latent trait being
tested. These judgments must be made based on the latent trait model used, and the
judges must be properly trained on what they need to rate.
Standard Setting Evaluation
All standard setting procedures must have documentation regarding all steps taken
during the process. The method, test design, test purpose, agency goals, and the
participant’s characteristics all must be examined (Cizek, 2006). The process must be
externally evaluated to ensure that the process was done correctly, and few if any
deviations of the established principles occurred. Any deviation must be reasonable,
specified in advance, and consistent with the prescribed goals (Cizek, 2006). Internal
evaluation must be conducted and documented to determine participant agreement and if
any one participant had undue influence over the process, and how participant agreement
was achieved. A minimum of two evaluations are recommended, involving examining
how the judges reached their ratings, usually involving a standard set of evaluative
questions decided upon before the standard setting session (Cizek, 2006).
Many different standard setting methods exist, and the Standards for Educational
and Psychological Testing (Standards) (1999) does not endorse one specific method.
The Standards include methods that make judgments about the test content or about the
9
test takers. The participants need help to make informed judgments that can be
reproduced and are fair to each examinee. The method selected needs to fit the purpose
and format of the examination and the scoring model of the exam (Cizek, 2006).
The way that standards are set relies heavily on individuals making judgments on
specified criteria, most commonly a score that a minimally competent candidate would
have to achieve in order to demonstrate that they have the requisite knowledge deemed
necessary to perform a function. These individuals are most commonly making
judgments based on a Classical Test Theory (CTT) based score, which as outlined above,
can be problematic. With the new emphasis on Item Response Theory (IRT), these
judgments must be transformed into the values necessary for computation of θ and the
item parameters.
10
Chapter 2
ESTIMATION OF TEST SCORES
Test Score Estimation
The standards are set based on the process of estimating examinee results. The
test theories discussed here, Classical Test Theory (CTT), Item Response Theory (IRT),
and Rasch Modeling (RM) all have their own ways of estimating examinee scores.
Classical Test Theory
Classical psychometric theory assumes that an obtained score from an exam is
composed of two parts, the examinee’s true score and an error component. Other
assumptions include the true score and the error score are not correlated, error scores
from one measure are not correlated with error scores obtained in other measures, and
error scores and true scores are not correlated when obtained from different measures.
These assumptions imply that error scores are random and unpredictable. This does not
consider systematic errors made during repeated testing (Guion, 1998). Practically, an
examinee’s actual score is of more interest than the true score, and the actual score is a
combination of a systematic score and random error. This means that there are two error
scores. The first is individual error affecting one examinee and the second is a systematic
error that affects every examinee that takes the test. Both error scores influence the
measure that is used, and comprise the total variance in a set of scores in the form of
systematic causes and random error (Guion, 1998).
11
CTT is interested in the true score and estimates attributes by using a linear
combination of test item responses composed of the true score, the observed score, and
the error (Ellis & Mead, 2002). The true score is an examinee’s expected score on a test
over repeated administrations or across parallel test forms. The observed score is
expressed as:
𝑋 =𝑇+𝐸
The unobserved score, or error, E, is composed of the difference between an observed
and a true score (Ellis & Mead, 2002). Observed scores are random variables with
unknown distributions, and the mean of the theoretical distribution of a group of observed
scores represents the true score concept or:
∈ (𝑋) = 𝑇
T cannot be observed but its properties allow useful inferences to be made (Ellis & Mead,
2002). Examinees can have a low score because their true score is low, their error score is
low, or a combination of both conditions. The lowest scoring examinees on any given
test most likely have low T and low E scores, indicating that observed scores on repeated
examinations would be higher because the error scores vary on each administration. Both
high and low scoring examinees receive scores closer to the mean on repeated
measurements, a concept known as regression to the mean (Kachigan, 1986). The true
score can only be estimated because of its mathematical abstraction, and the researcher
must use the estimate to see how well it fits the particular model in regards to the
practicality of the model, score and fit (Lord, 1980).
12
Another assumption of CTT is that the expected value or mean error score for a
population of examinees is zero, which is represented by the following equation:
𝜇𝐸 = 0
Additionally, the correlation between true score and error for a population of examinees
is assumed to be zero shown by this equation:
𝜌𝑇𝐸 = 0
Finally, the correlation between error on test one (E1) and error on test two (E2) is
assumed to be zero, or:
𝜌𝐸1 𝐸2 = 0
Practitioners want to know the true score, but can only use the observed score, leaving
the practitioner to determine the relationship between both scores (Ellis & Mead, 2002).
The reliability index (ρXT) refers to the correlation between observed scores and true
scores from a population of examinees, expressed as the ratio of standard deviations of
true scores and observed scores:
𝜌𝑋𝑇 = 𝜎𝑇 /𝜎𝑋
The reliability index is unknown because the standard deviation of the true score
distribution is unknown (Ellis & Mead, 2002). The consistency of a set of measures in
regards to a specific trait and individual systematic errors determines reliability, and
random error has little to no effect on reliability. The more reliable the test is, the less
random error the test possesses, with smaller error variances indicating that the test is
reliable (Guion, 1998).
13
Reliability can be determined by repeated testing, but a more useful way is to use
parallel forms to eliminate the practice effect of answering the same questions repeatedly.
Parallel test forms are composed of different items of similar difficulty covering the same
information. The reliability is determined by how similar an examinee’s scores are on
each form. Practically, a single score that depicts the examinee’s behavior is desired, so
the scores across the test forms are averaged and the result is interpreted as if the
examinee took a single exam. This score is more reliable because it is based on a larger
sample of behavior and is more representative of an examinee’s true behavior (Lord,
1980).
Usually, when determining test reliability, examinees are tested on the same test
twice or by using two parallel tests. However, strictly parallel tests are hard to achieve,
because that refers to the occasion when the examinee’s true score and error variance are
the same. This is hard to do because the items have to be different but similar in
difficulty. However, difficulty can only be approximated based on how much
information you have on the items. This depends on the resources that you have before
the exam is put into use, and many times this is not practical. Approximation leads to a
correlation between examinee scores on parallel tests, which is known as the reliability
coefficient (Ellis & Mead, 2002). The reliability coefficient is an estimate of the square
of the reliability index, which estimates the proportion of the total systematic variance.
(Guion, 1998). Mathematically speaking, the reliability coefficient is the ratio of true
score variance to observed score variance:
14
𝜌𝑋1𝑋2 = 𝜎𝑇2 /𝜎𝑋2
This implication means that if a test had perfect reliability, there would be no error,
which is theoretically possible, but is usually not achieved in the real world. Item
analysis using CTT is designed to maximize internal consistency estimates of reliability
using coefficient alpha, expressed as a decimal between 0 and 1 (Ellis & Mead, 2002).
The closer the value gets to 1, the exam in question has less random error variance and is
determined to be reliable (Guion, 1998).
CTT item analysis is used to determine item characteristics of difficulty and
discrimination. Item difficulty refers to how well items fit a target population and ranges
from 0 to 1, with values nearer the limits providing little or no useful information. Item
difficulty is related to the total test score, which determines item variance. Information
about examinee differences and total test score variance is maximized when pi = .50
assuming the inter-item correlations are held constant. Item discrimination is the
determination of an examinee’s knowledge, in that examinees who have not mastered the
material will not get the item correct and examinees who have mastered the material will
get the item correct. Item discrimination refers to the difference between the percentage
correct for each of these groups. Item discrimination indices include the D index, the
point biserial correlation, and the biserial correlation (Ellis & Mead, 2002). The D index
refers to the difference in the proportion of examinees passing an item for overall upper
(Pu) and lower (Pi) groups:
𝐷 = 𝑃𝑢 − 𝑃𝑖
15
that are defined by the upper and lower percentages, usually defined as the top and
bottom 27% of the distribution (Ellis & Mead, 2002). The point biserial indices show the
relationship between the total test score and examinees’ performance on an item,
computed by this formula:
𝑟𝑝𝑏𝑖𝑠 =
(𝑀+ − 𝑀𝑇 )
√𝑝⁄𝑞
𝑆𝑇
in which M+ is the mean of the scores for the examinees successfully passing an item; MT
refers to the mean of the test scores; p is the item difficulty; and q is 1 – p (Ellis & Mead,
2002). T he biserial correlation assumes that the latent variable underlying the item
response is continuous and possesses a normal distribution and is computed by this
formula:
𝑟𝑏𝑖𝑠 =
(𝑀+ − 𝑀𝑇 )
(𝑝⁄𝑌)
𝑆𝑇
in which Y is the height of the standard normal distribution at the z-score separating the
area under the curve proportionately between p and q (Ellis & Mead, 2002). The rpbis is
always smaller than rbis shown by this equation:
𝑟𝑝𝑏𝑖𝑠 =
𝑌
√𝑝𝑞
𝑟𝑏𝑖𝑠
Choosing which distribution to use depends on how practical the information is. The D
index is easier to calculate, but the correlational indices may give better information
depending on the characteristics of the analysis. If the items have moderate difficulty,
very little difference can be observed between the three methods (Ellis & Mead, 2002).
16
When item difficulties are in extreme ranges, or when different examinees are sampled,
or the developer prefers to use indices of discrimination and difficulty that are
independent of each other, rbis is the preferred method. A disadvantage with rbis is that it
can yield coefficients over 1.00 if the underlying assumptions are violated. If items with
high internal consistency are desired, rpbis would be a better choice (Ellis & Mead, 2002).
Additionally, rpbis gives more information about the test’s predictive validity because it
works well with moderately difficult items. These last two methods provide more
information than the D index, which can discard a third or more of the data.
CTT is based on weak assumptions easily met by most test data sets, and the
models have been applied to multiple varieties of test development and test score analysis
problems, but the item difficulty index, the item discrimination index, the observed score,
and the true score are completely sample and administration dependent (Ellis & Mead,
2002; Hambleton & Swaminathan, 1985; Lord, 1980). Other shortcomings exist when
dealing with CTT, such as examinee characteristics and test characteristics cannot be
interpreted independently. Examinee ability is test dependent, and cannot be compared
outside of a particular test. Test item difficulty is examinee dependent, meaning that
items are rated easy or difficult because of the characteristics of the people taking the test
(Hambleton, Swaminathan, & Rogers, 1991). Calculating examinee ability is dependent
on the test item difficulty, and all of the item statistics such as item discrimination,
reliability, and validity are dependent on the group of examinees taking that particular
examination. This means that these statistics change after every exam administration.
17
Group-dependent item indices cannot be used when tests are constructed for
examinee populations if they have different characteristics to the examinees who
provided the indices. This makes comparisons extremely difficult or impossible because
the scores are dependent on the test in addition to being based on two different scales
(Hambleton et al, 1991). This even occurs when examinees take the same or parallel
tests, because examinees possess different ability levels and the amount of error is
different. It is more desirable to have examinees answer some number of items correct
and some number of items incorrect, because it provides information about the
examinee’s ability, and can give a precise ability level. If two examinee scores contain
equal amounts of error, it allows test difficulty to be matched with approximate ability
levels (Hambleton et al., 1991). CTT also has issues with defining reliability and the
standard error of measurement (SEM). SEM is a function of test score reliability and
variance, as shown by the following formula:
𝑆𝐸 = 𝑆𝑥 √1 − 𝑟𝑥𝑥
where SE is the standard error, Sx is the test score standard deviation, and rxx is the test
reliability. An assumption of CTT is that it is the same for all examinees, where
reliability is the correlation between test scores on parallel forms (Hambleton et al.,
1991). Several methods of finding this correlation exist, but the problem is meeting the
definition of parallel tests, which is very difficult using CTT if not impossible.
Reliability coefficients, such as alpha, can provide lower bound estimates of reliability, or
reliability estimates with unknown biases. This can result in scores on an exam not being
18
precise measures of examinees with different ability levels. This means that the
assumption of equal errors of measurement for all examinees is implausible (Hambleton
et al., 1991). CTT is oriented to the test rather than the item, and the classical true score
model can predict examinee responses to any given item in a linear fashion, but the
accuracy suffers. CTT provides less-than-ideal solutions to many testing problems such
as test design, identification of biased items, adaptive testing and test score equating.
Finding an alternative test theory, researchers need to look at finding one that has items
characteristics that are not group-dependent, examinee scores that are not test dependent,
a reliability measure that does not require parallel tests, and finding a model that
precisely measures ability scores.
Item Response Theory
The concepts of IRT began in 1906 when Binet and Simon plotted performance
levels in relation to an independent variable. Thurstone developed a method of paired
comparisons in 1928 that can be used to scale a collection of stimuli. Richardson was
able to derive relationships between IRT models and classical item parameters providing
a method of obtaining IRT parameter estimates in 1936. In 1952, Lord developed the
two-parameter normal ogive model, and Birnbaum developed the logistic models and
supplied necessary statistical foundations. Rasch developed three item response models
in 1960, spurring on more research and the development of computer programs to assist
in the underlying statistical analysis of the Rasch model.
IRT became necessary because of the numerous CTT deficiencies. IRT is based
19
on the theory that the probability of an examinees answer on an item can be determined
by a function of the examinee’s position in the distribution of the latent trait being
obeserved. This can be displayed graphically as an item characteristic curve (ICC) (see
Fig 1) (Fischer & Molenaar, 1995):
Probability of a Correct Response
Figure 1. Item Characteristic Curve (ICC) for a hypothetical item.
1.00
0.90
0.80
0.70
0.60
0.50
0.40
0.30
0.20
0.10
0.00
-3.00 -2.50 -2.00 -1.50 -1.00 -0.50 0.00 0.50 1.00 1.50 2.00 2.50 3.00
θ (Ability Level)
The ICC is a plot of the level of performance on some task or tasks against an
independent measure. A smooth nonlinear curve is fitted to the data so minor
irregularities in the data pattern are removed, which makes it easier to design and analyze
tests. The ICC provides the probability of examinees with a given ability level answering
each item correctly, and the probability value is independent of the number of examinees
at that ability level (Hambleton & Swaminathan, 1985).
IRT deals with predicting examinee performance by defining examinee
20
characteristics, consisting of traits or abilities (Hambleton & Swaminathan, 1985). Tests
contain multiple items and when each of the item scores are summed, the test score is
found. To describe test scores coming from a specific group of examinees, statistics that
show individual item scores are used. Georg Rasch (1980), in developing the Rasch
Model (RM), wanted to use invariant comparison, to describe items and examinees by
their parameters. This allows computation of the probability of any examinee’s response
to any item even if similar examinees have never taken similar items before. The
relationship between examinee ability level and response to an item is known as the item
response function (Lord, 1980).
To predict or explain item and test performance, examinee scores are estimated
using trait and ability scores. The item response model chosen specifies the relationship
between examinee test performance, which is observable, and the unobservable traits or
abilities being measured (Hambleton & Swaminathan, 1985). Many different models are
available for selection and there is no one “correct” model, requiring the use of goodness
of fit tests. Each model has specific mathematical functions that describe the observable
and unobservable quantities by specifying assumptions about test data. These models can
be unidimensional or multidimensional; measuring one underlying trait or more than one.
They can be linear or non-linear in addition to using dichotomous scoring, either correct
or not, or polytomous, having multiple responses (Hambleton & Swaminathan, 1985).
Item response models are defined by the mathematical form of the item
characteristic function and the number of specified parameters (Hambleton et al., 1991).
21
There are one or more parameters describing the item and the examinee, and their utility
is determined by assessing how well the model fits the data. Once a model is found that
fits the test data, and all parameters are held constant except for item difficulty, examinee
ability estimates and item indices can be determined. These ability estimates and item
indices are not test or group dependent. This means that even ability estimates obtained
from different items and item parameter estimates obtained from different examinees are
the same. This is the biggest advantage of using IRT over CTT. Another IRT bonus is
that estimates of standard errors for individual ability estimates can be obtained instead of
a single estimate of error for all examinees (Hambleton et al., 1991).
Rasch Model
The Rasch Model (RM) makes more stringent assumptions than other IRT models
by stating an ideal measurement model and using data to see if it fits that model. The
probability of an examinee making a specific response is derived from a logistic function
of the person and item parameters, which means that higher ability level examinees have
higher probability of a correct answer (Fischer & Molenaar, 1995). The continuum of
total score assessments is used to determine where an examinee is located using the
scores, which are counts of discrete observations representing an observable outcome
between an item and an examinee (Fischer & Molenaar, 1995). The RM is restrictive,
because it holds strong to a specific model that confines each item to the same
discrimination, rather than changing the model to fit the data. Guessing behavior is not
directly modeled, but it is theoretically included in the error structure and must be
22
evaluated by the researcher.
One Parameter Model
Unidimensional and multidimensional models can be used for both
dichotomously and polytomously scored data. Commonly used logistic models are the
one-, two-, and three-parameter logistic models. The ICCs for the one-parameter logistic
(1PL) model are given by the following equation (Hambleton et al., 1991):
𝑃𝑖 (𝜃) =
𝑒 (𝜃−𝑏𝑖 )
𝑖 = 1, 2, … 𝑛
1 + 𝑒 (𝜃−𝑏𝑖 )
where Pi(θ) is the probability that a randomly chosen examinee with ability θ answers
item i correctly, bi is the item i difficulty parameter, n is the number of items in the test, e
is a transcendental number whose value is 2.718 and Pi(θ) is an S-shaped curve with
values between 0 and 1 over the ability scale. The bi parameter is the point on the ability
scale where the probability of a correct response is 0.5. It also indicates the position of
the ICC in relation to the ability scale. The larger this parameter is, the higher the ability
level required for an examinee to have a 50% chance of getting the item correct
(Hambleton et al., 1991). The more difficult the item the higher it is on the scale. The
ability values from a particular group are standardized to a mean of 0 and a standard
deviation of 1. The values of bi are theoretically defined on a scale that ranges from
negative infinity to positive infinity but typically the values only vary from about -2.0 to
+2.0 (Hambleton et al., 1991). The b parameter is expresssed as a z-score, and the mean
and standard deviation are based on the distribution for the ability being measured. With
a normal distribution, values between +/- 1.64 are reasonable depending on the use of the
23
exam, because this stays away from the extremes where little additional information is
provided (Ellis & Mead, 2002). The key assumption of the 1PL model is that item
difficulty is the main characteristic influencing examinee performance. Other factors
such as guessing behavior are considered, but similarly to the Rasch model, are absorbed
into the residuals unless they harm the model fit too much. In IRT, the choice made to
use this model depends on the data being analyzed and how it will be applied – the data
set and the purpose is the main factor in selection of other models (Hambleton et al.,
1991).
Two Parameter Logistic Model
Lord first based his two-parameter item response model on the cumulative normal
distribution, or the normal ogive, but this was later replaced by the two-parameter logistic
model (2PL) by Birnbaum (Lord & Novick, 1968). The logistic model is an explicit
function of item and ability parameters. Item characteristic curves for the 2PL model are
given by the equation (Hambleton et al., 1991):
𝑃𝑖 (𝜃) =
𝑒 𝐷𝑎𝑖 (𝜃−𝑏𝑖 )
1 + 𝑒 𝐷𝑎𝑖 (𝜃−𝑏𝑖 )
where the parameters Pi(θ), e and bi are defined the same way as the 1PL model. The
factor D is a scaling factor introduced to make the logistic function as close as possible to
the normal ogive function (Hambleton et al., 1991). The parameter ai is the item
discrimination parameter, usually ranging from .5 to 2.0, with values below .5 limiting an
item’s information, and values above 2.0 may indicate a problem with its estimation
(Ellis & Mead, 2002). The a parameter is proportional to the slope of the ICC at bi on the
24
ability scale. It is more desirable to have items with steeper slopes because they do a
better job of sorting examinees into different ability levels. Higher item discrimination
values (ai) result in item characteristic functions that increase as the examinee’s ability
increases. This provides the opportunity to use differently discriminating items, but this
model does not provide an allowance for guessing behavior (Hambleton et al., 1991).
Three Parameter Logistic Model
The three-parameter logistic (3PL) model uses three parameters to describe the
ICC: a, which is the discrimination parameter, the difficulty parameter is denoted as b,
and the pseudo-guessing parameter is c. It is written as (Ellis & Mead, 2002):
𝜌𝑖 (𝜃) = 𝑐𝑖 + (1 − 𝑐𝑖 )
1
[1 + exp{−𝐷𝑎𝑖 (𝜃 − 𝑏𝑖 )}]
where Pi(θ) is probability that an examinee with ability θ answers item I correctly; ai is
proportional to the slope of the ICC at its point of inflection; ci is height of the lower
asymptote of the ICC; and D, the same scaling constant as in the 2PL model, is 1.7 (Ellis
& Mead, 2002). Parameter c is the probability that a person completely lacking in ability
will answer the item correctly. It is called the guessing parameter or the pseudo-chance
score level (Lord, 1980). Theoretically, this ranges from 0 to 1, but practical c
parameters are frequently lower than the probability of random guessing, depending on
the available number of response options. Large c values degrade the item’s ability to
discriminate between low- and high-ability examinees. The c parameter influences the
shape of the ICC, which must be fitted between the c parameter and 1.0 (Ellis & Mead,
2002).
25
IRT Assumptions
Local independence is the first assumption for IRT and Rasch models. This
requires that any two items are uncorrelated if examinee ability level, or θ, is held
constant. Local independence is only obtained when each ability dimension in addition
to non-ability dimensions, such as examinee personality, and test-taking behaviors that
influence performance have been taken into account (Hambleton et al., 1991).
Unidimensionality, a special type of local independence, refers to the test items only
measuring a single ability. This assumption is not strictly met because of other external
factors affecting examinee performance. This assumption is practically met by allowing
for one dominant ability to explain examinee performance. If more than one ability
explains examinee performance, the model is multidimensional (Hambleton et al., 1991).
If the guessing parameter, or c, is equal to 0, the tetrachoric intercorrelations matrix is of
unit rank with θ as the common factor (Lord, 1980). The major distinction among
commonly used IRT models is in the number and type of item characteristics assumed to
affect examinee performance (Hambleton et al., 1991).
Item and Test Information
IRT uses the concept of information instead of the CTT concept of reliability.
IRT makes it possible to assess information functions for individual items instead of a
single reliability estimate for an entire test (Ellis & Mead, 2002). The item information
function is formed by looking at each item and its conditional variance at each ability
level. More information is provided as the slope increases and the variance decreases.
26
This also lowers the SEM, which is used as a tool to discard items that are not performing
as well. Higher SEM items are discarded, as they do not provide as much information.
Item information functions can be shaped in a number of ways, depending on how the
test is constructed. They provide the maximum information at bi for the one and two
parameter models. For the three-parameter model, the maximum information slightly
higher than b1, as shown by the following equation (Hambleton et al, 1991):
𝜃𝑚𝑎𝑥 = 𝑏𝑖 +
1
1 1
1𝑛 [ + √1 + 8𝑐𝑖 ]
𝐷𝑎𝑖
2 2
The maximum value of the information is held constant for the 1PL model, but for the
2PL it is proportional to the square of the item discrimination parameter, so for larger
values of a, greater information is provided (Hambleton et al., 1991). The following
equation provides the maximum value for the 3PL models:
𝐼(𝜃, 𝑢𝑖 )𝑚𝑎𝑥
3
𝐷2 𝑎𝑖2
2
2]
(1
)
=
[1
−
20𝑐
−
8𝑐
+
+
8𝑐
𝑖
𝑖
𝑖
8(1 − 𝑐𝑖2 )
The closer the guessing parameter, ci, is to zero, the more information is obtained. Item
information functions determine the test information function, which is found by the
following equation (Hambleton & Swaminathan, 1985):
𝑛
𝐼(𝜃, 𝜃̂) = ∑{(𝑃𝑖′ )2 ⁄𝑃𝑖 𝑄𝑖 }
𝑖=1
The quality and number of items influences the information provided from the test
information function. The test information function is defined for a set of test items at
each point on the ability scale, and the contribution of each item is independent of the
27
other items. The amount of information provided at each ability level is negatively
correlated with the error associated with ability estimates. Each item’s contribution is
additive making it easy to determine the impact of each item (Hambleton &
Swaminathan, 1985).
Test Characteristic Curve
The test characteristic curve (TCC) describes the relationship that exists between
a true score and the ability scale. When given an ability level, the researcher can
determine a corresponding true score by using the TCC. If the decision is made to use a
one- or two-parameter model for a complete test, the left tail of the curve nears zero as
the ability score decreases. The upper tail nears the number of items in the test as the
ability score increases as shown here (Baker, 2001):
28
Figure 2. Test Characteristic Curve (TCC) for a hypothetical 15-item test.
16.00
14.00
Expected True Score
12.00
10.00
8.00
6.00
4.00
2.00
0.00
-3.00 -2.50 -2.00 -1.50 -1.00 -0.50 0.00 0.50 1.00 1.50 2.00 2.50 3.00
q (Ability Level)
This graph shows the assumption that a true score of zero matches up with an
ability score of negative infinity. A true score of N, or the number of items on the test
would show an ability level of positive infinity. 3PL models will show the left end of
the curve trailing off at the level of the c parameter, showing that low-level examinees
can achieve a score higher than zero by guessing at the item level. At the test level, this
feature is only shown when aggregating across items. As the ability level of the
examinee gets closer to positive infinity, the TCC shows that examinees will have a
true score of N. This allows test developers to transform examinee ability into true
scores, and give the examinee a method to determine their own ability level (Baker,
2001).
29
Standard Error of Measurement
Standard error of measurement (SEM) is used to describe how examinee test
scores fluctuate over repeated testing because of the error component. Through this
process, confidence intervals are generated to help interpret test scores. When working
within the CTT framework, SEM can be computed by taking the square root of 1 minus
reliability (1-rn)1/2, times the standard deviation of the test variance as follows (Embretson
& Reise, 2000):
SE𝑀𝑠𝑚𝑡 = 𝜎(1 − 𝑟𝑛 )1/2
Assuming that measurement error is normally distributed equally for all score
levels it allows for confidence intervals to be constructed. This means that the same
confidence score applies to each score level and the true score is derived linearly. These
apply only to a particular population because the computations are computed using
population statistics. The raw score mean and standard deviation must be estimated for a
population to use the standard score conversion, and the standard error must be computed
using variance and reliability estimates (Embretson & Reise, 2000).
SEM in IRT can be estimated for any ability level because it is not dependent on
population distributions. There is empirical and theoretical evidence indicating that the
SEM values are different depending on examinee ability or score levels, and traditional
computations of SEM are not adequate (Woodruff, 1990). This means that the SEM is
conditional because it changes when ability or score levels change. Trait scores are
estimated separately and each score or response pattern, yielding smaller standard errors
30
when the items are most appropriate for a specific trait score level in addition to when the
items have high discrimination (Embretson & Reise, 2000). Information is defined for
both the item and the total scale, and the item information function shows the
contribution that an item makes along the θ continuum (Ellis & Mead, 2002). Item
information is found by obtaining the reciprocal of the error variance, or the squared
standard error. Smaller error scores provide more information. Confidence intervals
around an examinee’s score are constructed by using the variability among test scores, or
the conditional standard error of measurement (CSEM) (Ellis & Mead, 2002).
It is useful to know the CSEM for a given ability level, but it is more useful to use
item and test information when determining which items to select for the test. Test
developers specify the desired test information function, and use item analysis to select
items where the summation of those item information functions approximates the desired
test information function (Ellis & Mead, 2002). SEM is used to detect the differences
between two people’s scores, to see if an examinee’s score differs from some true score,
or to assess if scores can discriminate differently between different demographic groups
or other groups defined by different score ranges (Guion, 1998).
Invariance
A major component of IRT is the property of invariance regarding item and
ability parameters. Invariance refers to the item and ability parameters remaining the
same regardless of the examinees or a specific test administration. Item specific
parameters are independent of the examinee’s ability distribution and examinee specific
31
parameters are independent of the set of test items (Hambleton et al., 1991). Assuming
the model fits the data, the same ICC is obtained for a test item, regardless of examinee
ability and population. This is a property derived from linear regression, when the
regression model fits the data; the same regression line is found, even if the distribution
of the predictor variable changes (Hambleton et al., 1991). The probability that
examinees at a specific ability level answer item i correctly is dependent on θ. When the
model holds, ai, bi, and ci do not change when the group being tested changes, resulting in
the three parameters being invariant (Lord, 1980).
Parameter Estimation
Any time IRT is applied to test data, the parameters for the IRT model chosen are
estimated. The examinee’s response is used to estimate the item parameters and the
examinee ability level (Hambleton et al, 1991). Several methods of parameter estimation
can be used. If θ is known, the data points that are necessary to estimate the unknown
parameters are the same as the item parameters in the model, assuming perfect model fit.
In practical applications, the model does not exactly fit the data, so the goal is to find the
parameter values that will best fit the curve, using a maximum likelihood criterion
(Hambleton et al., 1991).
Assuming examinee responses are independent, and θ is known, an item is
administered to many examinees, and a likelihood function of N examinee responses is
obtained. In this case, the likelihood function is multi-faceted. To obtain the maximum
likelihood estimation (MLE) of the parameters a, b, and c, when θ is unknown, the values
32
corresponding to the maximum value of a surface in three dimensions must be found
(Hambleton et al., 1991). When the ability of each examinee is known, each item may be
considered separately without reference to the other items, and is repeated once for each
item (Hambleton et al., 1991).
A common and difficult problem is that both θ and the item parameters are
unknown. In order to determine these values, all of the examinee responses must be
examined. Assuming local independence, for the three-parameter model, a total of 3n +
N (items + people) parameters need to be estimated (Hambleton et al., 1991). The easiest
way to do this is to select an arbitrary scale for the ability scale, for example, setting the
mean to zero and the standard deviation to 1. In order to set the initial values for the
ability parameters, a logarithm of the ratio of the number of correct responses to the
number of incorrect responses must be obtained. These values are standardized and used
to estimate the item parameters. Once the estimate of the item parameters is obtained, the
ability parameters are estimated. This process is repeated until the values remain
consistent between two successive estimations. This results in an approximation of the
item parameters and ability estimates (Hambleton et al.,1991).
A different approach uses Bayesian estimates of the parameters using a priori
distributions. This results in both the item and ability parameters being simultaneously
estimated, which can result in inconsistent joint maximum likelihood estimates. To
resolve this issue, the item parameters are estimated without referring to the ability
parameters. Using a random selection of examinees and specifying an ability parameter
33
distribution allows the ability parameters to be integrated out of the likelihood function
(Hambleton et al., 1991). The resulting marginal maximum likelihood estimates of the
item parameters remain consistent as the number of examinees increases, allowing them
to be treated as known values and used to estimate examinee ability. Using Bayesian
estimation helps to resolve the issue of poor c estimates resulting in less accurate
estimations, and using more items allow for better parameter estimates (Hambleton et al.,
1991).
Theoretically, high-ability examinees should never get an easy item wrong, but
careless mistakes exist. The logistic function reaches the asymptotes more slowly than
the normal ogive and mistakes have less of an impact. Ability is difficult to accurately
measure, so a small-sample frequency has to be inferred from the model by using an
observable quantity with known parameter distributions (Lord, 1980). Predictions are
made using the estimated values and observed data is used to see how accurate they are
by fitting them to the observed data. Several real world issues come into play, such as
examinees becoming tired, sick or uncooperative before completing the testing.
Omission of items, not finishing, skipping back and forth through the test and poor item
quality also have an effect on model fit (Lord, 1980).
34
Chapter 3
ANGOFF METHOD AND THE LATENT TRAIT CONTINUUM
Angoff Method
A common method of standard setting is the Angoff method. The standards are
set based on test scores, and these scores need to be given meaning because a number in
itself is not useful unless there is a way to make them useful instruments of measurement
(Angoff, 1971). In order for this to occur, a scale structure must be defined, called
scaling. These scores must also be interpreted, and this means that norms or other
interpretive guides should be established so that these scores can be used for assessment.
Additionally, because most exams have multiple forms, there is a need to calibrate or
equate the scores on different forms (Angoff, 1971). These are separate issues, but
scaling is the key component for standard setting and the Angoff method. Mental ability
is not observable, meaning that a score of zero on a mental ability test does not signify
the absence of mental ability. Additionally, equal differences between scores may not be
equal representations of different ability units (Angoff, 1971).
Several types of scaling are available. The raw score scale consists of the number
of items answered correctly. This scale can help identify problems within the test, but it
is not generalizeable. If more than one form of the test exists, the raw score scale cannot
be used to compare scores from different forms because of the natural variations between
forms and administrations (Angoff, 1971). The percentage-mastery scale scores suggest
35
if an examinee has received a score of 85, that examinee has “mastered” 85% of the
material. This can be problematic because it is hard to quantify knowledge and difficult
to determine the percentage mastered.
This method is also flawed when multiple forms of an exam are in use (Angoff,
1971). The next scaling method involves linear transformation or standard scores. This
involves administering the exam to a reference group, either randomly drawn from a
population or the reference group could be a population with defined characteristics.
Following this, the raw score mean is transformed at a desired scaled score involving a
uniform change in unit size to find the standard deviation of the scaled score. The raw
score mean and standard deviation are placed in a linear scaling equation where the
standard score deviate for any scaled score equals the standard-score deviate for a
corresponding raw score for the reference group (Angoff, 1971).
The percentile rank scale involves finding the percentage of individuals who
receive scores located below the midpoint of each score or score level. The frequencies
for all of the scores below the selected score are totaled, added to half of the frequencies
at the selected score, and that total is divided by the total number of cases (Angoff, 1971).
The normalized scale, or normalized standard scores, involves transforming the scores
into units independent of the test characteristics and equally spaced. Plotting the
distribution results in an S shaped curve, similar to the ICC used in IRT analyses. The
percentile-derived linear scale is the one that is most useful to the Angoff standard setting
method. This scaling method deals with norms, meaning that the standard of
36
performance is set as observed in samples from the population (Angoff, 1971). An
example is that a minimum passing score is set at some number, such as 70, and it is
expected that a certain percentage of examinees will pass, such as 65, representing the
minimum acceptable performance. In order to determine these numbers, a systematic
process using the “minimally acceptable person” is employed. This involves reviewing
the exam item by item and determining if this hypothetical person, someone who would
possess the minimum aptitude to answer the question correctly, in the purely theoretical
realm of correct or incorrect. The number of items that this person can answer is the raw
score that the “minimally acceptable person” would earn on the exam. Another method
is asking judges to rate the probability that any “minimally acceptable person” would
answer an item correctly. This means asking the judges to conceptualize multiple people
instead of just one (Angoff, 1971). This original version of the Angoff method led to
many other researchers developing their own versions of the Angoff standard setting
method. Many of these arose from practical innovations, which have advanced the
method further.
Discussions of the competence of the minimally acceptable person occur on a
continuum of the latent trait, referring to the ability being tested on that particular exam.
The participants need to be given operational definitions of what is excellent, medium,
and poor performance, so that the judges can make accurate decisions. These decisions
are the participants judgments that the minimally acceptable person will have success on
a particular item based on the operational definitions, and are aggregated in order to
37
define the cutoff score (Hurtz, et al, 2012). Typical Angoff participants are not given the
ICCs for each item, however this procedure of defining performance and estimating
success can be translated into the establishment of the horizontal axis (ability) and the
height of the ICC (probability of getting the item correct). The problem with most
Angoff procedures is that unless the exam is scored using IRT, the process of aggregating
ratings into a cutoff score only involves estimating the success or failure of the examinee
answering the item, and cannot be linked to the latent construct the exam is measuring
(Hurtz et al, 2012).
Hurtz et al. (2012) have proposed a method to tackle this problem in order to
maintain consistent standards across multiple test forms and across time. An Angoff
standard setting workshop is developed with at least 10 or more SMEs. These SMEs are
given items to rate that should be well representative of the exam. These items must be
calibrated properly and perform properly according to the goal of the exam. This can
consist of single or multiple test forms. The Angoff participants provide ratings on each
item. These ratings must be converted from proportion-correct values to the latent scale
being observed, or Hurtz et al. (2012) recommends a method that maximizes the fit
between the ratings and the ICCs based on the Monte Carlo results of Hurtz et al. (2008)
in order to define a preliminary standard.
The next step is to review information regarding the latent population so that an
expected passing rate can be estimated using the preliminary standard. If any
adjustments are necessary, the SEM can be derived from the ratings (Hurtz et al., 2012).
38
Additionally the CSEM at the preliminary standard’s threshold can be used to compute a
95% confidence interval to make further adjustments. This adjusted standard is applied
to all forms because the latent scale is independent of the sample of items without the
need to establish a new standard setting workshop. With exams being scored using IRT,
the resulting θ* is used as the operational cutoff score, which can also be scaled to the
proper score reporting metric. If the exam is being scored using CTT, the standard must
be converted to a percent-cutoff score by using the TCC (Hurtz et al, 2012). Over time
the exam plan, qualification requirements, requirements for successful job performance
or the job itself, or other reasons may require the standard to be updated. In these cases,
confirmatory studies should be held to evaluate the standard to see if changes need to be
made or if it is still valid for its intended use.
39
Chapter 4
MONTE CARLO SIMULATIONS
Monte Carlo
Monte Carlo simulations allow for the control of multiple variations of statistical
data in order to simulate real world data. When evaluating different conditions using
Monte Carlo methodology, the researcher obtains a statistical model, subject to the laws
of chance. The conditions of this model can be manipulated to whatever real world
conditions that need to be evaluated (Kalos & Whitlock, 1986). Calculating Monte Carlo
figures requires the use of a sequence of random events. The most basic example is a
single elementary event of flipping a coin. Each time you flip it, the possible outcome is
associated with a probability ranging from 0 to 1. When dealing with more than one
elementary event, such as flipping two coins, the probability of the same outcome is
known as the joint probability. If the goal is to determine the probability of multiple
elementary events, the joint probabilities are combined into a marginal distribution
resulting in a conditional probability (Kalos & Whitlock, 1986). This process can be
extrapolated to any number of elementary events.
The key to evaluating the behavior of statistics is the sampling distribution,
referring to the values that a specific statistic can have with respect to a given population
and the probabilities associated with those values. The bias of a statistic can be evaluated
by examining the expected value of the sampling distribution, variability, and functional
40
form to evaluate the efficiency of that statistic and to make inferences about the
population. How can a statistic be evaluated when the conditions necessary for a
mathematical theory to be valid do not exist or when no strong theory exists? Monte
Carlo simulations allow researchers to understand that specific statistic’s sampling
distribution and evaluate its behavior in random samples by using random samples from
known populations of simulated data (Mooney, 1997)
Every Monte Carlo experiment deals with generating a random sample (Fishman,
1996). Most of the time, the random event outcome can be expressed as a numerical
value. When dealing with computer simulations, the random choice outcome usually is a
logical event. Covariance is used to measure the independence of two random variables.
The covariance is 0 if the variables are independent. Covariance can be positive or
negative, resulting in the variance of a linear combination of dependent variables being
larger or smaller than the resulting variance of the independent variables (Kalos &
Whitlock, 1986).
Statistical analyses are used to describe and make inferences using measured
variables about social phenomena. A characteristic is estimated with an estimator
computed from observed data (Mooney, 1997). To evaluate a given statistic, the
sampling distribution is needed. This distribution consists of the range of values that the
statistic can possess in random samples from a specific population and the resulting
probabilities associated with that value range (Mooney, 1997). If a statistical bias exists,
it can be determined by looking at the expected sampling distribution, variability, and
41
functional form, as a way to determine the statistic’s efficiency (Mooney, 1997).
Random samples from known populations of simulated data are used to determine
the behavior of the statistic in question, which is artificially generated, because sampling
data multiple times from real people in real world situations is very impractical and
inefficient. Statistics in random samples can be evaluated by generating many random
samples and observing the resulting behavior. The multiple random samples create a
pseudo-population that resembles the real world in all relevant aspects (Mooney, 1997).
When performing a basic Monte Carlo procedure, the pseudo-population is specified in
symbolic terms, so that samples can be generated by sampling from this pseudopopulation by using the same sampling strategy and size that reflects the statistical
situation being investigated. Next, the estimator ̂𝜃 is calculated from the pseudo-sample
and stored in a vector ̂𝜃, repeating this process as many times as the number of desired
trials. Finally, a relative frequency distribution of the resulting 𝜃̂𝑡 values is constructed.
This is the Monte Carlo estimate of the sampling distribution with respect to the specified
conditions of the pseudo-population and sampling procedures (Mooney, 1997).
After defining a variable in terms of its distribution function, the probability
density function (PDF) is used to map the probability of x falling between two values of
X for continuous random variables. The inverse distribution function incorporates
probability values, a, to determine the value of x so that Pr(X ≤ x) = α (Mooney, 1997).
Parameters determine the location, scale, and/or shape of the distribution, and each
distribution function has specific requirements pertaining to a range of possible values of
42
X. The chosen distribution determines its range, some having infinite range and others
having the range truncated at one or both ends. Because of this, it is important to know
the mean, variance, skewness, and kurtosis. The skewness and kurtosis describe how
normal or non-normal the distribution is. Researchers need to consider which
distribution will yield the proper range, shape and variation that will match the types of
simulated variables and processes and fit the design of the experiment (Mooney, 1997).
Beta Distribution
Evaluating rater bias in standard setting can be done by using the beta distribution
(Hurtz, Jones, & Jones, 2008; Reckase, 2006). This is a flexible distribution bounded by
0 and 1, and has two adjustable parameters, a and b. It has flexibility and a range of
PDFs that are highly right-skewed, uniform, approaching normality, highly left-skewed,
or even bimodal distributions with varying levels of interior dip (Mooney, 1997). The a
and b parameters determine the shape of the distribution. If a or b falls below a value of
1, the PDF curves downward on one end. If both parameters are below one, the PDF is
bimodal. As a or b decreases toward 0, the height of the bimodal distribution is
increased. If a and b have the same value, the distribution is symmetrical (Mooney,
1997). Monte Carlo simulations were used in the generation of the Angoff ratings
because they can be systematically varied to match the specified conditions to evaluate
how different judges may perform in the real world.
43
Chapter 5
METHOD
Purpose of the Study
The purpose of the present study is to use the methods outlined in this paper to
evaluate the Angoff standard setting method. This method involves subjective ratings
based on the information provided to the participants. This information involves the
exam that the standard is being set on, and the mathematical conversions of the
generalizeable θ values from the exam. However, most subject matter experts are
unfamiliar with IRT and cannot make accurate ratings based on the θ statistic. This needs
to be converted to a CTT proportion correct statistic so that training the Angoff raters can
be accomplished in a manner that will not expend extensive resources. Following the
standard setting meeting, the ratings that the judges give must then be converted back
into the θ metric so that the ratings can be generalized to other settings. This study is
designed to check how simulated rater agreement affects how this occurs using simulated
ratings.
What happens if the Angoff rating panel does not agree with each other? What
happens if the judges rate the items too easy? What happens if they rate the items too
difficult? Theoretically the conversion from CTT back to IRT should result in finding the
same θ value, however if the Angoff raters rate the items inaccurately, does this happen?
This study is designed to explore different conversion methods and how they affect
44
simulated Angoff ratings. The present study has expanded the Hurtz et al. (2008)
research by using an expanded range of θ by going to more extremes (adding a -2 and 2
conditions) in addition to using different bias conditions, such as 15 or 35 percent above
and below the ICC instead of +/- of .10 on the ICC. Additionally, this present study used
a simulated exam based on 100 items, instead of an actual exam using 75 items.
Minimum Passing Level
When evaluating the Angoff method, the minimum passing level (MPL) should fit
the IRT model. The fit needs to be examined because the procedures used to generate the
ratings differ from the procedures used to gather the examinee data. Additionally, the
raters possess different characteristics than the examinees (Kane, 1987). Even if
examinee performance data fits an IRT model, the ratings may still not fit the model
because the Angoff method can yield different results, and do not provide the same fit to
every IRT model. To account for this, the average MPLs for each individual item are
combined into a passing score by summing the average item MPLs over each individual
item (Kane, 1987).
The MPLs found through the Angoff method are interpreted as true score
estimates for minimally competent examinees. IRT model item parameters are estimated
for test items using examinee response data. First, if there is some value, θ*, on the θ
scale that is the examinee minimal competency level, then Pi(θ*) is the value for θ* of the
ICC for item i, which is the indicator of the minimally competent examinee’s expected
observed score (Kane, 1987). When the raters’ MPLs fit the selected IRT model, the
45
value of Pi(θ*) is equal to the expected MPL for each item i. Different raters may assign
different MPLs to different items in addition to random error. The expected MPL over
the entire rater population for each item will equal Pi(θ*) when the item has a fixed value
of θ* (Kane, 1987). The expected MPL over the population of raters will only deviate
from Pi(θ*) if there are different standards used with random variations. The unbiased
estimate of the sampling variance for individual raters on Item i is given by the following
equation (Kane, 1987):
𝜎̂𝑖2 =
∑𝑟(𝑀𝑖𝑟 − 𝑀𝑖𝑅 ) 2
𝑘−1
where k is the number of raters sampled, Mir is the MPL for Rater r on Item i, and MiR is
the average MPL on Item i for k raters (Kane, 1987). The sampling variance for the
mean MPL over k raters can be estimated as:
𝜎̂𝑖2 (𝑀𝑖𝑅 ) = 𝜎̂𝑖2 (𝑀𝑖𝑟 )/𝑘
The average rating distribution over samples of raters should be approximately normal,
especially if k is large. Assuming the ratings fit the model, implying that Mir is an
unbiased estimate of Pi(θ*), and that the average rating is normally distributed (Kane,
1987), the following equation:
𝑍𝑖𝑅 =
𝑀𝑖𝑅 − 𝑃𝑖 (𝜃 ∗ )
𝜎𝑖 (𝑀𝑖𝑅 )
should be normally distributed with mean of zero and a standard deviation of 1 for some
value of θ*. Assuming that ZiR for the n items are independently distributed, the overall
fit of the ratings to the model can be examined using the statistic:
46
2
∑ 𝑍𝑖𝑅
= ∑[(𝑀𝑖𝑅 − 𝑃𝑖 (𝜃 ∗ ))/𝜎𝑖 (𝑀𝑖𝑅 )]
𝑖
2
𝑖
which is distributed as a chi-square with n-1 degrees of freedom under the null hypothesis
(Kane, 1987). Sometimes the independence assumption cannot be met if the same raters
review each item because the error due to rater differences is correlated across all items.
The independence assumption can be met if the correlated error is minimal in comparison
to the random error. After examination, if the ratings do not fit the model, different
models should be examined so that the researcher can combine the MPLs over raters and
items to obtain a passing score, or to estimate the expected error in the passing score
(Kane, 1987).
Length of Test/Number of Items
In simulating Angoff style ratings for a set of items, and exam that contains items
with IRT properties is necessary to get useable results. The previous study used an exam
length of 75 questions. In order to determine if the length of the exam has an effect, this
study used a length of 100 questions. To make the scope of the study more generalizable,
a specific exam was not used, and instead the items’ IRT parameters were simulated.
These items fit the 3PL IRT model, and a set of simulated ratings around the ICCs was
generated. The IRT a parameter was generated with a mean of 1.56 and a standard
deviation of .29 with a minimum of .49 and a maximum of 2.24. The IRT b parameter
was generated with a mean of -.032 and a standard deviation of .566 with a minimum of 1.69 and a maximum of 1.76. The IRT c parameter was generated with a mean of .18 and
a standard deviation of .03, with a minimum of .09 and a maximum of .24. These values
47
were selected based on research from Plake and Kane (1991), and further modified
during personal conversations with Gregory M. Hurtz, Ph D, as to replicate item
properties on exams given in an applied setting.
Simulated Rating Data
The beta distribution defined for each item was used to draw simulated ratings.
The beta distribution was selected because of the ability to manipulate parameters as well
as allowing for adjustments of the lower and upper bounds within the 0 to 1 range. This
allows the simulated rating distributions to fall above the c parameter because those
ratings falling below it do not correspond to a value on the θ scale.
The upper limit of the items was fixed at 1.00 and three population standard
deviations were used. The population standard deviations were set at .05, .10, and .15.
This choice was made to explore the effects of rater agreement on the conversion method.
The previous study focused on evaluating rater agreement at the ICC, and conditions of
.10 above and below the ICCs. The following conditions were selected in order to see if
extending the rater agreement both positively and negatively would have an effect. The
ratings were simulated to see if raters with strong agreement (.05) to somewhat weaker
agreement (.15) with a mid range value (.10) included in the analysis would affect the
conversions. The rv.beta function in SPSS was used to generate samples from the beta
distribution, requiring two parameters A and B that are used to determine the mean and
variance in the distribution:
𝜇=
𝐴
𝐴+𝐵
48
𝜎2 =
𝐴𝐵
(𝐴 +
𝐵)2 (𝐴
+ 𝐵 + 1)
except in situations where the lower limit is greater than 0 and/or the upper limit is less
than 1, the mean and variance are determined by:
𝜇 = 𝐿 + (𝑈 − 𝐿)
𝜎 2 = (𝑈 − 𝐿)2
𝐴
𝐴+𝐵
𝐴𝐵
(𝐴 +
𝐵)2 (𝐴
+ 𝐵 + 1)
The simulation of this data requires the lower limit to be set at the ci IRT parameter, and
the variance is set to the square of the three population standard deviations for each
condition, .052, .102, and .152. The value of the mean is set to simulate ratings that reflect
no bias, 85% of the raters overestimating the ICCs, 65% of the raters overestimating the
ICCs, 65% of the raters underestimating the ICCs, and 85% of the raters underestimating
the ICCs.
Fifteen raters were simulated 1500 times for each of the 100-item draws at each of
the five a priori θ* values (-2.00, -1.00, .00, 1.00, and 2.00). Ratings were generated
around five population means, either equaling Pi(θ*), falling .85 above Pi(θ*), falling .85
below Pi(θ*), falling .65 above Pi(θ*), and falling .65 below Pi(θ*). Whenever a value
was found over one, it was set to equal .99, and when a value fell below an item’s c
parameter it was set to c + .01.
49
IRT to CTT Conversion
Most commonly used judgmental standard-setting procedures like the Angoff
method were developed using CTT. They use the number or proportion correct cutoff
scores obtained by averaging or summing the judged ratings. New procedures are being
developed, and do not have a direct link between judged p values and the operational
cutoff score (Hurtz et al., 2008). The judgments from older procedures require a
transformation to find a comparable cut-score on the θ metric. Because the older
procedures are still being used, a transformation method needs to be selected.
This is done by computing the number-correct cutoff score and converting it to a
θ value using the TCC. After aggregating the judged p values along the vertical axis of
the TCC, the related θ value is the cutoff score, symbolized by θ*.
Kane (1987) explored alternative methods by transforming judged p values into a
θ cut score. Kane’s Method 1 uses the mean of the ratings for each item and converts
them to the θ scale by using the ICC. The mean ratings of the judges are used with the
ICC to obtain a corresponding θ value, which are averaged to obtain a cut score (Kane,
1987). Proportion correct means are located on the ICC and have a corresponding θ
value, and the mean of these individual items becomes θ* as shown by the following
equation:
𝑀𝑖𝑅 = 𝑃𝑖 (θ̂∗ 𝑖𝑅 )
where MiR is the mean proportion-correct value across raters R, for an individual item I,
and 𝑃𝑖 (θ̂∗ 𝑖𝑅 ) is the height of the ICC for that item at θ̂∗ 𝑖𝑅 (Kane, 1987, equation 5).
50
Kane was able to improve this method by using a weighted mean of the individual
values, which minimizes the sampling variance because items with high interrater
agreement are given higher weights (Hurtz et al., 2008). The weights for each item can
be found by the following equation (Kane, 1987, equation 14):
1
∗
𝜎𝑖2 (θ̂
𝑖𝑅 )
𝑤𝑖 =
1
∑ 2 ∗
𝜎𝑖 (θ̂
𝑖𝑅 )
∗
and the weights sum to 1 across items. For this formula, 𝜎𝑖2 (θ̂
𝑖𝑅 ) represents the variance
in θ̂∗ 𝑖𝑅 that can be found by examining two factors, the variance of the original
proportion-correct ratings and the slope of an item’s ICC. Hurtz, et al, (2008) found that
the variance should be restricted to a minimum value of .0025 in order to avoid spurious
effects when computing item level weights. This minimum variance was adopted in the
current study.
Kane then developed Method 2, where the cut score is determined by averaging
across items and raters, then located along the TCC and the corresponding θ value is
found as the θ* cut score (Hurtz, et al, 2008). This is demonstrated by the following
equation (Kane, 1987, equation 15):
∑ 𝑀𝑖𝑅 = ∑ 𝑃𝑖 (θ̂∗ )
𝑖
𝑖
An approximation formula for Method 2 was developed using individual item values, and
is denoted by the following equation:
51
θ̂∗ ≅
1
∗
∗
̂
∑ 𝑃𝑖′ (θ̂
𝑖𝑅 ) θ𝑖𝑅
∗
∑𝑖 𝑃𝑖′ (θ̂
𝑖𝑅 ) 𝑖
The accuracy of this equation is dependent on assuming that the values of θ∗𝑖𝑅 for each
item are close to θ* (Kane, 1987). A weighted version of Method 2 was also developed.
Higher weights are applied to items where there is more agreement among judges. Kane
further developed Method 3 by determining the θ* cut score by finding the value where
the fit between the mean of the proportion-correct ratings for the item and the item’s ICC
is maximized. This method is used to maximize the fit to the judges predictions for testtaker performance (Hurtz, et al., 2008). Method 2 weighted is denoted by the following
equation (Kane, 1987, equation 21):
∗ ≅
θ̂
𝑤
1
∗
𝑃′ (θ̂
)
∑𝑖 [ 𝑖 𝑖𝑅 ]
(𝑀
𝜎𝑖 𝑖𝑅 )
∑[
𝑖
∗
𝑃𝑖′ (θ̂
𝑖𝑅 ) ̂
] θ∗
𝜎𝑖 (𝑀𝑖𝑅 ) 𝑖𝑅
Kane’s (1987) Method 3 attempts to maximize the fit between the mean of the
proportion-correct ratings for each item and that item’s ICC. There is an approximation
formula for method 3 that is similar to Method 2 weighted, except that it uses the
reciprocals of the θ* metric rater variances, as calculated in the following equation (Kane,
1987):
𝜃̂ ∗ ≅
∗
𝑃𝑖′2 (𝜃̂𝑖𝑅
) ∗
̂
∑
[
2 (𝑀 ) ] 𝜃𝑖𝑅
′2 ̂ ∗
𝜎𝑖
𝑃𝑖 (𝜃𝑖𝑅 )
𝑖𝑅
∑𝑖 [ 2
] 𝑖
𝜎𝑖 (𝑀𝑖𝑅 )
1
∗
The accuracy is based on how close the values of 𝜃̂𝑖𝑅
are the θ* (Kane, 1987). Kane
(1987) concluded that the conversion is more effective if made at the item level instead of
52
the test level. The aggregation process should be conducted with an optimal weighting
scheme so that the error variance can be minimized, based on theoretical assumptions
about model fit. These techniques have been evaluated with simulated data, and using
several fixed a priori values of θ* along the competence continuum at locations of the
judge’s minimum competence conceptualizations (Hurtz, et al., 2008). In expanding on
previous research, this study compared the results of each of the following methods:
∗
̂1∗ ), method 1 weighted, (θ̂
̂∗
method 1, (θ
1𝑤 ), method 2, (θ2 ), the method 2 approximation
∗
̂∗̅ ), method 2 weighted (θ̂
̂∗
formula, (θ
2𝑤 ), and method 3 (θ3 ).
2
The present study took advantage of the directions for future research based on
the past study, Hurtz et al, (2008), and used more items, defined by randomization instead
of being tied to a specific exam, three standard deviation conditions, and a different
estimation of bias. The first step in this study involved randomly generating IRT
parameters for 100 items. They were randomly generated using the beta distribution
using SPSS. The next step simulated the Angoff ratings. For this study, a total of fifteen
simulated raters were used and three different determinations of rater agreement were
selected by setting the standard deviations to .05, .10, and .15. These values were picked
based on the previous study, and were selected to give a greater range to previous
research. The bias conditions were established through the Monte Carlo simulation,
where the rater judgments were manipulated by setting five conditions. The first was no
bias, or ratings as would be expected with accurate rater panels. The rest of the
conditions were set as if the pool of raters rated the exam as easier as or harder than it
53
theoretically is, which would result in cut scores that would indicate that the examinees
have higher or lower ability levels than would theoretically occur. These conditions were
simulated as if the raters rated the exam as 15 or 35 percent harder or easier than what
would be expected with more accurate rater panels. This should determine if the rater
judgments drive the mathematical conversion, or if it is robust enough to compensate for
rater bias and disagreement. This was achieved by shifting the mean of the ICC by the
bias condition using the cumulative distribution function (CDF) of the Monte Carlo beta
distribution, and then transformed back into a cut score by using the inverse distribution
function (IDF) of the Monte Carlo beta distribution. The simulated rater judgments were
generated based on the new mean of the ICC, and rating the exam as if it was, for
example, fifteen percent easier than in reality. Additionally, the range of theta values
being examined were condensed from the previous study to a truncated range of -2.00, 1.00, 0.00, 1.00, and 2.00. The goal of this present study was to examine how accurate
(possessing less error) and how consistent the simulated rater judgments are based on the
conditions specified.
H1: The θ* resulting from simulated rater judgments will be more consistent and have
less error when the simulated ratings have no bias
H2: The θ* resulting from simulated rater judgments will be more consistent and have
less error when the simulated ratings have more agreement.
H3: The θ* resulting from simulated rater judgments will be more consistent and have
less error when the θ* value is closer to 0.00.
54
H4: Method 1 weighted and Method 3 will perform the best in recovering the original θ*
values.
Data Analysis
All calculations for the ratings conversions were performed using SPSS syntax
command files, written by Gregory M. Hurtz, from California State University
Sacramento (Hurtz et al., 2008) and modified by this researcher to fit the present
conditions under manipulation. The input files used the simulated item parameters as
defined above, in addition to the simulated ratings from each judge. The cut score
conversions were performed to find the θ cut scores from the proportion-correct cut
scores. The previous study (Hurtz et al., 2008) set variance restrictions for method 1
weighted and method 3 in order to eliminate spurious effects, and these restrictions were
included in this analysis.
55
Chapter 6
RESULTS
Unbiased Ratings
Low Variability (.05)
In Table 1, when the standard deviation is set to .05, and the a priori θ* value is
set to 0, in the unbiased condition all 6 methods did a good job of recovering the original
θ* value as shown by the low bias, with means ranging from -.013 to .003. The RMSE
mean values range from .004 to .015 showing that there is low error in the mathematical
conversions. Additionally the means of the estimated θ* values across replications range
from .023 through .042. The means of the error index were low, .043 for all six
equations, with a small standard deviation, .001. The means of the consistency index
were high, .939 for all six equations, and the standard deviation of .002, which is also a
good indication of how the conversion equations would consistently show these values
with other simulated ratings. Going forward, values for the error and consistency indexes
below .5 are classified as low, values ranging from .5 to .8 are considered moderate, and
values above .8 are considered high.
When the standard deviation is set to .05 and the a priori value is set to -1, in the
unbiased condition, method 1 weighted and method 3 did the best job of recovering the
original θ* values as shown by the low bias, with means of .043. The unweighted
method 2 did the next best job of recovering the original θ* values as shown by the bias
56
mean of .056. The rest of the methods do a similar job of recovering the original θ* as
shown by the bias means of .100 to .159, i.e., the range of the bias means show that there
are small differences in determining which is the best method, suggesting that you would
get a recovery of the original θ* values that could be equally applied regardless of the
selected method. The RMSE mean values mirror the bias values and show that method 1
weighted and method 3 have the lowest error in recovering the estimated θ* values.
Additionally the means of the estimated θ* values across replications range from -.841
through -.957. The means of the error index were low, ranging from .044 through .059,
with standard deviations in the .002 to .003 range. The means of the consistency index
were high, ranging from .917 through .938, with standard deviations in the .002 to .004
range.
When the standard deviation is set to .05 and the a priori value is set to 1, in the
unbiased condition, method 1 weighted and method 3 did the best job of recovering the
original θ* values as shown by the low bias, with means of -.048. The weighted method
2 did the next best job of recovering the original θ* values as shown by the bias mean of .150. Method 1 followed with bias means of -.282, the unweighted method 2 showed a
bias mean of -.322, and the approximation formula for method 2 did the worst job with a
bias mean of -.641. The RMSE mean values mirror the bias values and show that method
1 weighted and method 3 have the lowest error in recovering the estimated θ* values.
Additionally the means of the estimated θ* values across replications range from .678
(unweighted method 2) through .952 (method 1 weighted and method 3). The means of
57
the error index were low, ranging from .099 through .223, with standard deviations in the
.002 to .003 range. The means of the consistency index were moderately high, ranging
from .715 through .889, with standard deviations in the .002 to .004 range.
When the standard deviation is set to .05 and the a priori value is set to -2, in the
unbiased condition, method 2 did the best job of recovering the original θ* value with a
bias mean of .739. Method 1 weighted and Method 3 showed a bias mean of 1.011,
doing a better job than Method 2 weighted as shown by a bias mean of 1.047. The
approximation formula for method 2 did the worst job of recovering the θ* values with a
bias mean of 1.123. The RMSE mean values mirror the bias values and show that the
unweighted method 2 has the lowest error in recovering the estimated θ* values.
Additionally the means of the estimated θ* values across replications range from -1.261
through -.877. The means of the error index were low, ranging from .084 through .126,
with standard deviations in the .003 to .006 range. The means of the consistency index
were high, ranging from .823 through .890, with standard deviations in the .004 to .010
range.
When the standard deviation is set to .05 and the a priori value is set to 2, in the
unbiased condition, method 1 weighted and method 3 did the best job of recovering the
original θ* values as shown by the bias means of -.954. The approximation formula for
method 2 did the worst job with a bias mean of -2.013. The RMSE mean values mirror
the bias means and show that method 1 weighted and method 3 have the least error.
Additionally the means of the estimated θ* values across replications range from -.013
58
through 1.046. The means of the error index were moderate, ranging from .315 through
.406, with standard deviations in the .005 to .011 range. The means of the consistency
index were moderately low, ranging from .423 through .655, with standard deviations in
the .009 to .015 range.
59
Table 1
Summary of Cut Scores Across Methods for Different A Priori θ* Values, Where Ratings
are At the Item Characteristic Curves (ICCs) on Average (Low Variability SD .05)
Mean
SD
Range
Bias
RMSE
Error M
Error SD
Consistency M
Consistency SD
0.246
-1.046
-0.989
-1.261
-0.877
-0.953
-0.989
0.002
0.016
0.044
0.017
0.024
0.021
0.044
0.020
0.118
0.304
0.120
0.156
0.143
0.304
0.954
1.011
0.739
1.123
1.047
1.011
0.954
1.012
0.739
1.123
1.047
1.012
0.104
0.111
0.084
0.126
0.115
0.111
0.003
0.006
0.003
0.005
0.004
0.006
0.858
0.847
0.890
0.823
0.840
0.847
0.004
0.010
0.004
0.008
0.006
0.010
0.300
-0.841
-0.957
-0.944
-0.856
-0.900
-0.957
0.002
0.012
0.008
0.008
0.013
0.009
0.008
0.010
0.077
0.053
0.050
0.085
0.057
0.053
0.159
0.043
0.056
0.144
0.100
0.043
0.160
0.044
0.056
0.145
0.101
0.044
0.059
0.044
0.045
0.056
0.050
0.044
0.003
0.002
0.002
0.003
0.002
0.002
0.917
0.938
0.937
0.920
0.930
0.938
0.004
0.002
0.002
0.004
0.003
0.002
0.606
-0.003
0.002
-0.006
-0.013
0.003
0.002
0.002
0.006
0.004
0.006
0.006
0.004
0.004
0.010
0.042
0.023
0.030
0.038
0.026
0.023
-0.003
0.002
-0.006
-0.013
0.003
0.002
0.007
0.004
0.008
0.014
0.005
0.004
0.043
0.043
0.043
0.043
0.043
0.043
0.001
0.001
0.001
0.001
0.001
0.001
0.939
0.939
0.939
0.938
0.939
0.939
0.002
0.002
0.002
0.002
0.002
0.002
0.832
0.718
0.952
0.678
0.359
0.850
0.952
0.005
0.017
0.006
0.018
0.020
0.008
0.006
0.030
0.105
0.039
0.120
0.114
0.051
0.039
-0.282
-0.048
-0.322
-0.641
-0.150
-0.048
0.282
0.049
0.322
0.642
0.151
0.049
0.132
0.099
0.140
0.223
0.110
0.099
0.007
0.005
0.008
0.010
0.005
0.005
0.846
0.889
0.835
0.715
0.876
0.889
0.008
0.005
0.010
0.014
0.006
0.005
θ* = -2
p
𝜃̂1∗
∗
𝜃̂1𝑤
̂
𝜃2∗
𝜃̂̅2∗
∗
̂
𝜃2𝑤
̅̅̅̅
𝜃̂3∗
θ* = -1
p
𝜃̂1∗
∗
𝜃̂1𝑤
𝜃̂2∗
𝜃̂2̅∗
∗
𝜃̂̅̅̅̅
2𝑤
̂
𝜃3∗
θ* = 0
p
𝜃̂1∗
∗
𝜃̂1𝑤
𝜃̂2∗
𝜃̂2̅∗
∗
𝜃̂2𝑤
̅̅̅̅
̂
𝜃3∗
θ* = 1
p
𝜃̂1∗
∗
𝜃̂1𝑤
𝜃̂2∗
𝜃̂̅2∗
∗
̂
𝜃2𝑤
̅̅̅̅
𝜃̂3∗
θ* = 2
p
𝜃̂1∗
∗
𝜃̂1𝑤
̂
𝜃2∗
𝜃̂2̅∗
∗
̂
𝜃2𝑤
̅̅̅̅
𝜃̂3∗
0.719
0.009
0.050
0.415
0.023
0.136 -1.585
1.585
0.372
0.007
0.531
0.012
1.046
0.068
0.373 -0.954
0.956
0.315
0.011
0.655
0.015
0.301
0.026
0.170 -1.699
1.700
0.382
0.007
0.502
0.013
-0.013
0.021
0.132 -2.013
2.014
0.406
0.005
0.423
0.009
0.478
0.027
0.189 -1.522
1.522
0.366
0.008
0.546
0.013
1.046
0.068
0.373 -0.954
0.956
0.315
0.011
0.655
0.015
∗
Note. RMSE = root mean square error, 𝜃̂1∗ = converted cutoff score using unweighted Method 1; 𝜃̂1𝑤
= converted cutoff
score using weighted Method 1; 𝜃̂2∗ = converted cutoff score using unweighted Method 2; 𝜃̂2̅∗ = converted cutoff score
∗
̂∗
using approximation formula for unweighted Method 2; 𝜃̂̅̅̅̅
2𝑤 = converted cutoff score using weighted method 2; 𝜃3 = converted
cutoff score using Method 3.
60
Medium Variability (.10)
As shown by Table 2, when the standard deviation is set to .10 and the a priori
value is set to 0, in the unbiased condition, all six methods did a similar job of recovering
the original θ* values as shown by the range of the bias means of -.003 to .003, i.e., the
range of the bias means show that there are small differences in determining which is the
best method, suggesting that you would get a recovery of the original θ* values that could
be equally applied regardless of the selected method. The RMSE mean values mirror the
bias means ranging from .008 through .012. Additionally the means of the estimated θ*
values across replications, range from -.003 to .003. The means of the error index were
low, at .088, with a standard deviation of .002. The means of the consistency index were
high, at .875 to .876 with a standard deviation of .004.
When the standard deviation is set to .10 and the a priori value is set to -1, in the
unbiased condition, method 1 weighted and method 3 did the best job of recovering the
original θ* value as shown by the bias mean of .140. Method 2 weighted did the next
best job as shown by the bias value of .253, while method 1 and method 2 followed as
shown by the bias means of .366 and .343 respectively. The approximation formula for
method 2 did the worst job as shown by the bias mean of .621. The RMSE means mirror
the bias means and show the same results. Additionally, the means of the estimated θ*
values range from -.860 to -.379. The means of the error index were low, ranging from
.148 to .249 with a standard deviation ranging from .005 to .007. The means of the
consistency index were moderate, ranging from .627 to .791, with standard deviations
61
ranging from .007 to .011.
When the standard deviation is set to .10, in the unbiased condition, and the a
priori value is set to 1, in the unbiased condition, method 1 weighted and method 3 did
the best job of recovering the original θ* values as shown by the bias means of -.187.
Method 2 weighted did the next best job as shown by the bias mean of -.363, and method
1 and method 2 followed with bias means of -.497 and -.530 respectively. The
approximation formula for method 2 did the worst job as shown by the bias mean of .827. The RMSE means mirror the bias means. Additionally, the means of the estimated
θ* values range from .173 to .813. The means of the error index were low, ranging from
.188 to .247, with standard deviations ranging from .007 to .009. The means of the
consistency index were moderate, ranging from .567 to .785, with low standard
deviations.
When the standard deviation is set to .10, in the unbiased condition, and the a
priori value is set to -2, method 1 and method 3 did an adequate job of recovering the
original θ* values as shown by the bias means of 1.395. The rest of the methods did
worse; the approximation formula of method 2 did the worst job, with a bias mean of
2.053. The RMSE values mirror the bias means. Additionally, the means of the
estimated θ* values range from -.605 to .053. The means of the error index were low,
ranging from .378 to .408, with low standard deviations. The means of the consistency
index were also low, ranging from .414 to .444, also with low standard deviations. The
proportion-correct cutoff score generated in this condition was affected by not enough
62
low ratings being possible in the sampling when attempting to achieve the .10 standard
deviation of rater variability at the -2 θ* level.
When the standard deviation is set to .10, in the unbiased condition, and the a
priori value is set to 2, method 1 weighted and method 3 did the best job of recovering
the original θ* values as shown by the bias means of -1.527. The rest of the methods did
worse; the approximation formula for method 2 did the worst with a bias mean of -1.969.
The RMSE values mirror the bias means. Additionally, the means of the estimated θ*
values range from .031 to .472. The means of the error index were low ranging from
.362 to .370, with low standard deviations. The means of the consistency index were
moderately low, ranging from .497 to .549 with low standard deviations.
63
Table 2
Summary of Cut Scores Across Methods for Different A Priori θ* Values, Where Ratings
are At the Item Characteristic Curves (ICCs) on Average (Medium Variability SD .10)
Mean
SD
Range
Bias
RMSE
Error M
Error SD
Consistency M
Consistency SD
θ* = -2
p
𝜃̂1∗
∗
𝜃̂1𝑤
̂
𝜃2∗
𝜃̂̅2∗
∗
̂
𝜃2𝑤
̅̅̅̅
𝜃̂3∗
0.554
-0.179
-0.605
-0.141
0.053
-0.332
-0.605
0.009
0.020
0.070
0.023
0.018
0.028
0.070
0.060
0.143
0.469
0.150
0.112
0.202
0.469
1.821
1.395
1.859
2.053
1.668
1.395
1.821
1.397
1.859
2.053
1.668
1.397
0.398
0.378
0.400
0.408
0.391
0.378
0.006
0.008
0.006
0.005
0.007
0.008
0.414
0.444
0.415
0.429
0.415
0.444
0.008
0.015
0.007
0.007
0.010
0.015
0.373
-0.634
-0.860
-0.657
-0.379
-0.747
-0.860
0.005
0.017
0.018
0.016
0.012
0.013
0.018
0.030
0.104
0.127
0.100
0.091
0.078
0.127
0.366
0.140
0.343
0.621
0.253
0.140
0.367
0.141
0.344
0.621
0.253
0.141
0.186
0.148
0.181
0.249
0.164
0.148
0.007
0.005
0.007
0.007
0.005
0.005
0.728
0.791
0.735
0.627
0.764
0.791
0.010
0.007
0.011
0.010
0.008
0.007
0.608
0.002
0.002
-0.003
0.001
0.003
0.002
0.003
0.011
0.007
0.009
0.009
0.007
0.007
0.020
0.081
0.045
0.050
0.066
0.048
0.045
0.002
0.002
-0.003
0.001
0.003
0.002
0.012
0.008
0.010
0.010
0.009
0.008
0.088
0.088
0.088
0.088
0.088
0.088
0.002
0.002
0.002
0.002
0.002
0.002
0.876
0.876
0.875
0.876
0.876
0.876
0.004
0.004
0.004
0.004
0.004
0.004
0.774
0.503
0.813
0.470
0.173
0.637
0.813
0.006
0.020
0.020
0.020
0.019
0.016
0.020
0.040
0.150
0.179
0.140
0.130
0.108
0.179
-0.497
-0.187
-0.530
-0.827
-0.363
-0.187
0.497
0.188
0.531
0.827
0.363
0.188
0.240
0.188
0.247
0.319
0.214
0.188
0.008
0.007
0.009
0.009
0.007
0.007
0.704
0.785
0.692
0.567
0.746
0.785
0.011
0.008
0.013
0.014
0.010
0.008
θ* = -1
p
𝜃̂1∗
∗
𝜃̂1𝑤
𝜃̂2∗
𝜃̂2̅∗
∗
𝜃̂̅̅̅̅
2𝑤
̂
𝜃3∗
θ* = 0
p
𝜃̂1∗
∗
𝜃̂1𝑤
𝜃̂2∗
𝜃̂2̅∗
∗
𝜃̂2𝑤
̅̅̅̅
̂
𝜃3∗
θ* = 1
p
𝜃̂1∗
∗
𝜃̂1𝑤
𝜃̂2∗
𝜃̂̅2∗
∗
̂
𝜃2𝑤
̅̅̅̅
𝜃̂3∗
θ* = 2
p
𝜃̂1∗
∗
𝜃̂1𝑤
̂
𝜃2∗
𝜃̂2̅∗
∗
̂
𝜃2𝑤
̅̅̅̅
𝜃̂3∗
0.675
0.009
0.050
0.227
0.022
0.130 -1.773
1.773
0.370
0.006
0.506
0.010
0.472
0.114
0.683 -1.527
1.532
0.362
0.008
0.549
0.021
0.178
0.025
0.150 -1.822
1.822
0.372
0.006
0.497
0.010
0.031
0.020
0.112 -1.969
1.969
0.377
0.005
0.469
0.009
0.228
0.030
0.190 -1.772
1.772
0.370
0.006
0.506
0.011
0.472
0.114
0.683 -1.527
1.532
0.362
0.008
0.549
0.021
∗
Note. RMSE = root mean square error, 𝜃̂1∗ = converted cutoff score using unweighted Method 1; 𝜃̂1𝑤
= converted cutoff
score using weighted Method 1; 𝜃̂2∗ = converted cutoff score using unweighted Method 2; 𝜃̂2̅∗ = converted cutoff score
∗
̂∗
using approximation formula for unweighted Method 2; 𝜃̂̅̅̅̅
2𝑤 = converted cutoff score using weighted method 2; 𝜃3 = converted
cutoff score using Method 3.
64
High Variability (.15)
According to Table 3, when the standard deviation is set to .15, and the a priori θ*
value is set to 0, in the unbiased condition, method 2 weighted did the best job of
recovering the original θ* values as shown by the bias mean of .007. Method 1 did the
next best with a bias mean of -.010. The other four methods do a similar job of
recovering the original θ* values with bias means ranging from -.026 to .026, i.e., the
range of the bias means show that there are small differences in determining which is the
best method, suggesting that you would get a recovery of the original θ* values that could
be equally applied regardless of the selected method. The RMSE means mirror the bias
means. Additionally, the means of the estimated θ* values across replications, range
from -.026 to .026. The means of the error index were low at .136 with low standard
deviations. The means of the consistency index were high ranging from .806 to .808 with
low standard deviations.
When the standard deviation is set to .15 and the a priori θ* value is set to -1, in
the unbiased condition, method 1 weighted and method 3 did the best job of recovering
the original θ* values with a bias mean of .349. Method 2 and method 2 weighted did the
next best job with means of .471 and .428 respectively. Method 1 and the approximation
formula for method 2 performed the worst with bias means of .504 and .731. The RMSE
means mirror the bias means. Additionally, the means of the estimated θ* values across
replications range from -.269 to -.651. The means of the error index were very low,
ranging from .224 to .300 with small standard deviations. The means of the consistency
65
index were moderate, ranging from .553 to .672, with small standard deviations.
When the standard deviation is set to .15 and the a priori θ* value is set to 1, in
the unbiased condition, method 1 weighted and method 3 did the best job of recovering
the original θ* values with bias means of -.433. Method 2 weighted did the next best job,
with a bias mean of -.585. Method 1 and method 2 followed with bias means of -.661
and -.682, and the approximation formula for method 2 did the worst job with a bias
mean of -.895. The RMSE means mirror the bias means. The means of the estimated θ*
values range from .105 to .567. The means of the error index were low ranging from
.275 to .353 with low standard deviations. The means of the consistency index were
moderate ranging from .513 to .667 with low standard deviations.
When the standard deviation is set to .15 and the a priori θ* value is set to -2, in
the unbiased condition, method 1 weighted and method 3 did the best job of recovering
the original θ* values with bias means of 1.482. The rest of the methods did an equally
poor job of recovering the original θ* values with bias means ranging from 1.182 to
2.031. The RMSE values mirror the bias values. The means of the estimated θ* values
range from -.518 to .031. The means of the error index range were moderate from .373 to
.375 with low standard deviations. The means of the consistency index were low,
ranging from .445 to .473 with low standard deviations. The proportion-correct cutoff
score generated in this condition was affected by not enough low ratings being possible in
the sampling when attempting to achieve the .15 standard deviation of rater variability at
the -2 θ* level.
66
When the standard deviation is set to .15 and the a priori θ* value is set to 2, in
the unbiased condition, method 1 weighted and method 3 did the best job of recovering
the original θ* values with bias means of -1.781. The rest of the methods did an equally
poor job of recovering the original θ* values with bias means ranging from -1.861 to 1.954. The RMSE means mirror the bias means. The means of the estimated θ* values
range from .046 to .219. The means of the error index were low, ranging from .362 to
.366 with low standard deviations. The means of the consistency index were moderate
ranging from .488 to .515 with low standard deviations.
67
Table 3
Summary of Cut Scores Across Methods for Different A Priori θ* Values, Where Ratings
are At the Item Characteristic Curves (ICCs) on Average (High Variability – SD .15)
Mean
SD
Range
Bias
RMSE
Error M
Error SD
Consistency M
Consistency SD
0.603
-0.042
-0.518
-0.014
0.031
-0.188
-0.518
0.009
0.020
0.104
0.024
0.018
0.039
0.104
0.060
0.121
0.659
0.140
0.114
0.255
0.659
1.958
1.482
1.986
2.031
1.812
1.482
1.958
1.486
1.987
2.031
1.813
1.486
0.374
0.374
0.375
0.375
0.373
0.374
0.006
0.007
0.006
0.006
0.006
0.007
0.464
0.445
0.467
0.473
0.450
0.445
0.008
0.011
0.008
0.008
0.009
0.011
0.413
-0.496
-0.651
-0.529
-0.269
-0.572
-0.651
0.006
0.019
0.024
0.018
0.013
0.015
0.024
0.041
0.122
0.157
0.120
0.085
0.102
0.157
0.504
0.349
0.471
0.731
0.428
0.349
0.504
0.350
0.472
0.731
0.428
0.350
0.250
0.224
0.244
0.300
0.236
0.224
0.007
0.006
0.008
0.007
0.006
0.006
0.628
0.672
0.638
0.553
0.651
0.672
0.011
0.010
0.012
0.011
0.010
0.010
0.600
-0.010
0.026
-0.021
-0.026
0.008
0.026
0.005
0.014
0.018
0.013
0.013
0.012
0.018
0.031
0.087
0.095
0.080
0.078
0.082
0.095
-0.010
0.026
-0.021
-0.026
0.007
0.026
0.017
0.031
0.025
0.029
0.015
0.031
0.136
0.136
0.136
0.136
0.136
0.136
0.003
0.003
0.004
0.004
0.003
0.003
0.807
0.808
0.806
0.806
0.808
0.808
0.005
0.005
0.005
0.005
0.005
0.005
0.725
0.339
0.567
0.318
0.105
0.415
0.567
0.008
0.022
0.032
0.023
0.020
0.021
0.032
0.048
0.133
0.368
0.150
0.115
0.138
0.368
-0.661
-0.433
-0.682
-0.895
-0.585
-0.433
0.662
0.434
0.682
0.896
0.586
0.434
0.312
0.275
0.315
0.353
0.299
0.275
0.008
0.009
0.009
0.008
0.008
0.009
0.598
0.667
0.591
0.513
0.623
0.667
0.013
0.013
0.014
0.013
0.013
0.013
θ* = -2
p
𝜃̂1∗
∗
𝜃̂1𝑤
̂
𝜃2∗
𝜃̂̅2∗
∗
̂
𝜃2𝑤
̅̅̅̅
𝜃̂3∗
θ* = -1
p
𝜃̂1∗
∗
𝜃̂1𝑤
𝜃̂2∗
𝜃̂2̅∗
∗
𝜃̂̅̅̅̅
2𝑤
̂
𝜃3∗
θ* = 0
p
𝜃̂1∗
∗
𝜃̂1𝑤
𝜃̂2∗
𝜃̂2̅∗
∗
𝜃̂2𝑤
̅̅̅̅
̂
𝜃3∗
θ* = 1
p
𝜃̂1∗
∗
𝜃̂1𝑤
𝜃̂2∗
𝜃̂̅2∗
∗
̂
𝜃2𝑤
̅̅̅̅
𝜃̂3∗
θ* = 2
p
𝜃̂1∗
∗
𝜃̂1𝑤
̂
𝜃2∗
𝜃̂2̅∗
∗
̂
𝜃2𝑤
̅̅̅̅
𝜃̂3∗
0.656
0.009
0.061
0.139
0.023
0.151
-1.861 1.861
0.364
0.006
0.503
0.009
0.219
0.066
0.593
-1.781 1.782
0.362
0.006
0.515
0.014
0.124
0.025
0.160
-1.876 1.876
0.364
0.006
0.500
0.009
0.046
0.018
0.104
-1.954 1.954
0.366
0.006
0.488
0.009
0.134
0.023
0.143
-1.866 1.866
0.364
0.006
0.502
0.009
0.219
0.066
0.593
-1.781 1.782
0.362
0.006
0.515
0.014
∗
Note. RMSE = root mean square error, 𝜃̂1∗ = converted cutoff score using unweighted Method 1; 𝜃̂1𝑤
= converted cutoff
score using weighted Method 1; 𝜃̂2∗ = converted cutoff score using unweighted Method 2; 𝜃̂2̅∗ = converted cutoff score
∗
̂∗
using approximation formula for unweighted Method 2; 𝜃̂̅̅̅̅
2𝑤 = converted cutoff score using weighted method 2; 𝜃3 = converted
cutoff score using Method 3.
68
Biased Ratings
Low Variability (.05)
.15 above the ICC. According to Table 4, when the standard deviation is set to
.05, and the a priori θ* value is set to 0 in the bias condition where the simulated ratings
are set to .15 above the ICC, all six methods did a similar job recovering the original θ*
values as shown by the bias means ranging from .092 to .114, i.e., the range of the bias
means show that there are small differences in determining which is the best method,
suggesting that you would get a recovery of the original θ* values that could be equally
applied regardless of the selected method. The RMSE values mirror the bias means
except for method 2 with a mean of .304. The means of the estimated θ* values across
replications range from .092 to .113. The means of the error index were low at .048 with
a low standard deviation, and the means of the consistency index were high, ranging from
.933 to .944 with a low standard deviation.
When the standard deviation is set to .05 and the a priori θ* value is set to -1 in
the bias condition where the simulated ratings are set to .15 above the ICC, method 2 did
the best job of recovering the original θ* value as shown by the bias mean of .251. The
other five methods did a similar job of recovering the original θ* with bias means ranging
from .393 to .411, i.e., the range of the bias means show that there are small differences
in determining which is the best method, suggesting that you would get a recovery of the
original θ* values that could be equally applied regardless of the selected method. The
RMSE means mirror the bias means. The means of the estimated θ* values across
69
replications range from -.589 to -.749. The means of the error index were low ranging
from .082 to .113 with a low standard deviation. The means of the consistency index
were high ranging from .833 to .881 with a low standard deviation.
When the standard deviation is set to .05 and the a priori θ* value is set to 1 in the
bias condition where the simulated ratings are set to .15 above the ICC, method 1
weighted and method 3 did the best job of recovering the original θ* values as shown by
the bias mean of .045. Method 2 weighted did the next best job of recovering the original
θ* values as shown by the bias mean of -.066. The approximation formula for method 2
did the poorest job of recovering the original θ* values as shown by the bias mean of .644. The means of the estimated θ* values across replications range from .356 to 1.045.
The means of the error index were low, ranging from .098 to .242, with a low standard
deviation. The means of the consistency index were moderately high, ranging from .689
to .893 with a low standard deviation.
When the standard deviation is set to .05 and the a priori θ* value is set to -2 in
the bias condition where the simulated ratings are set to .15 above the ICC, method 2 did
the best job of recovering the original θ* value as shown by the bias mean of 1.192. The
rest of the methods did a poor job of recovering the original θ* value with bias means
ranging from 1.376 to 1.459. The RMSE means mirror the bias means. The means of the
estimated θ* values across replications range from -.808 to -.541. The means of the error
index were low ranging from .121 to .156 with low standard deviations, and the means of
the consistency index were high, ranging from .769 to .827 with low standard deviations.
70
When the standard deviation is set to .05 and the a priori θ* value is set to 2 in the
bias condition where the simulated ratings are set to .15 above the ICC, method 1
weighted and method 3 did the best job of recovering the original θ* values as shown by
the bias mean of -1.084. The rest of the methods did a poor job of recovering the original
θ* value as shown by the bias means ranging from -1.537 to -2.037. The RMSE mean
values mirror the bias mean values. The means of the estimated θ* values across
replications range from -.036 to .916. The means of the error index were low, ranging
from .333 to .409 with a low standard deviation, and the means of the consistency index
were moderately low, ranging from .415 to .628 with low standard deviations.
71
Table 4
Summary of Cut Scores Across Methods for Different A Priori θ* Values, Where Ratings
are .15 Above the Item Characteristic Curves (ICCs) on Average (Low Variability – SD
.05)
Mean
SD
Range
Bias
RMSE
Error M
Error SD
Consistency M
Consistency SD
0.331
-0.624
-0.541
-0.808
-0.573
-0.574
-0.541
0.001
0.005
0.015
0.006
0.006
0.008
0.015
0.010
0.030
0.091
0.030
0.038
0.052
0.091
1.376
1.459
1.192
1.427
1.426
1.459
1.376
1.459
1.192
1.427
1.426
1.459
0.142
0.156
0.121
0.150
0.150
0.156
0.001
0.003
0.001
0.002
0.002
0.003
0.792
0.769
0.827
0.778
0.779
0.769
0.002
0.005
0.002
0.003
0.003
0.005
0.347
-0.607
-0.603
-0.749
-0.589
-0.590
-0.603
0.001
0.005
0.014
0.006
0.006
0.007
0.014
0.010
0.031
0.093
0.030
0.043
0.043
0.093
0.393
0.397
0.251
0.411
0.410
0.397
0.393
0.397
0.251
0.411
0.410
0.397
0.109
0.110
0.082
0.113
0.113
0.110
0.002
0.003
0.002
0.002
0.002
0.003
0.840
0.838
0.881
0.833
0.833
0.838
0.002
0.005
0.002
0.003
0.003
0.005
0.644
0.113
0.110
0.092
0.094
0.113
0.110
0.002
0.007
0.005
0.005
0.005
0.004
0.005
0.010
0.050
0.030
0.020
0.030
0.025
0.030
0.113
0.110
0.092
0.094
0.113
0.110
0.114
0.110
0.304
0.094
0.113
0.110
0.048
0.048
0.048
0.048
0.048
0.048
0.001
0.001
0.001
0.001
0.001
0.001
0.933
0.933
0.934
0.934
0.933
0.933
0.002
0.002
0.002
0.002
0.002
0.002
0.851
0.861
1.045
0.760
0.356
0.933
1.045
0.005
0.018
0.007
0.020
0.023
0.009
0.007
0.030
0.118
0.050
0.140
0.141
0.065
0.050
-0.139
0.045
-0.240
-0.644
-0.066
0.045
0.140
0.046
0.241
0.645
0.067
0.046
0.120
0.098
0.139
0.242
0.109
0.098
0.006
0.005
0.008
0.011
0.005
0.005
0.864
0.893
0.840
0.689
0.879
0.893
0.008
0.005
0.010
0.016
0.006
0.005
θ* = -2
p
𝜃̂1∗
∗
𝜃̂1𝑤
𝜃̂2∗
𝜃̂2̅∗
∗
𝜃̂̅̅̅̅
2𝑤
̂
𝜃3∗
θ* = -1
p
𝜃̂1∗
∗
𝜃̂1𝑤
𝜃̂2∗
𝜃̂2̅∗
∗
𝜃̂2𝑤
̅̅̅̅
𝜃̂3∗
θ* = 0
p
𝜃̂1∗
∗
𝜃̂1𝑤
̂
𝜃2∗
𝜃̂2̅∗
∗
̂
𝜃2𝑤
̅̅̅̅
𝜃̂3∗
θ* = 1
p
𝜃̂1∗
∗
𝜃̂1𝑤
𝜃̂2∗
𝜃̂2̅∗
∗
𝜃̂̅̅̅̅
2𝑤
̂
𝜃3∗
θ* = 2
p
𝜃̂1∗
∗
𝜃̂1𝑤
𝜃̂2∗
𝜃̂2̅∗
∗
𝜃̂2𝑤
̅̅̅̅
𝜃̂3∗
0.718
0.008
0.060
0.463
0.022
0.141
-1.537
1.537
0.372
0.007
0.537
0.011
0.916
0.075
0.419
-1.084
1.087
0.333
0.011
0.628
0.017
0.299
0.024
0.150
-1.701
1.701
0.385
0.007
0.497
0.012
-0.036
0.020
0.133
-2.037
2.037
0.409
0.005
0.415
0.009
0.374
0.026
0.161
-1.626
1.626
0.379
0.007
0.516
0.012
0.916
0.075
0.419
-1.084
1.087
0.333
0.011
0.628
0.017
∗
Note. RMSE = root mean square error, 𝜃̂1∗ = converted cutoff score using unweighted Method 1; 𝜃̂1𝑤
= converted cutoff
score using weighted Method 1; 𝜃̂2∗ = converted cutoff score using unweighted Method 2; 𝜃̂2̅∗ = converted cutoff score
∗
̂∗
using approximation formula for unweighted Method 2; 𝜃̂2𝑤
̅̅̅̅ = converted cutoff score using weighted method 2; 𝜃3 = converted
cutoff score using Method 3.
72
Moderate Variability (.10)
.15 above the ICC. According to Table 5, when the standard deviation is set to
.10 and the a priori θ* value is set to 0 in the bias condition where the simulated ratings
are set to .15 above the ICC, all six methods did a similar job recovering the original θ*
with bias means ranging from .241 to .318, i.e., the range of the bias means show that
there are small differences in determining which is the best method, suggesting that you
would get a recovery of the original θ* values that could be equally applied regardless of
the selected method. The RMSE values mirror the bias means. The means of the
estimated θ* values across replications range from .241 to .318. The means of the error
index were low, at .003, with a low standard deviation, and the means of the consistency
index were high, ranging from .833 to .836 with low standard deviations.
When the standard deviation is set to .10 and the a priori θ* value is set to -1 in
the bias condition where the simulated ratings are set to .15 above the ICC, method 1
weighted and method 3 did the best job of recovering the original θ* values as shown by
the bias mean of .784. Method 2 did the worst job of recovering the original θ* value
with a bias mean of 1.003. The RMSE means mirror the bias means. The means of the
estimated θ* values across replications range from -.216 to .012. The means of the error
index were low, ranging from .236 to .245 with low standard deviations, and the means
of the consistency index were moderate, ranging from .651 to .654 with low standard
deviations.
When the standard deviation is set to .10 and the a priori θ* value is set to 1 in the
73
bias condition where the simulated ratings are set to .15 above the ICC, method 1
weighted and method 3 did the best job of recovering the original θ* values as shown by
the bias mean of -.146. The approximation formula for method 2 did the worst job of
recovering the original θ* value as shown by the bias mean of -.957. The RMSE means
mirror the bias means. The means of the estimated θ* values across replications range
from .132 to .929. The means of the error index were low ranging from .221 to .366 with
low standard deviations, and the means of the consistency index were moderate, ranging
from .498 to .753 with low standard deviations.
When the standard deviation is set to .10 and the a priori θ* value is set to -2 in
the bias condition where the simulated ratings are set to .15 above the ICC, method 1
weighted and method 3 did the best job of recovering the original θ* values as shown by
the bias means of 1.679. The rest of the methods do a similarly poor job of recovering
the original θ* values with bias means ranging from 1.909 to 2.135, i.e., the range of the
bias means show that there are small differences in determining which is the best method,
suggesting that you would get a recovery of the original θ* values that could be equally
applied regardless of the selected method. The RMSE means mirror the bias means. The
means of the estimated θ* values across replications range from -.321 to .135. The
means of the error index were low, ranging from .315 to .319 with low standard
deviations, and the means of the consistency index were moderate, ranging from .523 to
.566 with low standard deviations. The proportion-correct cutoff score generated in this
condition was affected by not enough low ratings being possible in the sampling when
74
attempting to achieve the .10 standard deviation of rater variability at the -2 θ* level.
When the standard deviation is set to .10 and the a priori θ* value is set to 2 in the
bias condition where the simulated ratings are set to .15 above the ICC, method 1
weighted and method 3 did the best job of recovering the original θ* values as shown by
the bias mean of -1.653. The rest of the methods do a similarly poor job of recovering
the original θ* values with bias means ranging from -1.839 to -1.964, i.e., the range of the
bias means show that there are small differences in determining which is the best method,
suggesting that you would get a recovery of the original θ* values that could be equally
applied regardless of the selected method. The RMSE means mirror the bias means. The
means of the estimated θ* values across replications range from .036 to .347. The means
of the error index were low, ranging from .363 to .368, with low standard deviations, and
the means of the consistency index were moderately low, ranging from .483 to .533 with
low standard deviations.
75
Table 5
Summary of Cut Scores Across Methods for Different A Priori θ* Values, Where Ratings
are .15 Above the Item Characteristic Curves (ICCs) on Average (Medium Variability SD
.10)
Mean
SD
Range
Bias
RMSE
Error M
Error SD
Consistency M
Consistency SD
0.660
0.109
-0.321
0.135
0.103
-0.091
-0.321
0.008
0.017
0.038
0.023
0.016
0.018
0.038
0.050
0.108
0.289
0.140
0.094
0.112
0.289
2.109
1.679
2.135
2.103
1.909
1.679
2.109
1.680
2.135
2.103
1.909
1.680
0.315
0.319
0.315
0.315
0.315
0.319
0.005
0.006
0.005
0.005
0.005
0.006
0.566
0.523
0.569
0.565
0.545
0.523
0.007
0.009
0.007
0.007
0.008
0.009
0.610
0.012
-0.216
0.003
-0.009
-0.135
-0.216
0.005
0.009
0.024
0.012
0.008
0.012
0.024
0.030
0.062
0.157
0.070
0.046
0.078
0.157
1.012
0.784
1.003
0.991
0.865
0.784
1.012
0.784
1.004
0.991
0.865
0.784
0.245
0.236
0.245
0.244
0.238
0.236
0.005
0.004
0.005
0.004
0.004
0.004
0.653
0.651
0.654
0.654
0.653
0.651
0.006
0.007
0.006
0.006
0.006
0.007
0.706
0.249
0.318
0.263
0.241
0.292
0.318
0.004
0.010
0.017
0.011
0.010
0.009
0.017
0.020
0.073
0.120
0.070
0.066
0.055
0.120
0.249
0.318
0.263
0.241
0.292
0.318
0.249
0.318
0.513
0.242
0.292
0.318
0.125
0.126
0.125
0.125
0.125
0.126
0.003
0.003
0.003
0.003
0.003
0.003
0.834
0.836
0.835
0.833
0.836
0.836
0.004
0.004
0.004
0.005
0.004
0.004
0.767
0.536
0.929
0.448
0.132
0.644
0.929
0.007
0.023
0.044
0.022
0.020
0.022
0.044
0.050
0.147
0.285
0.150
0.117
0.134
0.285
-0.579
-0.146
-0.664
-0.957
-0.498
-0.146
0.580
0.161
0.664
0.957
0.499
0.161
0.283
0.221
0.300
0.366
0.262
0.221
0.009
0.008
0.010
0.008
0.009
0.008
0.655
0.753
0.625
0.498
0.689
0.753
0.013
0.011
0.014
0.014
0.012
0.011
θ* = -2
p
𝜃̂1∗
∗
𝜃̂1𝑤
𝜃̂2∗
𝜃̂2̅∗
∗
𝜃̂̅̅̅̅
2𝑤
̂
𝜃3∗
θ* = -1
p
𝜃̂1∗
∗
𝜃̂1𝑤
𝜃̂2∗
𝜃̂2̅∗
∗
𝜃̂2𝑤
̅̅̅̅
𝜃̂3∗
θ* = 0
p
𝜃̂1∗
∗
𝜃̂1𝑤
̂
𝜃2∗
𝜃̂2̅∗
∗
̂
𝜃2𝑤
̅̅̅̅
𝜃̂3∗
θ* = 1
p
𝜃̂1∗
∗
𝜃̂1𝑤
𝜃̂2∗
𝜃̂2̅∗
∗
𝜃̂̅̅̅̅
2𝑤
̂
𝜃3∗
θ* = 2
p
𝜃̂1∗
∗
𝜃̂1𝑤
𝜃̂2∗
𝜃̂2̅∗
∗
𝜃̂2𝑤
̅̅̅̅
𝜃̂3∗
0.657
0.009
0.060
0.161
0.024
0.156
-1.839
1.839
0.365
0.006
0.504
0.009
0.347
0.098
0.666
-1.653
1.656
0.363
0.006
0.533
0.017
0.129
0.026
0.150
-1.871
1.872
0.366
0.006
0.498
0.009
0.036
0.019
0.121
-1.964
1.964
0.368
0.006
0.483
0.009
0.160
0.025
0.176
-1.840
1.840
0.365
0.006
0.503
0.010
0.347
0.098
0.666
-1.653
1.656
0.363
0.006
0.533
0.017
∗
Note. RMSE = root mean square error, 𝜃̂1∗ = converted cutoff score using unweighted Method 1; 𝜃̂1𝑤
= converted cutoff
score using weighted Method 1; 𝜃̂2∗ = converted cutoff score using unweighted Method 2; 𝜃̂2̅∗ = converted cutoff score
∗
̂∗
using approximation formula for unweighted Method 2; 𝜃̂2𝑤
̅̅̅̅ = converted cutoff score using weighted method 2; 𝜃3 = converted
cutoff score using Method 3.
76
High Variability (.15)
.15 above the ICC. According to Table 6, when the standard deviation is set to
.15 and the a priori θ* value is set to 0 in the bias condition where the simulated ratings
are set to .15 above the ICC, the approximation formula for method 2 did the best job of
recovering the original θ* values as shown by the bias mean of .233. The rest of the
methods did a similar job of recovering the original θ* values with bias means ranging
from .277 to .327, i.e., the range of the bias means show that there are small differences
in determining which is the best method, suggesting that you would get a recovery of the
original θ* values that could be equally applied regardless of the selected method. The
RMSE means mirror the bias means. The means of the estimated θ* values across
replications range from .233 to .327. The means of the error index were low ranging
from .181 to .185 with low standard deviations, and the means of the consistency index
were high ranging from .754 to .765 with low standard deviations.
When the standard deviation is set to .15 and the a priori θ* value is set to -1 in
the bias condition where the simulated ratings are set to .15 above the ICC, method 1
weighted and method 3 do the best job of recovering the original θ* values as shown by
the bias mean of .766. The rest of the methods to a similarly poor job of recovering the
original θ* value with bias means ranging from .982 to 1.218, i.e., the range of the bias
means show that there are small differences in determining which is the best method,
suggesting that you would get a recovery of the original θ* values that could be equally
applied regardless of the selected method. The RMSE means mirror the bias means. The
77
means of the estimated θ* values across replications range from -.234 to .218. The
means of the error index were low, ranging from .283 to .308, with low standard
deviations, and the means of the consistency index were moderate, ranging from .543 to
.621.
When the standard deviation is set to .15 and the a priori θ* value is set to 1 in the
bias condition where the simulated ratings are set to .15 above the ICC, the simulated
ratings failed to generate. This may be due to the restricted range at the top end of the
distribution and the variability in the rater agreement being so high.
When the standard deviation is set to .15 and the a priori θ* value is set to -2 in
the bias condition where the simulated ratings are set to .15 above the ICC, all six
methods do a similarly poor job of recovering the original θ* values as shown by the bias
means ranging from 1.946 to 2.102, i.e., the range of the bias means show that there are
small differences in determining which is the best method, suggesting that you would get
a recovery of the original θ* values that could be equally applied regardless of the
selected method. The RMSE values mirror the bias means. The range of the estimated
θ* values across replications range from -.054 to .103. The means of the error index were
low, ranging from .346 to .350 with low standard deviations, and the means of the
consistency index were moderate, ranging from .498 to .522 with low standard
deviations.
When the standard deviation is set to .15 and the a priori θ* value is set to 2 in the
bias condition where the simulated ratings are set to .15 above the ICC, the simulated
78
ratings failed to generate. This may be due to the restricted range at the top end of the
distribution and the variability in the rater agreement being so high.
79
Table 6
Summary of Cut Scores Across Methods for Different A Priori θ* Values, Where Ratings
are .15 Above the Item Characteristic Curves (ICCs) on Average (High Variability SD
.15)
Mean
SD
Range
Bias
RMSE
Error M
Error SD
Consistency M
Consistency SD
0.648
0.102
-0.054
0.103
0.069
0.038
-0.054
0.010
0.021
0.047
0.026
0.018
0.021
0.047
0.063
0.140
0.402
0.170
0.108
0.134
0.402
2.102
1.946
2.103
2.069
2.038
1.946
2.102
1.946
2.104
2.069
2.039
1.946
0.346
0.350
0.346
0.347
0.347
0.350
0.005
0.006
0.005
0.005
0.005
0.006
0.522
0.498
0.522
0.517
0.512
0.498
0.008
0.011
0.008
0.008
0.008
0.011
0.690
0.209
-0.234
0.218
0.133
-0.018
-0.234
0.007
0.017
0.030
0.019
0.012
0.017
0.030
0.045
0.105
0.206
0.130
0.092
0.103
0.206
1.209
0.766
1.218
1.133
0.982
0.766
1.209
0.767
1.218
1.133
0.982
0.767
0.284
0.308
0.283
0.286
0.292
0.308
0.004
0.006
0.004
0.005
0.005
0.006
0.620
0.543
0.621
0.609
0.584
0.543
0.006
0.010
0.006
0.006
0.008
0.010
0.728
0.305
0.277
0.327
0.233
0.287
0.277
0.005
0.016
0.015
0.015
0.014
0.013
0.015
0.032
0.102
0.099
0.090
0.088
0.075
0.099
0.305
0.277
0.327
0.233
0.287
0.277
0.306
0.278
0.572
0.233
0.287
0.278
0.182
0.182
0.181
0.185
0.182
0.182
0.004
0.004
0.004
0.005
0.004
0.004
0.763
0.760
0.765
0.754
0.761
0.760
0.006
0.006
0.006
0.007
0.006
0.006
θ* = -2
p
𝜃̂1∗
∗
𝜃̂1𝑤
𝜃̂2∗
𝜃̂2̅∗
∗
𝜃̂̅̅̅̅
2𝑤
̂
𝜃3∗
θ* = -1
p
𝜃̂1∗
∗
𝜃̂1𝑤
𝜃̂2∗
𝜃̂2̅∗
∗
𝜃̂2𝑤
̅̅̅̅
𝜃̂3∗
θ* = 0
p
𝜃̂1∗
∗
𝜃̂1𝑤
̂
𝜃2∗
𝜃̂2̅∗
∗
̂
𝜃2𝑤
̅̅̅̅
𝜃̂3∗
θ* = 1
p
𝜃̂1∗
∗
𝜃̂1𝑤
𝜃̂2∗
𝜃̂2̅∗
∗
𝜃̂̅̅̅̅
2𝑤
̂
𝜃3∗
θ* = 2
p
𝜃̂1∗
∗
𝜃̂1𝑤
𝜃̂2∗
𝜃̂2̅∗
∗
𝜃̂2𝑤
̅̅̅̅
𝜃̂3∗
∗
Note. RMSE = root mean square error, 𝜃̂1∗ = converted cutoff score using unweighted Method 1; 𝜃̂1𝑤
= converted cutoff
∗
∗
̂
̂
score using weighted Method 1; 𝜃2 = converted cutoff score using unweighted Method 2; 𝜃2̅ = converted cutoff score
∗
̂∗
using approximation formula for unweighted Method 2; 𝜃̂2𝑤
̅̅̅̅ = converted cutoff score using weighted method 2; 𝜃3 = converted
cutoff score using Method 3.
80
Low Variability (.05)
.15 below the ICC. According to Table 7, when the standard deviation is set to
.05 and the a priori θ* value is set to 0 in the bias condition where the simulated ratings
are set to .15 below the ICC, all six methods did a similar job of recovering the original
θ* value as shown by the bias means ranging from -.083 to -.110, i.e., the range of the
bias means show that there are small differences in determining which is the best method,
suggesting that you would get a recovery of the original θ* values that could be equally
applied regardless of the selected method. The RMSE values mirror the bias means. The
means of the estimated θ* values across replications range from -.083 to -.110. The
means of the error index were low and range from .051 to .052 with low standard
deviations, and the means of the consistency index were high ranging from .925 to .926
with low standard deviations.
When the standard deviation is set to .05 and the a priori θ* value is set to -1 in
the bias condition where the simulated ratings are set to .15 below the ICC, method 2 did
the best job of recovering the original θ* value as shown by the bias mean of -.009.
Method 1 weighted and method 3 did the next best job as shown by the bias mean of .011, and the approximation formula for method 2 follows with a bias mean of .082. The
other two methods do a similar job as shown by the bias means of .137 and .140, i.e., the
range of the bias means show that there are small differences in determining which is the
best method, suggesting that you would get a recovery of the original θ* values that could
be equally applied regardless of the selected method. The RMSE means mirror the bias
81
means. The means of the estimated θ* values across replications range from -.860 to 1.011. The means of the error index were low, ranging from .070 to .089 with low
standard deviations, and the means of the consistency index were high ranging from .875
to .904 with low standard deviations.
When the standard deviation is set to .05 and the a priori θ* value is set to 1 in the
bias condition where the simulated ratings are set to .15 below the ICC, method 1
weighted and method 3 did the best job of recovering the original θ* values as shown by
the bias means of -.131. Method 2 weighted did the next best job as shown by the bias
mean of -.234. Method 2 and Method 1 followed with bias means of -.399 and -.416
respectively. The approximation formula for method 2 did the worst job with a bias
mean of -.646. The RMSE means mirror the bias means. The means of the estimated θ*
values across replications range from .353 to .869. The means of the error index were
low ranging from .101 to .204 with low standard deviations, and the means of the
consistency index were moderately high ranging from .738 to .886 with low standard
deviations.
When the standard deviation is set to .05 and the a priori θ* value is set to -2 in
the bias condition where the simulated ratings are set to .15 below the ICC, method 2 did
the best job of recovering the original θ* values as shown by the bias mean of .779. The
rest of the methods do a similarly poor job of recovering the original θ* values as shown
by the bias means ranging from 1.019 to 1.049, i.e., the range of the bias means show that
there are small differences in determining which is the best method, suggesting that you
82
would get a recovery of the original θ* values that could be equally applied regardless of
the selected method. The RMSE means mirror the bias means. The means of the
estimated θ* values across replications range from -.910 to -1.221. The means of the
error index were low, ranging from .085 to .111 with low standard deviations, and the
means of the consistency index were high, ranging from .836 to .888.
When the standard deviation is set to .05 and the a priori θ* value is set to 2 in the
bias condition where the simulated ratings are set to .15 below the ICC, method 1
weighted and method 3 do the best job of recovering the original θ* values as shown by
the bias means of -.849. The rest of the methods do a similarly poor job of recovering the
original θ* values as shown by the bias means ranging from -1.428 to -1.983, i.e., the
range of the bias means show that there are small differences in determining which is the
best method, suggesting that you would get a recovery of the original θ* values that could
be equally applied regardless of the selected method. The RMSE means mirror the bias
means. The means of the estimated θ* values across replications range from .353 to
1.151. The means of the error index were low, ranging from .302 to .399 with low
standard deviations, and the means of the consistency index were moderately low,
ranging from .437 to .675 with low standard deviations.
83
Table 7
Summary of Cut Scores Across Methods for Different A Priori θ* Values, Where Ratings
are .15 Below the Item Characteristic Curves (ICCs) on Average (Low Variability SD
.05)
Mean
SD
Range
Bias
RMSE
Error M
Error SD
Consistency M
Consistency SD
0.251
-0.987
-0.981
-1.221
-0.910
-0.951
-0.981
0.002
0.016
0.042
0.017
0.022
0.021
0.042
0.010
0.098
0.270
0.110
0.162
0.136
0.270
1.013
1.019
0.779
1.090
1.049
1.019
1.013
1.020
0.779
1.090
1.049
1.020
0.107
0.108
0.085
0.117
0.111
0.108
0.003
0.006
0.003
0.004
0.004
0.006
0.852
0.851
0.888
0.836
0.845
0.851
0.004
0.010
0.004
0.007
0.006
0.010
0.286
-0.860
-1.011
-1.009
-0.863
-0.918
-1.011
0.002
0.014
0.013
0.011
0.016
0.013
0.013
0.010
0.083
0.075
0.070
0.108
0.079
0.075
0.140
-0.011
-0.009
0.137
0.082
-0.011
0.141
0.017
0.014
0.138
0.083
0.017
0.089
0.070
0.070
0.088
0.081
0.070
0.003
0.002
0.002
0.003
0.003
0.002
0.875
0.904
0.903
0.875
0.887
0.904
0.004
0.003
0.003
0.005
0.004
0.003
0.569
-0.110
-0.083
-0.103
-0.102
-0.090
-0.083
0.002
0.007
0.004
0.006
0.006
0.004
0.004
0.010
0.054
0.024
0.030
0.042
0.028
0.024
-0.110
-0.083
-0.103
-0.102
-0.090
-0.083
0.110
0.083
0.103
0.102
0.090
0.083
0.052
0.051
0.051
0.051
0.051
0.051
0.001
0.001
0.001
0.001
0.001
0.001
0.925
0.926
0.926
0.926
0.926
0.926
0.002
0.002
0.002
0.002
0.002
0.002
0.812
0.584
0.869
0.601
0.353
0.766
0.869
0.005
0.012
0.006
0.017
0.019
0.008
0.006
0.030
0.081
0.049
0.100
0.127
0.050
0.049
-0.416
-0.131
-0.399
-0.646
-0.234
-0.131
0.416
0.131
0.400
0.647
0.234
0.131
0.144
0.101
0.140
0.204
0.111
0.101
0.006
0.004
0.008
0.009
0.005
0.004
0.826
0.886
0.832
0.738
0.872
0.886
0.008
0.005
0.010
0.013
0.006
0.005
θ* = -2
p
𝜃̂1∗
∗
𝜃̂1𝑤
𝜃̂2∗
𝜃̂2̅∗
∗
𝜃̂̅̅̅̅
2𝑤
̂
𝜃3∗
θ* = -1
p
𝜃̂1∗
∗
𝜃̂1𝑤
𝜃̂2∗
𝜃̂2̅∗
∗
𝜃̂2𝑤
̅̅̅̅
𝜃̂3∗
θ* = 0
p
𝜃̂1∗
∗
𝜃̂1𝑤
̂
𝜃2∗
𝜃̂2̅∗
∗
̂
𝜃2𝑤
̅̅̅̅
𝜃̂3∗
θ* = 1
p
𝜃̂1∗
∗
𝜃̂1𝑤
𝜃̂2∗
𝜃̂2̅∗
∗
𝜃̂̅̅̅̅
2𝑤
̂
𝜃3∗
θ* = 2
p
𝜃̂1∗
∗
𝜃̂1𝑤
𝜃̂2∗
𝜃̂2̅∗
∗
𝜃̂2𝑤
̅̅̅̅
𝜃̂3∗
0.714
0.008
0.050
0.353
0.020
0.144
-1.647
1.647
0.372
0.007
0.522
0.011
1.151
0.051
0.366
-0.849
0.850
0.302
0.010
0.675
0.012
0.287
0.025
0.160
-1.713
1.713
0.378
0.007
0.505
0.013
0.017
0.020
0.119
-1.983
1.983
0.399
0.005
0.437
0.010
0.571
0.022
0.159
-1.428
1.429
0.352
0.008
0.574
0.012
1.151
0.051
0.366
-0.849
0.850
0.302
0.010
0.675
0.012
∗
Note. RMSE = root mean square error, 𝜃̂1∗ = converted cutoff score using unweighted Method 1; 𝜃̂1𝑤
= converted cutoff
score using weighted Method 1; 𝜃̂2∗ = converted cutoff score using unweighted Method 2; 𝜃̂2̅∗ = converted cutoff score
∗
̂∗
using approximation formula for unweighted Method 2; 𝜃̂2𝑤
̅̅̅̅ = converted cutoff score using weighted method 2; 𝜃3 = converted
cutoff score using Method 3.
84
Medium Variability (.10)
.15 below the ICC. According to Table 8, when the standard deviation is set to
.10 and the a priori value is set to 0 in the bias condition where the simulated ratings are
set to .15 below the ICC, all 6 methods did a similar job of recovering the original θ*
values as shown by the bias means ranging from -.104 to .009, i.e., the range of the bias
means show that there are small differences in determining which is the best method,
suggesting that you would get a recovery of the original θ* values that could be equally
applied regardless of the selected method. The RMSE means mirror the bias means. The
means of the estimated θ* values across replications range from -.104 to .009. The
means of the error index were low ranging from .126 to .135 and the means of the
consistency index were high, ranging from .809 to .818.
When the standard deviation is set to .10 and the a priori value is set to -1 in the
bias condition where the simulated ratings are set to .15 below the ICC, method 1
weighted and method 3 did the best job of recovering the original θ* values as shown by
the bias means of .237. Method 2 weighted did the next best job as shown by the bias
mean of .584. The rest of the methods do a similar job, as shown by the bias means
ranging from .885 to 1.005, i.e., the range of the bias means show that there are small
differences in determining which is the best method, suggesting that you would get a
recovery of the original θ* values that could be equally applied regardless of the selected
method. The RMSE means mirror the bias means. The means of the estimated θ* values
across replications range from -.764 to .005. The means of the error index were low,
85
ranging from .338 to .392, with low standard deviations, and the means of the consistency
index were low, ranging from .441 to .514.
When the standard deviation is set to .10 and the a priori value is set to 1 in the
bias condition where the simulated ratings are set to .15 below the ICC, method 1
weighted and method 3 did the best job of recovering the original θ* values as shown by
the bias means of -.278. Method 2 weighted did the next best job as shown by the bias
mean of -.438. The rest of the methods do a similar job as shown by the bias means
ranging from -.609 to -.800, i.e., the range of the bias means show that there are small
differences in determining which is the best method, suggesting that you would get a
recovery of the original θ* values that could be equally applied regardless of the selected
method. The RMSE means mirror the bias means. The means of the estimated θ* values
across replications range from .200 to .722. The means of the error index were low
ranging from .189 to .288 with low standard deviations, and the means of the consistency
index were moderate ranging from .613 to .780.
When the standard deviation is set to .10 and the a priori value is set to -2 in the
bias condition where the simulated ratings are set to .15 below the ICC, method 1
weighted and method 3 did the best job of recovering the original θ* values as shown by
the bias means of 1.864, the rest of the methods did a similarly poor job of recovering the
original θ* values as shown by the bias means ranging from 1.977 to 2.059, i.e., the range
of the bias means show that there are small differences in determining which is the best
method, suggesting that you would get a recovery of the original θ* values that could be
86
equally applied regardless of the selected method. The RMSE means mirror the bias
means. The means of the estimated θ* values across replications, range from -.137 to
.059. The means of the error index were low, ranging from .361 to .364 with low
standard deviations, and the means of the consistency index were low ranging from .468
to .495 with low standard deviations. The proportion-correct cutoff score generated in
this condition was affected by not enough low ratings being possible in the sampling
when attempting to achieve the .10 standard deviation of rater variability at the -2 θ*
level.
When the standard deviation is set to .10 and the a priori value is set to 2 in the
bias condition where the simulated ratings are set to .15 below the ICC, all of the
methods did a similarly poor job of recovering the original θ* value with bias means
ranging from -1.424 to -1.958, i.e., the range of the bias means show that there are small
differences in determining which is the best method, suggesting that you would get a
recovery of the original θ* values that could be equally applied regardless of the selected
method. The RMSE means mirror the bias means. The means of the estimated θ* values
across replications range from .168 to .576. The means of the error index were low,
ranging from .356 to .373 with low standard deviations, and the means of the consistency
index were moderately low, ranging from .476 to .570 with low standard deviations.
87
Table 8
Summary of Cut Scores Across Methods for Different A Priori θ* Values, Where Ratings
are .15 Below the Item Characteristic Curves (ICCs) on Average (Medium Variability SD
.10)
Mean
SD
Range
Bias
RMSE
Error M
Error SD
Consistency M
Consistency SD
0.631
0.033
-0.137
0.059
0.053
-0.023
-0.137
0.009
0.020
0.064
0.025
0.018
0.022
0.064
0.050
0.127
0.426
0.140
0.108
0.139
0.426
2.033
1.863
2.059
2.053
1.977
1.863
2.033
1.864
2.059
2.053
1.977
1.864
0.361
0.364
0.361
0.361
0.362
0.364
0.006
0.006
0.006
0.006
0.006
0.006
0.492
0.468
0.496
0.495
0.484
0.468
0.008
0.012
0.009
0.009
0.009
0.012
0.565
-0.115
-0.764
-0.112
0.005
-0.416
-0.764
0.008
0.018
0.030
0.022
0.016
0.020
0.030
0.060
0.126
0.181
0.150
0.109
0.131
0.181
0.885
0.237
0.888
1.005
0.584
0.237
0.885
0.238
0.888
1.005
0.584
0.238
0.384
0.338
0.385
0.392
0.362
0.338
0.006
0.008
0.006
0.005
0.007
0.008
0.441
0.514
0.441
0.446
0.458
0.514
0.008
0.012
0.008
0.007
0.011
0.012
0.597
-0.036
-0.104
-0.030
0.009
-0.087
-0.104
0.004
0.012
0.008
0.012
0.010
0.007
0.008
0.030
0.075
0.050
0.060
0.063
0.043
0.050
-0.036
-0.104
-0.030
0.009
-0.087
-0.104
0.038
0.104
0.032
0.013
0.088
0.104
0.129
0.126
0.130
0.135
0.126
0.126
0.004
0.004
0.004
0.004
0.004
0.004
0.815
0.818
0.815
0.809
0.818
0.818
0.005
0.005
0.006
0.005
0.005
0.005
0.749
0.388
0.722
0.391
0.200
0.563
0.722
0.006
0.017
0.016
0.020
0.017
0.014
0.016
0.050
0.118
0.149
0.140
0.114
0.099
0.149
-0.612
-0.278
-0.609
-0.800
-0.438
-0.278
0.612
0.279
0.609
0.800
0.438
0.279
0.243
0.189
0.243
0.288
0.209
0.189
0.008
0.006
0.009
0.009
0.007
0.006
0.691
0.780
0.692
0.613
0.746
0.780
0.011
0.008
0.013
0.013
0.009
0.008
θ* = -2
p
𝜃̂1∗
∗
𝜃̂1𝑤
𝜃̂2∗
𝜃̂2̅∗
∗
𝜃̂̅̅̅̅
2𝑤
̂
𝜃3∗
θ* = -1
p
𝜃̂1∗
∗
𝜃̂1𝑤
𝜃̂2∗
𝜃̂2̅∗
∗
𝜃̂2𝑤
̅̅̅̅
𝜃̂3∗
θ* = 0
p
𝜃̂1∗
∗
𝜃̂1𝑤
̂
𝜃2∗
𝜃̂2̅∗
∗
̂
𝜃2𝑤
̅̅̅̅
𝜃̂3∗
θ* = 1
p
𝜃̂1∗
∗
𝜃̂1𝑤
𝜃̂2∗
𝜃̂2̅∗
∗
𝜃̂̅̅̅̅
2𝑤
̂
𝜃3∗
θ* = 2
p
𝜃̂1∗
∗
𝜃̂1𝑤
𝜃̂2∗
𝜃̂2̅∗
∗
𝜃̂2𝑤
̅̅̅̅
𝜃̂3∗
0.672 0.009
0.060
0.201 0.022
0.144
-1.799
1.799
0.368
0.006
0.505
0.009
0.576 0.110
0.700
-1.424
1.428
0.356
0.008
0.570
0.019
0.168 0.026
0.150
-1.832
1.832
0.369
0.006
0.499
0.010
0.042 0.019
0.134
-1.958
1.958
0.373
0.005
0.476
0.009
0.258 0.028
0.170
-1.742
1.742
0.366
0.006
0.516
0.010
0.576 0.110
0.700
-1.424
1.428
0.356
0.008
0.570
0.019
∗
Note. RMSE = root mean square error, 𝜃̂1∗ = converted cutoff score using unweighted Method 1; 𝜃̂1𝑤
= converted cutoff
score using weighted Method 1; 𝜃̂2∗ = converted cutoff score using unweighted Method 2; 𝜃̂2̅∗ = converted cutoff score
∗
̂∗
using approximation formula for unweighted Method 2; 𝜃̂2𝑤
̅̅̅̅ = converted cutoff score using weighted method 2; 𝜃3 = converted
cutoff score using Method 3.
88
High Variability (.15)
.15 below the ICC. According to Table 9, when the standard deviation is set to
.15 and the a priori value is set to 0 in the bias condition where the simulated ratings are
set to .15 below the ICC, the approximation formula for method 2 did the best job of
recovering the original θ* values as shown by the bias mean of -.051. Method 1 did the
next best job as shown by the bias mean of -.083. Method 2 did the next best job as
shown by the bias mean of -.085. Method 1 weighted and method 3 did the worst job as
shown by the bias means of -.152. The RMSE means mirror the bias means. The means
of the estimated θ* values across replications range from -.051 to -.152. The means of
the error index were low ranging from .178 to .183, and the means of the consistency
index were moderately high, ranging from .737 to .740 with low standard deviations.
When the standard deviation is set to .15 and the a priori value is set to -1 in the
bias condition where the simulated ratings are set to .15 below the ICC, method 1
weighted and method 3 did the best job of recovering the original θ* values as shown by
the bias means of .552. Method 2 weighted did the next best job as shown by the bias
mean of .774. The rest of the methods do a similar job of recovering the original θ*
values as shown by the bias means ranging from .919 to 1.024, i.e., the range of the bias
means show that there are small differences in determining which is the best method,
suggesting that you would get a recovery of the original θ* values that could be equally
applied regardless of the selected method. The RMSE values mirror the bias means. The
means of the estimated θ* values across replications range from -.448 to .024. The
89
means of the error index were low, ranging from .371 to .391 and the means of the
consistency index were low and range from .435 to .449 with low standard deviations.
When the standard deviation is set to .15 and the a priori value is set to 1 in the
bias condition where the simulated ratings are set to .15 below the ICC, method1
weighted and method 3 did the best job of recovering the original θ* values as shown by
the bias means of -.0493. The rest of the methods did a similar job of recovering the
original θ* values as shown by the bias means ranging from -.650 to -.891, i.e., the range
of the bias means show that there are small differences in determining which is the best
method, suggesting that you would get a recovery of the original θ* values that could be
equally applied regardless of the selected method. The RMSE values mirror the bias
means. The range of the estimated θ* values across replications range from .109 to .507.
The means of the error index were low ranging from .264 to .323 and the means of the
consistency index were moderate, ranging from .555 to .675 with low standard
deviations.
When the standard deviation is set to .15 and the a priori value is set to -2 in the
bias condition where the simulated ratings are set to .15 below the ICC, all six methods
did a poor job of recovering the original θ* values as shown by the bias means ranging
from 1.997 to 2.067. The RMSE means mirror the bias means. The range of the
estimated θ* values across replications range from -.002 to .067. The means of the error
index were low ranging from .359 to .360 and the means of the consistency index were
moderately low ranging from .490 to .500 with low standard deviations. The proportion-
90
correct cutoff score generated in this condition was affected by not enough low ratings
being possible in the sampling when attempting to achieve the .15 standard deviation of
rater variability at the -2 θ* level.
When the standard deviation is set to .15 and the a priori value is set to 2 in the
bias condition where the simulated ratings are set to .15 below the ICC, all six methods
did a similar job of recovering the original θ* values as shown by the bias means ranging
from -.1689 to -1.882, i.e., the range of the bias means show that there are small
differences in determining which is the best method, suggesting that you would get a
recovery of the original θ* values that could be equally applied regardless of the selected
method. The RMSE values mirror the bias means. The range of the estimated θ* values
across replications range from .058 to .311. The means of the error index were low
ranging from .357 to .361 and the means of the consistency index were moderate, ranging
from .496 to .536 with low standard deviations.
91
Table 9
Summary of Cut Scores Across Methods for Different A Priori θ* Values, Where Ratings
are .15 Below the Item Characteristic Curves (ICCs) on Average (High Variability SD
.15)
Mean
SD
Range
Bias
RMSE
Error M
Error SD
Consistency M
Consistency SD
0.634
0.040
-0.002
0.067
0.055
0.028
-0.002
0.010
0.021
0.033
0.026
0.018
0.021
0.033
0.064
0.145
0.297
0.170
0.127
0.164
0.297
2.040
1.997
2.067
2.054
2.028
1.997
2.040
1.998
2.067
2.054
2.028
1.998
0.359
0.360
0.359
0.359
0.359
0.360
0.006
0.006
0.006
0.006
0.006
0.006
0.496
0.490
0.500
0.498
0.494
0.490
0.009
0.010
0.009
0.009
0.009
0.010
0.578
-0.081
-0.448
-0.078
0.024
-0.226
-0.448
0.009
0.020
0.042
0.023
0.017
0.021
0.042
0.062
0.135
0.259
0.160
0.109
0.125
0.259
0.919
0.552
0.922
1.024
0.774
0.552
0.919
0.554
0.922
1.024
0.774
0.554
0.388
0.371
0.388
0.391
0.381
0.371
0.006
0.007
0.006
0.005
0.006
0.007
0.440
0.446
0.441
0.449
0.435
0.446
0.007
0.011
0.007
0.007
0.009
0.011
0.576
-0.083
-0.152
-0.085
-0.051
-0.128
-0.152
0.006
0.013
0.011
0.015
0.012
0.010
0.011
0.036
0.090
0.075
0.090
0.081
0.063
0.075
-0.083
-0.152
-0.085
-0.051
-0.128
-0.152
0.084
0.152
0.086
0.053
0.129
0.152
0.181
0.178
0.181
0.183
0.178
0.178
0.005
0.004
0.005
0.004
0.004
0.004
0.739
0.740
0.739
0.737
0.740
0.740
0.007
0.006
0.007
0.006
0.006
0.006
0.693
0.227
0.507
0.228
0.109
0.349
0.507
0.008
0.020
0.023
0.022
0.017
0.017
0.023
0.050
0.129
0.193
0.130
0.113
0.104
0.193
-0.773
-0.493
-0.772
-0.891
-0.650
-0.493
0.774
0.493
0.773
0.892
0.651
0.493
0.304
0.264
0.303
0.323
0.285
0.264
0.007
0.007
0.008
0.007
0.007
0.007
0.595
0.675
0.595
0.555
0.634
0.675
0.012
0.011
0.013
0.012
0.011
0.011
θ* = -2
p
𝜃̂1∗
∗
𝜃̂1𝑤
𝜃̂2∗
𝜃̂2̅∗
∗
𝜃̂̅̅̅̅
2𝑤
̂
𝜃3∗
θ* = -1
p
𝜃̂1∗
∗
𝜃̂1𝑤
𝜃̂2∗
𝜃̂2̅∗
∗
𝜃̂2𝑤
̅̅̅̅
𝜃̂3∗
θ* = 0
p
𝜃̂1∗
∗
𝜃̂1𝑤
̂
𝜃2∗
𝜃̂2̅∗
∗
̂
𝜃2𝑤
̅̅̅̅
𝜃̂3∗
θ* = 1
p
𝜃̂1∗
∗
𝜃̂1𝑤
𝜃̂2∗
𝜃̂2̅∗
∗
𝜃̂̅̅̅̅
2𝑤
̂
𝜃3∗
θ* = 2
p
𝜃̂1∗
∗
𝜃̂1𝑤
𝜃̂2∗
𝜃̂2̅∗
∗
𝜃̂2𝑤
̅̅̅̅
𝜃̂3∗
0.651 0.009
0.070
0.118 0.022
0.141
-1.882
1.882
0.360
0.005
0.505
0.009
0.311 0.071
0.600
-1.689
1.691
0.357
0.006
0.536
0.014
0.111 0.026
0.160
-1.889
1.889
0.360
0.005
0.504
0.009
0.058 0.018
0.116
-1.942
1.942
0.361
0.005
0.496
0.009
0.157 0.024
0.143
-1.843
1.843
0.359
0.006
0.511
0.009
0.311 0.071
0.600
-1.689
1.691
0.357
0.006
0.536
0.014
∗
Note. RMSE = root mean square error, 𝜃̂1∗ = converted cutoff score using unweighted Method 1; 𝜃̂1𝑤
= converted cutoff
score using weighted Method 1; 𝜃̂2∗ = converted cutoff score using unweighted Method 2; 𝜃̂2̅∗ = converted cutoff score
∗
̂∗
using approximation formula for unweighted Method 2; 𝜃̂2𝑤
̅̅̅̅ = converted cutoff score using weighted method 2; 𝜃3 = converted
cutoff score using Method 3.
92
Low Variability (.05)
.35 above the ICC. According to According to Table 10, when the standard
deviation is set to .05 and the a priori value is set to 0 in the bias condition where the
simulated ratings are set to .35 above the ICC, the approximation formula for method 2
did the best job of recovering the original θ* value as shown by the bias mean of .086.
The rest of the methods did a similar job as shown by the bias means ranging from .106
to .147, i.e., the range of the bias means show that there are small differences in
determining which is the best method, suggesting that you would get a recovery of the
original θ* values that could be equally applied regardless of the selected method The
RMSE means mirror the bias means. The range of the estimated θ* values across
replications range from .086 to .147. The means of the error index were low, ranging
from .059 to .061, and the means of the consistency index were high, ranging from .917
to .918 with low standard deviations.
When the standard deviation is set to .05 and the a priori value is set to -1 in the
bias condition where the simulated ratings are set to .35 above the ICC, method 2 did the
best job of recovering the original θ* value as shown by the bias mean of .253. The rest
of the methods did a similar job of recovering the original θ* value as shown by the bias
means ranging from .396 to .413, i.e., the range of the bias means show that there are
small differences in determining which is the best method, suggesting that you would get
a recovery of the original θ* values that could be equally applied regardless of the
selected method. The RMSE means mirror the bias means. The range of the estimated
93
θ* values across replications range from -.587 to -.747. The means of the error index
were low, ranging from .083 to .113, and the means of the consistency index were high,
ranging from .833 to .839 with low standard deviations.
When the standard deviation is set to .05 and the a priori value is set to 1 in the
bias condition where the simulated ratings are set to .35 above the ICC, method 1
weighted and method 3 did the best job of recovering the original θ* value as shown by
the bias means of .033. Method 2 weighted did the next best job as shown by the bias
mean of -.174. Method 1 followed as shown by the bias mean of -.283. Method 2 is next
with a bias mean of -.432, and the approximation formula for method 2 did the worst job
as shown by the bias mean of -.902. The RMSE values mirror the bias means. The
means of the estimated θ* values across replications range from .098 to 1.033. The
means of the error index were low ranging from .172 to .372 and the means of the
consistency index were moderately high ranging from .486 to .812 with low standard
deviations.
When the standard deviation is set to .05 and the a priori value is set to -2 in the
bias condition where the simulated ratings are set to .35 above the ICC, all six methods
did a similar job of recovering the original θ* values as shown by the bias means ranging
from 1.192 to 1.459, i.e., the range of the bias means show that there are small
differences in determining which is the best method, suggesting that you would get a
recovery of the original θ* values that could be equally applied regardless of the selected
method. The RMSE means mirror the bias means. The means of the estimated θ* values
94
across replications range from -.313 to .136. The means of the error index were low,
ranging from .314 to .318 and the means of the consistency index were moderate, ranging
from .525 to .570 with low standard deviations.
When the standard deviation is set to .05 and the a priori value is set to 2 in the
bias condition where the simulated ratings are set to .35 above the ICC, all six methods
did a similar job of recovering the original θ* value as shown by the bias means ranging
from -1.319 to -1.987, i.e., the range of the bias means show that there are small
differences in determining which is the best method, suggesting that you would get a
recovery of the original θ* values that could be equally applied regardless of the selected
method. The RMSE means mirror the bias means. The means of the estimated θ* values
across replications range from .013 to .681. The means of the error index were low,
ranging from .361 to .376 and the means of the consistency index were moderate, ranging
from .469 to .575 with low standard deviations.
95
Table 10
Summary of Cut Scores Across Methods for Different A Priori θ* Values, Where Ratings
are .35 Above the Item Characteristic Curves (ICCs) on Average (Low Variability SD
.05)
Mean
SD
Range
Bias
RMSE
Error M
Error SD
Consistency M
Consistency SD
0.331
0.110
-0.313
0.136
0.103
-0.087
-0.313
0.001
0.018
0.036
0.023
0.015
0.017
0.036
0.010
0.116
0.229
0.150
0.095
0.099
0.229
1.376
1.459
1.192
1.427
1.426
1.459
1.376
1.459
1.192
1.427
1.426
1.459
0.314
0.318
0.314
0.314
0.314
0.318
0.005
0.006
0.005
0.005
0.005
0.006
0.567
0.525
0.570
0.566
0.546
0.525
0.007
0.010
0.007
0.007
0.008
0.010
0.347
-0.604
-0.602
-0.747
-0.587
-0.588
-0.602
0.001
0.006
0.014
0.006
0.006
0.007
0.014
0.010
0.033
0.100
0.030
0.036
0.055
0.100
0.396
0.398
0.253
0.413
0.412
0.398
0.396
0.398
0.253
0.413
0.412
0.398
0.109
0.110
0.083
0.113
0.113
0.110
0.002
0.003
0.001
0.002
0.002
0.003
0.839
0.838
0.881
0.833
0.833
0.838
0.002
0.005
0.002
0.003
0.003
0.005
0.649
0.147
0.125
0.106
0.086
0.136
0.125
0.002
0.008
0.005
0.007
0.007
0.004
0.005
0.020
0.044
0.036
0.050
0.059
0.030
0.036
0.147
0.125
0.106
0.086
0.136
0.125
0.147
0.125
0.107
0.086
0.136
0.125
0.061
0.059
0.059
0.060
0.060
0.059
0.002
0.002
0.002
0.002
0.002
0.002
0.917
0.918
0.918
0.917
0.918
0.918
0.002
0.003
0.003
0.003
0.003
0.003
0.801
0.717
1.033
0.561
0.098
0.826
1.033
0.006
0.022
0.009
0.022
0.023
0.012
0.009
0.040
0.146
0.058
0.140
0.161
0.078
0.058
-0.283
0.033
-0.439
-0.902
-0.174
0.033
0.284
0.034
0.440
0.902
0.175
0.034
0.223
0.172
0.256
0.372
0.202
0.172
0.009
0.006
0.010
0.010
0.008
0.006
0.740
0.812
0.689
0.486
0.770
0.812
0.011
0.007
0.014
0.016
0.009
0.007
θ* = -2
p
𝜃̂1∗
∗
𝜃̂1𝑤
𝜃̂2∗
𝜃̂2̅∗
∗
𝜃̂2𝑤
̅̅̅̅
𝜃̂3∗
θ* = -1
p
𝜃̂1∗
∗
𝜃̂1𝑤
̂
𝜃2∗
𝜃̂̅2∗
∗
̂
𝜃2𝑤
̅̅̅̅
𝜃̂3∗
θ* = 0
p
𝜃̂1∗
∗
𝜃̂1𝑤
𝜃̂2∗
𝜃̂2̅∗
∗
𝜃̂̅̅̅̅
2𝑤
̂
𝜃3∗
θ* = 1
p
𝜃̂1∗
∗
𝜃̂1𝑤
𝜃̂2∗
𝜃̂2̅∗
∗
𝜃̂2𝑤
̅̅̅̅
̂
𝜃3∗
θ* = 2
p
𝜃̂1∗
∗
𝜃̂1𝑤
𝜃̂2∗
𝜃̂̅2∗
∗
̂
𝜃2𝑤
̅̅̅̅
𝜃̂3∗
0.668 0.009
0.060
0.220 0.022
0.178
-1.780
1.780
0.371
0.006
0.504
0.009
0.681 0.080
0.489
-1.319
1.322
0.361
0.008
0.575
0.015
0.159 0.025
0.180
-1.841
1.842
0.373
0.006
0.493
0.010
0.013 0.019
0.138
-1.987
1.987
0.376
0.005
0.469
0.009
0.240 0.024
0.163
-1.760
1.760
0.371
0.006
0.507
0.010
0.681 0.080
0.489
-1.319
1.322
0.361
0.008
0.575
0.015
∗
Note. RMSE = root mean square error, 𝜃̂1∗ = converted cutoff score using unweighted Method 1; 𝜃̂1𝑤
= converted cutoff
score using weighted Method 1; 𝜃̂2∗ = converted cutoff score using unweighted Method 2; 𝜃̂̅2∗ = converted cutoff score
∗
̂∗
using approximation formula for unweighted Method 2; 𝜃̂2𝑤
̅̅̅̅ = converted cutoff score using weighted method 2; 𝜃3 = converted
cutoff score using Method 3.
96
Medium Variability (.10)
.35 above the ICC. According to Table 11, when the standard deviation is set to
.10 and the a priori value is set to 0 in the bias condition where the simulated ratings are
set to .35 above the ICC, all six methods did a similar job of recovering the original θ*
value as shown by the bias means ranging from .370 to .477, i.e., the range of the bias
means show that there are small differences in determining which is the best method,
suggesting that you would get a recovery of the original θ* values that could be equally
applied regardless of the selected method. The RMSE means mirror the bias means. The
means of the estimated θ* values across replications range from .370 to .477. The means
of the error index were low, ranging from .115 to .119, and the means of the consistency
index were high, ranging from .848 to .857 with low standard deviations.
When the standard deviation is set to .10 and the a priori value is set to -1 in the
bias condition where the simulated ratings are set to .35 above the ICC, method 1
weighted and method 3 did the best job of recovering the original θ* values as shown by
the bias means of .815. The rest of the methods did a similar job of recovering the
original θ* values as shown by the bias means ranging from .892 to 1.038, i.e., the range
of the bias means show that there are small differences in determining which is the best
method, suggesting that you would get a recovery of the original θ* values that could be
equally applied regardless of the selected method. The RMSE means mirror the bias
means. The means of the estimated θ* values across replications range from -.185 to
.038. The means of the error index were low, ranging from .228 to .237, and the means
97
of the consistency index were moderate, ranging from .664 to .666 with low standard
deviations.
When the standard deviation is set to .10 and the a priori value is set to 1 in the
bias condition where the simulated ratings are set to .35 above the ICC, method 1
weighted and method 3 did the best job of recovering the original θ* values as shown by
the bias means of -.146. Method 2 weighted did the next best job of recovering the
original θ* values as shown by the bias mean 0f -.498. Method 1 did the next best job as
shown by the bias mean of -.579. Method 2 did the next best job, as shown by the bias
mean of -.664. The approximation formula for Method 2 did the worst job, as shown by
the bias mean of -.957. The RMSE means mirror the bias means. The means of the
estimated θ* values across replications range from .043 to .854. The means of the error
index were low, ranging from .297 to .402 with low standard deviations, and the means
of the consistency index were moderate, ranging from .436 to .663 with low standard
deviations.
When the standard deviation is set to .10 and the a priori value is set to -2 in the
bias condition where the simulated ratings are set to .35 above the ICC, all six methods
did a poor job of recovering the original θ* values as shown by the bias means ranging
from 1.687 to 2.110, i.e., the range of the bias means show that there are small
differences in determining which is the best method, suggesting that you would get a
recovery of the original θ* values that could be equally applied regardless of the selected
method. The RMSE means mirror the bias means. The means of the estimated θ* values
98
across replications range from -.313 to .136. The means of the error index were low,
ranging from .314 to .318, and the means of the consistency index were moderate ranging
from .525 to .570 with low standard deviations. The proportion-correct cutoff score
generated in this condition was affected by not enough low ratings being possible in the
sampling when attempting to achieve the .10 standard deviation of rater variability at the
-2 θ* level.
When the standard deviation is set to .10 and the a priori value is set to 2 in the
bias condition where the simulated ratings are set to .35 above the ICC , all six methods
did a poor job of recovering the original θ* values as shown by the bias means ranging
from -1.855 to -1.976, i.e., the range of the bias means show that there are small
differences in determining which is the best method, suggesting that you would get a
recovery of the original θ* values that could be equally applied regardless of the selected
method. The RMSE means mirror the bias means. The means of the estimated θ* values
across replications range from .024 to .145. The means of the error index were low,
ranging from .366 to .368, and the means of the consistency index were moderate ranging
from .482 to .501 with low standard deviations.
99
Table 11
Summary of Cut Scores Across Methods for Different A Priori θ* Values, Where Ratings
are .35 Above the Item Characteristic Curves (ICCs) on Average (Medium Variability SD
.10)
Mean
SD
Range
Bias
RMSE
Error M
Error SD
Consistency M
Consistency SD
θ* = -2
p
𝜃̂1∗
∗
𝜃̂1𝑤
𝜃̂2∗
𝜃̂2̅∗
∗
𝜃̂̅̅̅̅
2𝑤
̂
𝜃3∗
0.660
0.110
-0.313
0.136
0.103
-0.087
-0.313
0.009
0.018
0.036
0.023
0.015
0.017
0.036
0.060
0.116
0.229
0.150
0.095
0.099
0.229
1.687
2.110
2.136
2.103
1.912
2.110
1.687
2.110
2.136
2.103
1.913
2.110
0.314
0.318
0.314
0.314
0.314
0.318
0.005
0.006
0.005
0.005
0.005
0.006
0.567
0.525
0.570
0.566
0.546
0.525
0.007
0.010
0.007
0.007
0.008
0.010
0.520
0.038
-0.185
0.030
0.016
-0.108
-0.185
0.005
0.010
0.022
0.012
0.008
0.012
0.022
0.030
0.059
0.140
0.070
0.044
0.071
0.140
1.038
0.815
1.030
1.016
0.892
0.815
1.038
0.815
1.030
1.016
0.892
0.815
0.238
0.228
0.237
0.236
0.230
0.228
0.005
0.004
0.005
0.004
0.004
0.004
0.666
0.664
0.666
0.666
0.666
0.664
0.006
0.006
0.006
0.006
0.006
0.006
0.762
0.444
0.477
0.431
0.370
0.458
0.477
0.004
0.016
0.018
0.012
0.012
0.010
0.018
0.030
0.100
0.122
0.080
0.072
0.072
0.122
0.444
0.477
0.432
0.370
0.458
0.477
0.445
0.477
0.432
0.370
0.458
0.477
0.116
0.115
0.116
0.119
0.115
0.115
0.003
0.003
0.003
0.004
0.003
0.003
0.855
0.857
0.855
0.848
0.856
0.857
0.004
0.004
0.004
0.005
0.004
0.004
0.731
0.421
0.854
0.336
0.043
0.502
0.854
0.008
0.025
0.067
0.024
0.021
0.027
0.067
0.050
0.156
0.492
0.140
0.124
0.178
0.492
-0.579
-0.146
-0.664
-0.957
-0.498
-0.146
0.580
0.161
0.664
0.957
0.499
0.161
0.355
0.297
0.366
0.402
0.343
0.297
0.008
0.011
0.008
0.006
0.009
0.011
0.553
0.663
0.527
0.436
0.577
0.663
0.013
0.017
0.014
0.011
0.013
0.017
θ* = -1
p
𝜃̂1∗
∗
𝜃̂1𝑤
𝜃̂2∗
𝜃̂2̅∗
∗
𝜃̂2𝑤
̅̅̅̅
𝜃̂3∗
θ* = 0
p
𝜃̂1∗
∗
𝜃̂1𝑤
̂
𝜃2∗
𝜃̂2̅∗
∗
̂
𝜃2𝑤
̅̅̅̅
𝜃̂3∗
θ* = 1
p
𝜃̂1∗
∗
𝜃̂1𝑤
𝜃̂2∗
𝜃̂2̅∗
∗
𝜃̂̅̅̅̅
2𝑤
̂
𝜃3∗
θ* = 2
p
𝜃̂1∗
∗
𝜃̂1𝑤
𝜃̂2∗
𝜃̂2̅∗
∗
𝜃̂2𝑤
̅̅̅̅
𝜃̂3∗
0.653
0.009
0.060
0.121
0.021
0.129 -1.879 1.879
0.366
0.006
0.497
0.009
0.145
0.049
0.337 -1.855 1.856
0.366
0.006
0.501
0.012
0.118
0.025
0.160 -1.882 1.882
0.366
0.006
0.497
0.009
0.024
0.018
0.134 -1.976 1.976
0.368
0.006
0.482
0.009
0.105
0.021
0.144 -1.895 1.895
0.366
0.006
0.494
0.009
0.145
0.049
0.337 -1.855 1.856
0.366
0.006
0.501
0.012
∗
Note. RMSE = root mean square error, 𝜃̂1∗ = converted cutoff score using unweighted Method 1; 𝜃̂1𝑤
= converted cutoff
score using weighted Method 1; 𝜃̂2∗ = converted cutoff score using unweighted Method 2; 𝜃̂̅2∗ = converted cutoff score
∗
̂∗
using approximation formula for unweighted Method 2; 𝜃̂2𝑤
̅̅̅̅ = converted cutoff score using weighted method 2; 𝜃3 = converted
cutoff score using Method 3.
100
High Variability (.15)
.35 above the ICC. According to Table 12, when the standard deviation is set to
.15 and the a priori value is set to 0 in the bias condition where the simulated ratings are
set to .35 above the ICC, all six methods did a similar job of recovering the original θ*
values as shown by the bias means ranging from .048 to .083, i.e., the range of the bias
means show that there are small differences in determining which is the best method,
suggesting that you would get a recovery of the original θ* values that could be equally
applied regardless of the selected method. The RMSE means mirror the bias means. The
means of the estimated θ* values across replications range from .048 to .083. The means
of the error index were low, ranging from .354 to .355, and the means of the consistency
index were moderate, ranging from .503 to .508 with low standard deviations.
When the standard deviation is set to .15 and the a priori value is set to -1 in the
bias condition where the simulated ratings are set to .35 above the ICC, all six methods
did a similar job of recovering the original θ* values as shown by the bias means ranging
from 1.077 to 1.319, i.e., the range of the bias means show that there are small
differences in determining which is the best method, suggesting that you would get a
recovery of the original θ* values that could be equally applied regardless of the selected
method. The RMSE means mirror the bias means. The means of the estimated θ* values
across replications range from .077 to .360. The means of the error index were low
ranging from .236 to .249, and the means of the consistency index were moderate ranging
from .653 to .697 with low standard deviations.
101
When the standard deviation is set to .15 and the a priori value is set to 1 in the
bias condition where the simulated ratings are set to .35 above the ICC, the simulated
ratings failed to generate. This may be due to the restricted range at the top end of the
distribution and the variability in the rater agreement being so high.
When the standard deviation is set to .15 and the a priori value is set to -2 in the
bias condition where the simulated ratings are set to .25 above the ICC, all six methods
did a similar job of recovering the original θ* values as shown by the bias means ranging
from 2.019 to 2.120, i.e., the range of the bias means show that there are small
differences in determining which is the best method, suggesting that you would get a
recovery of the original θ* values that could be equally applied regardless of the selected
method. The RMSE means mirror the bias means. The means of the estimated θ* values
across replications range from .019 to .120. The means of the error index were low,
ranging from .342 to .344, and the means of the consistency index were moderate,
ranging from .514 to .530 with low standard deviations.
When the standard deviation is set to .15 and the a priori value is set to 2 in the
bias condition where the simulated ratings are set to .35 above the ICC, the simulated
ratings failed to generate. This may be due to the restricted range at the top end of the
distribution and the variability in the rater agreement being so high.
102
Table 12
Summary of Cut Scores Across Methods for Different A Priori θ* Values, Where Ratings
are .35 Above the Item Characteristic Curves (ICCs) on Average (High Variability SD
.15)
Mean
SD
Range
Bias
RMSE
0.652
0.120
0.019
0.116
0.082
0.071
0.019
0.010
0.021
0.028
0.027
0.019
0.020
0.028
0.065
0.125
0.213
0.170
0.129
0.126
0.213
2.120
2.019
2.116
2.082
2.071
2.019
2.120
2.019
2.116
2.082
2.071
2.019
0.739
0.319
0.077
0.360
0.277
0.192
0.077
0.007
0.017
0.023
0.020
0.016
0.015
0.023
0.043
0.111
0.142
0.130
0.099
0.093
0.142
1.319
1.077
1.360
1.276
1.191
1.077
0.640
0.073
0.048
0.083
0.056
0.060
0.048
0.010
0.021
0.022
0.026
0.019
0.020
0.022
0.056
0.127
0.132
0.150
0.107
0.120
0.132
0.073
0.048
0.083
0.056
0.060
0.048
Error M
Error SD
Consistency M
Consistency SD
0.342
0.344
0.342
0.343
0.343
0.344
0.006
0.006
0.006
0.006
0.006
0.006
0.530
0.514
0.529
0.524
0.522
0.514
0.008
0.010
0.009
0.009
0.009
0.010
1.319
1.077
1.361
1.277
1.192
1.077
0.237
0.249
0.236
0.238
0.242
0.249
0.004
0.005
0.004
0.004
0.004
0.005
0.692
0.653
0.697
0.687
0.674
0.653
0.006
0.008
0.006
0.006
0.006
0.008
0.076
0.052
0.087
0.059
0.064
0.052
0.354
0.355
0.354
0.355
0.355
0.355
0.006
0.006
0.006
0.006
0.006
0.006
0.507
0.503
0.508
0.504
0.505
0.503
0.009
0.010
0.010
0.010
0.010
0.010
θ* = -2
p
𝜃̂1∗
∗
𝜃̂1𝑤
𝜃̂2∗
𝜃̂2̅∗
∗
𝜃̂2𝑤
̅̅̅̅
𝜃̂3∗
θ* = -1
p
𝜃̂1∗
∗
𝜃̂1𝑤
̂
𝜃2∗
𝜃̂̅2∗
∗
̂
𝜃2𝑤
̅̅̅̅
𝜃̂3∗
θ* = 0
p
𝜃̂1∗
∗
𝜃̂1𝑤
𝜃̂2∗
𝜃̂2̅∗
∗
𝜃̂̅̅̅̅
2𝑤
̂
𝜃3∗
θ* = 1
p
𝜃̂1∗
∗
𝜃̂1𝑤
𝜃̂2∗
𝜃̂2̅∗
∗
𝜃̂2𝑤
̅̅̅̅
̂
𝜃3∗
θ* = 2
p
𝜃̂1∗
∗
𝜃̂1𝑤
𝜃̂2∗
𝜃̂̅2∗
∗
̂
𝜃2𝑤
̅̅̅̅
𝜃̂3∗
∗
Note. RMSE = root mean square error, 𝜃̂1∗ = converted cutoff score using unweighted Method 1; 𝜃̂1𝑤
= converted cutoff
score using weighted Method 1; 𝜃̂2∗ = converted cutoff score using unweighted Method 2; 𝜃̂2̅∗ = converted cutoff score
∗
̂∗
using approximation formula for unweighted Method 2; 𝜃̂̅̅̅̅
2𝑤 = converted cutoff score using weighted method 2; 𝜃3 = converted
cutoff score using Method 3.
103
Low Variability (.05)
.35 below the ICC. According to Table 13, when the standard deviation is set to
.05 and the a priori value is set to 0 in the bias condition where the simulated ratings are
set to .35 below the ICC, all six methods did a similar job of recovering the original θ*
values as shown by the bias means ranging from -.150 to -.185, i.e., the range of the bias
means show that there are small differences in determining which is the best method,
suggesting that you would get a recovery of the original θ* values that could be equally
applied regardless of the selected method. The RMSE means mirror the bias means. The
means of the estimated θ* values across replications range from -.150 to -.185. The
means of the error index were low, ranging from .054 to .055, and the means of the
consistency index were high, ranging from .920 to .921 with low standard deviations.
When the standard deviation is set to .05 and the a priori value is set to -1 in the
bias condition where the simulated ratings are set to .35 below the ICC, Method 1
weighted, Method 3, and Method 2 did the best job of recovering the original θ* values
as shown by the bias means ranging from -.049 to -.050. Method 2 weighted, did the
next best job as shown by the bias mean of .51. The approximation formula for method 2
and method 1 are last with bias means of .111 and .123. The RMSE means mirror the
bias means. The means of the estimated θ* values across replications range from -1.049 .877. The means of the error index were low, ranging from .073 to .094, and the means
of the consistency index were high, ranging from .871 to .900 with low standard
deviations.
104
When the standard deviation is set to .05 and the a priori value is set to 1 in the
bias condition where the simulated ratings are set to .35 below the ICC, Method 1
weighted and Method 3 did the best job of recovering the original θ* values as shown by
the bias means of -.214. Method 2 weighted did the next best job as shown by the bias
mean of -.309. Method 2 and Method 1 did the next best job as shown by the bias means
of -.467 and -.484. The approximation formula for method 2 did the worst as shown by
the bias mean of -.670. The RMSE means mirror the bias means. The means of the
estimated θ* values across replications range from .330 to .786. The means of the error
index were low, ranging from .102 to .193 and the means of the consistency index were
high, ranging from .751 to .882 with low standard deviations.
When the standard deviation is set to .05 and the a priori value is set to -2 in the
bias condition where the simulated ratings are set to .35 below the ICC, Method 2 did the
best job of recovering the original θ* values as shown by the bias mean of .776. The rest
of the methods did a similar job as shown by the bias means ranging from 1.012 to 1.088,
i.e., the range of the bias means show that there are small differences in determining
which is the best method, suggesting that you would get a recovery of the original θ*
values that could be equally applied regardless of the selected method. The RMSE
means mirror the bias means. The means of the estimated θ* values across replications
range from -1.224 to -.912. The means of the error index were low, ranging from .085 to
.117 and the means of the consistency index were high ranging from .836 to .888 with
low standard deviations for both the error and consistency indices.
105
When the standard deviation is set to .05 and the a priori value is set to 2 in the
bias condition where the simulated ratings are set to .35 below the ICC, Method 1
weighted and Method 3 did the best job of recovering the original θ* values as shown by
the bias means of -.185. The rest of the methods did a similar job as shown by the bias
means ranging from -1.975 to -1.411, i.e., the range of the bias means show that there are
small differences in determining which is the best method, suggesting that you would get
a recovery of the original θ* values that could be equally applied regardless of the
selected method. The RMSE means mirror the bias means. The means of the estimated
θ* values across replications range from .025 to 1.185. The means of the error index
were low, ranging from .298 to .396 and the means of the consistency index were
moderate, ranging from .443 to .681.
106
Table 13
Summary of Cut Scores Across Methods for Different A Priori θ* Values, Where Ratings
are .35 Below the Item Characteristic Curves (ICCs) on Average (Low Variability SD
.05)
Mean
SD
Range
Bias
RMSE
Error M
Error SD
Consistency M
Consistency SD
θ* = -2
p
𝜃̂1∗
∗
𝜃̂1𝑤
𝜃̂2∗
𝜃̂2̅∗
∗
𝜃̂̅̅̅̅
2𝑤
̂
𝜃3∗
0.251
-0.988
-0.962
-1.224
-0.912
-0.951
-0.962
0.002
0.015
0.039
0.016
0.021
0.020
0.039
0.010
0.095
0.330
0.100
0.149
0.153
0.330
1.012
1.038
0.776
1.088
1.049
1.038
1.012
1.039
0.776
1.088
1.049
1.039
0.107
0.111
0.085
0.117
0.112
0.111
0.003
0.006
0.003
0.004
0.004
0.006
0.852
0.847
0.888
0.836
0.845
0.847
0.004
0.009
0.004
0.007
0.006
0.009
0.279
-0.877
-1.049
-1.050
-0.889
-0.949
-1.049
0.002
0.014
0.016
0.012
0.018
0.015
0.016
0.010
0.088
0.097
0.080
0.103
0.090
0.097
0.123
-0.049
-0.050
0.111
0.051
-0.049
0.124
0.052
0.051
0.113
0.053
0.052
0.094
0.073
0.073
0.092
0.084
0.073
0.003
0.002
0.002
0.004
0.003
0.002
0.868
0.900
0.900
0.871
0.883
0.900
0.004
0.004
0.003
0.006
0.004
0.004
0.541
-0.185
-0.150
-0.176
-0.172
-0.162
-0.150
0.002
0.007
0.004
0.006
0.006
0.004
0.004
0.010
0.048
0.025
0.030
0.038
0.029
0.025
-0.185
-0.150
-0.176
-0.172
-0.162
-0.150
0.185
0.150
0.177
0.172
0.162
0.150
0.055
0.054
0.054
0.054
0.054
0.054
0.001
0.001
0.001
0.001
0.001
0.001
0.920
0.921
0.920
0.921
0.921
0.921
0.002
0.002
0.002
0.002
0.002
0.002
0.516
0.786
0.533
0.330
0.691
0.786
0.011
0.006
0.016
0.017
0.007
0.006
0.082
0.036
0.100
0.113
0.041
0.036
-0.484
-0.214
-0.467
-0.670
-0.309
-0.214
0.485
0.214
0.467
0.670
0.309
0.214
0.143
0.102
0.139
0.193
0.111
0.102
0.006
0.004
0.007
0.009
0.005
0.004
0.824
0.882
0.829
0.751
0.870
0.882
0.008
0.005
0.010
0.012
0.006
0.005
θ* = -1
p
𝜃̂1∗
∗
𝜃̂1𝑤
𝜃̂2∗
𝜃̂2̅∗
∗
𝜃̂2𝑤
̅̅̅̅
𝜃̂3∗
θ* = 0
p
𝜃̂1∗
∗
𝜃̂1𝑤
̂
𝜃2∗
𝜃̂2̅∗
∗
̂
𝜃2𝑤
̅̅̅̅
𝜃̂3∗
θ* = 1
p
𝜃̂1∗
∗
𝜃̂1𝑤
𝜃̂2∗
𝜃̂2̅∗
∗
𝜃̂̅̅̅̅
2𝑤
̂
𝜃3∗
θ* = 2
p
𝜃̂1∗
∗
𝜃̂1𝑤
𝜃̂2∗
𝜃̂2̅∗
∗
𝜃̂2𝑤
̅̅̅̅
𝜃̂3∗
0.793
0.005
0.030
0.337
0.019
0.127 -1.662
1.662
0.371
0.007
0.521
0.011
1.185
0.045
0.309 -0.185
0.185
0.298
0.009
0.681
0.011
0.278
0.025
0.160 -1.722
1.722
0.376
0.007
0.506
0.012
0.025
0.019
0.121 -1.975
1.975
0.396
0.005
0.443
0.009
0.589
0.021
0.157 -1.411
1.411
0.349
0.008
0.580
0.012
1.185
0.045
0.309 -0.185
0.185
0.298
0.009
0.681
0.011
∗
Note. RMSE = root mean square error, 𝜃̂1∗ = converted cutoff score using unweighted Method 1; 𝜃̂1𝑤
= converted cutoff
score using weighted Method 1; 𝜃̂2∗ = converted cutoff score using unweighted Method 2; 𝜃̂̅2∗ = converted cutoff score
∗
̂∗
using approximation formula for unweighted Method 2; 𝜃̂2𝑤
̅̅̅̅ = converted cutoff score using weighted method 2; 𝜃3 = converted
cutoff score using Method 3.
107
Medium Variability (.10)
.35 below the ICC. According to Table 14, when the standard deviation is set to
.10 and the a priori value is set to 0 in the bias condition where the simulated ratings are
set to .35 below the ICC, all six methods did a similar job of recovering the original θ*
values as shown by the bias means ranging from -.070 to -.237, i.e., the range of the bias
means show that there are small differences in determining which is the best method,
suggesting that you would get a recovery of the original θ* values that could be equally
applied regardless of the selected method. The RMSE means mirror the bias means. The
means of the estimated θ* values across replications range from -.070 to -.237. The
means of the error index were low ranging from .142 to .161, and the means of the
consistency index were moderately high, ranging from .768 to .789 with low standard
deviations.
When the standard deviation is set to .10 and the a priori value is set to -1 in the
bias condition where the simulated ratings are set to .35 below the ICC, Method 1
weighted and Method 3 did the best job of recovering the original θ* values as shown by
the bias means of .295. Method 2 weighted did the next best job as shown by the bias
mean of .649. The rest of the methods did a similar job as shown by the bias means
ranging from .886 to 1.039, i.e., the range of the bias means show that there are small
differences in determining which is the best method, suggesting that you would get a
recovery of the original θ* values that could be equally applied regardless of the selected
method. The RMSE means mirror the bias means. The means of the estimated θ* values
108
across replications range from -.705 to .039. The means of the error index were low,
ranging from .365 to .400, and the means of the consistency index were low, ranging
from .428 to .470, with low standard deviations.
When the standard deviation is set to .10 and the a priori value is set to 1 in the
bias condition where the simulated ratings are set to .35 below the ICC, Method 1
weighted and Method 3 did the best job of recovering the original θ* values as shown by
the bias means of -.389. Method 2 weighted did the next best job as shown by the bias
mean of -.524. The rest of the methods did a similar job as shown by the bias means
ranging from -.681 to -.840, i.e., the range of the bias means show that there are small
differences in determining which is the best method, suggesting that you would get a
recovery of the original θ* values that could be equally applied regardless of the selected
method. The RMSE means mirror the bias means. The range of the estimated θ* values
across replications range from .160 to .611. The means of the error index were low,
ranging from .190 to .273, and the means of the consistency index were moderate,
ranging from .628 to .772 with low standard deviations.
When the standard deviation is set to .10 and the a priori value is set to -2 in the
bias condition where the simulated ratings are set to .35 below the ICC, all six methods
did a similar job of recovering the original θ* value as shown by the bias means ranging
from 1.961 to 2.065, i.e., the range of the bias means show that there are small
differences in determining which is the best method, suggesting that you would get a
recovery of the original θ* values that could be equally applied regardless of the selected
109
method. The RMSE means mirror the bias means. The range of the estimated θ* values
across replications, range from -.039 to .065. The means of the error index were low,
ranging from .359 to .361, and the means of the consistency index were moderately low,
ranging from .483 to .499 with low standard deviations. The proportion-correct cutoff
score generated in this condition was affected by not enough low ratings being possible in
the sampling when attempting to achieve the .10 standard deviation of rater variability at
the -2 θ* level.
When the standard deviation is set to .10 and the a priori value is set to 2 in the
bias condition where the simulated ratings are set to .35 below the ICC, all six methods
did a similar job of recovering the original θ* values as shown by the bias means, ranging
from -1.378 to -1.953, i.e., the range of the bias means show that there are small
differences in determining which is the best method, suggesting that you would get a
recovery of the original θ* values that could be equally applied regardless of the selected
method. The RMSE means mirror the bias means. The range of the estimated θ* values
across replications, range from .047 to .622. The means of the error index were low,
ranging from .351 to .370, and the means of the consistency index were moderately low,
ranging from .481 to .580 with low standard deviations.
110
Table 14
Summary of Cut Scores Across Methods for Different A Priori θ* Values, Where Ratings
are .35 Below the Item Characteristic Curves (ICCs) on Average (Medium Variability SD
.10)
Mean
SD
Range
Bias
RMSE
Error M
Error SD
Consistency M
Consistency SD
0.633
0.031
-0.039
0.065
0.058
0.015
-0.039
0.010
0.021
0.045
0.025
0.018
0.022
0.045
0.060
0.135
0.319
0.160
0.119
0.136
0.319
2.031
1.961
2.065
2.057
2.015
1.961
2.031
1.961
2.065
2.058
2.015
1.961
0.360
0.361
0.359
0.360
0.360
0.361
0.006
0.006
0.006
0.006
0.006
0.006
0.494
0.483
0.499
0.498
0.491
0.483
0.009
0.011
0.009
0.009
0.009
0.011
0.570
-0.113
-0.705
-0.100
0.039
-0.351
-0.705
0.009
0.019
0.046
0.023
0.017
0.022
0.046
0.050
0.126
0.290
0.140
0.112
0.145
0.290
0.886
0.295
0.900
1.039
0.649
0.295
0.887
0.299
0.901
1.039
0.650
0.299
0.394
0.365
0.395
0.400
0.383
0.365
0.006
0.008
0.006
0.005
0.007
0.008
0.427
0.470
0.428
0.438
0.427
0.470
0.007
0.013
0.007
0.006
0.010
0.013
0.560
-0.125
-0.237
-0.127
-0.070
-0.209
-0.237
0.005
0.012
0.008
0.013
0.010
0.007
0.008
0.030
-0.162
-0.262
-0.160
-0.100
-0.231
-0.262
-0.125
-0.237
-0.127
-0.070
-0.208
-0.237
0.126
0.237
0.128
0.070
0.209
0.237
0.152
0.142
0.151
0.161
0.143
0.142
0.005
0.004
0.005
0.005
0.004
0.004
0.779
0.789
0.779
0.768
0.788
0.789
0.007
0.006
0.008
0.007
0.006
0.006
0.723
0.319
0.611
0.312
0.160
0.476
0.611
0.006
0.017
0.014
0.019
0.017
0.012
0.014
0.040
0.120
0.120
0.110
0.110
0.080
0.120
-0.681
-0.389
-0.688
-0.840
-0.524
-0.389
0.681
0.389
0.688
0.840
0.525
0.389
0.236
0.190
0.238
0.273
0.207
0.190
0.008
0.006
0.008
0.008
0.006
0.006
0.693
0.772
0.691
0.628
0.744
0.772
0.011
0.007
0.013
0.012
0.009
0.007
θ* = -2
p
𝜃̂1∗
∗
𝜃̂1𝑤
𝜃̂2∗
𝜃̂2̅∗
∗
𝜃̂̅̅̅̅
2𝑤
̂
𝜃3∗
θ* = -1
p
𝜃̂1∗
∗
𝜃̂1𝑤
𝜃̂2∗
𝜃̂2̅∗
∗
𝜃̂2𝑤
̅̅̅̅
𝜃̂3∗
θ* = 0
p
𝜃̂1∗
∗
𝜃̂1𝑤
̂
𝜃2∗
𝜃̂2̅∗
∗
̂
𝜃2𝑤
̅̅̅̅
𝜃̂3∗
θ* = 1
p
𝜃̂1∗
∗
𝜃̂1𝑤
𝜃̂2∗
𝜃̂2̅∗
∗
𝜃̂̅̅̅̅
2𝑤
̂
𝜃3∗
θ* = 2
p
𝜃̂1∗
∗
𝜃̂1𝑤
𝜃̂2∗
𝜃̂2̅∗
∗
𝜃̂2𝑤
̅̅̅̅
𝜃̂3∗
0.669
0.009
0.070
0.194
0.022
0.162
-1.806
1.806
0.365
0.006
0.508
0.009
0.622
0.102
0.680
-1.378
1.382
0.351
0.008
0.580
0.017
0.161
0.025
0.190
-1.839
1.839
0.366
0.006
0.502
0.010
0.047
0.019
0.127
-1.953
1.954
0.370
0.005
0.481
0.009
0.266
0.028
0.187
-1.734
1.734
0.363
0.006
0.522
0.010
0.622
0.102
0.680
-1.378
1.382
0.351
0.008
0.580
0.017
∗
Note. RMSE = root mean square error, 𝜃̂1∗ = converted cutoff score using unweighted Method 1; 𝜃̂1𝑤
= converted cutoff
score using weighted Method 1; 𝜃̂2∗ = converted cutoff score using unweighted Method 2; 𝜃̂̅2∗ = converted cutoff score
∗
̂∗
using approximation formula for unweighted Method 2; 𝜃̂2𝑤
̅̅̅̅ = converted cutoff score using weighted method 2; 𝜃3 = converted
cutoff score using Method 3.
111
High Variability (.15)
.35 below the ICC. According to Table 15, when the standard deviation is set to
.15 and the a priori value is set to 0 in the bias condition where the simulated ratings are
set to .35 below the ICC, the approximation formula for method 2 did the best job of
recovering the original θ* values as shown by the bias mean of -.131. Method 1 did the
next best job as shown by the bias mean of -.174. Method 2 did the next best job as
shown by the bias mean of -.198. Method 2 weighted did the next best job as shown by
the bias mean of -.270. Method 1 weighted and Method 3 did the worst job as shown by
the bias means of -.328. The RMSE means mirror the bias means. The range of the
estimated θ* values across replications, range from -.131 to -.328. The means of the error
index were low, ranging from .203 to .225 with low standard deviations, and the means
of the consistency index were moderate, ranging from .672 to .696 with low standard
deviations.
When the standard deviation is set to .15 and the a priori value is set to -1 in the
bias condition where the simulated ratings are set to .35 below the ICC, Method 1
weighted and Method 3 did the best job of recovering the original θ* value as shown by
the bias mean of .627. The rest of the methods did a similar job as shown by the bias
means ranging from .838 to 1.046, i.e., the range of the bias means show that there are
small differences in determining which is the best method, suggesting that you would get
a recovery of the original θ* values that could be equally applied regardless of the
selected method. The RMSE means mirror the bias means. The means of the estimated
112
θ* values across replications range from -.373 to .046. The means of the error index were
low, ranging from .382 to .386, with low standard deviations, and the means of the
consistency index were moderately low, ranging from .429 to .459 with low standard
deviations.
When the standard deviation is set to .15 and the a priori value is set to 1 in the
bias condition where the simulated ratings are set to .35 below the ICC, Method 1
weighted and Method 3 did the best job of recovering the original θ* value as shown by
the bias means of -.622. The rest of the methods did a similar job as shown by the bias
means ranging from -.747 to -.947, i.e., the range of the bias means show that there are
small differences in determining which is the best method, suggesting that you would get
a recovery of the original θ* values that could be equally applied regardless of the
selected method. The means of the estimated θ* values across replications, range from
.053 to .378. The means of the error index were low, ranging from .262 to .305 with low
standard deviations, and the means of the consistency index were moderate, ranging from
.574 to .666 with low standard deviations.
When the standard deviation is set to .15 and the a priori value is set to -2 in the
bias condition where the simulated ratings are set to .35 below the ICC, all six methods
did a similar job of recovering the original θ* values as shown by the bias means ranging
from 2.043 to 2.074, i.e., the range of the bias means show that there are small
differences in determining which is the best method, suggesting that you would get a
recovery of the original θ* values that could be equally applied regardless of the selected
113
method. The RMSE means mirror the bias means. The range of the estimated θ* values
across replications, range from .043 to .074. The means of the error index were low,
ranging from .358 to .359, with low standard deviations, and the means of the consistency
index were moderately low, ranging from .497 to .501 with low standard deviations. The
proportion-correct cutoff score generated in this condition was affected by not enough
low ratings being possible in the sampling when attempting to achieve the .15 standard
deviation of rater variability at the -2 θ* level.
When the standard deviation is set to .15 and the a priori value is set to 2 in the
bias condition where the simulated ratings are set to .35 below the ICC, all six methods
did a similar job of recovering the original θ* value as shown by the bias means ranging
from -1.943 -1.685. The RMSE means mirror the bias means. The means of the
estimated θ* values across replications, range from .057 to .315. The means of the error
index were low, ranging from .354 to .358 with low standard deviations, and the means
of the consistency index were moderately low, ranging from .499 to .540 with low
standard deviations.
114
Table 15
Summary of Cut Scores Across Methods for Different A Priori θ* Values, Where Ratings
are .35 Below the Item Characteristic Curves (ICCs) on Average (High Variability SD
.15)
Mean
SD
Range
Bias
RMSE
Error M
Error SD
Consistency M
Consistency SD
0.637
0.062
0.043
0.074
0.064
0.053
0.043
0.009
0.020
0.022
0.024
0.017
0.019
0.022
0.059
0.129
0.148
0.160
0.108
0.122
0.148
2.062
2.043
2.074
2.064
2.053
2.043
2.062
2.043
2.074
2.064
2.054
2.043
0.359
0.359
0.358
0.359
0.359
0.359
0.006
0.006
0.006
0.006
0.006
0.006
0.499
0.497
0.501
0.500
0.498
0.497
0.009
0.010
0.009
0.009
0.009
0.010
0.596
-0.044
-0.373
-0.032
0.046
-0.162
-0.373
0.009
0.021
0.061
0.024
0.018
0.027
0.061
0.064
0.136
0.428
0.160
0.116
0.173
0.428
0.956
0.627
0.968
1.046
0.838
0.627
0.956
0.630
0.969
1.046
0.839
0.630
0.385
0.382
0.385
0.386
0.384
0.382
0.005
0.006
0.005
0.005
0.006
0.006
0.448
0.429
0.450
0.459
0.436
0.429
0.007
0.010
0.007
0.007
0.008
0.010
0.532
-0.175
-0.328
-0.198
-0.131
-0.270
-0.328
0.006
0.015
0.012
0.016
0.012
0.011
0.012
0.041
0.096
0.081
0.110
0.082
0.076
0.081
-0.174
-0.328
-0.198
-0.131
-0.270
-0.328
0.175
0.328
0.198
0.132
0.271
0.328
0.218
0.203
0.214
0.225
0.207
0.203
0.006
0.005
0.006
0.006
0.005
0.005
0.680
0.696
0.684
0.672
0.692
0.696
0.008
0.008
0.009
0.008
0.008
0.008
0.664
0.159
0.378
0.148
0.053
0.253
0.378
0.008
0.019
0.021
0.021
0.017
0.016
0.021
0.050
0.120
0.153
0.130
0.117
0.109
0.153
-0.841
-0.622
-0.852
-0.947
-0.747
-0.622
0.841
0.622
0.853
0.947
0.747
0.622
0.289
0.262
0.291
0.305
0.277
0.262
0.007
0.007
0.007
0.007
0.007
0.007
0.606
0.666
0.603
0.574
0.634
0.666
0.010
0.010
0.011
0.011
0.010
0.010
θ* = -2
p
𝜃̂1∗
∗
𝜃̂1𝑤
𝜃̂2∗
𝜃̂2̅∗
∗
𝜃̂̅̅̅̅
2𝑤
̂
𝜃3∗
θ* = -1
p
𝜃̂1∗
∗
𝜃̂1𝑤
𝜃̂2∗
𝜃̂2̅∗
∗
𝜃̂2𝑤
̅̅̅̅
𝜃̂3∗
θ* = 0
p
𝜃̂1∗
∗
𝜃̂1𝑤
̂
𝜃2∗
𝜃̂2̅∗
∗
̂
𝜃2𝑤
̅̅̅̅
𝜃̂3∗
θ* = 1
p
𝜃̂1∗
∗
𝜃̂1𝑤
𝜃̂2∗
𝜃̂2̅∗
∗
𝜃̂̅̅̅̅
2𝑤
̂
𝜃3∗
θ* = 2
p
𝜃̂1∗
∗
𝜃̂1𝑤
𝜃̂2∗
𝜃̂2̅∗
∗
𝜃̂2𝑤
̅̅̅̅
𝜃̂3∗
0.648
0.009
0.067
0.114
0.022
0.172
-1.886
1.886
0.357
0.006
0.508
0.009
0.315
0.072
0.656
-1.685
1.685
0.354
0.006
0.540
0.014
0.105
0.025
0.180
-1.895
1.895
0.357
0.006
0.507
0.009
0.057
0.018
0.117
-1.943
1.943
0.358
0.006
0.499
0.009
0.153
0.024 0.171
-1.847 1.847
0.356
0.006
0.515
0.010
0.315
0.072 0.656
-1.685 1.685
0.354
0.006
0.540
0.014
∗
Note. RMSE = root mean square error, 𝜃̂1∗ = converted cutoff score using unweighted Method 1; 𝜃̂1𝑤
= converted cutoff
score using weighted Method 1; 𝜃̂2∗ = converted cutoff score using unweighted Method 2; 𝜃̂̅2∗ = converted cutoff score
∗
̂∗
using approximation formula for unweighted Method 2; 𝜃̂2𝑤
̅̅̅̅ = converted cutoff score using weighted method 2; 𝜃3 = converted
cutoff score using Method 3.
115
Chapter 7
DISCUSSION
This present study replicated previous research (Hurtz et al., 2008), and expanded
on the range of simulated rater agreement and bias, in addition to expanding the range of
θ*. This study compared the performance of Kane’s (1987) methods to convert
proportion-correct standard-setting judgments to a value on the θ scale used in scoring
IRT examinations, using the restricted range based on Hurtz et al., 2008. Overall,
throughout all of the conditions, method 1 weighted and method 3 had the most success
in recovering the original θ* values because they most often showed the least amount of
bias and had the smallest error index ratings. The differences occur when investigating
individual conditions, such as when bias is introduced, or when the a priori θ* values are
moved more toward the extreme ratings.
Unbiased Ratings
When the a priori θ* value is set to 0, and the simulated ratings have low
variability, meaning that the simulated raters have more agreement, there is very little
difference between the six methods. As the a priori θ* values are moved away from 0, to
1 and negative 1, methods 1 weighted and method 3 perform the best when recovering
the original θ* values. As the a priori θ* values are moved to negative 2 and positive 2,
all six methods do not get close to the original θ* values, showing that at the extreme
ranges, the simulated ratings suffer from truncation and the variability is not conducive to
116
recovering those values. With the a priori θ* value at negative 2, method 2 comes
closest, with only a -1.261 value. This can be due to the difficulty with rating easier
items, as the simulated “expert” raters do not have a lot of leeway in figuring out how to
rank the items. Additionally, items of this difficulty should be relatively easy for any
exam taker to get correct, and finding the minimally acceptable candidate is more
difficult. When the a priori θ* value is set to positive 2, method 1 weighted and method 3
come out ahead, but not by much with a 1.046 recovered θ* value. This is due to items in
this difficulty range being too difficult to discern the minimally acceptable candidate.
Additionally, these extreme ranges are not as useful for most exams, because it is rare
when the purpose of the exam is to get as many candidates as possible or restrict as few
as possible from receiving a license.
When the variability among the simulated raters increases to a medium level,
there is little difference in the methods when the a priori θ* value is set to 0. As in the
previous condition, method 1 weighted and method 3 do perform the best when the a
priori θ* value is moved to negative and positive 1. This suggests that with simulated
raters mimicking the real performance of expert individuals, suggests that these methods
work the best in the applied sense because these θ* values are more applicable when
constructing real world exams. As the a priori θ* values move toward the extremes, as in
negative and positive 2, method 1 weighted and method 3 are still the front runners when
recovering the original θ* values, even though they do not get close to the negative or
positive 2. This is again reflected in that items in this range do not give a lot of
117
information about the examinees getting these items correct or incorrect. This kind of
information is hard for human raters to discern the minimally acceptable candidate
because of the extreme examinee ability levels and is equally difficult for simulated
ratings mimicking real raters.
As the variability of the simulated raters increase, the same pattern emerges with
the a priori θ* value set at 0. With the increased variability in the simulated ratings, as
the a priori θ* value is moved to negative and positive 1, method 1 and method 3 emerge
as the frontrunners, even though they do not get as close to the original θ* values as with
the previous variability conditions. The same occurs when the a priori θ* values are
moved to even further extremes of negative and positive 2. Method 1 weighted and
method 3 do appear to be the most robust in recovering the original θ* values across
conditions, suggesting that they are the best to use, especially when the simulated ratings
are unbiased. Throughout the unbiased conditions, the error and consistency indexes are
strongest at the a priori θ* value of 0 and trend lower when the a priori θ* values
approach the extremes, suggesting that the simulated ratings at those conditions are less
replicable if performed again. This also confirms the lack of test and item information
when the θ* ranges are moved out to the extremes, suggesting that both the simulated
ratings and the conversion methods have difficulty with those values.
Biased Ratings
In the first biased condition, where the simulated ratings are moved .15 above the
ICC, and when the a priori θ* values are set at 0, and there is low variability with the
118
simulated ratings, all six methods do a similar job of recovering the original θ* values,
however they do not get as close to 0 as in the unbiased conditions. As expected, they are
higher than the unbiased conditions with values ranging from .92 to .113. With bias
being introduced, as the a priori θ* value is moved to negative one, method 2 gets closer
than method 1 weighted and method 3 when attempting to recover the original θ* values,
but not by much with the respective θ* values of -.749 and -.603. However, when the a
priori θ* value moves to positive 1, method 1 weighted and method 3 come forth as the
best methods to use, with values falling slightly above 1. The same pattern emerges with
the negative and positive 2 a priori θ* values suggesting that method 1 weighted and
method 3 do better when the simulated ratings show the items as harder for examinees,
but do not function as well when the simulated ratings suggests that the items are easier.
In those easier cases, method 2 seems to perform the best. Even with those method
performing the best, they still are far away from the original θ* values. For this
condition, the consistency and error indexes indicate more replicable values in the
negative 2 to positive 1 range and decline in the positive 2 a priori θ* value.
In the bias condition where the simulated ratings are .15 above the ICC and the
simulated rater variability is increased to medium, even at the a priori θ* value of 0, the
recovered θ* values are higher than the low variability condition, even though there is
little difference between the methods. When the a priori θ* value moves into the
negative 1 and 2 values, none of the methods do a decent job of recovering the original
θ* values, suggesting that the bias condition is doing its job and showing that the
119
simulated ratings are showing that the items are being rated as more difficult. In the a
priori θ* value of positive 1, method 1 weighted and method 3 perform well in recovering
the original θ* value, making up for some of the deficiencies in the previous unbiased
condition with this variability level. Once the a priori value moves to positive 2, the bias
condition is in effect as all six methods do a poor job of recovering the original θ* value
as with the bias of rating the items as more difficult, restricts the range and the methods
are restricted because there is not a lot of room to maneuver and thus cannot recover the a
priori θ* value of 2. The error and consistency indexes follow the pattern of the unbiased
condition, showing stronger reliability at the a priori θ* value of 0 and degrading as that
value is moved to the extremes.
Once the variability of the simulated ratings is moved to high, in the bias
condition where the simulated ratings are generated at .15 above the ICC, none of the
methods come close to recovering the original θ* values, even though method 1 weighted
and method 3 perform the best. When the a priori θ* values are moved into the positive
range, the simulated ratings fail to generate, suggesting when the simulated raters have
this much disagreement, there isn’t much that can be done to get accurate ratings of the
items.
In contrast, in the bias condition when the simulated ratings are generated .15
below the ICC, in the low variability condition, and the a priori value is set to 0, all six
methods do a similar job of recovering the original θ* value, but they follow the bias
condition with negative values. When the θ* value is set to negative 1, all six methods
120
range around values of negative one when attempting to recover the original θ* value.
This shows better performance than the unbiased condition, with method 1 weighted,
method 3 and method 2 performing the best. As the a priori θ* value is moved to
negative 2, the values remain similar to the negative 1 condition, showing that with the
reduced range, and the bias of rating the items as easier, there isn’t a lot of room for the
simulated ratings to go too much lower. As the a priori θ* value is move to positive 1,
method 1 weighted and method 3 still get closest to recovering the original θ* value,
showing their robustness with the bias conditions. Interestingly enough, when moving
the a priori θ* value to positive 2, method 1 weighted and method 3 come close to
recovering the original θ* values at 1.151, suggesting that these methods work well at the
upper ranges when the simulated rating bias is set to rate the items easier. This allows for
more variability and thus allows the equations to do their work effectively. This is
somewhat dampened by the error and consistency index showing more issues at the
positive 2 range, but demonstrating that these ratings could be replicated more
consistently when the a priori θ* values are set to below positive 2.
In the bias condition where the simulated ratings are set at .15 below the ICC and
the variability is set to medium, and the a priori θ* value is set to 0, all of the methods do
a decent job of recovering the original θ* value. As the a priori θ* value is set to more
extreme values, method 1 weighted and method 3 do the best job of recovering the
original θ* values, however they do not come close to recovering the original θ* values
when the a priori θ* value is moved to positive or negative 2, even though they are the
121
best performers. The error and consistency indexes are strongest at the a priori θ* value
of 0 and are lower as the a priori value is moved to the extremes, resulting in the lowest
values at positive or negative 2.
In the bias condition where the simulated ratings are set to .15 below the ICC and
the variability is set to high, the only condition where any of the methods come close to
recovering the original θ* values is when the a priori θ* value is set to 0. In the rest of
the θ* value conditions, the high variability among the ratings does not let the θ* values
perform to their best efforts, suggesting that as the θ* values are moved away from 0,
agreement amongst the simulated raters is more important to recovering the original θ*
values.
In the bias condition where the simulated ratings are set to .35 above the ICC, in
the low variability conditions, denoting that the simulated ratings are judging the items as
more difficult, all six methods do a decent job of recovering the a priori θ* values at 0. In
the positive 1 a priori θ* value, method 1 weighted and method 3 emerge as the best
methods to use, but in the negative 1 a priori θ* value, method 2 performs the best,
suggesting that the rater bias affects method 1 weighted and method 3 in recovering the
original θ* value. As the positive and negative 2 values are investigated, each of the
methods do poorly in recovering the original θ* values, although method 1 weighted and
method 3 do better in the positive 2 value. This suggests that these two methods are the
best ones to use when the practitioner knows the makeup of the raters in an applied
setting. The error and consistency indexes are strongest in the negative 1 to positive 1
122
range, and degrade when the a priori θ* value is moved to the extremes. This is in part
due to the restricted range at the extremes and due to the bias condition, one would
expect to get better performance in the positive a priori θ* values.
In the bias condition where the simulated ratings are set to .35 above the ICC, in
the medium variability condition, the only condition where method 1 weighted and
method 3 do a decent job of recovering the original θ* value is when it is set to positive
1. In the rest of the conditions, due to the simulated ratings being set so high, all six
methods do poorly when trying to recovering the original θ* values. The error and
consistency values have a similar pattern to previous conditions where they are strongest
at 0 and degrade when the a priori θ* value is moved to the extremes.
In the bias condition where the simulated ratings are set to .35 above the ICC, in
the high variability condition, the only a priori value that the methods come close to
recovering is when it is set to 0. In the negative conditions, none of the methods come
close, and in the positive conditions, the ratings fail to generate. This is due to the high
variability combined with the simulated ratings being so far above the ICC, that there is
not enough room at those extreme ranges in those conditions for the equations to work as
expected.
In the bias condition where the simulated ratings are set to .35 below the ICC, all
six methods do the best when the a priori θ* value is set to 0, and the values are negative
as expected. In the negative 1 a priori θ* value, method 1 weighted and method 3 do a
good job of recovering the original θ* value, because the there is enough variability at
123
that value so those equations can come close to recovering the original θ* value.
Conversely, those methods do a decent job at the positive 1 a priori θ* value, although
they do come up a little short of recovering it exactly, which is to be expected when the
simulated ratings are set so low. In the positive 2 a priori θ* value, those two methods do
the best job of recovering the θ* value, because of the simulated rater agreement. In the
negative 2 a priori θ* value, method 2 emerges as the best candidate to use, even though
it is closer to negative 1.
In the bias condition where the simulated ratings are set to .35 below the ICC and
the variability increases to medium, each of the methods do a decent job of recovering the
original θ* values at the a priori θ* value of 0. For method 1 weighted and method 3,
they perform best at the negative and positive 1 a priori θ* values, and none of the
methods do a decent job at the negative and positive 2 a priori θ* values. As the
simulated rater variability increases, it is harder to recover the original θ* values at the
extreme ranges.
In the bias condition where the simulated ratings are set to .35 below the ICC and
the variability increases to high, the similar pattern emerges where the methods are
similar at the a priori θ* value of 0 and then all do poorly at the extreme ranges.
Another issue that arose was the simulation of the proportion-correct cutoff scores
that were generated in the medium and high variability conditions at the -2 θ values.
Because the simulation tried to fit the standard deviation conditions, it forced the
124
sampling of higher ratings, which drove up the average proportion-correct cutoff scores
to values that would not be expected based on the shape of the ICCs.
Conclusion
Based on these results, for the most part, method 1 weighted and method 3 are the
most robust and useful methods. However, this only really applies when knowing how
the population of raters is going to judge the items. If expecting the raters to judge items
easy or hard, these methods can come close to recovering the original θ* values. Rater
agreement is key to this process. As the simulated rater agreement increases in
variability, each of the methods become less useful. This suggests for Angoff standard
setting panels, rater orientation and training becomes crucial in order to get each of the
raters calibrated. When they disagree, especially in the extreme a priori +/- 2 θ* values,
any method chosen is not going to recover the original θ* value with any value that will
be useful for exam purposes. It is also important to examine the purpose of the
examination. If desiring a lowly screen in purpose for the exam, such as determining if a
candidate has the basic ability to read or write, the Angoff method in conjunction with
IRT may significantly overestimate the minimum competence threshold on the latent
scale for the -2 θ* conditions. Also, if the goal of the examination is to eliminate
potential mistakes, such as that with a surgeon or police officer, where the public safety is
in danger, in the 2 θ* conditions this process may lead to underestimation of the
minimum competence threshold on the latent scale. If the purpose of the examination is
for normal testing to differentiate between qualified and unqualified candidates in the mid
125
ability ranges (approximately +/- 1), then method 1 weighted and method 3 are the best to
use when invoking an Angoff standard setting method in conjunction with IRT.
For directions in future research, expanding the range of θ* values might not be as
useful, and increasing the variability of the simulated ratings leads to less accuracy in the
recovery of the original θ* values. Investigating Hurtz et al.’s (2008) variance
restrictions for method 1 weighted and method 3 and adjusting them to reflect the
expanded theta range may result in better performance by those methods. Additionally,
the uniform bias conditions may have an effect on each of the method’s performance.
Adjusting the bias so that each rater has an individual bias condition, some rating easier
and some rating harder might prove more useful in an applied setting. Another
suggestion would be to constrain the proportion-correct cutoff scores when using
simulated ratings in order to maintain the shape of the ICC and have more realistic cutoff
scores generated at the lower ability levels. Another possibility would be to use an actual
exam in addition to actual raters, which would allow the researcher to calibrate the
ratings instead of letting the simulation run its course.
The items used for this study were simulated, and that may result in artificial
embellishments of the conversion methods. Using this procedure with multiple actual
exams with actual raters instead of simulating those ratings could give a better indication
of how these methods perform with actual data, instead of simulated data. When that θ*
ability level is in the range that most likely captures most minimum competence
thresholds, ranging from -1 to 1, these conversions work well, especially, method 1
126
weighted and method 3. Method 2 does emerge in extreme conditions of coming closest
to recovering the original θ* values when the simulated ratings around a particular θ*
value is unexpected, such as in the positive bias conditions when looking at negative a
priori θ* values. When using Angoff standard setting panels in conjunction with IRT, the
underlying mathematics will convert the CTT Angoff ratings back into and IRT metric,
but the quality of those conversions depend on the makeup of the raters and the amount
of agreement that the raters have, and is dependent on the purpose and goal of the exam.
127
References
American Educational Research Association, American Psychological Association,
National Council on Measurement in Education. (1999). Standards for
educational and psychological testing. Washington, DC: American Educational
Research Association.
Angoff, W. H. (1971). Scales, norms, and equivalent scores. In R. L. Thorndike (Ed.),
Educational measurement (2nd ed., pp. 508-600). Washington, DC: American
Council on Education.
Baker, F. B. (2001). The basics of item response theory. College Park, MD: ERIC
Clearinghouse on Assessment and Evaluation, University of Maryland.
Cascio, W. F., Alexander, R. F., & Barrett, G. V. (1988). Setting cutoff scores: Legal,
psychometirc, and professional issues and guidelines. Personell Psychology, 41,
1-24.
Cizek, G. J. (2006). Standard Setting. In S. M. Downing, & T. M. Haladyna (Eds.),
Handbook of test development (pp. 225-258). Mahway, New Jersey: Lawrence
Erlbaum Associates.
Ellis, B. B., & Mead, A. D. (2002). Item analysis: Theory and practice using classical and
modern test theory. In S. Rogelberg (Ed.), Handbook of Research Methods in
Industrial and Organizational Psychology (pp. 324-343). Malden, Massachusetts:
Blackwell.
128
Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologists. Mahwah,
New Jersey: Lawrence Erlbaum Associates.
Ferdous, A. A., & Plake, B. (2008). Item response theory based approaches for
computing minimum passing scores from and Angoff-based standard-setting
study. Educational and Psychological Measurement, 68(5), 778-796.
Ferdous, A. A., & Plake, B. S. (2005). The use of subsets of test questions in an Angoff
standard-setting method. Educational and Psychological Measurement, 65(2),
185-201.
Ferdous, A. A., & Plake, B. S. (2007). Item selection strategy for reducing the number of
items raten in an Angoff standard setting study. Educational and Psychological
Measurement, 67(2), 193-206.
Fischer, G. H., & Molenaar, I. W. (1995). Rasch Models: Foundations, recent
development, and applications. New York: Springer-Verlag.
Fishman, G. S. (1996). Monte Carlo: Concepts, algorithms, and applications. New York:
Springer.
Guion, R. M. (1998). Assessment, measurement, and prediction for personnel decisions.
Mahwah, New Jersey: Lawrence Erlbaum Associated.
Hambleton, R. K., & Cook, L. L. (1977). Latent trait models and their use in the analysis
of educational test data. Journal of Educational Measurement, 14(2), 75-96.
Hambleton, R. K., & Swaminathan, H. (1985). Item response theory: Principles and
applications. Boston: Kluwer-Nijhoff Publishing.
129
Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991a). Fundamentals of item
response theory. London: Sage Publications.
Hurtz, G. M., Jones, J. P., & Jones, C. N. (2008). Conversion of proportion-correct
standard-setting judgments to cutoff score on the item response theory θ scale.
Applied Psychological Measurement, 32(5), 385-406.
Hurtz, G. M., Muh, V., Pierce, M., & Hertz, N. (2012). The Angoff method through the
lens of latent trait theory: Theoretical and practical benefits of setting standards on
the latent scale (where they belong). SIOP Conference. San Diego, California.
Kachigan, S. K. (1986). Statistical analysis: An interdisciplinary introduction to
univariate and multivariate methods. New York: Radius Press.
Kalos, M. H., & Whitlock, P. A. (1986). Monte Carlo methods. (Vol. I: Basics). New
York: Wiley-Interscience.
Kane, M. T. (1987). On the use of IRT models with judgmental standard setting
procedures. Journal of Educational Measurement, 24(4), 333-345.
Livingston, S. A., & Zieky, M. J. (1982). Performance on educational and occupational
tests. Princeton, NJ: Educational Testing Service.
Lord, F. M. (1980). Applications of item response theory to practical testing problems.
Hillsdale, New Jersey: Lawrence Erlbaum Associates.
Mooney, C. Z. (1997). Monte carlo simulation. London: Sage.
130
Norcini, J. J., Shea, J. A., & Ping, J. C. (1998). A note on the application of multiple
matrix sampling to standard setting. Journal of Educational Measurement, 25(2),
159-164.
Plake, B. S., & Kane, M. T. (1991). Comparison of methods for combining the minimum
passing levels for individual items into a passing score for a test. Journal of
Educational Measurement, 28(3), 249-256.
Rasch, G. (1980). Probabilistic models for some intelligence and attainment tests.
Chicago: The University of Chicago Press.
Reckase, M. (2006). A conceptual framework for a psychometric theory for standard
setting with examples of its use for evaluating the functioning of two standard
setting methods. Educational Measurement: Issues and Practice, 25(2), 4-18.
Samejima, F. (1988). Comprehensive latent trait theory. Behaviormetrika, 24, 1-24.
Williams, V. S., Pommerich, M., & Thissen, D. (1998). A comparison of developmental
scales based on Thurstone methods and item response theory. Journal of
Educational Measurement, 35(2), 93-107.
Woodruff, D. (1990). Conditional standard error of measurement in prediction. Journal
of Educational Measurement, 27(3), 191-208.
Download