A COMPARISON OF THE ANFOFF AND THE BOOKMARK PASSING SCORE

A COMPARISON OF THE ANFOFF AND THE BOOKMARK PASSING SCORE
METHODS IN A LICENSURE EXAMINATION
A Thesis
Presented to the faculty of the Department of Psychology
California State University, Sacramento
Submitted in partial satisfaction of
the requirements for the degree of
MASTER OF ARTS
in
Psychology
(Industrial/Organizational Psychology)
by
Maria Avalos
SUMMER
2012
© 2012
Maria Avalos
ALL RIGHTS RESERVED
ii
A COMPARISON OF THE ANFOFF AND THE BOOKMARK PASSING SCORE
METHODS IN A LICENSURE EXAMINATION
A Thesis
by
Maria Avalos
Approved by:
__________________________________, Committee Chair
Lawrence S. Meyers, Ph.D.
__________________________________, Second Reader
Gregory M. Hurtz, Ph.D.
__________________________________, Third Reader
Robert L. Holmgren, Ph.D.
____________________________
Date
iii
Student: Maria Avalos
I certify that this student has met the requirements for format contained in the University
format manual, and that this thesis is suitable for shelving in the Library and credit is to
be awarded for the thesis.
__________________________, Graduate Coordinator
Jianjian Qin, Ph.D.
Department of Psychology
iv
___________________
Date
Abstract
of
A COMPARISON OF THE ANFOFF AND THE BOOKMARK PASSING SCORE
METHODS IN A LICENSURE EXAMINATION
by
Maria Avalos
Under the current economic environment, state licensing entities need to find standard
setting methods that allow to set reliable passing scores and to save time during the exam
development process. Bookmark and Angoff methods were explored. Data were obtained
from 2 state licensure exams in 2 2-day workshops. Each examination had 100 4-multiple
choice items. Each exam was divided in 2 sets of items, resulting in 4 sets total with 50
items each. Each set of items was pass pointed using the two methods. Results suggested
that the Angoff method produced higher cut scores than the Bookmark method. In
addition, SMEs felt more confident about their Angoff cut score than the Bookmark cut
score.
_______________________, Committee Chair
Lawrence S. Meyers, Ph.D.
_______________________
Date
v
ACKNOWLEDGEMENTS
I thank all the valuable people in my life that made the completion of my thesis a
reality. I thought this day would never come.
I thank my committee chair, Dr. Lawrence Meyers, for his valuable guidance
throughout the development of my thesis. Thank you for spending all that time reviewing
my thesis and being so patient. Thank you for sharing all of our knowledge.
I also thank Dr. Greg Hurtz, my second reader for reviewing my thesis and
sharing you valuable expertise in passing score research. Thank you for those valuable
comments and suggestions to improve my thesis.
I thank my third reader, Bob Holmgren, supervisor at the Office of Professional
Examination Services, for taking time from his busy schedule to review my thesis. I also
want to thank you for understanding the importance of completing my thesis and
education.
Finally, I thank my family for understanding and supporting me through all of
those days, nights, weekends, and important dates that I did not spend with them because
I had to work on my thesis in order to complete it.
I thank my husband Osvaldo for being so patient and pushing me to keep working
and not quit “a la mitad del camino” (in the middle of the road). I want to thank my son
Oliver, that being so little, he was able to understand that mommy had to do homework
and could not play with him sometimes.
vi
TABLE OF CONTENTS
Page
Acknowledgments....................................................................................................... vi
List of Tables .............................................................................................................. ix
Chapter
1. VALIDITY AND EMPLOYMENT SELECTION …………………………….. 1
Civil Service System .................................................................................................... 1
Legislative Mandates Against Discrimination in Employment ................................... 3
Litigation in Employment Discrimination ................................................................... 4
Professional Standards for Validity ............................................................................. 8
Validity ...................................................................................................................... 10
2. LICENSURE TESTING ....................................................................................... 19
Minimal Acceptable Competence .............................................................................. 20
State of California and Licensure .............................................................................. 22
Job Analysis ............................................................................................................... 25
Subject Matter Experts............................................................................................... 31
3. STANDARD SETTING METHODS ................................................................... 33
Criterion-Referenced Passing Score at Office of Professional Examination
Services ..................................................................................................................... 33
Criterion Methods Based on Test Taker Performance ............................................... 34
Nedelsky’s Method Based on Subject Matter Expert Judgments of Test Content .... 35
Angoff’s Method........................................................................................................ 36
vii
Ebel’s Method Based on Subject Matter Expert Judgments of Test Content ............ 41
Direct Consensus Method .......................................................................................... 42
The Item Descriptor Method ...................................................................................... 42
Bookmark Method ..................................................................................................... 43
Purpose of this Study ................................................................................................. 54
4. METHOD ............................................................................................................. 56
Participants................................................................................................................. 56
Materials .................................................................................................................... 57
Research Design ........................................................................................................ 59
Procedures.................................................................................................................. 60
5. RESULTS ............................................................................................................. 67
6. DISCUSSION ....................................................................................................... 76
Appendix A. Agenda ................................................................................................ 81
Appendix B. Angoff Rating Sheet ............................................................................ 82
Appendix C. Bookmark Rating Sheet ....................................................................... 83
Appendix D. Evaluation Questionnaire .................................................................... 84
Appendix E. MAC Table .......................................................................................... 85
Appendix F. Power Point Presentation ..................................................................... 86
References ................................................................................................................... 87
viii
LIST OF TABLES
Tables
Page
1. Summary of Experimental Design ....................... .……………………………….59
2. Cut Scores, Standard Deviations, Confidence Intervals, and Reliability Between
Rounds……………………………….…. ................ ……………………………. 68
3. Summary of Angoff and Bookmark Final Cut Scores for each Set of
Items…………………… ................. ………….…………………………………. 70
4. Summary of McNemar Test for Significance of the Proportions of Candidates
Passing by the Angoff Method versus the Bookmark Method.…………………. 73
5. Summary about Standard Setting Process…………….…………………………. 74
ix
1
Chapter 1
VALIDITY AND EMPLOYMENT SELECTION
Civil Service System
In the history of employment testing, discrimination against citizens of a protected
class has been a major problem that had deprived individuals from employment
opportunities. Before the civil rights movement, employers were using procedures for
selecting job applicants that were not linked to a job analysis of the profession. Selection
procedures illegally discriminated against minority groups based on race, color, religion,
sex, or natural origin. Those selection procedures were not job-related. After many court
cases, the government started to request content validity evidence for those employee
selection procedures that demonstrate unfair treatment of minority groups.
Origin of the Civil Service System Around the World
According to Kaplan and Saccuzzo (2005), most of the major changes in
employment testing have happened in the United States during the last century. The use
of testing however might have its origins in China more than 4000 years ago. The use of
test instruments by the Han Dynasty (206 B.C.E. to 220 C.E.) was quite popular in the
areas of civil law, military, and agriculture. During the Ming Dynasty (1368-1644 C.E.),
a testing program was developed in which special testing centers were created in different
geographic locations.
The Western countries learned about testing programs through the Chinese
culture. In 1832, the English East India Company started to select employees for overseas
2
duties using the Chinese method. After 1855, the British, French, and German
governments adopted the same method of testing for their civil service (Kaplan &
Saccuzzo, 2005).
Civil Service System in the United States
According to Meyers (2006), the spoils system started with George Washington
and Thomas Jefferson. During the spoils system, newly elected administrations appointed
to government administrative posts ordinary citizens, (regardless of their ability to
perform the jobs) who had contributed to the campaigns of the winners of the elections.
These new appointees replaced those persons appointed by the previous administration.
The spoils system reached its peak during administration of Andrew Jackson from 1829
to 1837. The spoils system was in conflict with equal employment under the law because
making employment decisions based on race, sex, and national origin rather than on
people’s qualifications are examples of discrimination and are unjust. The civil service
system was created to correct these problems. Equal opportunity needed federal
legislation and psychometric input to be effective (Meyers, 2006).
Guion (1998) explains that in 1871, U.S. Congress authorized the creation of a
Civil Service Commission to establish competitive merit examinations as a way to reform
the spoils system and bring stability and competence to government but it was soon
ended by President Grant (p. 9). Meyers (2006) states that the Civil Service Commission
required that all examinations were job-related. The Pendleton Act created a permanent
civil service system in 1883. The government was required to select personnel by using
competitive merit examinations but discrimination in employment did not stop there. The
3
civil service originally covered fewer than 15,000 jobs but that rapidly expanded over the
years to all federal jobs. The civil service system was not intended to remove
discrimination in employment. It just helped to ensure that those applicants taking job
examinations are evaluated on the basis of merit (Meyers, 2006).
Legislative Mandates Against Discrimination in Employment
Civil Rights Act 1964
Congress passed the Civil Rights Act in 1964 to make discrimination explicitly
illegal in employment (Guion, 1998). Title VII of the Civil Rights Act of 1964 addresses
Equal Employment Opportunity and prohibits making employment-related decisions
based on an employee’s race, color, religion, sex, or national origin. The Act prohibits the
use of selection procedures that result in adverse impact unless the employer
demonstrates that it is job-related. An updated Civil Rights Act was passed in 1991 and
addressed the burden of proof in disparate impact cases brought under Title VII (Meyers,
2006).
Age Discrimination in Employment Act 1967
The Age Discrimination in Employment Act of 1967 (ADEA) followed the Civil
Rights Act and it refers to the discrimination of individuals of at least 40 years of age in
the employment process. ADEA prohibits discrimination in employment testing.
Applicants of this group must be provided with equal employment opportunities. When
an older worker files a discrimination report under the ADEA, the employer must present
evidence showing that in the job process the requirement of age is job-related. Employers
must have documented support to defend their employment practices as being job-related.
4
“ADEA covers employers having 20 or more employees, employment agencies, and
labor unions” (U.S. Department of Labor, 2000).
The Uniform Guidelines on Employee Selection Procedures
The Uniform Guidelines on Employee Selection Procedures published in the
Federal Register in 1978 were developed by the Equal Employment Opportunity
Commission (EEOC), the Civil Service Commission, and the Labor and Justice
Departments (U.S. Department of Labor, 2000). The Guidelines describe the 80% rule to
identify adverse impact, which compares rates of selection or hiring of protected class
groups to the majority group. Adverse impact is observed when the protected groups are
selected at a rate that is less than 80% of the majority group. When meaningful group
differences are found, there is a prima fascia case that the employer engaged in illegal
discrimination. The employer is then required to show documentation of the validity of
the selection procedures as being job-related (Meyers, 2006).
“The Guidelines cover all employers employing 15 or more people, labor
organizations, and employment agencies” (U.S. Department of Labor, 2000). In addition,
employers follow the Guidelines when making employment decisions based on tests and
inventories.
Litigation in Employment Discrimination
Griggs versus Duke Power Company
Several critical court decisions have had an impact on the employee selection
process. In the 1971 Supreme Court case of Griggs versus Duke Power Company,
African American employees charged that they were being prevented from promotional
5
jobs and that it was based on their lack of a high school education and their performance
on an intelligence test. These two employment practices were unrelated to any aspect of
the job. The Court ruled that employment selection examinations had to be job-related
and that employers could not use practices that resulted in adverse impact in the absence
of intent and that the motive behind those practices are not relevant (Guion, 1998).
Albemarle Paper Company versus Moody
In the 1975 Supreme Court case Albemarle Paper Company versus Moody,
African American employees claimed that they were only allowed to work in lower-paid
jobs. The Court ruled that selection examinations must be significantly linked with
important elements of work behavior relevant to the job which candidates are being
evaluated. The Court found that a job analysis was required to support the validation
study (Shrock & Coscarelli, 2000).
Kirkland versus New York State Department of Correctional Services
In the 1983 case of Kirkland versus New York State Department of Correctional
Services, the Court ruled that identifying critical tasks and knowledge as well as the
competency required to perform the various aspects of the job was an essential part of job
analysis. They concluded that the foundation of a content valid examination is the job
analysis (Kirkland v. New York Department of Correctional Services, 711 F. 2nd 1117,
1983).
Golden Rule versus Illinois
In the 1984 Golden Rule versus Illinois out-of-court case, the Golden Rule
Insurance Company and five individuals who had failed portions of the Illinois insurance
6
licensing exams sued the Educational Testing Services (ETS) regarding the development
of the tests. The fundamental issue was the discriminatory impact of the licensing exam
on minority groups. The key provision in the case was that preference should be given in
test construction, based in a job analysis, to the questions that showed smaller differences
in black and white candidates’ performance. The main elements agreed to, included:

Collection of racial and ethnic data from all examinees on a voluntary basis.

Assignment of an advisory committee to review licensing examinations.

Classification of test questions into two groups: (I) questions for which (a) the
correct-answer rates of black examinees, white examinees, and all examinees are
not lower than 40% at the .05 level of statistical significance, and (b) the correctanswer rates of black examinees and white examinees differ by no more than 15%
at the .05 level of statistical significance, and (II) all other questions.

Creation of test forms using questions in the first group described above in
preference to those questions in the second group, if possible to do so and still
meet the specifications for the tests. Within each group, questions with the
smallest black-white correct-answer rate differences would be used first, unless
there was good reason to do otherwise.

Annual release of a public report on the number of passing and failing candidates
by race, ethnicity, and educational level.

Disclosure of one test form per year and pretesting of questions (Linn & Drasgow,
1987).
7
ETS agrees to reduce the discrepancy in the scores received by black and white
candidates. ETS will “give priority to items for which the passing rates of blacks and
whites are similar and to use items in which the white passing rates are substantially
higher only if the available pool of Type 1 items is exhausted” (Murphy, 2005, p. 327).
Ricci versus DeStefano
In the 2009 case Ricci versus DeStefano, White and Hispanic firefighters alleged
reversed discrimination in light of Title VII and also 14th Amendment’s promise of equal
protection under the law. When the city found that results from promotional examinations
showed that White candidates outperformed minority candidates, they decided to throw
out the results based on the statistical racial disparity. The Supreme Court held that by
discarding the exams, the City of New Haven violated Title VII of the Civil Rights Act of
1964. The city engaged in outright intentional discrimination (Roberts, 2010).
Lewis versus City of Chicago
In the 2011 case of Lewis versus City of Chicago, several African-American
applicants who scored in the qualified range and had not been hired as candidate
firefighters filed a charge of discrimination with the Equal Employment Opportunity
Commission (EEOC). The Court ruled that the City of Chicago had used the
discriminatory test results each time it made hiring decisions on the basis of that policy.
In addition, an employment practice with a disparate impact can be challenged not only
when the practice is adopted, but also when it is later applied to fill open positions. The
case was settled and the city must pay for discriminating against black firefighter
candidates dating back to a 1995 entrance exam. The city will also have to hire 111
8
qualified black candidates from the 1995 test by March 2012 and pay $30 million in
damages to the 6000 others who will not be selected from a lottery system to re-take tests
(Lewis v. City of Chicago, 7th Cir, 2011).
As a result of these Court decisions regarding the use of job-related selection
practices, illegal discrimination has significantly been reduced. Although the government
efforts for eliminating discriminatory practices in the area of employment testing have
had positive consequences to reduce these illegal practices, it has not totally stopped.
Some other Court decisions have also created some controversy for test developers and
employers regarding the use of statistical data to evaluate the validity and reliability of
test items instead of the job-relatedness of the items. The development of regulations that
describe in detail how to avoid discriminatory practices against minority groups was
another tool for test developers and users to comply with the law and reduce illegal
discrimination.
Professional Standards for Validity
Uniform Guidelines on Employee Selection Procedures
The Uniform Guidelines on Employee Selection Procedures (Equal Employment
Opportunity Commission, Civil Service Commission, Department of Labor, &
Department of Justice, 1978) were developed by the Department of Labor, the EEOC, the
Civil Service Commission, and the Department of Justice in 1978. Their purpose was to
promote a clear set of principles to help employers comply with the laws that prohibit
illegal discrimination. The Uniform Guidelines established employment selection
practices that meet the law requirements for equal opportunity and that are job-related
9
using a job analysis (Brannick, Levine, & Morgeson, 2007, p. 169). The requirements for
conducting a job analyses include identifying the tasks and knowledge, skills, and
abilities (KSAs) that comprise successful performance of the job. Only those critical
tasks and KSAs should be used as the basis of selection. The Uniform Guidelines
promote the use of valid selection practices and provide a framework for the proper use
of selection examinations. “The Uniform Guidelines remain the official statement of
public policy on assessment in employment” (Guion, 1998).
Standards for Educational and Psychological Testing
The Standards for Educational and Psychological Testing (Standards, 1999) were
developed by the American Educational Research Association (AERA), American
Psychological Association (APA), and the National Council on Measurement in
Education (NCME). “The Standards are the authoritative source of information on how
to develop, evaluate, and use tests and other assessment procedures in educational,
employment, counseling, and clinical settings” (U.S. Department of Labor, 2000).
Although developed as professional guidelines, they are consistent with applicable
regulations and are frequently cited in litigation involving testing practices.
The Standards (1999) reflect changes in the United States Federal law and
measurement trends affecting validity; testing individuals with disabilities or different
linguistic backgrounds; and new types of tests as well as new uses of existing tests. The
Standards are written for the professionals and addresses professional and technical
issues of test development and use in education, psychology, and employment.
10
Principles for the Validation and Use of Personnel Selection Procedures
The Principles for the Validation and use of Personnel Selection Procedures
(Principles, 2003) were developed by the Society for Industrial and Organizational
Psychology. The purpose of the Principles (2003) is to “specify established scientific
findings and generally accepted professional practice in the field of personnel selection
psychology in the choice, development, evaluation, and use of personnel selection
procedures designed to measure constructs related to work behavior with a focus on the
accuracy of the inferences that underlie employment decisions”. The Principles promote
the use of job analyses and provide assistance to professionals by providing guidelines
for the evaluation, development and use of testing instruments. The Principles have been
a source of technical information concerning employment decisions in court. The
Principles is intended to be consistent with the Standards. The Principles intend to
inform decision making that is related to the statutes, regulation, and case law regarding
employment cases (Brannick et al., 2007).
As a result of the different Court decisions and the passage of testing regulations,
employers are required to use job-related procedures for selection of employees in order
to avoid discrimination. Employers need to follow the established guidelines and base
their selection procedures on an established job analysis in order to provide content
validation evidence in the case of adverse impact.
Validity
State licensure examinations are developed to identify applicants that are qualified
to practice independently on the job. The state of California is mainly interested in
11
identifying the test taker who possesses the minimal acceptable competencies required
for the job. Licensure examinations have to be linked to an exam plan or exam outline
that resulted from an empirical study/occupational analysis in order to be legally
defensible and fair to the target population of applicants. The exam plan, items, and cut
scores must also be reviewed, discussed, and approved by subject matter experts (SMEs)
of the specified profession. These and more details about the test development process
are established to ensure existence of the most important standard in psychological
testing which is validity.
The conception of validity has evolved over time. The Standards (1950) included
criterion, content, and construct validity in its definition. Based on the Standards (1999),
the conceptualization of validity has become more fully explicated and expanded.
“Validity refers to the degree to which evidence and theory support the interpretation of
test scores entailed by proposed uses of tests” (American Educational Research
Association, American Psychological Association, & National Council on Measurement
in Education, 1999).
Validation is a process of accumulating evidence. Tests are not declared to be
valid until enough evidence has been produced to demonstrate its validity. The validity of
tests can only be declared for certain uses but cannot be declared in general. It is
necessary to accumulate evidence, based in a job analysis, to demonstrate that
competencies measured in the test are the same and equally important across different
situations. Validity is not a property of the test. Validity is a judgment tied to the existing
evidence to support the use of a test for a given purpose. Test validation experts should
12
say that there is sufficient evidence available to support using a test for the particular
purpose.
Validity is the most important professional standard that has been established to
ensure a legally defensible exam development process. There are multiple techniques for
collecting evidence of validity. The Standards propose five sources of validity evidence
in order to provide a sound valid argument.
Validity Evidence Based on Test Content
There are various sources of evidence that may be used to evaluate a proposed
interpretation of test scores for particular purposes. Evidence based on test content refers
to themes, wording, and format of the items, tasks, or questions on a test as well as
guidelines for procedures regarding administration and scoring. It may include logical or
empirical analyses of the adequacy with which the test content represents the content
domain and of the relevance of the content domain to the proposed interpretation of test
scores. It can also come from SMEs’ judgments of the relationship between parts of the
test and the construct. In a licensure test, the major facets of the specific occupation can
be specified, and SMEs in that occupation can be asked to assign test items to the
categories defined by those facets. SMEs can also judge the representativeness of the
chosen set of items. Some tests are based on systematic observations of behavior and the
use of expert judgments to assess the relative importance, criticality, and frequency of the
tasks. A job sample test can be constructed from a random or stratified random sampling
of tasks rated highly on these characteristics.
13
The appropriateness of a given content domain is related to specific inferences to
be made from test scores. Thus, when considering an available test for a purpose other
than that for which it was first developed, it is important to evaluate the appropriateness
of the original content domain for the new use.
Evidence about content-related validity can be used in part to address questions
about differences in the meaning or interpretation of test scores across relevant subgroups
of examinees. Of particular concern is the extent to which construct underrepresentation
or construct-irrelevant components may give an unfair advantage or disadvantage to one
or more subgroups of examinees. In licensure examinations, the representation of a
content domain should reflect the KSAs required to successfully perform the job. All the
items in a test should be linked to the tasks and KSAs of the minimal acceptable
competencies from an occupational analysis of the specified profession in order to be
legally defensible. SMEs must agree that the details of the occupational analysis are
representative of the population of candidates for which the test is developed. Careful
review of the construct and test content domain by a diverse panel of SMEs may point to
potential sources of irrelevant difficulty or easiness that require further investigation
(AERA et al., 1999).
Validity Evidence Based on Response Process
Evidence based on response processes comes from analyses of individual
responses and provides evidence concerning the fit between the construct and the detailed
nature of performance or response actually engaged in by examinees. The Standards
explain that response process issues should also include errors made by the raters of
14
examinee’s performance. To the extent that identifiable rater errors are made, responses
of the test takers are confounded in the data collection procedure. Questioning test takers
about their performance strategies to particular items can yield evidence that enriches the
definition of a construct. Maintaining records of development of a response to a writing
task, through drafts or electronically monitored revisions, also provides evidence of
process. Wide individual differences in process can be revealing and may lead to
reconsideration of certain test formats. Studies of response process are not limited to the
examinee. Assessments often rely on raters to record and evaluate examinees’
performance. Relevant validity evidence includes the extent to which the processes of
raters are consistent with the intended interpretation of scores. Thus, validation may
include empirical studies of how raters record and evaluate data along with analyses of
the appropriateness of these processes to the intended interpretation or construct
definition (AERA et al., 1999).
Validity Evidence Based on Internal Structure
Evidence based on internal structure refer to the analyses of a test that indicate
the degree to which the relationships among test items and test components conform to
the construct on which the proposed test score interpretations are based. The conceptual
framework for a test may imply a single dimension of behavior, or it may suggest several
components that are each expected to be homogeneous, but that are also distinct from
each other.
Studies of internal structure of tests are designed to show whether particular items
may function differently for identifiable subgroups of examinees. Differential item
15
functioning (DIF) occurs when different groups of examinees with similar overall ability,
or similar status on an appropriate criterion, have, on average, systematically different
responses to a particular item. Test components such as reliability, item characteristic
curves (ICC), inter-item correlations, item-total correlations, and factor structure are
some statistical sources that are part of the new Standards. ICCs provide invaluable
information regarding the capability of the individual items to distinguish between test
takers of different ability or performance levels. They also can indicate the difficulty
level of the item. Inter-item correlations inform us of the degree of shared variance of the
items and the item-total correlations describe the relationship between the individual item
and the total test score. Inter-item and item-total correlations also drive the reliability of
the test. A statistical analysis of the administered test data can address most of these areas
(AERA et al., 1999).
Factor structure of the test is another part of the internal structure evidence. Items
that are combined together to form a scale or subscale on a test should be selected based
on empirical evidence to support the combining of designated items. The medium to
obtain this evidence is factor analysis, which identifies items sharing sufficient variance
for a viable underlying dimension to be statistically extracted. Thus, the scoring of the
test must be tied to the factor structure of the test. When scoring involves a high level of
judgment on the part of those doing the scoring, measures of inter-rater agreement, may
be more appropriate than internal consistency estimates. Test users are obligated to make
sure that these statistical analyses are performed and that the results are used to evaluate
and improve the test (AERA et al., 1999).
16
Validity Evidence Based on Relations to Other Variables
Evidence based on relations to other variables refers to the relationship of test
scores to variables external to the test. According to the Standards, these variables
include measures of some criteria that the test is expected to predict, as well as
relationships to other tests hypothesized to measure the same constructs, and tests
measuring related or different constructs. Measures such as performance criteria are often
used in employment settings. It addresses the degree to which the relationships are
consistent with the construct underlying the proposed test interpretations. There are three
sources of evidence based on relations to other variables (AERA et al., 1999).
Convergent and divergent evidence. Convergent and divergent evidence is
provided by relationships between test scores and other measures intended to assess
similar constructs, whereas relationships between test scores and measures of different
constructs provide divergent evidence. Experimental and correlational evidence can be
involved.
Test-criterion correlations. Test-criterion correlations refers to how accurately
test scores predict criterion performance. The degree of accuracy depends on the purpose
for which the test was used. The criterion variable is a measure of a quality of primary
interest determined by users. The value of a test-criterion study depends on the reliability
and validity of the interpretation based on the criterion measure for a given testing
application. The two designs that evaluate test-criterion relationships are predictive study
and concurrent study.
17
Predictive study. A Predictive study indicates how accurately test data can predict
criterion scores that are obtained at a later time. A highly predictive test can maintain
temporal differences of the performance of the practical situation, but without providing
the information necessary to judge and compare the effectiveness of assignments used.
Concurrent study. A concurrent study obtains predictor and criterion information
at the same time. Test scores are sometimes used in allocating individuals to different
treatments, such as jobs within an institution. Evidence is needed to judge suitability of
using a test when classifying or assigning a person to one job versus another. Evidence
about relations to other variables is also used to investigate questions of differential
prediction for groups. Differences can also arise from measurement error. Test
developers must be careful of criterion variables that are theoretically appropriate but
contain large amounts of error variance.
Validity generalization. Validity generalization is the “degree to which evidence
of validity based on test-criterion relations can be generalized to a new situation”. Metaanalytic analyses have shown that much of the variability in test-criterion correlations
may be due to statistical artifacts such as sampling fluctuations and variations across
validation studies in the ranges of test scores and in the reliability of criterion measures.
Statistical summaries of past validation in similar situations may be useful in estimating
test-criterion relationships in a new situation. Transportability and synthetic validity are
part of validity generalization (AERA, et al., 1999).
Transportability. Transportability has to do with the use of a test developed in
one context that is brought into a new context. In order to use the test in a new context,
18
sufficient documentation must be produced to justify the use of the test such as that the
two jobs share the same set of tasks and KSAs and that the candidate groups are
comparable.
Synthetic validity. Synthetic validity involves combining elements that have their
own associated validity evidence together to form a larger test. Elements are usually
fundamental ability modules, but there are needs to be sufficient documentation that the
fundamental elements are related to the domain that is being tested by the larger test.
These elements can be groups of related tasks and KSAs which can even be associated
with certain assessment strategies. A job analysis can reveal which tasks or task groups
are represented in a particular job. Results of a single local validation study may be quite
imprecise, but there are situations where a single study, carefully done, with adequate
sample size, provides sufficient evidence to support test use in a new situation.
Evidence Based on Consequences of Testing
In evidence based on consequences of testing, it is important to distinguish
between evidence that is directly relevant to validity and that may inform decisions about
social policy but falls outside the realm of validity. A few of the benefits of tests are
placement of workers in suitable jobs or prevention of unqualified individuals from
entering a profession. Validity can be found by examining whether or not the proposed
benefits of the test were obtained. A fundamental purpose of validation is to indicate
whether these specific benefits are likely to be realized. This source of evidence is gained
by ruling out confounds which can unjustly cause group differences and invalidate the
test (AERA et al., 1999).
19
Chapter 2
LICENSURE TESTING
Tests are widely used in the licensing of persons for many professions. According
to the Standards (1999), licensing requirements are imposed by state governments to
ensure that licensed people possess knowledge and skills in sufficient degree to perform
important occupational activities safely. Tests in licensure are intended to provide the
state governments with dependable mechanism for identifying practitioners who have
met particular standards. The standards are strict, but not so stringent as to unduly restrain
the right of qualified individuals to offer their services to the public. Licensing also
serves to protect the profession by excluding persons who are deemed to be not qualified
to do the work of the occupation. Qualifications for licensure typically include
educational requirements, supervised experience, and other specific criteria as well as
attainment of a passing score on one or more examinations. Tests are used in licensure in
a broad spectrum of professions, including medicine, psychology, teaching, real state, and
cosmetology (AERA et al., 1999).
Licensure tests are developed to identify that the candidates have mastered the
essential KSAs of a specified domain. The purpose of performance standards is to define
the degree of KSAs needed for safe and independent practice. The Standards (Standards
14.14, 14.15, and 15.16) propose that in order to protect the public, tests should be
consistent with the minimal KSAs required to practice safely when test takers obtain their
20
license. Knowledge that is attained after getting a license and experience on the job
should not be included on the licensure test.
Standard 14.17 proposes that the level of performance required for passing a
licensure test should depend on the KSAs necessary for minimal acceptable competence
in the profession and the proportion of people passing the test should not be adjusted. The
adjustment of cut scores lowers the degree of validity of the test.
Test development begins with an adequate description of the profession, so that
persons can be clearly identified as engaging in the activity. A definition of the nature
and requirements of the current profession is developed. A job analysis of the work
performed by the job incumbents is conducted. The essential KSAs are documented and
identified by qualified specialists in testing and SMEs in order to define test
specifications. Multiple-choice tests are one of the forms of testing that are used in
licensure as well as oral exams (AERA et al., 1999).
Minimal Acceptable Competence
In order to qualify to take a licensure examination, it is necessary that candidates
meet the education, training, and experience required for entry-level practice for the
profession. In addition, some professions allow candidates to demonstrate their
competence to perform the essential functions of the job with the intention to avoid a
situation where qualified candidates are denied a license because they did not meet the
educational requirement specified by the law (Schmitt & Shimberg, 1996).
According to the Standards (1999), the plan of a testing program for licensure
must include the description of the areas to be covered, the number of tests to be used,
21
and the method that is going to be used to combine the various scores and obtain the
overall result. An acceptable performance level is required on each examination. The
Standards (1999) explain that “defining the minimum level of knowledge and skill
required for licensure is one of the most important and difficult tasks facing those
responsible for licensing”. It is critical to verify the relevance of the cut score of a
licensure examination (Standard 14.17). The accuracy of the inference drawn from the
test depends on whether the standard for passing differentiates between qualified and
unqualified performance.
It is recommended that SMEs specify the performance standards required to be
demonstrated by the candidate. Performance standards must be high enough to ensure
protection of the public and the candidate, but not so high to restrict the candidate from
the opportunity to obtain a license. The Standards (1999) state that some government
agencies establish cut scores such as a 75% on licensure tests without the use of SMEs.
The resulting cut scores are meaningless because without the existence of details about
the test, job requirements, and how they are related it is impossible to have an objective
standard setting (Standard 14.14 and 14.17).
Licensure testing programs need to be accurate in the selection of the cut score.
Computer-based tests may end when a decision about the candidate’s performance is
made or when the candidate reached the allowed time. Consequently, a shorter test may
be provided for candidates whose performance exceeds or falls far below the minimum
performance required for a passing score. Mastery tests do not specify, “how badly the
candidate failed, or how well the candidate passed”, providing scores that are higher or
22
lower than the cut score could be confusing (AERA et al., 1999). Nevertheless,
candidates who fail are likely to profit from information about the areas in which their
performance was weak. When feedback to candidates about performance is intended,
precision throughout the score range is needed (Standard 14.10).
State of California and Licensure
Civil service procedures are established by the State Constitution and the
Government Code. The State had two organizations that enacted regulations based on
these statutes. These organizations were the State Personnel Board (SPB) and the
Department of Personnel Administration (DPA). SPB had authority under Article VII of
the State Constitution to oversee the merit principle. Article VII of the State Constitution
says that all public employees should be selected based on merit and with the use of a
competitive examination. Article VII establishes the constitutional structure of
California’s modern civil system and eliminates the political spoils system. Based on
Article VII, the SPB’s responsibilities included civil service examinations, the formal
establishment of job classifications, and discipline. DPA was responsible for functions
including pay, day-to-day administration of the classification plan, benefits, all other
conditions of employment, and collective bargaining. In order to eliminate overlapping of
activities and have a more efficient system, the human resource management
responsibilities performed by SPB and DPA were consolidated into one organization.
Effective July 1, 2012, the name of the new organization is California Department of
Human Resources (CalHR). The CalHR preserves the merit principle in state government
23
as required by Article VII of the State Constitution (Department of Consumer Affairs,
2007).
State Licensing
Professional licenses are issued by state agencies known as boards or bureaus. In
the State of California, the Department of Consumer Affairs (DCA) is the main state
agency that regulates professional licenses. A number of statutes set criteria for the
licensing process in California. These include the California Government Code section
12944 of the California Fair Employment and Housing Act and the Business and
Professions Code section 139 (Office of Professional Examination Services).
The California Government Code section 12944 of the California Fair
Employment and Housing Act prohibits discrimination by any state agency or authority
in the State and Consumer Services Agency, which has the authority to grant licenses that
are prerequisites to employment eligibility or professional status. The Department of Fair
Employment and Housing can accept complaints against licensing boards where
discrimination is alleged based on race, creed, color, national origin or ancestry, sex, age,
medical condition, physical disability or mental disability unless such practice can be
demonstrated to be job-related.
Business and Professions Code section 139 establishes a policy that sets minimum
requirements for psychometrically sound examination validation, examination
development, and occupational analyses. Section 139 requires a review of the
appropriateness of prerequisites for admission to the examination. The annual review
applies to programs administered by the DCA and they must report to the director. The
24
review must include the method employed for ensuring that every licensing examination
is periodically evaluated. Section 139 ensures that the examinations are fair to candidates
and assess job-related competencies relevant to current and safe practice (Board
Licensing Examination Validation and Occupational Analysis, 2006).
The Department of Consumer Affairs (DCA) issues more than 200 professional
licenses in the state of California. The State establishes requirements and competence
levels for licensure examinations. The DCA licenses practitioners, investigates consumer
complaints, and controls violators of state’s law. The DCA also watches over licensees
from unjust competition by unlicensed practitioners (Marino, 2007).
The California Department of Consumer Affairs (2011) has changed its way to
manage licensure. In 1876, the State established its first board in the medical ground. The
Medical Practice Act established entry-level standards for physicians, created licensing
examinations, and imposed fees for violations. Over the next thirty years, other
occupations were controlled by the State. In 1929, the Department of Professional and
Vocational Standards was established by combining 10 different boards. Accountants,
architects, barbers, cosmetologists, dentists, embalmers, optometrists, pharmacists,
physicians, and veterinarians were licensed by these boards. In 1970 the Consumer
Affairs Act changed the name to the Department of Consumer Affairs. Currently, the
DCA controls more than 40 professions in the State.
Licensing professionals who pass licensure examinations and meet state
requirements guarantees that only competent practitioners are legally allowed to serve the
public. In addition, it ensures that consumers are allowed to try different alternatives if a
25
professional service is not done competently. Educational programs are provided to
licensees in order to continue their learning and maintain their professional competence
(Marino, 2007).
The Office of Professional Examination Services (OPES) provides examinationrelated services to DCA’s regulatory boards and bureaus. OPES ensures that licensure
examination programs are fair, valid, and legally defensible. The OPES perform
occupational analysis, conduct exam item development, evaluate performance of
examinations, and consult on issues pertaining the measurement of minimum competency
standards for licensure (Department of Consumer Affairs, 2007).
Job Analysis
In order for an organization to have the most reliable information about a job and
make legal employment decisions, a job analysis should be conducted. A job analysis is a
comprehensive, rigorous approach to highlighting all the important aspects of a job.
Several definitions describe job analysis. The Standards (1999) define job analysis as a
general term referring to the investigation of positions or job classes to obtain descriptive
information about job duties and tasks, responsibilities, necessary worker characteristics,
working conditions, and/or other aspects of the work (Standard 14.8 and 14.10). The
Guidelines refer to job analysis as “a detailed statement of work behaviors and other
information relevant to the job” (Sec 14B and 14C).
Brannick et al. (2007) provide a detailed definition of job analysis in which they
describe additional elements. First, a systematic process is necessary to meet the
requirements of a job analysis. The job analyst specifies the method and the steps to be
26
involved in the job analysis. Second, the job must be broken up into smaller components.
Alternatively, the components might be different units, such as requirements for visual
tracking or problem solving. Third, the results of the job analysis may include any
number of different products, such as a job description, a list of tasks, or a job
specification. As a result, job analysis is defined as the systematic process of discovery of
the nature of a job by dividing it into smaller components, where the process results in
written products with the goal of describing what is done in the job and the capabilities
needed to successfully perform the job. The major objective is to describe worker
behavior in performing the job, along with details of the essential requirements.
Job Analysis and the Law
State agencies receive several legal penalties as a result of unfair treatment in
employment practices. Employment laws refer to fairness in access to work as the main
topic. The overall principle is that individuals should be selected for a job based on merit
rather than social group’s identification by such features as sex, religion, race, age, or
disability.
The Constitution. The Fifth and Fourteenth Amendments are occasionally
mentioned in court and protect people’s life, liberty, and property by stating that people
should not be denied of them without a legal process. The Fourteenth Amendment relates
to the state and the Fifth Amendment applies to the federal government. The language of
these laws does not specify specific employment procedures. The Fourteenth Amendment
has been used in reverse discrimination cases. Reverse discrimination refers to “claims of
27
unlawful discrimination brought by members of a majority group, such as the white male
group, which is also protected under the law” (Brannick et al., 2007).
Equal Pay Act. The Equal Pay Act (1963) requires employers to pay men and
women the same salary for the same job, that is, equal pay for equal work. Employers
cannot give two different titles to the same job, one for men and another for women. The
Equal Pay Act does not require equal pay for jobs that are different (Brannick et al.,
2007).
Civil Rights Act. In 1960, it was usual that some jobs were reserved for whites
and other jobs for blacks. An example of this practice is that of the electric power plant
Duke Power in which laborer jobs were given to blacks and management jobs were given
to whites. The Civil Rights Act of 1964 established that these employment practices were
illegal. As a result, the Duke Power began to allow blacks to apply for the other jobs.
Originally, Duke Power required a high school diploma to apply for laborer jobs. It
dropped the diploma requirement for those who could pass two tests. The passing score
was established at the median for those who completed high school. The result of this
procedure, although without intention, was that it excluded most blacks from moving to
upper level jobs. One of the affected workers sued the company because this was illegal.
In Griggs versus Duke Power Company (1971), the Court ruled that Duke Power’s
procedure was illegal because Duke never proved that the high school diploma and test
scores were related to job performance. The ruling stated that tests should be used to
assess how well individuals fit a specific job. The intent of the procedures were not
relevant, but the significance of the results of those procedures.
28
Title VII of the Civil Rights Act (1964) and its amendments (1972 and 1991)
prohibit employers from discriminating on the basis of race, color, sex, religion, or
national origin. The EEOC was established by the Civil Rights Act in order to implement
it. The Act applies to all conditions or privileges of employment practices. It is specified
that selection practices that do not impact members of one of the protected groups are
legal unless covered by another law. Thus, in some cases, practices that look unfair are
legal. In some other cases, practices that reject a higher proportion of one protected group
than of another protected group may be illegal unless they are job related. These practices
are alleged to produce an adverse impact. In some cases, practices that seem fair at first
glance have an adverse impact if actually performed.
Age Discrimination in Employment Act (ADEA). The Age Discrimination in
Employment Act (1967) prohibits discrimination against people 40 years of age and
older. The ADEA establishes a protected class of anyone 40 years of age and older. It
promotes that companies make employment decisions based on ability rather than age.
As a result, it would be legal to select a 25-year-old candidate to a 35-year-old candidate
because of age, but it would not be legal to prefer a 25-year-old candidate to a 40-yearold candidate because of age. The ADEA does not require companies to select less
qualified older workers over more qualified younger workers (Guion, 1998).
Rehabilitation Act. The Rehabilitation Act (1973) intends to provide equal
employment opportunities based on handicap. The word handicap basically refers to the
term disability. Applying only to federal contractors, the Rehabilitation Act states that
qualified candidates with handicaps should not be discriminated under any program or
29
activity receiving federal assistance because of his or her handicaps (Brannick et al.,
2007).
Americans with Disabilities Act (ADA). The Americans with Disabilities Act
(1990) prohibits discrimination against people with disabilities. According to Brannick et
al. (2007) “Disability is broadly defined as referring to both physical and mental
impairments that limit a major life activity of the person such as walking or working.
Disability also includes those individuals who have a history of impairment and those
who are regarded as disabled, regardless of their actual level of impairment”.
The ADA imposes employers to provide reasonable accommodation to people
with disabilities and protects qualified candidates with disabilities. “A qualified
individual with a disability is a person who, with or without reasonable accommodation,
can successfully complete the essential functions of the job” (Cizek, 2001).
The ADA is unclear in determining if an accommodation is reasonable or not.
Overall, accommodations are considered to be reasonable unless it would result too
expensive to the employer to make the adjustments.
Enforcement of Equal Employment Opportunity Laws. There are two main
organizations that intend to enforce the equal employment opportunity (EEO) laws: the
EEOC and the U.S Office of Federal Contract Compliance Programs (OFCCP). The
OFCCP manages only federal companies and the EEOC controls all the other companies.
The first guidelines for personnel selection were established by the EEOC in 1970 and
the courts applied them in discrimination court cases. Businesspeople worried about the
application of the guidelines in court cases because it was not clear whether companies
30
had to spend large amounts of money to try to comply with the guidelines. Furthermore,
other organizations such as the U.S. Department of Labor (DOL) possess their own
guidelines. The Uniform Guidelines on Employee Selection Procedures were
implemented by five federal agencies: the EEOC, the Office of Personnel Management
(OPM), the DOL, the Department of Justice (DOJ), and the Department of the Treasury
(Brannick et al., 2007).
Professional Standards. In 1999, the American Educational Research
Association, the American Psychological Association and the National Council on
Measurement in Education, updated the Standards for Educational and Psychological
Testing. The Standards provide details about good practices in test development used in
the assessment of people. The Standards distinguish test fairness and selection bias.
Selection bias is the technical view of the relationship of test scores and performance on
the job. On the other hand, test fairness is a nontechnical ethical component resulting
from social and political opinions. As a result, steps to avoid test bias can be addressed by
the Standards, but not how to maintain test fairness.
In 2003, the Society for Industrial and Organizational Psychology published the
latest edition of the Principles for the Validation and Use of Personnel Selection
Procedures. The Principles illustrate good practice in the development and evaluation of
personnel selection tests.
Uses of Job Analysis. Job analyses are used for selection purposes. In personnel
selection, employers collect job applicant’s information in order to make an employment
decision. Based on the information, employers decide if the applicant is qualified to do
31
the job. Another purpose of job analysis is to establish wages to employees. Differences
in wages should be based on a seniority system, a merit system, measures of quantity of
production, or some quality other than sex (Brannick et al., 2007).
Furthermore, job analyses are required for disability and job design purposes. The
ADA prohibits discrimination against candidates based on their disabilities. Employers
may be required to redesign the job so people with disabilities are not required to perform
not essential activities. Reasonable accommodations must be provided to permit people
with disabilities to do the essential functions of the job. The EEOC provides guidelines to
decide if a function is essential. The employer has to evaluate if the position exists to
execute the function, the availability of other employees that could perform the function,
and if special KSAs are required to perform the function. Elements that can help to
decide if a function is essential are the employer’s opinion, a job description, how much
time is spent in performing the function, outcome of the lack of people to perform the
function, and past performance of the job. A job analysis is not required in order to meet
ADA requirements. Nevertheless, the EEOC suggests that employers would benefit if
they have a systematic study of the job position.
Subject Matter Experts
According to the Principles (2003) participation of SMEs in the development of
licensure exams is necessary and guarantees that the exams truly assess whether
candidates have the minimally acceptable KSAs necessary to perform tasks on the job
safely and competently. In addition, SMEs distinguish the work behaviors and other
activities required to perform the job effectively.
32
The selection of SMEs by state entities significantly influences the quality and
defensibility of the exams. Therefore, SMEs should demonstrate familiarity with the
essential job characteristics such as shift, equipment, and location. It is recommended that
the group of SMEs represent the current population of practitioners, geographic location,
ethnicity, gender, and practice setting. Detailed evidence of the methods used in the
development of the selection procedures based on the judgment of SMEs should be
documented (Society for Industrial and Organizational Psychology, 2003).
33
Chapter 3
STANDARD SETTING METHODS
According to the Standards (2003), the establishment of “one or more cut points
dividing the score range to partition the distribution of scores into categories” is a
significant part of the process of exam development. In licensure exams, these categories
define and differentiate candidates who are qualified to obtain a professional license
based on the minimum passing score established by the state. The cut scores influence the
validity of test interpretations by representing the rules and regulations to be used in those
interpretations. Cut scores are established through different methods and for different
purposes that need to follow several procedures to make them legally defensible
(Standard 4.20).
State licensure typically uses 3 to 4 performance levels to set cut scores on their
licensure exams to characterize a test taker’s performance. The process used to set cut
scores is called Standard Setting. There are several acceptable methods to set standards.
Criterion-Referenced Passing Scores at the Office of Professional Examination
Services
The Office of Professional Examination Services (OPES) uses a criterionreferenced passing score for licensure examinations. The method “applies standards for
competent practice to all candidates regardless of the form of the examination
administered” (Office of Professional Examination Services, 2010, para.1). A criterionreferenced passing score ensures that candidates passing the exam are qualified to
34
perform the job competently. OPES uses a modified Angoff method in standard setting.
The group process includes practitioners with different years of experience to represent
different aspects of the profession and entry-level competence. The process starts with
the development of definitions of the minimally acceptable level of competence for safe
practice. The group compares essential performance behaviors of the highly qualified,
minimally qualified, and unqualified candidate. The difficulty of licensure exams varies
from one exam to another. Thus, having an unchanging passing score for different forms
of the exam will not reflect the minimally acceptable competence making it difficult to
legally defend the passing score. The use of a criterion-referenced method lowers the
passing score for an examination with a large number of difficult questions and raises the
passing score for an examination that has a small number of difficult questions. The
resulting passing score is intended to protect the public and the candidate because it is
based on the difficulty of the questions in the exam and not on performance with respect
to the group as is the case in a norm-referenced strategy.
Criterion Methods Based on Test Taker Performance
Contrasting Groups Method
In the contrasting groups method, the scores on a licensure exam of a qualified
group are compared to the scores of a nonqualified group. The nonqualified group should
not be qualified for the license but the group should still be demographically similar to
the qualified group when possible. Furthermore, the mandatory KSAs for successful
performance of the job should represent all the KSAs that would be reflected by the
nonqualified group (Meyers, 2009). One benefit of the contrasting group method is that
35
SMEs are required to make judgments about people with whom they are familiar, rather
than about a hypothetical group of test-takers. The task of teachers rating their students is
an illustration of this situation (Hambleton, Jaeger, Plake, & Mills, 2000). The
identification of raters that are out of place or misclassified is also possible, because
overlap of score distributions can be noticed directly.
Marginal Group Method
A marginal group is represented by those candidates that possess the minimum
level of KSAs required to perform the job tasks safely. Only one group of candidates is
identified as marginally qualified to obtain a license. The candidates take the test and a
cut score is established at their common performance level (Meyers, 2009).
Incumbent Group Method
In this method, one group of already licensed practitioners takes the exam. The
resulting cut score is established such that most of the qualified job incumbents would
pass the exam. This is the least favored method because not enough information about the
incumbents is available in order to set the cut score, resulting in inaccurate statement that
all members of the group are competent at the time of testing (Meyers, 2009).
Nedelsky’s Method Based on Subject Matter Expert Judgments of Test Content
The Nedelsky procedure was specifically designed for multiple-choice items
(Kane, 1998). The basic idea of the Nedelsky method is to understand the possible test
score of a candidate possessing less knowledge of the content domain than would be
required to be a successful licensee. SMEs rate each question in order to remove the
options that even a nonqualified candidate would possibly identify as the wrong answer.
36
As a result of the elimination of the obvious wrong answers, candidates would have
demonstrated their competence and start to guess among the remaining options of the
question.
The Nedelsky method intends to obtain the score these candidates would achieve
by this approach. “An item guessing score is calculated for each item by dividing the
number of remaining choices into 1 and is then summed across the items yielding an
estimated test guessing score” (Meyers, 2009, p. 2). A measure of central tendency such
as the average of scores is used as the final cut score. Cizek (2001) states that the
Nedelsky method has been used in the medical environment because it is assumed that a
minimally qualified candidate must reject the wrong options as they would harm the
public.
Angoff’s Method
According to Kane (1998), the Angoff method is the most common standard
setting method applied on high-stakes achievement tests. The Angoff method has been
applied to a large number of objective tests used for licensure exams, without major
criticism from the judges involved in the process.
William Angoff developed the method in 1971and it is based on judgment about
whether a minimally qualified candidate at a particular level of achievement would select
the right answer to each question. The judges answer each question and subsequently talk
about performance behaviors that delineate a highly qualified, qualified, and unqualified
candidate. The job of the judges is to estimate the percentage of minimally qualified
candidates who would answer each question correctly. The judges are instructed to think
37
about a group of 100 minimally qualified candidates at each achievement level. The
questions are presented one at a time, in the order that they appear on the test. Then,
judges rate the difficulty of each question based on the percentage of minimally qualified
candidates who would get each question right (Nichols, Twig, & Mueller, 2010). Wang
(2003) states that judges usually use the range of 70% to 80% when using the Angoff
method. In addition, judges use 90% or 95% for easy items. In the case of hard items,
judges provide ratings of 50% or 60%, but never provide ratings lower than 50%.
At the end of the process, the averages of the judges ratings for each question on
the test are determined. The performance standards on the total score scale are established
by combining these question averages. The averages can then be calculated to set each
performance standard for the test (Hambleton et al., 2000).
Literature on Components of the Angoff Method
Some key components contribute to better psychometric results when using the
Angoff method. One of these components is the use of SMEs in the passing score
process. It has been found that the participation of job incumbents, supervisors, or anyone
who is an expert in a profession produces more accurate cut scores than participation of
people who are not experts (Maurer, Alexander, Callahan, Bailey, & Dambrot, 1991).
The valuable expertise and knowledge of judges produces more reliable cut scores in the
judgment process than those produced by judges who do not possess the knowledge
necessary to rate the difficulty of the items.
Another component to the psychometric quality of the resulting passing score is
the training that SMEs receive for the standard setting process (Plake, Impara, & Irwin,
38
2000). Research has specified that training SMEs on the KSAs required for the exam,
results in higher agreement about the difficulty of the items on the exam. Moreover,
Impara and Plake (1998) noted that the validity of the Angoff method is in question if
SMEs cannot perform the difficult task of estimating item difficulty. Thus, panelists need
to receive training in order to be familiar with the required KSAs as well as the standard
setting method used.
Although the use of performance data in Angoff method is not essential,
performance data can be another aspect of the Angoff method. Performance data
provided to the SMEs helps to avoid underestimation of the difficulty of the items
(Yudkowsky & Downing, 2008). According to Hurtz and Auerbach (2003), the use of
normative data during passing score studies did not have any effect in the variability of
SMEs’ ratings. In addition, the use of normative data had a tendency to lower the ratings
of the items producing lower cut scores.
The outcomes of using performance behavior descriptors during the Angoff
method have been explored. Judges have indicated that having pre-established behavioral
descriptors helped them focus on the task of making judgments about items (Chinn &
Hertz, 2002). Hurtz and Auerbach (2003) suggest that when the group develops its own
performance behaviors definitions of a minimally competent candidate, there is higher
agreement in the ratings. Furthermore, when descriptions of the candidate are less
definitive, there is more variety in the descriptions given by the raters about the candidate
than when the description presented by the workshop facilitators are more definitive in
terms of predicted performance. Thus, previous definitions of performance of the
39
candidate in certain ways and more or less accurately influence judgment of competence
(Giraud, Impara, & Plake, 2005).
Strengths and Weaknesses of the Angoff Method
Researchers have studied the strengths and weaknesses of the Angoff method for
standard setting. One of the strengths of the Angoff method is that the method can be
applied to a wide range of exam formats such as those containing multiple-choice
questions (Stephenson, 1998). In addition, the Angoff method was the most favored by
SMEs and the most time efficient method (Yudkowsky & Downing, 2008).
Furthermore, judges using the Angoff method express high confidence in their
ratings due to the perception that the standards are reasonable judgments (Alsmadi,
2007). It has been found that the Angoff method has valuable psychometric qualities.
Stephenson (1998) found strong intrajudge and interjudge reliabilities using a modified
Angoff method.
Although the Angoff method possesses valuable qualities, it has also received a
lot of criticism. The limitations of the Angoff method are associated with the subjectivity
of the judgment task (Alsmadi, 2007). It is clear that the determination of percentages
about candidates who would answer each item correctly has been found to be complex
and subjective (Plake et al., 2000).
Moreover, research about the validity of the method suggests that judges are
unable to estimate the difficulty of the items accurately for barely qualified candidates
(Goodwin, 1999; Plake & Impara, 2001). This leads them to set cut scores that are
different than what they have it in mind based on the KSAs of highly qualified candidates
40
(Skaggs, Hein, & Awuor, 2007). In addition, evidence indicates that judges find it
difficult to rate the questions before and after they have talked about the questions with
the rest of the group. As a result, judges overrate the probability of success on difficult
items and underrate the probability of success on easy items. It has been found that
judges are more likely to overestimate performance on difficult items than to
underestimate that of easy items (Clauser, Harik, & Margolis, 2009). Similarly, Kramer
et al. (2003) indicate that ratings resulting from the Angoff method were inconsistent and
resulted in a low reliability.
Some organizations that invest in exam development process consider that the
Angoff method is complex, time-consuming, and expensive. In addition, the Angoff
method may be less convenient when it is used for performance-based testing than when
it is used for written tests (Kramer et al., 2003).
The National Research Council (1999) disagrees with the psychometric value of
the Angoff method. The reasons presented are that the resulting standards seem
unrealistic because too few candidates are judged to be advanced relative to many other
common conceptions of advanced performance in a discipline. The next reason is that the
results differ significantly by type and difficulty of the items. According to the National
Research Council (1999), cut scores for constructed-response items are usually higher
than cut scores for multiple-choice items. Another reason for their disagreement is that
there is evidence showing that it is hard for judges to accurately calculate the chances that
qualified candidates would answer each item correctly (National Research Council,
1999).
41
Following the publication of the National Research Council judgment on the
Angoff method, Hambleton et al. (2000) presented a review of the qualities of the
method. They identified the simplicity of the Angoff method as one advantage.
Hambleton et al. (2000) described that the accumulated method allows the judges to
differentially assess the questions in the test, giving higher value to those questions they
feel are more important. However, judges occasionally consider that the method tends to
break the assessment into small, isolated components. Even though the purpose of the
weighting method is to provide a complete analysis of the whole assessment, judges
sometimes perceive that the extended Angoff method might not consider the holistic
nature of the performance. Also, questions have been raised about the capability of
judges to make the necessary rating judgments (Hambleton et al., 2000).
The recognition of the Angoff method as the most widely used method for setting
standards on a lot of settings is acknowledged by researchers. For many years, the Angoff
method has been applied to numerous objective tests used for licensure and state
programs.
Ebel’s Method Based on Subject Matter Expert Judgments of Test Content
Ebel’s (1972) method is applied in two phases in which judges classify the
difficulty and the importance of the items. The three levels of difficulty are easy,
medium, and difficult; and the four levels of importance are critical, important,
acceptable, and questionable. The results are 12 groups of items. SMEs are asked to
organize each item into one of these 12 groups. The next phase is to estimate the
percentage of items that a minimally acceptable candidate (MAC) should get right from
42
each group (Cizek, 2001). In each group, the estimated percentage is multiplied by the
total number of items in that group. The cut score is determined by obtaining the average
of each group of items (Meyers, 2009; Kane, 1998).
Direct Consensus Method
In the Direct Consensus Method (DCM), the test items are separated into groups
according to the content specifications of the exam plan. SMEs are then instructed to
identify the number of test items that a minimally qualified candidate should answer
correctly. The items’ scores are summed and then SMEs scores combined to get the pass
point for the first round of ratings. Then the groups of SMEs discuss the rationale for
their ratings for each group of items. In some cases, SMEs are provided with item
performance data. The purpose of the DCM is that SMEs reach consensus on the final cut
score through an effective process in which aspects such as scope of practice, content and
difficulty of each group of items, opinions of other SMEs, and review of performance
data are involved (Hambleton & Pitoniak, 2004).
The Item Descriptor Method
The Item Descriptor Method (IDM) emerged in 1991 (Meyers, 2009). According
to Almeida (2006), the IDM was developed when the Maryland State Department of
Education considered necessary to update proficiency level descriptions in the Maryland
School Performance Program. The IDM is used when performance level descriptors are
linked to test items. In 1999, the method was modified in Philadelphia.
Item response theory (IRT) is used to order the items in a booklet placing one
item per page. SMEs are instructed to go through the different groups of items and then
43
link the items to the performance level descriptor that they believe reflects successful
performance.
The IDM shares characteristics with other methods such as the Angoff and the
Bookmark (to be described in the next section) methods. Common applications of these
methods involve panelists to make judgments about items to identify a cut score, involve
more than one round of judgments, utilize performance level descriptors as a basis for
judgment about items, and in most cases, SMEs are provided with performance data
(Ferrara, Perie, & Johnson, 2002).
The application of standard setting methods in licensing exams depends on the
characteristics of the exams. Some standard setting methods are more suited to be used
with multiple-choice item exams; other standard setting methods might provide better
results for performance exams. Test users need to consider these and other qualities about
exams when setting standards for those exams.
Bookmark Method
The Bookmark method is currently one of the most popular standard-setting
methods across different testing settings such as educational and licensing. However,
research to support the validity of the method is limited (Olsen & Smith, 2008). The
Bookmark method represents a relatively new approach (Kane, 1998) and was developed
to address perceived limitations of the Angoff method, which has been the most
commonly, applied procedure (Mitzel, Lewis, Patz, & Green, 2001).
The Bookmark method requires the use of IRT methodology. Ordinarily, theta
ability parameter is estimated for the situation where we want to know the probability of
44
answering the item correctly, that is, we want to know p(x=1). This probability is a
function of the a, b, c, and theta parameters (Meyers, 1998).
In the case of dealing with the hypothetical MAC candidate, the probability of
correctly answering the question is dropped to 2/3, that is, the probability that an item
will be answered correctly 2 out of 3 times p(x=2/3). Setting the c parameter equal to 0 in
the 3PL model, and solving for theta given p(x=2/3), the result for a given item: Theta =
b + .693/1.7a. Items are then placed in an Ordered Item Booklet (OIB) based on theta
value needed to answer the questions correctly 2 in 3 times (Meyers, 2009).
According to Lin (2006), the OIB is a representation of the actual exam and what
is being tested. In addition, the OIB is the medium used to make the cut score ratings of
the items in the examination. The items appear one per page on the OIB. The first item is
the easiest and the last item is the most difficult.
In various modifications of the Bookmark method, SMEs were shown p-values
for the items instead of the IRT values because the order of items based on p-values was
the same order of the items if the IRT model was used (Buckendahl, Smith, Impara, &
Plake, 2002; Skaggs & Hein, 2011).
The Bookmark method is intended to lessen the difficulty to estimate item
difficulty for borderline examinees by having SMEs examine the OIB in which test items
are arranged in order of difficulty (Skaggs et al., 2007). The judges receive an item map,
with the test items ordered in terms of their empirical difficulty level (Kane, 1998). SMEs
working in small groups with small groups of items start on page 1 of the OIB. Their
instructions are to identify the page where the chances of minimally acceptable
45
candidates answering the item correctly fall below 2/3 or the designated response
probability (RP) if some other value is chosen (Meyers, 2009). Moreover, SMEs of the
Bookmark method have to answer and discuss two questions: a) What does the item
measure? and b) What makes this item more difficult than the items that precede it in the
OIB? (Lee & Lewis, 2008).
The individual SME Bookmark estimates are often shared with the group either
numerically and discussion occurs regarding the SME Bookmark placements. Impact
data from examinee scores can be shared with the SMEs. There are usually two or three
rounds with the SME minimum passing scores reviewed and discussed for each round
(Olsen & Smith, 2008). Each round is intended to help increase consensus and reduce
differences among the SMEs.
During Round 1, SMEs in small groups examine each item in the OIB, discussing
what each item measures and what makes the item harder than those before it. After this
discussion, each SME determines a cut score by placing a bookmark in the OIB
according to their own judgment of what students should know and be able to do at each
of the performance levels (Eagan, 2008).
SMEs then engage in two more rounds of placements. In round 2, SMEs discuss
the rationale behind their original placement with other SMEs at their table. In round 3,
SMEs at all tables discuss their placements together. After each round of discussion,
SMEs may adjust or maintain their placements. Impact data, that is the percentage of
students in that state that would fall below each bookmark, is introduced to participants
during the third round. After the final round of placements, the recommended score is
46
calculated by taking the mean of all bookmark placements in the final round (Eagan,
2008).
Based on the final cut scores set, performance level descriptors are then written by
the SMEs. Performance descriptors define the specific KSAs held by test takers at a
given performance level. Items prior to the bookmarks reflect the content that students at
this performance level are expected to be able to answer correctly with at least a 0.67
likelihood. The KSAs required to respond successfully to these items are then
synthesized to formulate the description for this performance level. Performance level
descriptors become a natural extension of the cut score setting procedure (Lin, 2006).
According to Peterson, Schulz, and Engelhard (2011), the Bookmark and Angoff
methods are different from one another in several ways. During the Angoff method,
SMEs review each item and rate the probability of a minimally qualified candidate of
selecting the right answer. The cut score is the sum of average of the SMEs’ ratings. In
the case of the Bookmark method, SMEs use a booklet with the items ordered from
easiest to most difficult. The final cut score is the score at which candidates have a
specified probability of answering the most difficult item from the booklet correct
(Peterson et al., 2011).
Studies have compared the resulting cut scores from standard setting studies using
the Bookmark and the Angoff methods. In the study by Buckendahl et al. (2002), the
Bookmark method allowed teachers to focus on the likely performance of the barely
proficient student (BPS). Although there was a small difference in final cut scores
between the two methods, the standard deviation decreased for the second round of the
47
Bookmark method compared with the Angoff method. Thus, the Bookmark method
produced a lower standard deviation. Consequently, a narrower range of possible cut
scores was produced, indicating a higher level of inter-judge agreement.
Wang (2003) provides evidence to justify the use of an item-mapping method
(Bookmark-based) for establishing cut scores for licensure examinations. The itemmapping method incorporates item performance in the process by graphically presenting
item difficulties. It was noted that item-mapping method sets lower cut scores than the
Angoff method. Another finding was that the predicted percentages of passing were
lower for the Angoff method than for the item-mapping method (Wang, 2003).
Green, Trimble, and Lewis (2003), compared the Bookmark method to the
Contrasting Groups method in order to establish credible student performance standards
by using this multiple procedure approach. The Bookmark method produced lower cut
scores in the Novice/Apprentice and Apprentice/Proficient levels. The Contrasting
Groups method produced lower cut scores in the Proficient/Distinguished. The final cut
score set by the synthesis group (subset of participants from the other methods) were
closer to the Bookmark cut scores than the other two methods. Olson (2008) found that
Bookmark ratings were consistently below the results from the Angoff ratings.
Findings About Cut Scores and Rater Agreement
Research on the consistency of rater agreement (Wang, 2003) indicates similar
distributions of variation agreement, but different distribution patterns across four exams.
Judges provided more consistent ratings in the Bookmark item-mapping method than in
the Angoff method. All ratings from the Bookmark item-mapping method reached rater
48
agreement higher than 0.95, whereas the rater agreements for the Angoff method ranged
from 0.796 to 0.922 (Wang, 2003). This was not the case for the study by Yin and
Sconing (2008) in which researchers found that cut scores were generally consistent for
an Angoff-based method (item rating) and a Bookmark-based method (mapmark).
Skaggs and Hein (2011) found similar cut scores when comparing the Bookmark method.
This is also the case for the study by Olsen (2008), in which the results indicated similar
cut scores for the Modified Angoff and Bookmark methods. Lin (2006) found evidence
of similar cut scores being set with the Modified Angoff and Bookmark standard setting
procedures. In addition, better inter-rater agreement was found for the Bookmark cut
scores.
Providing impact of performance data between rounds influences SMEs’ ratings.
Buckendahl et al. (2002) found that the second round cut score in the Angoff method
dropped by a point and a half after performance data was given. In the case of Bookmark,
the second round cut score increased by two points (Buckendahl et al., 2002).
Lee and Lewis (2008) suggest that in order to decrease the standard error for cut
scores from the Bookmark method, increasing the number of small groups was more
efficient than increasing the number of participants. Increasing the number of participants
within groups also decreased the standard error, but the use of more groups was more
efficient. This new strategy contributes to the increase in reliability of the cut score.
Strengths and Weaknesses of the Bookmark Method
Karantonis and Sireci (2006) acknowledged that the extensive use and acceptance
of the Bookmark method indicate the logical appeal and practicality of the procedure.
49
Research has identified the advantages of the Bookmark method and these include: (a)
essentially no data entry, (b) the ability to similarly handle multiple choice and
constructed response items, (c) time efficiency, (d) defensible performance level
descriptions that are a natural outcome of setting the cut points, and (e) cut scores based
on a comprehensive understanding of the test content (Cizek, 2001).
Moreover, the Bookmark method meets Berk’s criteria for defensibility of
standard setting methods and is a relatively sound procedure in terms of technical
adequacy and practicability (Lin, 2006). The Bookmark method allows to: (a) reduce
cognitive complexity; (b) connect performance descriptors with content of assessments;
(c) promote better understanding of expected student performance; (d) accommodate
multiple cut scores; (e) accommodate multiple test forms; and (f) obtain low standard
error of the cut scores (Lin, 2006).
Hambleton et al. (2000) noted that panelists respond positively to the Bookmark
method because item ratings are avoided and the method can handle both selected and
constructed item formats. Also, panelists decide how much knowledge and skills would
be reflected by basic, proficient, and advanced examinees (Hambleton et al., 2000).
In a single-passage Bookmark study, the presentation of the items in separate
booklets reduced the cognitive complexity of the judgment task. The items in each
booklet referred to the same passage instead of a single booklet with all the items about
different passages (Skaggs & Hein, 2011). Thus, separate booklets should be used when
different passages or content areas are used in a test.
50
One of the main limitations of the Bookmark method is the disordinality of items
within the OIB. Item disordinality refers to the disagreement among SMEs on the
ordering of the items in the OIB (Lin, 2006). The disordinality of items affects panelist’s
bookmark placements during a standard setting procedure (Skaggs & Tessema, 2001;
Lin, 2006). Typically, panelists do not agree on the way items are ordered in the OIB
because they may not be able to estimate item difficulty accurately. As a result,
variability of cut scores among panelists increase, and standard error of the final score
increases accordingly. Panelists reported having a difficult time when trying to evaluate
the difficulty and order of the items in a booklet, because the test was comprised of
reading passages with multiple selected-response items. Several judges insisted that
several items should have been placed before other items at different locations in the
booklet (Skaggs & Tessema, 2001).
Karantonis and Sireci (2006) identified potential bias to underestimate cut scores
and problems in selecting the most appropriate response probability value for ordering
the items presented to panelists in the item booklet of the Bookmark method (Karantonis
& Sireci, 2006). Similarly, Lin (2006) identified the choice of RP, as one problem of the
method. This is the approach to displaying the performance data on the reporting scale in
the final determination of performance standards. This is the ability level at which a test
taker has a 50% or 75% chance of success. There is a concern that this simple variation
might considerably impact performance standards and the resulting item difficulty
locations (Hambleton et al., 2000). According to Karatonis and Sireci (2006), participants
felt more comfortable using an RP value of .67 than .50. The 50% chance was more
51
difficult to understand because it reflects an even chance and a 67% was easier to
understand because it refers to a mastery task statement. Furthermore, the exclusion of
important factors other than the difficulty parameter and restrictions of the IRT models
are also weaknesses of the Bookmark method (Lin, 2006). Some other factors about the
examination items such as previous performance data, importance to the job or exam plan
relevance should also be considered when ordering the items in the OIB.
The percentage failed in the context of defining minimal competency has been
studied (MacCann, 2008). The length of a test has been found to influence the percentage
of people identified below the level of minimal competence. MacCann (2008) proposed a
formula to adjust cut scores, which modifies the cut score to reduce the incidence of those
students who do not deserve to fail and that are currently failing due to errors of
measurement. The price to be paid for this result is that examinees that deserved to fail on
true scores passed due to errors of measurement. On balance, it seems good result as the
favorable shift is much larger than the unfavorable one at a relatively low level of
reliability. Thus, this technique reduces the number of students who failed due to errors
of measurement but deserved to pass on true scores. Although lowering the cut scores has
positive consequences in the reduction of false negatives, test users should consider the
cost of certifying competence in areas where incompetence could be life threatening such
as in medical fields (MacCann, 2008).
Negative bias have been found (Reckase, 2006) as a result of the item order from
easy to hard starting at the beginning of the booklet and the identification for items with
probabilities below .67. Reckase (2006) recommends that panelists should also receive a
52
booklet with items ordered from hard to easy and look for the first items with probability
above .67, the bookmark selection would probably result in bias in cut score estimation in
the opposite direction. Using the average of these two bookmark placements would
probably reduce the bias in the estimated of cut score. Furthermore, to reduce the
standard error of the Bookmark method, several booklets placements could be made
using different subsets of items and the resulting cut scores averaged (Reckase, 2006).
Participants’ Experience in Standard Setting Studies using Bookmark Method
In the study of cognitive experiences during standard setting, judges have
indicated that even though they were not very clear about what the process would do at
the beginning of the study, after a few items were rated, they became more comfortable
with the Bookmark process (Wang, 2003). In addition, panelists have expressed high
levels of confidence in the passing score and with the process (Buckendahl et al., 2002).
Panelists are able to understand the task and show confidence about the standard setting
procedure (Karantonis & Sireci, 2006). Thus, judges concluded that their Bookmark final
cut scores were very close to their conception of appropriate standards (Green et al.,
2003). Judges agreed that the Bookmark item-mapping method set more realistic cut
scores than the Angoff method (Wang, 2003).
In studies focused on qualitative experiences in standard setting, participants
perceived that their own item ordering was unique and it reflected their own BPS rather
than as generalizing to other teacher’s circumstances (Hein & Skaggs, 2009). The use of
less time involved in the item-mapping method in comparison with the Angoff method
was acknowledged and appreciated by the judges (Wang, 2003). Thus, judges were not
53
frustrated about going through all the items. Judges have perceived that they are allowed
to focus on the likely performance of the BPS without the challenge of characterizing that
performance relative to the absolute item difficulty (Buckendahl et al., 2002).
Studies regarding the cognitive experiences of raters about the Bookmark method
suggest that choosing a specific test item as their bookmark placement was perceived as a
nearly impossible task (Hein & Skaggs, 2009). Judges emphasized that the problem was
not a lack of understanding of the procedure, but the difficulty of the task itself. Research
also suggest that even though participants expressed understanding of the procedure,
some participants used alternative bookmark strategies during the remaining rounds of
bookmarking (Hein & Skaggs, 2009). Judges did not view their own professional
judgments about an appropriate placement as an adequate basis for making such
decisions. Instead, they viewed their placement as needing to be informed and justified
by some external factor, such as a state proficiency standard. Some participants showed
confusion and frustration about the item order. Participants considered that the order of
the items was incorrect and wanted to reorder the items (Hein & Skaggs, 2009).
In the evaluation of the Bookmark method, strengths of the Bookmark method
outweigh its weaknesses. The Bookmark method remains a promising procedure for
standard setting with its own strengths and limitations. More research is needed for this
relatively new method for standard setting. The use of different sources of information in
order to adopt a cut score might benefit the resulting standard (Green et al., 2003).
54
Purpose of this Study
The increasing need of legally defensible, technically sound, and credible
standard setting procedures is a common challenge for state licensure exams. In addition,
in the current economic environment, government entities in charge of issuance of
licenses would benefit from standard setting methods that allow saving time and money
in the examination development process as well as protecting the public and the test
taker.
For these reasons, researchers need to study new methods for standard setting that
provide reliable cut scores. The Bookmark method is a new procedure that has been
positively accepted by test users. Despite the increasing popularity and evidence of
reduced complexity of the process, the Bookmark method for standard setting still lacks
evidence on its validity to support its status as a best practice in licensure testing contexts.
The purpose of this study is to evaluate the effectiveness of the Bookmark method
compared to the modified Angoff method. The aim of the study is to answer the
following question: Do the Bookmark and Angoff methods lead to same results when
used on the same set of items for a state licensure exam?
Another goal of the study is to evaluate SMEs’ cognitive experiences during the
standard setting process and the resulting cut scores. This evaluation will be
accomplished with the use of a 17-item questionnaire about the effectiveness of the
standard setting methods.
Most of research studies on Bookmark-based methods provide good evidence of
the reliability and validity of Bookmark-based methods (Peterson et al., 2011). The cut
55
scores for the same content area resulting from Angoff-based and Bookmark-based
methods clearly converged, providing evidence of reliability of Bookmark-based
methods. Procedural validity of the Bookmark-based methods was supported as panelist
understanding of the tasks and instructions in Bookmark-based methods was higher than
in Angoff-based methods and panelists ratings of the reasonableness and defensibleness
and of their confidence in the final cut scores were high, and higher than in the Angoffbased methods (Peterson et al., 2011).
As used in Skaggs, Hein, and Awuor (2007) study of passage-based tests, this
study used separate booklets for the different content areas of the examination instead of
a single ordered item booklet for the Bookmark method. For example, if the examination
contains five content areas, there were five ordered item booklets. Also, by producing
more data points resulting from the separate booklets, reliability of the cut score should
be higher in that it should be easier to rate the items that belong to a same content area
because of the similar formatting that exist in different content areas of the examination.
56
Chapter 4
METHOD
Participants
A total of fifteen SMEs participated in the standard setting studies. Seven SMEs
participated in workshop 1 and a different group of eight SMEs participated in workshop
2. All SMEs were current and active licensees. It was expected that most of the
participants of the study were going to be female because this reflects the majority of job
practitioners. Some SMEs had several years of experience in their profession and other
SMEs are newly licensed. The number of years being licensed ranged from 1.5 to 37
years. SMEs represented different practice specialties and geographic locations. All
SMEs were familiar with the issues that independent practitioners face on a day-to-day
basis as well as competencies required for entry-level practitioners and they have
participated in previous exam development workshops. Some SMEs have had previous
experience in exam development workshops and other SMEs have not been in other
workshops. It is important to have a sample of SMEs with different levels of experience,
ranging from being newly licensed to having many years of experience.
The names, age, and ethnicity of SMEs from the two workshops were kept
confidential by the Office of Professional Examination Services (OPES). Only the
county, years licensed, and work setting information about SMEs is provided. SMEs
were allowed to participate in the standard setting workshop only if they did not
57
participate in the exam construction workshop so that previous exposure to the items
would not interfere with the evaluation of the items in the standard setting workshop.
Materials
At the beginning of each workshop, SMEs received a folder with a copy of the
agenda for the workshop. An example of the agenda is provided in Appendix A. SMEs
also received information about the workshop building facilities, a copy of a power point
presentation, exam security agreement, and a copy of the two exam plans for the two
examinations to be rated.
The test developer showed a power point presentation to the SMEs as part of the
training that explained the two standard setting methods. After reviewing the exam plans
for the two exams, SMEs received a copy of the two licensure exams. Each exam had 100
four- multiple-choice items. The 100 items for each exam were divided in two smaller
sets of 50 items each. All items were designed to assess minimal competence in different
content areas based on the exam plan. The items were previously selected by a different
group of SMEs during an exam construction workshop. During the exam construction
workshop, the test developer brings a pool of items for the SMEs to select the items that
they consider the best for that exam, making sure that the items do not overlap or test the
same area from the exam plan. Before the exam construction workshop, the test
developer reviews past performance of the items in order to ensure that they are
functioning well and that they differentiate those test takers who are qualified from those
who are not qualified to practice safely in the profession.
58
The data were collected using the Angoff and Bookmark methods. Although the
Bookmark method typically uses IRT based difficulty estimates, the test developer used
classical test theory item difficulties (p-values) from previous exam administrations
which has been supported in recent research (Buckendahl et al., 2002; Davis-Becker,
Buckendahl, & Gerrow, 2011). The p-values were weighted based on sample size and the
following hypothetical example will illustrate the weighting of p-values. Item 99 was
administered on dates X, Y, and Z. For administration X the p-value is 0.20 and the
sample size is 10. For administration Y the p-value is 0.30 and sample size is 20. Finally,
for administration Z the p-value is 0.30 and sample size is 80. The weighted average for
Item 99 is (0.20*10 + 0.30*20 +0.40*80) divided by (10+20+80) which is equal to 0.363.
In the current study, the p-values for items on which the Bookmark sorting was based
were based on N’s of 70 to 198 from prior test administrations. The dates of previous
administrations range from November 2001 to November 2010. In addition, the items
differ in the number of times that they have been administered. The number of times
administered ranges from 3 to 7 times.
Rating sheets containing three columns were provided to SMEs to write down
their ratings for the Angoff method (see Appendix B). Each rating sheet also had the
SME’s name, the date of the workshop, and the name of the profession. SMEs' ratings
were recorded in an Excel spreadsheet. The first column in the rating sheet contained the
item number, the second column was designed to enter the initial rating, and the last
column was designed to enter the final rating for each SME. They also received a rating
sheet for the Bookmark method which included their name and group number as it is
59
shown in Appendix C. It also included columns for the content area, initial placement,
second placement, and final placement for each of the two exams.
Research Design
The study consisted of 2 separate 2-day workshops. SMEs were current licensees
and they were recruited by their state board to represent different specialties, geographic
locations, and levels of experience. The design of the study is shown in Table 1. SMEs
rated subsets of 2 different licensure exams using two different standard setting methods:
the Bookmark and the Angoff methods. The goal was to compare the cut score that
resulted from the two methods. Thus, each subset of items was rated by the two methods.
For example, for subset 1, the Angoff cut score was obtained as well as the Bookmark cut
score. The research question was whether the Bookmark method produces similar cut
scores to the Angoff method on the same set of items. The first workshop was held on
January 21 – 22, 2011; and the second workshop was held on January 28 – 29, 2011. A
group of seven SMEs participated in each workshop. These two groups of SMEs were
independent of each other.
Table 1
Summary of Experimental Design
Day 1
Workshop 1
Workshop 2
Day 2
Exam
Set
Method
Exam
Set
Method
1
1
Angoff
2
2
Bookmark
2
Bookmark
1
Angoff
1
Bookmark
2
Angoff
2
Angoff
1
Bookmark
2
1
60
On day 1 of workshop 1, SMEs rated the first set of items for exam 1 using the
Angoff method. Then, SMEs rated the second subset of items from the same exam using
the Bookmark method. On day 2, SMEs rated the second set of items from exam 2 using
the Bookmark method. Then, SMEs rated the first set for the same exam using the
Angoff Method.
On day 1 of workshop 2, a different group of SMEs rated the first set of items for
exam 2 using the Bookmark method. Exam 2 was rated on the first day in order to
counterbalance the order effects of type of exam. Each set of items was rated using the
two different standard setting methods in order to compare the results. Then, SMEs rated
the second set of items from the same exam using the Angoff method. On day 2, SMEs
rated the second set of items from exam 1 using the Angoff method. Then, SMEs rated
the first set for the same exam using the Bookmark method. A cut score was derived from
these workshops.
Procedures
General Training
SMEs provided judgments on item difficulty for two licensure exams in a
standard setting study. The two independent groups of SMEs had the same training at the
beginning of each workshop. As part of the training, the test developer explained the
exam security procedures to be followed in order to protect the confidentiality of the
exams. Then, SMEs were asked to sign the exam security agreement stating that they
would follow those procedures. Another part of the training was to review the exam plan
for each exam in order to become familiar with the required KSAs for these exams. After
61
reviewing the exam plans, the test developer presented a performance behaviors table
with definitions of the highly qualified, minimally acceptable, and unqualified candidate.
This table was directly linked to the exam plan and it was established during the
development of previous workshops by multiple focus groups. A short sample of the
performance behaviors table is provided in Appendix E.
At the end of this part of the meeting, SMEs were trained on the standard setting
method to be used first, followed by the other standard setting method. The training on
the standard setting method was done on the first day of the workshops. Thus, in
workshop 1, SMEs were trained on Angoff method first followed by the rating of the
items. Afterward, SMEs were trained on the Bookmark method followed by the rating of
the items. In workshop 2, SMEs were trained on the Bookmark method first followed by
the rating of the items; after that, SMEs were trained on the Angoff method followed by
the rating of the items. In the Angoff method, SMEs rated each item of the test. In the
Bookmark method, SMEs placed bookmarks on the OIB. Each standard setting method
consisted of more than one round of ratings. A round of ratings was the time period
within the standard setting process in which judgments were collected from each SME.
Angoff Method
For the Angoff method, SMEs received two booklets with 50 four-multiplechoice items each and 2 rating sheets (one per exam). The items were ordered by content
area. The test developer trained the SMEs using a Microsoft PowerPoint presentation that
explained the Angoff method. An outline of the presentation is provided in Appendix F.
62
After the presentation, the SMEs answered a 10-item-practice exam in order to
become familiar with the process. The practice exam also helped to verify that SMEs
were properly calibrated and understood what Minimum Acceptable Competence (MAC)
criteria was for the test taker. Having this concept in mind is critical because SMEs’
ratings are based on this entry-level standard.
As part of the practice process, the test developer provided the key for each item
after all SMEs had answered the 10 items. Then, SMEs were asked to rate the difficulty
of the items and to write them down on a rating sheet. The SMEs were instructed to rate
the items based on the percentage of people that would answer the item correctly from a
group of 100 qualified candidates based on MAC criteria.
The initial ratings were independent from the other SMEs. Thus, no discussion of
the items was held yet. The SMEs used a rating scale ranging from 25% - 95% to rate the
perceived difficulty of the items. Then, each SME provided the initial ratings to the test
developer and the test developer entered the ratings from each SME on an Excel
spreadsheet designed to compute the cut score. The spreadsheet contained SME’s names
across the columns and the items listed in the rows grouped by content areas for easier
identification. The test developer obtained the average rating for the item based on the
initial rating across raters and determined if the group of SMEs needed to discuss the
item.
In the case that the range of ratings of an item had a discrepancy of 20% or more,
SMEs were asked to discuss that specific item. For example, one of the SMEs considers
the item to be easy and rates the item with an 80%. On the other hand, another SME
63
considers the same item to be challenging and rates the item with a 45%. In this example,
there is a rating discrepancy of 35% which reflects SMEs disagreement in the difficulty
of the item. The SMEs with the lowest and highest ratings are asked to explain their
rationale as to why they provided those ratings. The discussion is based on the evaluation
of the item as a whole. SMEs evaluate the quality of the 3 wrong answers of the item and
the required KSAs. SMEs think about which and how many of the wrong answers could
be easily dropped in order to get to the right answer.
After some discussion with the group, SMEs have an opportunity to re-rate the
item and write down the final rating for the item. SMEs are asked to provide a final rating
based on the discussion, if there was one. In the case that ratings failed to fall within the
20% criterion again, SMEs were asked if the item was difficult, average, or easy in order
to verify if SMEs were still calibrated. Another method to verify that SMEs were using
the rating scale properly is to ask them to explain their rationale for why they considered
the item easy or difficult. Sometimes SMEs fail to use the rating scale appropriately and
tend to rate the items with an 80% when they really consider the item to be difficult. In
the case that no discussion was held, they are asked to pass their final rating over to the
final column on their rating sheets. The test developer enters the final ratings on the Excel
spreadsheet.
After completing the 10-item-practice exam, SMEs started to work on exam 1.
SMEs answered 25 items from exam 1 and followed the same procedure used in the 10item-practice exam. After rating all 25 items, they continued with the other 25 items from
this set for a total of 50 items. SMEs answered the items and rated them in two separate
64
groups of items. It is easier to complete this task by answering small groups of items
because SMEs get less fatigued by trying to remember the content of the items for them
to rate them. The recommended passing score was the average of the final ratings of all
the raters.
Bookmark Method
For the Bookmark method, SMEs received several booklets with the items
grouped by content area for each exam: 5 booklets for exam 1 and 3 booklets for exam 2.
Each booklet had one item per page for a total of 50 items per exam. The items were
ordered from easiest to most difficult in each booklet based on classical theory p-values
obtained from previous administrations.
The test developer presented a Microsoft PowerPoint presentation on the
Bookmark method. After the presentation, SMEs were assigned to one of 2 groups, with
3 or 4 people in each group. SMEs answered a 10-item-practice exam in order to become
familiar with the process. The items were the same items used in the Angoff method. The
items were ordered from easiest to most difficult. After answering all the items, the test
developer provided the key for each item. During round 1, SMEs were asked to go
through the items and place the initial bookmark in the item that they believed would be
the last correct item that a candidate would get from that exam and beyond this placement
the candidate would get all the items incorrect while thinking of the MAC candidate. The
first placement was independent from other SMEs. There was no discussion held among
SMEs in this round of placements. SMEs recorded their initial placement in the
65
bookmark rating sheet. SMEs were asked for their first placement and the test developer
entered the data in the Excel spreadsheet.
During round 2, performance data was provided and explained after the first
bookmark placement. SMEs were allowed ten minutes to discuss with their group the
reasons for their placements based on the knowledge that the candidates need to answer
the item right. They also discussed why the previous and following item would have been
more or less difficult for the MAC candidate. After group discussions, SMEs placed their
second bookmark on the booklet and the test developer entered it in an Excel spreadsheet.
This time, SMEs considered the information shared during their small group discussion
in order to make their placement.
For the final discussion, in round 3, one SME from each group summarized the
discussions from their group. They had the last opportunity to move their final bookmark
placement. There were three rounds of bookmark placements. High disagreement among
SMEs on their placements was not required in order to hold these discussions, as it was
the case in the Angoff method. The discussions were required steps of the Bookmark
method, thus they had to be made.
After the practice exercise, SMEs answered the 50 items from the exam one
booklet at a time. When they were done, the test developer provided the key to those
items. SMEs had to place a separate bookmark for each booklet, so 5 bookmarks for
exam 1 and 3 bookmarks for exam 2. They used the same procedure used for the practice
exam.
66
The same procedures were used for the second workshop. A different group of
SMEs participated and the standard setting methods were reversed in this case to
counterbalance any effects of the standard setting methods. So, the Bookmark method
was used first instead of Angoff on the first day; and the Angoff method was used first on
the second day.
Evaluation of Standard Setting Methods
At the end of the workshop, Subject Matter Experts completed a 17-item
questionnaire about the effectiveness of the two standard setting procedures. The
questionnaire is presented in Appendix D.
67
Chapter 5
RESULTS
The mean of the final round of ratings across SMEs was the final cut score for
each subset of items in which the Angoff method was used. The resulting cut scores for
each round of ratings for each subset of items are shown in Table 2. Table 2 also provides
standard deviations, 95% confidence intervals, and ICC coefficients for each round of
ratings.
The mean of the final bookmark placement across SMEs was used as the cut score
for the set of items in which the Bookmark method was used. The resulting cut scores for
each round are shown in Table 2. Table 2 also shows the change in reliability of the cut
scores between rounds of ratings.
The reliability of set 1 of exam 1 increased from 0.72 in the initial round to 0.85
in the final round of ratings when using the Angoff method. As a result of the
disagreement of the SMEs in the difficulty of some of the items, there were negative
inter-judge correlations. The reliability of the Bookmark cut score was negative
(Cronbach’s Alpha α = -0.18) when using the 7 raters for the analysis (M=62.68, SD=
7.88, 95%CI = 55.39-69.97). Thus, only 4 raters were used for the analysis. The resulting
reliability increased from 0.64 in the initial round, to 0.83 in the second round, and then
changed to 0.68 in the final round.
68
Table 2
Cut Scores, Standard Deviations, Confidence Intervals, and Reliability Between Rounds
Method
Exam1
Set 1
Angoff
Bookmark
Set 2
Angoff
Bookmark
Exam 2
Set 1
Bookmark
Angoff
Set 2
Bookmark
Angoff
Round
Cut score
SD
95% CI
α
Initial
70.66
4.11
66.85 – 74.47
0.72
Final
70.26
2.94
67.53 – 72.98
0.85
Initial
69.00
3.44
63.52 – 74.48
0.64
Second
67.55
4.32
60.66 – 74.44
0.83
Final
67.75
3.81
61.67 – 73.83
0.68
Initial
77.09
3.44
73.89 – 80.27
0.76
Final
77.87
2.80
74.92 – 80.81
0.79
Initial
66.82
5.62
61.62 – 72.03
0.74
Second
65.28
4.03
61.54 – 69.02
0.86
Final
65.42
5.79
60.06 – 70.78
0.93
Initial
64.04
4.48
61.25 - 67.08
0.93
Second
64.54
3.97
61.21 – 67.86
0.97
Final
64.57
3.91
60.96 – 68.19
0.95
Initial
69.00
2.94
66.27 – 71.73
0.73
Final
69.13
2.77
66.56 – 71.69
0.77
Initial
64.24
6.68
58.06 – 70.42
0.30
Second
66.95
4.69
62.61 – 71.30
0.90
Final
66.86
5.39
61.87 – 71.85
0.83
Initial
73.82
4.40
70.13 – 77.51
0.80
Final
73.84
3.48
70.92 - 76.75
0.87
Note. Bookmark cut score for set 1of exam 1 was based on the analysis of 4 SMEs due to low
inter-judge correlation of the reliability analysis using 7 SMEs.
69
A summary of the final cut scores and intraclass correlation coefficients (ICCs) is
presented in Table 3. The reliability of the cut score of set 2 when using the Angoff
method changed from 0.76 in the initial round to 0.79 in the final round. The reliability of
the cut score when using the Bookmark method in set 2 of exam 1 was 0.74 in the initial
round, 0.86 in the second round, and 0.93 in the final round.
The reliability of the Bookmark cut score for set 1 in exam 2 went from 0.93 in
the initial round, to 0.97 in the second round, and finally changed to 0.95 in the final
round. The reliability of the Angoff cut score changed from 0.73 in the initial round to
0.77 in the final round. The reliability of the Bookmark cut score for set 2 was 0.30 in the
initial round, 0.90 in the second round, and 0.83 in the final round. The reliability of the
Angoff cut score for set 2 was 0.80 in the initial round and 0.87 in the final round of
ratings.
Table 3 shows the summary of final cut scores for each set of items from exams 1
and 2 using the Angoff and Bookmark methods. The table also provides the number of
items, number of raters, number of cases, intraclass correlation coefficients (ICC), and
95% confidence interval for each cut score. Two cut scores were produced for exam 1
and two for exam 2. The Angoff method was used for one half of the exam and the
Bookmark method was used on the other half of the exam. Cut scores of the same set of
items were obtained using the two different methods. Thus, Angoff cut score and
Bookmark cut score were obtained on the same subset of items in order to determine if
they would provide similar cut scores.
70
Table 3
Summary of Angoff and Bookmark Final Cut Scores for each Set of Items
Exam 1
Set 1
Method
Set 2
Angoff
Bookmark
Bookmark
Angoff
1-50
1-50
51-100
51-100
Number of Raters
7
4
7
7
Cases
50
5
5
50
Cut Score
70.26
67.75
65.42
77.87
ICC-Single
0.41
0.32
0.57
0.27
0.29-0.54
-0.04-0.85
0.24-0.92
0.16-0.40
0.83
0.65
0.90
0.72
0.74-0.89
-0.17-0.96
0.68-1.00
0.58-0.82
Items
95% CI
ICC-Average
95% CI
Exam 2
Set 1
Method
Set 2
Bookmark
Angoff
Angoff
Bookmark
1-50
1-50
51-100
51-100
Number of Raters
8
7
8
7
Cases
3
50
50
3
Cut Score
64.57
69.13
73.84
66.86
ICC-Single
0.57
0.32
0.35
0.37
0.16-0.98
0.21-0.45
0.25-0.48
0.02-0.97
0.91
0.76
0.81
0.80
0.61-0.99
0.65-0.82
0.72-0.88
0.10-0.99
Items
95% CI
ICC-Average
95% CI
71
The Angoff cut score for set 1 of exam 1 (M = 70.26, SD = 2.94), was higher than
the Bookmark cut score (M = 67.75, SD = 3.81). The Angoff cut score had a narrower
confidence interval (CI = 67.53 – 72.98) than the Bookmark cut score (CI = 61.67 –
73.83). The Bookmark cut score (67.75) fell within the Angoff confidence interval (67.53
– 72.98) and the Angoff cut score (70.26) fell within the Bookmark confidence interval
(61.67 – 73.83) suggesting no difference.
The Angoff cut score for set 2 (M = 77.87, SD = 2.80) was higher than the
Bookmark cut score (M = 65.42, SD = 5.79). The Angoff cut score had a narrower
confidence interval (CI = 74.92 – 80.81) than the Bookmark cut score (CI = 60.06 –
70.78). Confidence intervals did not fall within each other in this case. They were
separate from each other.
The Angoff cut score for set 1 of exam 2 (M = 69.13, SD = 2.77) was higher than
the Bookmark cut score (M = 64.57, SD = 3.91). The Angoff cut score had a narrower
confidence interval (CI = 66.56 – 71.69) than the Bookmark cut score (CI = 60.96 –
68.19). The Angoff cut score did not fall within the Bookmark confidence interval and
the Bookmark cut score did not fall within the Angoff confidence interval suggesting
difference in cut scores.
The Angoff cut score for set 2 (M = 73.84, SD = 3.48) was higher than the
Bookmark cut score (M = 66.86, SD = 5.39). The Bookmark cut score had a narrower
confidence interval (CI = 70.92 – 76.75) than the Angoff cut score (CI = 61.87 – 71.85).
The Angoff cut score (73.84) fell within the Bookmark confidence interval (70.92 –
72
76.75) and the Bookmark cut score fell within the Angoff confidence interval (61.87 –
71.85) suggesting no difference.
The data suggested difference in cut scores in set 2 of exam 1 and set 1 of exam 2.
There was no difference suggested for set 1 of exam 1 and set 2 of exam 2. The
confidence interval for the Angoff cut scores was narrower than the Bookmark cut scores
in 3 of the 4 sets. Angoff cut scores were usually higher than the Bookmark cut scores.
A McNemar’s test was conducted to assess the significance of the difference
between two methods. The results of the test are presented in Table 4. Table 4 shows the
proportions of candidates passing and the differences between the two methods for each
set (50 items) of the two exams. The results of exam 1 are based on 57 candidates. For
exam 1, set 1, approximately 37% of candidates passed according to the Angoff method
while 63% passed according to the Bookmark method, for a difference of 26%. For set 2,
56% of candidates passed according to the Angoff method and 93% passed using the
Bookmark, for a difference of 37%. Differences were statistically significant, p = .000.
The results of the McNemar test for exam 2 are based on 99 candidates. For exam
2, set 1, 44% of candidates passed according to the Angoff method while 52% passed
using the Bookmark method, for a difference of 8% which was statistically significant, p
= .007. For set 2, 47% passed using the Angoff method and 62% passed using the
Bookmark, for a difference of 15% which was statistically significant, p = .000.
73
Table 4
Summary of McNemar Test for Significance of the Proportions of Candidates Passing by
the Angoff Method versus the Bookmark Method
Exam 1
Set 1
Method
Set 2
Angoff
Bookmark
Bookmark
Angoff
Cut Score
70
67
65
77
Proportions Passing




Proportion Differences
26
37
Exam 2
Set 1
Method
Set 2
Bookmark
Angoff
Angoff
Bookmark
Cut Score
64
69
73
66
Proportions Passing




Proportion Differences
8
15
Workshop evaluation data were collected at the end of each workshop to obtain
perceptions about the standard setting methods Angoff and Bookmark. SMEs were asked
to complete a 17-item survey. Table 5 shows the means, standard deviations, and 95%
confidence intervals for each item from the survey. SMEs rated how confident and
comfortable they were about the process and their resulting cut scores. Ratings were
74
based on a five point scale (5 was Strongly Agree; 4 was Agree; 3 was Undecided; 2 was
Disagree; and 1 was Strongly Disagree).
Table 5
Summary about Standard Setting Process
Items
Entire Process
1
Training and practice exercises helped to understand how to
perform the tasks
2
Taking the test helped me to understand the required
knowledge
3
I was able to follow the instructions and complete the rating
sheets accurately
4
The time provided for discussions was adequate
5
The training provided a clear understanding of the purpose of
the workshop
6
The discussions after the first round of ratings were helpful
to me
7
The workshop facilitator clearly explained the task
8
The performance behavior descriptions were clear and useful
Angoff
9
I am confident about the defensibility of my final
recommended cut scores using Angoff
Bookmark
10
I thought a few of the items in the booklet were out of order
11
12
13
14
15
16
17
The information showing the statistics of the items was
helpful to me
The discussions after the second round of ratings were
helpful to me
Deciding where to place my bookmark was difficult
Understanding of the task during each round of bookmark
placements was clear
I am confident that my final bookmark approximated the
proficient level on the state test
Overall, the order of the items in the booklet made sense
I felt pressured to place my bookmark close to those favored
by another SMEs
M
SD
95% CI
4.82
0.41
4.55 – 5.09
4.64
0.67
4.18 – 5.09
4.36
0.67
3.91 – 4.82
4.36
0.51
4.02 – 4.70
4.18
0.75
3.68 – 4.69
4.09
1.22
3.27 – 4.91
4.09
0.70
3.62 – 4.56
4.09
0.70
3.62 – 4.56
4.45
0.52
4.10 – 4.81
4.36
0.67
3.91 – 4.82
4.09
0.54
3.73 – 4.45
3.64
1.36
2.72 – 4.55
3.55
1.21
2.73 – 4.36
3.27
1.19
2.47 – 4.07
2.73
1.19
1.93 – 3.53
2.64
1.03
1.95 – 3.33
2.45
1.29
1.59 – 3.32
Note. The results are based on 11 SMEs. The other 4 SMEs did not return the evaluation questionnaire.
75
Table 5 shows that SMEs perceived that training activities helped them
understand how to perform the tasks. Those activities included the use of the performance
behavior table, training and practice exercises, and taking the test. SMEs agreed that the
discussions after the first round of ratings were helpful. SMEs felt confident about the
defensibility of their final Angoff cut score.
Despite the fact that SMEs were able to understand the task during each round of
placements, they were not confident that their Bookmark cut score approximated
proficient level on the state test. SMEs also perceived that some of the items of the OIB
were out of order. They found that the discussion after the second round of ratings was
useful in the process. Furthermore, SMEs agreed that making a decision on where to
place their bookmark was a difficult task. SMEs did not feel pressured to place their
bookmark close to those of other SMEs.
76
Chapter 6
DISCUSSION
The results of the present study suggest that the cut scores produced by the
Angoff method were consistently higher than those produced by the Bookmark method.
Specifically, when the two methods were used on the same set of items of two state
licensure exams, the results showed that the Bookmark method consistently produced
lower cut scores. This is different from the findings of other research studies in which the
cut scores for the same content area resulting from the Angoff and Bookmark methods
converged, providing evidence of validity of the Bookmark method (Peterson et al.,
2011).
The results of the McNemar’s test confirmed the statistically difference in cut
scores by the two methods on each set. It was found that more candidates would pass the
licensure exams when using the Bookmark cut score than when using the Angoff cut
score for all 4 sets of items.
It was also found that the reliability of cut scores improved between rounds of
ratings for the Angoff and the Bookmark methods. In the case of the Angoff method,
reliability coefficients ranged from 0.71 to 0.86. In the case of the Bookmark method,
reliability coefficients ranged from 0.64 to 0.96. In addition, the results of the survey
imply that discussion of the items during a passing score workshop may improve the
reliability of the cut scores.
77
In the study by Skaggs et al. (2007), separate booklets for the different content
areas of the exam for the Bookmark method were used. The same procedure was used in
the present study instead of having only one bookmark placement for the whole set of
items. For example, if the exam contains five content areas, there were five OIBs. Skaggs
et al. (2007) found that by producing more data points resulting from the separate
booklets, reliability of the cut score was high. The moderate reliability of the cut scores
from the present study was due to the easy task of rating items from the same content
area. Items from the same content area share similarities such as formatting
characteristics. This is the case for the present study, in which one content area might
have required the test taker to select a vocabulary word from a given list, and another
content area might have required the test taker to proofread a paragraph. Furthermore, the
use of booklets of each content area during a passing score workshop using Bookmark
would make the task of rating items easier for SMEs and allow them to focus on the
difficulty of the items rather than the difficulty of the process.
The results of the evaluation questionnaire suggest that training activities helped
SMEs understand how to perform the standard setting tasks. Those activities included the
training and practice exercises, and taking the test. This relates to the research study by
Peterson et al. (2011) in which the understanding of the tasks and instructions was high
for the Bookmark and the Angoff methods. These results imply that test developers need
to ensure that SMEs understand their standard setting tasks by including training
activities that familiarize them more with the process as well as the exam requirements.
78
In addition, SMEs felt confident about the defensibility of their final Angoff cut
score, but they were not confident about the defensibility of the Bookmark cut score. This
is contrary to the findings of Buckendahl et al. (2002) in which panelists expressed high
levels of confidence in their final Bookmark cut score. In the case of the present study, it
is possible that SMEs felt more confident about the Angoff results than the Bookmark
because they are more familiar with the well-researched Angoff method than the
Bookmark method.
In the present study, SMEs expressed that it was a difficult task to decide where to
place their bookmarks in the OIB. This is similar to the study by Hein and Skaggs (2009)
in which panelists were able to understand the procedure, and still found that selecting
the item where to place their bookmark was almost an impossible task. This is likely to
be caused by the SMEs’ perception that the items in the OIB were out of order. This is
parallel to the findings by Lin (2006) where panelists thought it was difficult to place
their bookmarks because they perceived that the items were out of order. There may be
different reasons that create this perception of the items to be out of order based on
difficulty. One reason might be that SMEs are not considering the whole item when they
are rating. This means, that they are not considering the job of the distracters or wrong
options in the questions. The distracters in an item might make the item either easier or
more challenging. Test developers should ensure that SMEs are evaluating the items as a
whole that include the influence of the distracters.
Based on the results of this study and the need for evidence that supports the
validity of the Bookmark method, state entities should keep using the well-researched
79
Angoff method to set cut scores for licensure exams. The results of this study imply that
the use of the Bookmark method to set cut scores would allow more candidates to pass a
licensure exam compared to the Angoff method. In this situation, state entities could be
licensing unqualified candidates and consequently harming the public.
Future research should compare the Bookmark and Angoff cut scores based on a
larger number of items in each content area and using a larger number of SMEs that
allows dividing them in more groups for the workshop development.
Some of the limitations of the current study should also be addressed in order to
provide future directions. The first limitation of the study was the limited sample size of
raters. Future researchers could select a representative larger sample of raters and
organize them in smaller groups of raters for the Bookmark method. This approach could
be used to improve the reliability of the cut scores. The second limitation was that the
study was conducted in only one profession. Thus, it becomes difficult to generalize
across other professions.
A third limitation was the use of classical test theory p-values instead of the IRT
approach to rank the items in the OIB. A combination of classical test theory and the use
of the IRT approach could provide more accurate statistical information in order to
arrange the items in the OIB based on difficulty. In addition, the study would benefit if
the data used to arrange the items for the Bookmark method was based on a larger sample
of people that have taken the exams. In the case of this profession, the population of test
takers is not large, thus it was difficult to have a larger sample to obtain item information.
80
The fourth limitation in this study is the number of items that were used in each
booklet. In the present study, when the sets of items for each exam were created, the
same number of items for each content area were placed in each booklet. As a result, a
small number of items ended up in each booklet. Future researchers could use a different
approach to create the set of items for the booklets.
81
Appendix A
Agenda
Agenda
Board of California
EXAMS
PASSING SCORE WORKSHOP
Office of Professional Examination Services 2420 Del Paso Road, Suite 265 Sacramento, CA 95834
January 21 – 22, 2011
I.
Welcome and introductions
II.
Board business
A.
Examination security, self certification
III.
About OPES facilities
A.
Security procedures (electronic devices)
B.
Workshop procedures (breaks and lunch)
IV.
Power Point Presentation
A.
Angoff-use full range 25%-95%
“What percentage of minimally competent candidates WOULD answer this
item correctly?”
B.
Bookmark-2 groups, ordered from easiest to most difficult, 3 rounds
bookmark placement, up to the item where performance level should be
based on item difficulty.
V.
What are minimum competence standards?
VI.
What are performance behaviors?
A.
Ineffective, Minimally Acceptable Competence, and Highly effective
VII.
Take the examination 1
VIII. Assignment of ratings to examination 1 using Angoff
IX.
Assignment of ratings to examination 1 using Bookmark
X.
Take the examination 2
XI.
Assignment of ratings to examination 2 using Bookmark
XII. Assignment of ratings to examination 2 using Angoff
XIII. Wrap-up and adjourn
82
Appendix B
Angoff Rating Sheet
Name: «First_Name» «Last_Name»
MFT-CV Exam Rating Sheet
September 2008
Passing Score Workshop
Guess Hard
25 35 45
Item
Number
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Thinking
55 65 75
Initial
Rating
Final
Rating
Easy
85 95
83
Appendix C
Bookmark Rating Sheet
BOOKMARK RATING SHEET
WORKSHOP 1
NAME:
GROUP:
DATE:
PROFESSIONAL PRACTICE
INITIAL
PLACEMENT
SECOND
PLACEMENT
FINAL
PLACEMENT
1 Reporting Proceedings
2 Transcribing Proceedings
3 Research and Language Skills
4 Transcript Management
5 Ethics
BOOKMARK RATING SHEET
WORKSHOP 1
NAME:
GROUP:
DATE:
ENGLISH
1 Grammar
2 Proofreading
3 Vocabulary
INITIAL
PLACEMENT
SECOND
PLACEMENT
FINAL
PLACEMENT
84
Appendix D
Evaluation Questionnaire
Standard Setting Methods Evaluation Questionnaire
Please, use the following rating scale to evaluate the effectiveness of the two standard setting
methods used during the workshop. For each statement, circle the number that better describes
your answer.
1
Strongly
Disagree
2
Disagree
3
Undecide
d
4
5
Agree
Strongly Agree
Statement
Rating
1
The training provided a clear understanding of the purpose of the
workshop
1
2
3
4
5
2
The workshop facilitator clearly explained the task
1
2
3
4
5
3
Training and practice exercises helped to understand how to perform
the tasks
1
2
3
4
5
4
Taking the test helped me to understand the required knowledge
1
2
3
4
5
5
The performance behavior descriptions were clear and useful
1
2
3
4
5
6
The time provided for discussions was adequate
1
2
3
4
5
7
I was able to follow the instructions and complete the rating sheets
accurately
1
2
3
4
5
8
The discussions after the first round of ratings were helpful to me
1
2
3
4
5
9
The discussions after the second round of ratings were helpful to me
1
2
3
4
5
10
The information showing the statistics of the items was helpful to me
1
2
3
4
5
11
I am confident about the defensibility of the final recommended cut
scores using Angoff
1
2
3
4
5
12
Understanding of the task during each round of bookmark was clear
1
2
3
4
5
13
I felt pressured to place my bookmark close to those favored by other
SMEs
1
2
3
4
5
14
Deciding where to place my bookmark was difficult
1
2
3
4
5
15
Overall, the order of the items in the booklet made sense
1
2
3
4
5
16
I thought a few of the items in the booklet were out of order
1
2
3
4
5
17
I am confident that my final bookmark approximated the proficient
level on the state test
1
2
3
4
5
Comments/Suggestions:
85
Appendix E
MAC Table
Content Area 1 (39%)
Unqualified
 Misapplies or is unaware of
applicable code sections
 Fails to ask for assistance
when needed
MAC (Minimally Competent)
 Treats all parties impartially
 Asks for assistance when
needed
 Complies with applicable codes
Highly Qualified
 Explains and applies code
sections correctly
 Manages workload
effectively
 Effectively carries
additional equipment
Content Area 2 (20%)
 Unaware of filing
procedures or protocol
 Fails to adhere to redaction
protocols
 Applies basic computer
operating functions and
capabilities
 Adheres to redaction protocols
 Utilizes multiple forms of
backup
 Creates organizational
systems for job documents
 Fails to back up
Content Area 3 (11%)
 Misspells frequently
 Frequently misuses
specialized vocabularies
 Misapplies rules of
punctuation, grammar,
word, and number usage
 Is unfamiliar with common
idioms/slang
 Corrects errors before
submitting final transcript
 Uses specialized
vocabularies appropriately
 Possesses general vocabulary
 Possesses extensive general
vocabulary
 Uses reference sources to ensure
accuracy
 Applies research methods to
verify citations
 Is familiar with
idioms/slang
 Creates a word list
86
Appendix F
Power Point Presentation
PASSING SCORE WORKSHOP-January, 2011
GOALS OF THE WORKSHOP
•Review scope of practice •Review MAC table •Take test •Rate scorable and pretest items •Obtain cut
score for
PURPOSE OF A LICENSING EXAMINATION
To identify candidates who are qualified to practice safely.
Cycle of Examination Development
Occupational analysis, Examination outline, Item development,
Item revision, Exam Construction, Passing Score
REVIEW OF THE EXAMINATION
Identify overlap test items, Determine if an item needs to be replaced
Determine if items need to be separated on the examination from another item
The Angoff Process: Review scope of practice, Discuss concept of minimum competence, and take
examination
Bookmark Method: Large group divided into smaller groups, Review exam specifications and KSAs
OIB-questions ordered from easiest to most difficult based on p-values/item difficulty, Take exam
Discuss in small groups KSAs MAC needed to answer each item correct and possible reasons why each
succeeding item was more difficult than the previous one-additional KSAs needed, Provide performance
data: p-values for all items
R1-First Bookmark placement up to the item to which MAC will answer all items correct and beyond
that item will get all items incorrect
Discuss rationale for first bookmark placement-based on lowest and highest placement from the group,
R2-Second Bookmark placement based on small group discussion
Large group discussion-small group summary from one member of group
Bookmark Method
2 small groups (random), Answer all 50 items, Obtain key for all items
Discuss in your groups: What KSAs are needed to answer each item
Reasons
for why each succeeding item was more difficult than the previous (which additional knowledge
were needed to answer the next item)
Individually, based on the MAC, place first bookmark in each booklet (by content area) on the item to
which up to that point candidates would get all the items correct and beyond that point, candidates would
get all the items incorrect, Record each rater’s placement in computer by group, booklet (content area), R1,
R 2, R3
Provide performance data: p-values and explain to SMEs
Discuss in group by booklet rationale for first placement for the lowest and highest bookmark in the
group, Discuss same 2 questions, Independently place 2 nd bookmark if changed their mind after discussion,
Big group discussion, summary of small groups for each booklet, last chance to change bookmark, record
in spreadsheet
87
References
Almeida, M. D. (2006). Standard-setting procedures to establish cut-scores for multiplechoice criterion referenced tests in the field of education: A comparison between
Angoff and ID Matching methods. Manuscript submitted as a Final Paper on
EPSE 529 Course.
Alsmadi, A. A. (2007). A comparative study of two standard-setting techniques. Social
Behavior and Personality, 35, 479-486.
American Educational Research Association, American Psychological Association, &
National Council on Measurement in Education. (1999). Standards for
educational and psychological testing. Washington, DC: American Educational
Research Association.
Brannick, M. T., Levine, E. L., & Morgeson, F. P. (2007). Job analysis: Methods,
research, and applications for human resource management. Thousand Oaks,
CA: Sage Publications, Inc.
Buckendahl, C. W., Smith, R. W., Impara, J. C., & Plake, B. S. (2002). A comparison of
Angoff and Bookmark standard setting methods. Journal of Educational
Measurement, 39(3), 253-263.
California Department of Consumer Affairs. (2011). More about the Department of
Consumer Affairs. Retrieved from
http://www.dca.ca.gov/about_dca/moreabout.shtml
88
Chinn, R. N., & Hertz, N. R. (2002). Alternative approaches to standard setting for
licensing and certification examinations. Applied Measurement in Education,
15(1), 1-14.
Cizek, G. J. (2001). In G.J. Cizek (Ed.), The Bookmark procedure: Psychological
perspectives. Setting performance standards: Concepts, methods, and
perspectives (pp. 249-281). Mahwah, NJ: Erlbaum.
Cizek, G. J. (2001). Setting performance standards. Conjectures on the Rise and Call of
Standard Setting: An Introduction to Context and Practice (pp. 80-120).
Baltimore: John University Press.
Clauser, B. E., Harik, P., Margolis, M. J. (2009). An empirical examination of the impact
of group discussion and examinee performance information on judgments made in
the Angoff standard-setting procedure. Applied Measurement in Education, 22, 121.
Davis-Becker, S. L., Buckendahl, C. W., & Gerrow, J. (2011). Evaluating the Bookmark
standard setting method: The impact of random item ordering. International
Journal of Testing, 11, 24-37.
Eagan, K. (2008). Bookmark standard setting. Madison, WI: CTB/McGraw-Hill.
Equal Employment Opportunity Commission, Civil Service Commission, Department of
Labor, & Department of Justice. (1978). Uniform Guidelines on employee
selection procedures. Federal Register, 43, (166), 38290-38309.
89
Ferrara, S., Perie, M., & Johnson, E. (2002). Matching the Judgmental task with standard
setting panelist expertise: The Item-Descriptor (ID) matching method. Setting
performance standards: The item descriptor (ID) matching procedure. Paper
presented at the annual meeting of the American Educational Research
Association, New Orleans, LA.
Giraud, G., Impara, J. C., & Plake, B. S. (2005). Teachers’ conceptions of the target
examinee in Angoff standard setting. Applied Measurement in Education, 18(3),
223-232.
Goodwin, L. D. (1999). Relations between observed item difficulty levels and Angoff
minimum passing levels for a group of borderline examinees. Applied
Measurement in Education, 12(1), 13-28.
Green, D. R. Trimble, C. S., & Lewis, D. M. (2003). Interpreting the results of three
different standard setting procedures. Educational Measurement: Issues and
Practice, 22(1), 22-32.
Guion, R. M. (1998). Assessment, measurement, and prediction for personnel decisions.
Mahwah, NJ: Lawrence Earlbaum Associates.
Hambleton, R. K., Jaeger, R.M., Plake, B. S., & Mills, C. (2000). Setting performance
standards on complex educational assessments. Applied Psychological
Measurement, 24(4), 355-366.
Hambleton, R. K., & Pitoniak, M. J. (2004). Setting passing scores on the CBT version of
the uniform CPA examination: Comparison of several promising methods.
Manuscript for presentation at NCME.
90
Hein, S. F., & Skaggs, G. E. (2009). A qualitative investigation of panelists’ experiences
of standard setting using two variations of the bookmark method. Applied
Measurement in Education, 22, 207-228.
Hurtz, G. M. & Auerbach, M. A. (2003). A meta-analysis of the effects of modifications
to the Angoff method on cutoff scores and judgment consensus. Educational and
Psychological Measurement, 63(4), 584-601.
Impara, J. C. & Plake, B. S. (1998). Teachers’ ability to estimate item difficulty: A test of
the assumptions in the Angoff standard setting method. Journal of Educational
Measurement, 35(1), 69-81.
Kane, M. (1998). Choosing between examinee-centered and test-centered standard setting
methods. Educational Assessment, 5(3), 129-145.
Kaplan, R. M., & Saccuzzo, D. P. (2005). Psychological testing: Principles, applications,
and issues (6th ed.). Belmont, CA: Thomson Wadsworth.
Karantonis, A. & Sireci, S. G. (2006). The Bookmark standard-setting method: A
literature review. Educational Measurement: Issues and Practice, 25(1), 4-12.
Kirkland v. New York Department of Correctional Services, 711 F. 2nd 1117 (1983).
Kramer, A., Muiktjens, A., Jansen, K., Dusman, H., Tan, L., & Vleuten, C. V. (2003).
Comparison of a rational and an empirical standard setting procedure for an
OSCE. Medical Education, 37, 132-139.
Lee, G., & Lewis, D. M. (2008). A generalizability theory approach to standard error
estimates for bookmark standard settings. Educational and Psychological
Measurement, 68, (4), 603-620.
91
Lewis, A. L. Jr. v. City of Chicago, Illinois, 2011, 7th Cir. 5/13/2011.
Lin, J. (2006). The Bookmark standard setting procedure: strengths and weaknesses.
The Alberta Journal of Educational Research, 52(1), 36-52.
Linn, R. L. & Drasgow, F. (1987). Implications of the golden rule settlement for test
construction. Educational Measurement: Issues and Practice, 6, 13-17.
Marino, R. D. (2007, September). Welcome to DCA. Training at the meeting of
Department of Consumer Affairs, Sacramento, CA.
MacCann, R. G. (2008). A modification to Angoff and Bookmarking cut scores to
account for the imperfect reliability of test scores. Educational and Psychological
Measurement, 68(2), 197-214.
Maurer, T. J., Alexander, R. A., Callahan, C. M., Bailey, J. J., & Dambrot, F. H. (1991).
Methodological and psychometric issues in setting cutoff scores in using the
Angoff method. Personnel Psychology, 44, 235-262.
Meyers, L. S. (2006). Civil Service, the Law, and adverse impact analysis in promoting
equal employment opportunity. Tests and Measurement: Adverse Impact
Handout. Sacramento State.
Meyers, L. S. (2009). Some procedures for setting cut scores handout. Tests and
Measurement.
Mitzel, H. C., Lewis, D. M., Patz, R. J., & Green, D. R. (2001). The Bookmark
procedure: Psychological perspectives. In G. J. Cizek (Ed.). Setting performance
standards: Concepts, methods, and perspectives (pp. 249-281). Mahwah, NJ:
Lawrence Erlbaum Associates.
92
Murphy, K. R., & Davidshofer, C. O. (2005). Psychological testing: Principles and
applications. New Jersey: Pearson Prentice Hall.
National Research Council. (1999). Setting reasonable and useful performance standards.
In J. W. Pelligrino, L. R. Jones, & Jones, & K. J. Mitchell (Eds.), Grading the
nation’s report card: Evaluating NAEP and transforming the assessment of
educational progress (pp. 162-184). Washington, DC: National Academy Press.
Nichols, P., Twig, J., & Mueller, C. D. (2010). Standard-setting methods as measurement
processes. Educational Measurement: Issues and Practice, 29(1), 14-24.
Office of Professional Examination Services. (2010). Informational Series No. 4.
Criterion-Referenced Passing Scores. Department of Consumer Affairs.
Olsen, J. B. & Smith, R. (2008). Cross validating Modified Angoff and Bookmark
standard setting for a home inspection certification. Paper presented at the Annual
Meeting of the National Council on Measurement in Education.
Plake, B. S. & Impara, J. C. (2001). Ability of panelists to estimate item performance for
a target group of candidates: An issue in judgmental standard setting.
Educational Assessment, 7(2), 87-97.
Plake, B. S., Impara, J. C., & Irwin, P. M. (2000). Consistency of Angoff-based
predictions of item performance: Evidence of technical quality of results from the
Angoff standard setting method. Journal of Educational Measurement, 37(4),
347-355.
93
Peterson, C. H., Schulz, E. M., & Engelhard G. J. (2011). Reliability and validity of
Bookmark-based methods for standard setting: Comparisons to Angoff-based
methods in the National Assessment of Educational progress. Educational
Measurement: Issues and Practice, 30(2), 3-14.
Professional Credentialing Services. Introduction to Test Development for Credentialing:
Standard Setting and Equating. www.act.org/workforce. (American College Test)
Reckase, M. D. (2006). A conceptual framework for a psychometric theory for standard
setting with examples of its use for evaluating the functioning of two standard
setting methods. Educational Measurement: Issues and Practice, 4-18.
Roberts, R. N. (2010). Damned if you do and damned if you don’t: Title VII and public
employee promotion disparate treatment and disparate impact litigation. Public
Administration Review, 70(4), 582-590.
Schmitt, K. & Shimberg, B. (1996). Demystifying Occupational and Professional
Regulation: Answers to questions you may have been afraid to ask. Council on
Licensure, Enforcement and Regulation (CLEAR): Lexington, KY.
Shrock, S. A., & Coscarelli, W. C. (2000). Criterion-referenced test development:
Technical and legal guidelines for corporate training and certification. (2nd ed.).
Washington, DC: International Society for Performance Improvement.
Skaggs, G., & Tessema, A. (2001). Item disordinality with the Bookmark standard setting
procedure. Paper presented at the annual meeting of the National Council on
Measurement in Education, Seattle, WA.
94
Skaggs, G., Hein, S. F., & Awuor, R. (2007). Setting passing scores on passage-based
tests: A comparison of traditional and single-passage bookmark methods. Applied
Measurement in Education, 20(4), 405-426.
Skaggs, G., & Hein, S. F. (2011). Reducing the cognitive complexity associated with
standard setting: A comparison of the single-passage bookmark and Yes/No
methods. Educational and Psychological Measurement, 71(3), 571-592.
Society for Industrial and Organizational Psychology (2003). Principles for the
validation and use of personnel selection procedures. Bowling Green, OH:
Author.
Stephenson, A. S. (1998). Standard setting techniques: A comparison of methods based
on judgments about test questions and methods on test-takers. A Dissertation
Submitted in Partial Fulfillment of the Requirements for the Doctor of Philosophy
Degree. Department of Educational Psychology and Special Education in
Graduate School, Southern Illinois University.
U.S. Department of Labor, Employment and Training Administration. (2000). Testing
and Assessment: An employer’s guide to good practices.
Wang, N. (2003). Use of the Rasch IRT model in standard setting: An item-mapping
method. Journal of Educational Measurement, 40(3), 231-253.
Yudkowsky, R., & Downing, S. M. (2008). Simpler standards for local performance
examinations: The Yes/No Angoff and Whole-Test Ebel. Teaching and Learning
in Medicine, 20(3), 212-217.