A COMPARISON OF THE ANFOFF AND THE BOOKMARK PASSING SCORE METHODS IN A LICENSURE EXAMINATION A Thesis Presented to the faculty of the Department of Psychology California State University, Sacramento Submitted in partial satisfaction of the requirements for the degree of MASTER OF ARTS in Psychology (Industrial/Organizational Psychology) by Maria Avalos SUMMER 2012 © 2012 Maria Avalos ALL RIGHTS RESERVED ii A COMPARISON OF THE ANFOFF AND THE BOOKMARK PASSING SCORE METHODS IN A LICENSURE EXAMINATION A Thesis by Maria Avalos Approved by: __________________________________, Committee Chair Lawrence S. Meyers, Ph.D. __________________________________, Second Reader Gregory M. Hurtz, Ph.D. __________________________________, Third Reader Robert L. Holmgren, Ph.D. ____________________________ Date iii Student: Maria Avalos I certify that this student has met the requirements for format contained in the University format manual, and that this thesis is suitable for shelving in the Library and credit is to be awarded for the thesis. __________________________, Graduate Coordinator Jianjian Qin, Ph.D. Department of Psychology iv ___________________ Date Abstract of A COMPARISON OF THE ANFOFF AND THE BOOKMARK PASSING SCORE METHODS IN A LICENSURE EXAMINATION by Maria Avalos Under the current economic environment, state licensing entities need to find standard setting methods that allow to set reliable passing scores and to save time during the exam development process. Bookmark and Angoff methods were explored. Data were obtained from 2 state licensure exams in 2 2-day workshops. Each examination had 100 4-multiple choice items. Each exam was divided in 2 sets of items, resulting in 4 sets total with 50 items each. Each set of items was pass pointed using the two methods. Results suggested that the Angoff method produced higher cut scores than the Bookmark method. In addition, SMEs felt more confident about their Angoff cut score than the Bookmark cut score. _______________________, Committee Chair Lawrence S. Meyers, Ph.D. _______________________ Date v ACKNOWLEDGEMENTS I thank all the valuable people in my life that made the completion of my thesis a reality. I thought this day would never come. I thank my committee chair, Dr. Lawrence Meyers, for his valuable guidance throughout the development of my thesis. Thank you for spending all that time reviewing my thesis and being so patient. Thank you for sharing all of our knowledge. I also thank Dr. Greg Hurtz, my second reader for reviewing my thesis and sharing you valuable expertise in passing score research. Thank you for those valuable comments and suggestions to improve my thesis. I thank my third reader, Bob Holmgren, supervisor at the Office of Professional Examination Services, for taking time from his busy schedule to review my thesis. I also want to thank you for understanding the importance of completing my thesis and education. Finally, I thank my family for understanding and supporting me through all of those days, nights, weekends, and important dates that I did not spend with them because I had to work on my thesis in order to complete it. I thank my husband Osvaldo for being so patient and pushing me to keep working and not quit “a la mitad del camino” (in the middle of the road). I want to thank my son Oliver, that being so little, he was able to understand that mommy had to do homework and could not play with him sometimes. vi TABLE OF CONTENTS Page Acknowledgments....................................................................................................... vi List of Tables .............................................................................................................. ix Chapter 1. VALIDITY AND EMPLOYMENT SELECTION …………………………….. 1 Civil Service System .................................................................................................... 1 Legislative Mandates Against Discrimination in Employment ................................... 3 Litigation in Employment Discrimination ................................................................... 4 Professional Standards for Validity ............................................................................. 8 Validity ...................................................................................................................... 10 2. LICENSURE TESTING ....................................................................................... 19 Minimal Acceptable Competence .............................................................................. 20 State of California and Licensure .............................................................................. 22 Job Analysis ............................................................................................................... 25 Subject Matter Experts............................................................................................... 31 3. STANDARD SETTING METHODS ................................................................... 33 Criterion-Referenced Passing Score at Office of Professional Examination Services ..................................................................................................................... 33 Criterion Methods Based on Test Taker Performance ............................................... 34 Nedelsky’s Method Based on Subject Matter Expert Judgments of Test Content .... 35 Angoff’s Method........................................................................................................ 36 vii Ebel’s Method Based on Subject Matter Expert Judgments of Test Content ............ 41 Direct Consensus Method .......................................................................................... 42 The Item Descriptor Method ...................................................................................... 42 Bookmark Method ..................................................................................................... 43 Purpose of this Study ................................................................................................. 54 4. METHOD ............................................................................................................. 56 Participants................................................................................................................. 56 Materials .................................................................................................................... 57 Research Design ........................................................................................................ 59 Procedures.................................................................................................................. 60 5. RESULTS ............................................................................................................. 67 6. DISCUSSION ....................................................................................................... 76 Appendix A. Agenda ................................................................................................ 81 Appendix B. Angoff Rating Sheet ............................................................................ 82 Appendix C. Bookmark Rating Sheet ....................................................................... 83 Appendix D. Evaluation Questionnaire .................................................................... 84 Appendix E. MAC Table .......................................................................................... 85 Appendix F. Power Point Presentation ..................................................................... 86 References ................................................................................................................... 87 viii LIST OF TABLES Tables Page 1. Summary of Experimental Design ....................... .……………………………….59 2. Cut Scores, Standard Deviations, Confidence Intervals, and Reliability Between Rounds……………………………….…. ................ ……………………………. 68 3. Summary of Angoff and Bookmark Final Cut Scores for each Set of Items…………………… ................. ………….…………………………………. 70 4. Summary of McNemar Test for Significance of the Proportions of Candidates Passing by the Angoff Method versus the Bookmark Method.…………………. 73 5. Summary about Standard Setting Process…………….…………………………. 74 ix 1 Chapter 1 VALIDITY AND EMPLOYMENT SELECTION Civil Service System In the history of employment testing, discrimination against citizens of a protected class has been a major problem that had deprived individuals from employment opportunities. Before the civil rights movement, employers were using procedures for selecting job applicants that were not linked to a job analysis of the profession. Selection procedures illegally discriminated against minority groups based on race, color, religion, sex, or natural origin. Those selection procedures were not job-related. After many court cases, the government started to request content validity evidence for those employee selection procedures that demonstrate unfair treatment of minority groups. Origin of the Civil Service System Around the World According to Kaplan and Saccuzzo (2005), most of the major changes in employment testing have happened in the United States during the last century. The use of testing however might have its origins in China more than 4000 years ago. The use of test instruments by the Han Dynasty (206 B.C.E. to 220 C.E.) was quite popular in the areas of civil law, military, and agriculture. During the Ming Dynasty (1368-1644 C.E.), a testing program was developed in which special testing centers were created in different geographic locations. The Western countries learned about testing programs through the Chinese culture. In 1832, the English East India Company started to select employees for overseas 2 duties using the Chinese method. After 1855, the British, French, and German governments adopted the same method of testing for their civil service (Kaplan & Saccuzzo, 2005). Civil Service System in the United States According to Meyers (2006), the spoils system started with George Washington and Thomas Jefferson. During the spoils system, newly elected administrations appointed to government administrative posts ordinary citizens, (regardless of their ability to perform the jobs) who had contributed to the campaigns of the winners of the elections. These new appointees replaced those persons appointed by the previous administration. The spoils system reached its peak during administration of Andrew Jackson from 1829 to 1837. The spoils system was in conflict with equal employment under the law because making employment decisions based on race, sex, and national origin rather than on people’s qualifications are examples of discrimination and are unjust. The civil service system was created to correct these problems. Equal opportunity needed federal legislation and psychometric input to be effective (Meyers, 2006). Guion (1998) explains that in 1871, U.S. Congress authorized the creation of a Civil Service Commission to establish competitive merit examinations as a way to reform the spoils system and bring stability and competence to government but it was soon ended by President Grant (p. 9). Meyers (2006) states that the Civil Service Commission required that all examinations were job-related. The Pendleton Act created a permanent civil service system in 1883. The government was required to select personnel by using competitive merit examinations but discrimination in employment did not stop there. The 3 civil service originally covered fewer than 15,000 jobs but that rapidly expanded over the years to all federal jobs. The civil service system was not intended to remove discrimination in employment. It just helped to ensure that those applicants taking job examinations are evaluated on the basis of merit (Meyers, 2006). Legislative Mandates Against Discrimination in Employment Civil Rights Act 1964 Congress passed the Civil Rights Act in 1964 to make discrimination explicitly illegal in employment (Guion, 1998). Title VII of the Civil Rights Act of 1964 addresses Equal Employment Opportunity and prohibits making employment-related decisions based on an employee’s race, color, religion, sex, or national origin. The Act prohibits the use of selection procedures that result in adverse impact unless the employer demonstrates that it is job-related. An updated Civil Rights Act was passed in 1991 and addressed the burden of proof in disparate impact cases brought under Title VII (Meyers, 2006). Age Discrimination in Employment Act 1967 The Age Discrimination in Employment Act of 1967 (ADEA) followed the Civil Rights Act and it refers to the discrimination of individuals of at least 40 years of age in the employment process. ADEA prohibits discrimination in employment testing. Applicants of this group must be provided with equal employment opportunities. When an older worker files a discrimination report under the ADEA, the employer must present evidence showing that in the job process the requirement of age is job-related. Employers must have documented support to defend their employment practices as being job-related. 4 “ADEA covers employers having 20 or more employees, employment agencies, and labor unions” (U.S. Department of Labor, 2000). The Uniform Guidelines on Employee Selection Procedures The Uniform Guidelines on Employee Selection Procedures published in the Federal Register in 1978 were developed by the Equal Employment Opportunity Commission (EEOC), the Civil Service Commission, and the Labor and Justice Departments (U.S. Department of Labor, 2000). The Guidelines describe the 80% rule to identify adverse impact, which compares rates of selection or hiring of protected class groups to the majority group. Adverse impact is observed when the protected groups are selected at a rate that is less than 80% of the majority group. When meaningful group differences are found, there is a prima fascia case that the employer engaged in illegal discrimination. The employer is then required to show documentation of the validity of the selection procedures as being job-related (Meyers, 2006). “The Guidelines cover all employers employing 15 or more people, labor organizations, and employment agencies” (U.S. Department of Labor, 2000). In addition, employers follow the Guidelines when making employment decisions based on tests and inventories. Litigation in Employment Discrimination Griggs versus Duke Power Company Several critical court decisions have had an impact on the employee selection process. In the 1971 Supreme Court case of Griggs versus Duke Power Company, African American employees charged that they were being prevented from promotional 5 jobs and that it was based on their lack of a high school education and their performance on an intelligence test. These two employment practices were unrelated to any aspect of the job. The Court ruled that employment selection examinations had to be job-related and that employers could not use practices that resulted in adverse impact in the absence of intent and that the motive behind those practices are not relevant (Guion, 1998). Albemarle Paper Company versus Moody In the 1975 Supreme Court case Albemarle Paper Company versus Moody, African American employees claimed that they were only allowed to work in lower-paid jobs. The Court ruled that selection examinations must be significantly linked with important elements of work behavior relevant to the job which candidates are being evaluated. The Court found that a job analysis was required to support the validation study (Shrock & Coscarelli, 2000). Kirkland versus New York State Department of Correctional Services In the 1983 case of Kirkland versus New York State Department of Correctional Services, the Court ruled that identifying critical tasks and knowledge as well as the competency required to perform the various aspects of the job was an essential part of job analysis. They concluded that the foundation of a content valid examination is the job analysis (Kirkland v. New York Department of Correctional Services, 711 F. 2nd 1117, 1983). Golden Rule versus Illinois In the 1984 Golden Rule versus Illinois out-of-court case, the Golden Rule Insurance Company and five individuals who had failed portions of the Illinois insurance 6 licensing exams sued the Educational Testing Services (ETS) regarding the development of the tests. The fundamental issue was the discriminatory impact of the licensing exam on minority groups. The key provision in the case was that preference should be given in test construction, based in a job analysis, to the questions that showed smaller differences in black and white candidates’ performance. The main elements agreed to, included: Collection of racial and ethnic data from all examinees on a voluntary basis. Assignment of an advisory committee to review licensing examinations. Classification of test questions into two groups: (I) questions for which (a) the correct-answer rates of black examinees, white examinees, and all examinees are not lower than 40% at the .05 level of statistical significance, and (b) the correctanswer rates of black examinees and white examinees differ by no more than 15% at the .05 level of statistical significance, and (II) all other questions. Creation of test forms using questions in the first group described above in preference to those questions in the second group, if possible to do so and still meet the specifications for the tests. Within each group, questions with the smallest black-white correct-answer rate differences would be used first, unless there was good reason to do otherwise. Annual release of a public report on the number of passing and failing candidates by race, ethnicity, and educational level. Disclosure of one test form per year and pretesting of questions (Linn & Drasgow, 1987). 7 ETS agrees to reduce the discrepancy in the scores received by black and white candidates. ETS will “give priority to items for which the passing rates of blacks and whites are similar and to use items in which the white passing rates are substantially higher only if the available pool of Type 1 items is exhausted” (Murphy, 2005, p. 327). Ricci versus DeStefano In the 2009 case Ricci versus DeStefano, White and Hispanic firefighters alleged reversed discrimination in light of Title VII and also 14th Amendment’s promise of equal protection under the law. When the city found that results from promotional examinations showed that White candidates outperformed minority candidates, they decided to throw out the results based on the statistical racial disparity. The Supreme Court held that by discarding the exams, the City of New Haven violated Title VII of the Civil Rights Act of 1964. The city engaged in outright intentional discrimination (Roberts, 2010). Lewis versus City of Chicago In the 2011 case of Lewis versus City of Chicago, several African-American applicants who scored in the qualified range and had not been hired as candidate firefighters filed a charge of discrimination with the Equal Employment Opportunity Commission (EEOC). The Court ruled that the City of Chicago had used the discriminatory test results each time it made hiring decisions on the basis of that policy. In addition, an employment practice with a disparate impact can be challenged not only when the practice is adopted, but also when it is later applied to fill open positions. The case was settled and the city must pay for discriminating against black firefighter candidates dating back to a 1995 entrance exam. The city will also have to hire 111 8 qualified black candidates from the 1995 test by March 2012 and pay $30 million in damages to the 6000 others who will not be selected from a lottery system to re-take tests (Lewis v. City of Chicago, 7th Cir, 2011). As a result of these Court decisions regarding the use of job-related selection practices, illegal discrimination has significantly been reduced. Although the government efforts for eliminating discriminatory practices in the area of employment testing have had positive consequences to reduce these illegal practices, it has not totally stopped. Some other Court decisions have also created some controversy for test developers and employers regarding the use of statistical data to evaluate the validity and reliability of test items instead of the job-relatedness of the items. The development of regulations that describe in detail how to avoid discriminatory practices against minority groups was another tool for test developers and users to comply with the law and reduce illegal discrimination. Professional Standards for Validity Uniform Guidelines on Employee Selection Procedures The Uniform Guidelines on Employee Selection Procedures (Equal Employment Opportunity Commission, Civil Service Commission, Department of Labor, & Department of Justice, 1978) were developed by the Department of Labor, the EEOC, the Civil Service Commission, and the Department of Justice in 1978. Their purpose was to promote a clear set of principles to help employers comply with the laws that prohibit illegal discrimination. The Uniform Guidelines established employment selection practices that meet the law requirements for equal opportunity and that are job-related 9 using a job analysis (Brannick, Levine, & Morgeson, 2007, p. 169). The requirements for conducting a job analyses include identifying the tasks and knowledge, skills, and abilities (KSAs) that comprise successful performance of the job. Only those critical tasks and KSAs should be used as the basis of selection. The Uniform Guidelines promote the use of valid selection practices and provide a framework for the proper use of selection examinations. “The Uniform Guidelines remain the official statement of public policy on assessment in employment” (Guion, 1998). Standards for Educational and Psychological Testing The Standards for Educational and Psychological Testing (Standards, 1999) were developed by the American Educational Research Association (AERA), American Psychological Association (APA), and the National Council on Measurement in Education (NCME). “The Standards are the authoritative source of information on how to develop, evaluate, and use tests and other assessment procedures in educational, employment, counseling, and clinical settings” (U.S. Department of Labor, 2000). Although developed as professional guidelines, they are consistent with applicable regulations and are frequently cited in litigation involving testing practices. The Standards (1999) reflect changes in the United States Federal law and measurement trends affecting validity; testing individuals with disabilities or different linguistic backgrounds; and new types of tests as well as new uses of existing tests. The Standards are written for the professionals and addresses professional and technical issues of test development and use in education, psychology, and employment. 10 Principles for the Validation and Use of Personnel Selection Procedures The Principles for the Validation and use of Personnel Selection Procedures (Principles, 2003) were developed by the Society for Industrial and Organizational Psychology. The purpose of the Principles (2003) is to “specify established scientific findings and generally accepted professional practice in the field of personnel selection psychology in the choice, development, evaluation, and use of personnel selection procedures designed to measure constructs related to work behavior with a focus on the accuracy of the inferences that underlie employment decisions”. The Principles promote the use of job analyses and provide assistance to professionals by providing guidelines for the evaluation, development and use of testing instruments. The Principles have been a source of technical information concerning employment decisions in court. The Principles is intended to be consistent with the Standards. The Principles intend to inform decision making that is related to the statutes, regulation, and case law regarding employment cases (Brannick et al., 2007). As a result of the different Court decisions and the passage of testing regulations, employers are required to use job-related procedures for selection of employees in order to avoid discrimination. Employers need to follow the established guidelines and base their selection procedures on an established job analysis in order to provide content validation evidence in the case of adverse impact. Validity State licensure examinations are developed to identify applicants that are qualified to practice independently on the job. The state of California is mainly interested in 11 identifying the test taker who possesses the minimal acceptable competencies required for the job. Licensure examinations have to be linked to an exam plan or exam outline that resulted from an empirical study/occupational analysis in order to be legally defensible and fair to the target population of applicants. The exam plan, items, and cut scores must also be reviewed, discussed, and approved by subject matter experts (SMEs) of the specified profession. These and more details about the test development process are established to ensure existence of the most important standard in psychological testing which is validity. The conception of validity has evolved over time. The Standards (1950) included criterion, content, and construct validity in its definition. Based on the Standards (1999), the conceptualization of validity has become more fully explicated and expanded. “Validity refers to the degree to which evidence and theory support the interpretation of test scores entailed by proposed uses of tests” (American Educational Research Association, American Psychological Association, & National Council on Measurement in Education, 1999). Validation is a process of accumulating evidence. Tests are not declared to be valid until enough evidence has been produced to demonstrate its validity. The validity of tests can only be declared for certain uses but cannot be declared in general. It is necessary to accumulate evidence, based in a job analysis, to demonstrate that competencies measured in the test are the same and equally important across different situations. Validity is not a property of the test. Validity is a judgment tied to the existing evidence to support the use of a test for a given purpose. Test validation experts should 12 say that there is sufficient evidence available to support using a test for the particular purpose. Validity is the most important professional standard that has been established to ensure a legally defensible exam development process. There are multiple techniques for collecting evidence of validity. The Standards propose five sources of validity evidence in order to provide a sound valid argument. Validity Evidence Based on Test Content There are various sources of evidence that may be used to evaluate a proposed interpretation of test scores for particular purposes. Evidence based on test content refers to themes, wording, and format of the items, tasks, or questions on a test as well as guidelines for procedures regarding administration and scoring. It may include logical or empirical analyses of the adequacy with which the test content represents the content domain and of the relevance of the content domain to the proposed interpretation of test scores. It can also come from SMEs’ judgments of the relationship between parts of the test and the construct. In a licensure test, the major facets of the specific occupation can be specified, and SMEs in that occupation can be asked to assign test items to the categories defined by those facets. SMEs can also judge the representativeness of the chosen set of items. Some tests are based on systematic observations of behavior and the use of expert judgments to assess the relative importance, criticality, and frequency of the tasks. A job sample test can be constructed from a random or stratified random sampling of tasks rated highly on these characteristics. 13 The appropriateness of a given content domain is related to specific inferences to be made from test scores. Thus, when considering an available test for a purpose other than that for which it was first developed, it is important to evaluate the appropriateness of the original content domain for the new use. Evidence about content-related validity can be used in part to address questions about differences in the meaning or interpretation of test scores across relevant subgroups of examinees. Of particular concern is the extent to which construct underrepresentation or construct-irrelevant components may give an unfair advantage or disadvantage to one or more subgroups of examinees. In licensure examinations, the representation of a content domain should reflect the KSAs required to successfully perform the job. All the items in a test should be linked to the tasks and KSAs of the minimal acceptable competencies from an occupational analysis of the specified profession in order to be legally defensible. SMEs must agree that the details of the occupational analysis are representative of the population of candidates for which the test is developed. Careful review of the construct and test content domain by a diverse panel of SMEs may point to potential sources of irrelevant difficulty or easiness that require further investigation (AERA et al., 1999). Validity Evidence Based on Response Process Evidence based on response processes comes from analyses of individual responses and provides evidence concerning the fit between the construct and the detailed nature of performance or response actually engaged in by examinees. The Standards explain that response process issues should also include errors made by the raters of 14 examinee’s performance. To the extent that identifiable rater errors are made, responses of the test takers are confounded in the data collection procedure. Questioning test takers about their performance strategies to particular items can yield evidence that enriches the definition of a construct. Maintaining records of development of a response to a writing task, through drafts or electronically monitored revisions, also provides evidence of process. Wide individual differences in process can be revealing and may lead to reconsideration of certain test formats. Studies of response process are not limited to the examinee. Assessments often rely on raters to record and evaluate examinees’ performance. Relevant validity evidence includes the extent to which the processes of raters are consistent with the intended interpretation of scores. Thus, validation may include empirical studies of how raters record and evaluate data along with analyses of the appropriateness of these processes to the intended interpretation or construct definition (AERA et al., 1999). Validity Evidence Based on Internal Structure Evidence based on internal structure refer to the analyses of a test that indicate the degree to which the relationships among test items and test components conform to the construct on which the proposed test score interpretations are based. The conceptual framework for a test may imply a single dimension of behavior, or it may suggest several components that are each expected to be homogeneous, but that are also distinct from each other. Studies of internal structure of tests are designed to show whether particular items may function differently for identifiable subgroups of examinees. Differential item 15 functioning (DIF) occurs when different groups of examinees with similar overall ability, or similar status on an appropriate criterion, have, on average, systematically different responses to a particular item. Test components such as reliability, item characteristic curves (ICC), inter-item correlations, item-total correlations, and factor structure are some statistical sources that are part of the new Standards. ICCs provide invaluable information regarding the capability of the individual items to distinguish between test takers of different ability or performance levels. They also can indicate the difficulty level of the item. Inter-item correlations inform us of the degree of shared variance of the items and the item-total correlations describe the relationship between the individual item and the total test score. Inter-item and item-total correlations also drive the reliability of the test. A statistical analysis of the administered test data can address most of these areas (AERA et al., 1999). Factor structure of the test is another part of the internal structure evidence. Items that are combined together to form a scale or subscale on a test should be selected based on empirical evidence to support the combining of designated items. The medium to obtain this evidence is factor analysis, which identifies items sharing sufficient variance for a viable underlying dimension to be statistically extracted. Thus, the scoring of the test must be tied to the factor structure of the test. When scoring involves a high level of judgment on the part of those doing the scoring, measures of inter-rater agreement, may be more appropriate than internal consistency estimates. Test users are obligated to make sure that these statistical analyses are performed and that the results are used to evaluate and improve the test (AERA et al., 1999). 16 Validity Evidence Based on Relations to Other Variables Evidence based on relations to other variables refers to the relationship of test scores to variables external to the test. According to the Standards, these variables include measures of some criteria that the test is expected to predict, as well as relationships to other tests hypothesized to measure the same constructs, and tests measuring related or different constructs. Measures such as performance criteria are often used in employment settings. It addresses the degree to which the relationships are consistent with the construct underlying the proposed test interpretations. There are three sources of evidence based on relations to other variables (AERA et al., 1999). Convergent and divergent evidence. Convergent and divergent evidence is provided by relationships between test scores and other measures intended to assess similar constructs, whereas relationships between test scores and measures of different constructs provide divergent evidence. Experimental and correlational evidence can be involved. Test-criterion correlations. Test-criterion correlations refers to how accurately test scores predict criterion performance. The degree of accuracy depends on the purpose for which the test was used. The criterion variable is a measure of a quality of primary interest determined by users. The value of a test-criterion study depends on the reliability and validity of the interpretation based on the criterion measure for a given testing application. The two designs that evaluate test-criterion relationships are predictive study and concurrent study. 17 Predictive study. A Predictive study indicates how accurately test data can predict criterion scores that are obtained at a later time. A highly predictive test can maintain temporal differences of the performance of the practical situation, but without providing the information necessary to judge and compare the effectiveness of assignments used. Concurrent study. A concurrent study obtains predictor and criterion information at the same time. Test scores are sometimes used in allocating individuals to different treatments, such as jobs within an institution. Evidence is needed to judge suitability of using a test when classifying or assigning a person to one job versus another. Evidence about relations to other variables is also used to investigate questions of differential prediction for groups. Differences can also arise from measurement error. Test developers must be careful of criterion variables that are theoretically appropriate but contain large amounts of error variance. Validity generalization. Validity generalization is the “degree to which evidence of validity based on test-criterion relations can be generalized to a new situation”. Metaanalytic analyses have shown that much of the variability in test-criterion correlations may be due to statistical artifacts such as sampling fluctuations and variations across validation studies in the ranges of test scores and in the reliability of criterion measures. Statistical summaries of past validation in similar situations may be useful in estimating test-criterion relationships in a new situation. Transportability and synthetic validity are part of validity generalization (AERA, et al., 1999). Transportability. Transportability has to do with the use of a test developed in one context that is brought into a new context. In order to use the test in a new context, 18 sufficient documentation must be produced to justify the use of the test such as that the two jobs share the same set of tasks and KSAs and that the candidate groups are comparable. Synthetic validity. Synthetic validity involves combining elements that have their own associated validity evidence together to form a larger test. Elements are usually fundamental ability modules, but there are needs to be sufficient documentation that the fundamental elements are related to the domain that is being tested by the larger test. These elements can be groups of related tasks and KSAs which can even be associated with certain assessment strategies. A job analysis can reveal which tasks or task groups are represented in a particular job. Results of a single local validation study may be quite imprecise, but there are situations where a single study, carefully done, with adequate sample size, provides sufficient evidence to support test use in a new situation. Evidence Based on Consequences of Testing In evidence based on consequences of testing, it is important to distinguish between evidence that is directly relevant to validity and that may inform decisions about social policy but falls outside the realm of validity. A few of the benefits of tests are placement of workers in suitable jobs or prevention of unqualified individuals from entering a profession. Validity can be found by examining whether or not the proposed benefits of the test were obtained. A fundamental purpose of validation is to indicate whether these specific benefits are likely to be realized. This source of evidence is gained by ruling out confounds which can unjustly cause group differences and invalidate the test (AERA et al., 1999). 19 Chapter 2 LICENSURE TESTING Tests are widely used in the licensing of persons for many professions. According to the Standards (1999), licensing requirements are imposed by state governments to ensure that licensed people possess knowledge and skills in sufficient degree to perform important occupational activities safely. Tests in licensure are intended to provide the state governments with dependable mechanism for identifying practitioners who have met particular standards. The standards are strict, but not so stringent as to unduly restrain the right of qualified individuals to offer their services to the public. Licensing also serves to protect the profession by excluding persons who are deemed to be not qualified to do the work of the occupation. Qualifications for licensure typically include educational requirements, supervised experience, and other specific criteria as well as attainment of a passing score on one or more examinations. Tests are used in licensure in a broad spectrum of professions, including medicine, psychology, teaching, real state, and cosmetology (AERA et al., 1999). Licensure tests are developed to identify that the candidates have mastered the essential KSAs of a specified domain. The purpose of performance standards is to define the degree of KSAs needed for safe and independent practice. The Standards (Standards 14.14, 14.15, and 15.16) propose that in order to protect the public, tests should be consistent with the minimal KSAs required to practice safely when test takers obtain their 20 license. Knowledge that is attained after getting a license and experience on the job should not be included on the licensure test. Standard 14.17 proposes that the level of performance required for passing a licensure test should depend on the KSAs necessary for minimal acceptable competence in the profession and the proportion of people passing the test should not be adjusted. The adjustment of cut scores lowers the degree of validity of the test. Test development begins with an adequate description of the profession, so that persons can be clearly identified as engaging in the activity. A definition of the nature and requirements of the current profession is developed. A job analysis of the work performed by the job incumbents is conducted. The essential KSAs are documented and identified by qualified specialists in testing and SMEs in order to define test specifications. Multiple-choice tests are one of the forms of testing that are used in licensure as well as oral exams (AERA et al., 1999). Minimal Acceptable Competence In order to qualify to take a licensure examination, it is necessary that candidates meet the education, training, and experience required for entry-level practice for the profession. In addition, some professions allow candidates to demonstrate their competence to perform the essential functions of the job with the intention to avoid a situation where qualified candidates are denied a license because they did not meet the educational requirement specified by the law (Schmitt & Shimberg, 1996). According to the Standards (1999), the plan of a testing program for licensure must include the description of the areas to be covered, the number of tests to be used, 21 and the method that is going to be used to combine the various scores and obtain the overall result. An acceptable performance level is required on each examination. The Standards (1999) explain that “defining the minimum level of knowledge and skill required for licensure is one of the most important and difficult tasks facing those responsible for licensing”. It is critical to verify the relevance of the cut score of a licensure examination (Standard 14.17). The accuracy of the inference drawn from the test depends on whether the standard for passing differentiates between qualified and unqualified performance. It is recommended that SMEs specify the performance standards required to be demonstrated by the candidate. Performance standards must be high enough to ensure protection of the public and the candidate, but not so high to restrict the candidate from the opportunity to obtain a license. The Standards (1999) state that some government agencies establish cut scores such as a 75% on licensure tests without the use of SMEs. The resulting cut scores are meaningless because without the existence of details about the test, job requirements, and how they are related it is impossible to have an objective standard setting (Standard 14.14 and 14.17). Licensure testing programs need to be accurate in the selection of the cut score. Computer-based tests may end when a decision about the candidate’s performance is made or when the candidate reached the allowed time. Consequently, a shorter test may be provided for candidates whose performance exceeds or falls far below the minimum performance required for a passing score. Mastery tests do not specify, “how badly the candidate failed, or how well the candidate passed”, providing scores that are higher or 22 lower than the cut score could be confusing (AERA et al., 1999). Nevertheless, candidates who fail are likely to profit from information about the areas in which their performance was weak. When feedback to candidates about performance is intended, precision throughout the score range is needed (Standard 14.10). State of California and Licensure Civil service procedures are established by the State Constitution and the Government Code. The State had two organizations that enacted regulations based on these statutes. These organizations were the State Personnel Board (SPB) and the Department of Personnel Administration (DPA). SPB had authority under Article VII of the State Constitution to oversee the merit principle. Article VII of the State Constitution says that all public employees should be selected based on merit and with the use of a competitive examination. Article VII establishes the constitutional structure of California’s modern civil system and eliminates the political spoils system. Based on Article VII, the SPB’s responsibilities included civil service examinations, the formal establishment of job classifications, and discipline. DPA was responsible for functions including pay, day-to-day administration of the classification plan, benefits, all other conditions of employment, and collective bargaining. In order to eliminate overlapping of activities and have a more efficient system, the human resource management responsibilities performed by SPB and DPA were consolidated into one organization. Effective July 1, 2012, the name of the new organization is California Department of Human Resources (CalHR). The CalHR preserves the merit principle in state government 23 as required by Article VII of the State Constitution (Department of Consumer Affairs, 2007). State Licensing Professional licenses are issued by state agencies known as boards or bureaus. In the State of California, the Department of Consumer Affairs (DCA) is the main state agency that regulates professional licenses. A number of statutes set criteria for the licensing process in California. These include the California Government Code section 12944 of the California Fair Employment and Housing Act and the Business and Professions Code section 139 (Office of Professional Examination Services). The California Government Code section 12944 of the California Fair Employment and Housing Act prohibits discrimination by any state agency or authority in the State and Consumer Services Agency, which has the authority to grant licenses that are prerequisites to employment eligibility or professional status. The Department of Fair Employment and Housing can accept complaints against licensing boards where discrimination is alleged based on race, creed, color, national origin or ancestry, sex, age, medical condition, physical disability or mental disability unless such practice can be demonstrated to be job-related. Business and Professions Code section 139 establishes a policy that sets minimum requirements for psychometrically sound examination validation, examination development, and occupational analyses. Section 139 requires a review of the appropriateness of prerequisites for admission to the examination. The annual review applies to programs administered by the DCA and they must report to the director. The 24 review must include the method employed for ensuring that every licensing examination is periodically evaluated. Section 139 ensures that the examinations are fair to candidates and assess job-related competencies relevant to current and safe practice (Board Licensing Examination Validation and Occupational Analysis, 2006). The Department of Consumer Affairs (DCA) issues more than 200 professional licenses in the state of California. The State establishes requirements and competence levels for licensure examinations. The DCA licenses practitioners, investigates consumer complaints, and controls violators of state’s law. The DCA also watches over licensees from unjust competition by unlicensed practitioners (Marino, 2007). The California Department of Consumer Affairs (2011) has changed its way to manage licensure. In 1876, the State established its first board in the medical ground. The Medical Practice Act established entry-level standards for physicians, created licensing examinations, and imposed fees for violations. Over the next thirty years, other occupations were controlled by the State. In 1929, the Department of Professional and Vocational Standards was established by combining 10 different boards. Accountants, architects, barbers, cosmetologists, dentists, embalmers, optometrists, pharmacists, physicians, and veterinarians were licensed by these boards. In 1970 the Consumer Affairs Act changed the name to the Department of Consumer Affairs. Currently, the DCA controls more than 40 professions in the State. Licensing professionals who pass licensure examinations and meet state requirements guarantees that only competent practitioners are legally allowed to serve the public. In addition, it ensures that consumers are allowed to try different alternatives if a 25 professional service is not done competently. Educational programs are provided to licensees in order to continue their learning and maintain their professional competence (Marino, 2007). The Office of Professional Examination Services (OPES) provides examinationrelated services to DCA’s regulatory boards and bureaus. OPES ensures that licensure examination programs are fair, valid, and legally defensible. The OPES perform occupational analysis, conduct exam item development, evaluate performance of examinations, and consult on issues pertaining the measurement of minimum competency standards for licensure (Department of Consumer Affairs, 2007). Job Analysis In order for an organization to have the most reliable information about a job and make legal employment decisions, a job analysis should be conducted. A job analysis is a comprehensive, rigorous approach to highlighting all the important aspects of a job. Several definitions describe job analysis. The Standards (1999) define job analysis as a general term referring to the investigation of positions or job classes to obtain descriptive information about job duties and tasks, responsibilities, necessary worker characteristics, working conditions, and/or other aspects of the work (Standard 14.8 and 14.10). The Guidelines refer to job analysis as “a detailed statement of work behaviors and other information relevant to the job” (Sec 14B and 14C). Brannick et al. (2007) provide a detailed definition of job analysis in which they describe additional elements. First, a systematic process is necessary to meet the requirements of a job analysis. The job analyst specifies the method and the steps to be 26 involved in the job analysis. Second, the job must be broken up into smaller components. Alternatively, the components might be different units, such as requirements for visual tracking or problem solving. Third, the results of the job analysis may include any number of different products, such as a job description, a list of tasks, or a job specification. As a result, job analysis is defined as the systematic process of discovery of the nature of a job by dividing it into smaller components, where the process results in written products with the goal of describing what is done in the job and the capabilities needed to successfully perform the job. The major objective is to describe worker behavior in performing the job, along with details of the essential requirements. Job Analysis and the Law State agencies receive several legal penalties as a result of unfair treatment in employment practices. Employment laws refer to fairness in access to work as the main topic. The overall principle is that individuals should be selected for a job based on merit rather than social group’s identification by such features as sex, religion, race, age, or disability. The Constitution. The Fifth and Fourteenth Amendments are occasionally mentioned in court and protect people’s life, liberty, and property by stating that people should not be denied of them without a legal process. The Fourteenth Amendment relates to the state and the Fifth Amendment applies to the federal government. The language of these laws does not specify specific employment procedures. The Fourteenth Amendment has been used in reverse discrimination cases. Reverse discrimination refers to “claims of 27 unlawful discrimination brought by members of a majority group, such as the white male group, which is also protected under the law” (Brannick et al., 2007). Equal Pay Act. The Equal Pay Act (1963) requires employers to pay men and women the same salary for the same job, that is, equal pay for equal work. Employers cannot give two different titles to the same job, one for men and another for women. The Equal Pay Act does not require equal pay for jobs that are different (Brannick et al., 2007). Civil Rights Act. In 1960, it was usual that some jobs were reserved for whites and other jobs for blacks. An example of this practice is that of the electric power plant Duke Power in which laborer jobs were given to blacks and management jobs were given to whites. The Civil Rights Act of 1964 established that these employment practices were illegal. As a result, the Duke Power began to allow blacks to apply for the other jobs. Originally, Duke Power required a high school diploma to apply for laborer jobs. It dropped the diploma requirement for those who could pass two tests. The passing score was established at the median for those who completed high school. The result of this procedure, although without intention, was that it excluded most blacks from moving to upper level jobs. One of the affected workers sued the company because this was illegal. In Griggs versus Duke Power Company (1971), the Court ruled that Duke Power’s procedure was illegal because Duke never proved that the high school diploma and test scores were related to job performance. The ruling stated that tests should be used to assess how well individuals fit a specific job. The intent of the procedures were not relevant, but the significance of the results of those procedures. 28 Title VII of the Civil Rights Act (1964) and its amendments (1972 and 1991) prohibit employers from discriminating on the basis of race, color, sex, religion, or national origin. The EEOC was established by the Civil Rights Act in order to implement it. The Act applies to all conditions or privileges of employment practices. It is specified that selection practices that do not impact members of one of the protected groups are legal unless covered by another law. Thus, in some cases, practices that look unfair are legal. In some other cases, practices that reject a higher proportion of one protected group than of another protected group may be illegal unless they are job related. These practices are alleged to produce an adverse impact. In some cases, practices that seem fair at first glance have an adverse impact if actually performed. Age Discrimination in Employment Act (ADEA). The Age Discrimination in Employment Act (1967) prohibits discrimination against people 40 years of age and older. The ADEA establishes a protected class of anyone 40 years of age and older. It promotes that companies make employment decisions based on ability rather than age. As a result, it would be legal to select a 25-year-old candidate to a 35-year-old candidate because of age, but it would not be legal to prefer a 25-year-old candidate to a 40-yearold candidate because of age. The ADEA does not require companies to select less qualified older workers over more qualified younger workers (Guion, 1998). Rehabilitation Act. The Rehabilitation Act (1973) intends to provide equal employment opportunities based on handicap. The word handicap basically refers to the term disability. Applying only to federal contractors, the Rehabilitation Act states that qualified candidates with handicaps should not be discriminated under any program or 29 activity receiving federal assistance because of his or her handicaps (Brannick et al., 2007). Americans with Disabilities Act (ADA). The Americans with Disabilities Act (1990) prohibits discrimination against people with disabilities. According to Brannick et al. (2007) “Disability is broadly defined as referring to both physical and mental impairments that limit a major life activity of the person such as walking or working. Disability also includes those individuals who have a history of impairment and those who are regarded as disabled, regardless of their actual level of impairment”. The ADA imposes employers to provide reasonable accommodation to people with disabilities and protects qualified candidates with disabilities. “A qualified individual with a disability is a person who, with or without reasonable accommodation, can successfully complete the essential functions of the job” (Cizek, 2001). The ADA is unclear in determining if an accommodation is reasonable or not. Overall, accommodations are considered to be reasonable unless it would result too expensive to the employer to make the adjustments. Enforcement of Equal Employment Opportunity Laws. There are two main organizations that intend to enforce the equal employment opportunity (EEO) laws: the EEOC and the U.S Office of Federal Contract Compliance Programs (OFCCP). The OFCCP manages only federal companies and the EEOC controls all the other companies. The first guidelines for personnel selection were established by the EEOC in 1970 and the courts applied them in discrimination court cases. Businesspeople worried about the application of the guidelines in court cases because it was not clear whether companies 30 had to spend large amounts of money to try to comply with the guidelines. Furthermore, other organizations such as the U.S. Department of Labor (DOL) possess their own guidelines. The Uniform Guidelines on Employee Selection Procedures were implemented by five federal agencies: the EEOC, the Office of Personnel Management (OPM), the DOL, the Department of Justice (DOJ), and the Department of the Treasury (Brannick et al., 2007). Professional Standards. In 1999, the American Educational Research Association, the American Psychological Association and the National Council on Measurement in Education, updated the Standards for Educational and Psychological Testing. The Standards provide details about good practices in test development used in the assessment of people. The Standards distinguish test fairness and selection bias. Selection bias is the technical view of the relationship of test scores and performance on the job. On the other hand, test fairness is a nontechnical ethical component resulting from social and political opinions. As a result, steps to avoid test bias can be addressed by the Standards, but not how to maintain test fairness. In 2003, the Society for Industrial and Organizational Psychology published the latest edition of the Principles for the Validation and Use of Personnel Selection Procedures. The Principles illustrate good practice in the development and evaluation of personnel selection tests. Uses of Job Analysis. Job analyses are used for selection purposes. In personnel selection, employers collect job applicant’s information in order to make an employment decision. Based on the information, employers decide if the applicant is qualified to do 31 the job. Another purpose of job analysis is to establish wages to employees. Differences in wages should be based on a seniority system, a merit system, measures of quantity of production, or some quality other than sex (Brannick et al., 2007). Furthermore, job analyses are required for disability and job design purposes. The ADA prohibits discrimination against candidates based on their disabilities. Employers may be required to redesign the job so people with disabilities are not required to perform not essential activities. Reasonable accommodations must be provided to permit people with disabilities to do the essential functions of the job. The EEOC provides guidelines to decide if a function is essential. The employer has to evaluate if the position exists to execute the function, the availability of other employees that could perform the function, and if special KSAs are required to perform the function. Elements that can help to decide if a function is essential are the employer’s opinion, a job description, how much time is spent in performing the function, outcome of the lack of people to perform the function, and past performance of the job. A job analysis is not required in order to meet ADA requirements. Nevertheless, the EEOC suggests that employers would benefit if they have a systematic study of the job position. Subject Matter Experts According to the Principles (2003) participation of SMEs in the development of licensure exams is necessary and guarantees that the exams truly assess whether candidates have the minimally acceptable KSAs necessary to perform tasks on the job safely and competently. In addition, SMEs distinguish the work behaviors and other activities required to perform the job effectively. 32 The selection of SMEs by state entities significantly influences the quality and defensibility of the exams. Therefore, SMEs should demonstrate familiarity with the essential job characteristics such as shift, equipment, and location. It is recommended that the group of SMEs represent the current population of practitioners, geographic location, ethnicity, gender, and practice setting. Detailed evidence of the methods used in the development of the selection procedures based on the judgment of SMEs should be documented (Society for Industrial and Organizational Psychology, 2003). 33 Chapter 3 STANDARD SETTING METHODS According to the Standards (2003), the establishment of “one or more cut points dividing the score range to partition the distribution of scores into categories” is a significant part of the process of exam development. In licensure exams, these categories define and differentiate candidates who are qualified to obtain a professional license based on the minimum passing score established by the state. The cut scores influence the validity of test interpretations by representing the rules and regulations to be used in those interpretations. Cut scores are established through different methods and for different purposes that need to follow several procedures to make them legally defensible (Standard 4.20). State licensure typically uses 3 to 4 performance levels to set cut scores on their licensure exams to characterize a test taker’s performance. The process used to set cut scores is called Standard Setting. There are several acceptable methods to set standards. Criterion-Referenced Passing Scores at the Office of Professional Examination Services The Office of Professional Examination Services (OPES) uses a criterionreferenced passing score for licensure examinations. The method “applies standards for competent practice to all candidates regardless of the form of the examination administered” (Office of Professional Examination Services, 2010, para.1). A criterionreferenced passing score ensures that candidates passing the exam are qualified to 34 perform the job competently. OPES uses a modified Angoff method in standard setting. The group process includes practitioners with different years of experience to represent different aspects of the profession and entry-level competence. The process starts with the development of definitions of the minimally acceptable level of competence for safe practice. The group compares essential performance behaviors of the highly qualified, minimally qualified, and unqualified candidate. The difficulty of licensure exams varies from one exam to another. Thus, having an unchanging passing score for different forms of the exam will not reflect the minimally acceptable competence making it difficult to legally defend the passing score. The use of a criterion-referenced method lowers the passing score for an examination with a large number of difficult questions and raises the passing score for an examination that has a small number of difficult questions. The resulting passing score is intended to protect the public and the candidate because it is based on the difficulty of the questions in the exam and not on performance with respect to the group as is the case in a norm-referenced strategy. Criterion Methods Based on Test Taker Performance Contrasting Groups Method In the contrasting groups method, the scores on a licensure exam of a qualified group are compared to the scores of a nonqualified group. The nonqualified group should not be qualified for the license but the group should still be demographically similar to the qualified group when possible. Furthermore, the mandatory KSAs for successful performance of the job should represent all the KSAs that would be reflected by the nonqualified group (Meyers, 2009). One benefit of the contrasting group method is that 35 SMEs are required to make judgments about people with whom they are familiar, rather than about a hypothetical group of test-takers. The task of teachers rating their students is an illustration of this situation (Hambleton, Jaeger, Plake, & Mills, 2000). The identification of raters that are out of place or misclassified is also possible, because overlap of score distributions can be noticed directly. Marginal Group Method A marginal group is represented by those candidates that possess the minimum level of KSAs required to perform the job tasks safely. Only one group of candidates is identified as marginally qualified to obtain a license. The candidates take the test and a cut score is established at their common performance level (Meyers, 2009). Incumbent Group Method In this method, one group of already licensed practitioners takes the exam. The resulting cut score is established such that most of the qualified job incumbents would pass the exam. This is the least favored method because not enough information about the incumbents is available in order to set the cut score, resulting in inaccurate statement that all members of the group are competent at the time of testing (Meyers, 2009). Nedelsky’s Method Based on Subject Matter Expert Judgments of Test Content The Nedelsky procedure was specifically designed for multiple-choice items (Kane, 1998). The basic idea of the Nedelsky method is to understand the possible test score of a candidate possessing less knowledge of the content domain than would be required to be a successful licensee. SMEs rate each question in order to remove the options that even a nonqualified candidate would possibly identify as the wrong answer. 36 As a result of the elimination of the obvious wrong answers, candidates would have demonstrated their competence and start to guess among the remaining options of the question. The Nedelsky method intends to obtain the score these candidates would achieve by this approach. “An item guessing score is calculated for each item by dividing the number of remaining choices into 1 and is then summed across the items yielding an estimated test guessing score” (Meyers, 2009, p. 2). A measure of central tendency such as the average of scores is used as the final cut score. Cizek (2001) states that the Nedelsky method has been used in the medical environment because it is assumed that a minimally qualified candidate must reject the wrong options as they would harm the public. Angoff’s Method According to Kane (1998), the Angoff method is the most common standard setting method applied on high-stakes achievement tests. The Angoff method has been applied to a large number of objective tests used for licensure exams, without major criticism from the judges involved in the process. William Angoff developed the method in 1971and it is based on judgment about whether a minimally qualified candidate at a particular level of achievement would select the right answer to each question. The judges answer each question and subsequently talk about performance behaviors that delineate a highly qualified, qualified, and unqualified candidate. The job of the judges is to estimate the percentage of minimally qualified candidates who would answer each question correctly. The judges are instructed to think 37 about a group of 100 minimally qualified candidates at each achievement level. The questions are presented one at a time, in the order that they appear on the test. Then, judges rate the difficulty of each question based on the percentage of minimally qualified candidates who would get each question right (Nichols, Twig, & Mueller, 2010). Wang (2003) states that judges usually use the range of 70% to 80% when using the Angoff method. In addition, judges use 90% or 95% for easy items. In the case of hard items, judges provide ratings of 50% or 60%, but never provide ratings lower than 50%. At the end of the process, the averages of the judges ratings for each question on the test are determined. The performance standards on the total score scale are established by combining these question averages. The averages can then be calculated to set each performance standard for the test (Hambleton et al., 2000). Literature on Components of the Angoff Method Some key components contribute to better psychometric results when using the Angoff method. One of these components is the use of SMEs in the passing score process. It has been found that the participation of job incumbents, supervisors, or anyone who is an expert in a profession produces more accurate cut scores than participation of people who are not experts (Maurer, Alexander, Callahan, Bailey, & Dambrot, 1991). The valuable expertise and knowledge of judges produces more reliable cut scores in the judgment process than those produced by judges who do not possess the knowledge necessary to rate the difficulty of the items. Another component to the psychometric quality of the resulting passing score is the training that SMEs receive for the standard setting process (Plake, Impara, & Irwin, 38 2000). Research has specified that training SMEs on the KSAs required for the exam, results in higher agreement about the difficulty of the items on the exam. Moreover, Impara and Plake (1998) noted that the validity of the Angoff method is in question if SMEs cannot perform the difficult task of estimating item difficulty. Thus, panelists need to receive training in order to be familiar with the required KSAs as well as the standard setting method used. Although the use of performance data in Angoff method is not essential, performance data can be another aspect of the Angoff method. Performance data provided to the SMEs helps to avoid underestimation of the difficulty of the items (Yudkowsky & Downing, 2008). According to Hurtz and Auerbach (2003), the use of normative data during passing score studies did not have any effect in the variability of SMEs’ ratings. In addition, the use of normative data had a tendency to lower the ratings of the items producing lower cut scores. The outcomes of using performance behavior descriptors during the Angoff method have been explored. Judges have indicated that having pre-established behavioral descriptors helped them focus on the task of making judgments about items (Chinn & Hertz, 2002). Hurtz and Auerbach (2003) suggest that when the group develops its own performance behaviors definitions of a minimally competent candidate, there is higher agreement in the ratings. Furthermore, when descriptions of the candidate are less definitive, there is more variety in the descriptions given by the raters about the candidate than when the description presented by the workshop facilitators are more definitive in terms of predicted performance. Thus, previous definitions of performance of the 39 candidate in certain ways and more or less accurately influence judgment of competence (Giraud, Impara, & Plake, 2005). Strengths and Weaknesses of the Angoff Method Researchers have studied the strengths and weaknesses of the Angoff method for standard setting. One of the strengths of the Angoff method is that the method can be applied to a wide range of exam formats such as those containing multiple-choice questions (Stephenson, 1998). In addition, the Angoff method was the most favored by SMEs and the most time efficient method (Yudkowsky & Downing, 2008). Furthermore, judges using the Angoff method express high confidence in their ratings due to the perception that the standards are reasonable judgments (Alsmadi, 2007). It has been found that the Angoff method has valuable psychometric qualities. Stephenson (1998) found strong intrajudge and interjudge reliabilities using a modified Angoff method. Although the Angoff method possesses valuable qualities, it has also received a lot of criticism. The limitations of the Angoff method are associated with the subjectivity of the judgment task (Alsmadi, 2007). It is clear that the determination of percentages about candidates who would answer each item correctly has been found to be complex and subjective (Plake et al., 2000). Moreover, research about the validity of the method suggests that judges are unable to estimate the difficulty of the items accurately for barely qualified candidates (Goodwin, 1999; Plake & Impara, 2001). This leads them to set cut scores that are different than what they have it in mind based on the KSAs of highly qualified candidates 40 (Skaggs, Hein, & Awuor, 2007). In addition, evidence indicates that judges find it difficult to rate the questions before and after they have talked about the questions with the rest of the group. As a result, judges overrate the probability of success on difficult items and underrate the probability of success on easy items. It has been found that judges are more likely to overestimate performance on difficult items than to underestimate that of easy items (Clauser, Harik, & Margolis, 2009). Similarly, Kramer et al. (2003) indicate that ratings resulting from the Angoff method were inconsistent and resulted in a low reliability. Some organizations that invest in exam development process consider that the Angoff method is complex, time-consuming, and expensive. In addition, the Angoff method may be less convenient when it is used for performance-based testing than when it is used for written tests (Kramer et al., 2003). The National Research Council (1999) disagrees with the psychometric value of the Angoff method. The reasons presented are that the resulting standards seem unrealistic because too few candidates are judged to be advanced relative to many other common conceptions of advanced performance in a discipline. The next reason is that the results differ significantly by type and difficulty of the items. According to the National Research Council (1999), cut scores for constructed-response items are usually higher than cut scores for multiple-choice items. Another reason for their disagreement is that there is evidence showing that it is hard for judges to accurately calculate the chances that qualified candidates would answer each item correctly (National Research Council, 1999). 41 Following the publication of the National Research Council judgment on the Angoff method, Hambleton et al. (2000) presented a review of the qualities of the method. They identified the simplicity of the Angoff method as one advantage. Hambleton et al. (2000) described that the accumulated method allows the judges to differentially assess the questions in the test, giving higher value to those questions they feel are more important. However, judges occasionally consider that the method tends to break the assessment into small, isolated components. Even though the purpose of the weighting method is to provide a complete analysis of the whole assessment, judges sometimes perceive that the extended Angoff method might not consider the holistic nature of the performance. Also, questions have been raised about the capability of judges to make the necessary rating judgments (Hambleton et al., 2000). The recognition of the Angoff method as the most widely used method for setting standards on a lot of settings is acknowledged by researchers. For many years, the Angoff method has been applied to numerous objective tests used for licensure and state programs. Ebel’s Method Based on Subject Matter Expert Judgments of Test Content Ebel’s (1972) method is applied in two phases in which judges classify the difficulty and the importance of the items. The three levels of difficulty are easy, medium, and difficult; and the four levels of importance are critical, important, acceptable, and questionable. The results are 12 groups of items. SMEs are asked to organize each item into one of these 12 groups. The next phase is to estimate the percentage of items that a minimally acceptable candidate (MAC) should get right from 42 each group (Cizek, 2001). In each group, the estimated percentage is multiplied by the total number of items in that group. The cut score is determined by obtaining the average of each group of items (Meyers, 2009; Kane, 1998). Direct Consensus Method In the Direct Consensus Method (DCM), the test items are separated into groups according to the content specifications of the exam plan. SMEs are then instructed to identify the number of test items that a minimally qualified candidate should answer correctly. The items’ scores are summed and then SMEs scores combined to get the pass point for the first round of ratings. Then the groups of SMEs discuss the rationale for their ratings for each group of items. In some cases, SMEs are provided with item performance data. The purpose of the DCM is that SMEs reach consensus on the final cut score through an effective process in which aspects such as scope of practice, content and difficulty of each group of items, opinions of other SMEs, and review of performance data are involved (Hambleton & Pitoniak, 2004). The Item Descriptor Method The Item Descriptor Method (IDM) emerged in 1991 (Meyers, 2009). According to Almeida (2006), the IDM was developed when the Maryland State Department of Education considered necessary to update proficiency level descriptions in the Maryland School Performance Program. The IDM is used when performance level descriptors are linked to test items. In 1999, the method was modified in Philadelphia. Item response theory (IRT) is used to order the items in a booklet placing one item per page. SMEs are instructed to go through the different groups of items and then 43 link the items to the performance level descriptor that they believe reflects successful performance. The IDM shares characteristics with other methods such as the Angoff and the Bookmark (to be described in the next section) methods. Common applications of these methods involve panelists to make judgments about items to identify a cut score, involve more than one round of judgments, utilize performance level descriptors as a basis for judgment about items, and in most cases, SMEs are provided with performance data (Ferrara, Perie, & Johnson, 2002). The application of standard setting methods in licensing exams depends on the characteristics of the exams. Some standard setting methods are more suited to be used with multiple-choice item exams; other standard setting methods might provide better results for performance exams. Test users need to consider these and other qualities about exams when setting standards for those exams. Bookmark Method The Bookmark method is currently one of the most popular standard-setting methods across different testing settings such as educational and licensing. However, research to support the validity of the method is limited (Olsen & Smith, 2008). The Bookmark method represents a relatively new approach (Kane, 1998) and was developed to address perceived limitations of the Angoff method, which has been the most commonly, applied procedure (Mitzel, Lewis, Patz, & Green, 2001). The Bookmark method requires the use of IRT methodology. Ordinarily, theta ability parameter is estimated for the situation where we want to know the probability of 44 answering the item correctly, that is, we want to know p(x=1). This probability is a function of the a, b, c, and theta parameters (Meyers, 1998). In the case of dealing with the hypothetical MAC candidate, the probability of correctly answering the question is dropped to 2/3, that is, the probability that an item will be answered correctly 2 out of 3 times p(x=2/3). Setting the c parameter equal to 0 in the 3PL model, and solving for theta given p(x=2/3), the result for a given item: Theta = b + .693/1.7a. Items are then placed in an Ordered Item Booklet (OIB) based on theta value needed to answer the questions correctly 2 in 3 times (Meyers, 2009). According to Lin (2006), the OIB is a representation of the actual exam and what is being tested. In addition, the OIB is the medium used to make the cut score ratings of the items in the examination. The items appear one per page on the OIB. The first item is the easiest and the last item is the most difficult. In various modifications of the Bookmark method, SMEs were shown p-values for the items instead of the IRT values because the order of items based on p-values was the same order of the items if the IRT model was used (Buckendahl, Smith, Impara, & Plake, 2002; Skaggs & Hein, 2011). The Bookmark method is intended to lessen the difficulty to estimate item difficulty for borderline examinees by having SMEs examine the OIB in which test items are arranged in order of difficulty (Skaggs et al., 2007). The judges receive an item map, with the test items ordered in terms of their empirical difficulty level (Kane, 1998). SMEs working in small groups with small groups of items start on page 1 of the OIB. Their instructions are to identify the page where the chances of minimally acceptable 45 candidates answering the item correctly fall below 2/3 or the designated response probability (RP) if some other value is chosen (Meyers, 2009). Moreover, SMEs of the Bookmark method have to answer and discuss two questions: a) What does the item measure? and b) What makes this item more difficult than the items that precede it in the OIB? (Lee & Lewis, 2008). The individual SME Bookmark estimates are often shared with the group either numerically and discussion occurs regarding the SME Bookmark placements. Impact data from examinee scores can be shared with the SMEs. There are usually two or three rounds with the SME minimum passing scores reviewed and discussed for each round (Olsen & Smith, 2008). Each round is intended to help increase consensus and reduce differences among the SMEs. During Round 1, SMEs in small groups examine each item in the OIB, discussing what each item measures and what makes the item harder than those before it. After this discussion, each SME determines a cut score by placing a bookmark in the OIB according to their own judgment of what students should know and be able to do at each of the performance levels (Eagan, 2008). SMEs then engage in two more rounds of placements. In round 2, SMEs discuss the rationale behind their original placement with other SMEs at their table. In round 3, SMEs at all tables discuss their placements together. After each round of discussion, SMEs may adjust or maintain their placements. Impact data, that is the percentage of students in that state that would fall below each bookmark, is introduced to participants during the third round. After the final round of placements, the recommended score is 46 calculated by taking the mean of all bookmark placements in the final round (Eagan, 2008). Based on the final cut scores set, performance level descriptors are then written by the SMEs. Performance descriptors define the specific KSAs held by test takers at a given performance level. Items prior to the bookmarks reflect the content that students at this performance level are expected to be able to answer correctly with at least a 0.67 likelihood. The KSAs required to respond successfully to these items are then synthesized to formulate the description for this performance level. Performance level descriptors become a natural extension of the cut score setting procedure (Lin, 2006). According to Peterson, Schulz, and Engelhard (2011), the Bookmark and Angoff methods are different from one another in several ways. During the Angoff method, SMEs review each item and rate the probability of a minimally qualified candidate of selecting the right answer. The cut score is the sum of average of the SMEs’ ratings. In the case of the Bookmark method, SMEs use a booklet with the items ordered from easiest to most difficult. The final cut score is the score at which candidates have a specified probability of answering the most difficult item from the booklet correct (Peterson et al., 2011). Studies have compared the resulting cut scores from standard setting studies using the Bookmark and the Angoff methods. In the study by Buckendahl et al. (2002), the Bookmark method allowed teachers to focus on the likely performance of the barely proficient student (BPS). Although there was a small difference in final cut scores between the two methods, the standard deviation decreased for the second round of the 47 Bookmark method compared with the Angoff method. Thus, the Bookmark method produced a lower standard deviation. Consequently, a narrower range of possible cut scores was produced, indicating a higher level of inter-judge agreement. Wang (2003) provides evidence to justify the use of an item-mapping method (Bookmark-based) for establishing cut scores for licensure examinations. The itemmapping method incorporates item performance in the process by graphically presenting item difficulties. It was noted that item-mapping method sets lower cut scores than the Angoff method. Another finding was that the predicted percentages of passing were lower for the Angoff method than for the item-mapping method (Wang, 2003). Green, Trimble, and Lewis (2003), compared the Bookmark method to the Contrasting Groups method in order to establish credible student performance standards by using this multiple procedure approach. The Bookmark method produced lower cut scores in the Novice/Apprentice and Apprentice/Proficient levels. The Contrasting Groups method produced lower cut scores in the Proficient/Distinguished. The final cut score set by the synthesis group (subset of participants from the other methods) were closer to the Bookmark cut scores than the other two methods. Olson (2008) found that Bookmark ratings were consistently below the results from the Angoff ratings. Findings About Cut Scores and Rater Agreement Research on the consistency of rater agreement (Wang, 2003) indicates similar distributions of variation agreement, but different distribution patterns across four exams. Judges provided more consistent ratings in the Bookmark item-mapping method than in the Angoff method. All ratings from the Bookmark item-mapping method reached rater 48 agreement higher than 0.95, whereas the rater agreements for the Angoff method ranged from 0.796 to 0.922 (Wang, 2003). This was not the case for the study by Yin and Sconing (2008) in which researchers found that cut scores were generally consistent for an Angoff-based method (item rating) and a Bookmark-based method (mapmark). Skaggs and Hein (2011) found similar cut scores when comparing the Bookmark method. This is also the case for the study by Olsen (2008), in which the results indicated similar cut scores for the Modified Angoff and Bookmark methods. Lin (2006) found evidence of similar cut scores being set with the Modified Angoff and Bookmark standard setting procedures. In addition, better inter-rater agreement was found for the Bookmark cut scores. Providing impact of performance data between rounds influences SMEs’ ratings. Buckendahl et al. (2002) found that the second round cut score in the Angoff method dropped by a point and a half after performance data was given. In the case of Bookmark, the second round cut score increased by two points (Buckendahl et al., 2002). Lee and Lewis (2008) suggest that in order to decrease the standard error for cut scores from the Bookmark method, increasing the number of small groups was more efficient than increasing the number of participants. Increasing the number of participants within groups also decreased the standard error, but the use of more groups was more efficient. This new strategy contributes to the increase in reliability of the cut score. Strengths and Weaknesses of the Bookmark Method Karantonis and Sireci (2006) acknowledged that the extensive use and acceptance of the Bookmark method indicate the logical appeal and practicality of the procedure. 49 Research has identified the advantages of the Bookmark method and these include: (a) essentially no data entry, (b) the ability to similarly handle multiple choice and constructed response items, (c) time efficiency, (d) defensible performance level descriptions that are a natural outcome of setting the cut points, and (e) cut scores based on a comprehensive understanding of the test content (Cizek, 2001). Moreover, the Bookmark method meets Berk’s criteria for defensibility of standard setting methods and is a relatively sound procedure in terms of technical adequacy and practicability (Lin, 2006). The Bookmark method allows to: (a) reduce cognitive complexity; (b) connect performance descriptors with content of assessments; (c) promote better understanding of expected student performance; (d) accommodate multiple cut scores; (e) accommodate multiple test forms; and (f) obtain low standard error of the cut scores (Lin, 2006). Hambleton et al. (2000) noted that panelists respond positively to the Bookmark method because item ratings are avoided and the method can handle both selected and constructed item formats. Also, panelists decide how much knowledge and skills would be reflected by basic, proficient, and advanced examinees (Hambleton et al., 2000). In a single-passage Bookmark study, the presentation of the items in separate booklets reduced the cognitive complexity of the judgment task. The items in each booklet referred to the same passage instead of a single booklet with all the items about different passages (Skaggs & Hein, 2011). Thus, separate booklets should be used when different passages or content areas are used in a test. 50 One of the main limitations of the Bookmark method is the disordinality of items within the OIB. Item disordinality refers to the disagreement among SMEs on the ordering of the items in the OIB (Lin, 2006). The disordinality of items affects panelist’s bookmark placements during a standard setting procedure (Skaggs & Tessema, 2001; Lin, 2006). Typically, panelists do not agree on the way items are ordered in the OIB because they may not be able to estimate item difficulty accurately. As a result, variability of cut scores among panelists increase, and standard error of the final score increases accordingly. Panelists reported having a difficult time when trying to evaluate the difficulty and order of the items in a booklet, because the test was comprised of reading passages with multiple selected-response items. Several judges insisted that several items should have been placed before other items at different locations in the booklet (Skaggs & Tessema, 2001). Karantonis and Sireci (2006) identified potential bias to underestimate cut scores and problems in selecting the most appropriate response probability value for ordering the items presented to panelists in the item booklet of the Bookmark method (Karantonis & Sireci, 2006). Similarly, Lin (2006) identified the choice of RP, as one problem of the method. This is the approach to displaying the performance data on the reporting scale in the final determination of performance standards. This is the ability level at which a test taker has a 50% or 75% chance of success. There is a concern that this simple variation might considerably impact performance standards and the resulting item difficulty locations (Hambleton et al., 2000). According to Karatonis and Sireci (2006), participants felt more comfortable using an RP value of .67 than .50. The 50% chance was more 51 difficult to understand because it reflects an even chance and a 67% was easier to understand because it refers to a mastery task statement. Furthermore, the exclusion of important factors other than the difficulty parameter and restrictions of the IRT models are also weaknesses of the Bookmark method (Lin, 2006). Some other factors about the examination items such as previous performance data, importance to the job or exam plan relevance should also be considered when ordering the items in the OIB. The percentage failed in the context of defining minimal competency has been studied (MacCann, 2008). The length of a test has been found to influence the percentage of people identified below the level of minimal competence. MacCann (2008) proposed a formula to adjust cut scores, which modifies the cut score to reduce the incidence of those students who do not deserve to fail and that are currently failing due to errors of measurement. The price to be paid for this result is that examinees that deserved to fail on true scores passed due to errors of measurement. On balance, it seems good result as the favorable shift is much larger than the unfavorable one at a relatively low level of reliability. Thus, this technique reduces the number of students who failed due to errors of measurement but deserved to pass on true scores. Although lowering the cut scores has positive consequences in the reduction of false negatives, test users should consider the cost of certifying competence in areas where incompetence could be life threatening such as in medical fields (MacCann, 2008). Negative bias have been found (Reckase, 2006) as a result of the item order from easy to hard starting at the beginning of the booklet and the identification for items with probabilities below .67. Reckase (2006) recommends that panelists should also receive a 52 booklet with items ordered from hard to easy and look for the first items with probability above .67, the bookmark selection would probably result in bias in cut score estimation in the opposite direction. Using the average of these two bookmark placements would probably reduce the bias in the estimated of cut score. Furthermore, to reduce the standard error of the Bookmark method, several booklets placements could be made using different subsets of items and the resulting cut scores averaged (Reckase, 2006). Participants’ Experience in Standard Setting Studies using Bookmark Method In the study of cognitive experiences during standard setting, judges have indicated that even though they were not very clear about what the process would do at the beginning of the study, after a few items were rated, they became more comfortable with the Bookmark process (Wang, 2003). In addition, panelists have expressed high levels of confidence in the passing score and with the process (Buckendahl et al., 2002). Panelists are able to understand the task and show confidence about the standard setting procedure (Karantonis & Sireci, 2006). Thus, judges concluded that their Bookmark final cut scores were very close to their conception of appropriate standards (Green et al., 2003). Judges agreed that the Bookmark item-mapping method set more realistic cut scores than the Angoff method (Wang, 2003). In studies focused on qualitative experiences in standard setting, participants perceived that their own item ordering was unique and it reflected their own BPS rather than as generalizing to other teacher’s circumstances (Hein & Skaggs, 2009). The use of less time involved in the item-mapping method in comparison with the Angoff method was acknowledged and appreciated by the judges (Wang, 2003). Thus, judges were not 53 frustrated about going through all the items. Judges have perceived that they are allowed to focus on the likely performance of the BPS without the challenge of characterizing that performance relative to the absolute item difficulty (Buckendahl et al., 2002). Studies regarding the cognitive experiences of raters about the Bookmark method suggest that choosing a specific test item as their bookmark placement was perceived as a nearly impossible task (Hein & Skaggs, 2009). Judges emphasized that the problem was not a lack of understanding of the procedure, but the difficulty of the task itself. Research also suggest that even though participants expressed understanding of the procedure, some participants used alternative bookmark strategies during the remaining rounds of bookmarking (Hein & Skaggs, 2009). Judges did not view their own professional judgments about an appropriate placement as an adequate basis for making such decisions. Instead, they viewed their placement as needing to be informed and justified by some external factor, such as a state proficiency standard. Some participants showed confusion and frustration about the item order. Participants considered that the order of the items was incorrect and wanted to reorder the items (Hein & Skaggs, 2009). In the evaluation of the Bookmark method, strengths of the Bookmark method outweigh its weaknesses. The Bookmark method remains a promising procedure for standard setting with its own strengths and limitations. More research is needed for this relatively new method for standard setting. The use of different sources of information in order to adopt a cut score might benefit the resulting standard (Green et al., 2003). 54 Purpose of this Study The increasing need of legally defensible, technically sound, and credible standard setting procedures is a common challenge for state licensure exams. In addition, in the current economic environment, government entities in charge of issuance of licenses would benefit from standard setting methods that allow saving time and money in the examination development process as well as protecting the public and the test taker. For these reasons, researchers need to study new methods for standard setting that provide reliable cut scores. The Bookmark method is a new procedure that has been positively accepted by test users. Despite the increasing popularity and evidence of reduced complexity of the process, the Bookmark method for standard setting still lacks evidence on its validity to support its status as a best practice in licensure testing contexts. The purpose of this study is to evaluate the effectiveness of the Bookmark method compared to the modified Angoff method. The aim of the study is to answer the following question: Do the Bookmark and Angoff methods lead to same results when used on the same set of items for a state licensure exam? Another goal of the study is to evaluate SMEs’ cognitive experiences during the standard setting process and the resulting cut scores. This evaluation will be accomplished with the use of a 17-item questionnaire about the effectiveness of the standard setting methods. Most of research studies on Bookmark-based methods provide good evidence of the reliability and validity of Bookmark-based methods (Peterson et al., 2011). The cut 55 scores for the same content area resulting from Angoff-based and Bookmark-based methods clearly converged, providing evidence of reliability of Bookmark-based methods. Procedural validity of the Bookmark-based methods was supported as panelist understanding of the tasks and instructions in Bookmark-based methods was higher than in Angoff-based methods and panelists ratings of the reasonableness and defensibleness and of their confidence in the final cut scores were high, and higher than in the Angoffbased methods (Peterson et al., 2011). As used in Skaggs, Hein, and Awuor (2007) study of passage-based tests, this study used separate booklets for the different content areas of the examination instead of a single ordered item booklet for the Bookmark method. For example, if the examination contains five content areas, there were five ordered item booklets. Also, by producing more data points resulting from the separate booklets, reliability of the cut score should be higher in that it should be easier to rate the items that belong to a same content area because of the similar formatting that exist in different content areas of the examination. 56 Chapter 4 METHOD Participants A total of fifteen SMEs participated in the standard setting studies. Seven SMEs participated in workshop 1 and a different group of eight SMEs participated in workshop 2. All SMEs were current and active licensees. It was expected that most of the participants of the study were going to be female because this reflects the majority of job practitioners. Some SMEs had several years of experience in their profession and other SMEs are newly licensed. The number of years being licensed ranged from 1.5 to 37 years. SMEs represented different practice specialties and geographic locations. All SMEs were familiar with the issues that independent practitioners face on a day-to-day basis as well as competencies required for entry-level practitioners and they have participated in previous exam development workshops. Some SMEs have had previous experience in exam development workshops and other SMEs have not been in other workshops. It is important to have a sample of SMEs with different levels of experience, ranging from being newly licensed to having many years of experience. The names, age, and ethnicity of SMEs from the two workshops were kept confidential by the Office of Professional Examination Services (OPES). Only the county, years licensed, and work setting information about SMEs is provided. SMEs were allowed to participate in the standard setting workshop only if they did not 57 participate in the exam construction workshop so that previous exposure to the items would not interfere with the evaluation of the items in the standard setting workshop. Materials At the beginning of each workshop, SMEs received a folder with a copy of the agenda for the workshop. An example of the agenda is provided in Appendix A. SMEs also received information about the workshop building facilities, a copy of a power point presentation, exam security agreement, and a copy of the two exam plans for the two examinations to be rated. The test developer showed a power point presentation to the SMEs as part of the training that explained the two standard setting methods. After reviewing the exam plans for the two exams, SMEs received a copy of the two licensure exams. Each exam had 100 four- multiple-choice items. The 100 items for each exam were divided in two smaller sets of 50 items each. All items were designed to assess minimal competence in different content areas based on the exam plan. The items were previously selected by a different group of SMEs during an exam construction workshop. During the exam construction workshop, the test developer brings a pool of items for the SMEs to select the items that they consider the best for that exam, making sure that the items do not overlap or test the same area from the exam plan. Before the exam construction workshop, the test developer reviews past performance of the items in order to ensure that they are functioning well and that they differentiate those test takers who are qualified from those who are not qualified to practice safely in the profession. 58 The data were collected using the Angoff and Bookmark methods. Although the Bookmark method typically uses IRT based difficulty estimates, the test developer used classical test theory item difficulties (p-values) from previous exam administrations which has been supported in recent research (Buckendahl et al., 2002; Davis-Becker, Buckendahl, & Gerrow, 2011). The p-values were weighted based on sample size and the following hypothetical example will illustrate the weighting of p-values. Item 99 was administered on dates X, Y, and Z. For administration X the p-value is 0.20 and the sample size is 10. For administration Y the p-value is 0.30 and sample size is 20. Finally, for administration Z the p-value is 0.30 and sample size is 80. The weighted average for Item 99 is (0.20*10 + 0.30*20 +0.40*80) divided by (10+20+80) which is equal to 0.363. In the current study, the p-values for items on which the Bookmark sorting was based were based on N’s of 70 to 198 from prior test administrations. The dates of previous administrations range from November 2001 to November 2010. In addition, the items differ in the number of times that they have been administered. The number of times administered ranges from 3 to 7 times. Rating sheets containing three columns were provided to SMEs to write down their ratings for the Angoff method (see Appendix B). Each rating sheet also had the SME’s name, the date of the workshop, and the name of the profession. SMEs' ratings were recorded in an Excel spreadsheet. The first column in the rating sheet contained the item number, the second column was designed to enter the initial rating, and the last column was designed to enter the final rating for each SME. They also received a rating sheet for the Bookmark method which included their name and group number as it is 59 shown in Appendix C. It also included columns for the content area, initial placement, second placement, and final placement for each of the two exams. Research Design The study consisted of 2 separate 2-day workshops. SMEs were current licensees and they were recruited by their state board to represent different specialties, geographic locations, and levels of experience. The design of the study is shown in Table 1. SMEs rated subsets of 2 different licensure exams using two different standard setting methods: the Bookmark and the Angoff methods. The goal was to compare the cut score that resulted from the two methods. Thus, each subset of items was rated by the two methods. For example, for subset 1, the Angoff cut score was obtained as well as the Bookmark cut score. The research question was whether the Bookmark method produces similar cut scores to the Angoff method on the same set of items. The first workshop was held on January 21 – 22, 2011; and the second workshop was held on January 28 – 29, 2011. A group of seven SMEs participated in each workshop. These two groups of SMEs were independent of each other. Table 1 Summary of Experimental Design Day 1 Workshop 1 Workshop 2 Day 2 Exam Set Method Exam Set Method 1 1 Angoff 2 2 Bookmark 2 Bookmark 1 Angoff 1 Bookmark 2 Angoff 2 Angoff 1 Bookmark 2 1 60 On day 1 of workshop 1, SMEs rated the first set of items for exam 1 using the Angoff method. Then, SMEs rated the second subset of items from the same exam using the Bookmark method. On day 2, SMEs rated the second set of items from exam 2 using the Bookmark method. Then, SMEs rated the first set for the same exam using the Angoff Method. On day 1 of workshop 2, a different group of SMEs rated the first set of items for exam 2 using the Bookmark method. Exam 2 was rated on the first day in order to counterbalance the order effects of type of exam. Each set of items was rated using the two different standard setting methods in order to compare the results. Then, SMEs rated the second set of items from the same exam using the Angoff method. On day 2, SMEs rated the second set of items from exam 1 using the Angoff method. Then, SMEs rated the first set for the same exam using the Bookmark method. A cut score was derived from these workshops. Procedures General Training SMEs provided judgments on item difficulty for two licensure exams in a standard setting study. The two independent groups of SMEs had the same training at the beginning of each workshop. As part of the training, the test developer explained the exam security procedures to be followed in order to protect the confidentiality of the exams. Then, SMEs were asked to sign the exam security agreement stating that they would follow those procedures. Another part of the training was to review the exam plan for each exam in order to become familiar with the required KSAs for these exams. After 61 reviewing the exam plans, the test developer presented a performance behaviors table with definitions of the highly qualified, minimally acceptable, and unqualified candidate. This table was directly linked to the exam plan and it was established during the development of previous workshops by multiple focus groups. A short sample of the performance behaviors table is provided in Appendix E. At the end of this part of the meeting, SMEs were trained on the standard setting method to be used first, followed by the other standard setting method. The training on the standard setting method was done on the first day of the workshops. Thus, in workshop 1, SMEs were trained on Angoff method first followed by the rating of the items. Afterward, SMEs were trained on the Bookmark method followed by the rating of the items. In workshop 2, SMEs were trained on the Bookmark method first followed by the rating of the items; after that, SMEs were trained on the Angoff method followed by the rating of the items. In the Angoff method, SMEs rated each item of the test. In the Bookmark method, SMEs placed bookmarks on the OIB. Each standard setting method consisted of more than one round of ratings. A round of ratings was the time period within the standard setting process in which judgments were collected from each SME. Angoff Method For the Angoff method, SMEs received two booklets with 50 four-multiplechoice items each and 2 rating sheets (one per exam). The items were ordered by content area. The test developer trained the SMEs using a Microsoft PowerPoint presentation that explained the Angoff method. An outline of the presentation is provided in Appendix F. 62 After the presentation, the SMEs answered a 10-item-practice exam in order to become familiar with the process. The practice exam also helped to verify that SMEs were properly calibrated and understood what Minimum Acceptable Competence (MAC) criteria was for the test taker. Having this concept in mind is critical because SMEs’ ratings are based on this entry-level standard. As part of the practice process, the test developer provided the key for each item after all SMEs had answered the 10 items. Then, SMEs were asked to rate the difficulty of the items and to write them down on a rating sheet. The SMEs were instructed to rate the items based on the percentage of people that would answer the item correctly from a group of 100 qualified candidates based on MAC criteria. The initial ratings were independent from the other SMEs. Thus, no discussion of the items was held yet. The SMEs used a rating scale ranging from 25% - 95% to rate the perceived difficulty of the items. Then, each SME provided the initial ratings to the test developer and the test developer entered the ratings from each SME on an Excel spreadsheet designed to compute the cut score. The spreadsheet contained SME’s names across the columns and the items listed in the rows grouped by content areas for easier identification. The test developer obtained the average rating for the item based on the initial rating across raters and determined if the group of SMEs needed to discuss the item. In the case that the range of ratings of an item had a discrepancy of 20% or more, SMEs were asked to discuss that specific item. For example, one of the SMEs considers the item to be easy and rates the item with an 80%. On the other hand, another SME 63 considers the same item to be challenging and rates the item with a 45%. In this example, there is a rating discrepancy of 35% which reflects SMEs disagreement in the difficulty of the item. The SMEs with the lowest and highest ratings are asked to explain their rationale as to why they provided those ratings. The discussion is based on the evaluation of the item as a whole. SMEs evaluate the quality of the 3 wrong answers of the item and the required KSAs. SMEs think about which and how many of the wrong answers could be easily dropped in order to get to the right answer. After some discussion with the group, SMEs have an opportunity to re-rate the item and write down the final rating for the item. SMEs are asked to provide a final rating based on the discussion, if there was one. In the case that ratings failed to fall within the 20% criterion again, SMEs were asked if the item was difficult, average, or easy in order to verify if SMEs were still calibrated. Another method to verify that SMEs were using the rating scale properly is to ask them to explain their rationale for why they considered the item easy or difficult. Sometimes SMEs fail to use the rating scale appropriately and tend to rate the items with an 80% when they really consider the item to be difficult. In the case that no discussion was held, they are asked to pass their final rating over to the final column on their rating sheets. The test developer enters the final ratings on the Excel spreadsheet. After completing the 10-item-practice exam, SMEs started to work on exam 1. SMEs answered 25 items from exam 1 and followed the same procedure used in the 10item-practice exam. After rating all 25 items, they continued with the other 25 items from this set for a total of 50 items. SMEs answered the items and rated them in two separate 64 groups of items. It is easier to complete this task by answering small groups of items because SMEs get less fatigued by trying to remember the content of the items for them to rate them. The recommended passing score was the average of the final ratings of all the raters. Bookmark Method For the Bookmark method, SMEs received several booklets with the items grouped by content area for each exam: 5 booklets for exam 1 and 3 booklets for exam 2. Each booklet had one item per page for a total of 50 items per exam. The items were ordered from easiest to most difficult in each booklet based on classical theory p-values obtained from previous administrations. The test developer presented a Microsoft PowerPoint presentation on the Bookmark method. After the presentation, SMEs were assigned to one of 2 groups, with 3 or 4 people in each group. SMEs answered a 10-item-practice exam in order to become familiar with the process. The items were the same items used in the Angoff method. The items were ordered from easiest to most difficult. After answering all the items, the test developer provided the key for each item. During round 1, SMEs were asked to go through the items and place the initial bookmark in the item that they believed would be the last correct item that a candidate would get from that exam and beyond this placement the candidate would get all the items incorrect while thinking of the MAC candidate. The first placement was independent from other SMEs. There was no discussion held among SMEs in this round of placements. SMEs recorded their initial placement in the 65 bookmark rating sheet. SMEs were asked for their first placement and the test developer entered the data in the Excel spreadsheet. During round 2, performance data was provided and explained after the first bookmark placement. SMEs were allowed ten minutes to discuss with their group the reasons for their placements based on the knowledge that the candidates need to answer the item right. They also discussed why the previous and following item would have been more or less difficult for the MAC candidate. After group discussions, SMEs placed their second bookmark on the booklet and the test developer entered it in an Excel spreadsheet. This time, SMEs considered the information shared during their small group discussion in order to make their placement. For the final discussion, in round 3, one SME from each group summarized the discussions from their group. They had the last opportunity to move their final bookmark placement. There were three rounds of bookmark placements. High disagreement among SMEs on their placements was not required in order to hold these discussions, as it was the case in the Angoff method. The discussions were required steps of the Bookmark method, thus they had to be made. After the practice exercise, SMEs answered the 50 items from the exam one booklet at a time. When they were done, the test developer provided the key to those items. SMEs had to place a separate bookmark for each booklet, so 5 bookmarks for exam 1 and 3 bookmarks for exam 2. They used the same procedure used for the practice exam. 66 The same procedures were used for the second workshop. A different group of SMEs participated and the standard setting methods were reversed in this case to counterbalance any effects of the standard setting methods. So, the Bookmark method was used first instead of Angoff on the first day; and the Angoff method was used first on the second day. Evaluation of Standard Setting Methods At the end of the workshop, Subject Matter Experts completed a 17-item questionnaire about the effectiveness of the two standard setting procedures. The questionnaire is presented in Appendix D. 67 Chapter 5 RESULTS The mean of the final round of ratings across SMEs was the final cut score for each subset of items in which the Angoff method was used. The resulting cut scores for each round of ratings for each subset of items are shown in Table 2. Table 2 also provides standard deviations, 95% confidence intervals, and ICC coefficients for each round of ratings. The mean of the final bookmark placement across SMEs was used as the cut score for the set of items in which the Bookmark method was used. The resulting cut scores for each round are shown in Table 2. Table 2 also shows the change in reliability of the cut scores between rounds of ratings. The reliability of set 1 of exam 1 increased from 0.72 in the initial round to 0.85 in the final round of ratings when using the Angoff method. As a result of the disagreement of the SMEs in the difficulty of some of the items, there were negative inter-judge correlations. The reliability of the Bookmark cut score was negative (Cronbach’s Alpha α = -0.18) when using the 7 raters for the analysis (M=62.68, SD= 7.88, 95%CI = 55.39-69.97). Thus, only 4 raters were used for the analysis. The resulting reliability increased from 0.64 in the initial round, to 0.83 in the second round, and then changed to 0.68 in the final round. 68 Table 2 Cut Scores, Standard Deviations, Confidence Intervals, and Reliability Between Rounds Method Exam1 Set 1 Angoff Bookmark Set 2 Angoff Bookmark Exam 2 Set 1 Bookmark Angoff Set 2 Bookmark Angoff Round Cut score SD 95% CI α Initial 70.66 4.11 66.85 – 74.47 0.72 Final 70.26 2.94 67.53 – 72.98 0.85 Initial 69.00 3.44 63.52 – 74.48 0.64 Second 67.55 4.32 60.66 – 74.44 0.83 Final 67.75 3.81 61.67 – 73.83 0.68 Initial 77.09 3.44 73.89 – 80.27 0.76 Final 77.87 2.80 74.92 – 80.81 0.79 Initial 66.82 5.62 61.62 – 72.03 0.74 Second 65.28 4.03 61.54 – 69.02 0.86 Final 65.42 5.79 60.06 – 70.78 0.93 Initial 64.04 4.48 61.25 - 67.08 0.93 Second 64.54 3.97 61.21 – 67.86 0.97 Final 64.57 3.91 60.96 – 68.19 0.95 Initial 69.00 2.94 66.27 – 71.73 0.73 Final 69.13 2.77 66.56 – 71.69 0.77 Initial 64.24 6.68 58.06 – 70.42 0.30 Second 66.95 4.69 62.61 – 71.30 0.90 Final 66.86 5.39 61.87 – 71.85 0.83 Initial 73.82 4.40 70.13 – 77.51 0.80 Final 73.84 3.48 70.92 - 76.75 0.87 Note. Bookmark cut score for set 1of exam 1 was based on the analysis of 4 SMEs due to low inter-judge correlation of the reliability analysis using 7 SMEs. 69 A summary of the final cut scores and intraclass correlation coefficients (ICCs) is presented in Table 3. The reliability of the cut score of set 2 when using the Angoff method changed from 0.76 in the initial round to 0.79 in the final round. The reliability of the cut score when using the Bookmark method in set 2 of exam 1 was 0.74 in the initial round, 0.86 in the second round, and 0.93 in the final round. The reliability of the Bookmark cut score for set 1 in exam 2 went from 0.93 in the initial round, to 0.97 in the second round, and finally changed to 0.95 in the final round. The reliability of the Angoff cut score changed from 0.73 in the initial round to 0.77 in the final round. The reliability of the Bookmark cut score for set 2 was 0.30 in the initial round, 0.90 in the second round, and 0.83 in the final round. The reliability of the Angoff cut score for set 2 was 0.80 in the initial round and 0.87 in the final round of ratings. Table 3 shows the summary of final cut scores for each set of items from exams 1 and 2 using the Angoff and Bookmark methods. The table also provides the number of items, number of raters, number of cases, intraclass correlation coefficients (ICC), and 95% confidence interval for each cut score. Two cut scores were produced for exam 1 and two for exam 2. The Angoff method was used for one half of the exam and the Bookmark method was used on the other half of the exam. Cut scores of the same set of items were obtained using the two different methods. Thus, Angoff cut score and Bookmark cut score were obtained on the same subset of items in order to determine if they would provide similar cut scores. 70 Table 3 Summary of Angoff and Bookmark Final Cut Scores for each Set of Items Exam 1 Set 1 Method Set 2 Angoff Bookmark Bookmark Angoff 1-50 1-50 51-100 51-100 Number of Raters 7 4 7 7 Cases 50 5 5 50 Cut Score 70.26 67.75 65.42 77.87 ICC-Single 0.41 0.32 0.57 0.27 0.29-0.54 -0.04-0.85 0.24-0.92 0.16-0.40 0.83 0.65 0.90 0.72 0.74-0.89 -0.17-0.96 0.68-1.00 0.58-0.82 Items 95% CI ICC-Average 95% CI Exam 2 Set 1 Method Set 2 Bookmark Angoff Angoff Bookmark 1-50 1-50 51-100 51-100 Number of Raters 8 7 8 7 Cases 3 50 50 3 Cut Score 64.57 69.13 73.84 66.86 ICC-Single 0.57 0.32 0.35 0.37 0.16-0.98 0.21-0.45 0.25-0.48 0.02-0.97 0.91 0.76 0.81 0.80 0.61-0.99 0.65-0.82 0.72-0.88 0.10-0.99 Items 95% CI ICC-Average 95% CI 71 The Angoff cut score for set 1 of exam 1 (M = 70.26, SD = 2.94), was higher than the Bookmark cut score (M = 67.75, SD = 3.81). The Angoff cut score had a narrower confidence interval (CI = 67.53 – 72.98) than the Bookmark cut score (CI = 61.67 – 73.83). The Bookmark cut score (67.75) fell within the Angoff confidence interval (67.53 – 72.98) and the Angoff cut score (70.26) fell within the Bookmark confidence interval (61.67 – 73.83) suggesting no difference. The Angoff cut score for set 2 (M = 77.87, SD = 2.80) was higher than the Bookmark cut score (M = 65.42, SD = 5.79). The Angoff cut score had a narrower confidence interval (CI = 74.92 – 80.81) than the Bookmark cut score (CI = 60.06 – 70.78). Confidence intervals did not fall within each other in this case. They were separate from each other. The Angoff cut score for set 1 of exam 2 (M = 69.13, SD = 2.77) was higher than the Bookmark cut score (M = 64.57, SD = 3.91). The Angoff cut score had a narrower confidence interval (CI = 66.56 – 71.69) than the Bookmark cut score (CI = 60.96 – 68.19). The Angoff cut score did not fall within the Bookmark confidence interval and the Bookmark cut score did not fall within the Angoff confidence interval suggesting difference in cut scores. The Angoff cut score for set 2 (M = 73.84, SD = 3.48) was higher than the Bookmark cut score (M = 66.86, SD = 5.39). The Bookmark cut score had a narrower confidence interval (CI = 70.92 – 76.75) than the Angoff cut score (CI = 61.87 – 71.85). The Angoff cut score (73.84) fell within the Bookmark confidence interval (70.92 – 72 76.75) and the Bookmark cut score fell within the Angoff confidence interval (61.87 – 71.85) suggesting no difference. The data suggested difference in cut scores in set 2 of exam 1 and set 1 of exam 2. There was no difference suggested for set 1 of exam 1 and set 2 of exam 2. The confidence interval for the Angoff cut scores was narrower than the Bookmark cut scores in 3 of the 4 sets. Angoff cut scores were usually higher than the Bookmark cut scores. A McNemar’s test was conducted to assess the significance of the difference between two methods. The results of the test are presented in Table 4. Table 4 shows the proportions of candidates passing and the differences between the two methods for each set (50 items) of the two exams. The results of exam 1 are based on 57 candidates. For exam 1, set 1, approximately 37% of candidates passed according to the Angoff method while 63% passed according to the Bookmark method, for a difference of 26%. For set 2, 56% of candidates passed according to the Angoff method and 93% passed using the Bookmark, for a difference of 37%. Differences were statistically significant, p = .000. The results of the McNemar test for exam 2 are based on 99 candidates. For exam 2, set 1, 44% of candidates passed according to the Angoff method while 52% passed using the Bookmark method, for a difference of 8% which was statistically significant, p = .007. For set 2, 47% passed using the Angoff method and 62% passed using the Bookmark, for a difference of 15% which was statistically significant, p = .000. 73 Table 4 Summary of McNemar Test for Significance of the Proportions of Candidates Passing by the Angoff Method versus the Bookmark Method Exam 1 Set 1 Method Set 2 Angoff Bookmark Bookmark Angoff Cut Score 70 67 65 77 Proportions Passing Proportion Differences 26 37 Exam 2 Set 1 Method Set 2 Bookmark Angoff Angoff Bookmark Cut Score 64 69 73 66 Proportions Passing Proportion Differences 8 15 Workshop evaluation data were collected at the end of each workshop to obtain perceptions about the standard setting methods Angoff and Bookmark. SMEs were asked to complete a 17-item survey. Table 5 shows the means, standard deviations, and 95% confidence intervals for each item from the survey. SMEs rated how confident and comfortable they were about the process and their resulting cut scores. Ratings were 74 based on a five point scale (5 was Strongly Agree; 4 was Agree; 3 was Undecided; 2 was Disagree; and 1 was Strongly Disagree). Table 5 Summary about Standard Setting Process Items Entire Process 1 Training and practice exercises helped to understand how to perform the tasks 2 Taking the test helped me to understand the required knowledge 3 I was able to follow the instructions and complete the rating sheets accurately 4 The time provided for discussions was adequate 5 The training provided a clear understanding of the purpose of the workshop 6 The discussions after the first round of ratings were helpful to me 7 The workshop facilitator clearly explained the task 8 The performance behavior descriptions were clear and useful Angoff 9 I am confident about the defensibility of my final recommended cut scores using Angoff Bookmark 10 I thought a few of the items in the booklet were out of order 11 12 13 14 15 16 17 The information showing the statistics of the items was helpful to me The discussions after the second round of ratings were helpful to me Deciding where to place my bookmark was difficult Understanding of the task during each round of bookmark placements was clear I am confident that my final bookmark approximated the proficient level on the state test Overall, the order of the items in the booklet made sense I felt pressured to place my bookmark close to those favored by another SMEs M SD 95% CI 4.82 0.41 4.55 – 5.09 4.64 0.67 4.18 – 5.09 4.36 0.67 3.91 – 4.82 4.36 0.51 4.02 – 4.70 4.18 0.75 3.68 – 4.69 4.09 1.22 3.27 – 4.91 4.09 0.70 3.62 – 4.56 4.09 0.70 3.62 – 4.56 4.45 0.52 4.10 – 4.81 4.36 0.67 3.91 – 4.82 4.09 0.54 3.73 – 4.45 3.64 1.36 2.72 – 4.55 3.55 1.21 2.73 – 4.36 3.27 1.19 2.47 – 4.07 2.73 1.19 1.93 – 3.53 2.64 1.03 1.95 – 3.33 2.45 1.29 1.59 – 3.32 Note. The results are based on 11 SMEs. The other 4 SMEs did not return the evaluation questionnaire. 75 Table 5 shows that SMEs perceived that training activities helped them understand how to perform the tasks. Those activities included the use of the performance behavior table, training and practice exercises, and taking the test. SMEs agreed that the discussions after the first round of ratings were helpful. SMEs felt confident about the defensibility of their final Angoff cut score. Despite the fact that SMEs were able to understand the task during each round of placements, they were not confident that their Bookmark cut score approximated proficient level on the state test. SMEs also perceived that some of the items of the OIB were out of order. They found that the discussion after the second round of ratings was useful in the process. Furthermore, SMEs agreed that making a decision on where to place their bookmark was a difficult task. SMEs did not feel pressured to place their bookmark close to those of other SMEs. 76 Chapter 6 DISCUSSION The results of the present study suggest that the cut scores produced by the Angoff method were consistently higher than those produced by the Bookmark method. Specifically, when the two methods were used on the same set of items of two state licensure exams, the results showed that the Bookmark method consistently produced lower cut scores. This is different from the findings of other research studies in which the cut scores for the same content area resulting from the Angoff and Bookmark methods converged, providing evidence of validity of the Bookmark method (Peterson et al., 2011). The results of the McNemar’s test confirmed the statistically difference in cut scores by the two methods on each set. It was found that more candidates would pass the licensure exams when using the Bookmark cut score than when using the Angoff cut score for all 4 sets of items. It was also found that the reliability of cut scores improved between rounds of ratings for the Angoff and the Bookmark methods. In the case of the Angoff method, reliability coefficients ranged from 0.71 to 0.86. In the case of the Bookmark method, reliability coefficients ranged from 0.64 to 0.96. In addition, the results of the survey imply that discussion of the items during a passing score workshop may improve the reliability of the cut scores. 77 In the study by Skaggs et al. (2007), separate booklets for the different content areas of the exam for the Bookmark method were used. The same procedure was used in the present study instead of having only one bookmark placement for the whole set of items. For example, if the exam contains five content areas, there were five OIBs. Skaggs et al. (2007) found that by producing more data points resulting from the separate booklets, reliability of the cut score was high. The moderate reliability of the cut scores from the present study was due to the easy task of rating items from the same content area. Items from the same content area share similarities such as formatting characteristics. This is the case for the present study, in which one content area might have required the test taker to select a vocabulary word from a given list, and another content area might have required the test taker to proofread a paragraph. Furthermore, the use of booklets of each content area during a passing score workshop using Bookmark would make the task of rating items easier for SMEs and allow them to focus on the difficulty of the items rather than the difficulty of the process. The results of the evaluation questionnaire suggest that training activities helped SMEs understand how to perform the standard setting tasks. Those activities included the training and practice exercises, and taking the test. This relates to the research study by Peterson et al. (2011) in which the understanding of the tasks and instructions was high for the Bookmark and the Angoff methods. These results imply that test developers need to ensure that SMEs understand their standard setting tasks by including training activities that familiarize them more with the process as well as the exam requirements. 78 In addition, SMEs felt confident about the defensibility of their final Angoff cut score, but they were not confident about the defensibility of the Bookmark cut score. This is contrary to the findings of Buckendahl et al. (2002) in which panelists expressed high levels of confidence in their final Bookmark cut score. In the case of the present study, it is possible that SMEs felt more confident about the Angoff results than the Bookmark because they are more familiar with the well-researched Angoff method than the Bookmark method. In the present study, SMEs expressed that it was a difficult task to decide where to place their bookmarks in the OIB. This is similar to the study by Hein and Skaggs (2009) in which panelists were able to understand the procedure, and still found that selecting the item where to place their bookmark was almost an impossible task. This is likely to be caused by the SMEs’ perception that the items in the OIB were out of order. This is parallel to the findings by Lin (2006) where panelists thought it was difficult to place their bookmarks because they perceived that the items were out of order. There may be different reasons that create this perception of the items to be out of order based on difficulty. One reason might be that SMEs are not considering the whole item when they are rating. This means, that they are not considering the job of the distracters or wrong options in the questions. The distracters in an item might make the item either easier or more challenging. Test developers should ensure that SMEs are evaluating the items as a whole that include the influence of the distracters. Based on the results of this study and the need for evidence that supports the validity of the Bookmark method, state entities should keep using the well-researched 79 Angoff method to set cut scores for licensure exams. The results of this study imply that the use of the Bookmark method to set cut scores would allow more candidates to pass a licensure exam compared to the Angoff method. In this situation, state entities could be licensing unqualified candidates and consequently harming the public. Future research should compare the Bookmark and Angoff cut scores based on a larger number of items in each content area and using a larger number of SMEs that allows dividing them in more groups for the workshop development. Some of the limitations of the current study should also be addressed in order to provide future directions. The first limitation of the study was the limited sample size of raters. Future researchers could select a representative larger sample of raters and organize them in smaller groups of raters for the Bookmark method. This approach could be used to improve the reliability of the cut scores. The second limitation was that the study was conducted in only one profession. Thus, it becomes difficult to generalize across other professions. A third limitation was the use of classical test theory p-values instead of the IRT approach to rank the items in the OIB. A combination of classical test theory and the use of the IRT approach could provide more accurate statistical information in order to arrange the items in the OIB based on difficulty. In addition, the study would benefit if the data used to arrange the items for the Bookmark method was based on a larger sample of people that have taken the exams. In the case of this profession, the population of test takers is not large, thus it was difficult to have a larger sample to obtain item information. 80 The fourth limitation in this study is the number of items that were used in each booklet. In the present study, when the sets of items for each exam were created, the same number of items for each content area were placed in each booklet. As a result, a small number of items ended up in each booklet. Future researchers could use a different approach to create the set of items for the booklets. 81 Appendix A Agenda Agenda Board of California EXAMS PASSING SCORE WORKSHOP Office of Professional Examination Services 2420 Del Paso Road, Suite 265 Sacramento, CA 95834 January 21 – 22, 2011 I. Welcome and introductions II. Board business A. Examination security, self certification III. About OPES facilities A. Security procedures (electronic devices) B. Workshop procedures (breaks and lunch) IV. Power Point Presentation A. Angoff-use full range 25%-95% “What percentage of minimally competent candidates WOULD answer this item correctly?” B. Bookmark-2 groups, ordered from easiest to most difficult, 3 rounds bookmark placement, up to the item where performance level should be based on item difficulty. V. What are minimum competence standards? VI. What are performance behaviors? A. Ineffective, Minimally Acceptable Competence, and Highly effective VII. Take the examination 1 VIII. Assignment of ratings to examination 1 using Angoff IX. Assignment of ratings to examination 1 using Bookmark X. Take the examination 2 XI. Assignment of ratings to examination 2 using Bookmark XII. Assignment of ratings to examination 2 using Angoff XIII. Wrap-up and adjourn 82 Appendix B Angoff Rating Sheet Name: «First_Name» «Last_Name» MFT-CV Exam Rating Sheet September 2008 Passing Score Workshop Guess Hard 25 35 45 Item Number 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Thinking 55 65 75 Initial Rating Final Rating Easy 85 95 83 Appendix C Bookmark Rating Sheet BOOKMARK RATING SHEET WORKSHOP 1 NAME: GROUP: DATE: PROFESSIONAL PRACTICE INITIAL PLACEMENT SECOND PLACEMENT FINAL PLACEMENT 1 Reporting Proceedings 2 Transcribing Proceedings 3 Research and Language Skills 4 Transcript Management 5 Ethics BOOKMARK RATING SHEET WORKSHOP 1 NAME: GROUP: DATE: ENGLISH 1 Grammar 2 Proofreading 3 Vocabulary INITIAL PLACEMENT SECOND PLACEMENT FINAL PLACEMENT 84 Appendix D Evaluation Questionnaire Standard Setting Methods Evaluation Questionnaire Please, use the following rating scale to evaluate the effectiveness of the two standard setting methods used during the workshop. For each statement, circle the number that better describes your answer. 1 Strongly Disagree 2 Disagree 3 Undecide d 4 5 Agree Strongly Agree Statement Rating 1 The training provided a clear understanding of the purpose of the workshop 1 2 3 4 5 2 The workshop facilitator clearly explained the task 1 2 3 4 5 3 Training and practice exercises helped to understand how to perform the tasks 1 2 3 4 5 4 Taking the test helped me to understand the required knowledge 1 2 3 4 5 5 The performance behavior descriptions were clear and useful 1 2 3 4 5 6 The time provided for discussions was adequate 1 2 3 4 5 7 I was able to follow the instructions and complete the rating sheets accurately 1 2 3 4 5 8 The discussions after the first round of ratings were helpful to me 1 2 3 4 5 9 The discussions after the second round of ratings were helpful to me 1 2 3 4 5 10 The information showing the statistics of the items was helpful to me 1 2 3 4 5 11 I am confident about the defensibility of the final recommended cut scores using Angoff 1 2 3 4 5 12 Understanding of the task during each round of bookmark was clear 1 2 3 4 5 13 I felt pressured to place my bookmark close to those favored by other SMEs 1 2 3 4 5 14 Deciding where to place my bookmark was difficult 1 2 3 4 5 15 Overall, the order of the items in the booklet made sense 1 2 3 4 5 16 I thought a few of the items in the booklet were out of order 1 2 3 4 5 17 I am confident that my final bookmark approximated the proficient level on the state test 1 2 3 4 5 Comments/Suggestions: 85 Appendix E MAC Table Content Area 1 (39%) Unqualified Misapplies or is unaware of applicable code sections Fails to ask for assistance when needed MAC (Minimally Competent) Treats all parties impartially Asks for assistance when needed Complies with applicable codes Highly Qualified Explains and applies code sections correctly Manages workload effectively Effectively carries additional equipment Content Area 2 (20%) Unaware of filing procedures or protocol Fails to adhere to redaction protocols Applies basic computer operating functions and capabilities Adheres to redaction protocols Utilizes multiple forms of backup Creates organizational systems for job documents Fails to back up Content Area 3 (11%) Misspells frequently Frequently misuses specialized vocabularies Misapplies rules of punctuation, grammar, word, and number usage Is unfamiliar with common idioms/slang Corrects errors before submitting final transcript Uses specialized vocabularies appropriately Possesses general vocabulary Possesses extensive general vocabulary Uses reference sources to ensure accuracy Applies research methods to verify citations Is familiar with idioms/slang Creates a word list 86 Appendix F Power Point Presentation PASSING SCORE WORKSHOP-January, 2011 GOALS OF THE WORKSHOP •Review scope of practice •Review MAC table •Take test •Rate scorable and pretest items •Obtain cut score for PURPOSE OF A LICENSING EXAMINATION To identify candidates who are qualified to practice safely. Cycle of Examination Development Occupational analysis, Examination outline, Item development, Item revision, Exam Construction, Passing Score REVIEW OF THE EXAMINATION Identify overlap test items, Determine if an item needs to be replaced Determine if items need to be separated on the examination from another item The Angoff Process: Review scope of practice, Discuss concept of minimum competence, and take examination Bookmark Method: Large group divided into smaller groups, Review exam specifications and KSAs OIB-questions ordered from easiest to most difficult based on p-values/item difficulty, Take exam Discuss in small groups KSAs MAC needed to answer each item correct and possible reasons why each succeeding item was more difficult than the previous one-additional KSAs needed, Provide performance data: p-values for all items R1-First Bookmark placement up to the item to which MAC will answer all items correct and beyond that item will get all items incorrect Discuss rationale for first bookmark placement-based on lowest and highest placement from the group, R2-Second Bookmark placement based on small group discussion Large group discussion-small group summary from one member of group Bookmark Method 2 small groups (random), Answer all 50 items, Obtain key for all items Discuss in your groups: What KSAs are needed to answer each item Reasons for why each succeeding item was more difficult than the previous (which additional knowledge were needed to answer the next item) Individually, based on the MAC, place first bookmark in each booklet (by content area) on the item to which up to that point candidates would get all the items correct and beyond that point, candidates would get all the items incorrect, Record each rater’s placement in computer by group, booklet (content area), R1, R 2, R3 Provide performance data: p-values and explain to SMEs Discuss in group by booklet rationale for first placement for the lowest and highest bookmark in the group, Discuss same 2 questions, Independently place 2 nd bookmark if changed their mind after discussion, Big group discussion, summary of small groups for each booklet, last chance to change bookmark, record in spreadsheet 87 References Almeida, M. D. (2006). Standard-setting procedures to establish cut-scores for multiplechoice criterion referenced tests in the field of education: A comparison between Angoff and ID Matching methods. Manuscript submitted as a Final Paper on EPSE 529 Course. Alsmadi, A. A. (2007). A comparative study of two standard-setting techniques. Social Behavior and Personality, 35, 479-486. American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1999). Standards for educational and psychological testing. Washington, DC: American Educational Research Association. Brannick, M. T., Levine, E. L., & Morgeson, F. P. (2007). Job analysis: Methods, research, and applications for human resource management. Thousand Oaks, CA: Sage Publications, Inc. Buckendahl, C. W., Smith, R. W., Impara, J. C., & Plake, B. S. (2002). A comparison of Angoff and Bookmark standard setting methods. Journal of Educational Measurement, 39(3), 253-263. California Department of Consumer Affairs. (2011). More about the Department of Consumer Affairs. Retrieved from http://www.dca.ca.gov/about_dca/moreabout.shtml 88 Chinn, R. N., & Hertz, N. R. (2002). Alternative approaches to standard setting for licensing and certification examinations. Applied Measurement in Education, 15(1), 1-14. Cizek, G. J. (2001). In G.J. Cizek (Ed.), The Bookmark procedure: Psychological perspectives. Setting performance standards: Concepts, methods, and perspectives (pp. 249-281). Mahwah, NJ: Erlbaum. Cizek, G. J. (2001). Setting performance standards. Conjectures on the Rise and Call of Standard Setting: An Introduction to Context and Practice (pp. 80-120). Baltimore: John University Press. Clauser, B. E., Harik, P., Margolis, M. J. (2009). An empirical examination of the impact of group discussion and examinee performance information on judgments made in the Angoff standard-setting procedure. Applied Measurement in Education, 22, 121. Davis-Becker, S. L., Buckendahl, C. W., & Gerrow, J. (2011). Evaluating the Bookmark standard setting method: The impact of random item ordering. International Journal of Testing, 11, 24-37. Eagan, K. (2008). Bookmark standard setting. Madison, WI: CTB/McGraw-Hill. Equal Employment Opportunity Commission, Civil Service Commission, Department of Labor, & Department of Justice. (1978). Uniform Guidelines on employee selection procedures. Federal Register, 43, (166), 38290-38309. 89 Ferrara, S., Perie, M., & Johnson, E. (2002). Matching the Judgmental task with standard setting panelist expertise: The Item-Descriptor (ID) matching method. Setting performance standards: The item descriptor (ID) matching procedure. Paper presented at the annual meeting of the American Educational Research Association, New Orleans, LA. Giraud, G., Impara, J. C., & Plake, B. S. (2005). Teachers’ conceptions of the target examinee in Angoff standard setting. Applied Measurement in Education, 18(3), 223-232. Goodwin, L. D. (1999). Relations between observed item difficulty levels and Angoff minimum passing levels for a group of borderline examinees. Applied Measurement in Education, 12(1), 13-28. Green, D. R. Trimble, C. S., & Lewis, D. M. (2003). Interpreting the results of three different standard setting procedures. Educational Measurement: Issues and Practice, 22(1), 22-32. Guion, R. M. (1998). Assessment, measurement, and prediction for personnel decisions. Mahwah, NJ: Lawrence Earlbaum Associates. Hambleton, R. K., Jaeger, R.M., Plake, B. S., & Mills, C. (2000). Setting performance standards on complex educational assessments. Applied Psychological Measurement, 24(4), 355-366. Hambleton, R. K., & Pitoniak, M. J. (2004). Setting passing scores on the CBT version of the uniform CPA examination: Comparison of several promising methods. Manuscript for presentation at NCME. 90 Hein, S. F., & Skaggs, G. E. (2009). A qualitative investigation of panelists’ experiences of standard setting using two variations of the bookmark method. Applied Measurement in Education, 22, 207-228. Hurtz, G. M. & Auerbach, M. A. (2003). A meta-analysis of the effects of modifications to the Angoff method on cutoff scores and judgment consensus. Educational and Psychological Measurement, 63(4), 584-601. Impara, J. C. & Plake, B. S. (1998). Teachers’ ability to estimate item difficulty: A test of the assumptions in the Angoff standard setting method. Journal of Educational Measurement, 35(1), 69-81. Kane, M. (1998). Choosing between examinee-centered and test-centered standard setting methods. Educational Assessment, 5(3), 129-145. Kaplan, R. M., & Saccuzzo, D. P. (2005). Psychological testing: Principles, applications, and issues (6th ed.). Belmont, CA: Thomson Wadsworth. Karantonis, A. & Sireci, S. G. (2006). The Bookmark standard-setting method: A literature review. Educational Measurement: Issues and Practice, 25(1), 4-12. Kirkland v. New York Department of Correctional Services, 711 F. 2nd 1117 (1983). Kramer, A., Muiktjens, A., Jansen, K., Dusman, H., Tan, L., & Vleuten, C. V. (2003). Comparison of a rational and an empirical standard setting procedure for an OSCE. Medical Education, 37, 132-139. Lee, G., & Lewis, D. M. (2008). A generalizability theory approach to standard error estimates for bookmark standard settings. Educational and Psychological Measurement, 68, (4), 603-620. 91 Lewis, A. L. Jr. v. City of Chicago, Illinois, 2011, 7th Cir. 5/13/2011. Lin, J. (2006). The Bookmark standard setting procedure: strengths and weaknesses. The Alberta Journal of Educational Research, 52(1), 36-52. Linn, R. L. & Drasgow, F. (1987). Implications of the golden rule settlement for test construction. Educational Measurement: Issues and Practice, 6, 13-17. Marino, R. D. (2007, September). Welcome to DCA. Training at the meeting of Department of Consumer Affairs, Sacramento, CA. MacCann, R. G. (2008). A modification to Angoff and Bookmarking cut scores to account for the imperfect reliability of test scores. Educational and Psychological Measurement, 68(2), 197-214. Maurer, T. J., Alexander, R. A., Callahan, C. M., Bailey, J. J., & Dambrot, F. H. (1991). Methodological and psychometric issues in setting cutoff scores in using the Angoff method. Personnel Psychology, 44, 235-262. Meyers, L. S. (2006). Civil Service, the Law, and adverse impact analysis in promoting equal employment opportunity. Tests and Measurement: Adverse Impact Handout. Sacramento State. Meyers, L. S. (2009). Some procedures for setting cut scores handout. Tests and Measurement. Mitzel, H. C., Lewis, D. M., Patz, R. J., & Green, D. R. (2001). The Bookmark procedure: Psychological perspectives. In G. J. Cizek (Ed.). Setting performance standards: Concepts, methods, and perspectives (pp. 249-281). Mahwah, NJ: Lawrence Erlbaum Associates. 92 Murphy, K. R., & Davidshofer, C. O. (2005). Psychological testing: Principles and applications. New Jersey: Pearson Prentice Hall. National Research Council. (1999). Setting reasonable and useful performance standards. In J. W. Pelligrino, L. R. Jones, & Jones, & K. J. Mitchell (Eds.), Grading the nation’s report card: Evaluating NAEP and transforming the assessment of educational progress (pp. 162-184). Washington, DC: National Academy Press. Nichols, P., Twig, J., & Mueller, C. D. (2010). Standard-setting methods as measurement processes. Educational Measurement: Issues and Practice, 29(1), 14-24. Office of Professional Examination Services. (2010). Informational Series No. 4. Criterion-Referenced Passing Scores. Department of Consumer Affairs. Olsen, J. B. & Smith, R. (2008). Cross validating Modified Angoff and Bookmark standard setting for a home inspection certification. Paper presented at the Annual Meeting of the National Council on Measurement in Education. Plake, B. S. & Impara, J. C. (2001). Ability of panelists to estimate item performance for a target group of candidates: An issue in judgmental standard setting. Educational Assessment, 7(2), 87-97. Plake, B. S., Impara, J. C., & Irwin, P. M. (2000). Consistency of Angoff-based predictions of item performance: Evidence of technical quality of results from the Angoff standard setting method. Journal of Educational Measurement, 37(4), 347-355. 93 Peterson, C. H., Schulz, E. M., & Engelhard G. J. (2011). Reliability and validity of Bookmark-based methods for standard setting: Comparisons to Angoff-based methods in the National Assessment of Educational progress. Educational Measurement: Issues and Practice, 30(2), 3-14. Professional Credentialing Services. Introduction to Test Development for Credentialing: Standard Setting and Equating. www.act.org/workforce. (American College Test) Reckase, M. D. (2006). A conceptual framework for a psychometric theory for standard setting with examples of its use for evaluating the functioning of two standard setting methods. Educational Measurement: Issues and Practice, 4-18. Roberts, R. N. (2010). Damned if you do and damned if you don’t: Title VII and public employee promotion disparate treatment and disparate impact litigation. Public Administration Review, 70(4), 582-590. Schmitt, K. & Shimberg, B. (1996). Demystifying Occupational and Professional Regulation: Answers to questions you may have been afraid to ask. Council on Licensure, Enforcement and Regulation (CLEAR): Lexington, KY. Shrock, S. A., & Coscarelli, W. C. (2000). Criterion-referenced test development: Technical and legal guidelines for corporate training and certification. (2nd ed.). Washington, DC: International Society for Performance Improvement. Skaggs, G., & Tessema, A. (2001). Item disordinality with the Bookmark standard setting procedure. Paper presented at the annual meeting of the National Council on Measurement in Education, Seattle, WA. 94 Skaggs, G., Hein, S. F., & Awuor, R. (2007). Setting passing scores on passage-based tests: A comparison of traditional and single-passage bookmark methods. Applied Measurement in Education, 20(4), 405-426. Skaggs, G., & Hein, S. F. (2011). Reducing the cognitive complexity associated with standard setting: A comparison of the single-passage bookmark and Yes/No methods. Educational and Psychological Measurement, 71(3), 571-592. Society for Industrial and Organizational Psychology (2003). Principles for the validation and use of personnel selection procedures. Bowling Green, OH: Author. Stephenson, A. S. (1998). Standard setting techniques: A comparison of methods based on judgments about test questions and methods on test-takers. A Dissertation Submitted in Partial Fulfillment of the Requirements for the Doctor of Philosophy Degree. Department of Educational Psychology and Special Education in Graduate School, Southern Illinois University. U.S. Department of Labor, Employment and Training Administration. (2000). Testing and Assessment: An employer’s guide to good practices. Wang, N. (2003). Use of the Rasch IRT model in standard setting: An item-mapping method. Journal of Educational Measurement, 40(3), 231-253. Yudkowsky, R., & Downing, S. M. (2008). Simpler standards for local performance examinations: The Yes/No Angoff and Whole-Test Ebel. Teaching and Learning in Medicine, 20(3), 212-217.