للتقويم التربوي واالختبارات التعليمية for Educational Evaluation & Testing Evaluation of Selected General Cognitive Ability Tests Submitted to Submitted to Employment Selection & Assessment Program Ministry of Civil Service By Sabbarah for Educational Evaluation and Testing Prepared by Selection and Assessment Consulting Jerard F. Kehoe, Ph.D. James Longabaugh, MA August 26, 2013 TABLE OF CONTENTS Executive Summary 3 Section 1: Purpose of Report 5 Section 2: Description of Report 6 Section 3: Objectives of Study 2 8 Section 4: Methodology of Study 2 10 Section 5: Descriptive Results 12 Differential Aptitude Test for Personnel and Career Assessment (DAT PCA) Employee Aptitude Survey (EAS) General Ability Test Battery (GATB) Information about OPM’s New Civil Service Tests Professional Employment Test (PET) Power and Performance Measures (PPM) Verify Watson-Glaser Critical Thinking Appraisal Section 6: Integration, Evaluation and Interpretation of Results The Question of Validity Test Construct Considerations Item Content Considerations Approaches to Item / Test Security and Methods of Delivery Item and Test Development Strategies for Item and Bank Management Use of Test Results to Make Hiring Decisions Considerations of Popularity, Professional Standards, and Relevance to MCS Section 7: Recommendations and Suggestions 148 148 149 155 159 161 166 169 172 175 GCAT Plan and Specifications Choosing Test Constructs Specifying Item Content Methods of Item and Test Development Validation Strategies 175 175 178 182 184 Organizational and Operational Considerations Security Strategy Item Banking Staff Requirements User Support Materials Guiding Policies 199 199 200 201 204 205 Section 8: References 2 12 27 53 77 82 91 105 127 209 EXECUTIVE SUMMARY Background Information The purpose of this Study 2 report is to review and evaluate selected tests of general cognitive ability used for personnel selection in order to make recommendations to the Ministry of Civil Service (MCS) about its development of a civil service test battery. Seven tests of general mental ability (GMA) were selected for this review based on their expected relevance to MCS’s interests. The objective of Study 2 was to provide MCS with evaluative and instructive information about the development, validation and use of GMA tests for large scale applications relevant to MCS’s planned General Cognitive Ability Test (GCAT). The objective of Study 2 is to review and evaluate at seven existing GMA tests that represent models or comparisons relevant to MCS’s plan for GCAT. Selection and Assessment Consulting (SAC) reviewed and evaluated the batteries by gathering testrelated documents from publishers and from independent sources and by interviewing publishers’ testing experts for further detailed information. Findings Relevant to MCS What Do the Batteries Measure? 1. All batteries are comprised of between 4 and 10 subtests, each designed to measure a different cognitive ability (excluding psychomotor subtests in GATB). 2. Other than Watson-Glaser which is designed to measure critical thinking skills,, nearly 80% (39 of 49) of all subtests measure one of four core categories of cognitive ability – Verbal Ability, Quantitative Ability, Reasoning Ability and Spatial/Mechanical Ability. 3. Other than psychomotor and speeded subtests, 70% (24 of 34) of all subtests contain sufficient acquired knowledge content to be considered measures of crystallized ability. 4. Of the 24 subtests measuring crystallized ability, 18 (75%) are based on work-neutral item content; 6 are based on work-like item content. No subtest is designed specifically to measure acquired job knowledge. 5. Except for Watson-Glaser and PET, all batteries are designed to be applicable across a wide range of job families and job levels. This is achieved largely through work-neutral item content and moderately broad abilities that do not attempt to measure highly job-specific abilities. What Types of Validity Evidence Are Reported? 1. All batteries report some form of construct validity evidence, always in the form of correlations with subtests in other, similar known batteries. 2. All batteries except PPM, report predictive validity evidence usually against job performance criteria consisting of supervisory ratings of proficiency or other measures of training success. 3. No content validity evidence linking test content to job content is reported for any battery. How Are Batteries Used? 1. All batteries except Verify are available for use in proctored settings with a small (1-3) number of fixed forms that are modified only infrequently. 2. All batteries are available online in addition to a paper-pencil mode. 3. Verify and Watson-Glaser (UK) are the only batteries available online in an unproctored setting. With both batteries, unproctored administration is support by IRT-based production of randomized forms so that virtually all test takers receive a nearly unique, equivalent form. 4. For all batteries except Verify, scores are based on number-correct raw scores (corrected for guessing in two batteries). Most batteries transform raw scores to percentile rank scores based on occupation or applicant group norms. 3 Recommendations and Suggestions Recommendations and suggestions are grouped into two categories – those relating to the development of GCAT and those relating to the management of the civil service testing program. Key recommendations are summarized here. GCAT Development 1. It is recommended that MCS develop at least two subtests within each of the Verbal Ability, Quantitative Ability and Reasoning Ability categories as those categories are exemplified by most of the reviewed batteries (not Watson-Glaser). Modest job information should be gathered to determine if additional subtests are needed for Processing Speed and Accuracy and/ or for Spatial / Mechanical Ability. 2. Most, if not all, subtests should use item content associated with crystallized ability and, to the extent possible, that content should be in a work-like context to promote user acceptance and confidence. 3. Initial item and subtest development is likely to rely on classical test theory (CTT) for item retention and subtest construction. Yet, MCS should plan to migrate to an IRT-based item bank approach after implementation to prepare for the capability of producing many equivalent forms. 4. Initial item and subtest development should be sized to produce 4-6 equivalent forms of each subtest, with the exception of two forms of Processing Speed and Accuracy subtests. All forms should be in use from the beginning to establish a reputation for strong security. 5. Item developers are encouraged to develop items within a subtest with moderately diverse levels / difficulty / complexity around the level appropriate to the typical education level of the applicants. Items of moderate diversity in level/difficulty/complexity will create eventual opportunities using IRT-based forms production to tailor subtests to the level of the target job family. Program Management 1. Establish a proactive, well communicated security strategy from the beginning modeled after many aspects of SHL’s strategy for supporting Verify. This should include multiple forms, item refreshing, bank development, data forensics, web patrols and an “Honesty” agreement applicants are required to sign. 2. As soon as possible, MCS should evolve to a forms production strategy that is IRT-based and supported by large item banks. This approach is intended primarily to protect the integrity of the testing program and is not intended to encourage unproctored administration. Nevertheless, this approach would be capable of supporting unproctored administration, if necessary. SHL’s Verify approach is a model that is adherent to current professional standards. 3. Establish a significant research function (coupled with development) to provide the capability to refine and improve the testing program over time as empirical data accumulates to support validation efforts, to support the design of Phase 2, and to support investigations of program effectiveness. 4. Staff the support organization with the expertise and experience required to manage an effective personnel selection program. The operational leadership role should require extensive experience and expertise in personnel selection technology and program management. 5. Establish roles for managing relationships with key stakeholders including applicants, hiring organizations, the Saudi Arabian public including schools and training centers, and Testing Center staff. 6. Develop a full set of guiding policies to govern the manner in which applicants, hiring organizations, and Test Center staff may utilize and participate in the testing program. 4 SECTION 1: PURPOSE OF REPORT The purpose of this Study 2 report was to review and evaluate selected tests of general cognitive ability developed to be used for personnel selection applications. Seven test batteries were selected to help inform the Ministry of Civil Service (MCS) of the Kingdom of Saudi Arabia about the task of developing a national General Cognitive Ability Test (GCAT). Selection and Assessment Consulting (SAC) was requested by Sabbarah for Educational Testing and Evaluation (Sabbarah) to conduct Study 2. Study 2 is one of three studies planned to support MCS’s effort to develop GCAT for use in recruiting and employing applicants in a wide range of civil service jobs. The overall purpose of these studies is to provide MCS with evaluative and instructive information about the development, validation and use of cognitive ability tests for large scale applications relevant to MCS’s planned GCAT. The objective of Study 2 was to review and evaluate seven existing cognitive ability test batteries that represent models or comparisons relevant to MCS’s plan for GCAT. This review and evaluation provides a) descriptive information including usage, modes of delivery, user requirements, and available documentation, b) technical information regarding scoring, measurement/psychometric models, reliability and validity evidence, and information about the definitions of the tested abilities and methods for developing the items comprising the subtests, c) evaluative information for each test including its standing with respect to professional standards and principles, and issues associated with translation and adaptation, and d) suggestions and recommendations taken from the evaluative reviews regarding the development of GCAT and implementation issues such as security maintenance, item banking strategies, and test management issues relating to large scale testing programs. The information, suggestions, and recommendations contained within this Study 2 report are meant to assist test developers and designers, item writers and reviewers, and psychometricians in developing plans and specifications for GCAT so that MCS may use the resulting cognitive ability battery as a selection instrument across a wide range of civil service jobs. The descriptive information reported in Section 5 about each of the seven reviewed batteries provides the basis for the evaluative and integrative perspectives in Section 6 and provides part of the basis for the recommendations presented in Section 7. The recommendations rely on the integration of information about the selected batteries and SAC’s prior experience with large scale selection programs. Other information that informed the interpretations and evaluations in Section 6 and the recommendations and suggestions in Section 7 was derived from relevant published information that is informative to MCS, as well as information that was gained from interviewing test publishers. 5 SECTION 2: DESCRIPTION OF REPORT The descriptive and evaluative information about the selected batteries are provided in Sections 5 and 6, respectively. The recommendations are presented in Section 7. These three section constitute the primary contributions of Study 2. A. Section 5: Results, Overall Description and Uses a. Provides descriptions of the seven selected cognitive ability batteries as a summary of technical information and other test relevant information that was gathered. It is intended as descriptive information of test content and technical characteristics from the resources provided by the test publishers and alternate sources of test-relevant information. No significant evaluative information is provided in this section, except for summaries of evaluative reviews of the cognitive ability batteries previously published. Evaluative information is contained within Section 6. Information has been organized in a manner that it is clear and succinct for the reader. A complete summary of all descriptive information is provided separately in Section 5 for each of the seven reviewed batteries. Each of these battery descriptions is organized into nine major subsections: Overall Description and Uses Administrative Details Construct-Content Information Item and Test Development Criterion Validity Evidence Approach to Item / Test Security Translations / Adaptions User Support Resources Evaluative Reviews B. Section 6: Integration, Interpretation and Evaluation of Results a. Section 6 integrates and evaluates the information in Section 5 about each battery into overall observations and conclusions about this set of batteries. This section follows the presentation of extensive information about each target battery by integrating that information into comparative assessments of battery qualities, interpreting the potential implications for MCS’s civil service test system and evaluating the features of batteries as possible models for MCS civil service tests. Section 6 is organized into eight subsections. The question of validity Test construct considerations Item content considerations Approaches to item / test security and methods of delivery Item and test development Strategies for item management / banking Use of test results to make hiring decisions Considerations of popularity, professional standards and relevance to MCS interests C. Section 7: Recommendations and Suggestions a. This section provides recommendations and suggestions for MCS about specifications and planning for a national GCAT. This section provides suggestions about important organizational and operational considerations relating to management, staffing, security, item banking, technical support, administration, and the design of test guides, manuals, and reports. These recommendations are based on multiple sources including the reviews of the seven batteries, SAC’s own professional experience with personnel selection testing programs, and the professional literature relating to the design and management of personnel selection systems. This section provides these suggestions and recommendations organized in the following manner: 6 i. GCAT Plan and Specifications 1. Choosing Test Constructs 2. Specifying Item Content 3. Methods of Item and Test Development 4. Validation Strategies ii. Organizational and Operational Considerations 1. Security Strategy 2. Item Banking 3. Operational Staff Requirements 4. Design of User Materials 5. Strategies and Policies 7 SECTION 3: OBJECTIVES OF STUDY 2 Study 2 had several important objectives relating to reviewing and evaluating GMA tests. Provide detailed reviews of seven selected GMA tests used in employment selection. The information relevant for inclusion in the reviews were to consist of the GMA tests’ purposes, content dimensions, delivery platforms, type of scores, spread of uses, validity research, technical quality, availability of documentation, and other applicable and relevant information. List cognitive ability tests ranked by their popularity of use and compliance with the principals and professional standards for testing (e.g., APA, NCMA, AERA & SIOP). Identify the most important considerations to be taken from these tests with regard to the potential GCAT specifications and plans. Discuss important considerations regarding test specifications and plans with respect to needed resources, time, staffing, technical support, organization, management and administration of GMA tests used in an employment context. Provide detailed descriptions and comparisons of the batteries in terms of o o o o o o o o o o o Provide technical descriptions and comparison of the batteries in terms of o o o o o o o o o Content definition methodologies, Procedures for construct and content validity (if reported), Number and types of items for each subtest of the GMA tests, Psychometric model (whether IRT or CTT), Methods used in equating alternate forms, norming, and standardization, Total score and subtest scores, Test scale and methods of scaling, Reliability and validity of tests, and Practical aspects of selected GMA tests in terms of: Security strategies to minimize fraud during test development, production, management, maintenance, and administration, Item banking and item management, Designs of test (according to mode of delivery), Designs of test guides, manuals, supplementary materials, and technical reports, and Services provided to test users. Evaluate the batteries with respect to o o o o o 8 Purpose of use, Targeted populations, Targeted jobs or occupations, Mode of delivery, Time limits, Type of score, Availability of test reviews, Quality of technical reports and user manuals, Ease of use and interpretations, Costs associated with test and related materials, and Publishers of the seven selected GMA tests. Evaluation of tests according to professional standards (e.g., APA, NCME & AERA, and SIOP), Popularity and use of the seven selected GMA tests, Compliance with the principals and professional standards, Suggestions for translations and adaptations, Suggestions and recommendations for MCS’s development of GCAT plan and specifications, and o 9 A set of suggestions and recommendations with respect to important consideration in management, staffing, security, item banking, technical support, administration, design of test guides, manuals and reports. SECTION 4: METHODOLOGY OF STUDY 2 Study 2 used several tactics to gather the information (e.g., technical reports, interviews, alternate resources, etc…) necessary to fulfill the requirements of the study. The first phase of Study 2 involved the identification of GMA tests by SAC that had some initial basis for early consideration as possible Study 2 tests. These early tests were identified based on SAC’s own experience with such tests in the personnel selection domain, searches through sources of information about such tests, and early input from Sabbarah. This early search effort was guided by broad, high level requirements that the tests be GMA tests, produced by reputable developer/publishers, with some history of use for personnel selection across a range of job families. Originally, SAC identified18 cognitive ability tests, which then received a more detailed evaluation based on emerging guidelines and selection criteria, as well as input from Sabbarah. SAC provided a short list of 11 batteries with strong recommendation for three batteries. From these 11, Sabbarah identified the seven test batteries to be reviewed in Study 2. These are shown in Table 1 with their subtests and publishers. The criteria SAC developed for the final selection of batteries were developed during this initial battery identification process. These criteria were based on a variety of considerations, including, primarily, Sabbarah’s requirements of Study 2 and SAC’s own professional experience and judgment about important test characteristics given the Study 2 objectives. As an example, one recommendation criterion is availability of technical data. Other criteria, such as scope of cognitive abilities, became better understood as communications with Sabbarah further clarified the likely model for the new civil service exam in which a battery of several subtests would assess a range of cognitive abilities. Also, Sabbarah suggested an important criterion relating to the level of test content. The majority of candidates for Saudi Arabia civil service jobs are likely to have college degrees or some years of college experience. Also, an important job family is professional/managerial work. As result, Study 2 will be useful if it includes at least some tests developed at reading levels and/or content complexity appropriate to college educated candidates and professional/managerial jobs. It was from these criteria that the seven GMA tests were chosen for inclusion in Study 2. The criteria used to inform SAC’s recommendations for GMA tests to be included in Study 2 are shown here. A. Reputable Developer/Publisher a. Does the developer/publisher have a strong professional reputation for welldeveloped and documented assessment products? B. Sabbarah/MCS Interest a. Has Sabbarah expressed an interest in a particular test? C. Scope of Cognitive Abilities a. Do the cognitive abilities assessed cover a wide range? D. Availability of Data a. Is information about the test readily available, such as technical manuals, administrative documents, research reports, and the like? E. In Current Use a. Is the test in current use? F. Complementarity a. Does the test assess cognitive abilities that complement the specific abilities assessed by other included tests? G. Relevance to MCS Job Families a. Are the cognitive abilities assessed by the test relevant to the job families for which MCS’s new civil service exams will be used? H. Level of Test Content a. Is the reading level or complexity of content at a level appropriate to managerial/professional jobs? (Needed for some but not all included tests.) I. Special Considerations a. Are there special features/capabilities of the test that will provide significant value to Sabbarah that would be absent without this test in Study 2? J. Overall Recommendation a. Does the test have a High Value, Good Value, Marginal Value, or No Value? Table 1 describes the seven selected batteries reviewed in Study 2. 10 Table 1. The seven batteries included for review and evaluation in Study 2. GMA Test Battery Subtests Differential Aptitude Test for 1. Space Relations Personnel and Career 2. Abstract Reasoning Assessment (DAT for PCA) 3. Language Usage 4. Mechanical Reasoning 5. Numerical Ability 6. Verbal Reasoning Employee Aptitude Survey 1. EAS 1-Verbal Comprehension (EAS) 2. EAS 2-Numerical Ability 3. EAS 3-Visual Pursuit 4. EAS 4-Visual Speed and Accuracy 5. EAS 5-Space Visualization 6. EAS 6-Numerical Reasoning 7. EAS 7-Verbal Reasoning 8. EAS 8-Word Fluency 9. EAS 9-Manual Speed and Accuracy 10. EAS 10-Symbolic Reasoning General Aptitude Test Battery 1. Name Comparison (GATB Forms E & F) 2. Computation 3. Three-dimensional Space 4. Vocabulary 5. Object Matching 6. Arithmetic Reasoning 7. Mark Making 8. Place 9. Turn 10. Assemble 11. Disassemble Professional Employment Test 1. Data Interpretation (PET) 2. Reasoning 3. Quantitative Problem Solving 4. Reading Comprehension Power and Performance 1. Applied Power Measures (PPM) 2. Mechanical Understanding 3. Numerical Computation 4. Numerical Reasoning 5. Perceptual Reasoning 6. Processing Speed 7. Spatial Ability 8. Verbal Comprehension 9. Verbal Reasoning Verify 1. Verbal Reasoning 2. Numerical Reasoning 3. Inductive Reasoning 4. Mechanical Comprehension 5. Checking 6. Calculation 7. Reading Comprehension 8. Spatial Ability 9. Deductive Reasoning Watson-Glaser II 1. Recognize Assumptions 2. Evaluate Arguments 3. Draw Conclusions Publisher Pearson PSI United States Employment Service (USES) PSI Hogrefe SHL Pearson After the seven batteries were selected, SAC gathered technical manuals and other relevant sources of information related to each of the batteries. These sources included documents and information such as technical reports, sample materials, administration materials, user manuals, interpretive guides, and published research studies. In addition to these types of documents, SAC also recognized the need to interview publishers’ technical experts to gather detailed information that is generally not available in published material. This information included topics such as item/test 11 development, psychometrics modeling, validity support, development of alternate forms, item banking strategies, and item/test security to name a few topics. SAC was able to interview publishers’ experts for all batteries except GATB and Verify. This was a not a significant setback for GATB because far more published information is available about GATB than the other batteries. This was a modest setback for Verify, however, because in our view Verify is the best model in many respects for GCAT. Further, SHL declined to approve our effort to gather protected documentation about Verify such as a current technical manual and technical information about their IRT-based bank management process. SHL’s reason for not approving our access to protected information was that they were concerned about disclosure of their intellectual property. The information gathered from multiple sources was then disseminated into Section 5 which details information regarding each of the seven selected GMA tests. In addition to the publisher expert interviews, SAC also interviewed Principals at Sabbarah along with the Study Reviewer. This particular interview provided significant clarification about the overall development plans and use of GCAT for use as a selection instrument across a wide range of civil service jobs in the Kingdom of Saudi Arabia. From this particular interview, the scope of the project was slightly adjusted and modified to better complement the proposed development and uses of GCAT by MCS. Section 5 summarizes all descriptive information that was gathered for each of the seven batteries using the same organization of information for all batteries. This information informed Section 6 and Section 7 of this Study 2 report. The primary focus of Section 6 was to identify key similarities and differences among the reviewed batteries to identify those features that are likely to be most relevant and least relevant to MCS’s effort to develop GCAT. Section 6 placed high importance on the issues of test constructs and item content because these early decisions about GCAT will dictate many of the subsequent technical considerations. Section 7 presents recommendations about decisions and approaches to the development, validation, implementation and ongoing maintenance of the civil service testing system. While the battery-specific information in Sections 7 and 8 provided significant input into SECTION 7, the recommendations were also based on SAC’s professional experience and other published information about large scale personnel selection programs. DESCRIPTIVE RESULTS SECTION 5: DESCRIPTIVE RESULTS FOR DAT for PCA Overall Description and Uses Introduction The Differential Aptitude Test for Personnel and Career Assessment (DAT for PCA) is a cognitive ability battery composed of eight subtests intended to measure four abilities. The first edition of the DAT was originally published in 1947, and since then it has been widely used for educational placement and career guidance of students. It has also been used extensively for occupational assessments of adults and post-secondary students. The DAT for PCA was created specifically for assessing adults who are applicants for employment, candidates for training and development, career guidance, and adult education students. The DAT for PCA is an adapted short form of the DAT Form V that uses the same test items to measure the same DAT aptitudes. Verbal Reasoning was designed to measure the ability to understand concepts framed in words, to think constructively, find commonalities among different concepts, and to manipulate ideas at an abstract level. Numerical Ability was designed to measure an individual’s understanding of numerical relationships and capability to handle numerical concepts. Abstract Reasoning was designed to assess reasoning ability such as perceiving relationships that exist in abstract patterns. Mechanical Reasoning was developed to measure knowledge of basic mechanical principles, tools, and motions. Space Relations was designed to assess a person’s ability to visualize a three-dimensional object from a two-dimensional pattern, and to visualize how the pattern would look if rotated. Spelling and Language Usage was designed to measure how well individuals spell common English words. The 12 Clerical Speed and Accuracy subtest was designed to assess a person’s processing speed in simple perceptual tasks. Note: The DAT for PCA is a shortened form of the DAT Form V. Therefore, the DAT for PCA and DAT may be used interchangeably throughout this document. However, we have just learned that two subtests, Spelling and Clerical Speed and Accuracy, were recently removed from the DAT battery due to a low volume of use. We have decided to continue to include these two subtests in the remaining portion of this DAT Report as we feel they still provide examples of subtests relevant to administrative/clerical jobs in Saudi Arabia’s service sector of jobs. Purpose of Use DAT for PCA was designed specifically for assessing adults who are applicants for employment, candidates for training and development, career guidance, and adult education students. Other versions of the DAT are also used for educational placement and career guidance of students. The abilities that DAT was designed to assess are general reasoning abilities, mechanical operations and principles, verbal achievement, and clerical speed abilities. These abilities have been seen to be important for a broad spectrum of applications in personnel selection and career guidance. Each of the subtests was developed separately from one another and intended to measure distinct abilities. Much of the item content in the DAT tests is a mix of school-related topics, experience, verbal and nonverbal analogical reasoning, nonverbal spatial ability, and perception. DAT for PCA was developed to shorten the overall test length compared to DAT Form V and other earlier versions, improve content appropriateness, and enhance the ease of local scoring. Target Populations The populations targeted by the DAT are adults who are applicants for employment, candidates for training and development, career guidance, and adult education students. These adults may be applying for or working in a wide range of occupations such as professional, managerial, clerical, engineering, military, administrative, technical services, skilled trades, and even unskilled trades. Normative information is presented in the technical manual and supporting norm documents. Norms are for gender (i.e., males, females, and both sexes), and are presented in percentiles as well as stanines. Target Jobs / Occupations The DAT was designed to be used for a wide range of occupations such as professional, managerial, clerical, engineering, military, administrative, technical services, skilled trades, and even unskilled trades. It may be tailored to specific occupations by administering subtests in various combinations. The technical manual reports summary statistical information for several specific occupational groups. Spread of Uses The DAT was designed as test of cognitive abilities primarily for assessing adults who are applicants for employment, candidates for training and development, career guidance, and adult education students. However, other versions of the DAT have been used for different purposes and for other populations. For example, one version of the DAT is primarily used with primary school students as an instrument for career guidance. The DAT for PCA is designed solely for adult personnel selection and vocational planning. It was designed to be used across a wide range of occupations. The norm tables in the technical manual are related to gender, but other norm tables are available for student grade levels (e.g., students in year 11, A Level students, and students with higher levels of education) and for job families such as professional and managerial. The DAT is scored with number-correct, raw score totals for each subtest. Total scores may be tailored to different combinations of subtests. Administrative Details Administrative detail is summarized in Table 2 and briefly described below. 13 Table 2. Administrative features of the DAT for PCA. Subtest # Items Time Limit 35 15 minutes 30 15 minutes 30 12 minutes 45 20 minutes 25 20 minutes Verbal Reasoning 30 20 minutes Spelling 55 6 minutes Clerical Speed & Accuracy 100 6 minutes Space Relations Abstract Reasoning Language Usage Mechanical Reasoning Numerical Ability Scoring Rule # Correct Raw Score. Percentile scores and stanines reported based on selected norm groups. Methods of Delivery Paper-pencil, proctored; Online, proctored. Time Limits Compared to other versions of the DAT, the DAT for PCA subtests have fewer items and shorter time limits. The developers shortened it to encourage and facilitate use of the battery for adult assessment applications. The shortened time limits and fewer numbers of items are the only differences. The target constructs and item content remained the same. Note: This shortening did not apply to the speeded Clerical Speed and Accuracy subtest before it was removed from the battery. Number and Type of Items Table 2 shows the number of items for each of the eight subtests comprising the DAT for PCA, ranging from 25 to 100 items. It appears that much of the item content of the DAT tests are a mix of school-related topics, experience, verbal and nonverbal analogical reasoning, nonverbal spatial ability, and perception. Three of the subtests measure some form of reasoning (e.g., the Abstract Reasoning subtest is nonverbal), three are based on verbal content, and one each for numerical, spatial ability, and clerical speed. The reading level is specified at no greater than grade six. Therefore, this allows the DAT for PCA cognitive ability battery to be used across a very wide range of jobs except higher level jobs for which job complexity and applicant education levels warranted higher reading levels. Type of Score Number-correct subtest scores are added together to produce an overall total score and is compared to the total possible score. Raw scores can also be compared to norm group tables to produce percentile scores and stanines. The DAT for PCA was designed specifically for local scoring by the user; both hand scoring and Ready-Score self-scoring options are available. Test Scale and Methods of Scaling Norm group standards allow percentile and stanine scores to be produced from raw score totals for each of the eight subtests and for total scores. Each of the versions of the DAT can be used by tailoring combinations of subtests to specific occupations and jobs. 14 Method of Delivery The fixed forms of the DAT for PCA battery can be administered either by paper-pencil or by computer and Pearson requires that both methods of delivery be proctored. The number of items, types of items, and time limits are the same in both modes of administration. Scoring is the same for both methods. Cost Tables 3 shows the current prices published for the DAT for PCA administrative materials. No pricing is published for online administrations. Table 3. Published prices for DAT for PCA administrative materials. Item Unit Cost Individual Test Booklets Verbal Reasoning - Pkg. of 25 $190.00 Numerical Ability - Pkg. of 25 $190.00 Abstract Reasoning - Pkg. of 25 $190.00 Mechanical Reasoning - Pkg. of 25 $190.00 Space Relations - Pkg. of 25 $190.00 Language Usage - Pkg. of 25 $190.00 Answer Documents Verbal Reasoning - Pkg. of 50 $147.00 Numerical Ability - Pkg. of 50 $147.00 Abstract Reasoning - Pkg. of 50 $147.00 Mechanical Reasoning - Pkg. of 50 $147.00 Space Relations - Pkg. of 50 $147.00 Language Usage - Pkg. of 50 $147.00 Directions for Administration and Scoring Directions for Administration and Scoring - Pkg. of One $22.00 Scoring Key - Single Copy Verbal Reasoning $96.00 Numerical Ability $96.00 Abstract Reasoning $96.00 Mechanical Reasoning $96.00 Space Relations $96.00 Language Usage $96.00 *No cost information is available for Spelling or Clerical Speed and Accuracy subtests as they were recently removed from the batter. Construct – Content Information Intended Constructs The subtests that comprise the DAT for PCA are typical of other multiaptitude cognitive ability batteries. The DAT for PCA includes eight subtests which measure four major abilities. Three of the subtests are designed to measure reasoning (e.g., the Abstract Reasoning subtest is nonverbal), three are verbal, one, numerical, one for spatial abilities, and one for clerical speed. Like many other commonly used cognitive batteries DAT subtests may be combined in a manner that is tailored to specific occupations and jobs. The DAT developer intended to design a multiaptitude cognitive ability battery to measure an individual’s ability to learn or succeed in various work and school domains. The original DAT developed in 1947 was used primarily for educational placement and career guidance for students in grades eight through twelve. However, it was later adapted to be used with adults for personnel selection and career assessment. In order to use it as such, the DAT for PCA was shortened (i.e., 15 both number of items and time limits), the tests were repackaged, and it was enhanced for the ease of local scoring. Table 4 below shows the eight subtests comprising the battery and the four abilities it is intended to measure. Table 4. Subtests of the DAT for PCA and the abilities it is intended to assess. Ability Mechanical General Operations & Verbal Subtest Reasoning Principles Achievement Verbal X Reasoning Numerical Ability X Abstract Reasoning Mechanical Reasoning Space Relations Clerical Speed X X X Spelling X Language Usage Clerical Speed & Accuracy X X Item Content in the Subtests Item content for each subtest is described here. The reading level is specified at no higher than grade six, allowing the battery to be used across a wide range of occupations. Verbal Reasoning (VR) a. Measures the ability to understand concepts framed in words, to think constructively, find commonalities among different concepts, and to manipulate ideas at an abstract level. The items can test the examinee’s knowledge and also their ability to abstract and generalize relationships from their knowledge. i. The VR test consists of analogies that are double-ended in which both the first and last terms are missing. The participant then must choose from five alternative pairs to determine the best one pair that best completes the analogy. Content of the items can be varied from multiple subject areas. Rather than focusing on vocabulary recognition, the analogies instead focus on the ability to infer the relationship between the first pair of words and apply that relationship to a second pair of words. 1. Pearson notes that VR may be expected to predict future success in occupations such as business, law, education, journalism, and the sciences 16 2. Pearson notes that VR may be expected to predict future success in occupations such as business, law, education, journalism, and the sciences. B. Numerical Ability (NA) a. Measures an individual’s understanding of numerical relationships and capability to handle numerical concepts. i. Designed to avoid language elements to avoid language elements of usual arithmetic reasoning problems in which reading ability may affect the outcome. Being so, items are purely a measure of numerical ability. The examinee must perceive the difference between numerical forms. In order to assess ensure that reasoning rather than computation facility is stressed, the computational level of the items is below the grade level of examinees that the test is intended for. 1. Pearson notes that NA may be an important predictor of success in occupations such as mathematics, physics, chemistry, and engineering. It may also be important for jobs such as bookkeeper, statistician, and tool making, etc… C. Abstract Reasoning (AR) a. Assess reasoning ability such as perceiving relationships that exist in abstract patterns. It is a non-verbal measure of reasoning ability. It assesses how well examinees can reason with geometric figures or designs. i. Each of the AR items requires the examinee to use perception of operating principles in a series of changing diagrams. They must discover the principles governing the changes in the diagrams by designating which of the optional diagrams should logically follow. 17 1. Pearson notes that AR may predict success in fields such as mathematics, computer programming, drafting, and automobile repair. D. Mechanical Reasoning (MR) a. Measures basic mechanical principles, tools, and motions. i. Each of the MR test items consist of a pictorially presented mechanical situation and questions that is worded simply. These items are representative of simple principles that involved reasoning. 1. Pearson notes that MR may be expected to predict future job performance in occupations such as carpenter, mechanic, maintenance person, and assembler. E. Space Relations (SR) a. Assesses a person’s ability to visualize a three-dimensional object from a twodimensional pattern, and be to visualize how the pattern would look if rotated. i. Each SR test item presents one pattern, which is followed by four threedimensional figures. The participant must choose the one figure that can be created from the pattern. The test patterns are purported to be large and clear. Basically, the task is to judge how the objects would look if constructed and rotated. 1. Pearson notes that SR may be predictive of success in occupations such as drafting, clothing design, architecture, art, decorating, carpentry, and dentistry. 18 F. Spelling (SP) a. Measure of how well individuals can spell common English words. i. The examinee is presented with a list of words in which they must determine which of those words are correctly spelled and which are misspelled. Misspelled words are considered to be common and plausible spelling errors. 1. Pearson notes that SP may be predictive of future job performance in occupations such as stenography, journalism, and advertising, among nearly all occupations where use of the English language is a necessity. 2. No sample item available. G. Language Usage (LU) a. Items are representative of present-day formal writing. Examinees must distinguish between correct English language usage and incorrect usage. i. Measures the ability to detect errors in grammar, punctuation, and capitalization. 1. Pearson notes that LU may be predictive of future job performance in occupations such as stenography, journalism, and advertising, among nearly all occupations where use of the English language is a necessity. 19 Clerical Speed and Accuracy (CSA) b. Assesses a person’s response speed in simple perceptual tasks. i. Examinees must first select the combination that is marked in the test booklet, then keep in mind while searching the same combination in a group of similar combinations on an answer sheet, and then selecting the correct combination. 1. Pearson notes that CSA may be important in occupations such as filing, coding, stock room work, and other technical and scientific data roles. 2. No sample item available. Combinations of Subtests The technical manual reports several combinations of the eight subtests that can be combined to create composites for measuring specific abilities. These are: A. VR + NA a. Measures functions associated with general cognitive ability i. This combination of subtests reflects an ability to learn in an occupational or scholastic environment, especially from manuals, trainers, teachers, and mentors. B. MR + SR + AR a. Measure components of perceptual ability. This may be seen as the ability to visualize concrete objects and manipulate the visualizations, recognizing everyday physical forces and principles, and also reasoning and learning deriving from a nonverbal medium. i. Abilities of this type are important in dealing with things, not necessarily people or words. Could be used in skilled trade occupations. C. CSA + SP + LU a. Abilities that represent a set of skills which may be necessary for a wide range of office work. i. These abilities may be useful for jobs relating to clerical or secretary positions. Construct Validity Evidence The publisher reported correlations between the DAT for PCA subtests and other subtests included in other commercially available cognitive ability batteries such as the General Aptitude Test Battery (GATB) and the Armed Services Vocational Aptitude Battery (ASVAB), which were at one time the two most widely used multiple aptitude test batteries in the US. The GATB had been widely used in personnel assessment and selection by state employment services. The pattern of correlations in Table 5 below provides support for the DAT: (1) The DAT battery is highly related to the six GATB cognitive factors, (2) all of the DAT subtests, except for the Clerical Speed and Accuracy, were highly related with the GATB’s general intelligence factor, (3) each of the DAT subtests has its highest correlation with the appropriate GATB factor, and (4) the DAT’s Clerical Speed and Accuracy subtest correlated relatively high with the GATB perceptual tests and motor tests. 20 Table 5. Construct validity correlations between DAT and GATB subtests. VR + VR NA NA AR CSA MR SR SP LU G NA .73 VR + NA AR .63 .70 .71 CSA .41 .44 .46 .35 MR .68 .63 .71 .69 .27 SR .68 .67 .72 .70 .38 .72 V N SP LU G V N S .68 .81 .78 .76 .52 .53 .57 .66 .72 .64 .62 .58 .68 .80 .81 .76 .61 .59 .44 .51 .64 .58 .43 .63 .40 .44 .48 .46 .48 .42 .35 .55 .62 .57 .29 .58 .40 .53 .64 .55 .33 .68 .75 .70 .68 .64 .40 .80 .81 .58 .45 .94 .66 .70 .54 .57 .41 P Q .19 .40 .23 .42 .22 .44 .24 .39 .36 .61 .19 .21 .27 .29 .24 .50 .19 .39 .37 .57 .28 .51 .35 .62 S P .49 .46 .49 The DAT was also correlated with the ASVAB, which has been used in personnel selection, high school career counseling, and two major Department of Defense programs. The ASVAB Form 14 contains 10 subtests; (a) General Science, (b) Arithmetic Reasoning, (c) Word Knowledge, (d) Paragraph Comprehension, (e) Numerical Operations, (f) Coding Speed, (g) Automotive and Shop Information, (h) Mathematics Knowledge, (i) Mechanical Comprehension, and (j) Electronics Information. Eight of the tests are power tests, and the other two are speeded tests. This study found that (1) overall, the DAT is highly correlated with the ASVAB, (2) except for the Clerical Speed and Accuracy DAT subtest, all of the DAT subtests were moderately to highly correlated with all of the ASVAB’s power tests, (3) DAT’s Clerical Speed and Accuracy test was only correlated with the ASVAB’s speeded tests, and (4) the DAT composite of Verbal Reasoning and Numerical Ability was highly related to ASVAB’s school-related tests. Table 6 provides the results of the study. 21 Table 6. Construct validity correlations between DAT and ASVAB subtests. VR NA NA VR+NA AR CSA MR SR SP LU GS ARITH WK PC NO CS AS MK MC .75 VR+NA - - AR .69 .75 - CSA .11 .22 - .17 MR .62 .55 - .58 .08 SR .66 .62 - .61 .09 .71 SP .56 .56 - .46 .19 .50 .48 LU .76 .67 - .62 .11 .64 .64 .73 GS .72 .64 .73 .58 .05 .66 .61 .53 .68 ARITH .75 .79 .82 .65 .10 .62 .66 .54 .67 .72 WK .78 .67 .78 .62 .04 .63 .59 .60 .76 .82 .73 PC .72 .66 .74 .62 .07 .60 .59 .57 .72 .72 .71 .80 NO .23 .41 .33 .30 .35 .21 .27 .36 .28 .27 .35 .28 .29 CS .22 .35 .30 .28 .42 .12 .19 .36 .26 .16 .26 .20 .26 .58 AS .47 .40 .47 .39 -.03 .63 .50 .27 .39` .60 .52 .57 .49 .15 .04 MK .73 .78 .80 .66 .13 .58 .67 .54 .68 .67 .80 .68 .69 .34 .27 .45 MC .61 .57 .63 .57 .03 .73 .66 .39 .55 .66 .65 .63 .62 .21 .14 .66 .63 EI .48 .42 .49 .40 -.01 .59 .50 .35 .48 .59 .53 .57 .52 .14 .07 .66 .50 .68 Item and Test Development Item Development The DAT for PCA is a shortened form the DAT Form V with fewer items and shorter time limits, except for the Clerical Speed and Accuracy test which remained the same before it was removed from DAT for PCA. In order to shorten the DAT for PCA, the developer begin with a theoretical analysis of the psychometric effect on reliability resulting from shortening tests. In order to estimate reliability of the shorter tests comprising the DAT for PCA, the Spearman-Brown formula was applied to the known internal reliability coefficients for DAT Form V. From these values Pearson was able to determine how much the original DAT tests could be shorten while retaining acceptable reliability coefficients. Because DAT Form V is among the most reliable of all commercially available cognitive ability selection batteries we are aware of, Pearson found that the DAT Form V tests could be shortened significantly causing inadequate reliability. Table 7 reports these reliabilities for DAT Form V and DAT for PCA. 22 Table 7. Observed reliability coefficients for DAT Form V and estimated reliability coefficients for DAT for PCA. DAT Subtest Form V (Observed) DAT for PCA (Estimated) Verbal Reasoning .94 .90 Numerical Ability .92 .88 Abstract Reasoning .94 .90 Mechanical Reasoning .94 .90 Space Relations .95 .92 Spelling .96 .93 Language Usage .92 .87 The developer’s intent was to create items that would produce an overall test difficulty level appropriate for adults and students in post-secondary training programs. “Target difficulty” of test items shown in Table 8 is the average proportion of correct responses to a test’s items in the intended population, adjusted for the number of response alternatives. Table 8. Target difficulty for DAT for PCA subtests. DAT Subtest Response Alternatives Target Difficulty Verbal Reasoning 5 .60 Numerical Ability 5 .60 Abstract Reasoning 5 .60 Mechanical Reasoning 3 .67 Space Relations 4 .62 Spelling 2 .75 Language Usage 5 .60 * Response alternatives are the number of alternatives from which examinees may choose on each item. Items were also adapted for the DAT for PCA in order to enhance cultural transparency. Some of the previous items in the DAT were found to not be appropriate in other English-speaking countries. This may have been due to language differences, spelling conventions, systems of measurement, and cultural references. Care was also taken to avoid increasing gender differences that were known in other versions of the DAT. The developers used two sets of data to assemble the DAT for PCA, an initial try out sample and a subsequent pilot sample. The first sample was used to select items into the DAT for PCA while the second sample was used for subsequent item analysis and test statistics of the DAT for PCA subtests. The first sample consisted of item analysis statistics (i.e., item difficulty and discriminating power), for students in Grade 12 of the 1982 Form V standardization sample (Data available for both males and females, separately and combined; sample size not reported). The second sample consisted of a spaced sample of 1,512 males in the standardization sample, spanning Grades 10 through 12. They used these data sets to select DAT Form V items for use in the DAT for PCA, and also to evaluate the psychometric properties of the shortened DAT for PCA compared to the DAT Form V. Table 9 shows the internal consistency estimates provided by the second data set. Table 9. Internal consistency estimates for the DAT for PCA subtests. Subtest KR20 M SD SEM Verbal Reasoning .91 15.7 7.7 2.3 Numerical Ability .88 13.9 6.0 2.1 Abstract Reasoning .91 20.0 7.2 2.2 Mechanical Reasoning .91 32.5 8.5 2.6 Space Relations .93 22.0 9.0 2.4 Spelling .94 38.8 11.0 2.9 Language Usage .89 17.0 7.0 2.3 *Clerical Speed and Accuracy is not reported because it is a speeded test. As such, internal consistency measures of reliability are not appropriate. 23 Psychometric Model Predominantly, the DAT for PCA does not appear to employ IRT for item and test development or item management. Also, there is no item bank for this cognitive ability battery. Multiple Forms DAT for PCA does not have multiple forms. It was derived directly from the DAT Form V in which the content specifications were unchanged and the DAT for PCA subtests were composed of items directly from the DAT Form V. Therefore, it can be considered that the DAT for PCA and DAT Form V are measuring the same aptitudes. In order to determine the equivalency of the two versions, correlation coefficients were computed between raw scores on each test as a type of alternative forms reliability estimate. Clerical Speed and Accuracy did not have alternative forms reliability computed as it remained the same. The data source for the analysis was from the Grades 10 through 12 males’ item response data from the 1982 standardization of DAT Form V. To compute, each DAT test was scored twice (i.e., First, the Form V raw score was computed, and then the PCA for PCA raw score was computed by ignoring the responses to the Form V items not used in the new DAT for PCA test. The correlations are part-whole correlations because the constituent test items of the DAT for PCA are completely contained in the longer Form V. They cannot be interpreted as alternative form correlations because part-whole correlations are known to overstate the relationship between independently measured variables. But, the can be used to support the equivalence of a short form with a long form. The results in Table 10 show these part-whole correlations for the DAT Form V and DAT for PCA. Correlations may be very high as the DAT for PCA contains the same items as the DAT Form V. Table 10. Part-whole correlations of DAT Form V and DAT for PCA. DAT Subtest rxx Verbal Reasoning .98 Numerical Ability .97 Abstract Reasoning .98 Mechanical Reasoning .98 Space Relations .96 Spelling .97 Language Usage .97 Item Banking Approaches The DAT Form V tests were used as items banks for selection of the DAT for PCA items during development. However, the DAT for PCA is used as a single fixed form test, requiring proctored administration in all cases. Pearson has not develop a bank of items to support DAT for PRC and does not use any methods of randomized or adaptive test construction in the online version for DAT for PCA. Approach to Item / Test Security Test Security The information gained from the Pearson in regard to test security centered on steps taken to make sure that test content is not jeopardized within their proctored administration process. Pearson uses web patrols to monitor internet activity and other resources for any indication that test items or answer keys have been acquired and distributes or sold. Also, they require test users to agree that DAT for PCA will be proctored by a trained, qualified test administrator. Lastly, Pearson is prepared to develop new equivalent items to ensure item currency and security or the creation of customized forms as clients might require. 24 Criterion Validity Evidence Criterion Validity Table 11 summarizes criterion validity results for the DAT for PCA. The results support the conclusion that DAT for PCA total test scores predict future job performance for a number of occupations. Table 11. Criterion validity evidence in various occupations with the DAT for PCA. Group N Test r Skilled tradesman 87 VR NA MR .39 .25 .31 66 VR NA MR SR .43 .29 .36 .18 40 VR NA MR SR .35 .29 .35 .27 Electricians 26 VR NA MR SR .57 .36 .39 .05 Pipefitters 21 SR AR .32 .41 Skilled Tradesman 87 VR .37 Electricians, Mechanics 66 SR .18 Mechanics 40 SR .27 Electricians 26 SR .05 Pipefitters 21 SR AR .32 .53 Pipefitters 21 AR .47 Manager ranking of communication Pipefitters 21 AR .47 Manager ranking of planning and organization 87 VR NA MR SR .37 .30 .35 .24 66 VR NA MR SR .38 .35 .34 .17 40 VR NA MR SR .34 .32 .37 .00 Electricians, Mechanics Mechanics Skilled tradesman Electricians, Mechanics Mechanics 25 Criterion Sample Composition Manager ranking of mechanical performance Manager ranking of ability to take initiative Manager ranking of problem identification skill Avg. age 43 years, 20 years company tenure, 12.5 years in current job. Males, 83 White, 3 Black, 1 Hispanic. Electricians 26 VR NA MR SR Pipefitters 21 VR NA .32 .51 Skilled tradesman 87 VR NA MR .28 .31 .27 Electricians, Mechanics 66 VR NA MR .33 .35 .29 Mechanics 40 VR NA MR .31 .40 .27 Electricians 26 VR NA MR .39 .34 .34 Pipefitters 21 AR .39 87 VR NA MR SR .33 .31 .33 .25 66 VR NA MR SR .38 .35 .39 .22 40 VR NA MR SR .37 .40 .39 .19 Electricians 26 VR NA MR SR .40 .34 .39 .32 Pipefitters 21 SR AR .24 .44 87 VR NA MR SR .33 .30 .40 .29 66 VR NA MR SR .36 .33 .44 .24 40 VR NA MR SR .28 .34 .43 .12 Electricians 40 VR NA MR SR .51 .42 .49 .52 Pipefitters 21 SR AR .37 .49 Skilled tradesman Electricians, Mechanics Mechanics Skilled tradesman Electricians, Mechanics Mechanics 26 .44 .45 .31 .46 Manager ranking of problem resolution skill Manager ranking of overall performance Manager ranking of overall performance potential Translations / Adaptions No translations are available. The DAT for PCA is available only in English. User Support Resources Pearson provides resources for registered and qualified users. Resources supporting DAT products include: Technical Manuals User Guides Information Overviews Sample Report Evaluative Reviews Twelfth Mental Measurements Yearbook The reviews are generally positive regarding the DAT for PCA, and especially praising its longevity and wide range of uses. It is noted that there is substantial criterion validity for use as a general screening instrument for employment purposes. The authors purport it to being an excellent test for personnel and educational assessment. As noted previously, it is also continuously being revised and updated which enhances its future use. There are some weaknesses around test bias and normative information which are said could be improved with further work (Willson & Wang, 1995). While normative and validity evidence exists, authors of the review suspect that test users could find more useful information if the publisher were to continue such studies. All in all, the authors support future use of the DAT for both personnel selection and vocational guidance. The DAT is also reviewed and reported to have strong reliability and validity evidence, and is also purported to be one of most frequently used cognitive ability batteries (Wang, 1993). Wang (1993) states that it has a very high quality, credibility, and utility, which make such a well-founded battery. Also, new items have only added to its improvements and maintained its psychometric qualities. DESCRIPTIVE RESULTS FOR EAS Overall Description and Uses Introduction PSI's Employee Aptitude Survey (EAS) consists of 10 subtests and is intended to measure eight cognitive abilities. The EAS was originally published in 1963, and since then it has been used extensively for personnel selection and career guidance applications. Since its development, it has maintained the 10 original subtests which have been shown to be both reliable and valid predictors of future performance in many different occupations. These abilities as measured by the EAS have been found to be important for a wide variety of jobs. The Verbal Comprehension subtest was designed to assess the ability to understand written words and the ideas associated with them. The Numerical Ability subtest assesses the ability to add, subtract, multiply, and divide integers, decimals, and fractions. Visual Pursuit was created with the intent of measuring a person's ability to make rapid, accurate scanning movements with the eyes. Visual Speed and Accuracy measures the ability to compare numbers or patterns quickly and accurately. Space Visualization assesses the ability to imagine objects in threedimensional space and to manipulate objects mentally. Numerical Reasoning was created with the intent to measure a participant's ability to analyze logical numerical relationships and to discover 27 underlying principles. Verbal Reasoning measures the ability to combine separate pieces of information and to form conclusions on the basis of that information. Word Fluency was designed to assess an individual's ability to generate a number of words quickly without regard to meaning. Manual Speed and Accuracy measures the ability to make repetitive, fine finger movements rapidly and accurately. Symbolic Reasoning assesses the ability to apply general rules to specific problems and to derive logical answers. Purpose of Use The Employee Aptitude Survey (EAS) was designed as a personnel selection and career guidance instrument. In addition to selection and career guidance, the EAS can also be used for placement, promotions, and training and development. The abilities that the EAS was designed to assess are verbal comprehension, numbers, word fluency, reasoning, mechanical reasoning, space visualization, syntactic evaluation, and pursuit. Each of these abilities is seen as important for a wide ranging variety of jobs such as professional/managerial/supervisory, clerical, production/mechanical (skilled and semi-skilled), technical, sales, unskilled, protective services, and health professionals. Each of the subtests was developed separately and by multiple authors. Much of the content of the items appears to be work neutral. PSI has taken care to make sure that the EAS is easy to administer, score, and interpret. Target Populations The populations targeted for the EAS are candidates who are at least 16 years and older or working adults in wide ranging jobs and occupations such as professional/managerial/supervisory, clerical, production/mechanical (skilled and semi-skilled), technical, sales, unskilled, protective services, & health professionals. Normative information for the EAS is presented in the EAS Examiner’s Manual, with additional norm tables provided in the EAS Norms Report. A total of 85 norm tables, 65 job classifications, and 17 general or educational categories are available. Target Jobs/Occupations As previously mentioned, the EAS is not intended just for a few specific occupations, but instead a wide ranging field such as such as professional/managerial/supervisory, clerical, production/mechanical (skilled and semi-skilled), technical, sales, unskilled, protective services, and health professionals. As such, the EAS can be tailored using numerous combinations of the subtests to best suit the needs and requirements of the target occupation. Examples of the occupations the EAS can be used for selection with: A. Professional, Managerial, and Supervisory a. Jobs specializing in fields such engineering, accounting, personnel relations, and management. These are jobs that most likely require college or university level education. i. Examples of tasks would be scheduling, assigning, monitoring, and coordinating work. B. Clerical a. Jobs that are administrative in nature. i. Examples of tasks are preparing, checking, modifying, compiling, and maintaining documents/files. Other tasks may be coding and entering data. C. Production/Mechanical (Skilled and Semi-skilled) a. Jobs that require specific and standardized procedures. i. Examples of tasks are operating, monitoring, inspecting, troubleshooting, repairing, and installing equipment. These jobs may also involve calculations, computer operation, and quality control activities. D. Technical 28 E. F. G. H. a. Jobs that specialize in fields such as engineering, information management systems, applied sciences, and computer programming. These are jobs which normally require junior college or technical education. i. Examples of tasks are programming and computer operations. Sales a. Jobs that involve product marketing or selling of services. i. Example of a tasks is product demonstrations with potential clients. Unskilled a. Jobs requiring no advanced or technical education. Usually involve the performance of simple, routine, and repetitive tasks in structured environments. Protective Services a. Jobs focusing on promoting health, safety, and welfare of the public. Jobs may police officer, fire fighter, and security guard. Health Professional a. Jobs focusing on medical, dental, psychological, and other health services. These jobs usually require specialized training and education from colleges or universities. Spread of Uses The EAS was designed as a test of cognitive abilities primarily for use as personnel selection and career guidance. However, the EAS is also been used for placement, promotions, and training and development. It was primarily designed for individuals 16 years of age or older and working adults in a wide range of occupations. It is not intended solely for specific jobs or occupations. Examples of target occupations have been previously described in the section above. For this range of applications, PSI has assembled 85 norm tables. For many occupations, the technical manual provides total score and percentile norms. Administrative Details Administrative detail is summarized in Table 12 and briefly described below. Table 12. Administrative features of the EAS subtests. Subtest # Items Time Limit EAS 1-Verbal Comprehension 30 5 minutes EAS 2-Numerical Ability 75 2-10 minutes EAS 3-Visual Pursuit 30 5 minutes EAS 4-Visual Speed and Accuracy 150 5 minutes EAS 5-Space Visualization 50 5 minutes EAS 6-Numerical Reasoning 20 5 minutes EAS 7-Verbal Reasoning 30 5 minutes EAS 8-Word Fluency 75 5 minutes EAS 9-Manual Speed and Accuracy 750 5 minutes EAS 10-Symbolic Reasoning 30 5 minutes 29 Scoring Rule Total score with ability to link to percentile score. Methods of Delivery Paper-pencil; proctored. Online (ATLAS); proctored and unproctored. Time Limits A concept that guided the development of the EAS was maximum validity per minute of testing time, as reported in the technical manual. While several of the subtests are composed of more than 50 items (i.e., EAS 1-Verbal, EAS 2-Numerical Ability Comprehension, EAS 4-Visual Speed and Accuracy, and EAS 5-Space Visualization), each of the subtests is limited to five minutes. The exception is the EAS 2-Numerical Ability Comprehension, which has a 2 to 10 minute time limit depending on items used. Overall, EAS subtests have among the shortest time limits subtests among commercially available cognitive batteries used for personnel selection. Test time was a critical consideration when PSI developed the EAS. Number and Types of Items Table 12 shows the number of items for each of the 10 subtests, ranging from 20 to 750 items. None of the 10 subtests appear to have work related content in the items, and therefore contain workneutral item content. Three of the subtests are related to reasoning, two are numerical, two are verbal, and three are related to visual abilities. Each of the subtests is composed of multiple choice items which measure the previously discussed abilities. The needed reading level is not specified for any of the 10 subtests, but assumed to be appropriate for individuals who are at least 16 years old and work in unskilled occupations. Type of Score Scores on each of the subtests are computed as number-correct scores that are transformed to percentile scores based on relevant norm groups. The EAS may be scored three different ways; (a) computer automated scoring (using PSI’s ATLAS™ platform), (b) hand scoring using templates, and (c) on-site optical scanning/scoring. Scoring is computed by the total raw score of items answered correctly for each subtest. Each raw score determines a percentile score based on relevant PSI norm groups. Scores can also be banded, pass/fail, or used in ranking of test examinees. Each of the subtests can be either hand scored or machined scored from scannable test forms. For hand scoring, keys are provided for scoring both right and wrong responses. Scannable test forms for machine scoring are available for the eight tests that lend themselves to optical scanning (EAS 1 through 7 and EAS 10). Test Scale and Methods of Scaling The score reported for each of the EAS subtests is a “number-correct” raw score. Percentile scores are provided based on the norm groups. The EAS may be tailored to a target job by combining 3 or 4 of the subtests into a job-specific composite. Scores from each of the chosen subtests can be combined to form a composite based on SD-based weights, assigned unit weights, or other rationally or statistically determined weights. While not weighted differently, test items are in order of difficulty from easy to difficult on each form of the EAS. Method of Delivery The EAS may be delivered by proctored paper-pencil administration as well as online administration. For online use, the EAS is administered through PSI’s own web-delivery platform, ATLAS™. The ATLAS™ web-based talent assessment management system is an enterprise web-based Softwareas-a-Service (SaaS) solution for managing selection and assessment processes. This platform can function in a number of capacities (e.g., configuring tests, batteries, score reports, managing test inventory, manage and deploy proctored and unproctored tests, establish candidate workflow with or without applicant tracking systems, etc…). One of the important aspects of administering the EAS using PSI’s own web-delivery platform, ATLAS™, is that it can be unproctored. The EAS may also be administered individually or in a group. For online unproctored use, PSI has created a form that is 30 specifically used for unproctored use. This form can also be tailored to the client needs, in which case items may be randomized. Costs Test Resource Cost EAS Subtests Separately (pkg. 25; includes scoring sheet) EAS 1-Verbal Comprehension $112.50 EAS 2-Numerical Ability $112.50 EAS 3-Visual Pursuit $112.50 EAS 4-Visual Speed and Accuracy $112.50 EAS 5-Space Visualization $112.50 EAS 6-Numerical Reasoning $112.50 EAS 7-Verbal Reasoning $112.50 EAS 8-Word Fluency $112.50 EAS 9-Manual Speed and Accuracy $112.50 EAS 10-Symbolic Reasoning $112.50 Manuals EAS Technical Manual $350.00 *Other cost information was not available after extensive efforts to attainment. Construct-Content Information Intended Constructs The types of subtests in the EAS are similar to other commercially available tests of cognitive ability. It is comprised of 10 subtests designed to measure eight abilities. As shown in Table 13, EAS consists of three subtests that are related to reasoning, two are numerical, three are verbal/word related, and three are related to visual abilities, and one related to manual speed. EAS provides the ability to tailor the battery to specific jobs and occupations by administering various combinations of the subtests. Specific subtests may be best suited for differing jobs and not just all jobs in general. This was found from computed intercorrelations between each of the EAS subtests. Each of these abilities has been shown to be predictive of future job performance across a wide range of jobs. The developers of the EAS intended to develop a cognitive ability battery that would have the maximum validity per minute of testing time. As such, they created and combined multiple subtests of short time limits, each of which measures a specific ability that may be regarded as relevant to certain types of job tasks/activities. From numerous interviews with human resource managers, they identified three important considerations that had not been met by other personnel selection tests; ease of administration, scoring, and interpretation. Therefore, the developers designed several of the subtests and also adapted several other commonly used tests to be included in the EAS battery. The sole purpose was to create a cognitive test battery that could be administered to diverse populations and be used for a wide range of jobs. PSI defined each of the targeted abilities as follows. Verbal Comprehension is the ability to use words in thinking and communicating. Number the ability to handle numbers and to work with numerical material including the ability to perceive small details accurately and rapidly within materials. Word Fluency is the ability to produce words rapidly. Reasoning is the ability to discover relationships and to derive principles. 31 Mechanical Reasoning is the ability to apply and understand physical and mechanical principles. Space Visualization is the ability to visualize objects in three-dimensional space. Syntactic Evaluation is the ability to apply principles to arrive at a unique solution. Pursuit is the ability to make rapid, accurate scanning movements with the eyes. Table 13. Abilities measured by each of the EAS subtests. Pursuit Syntactic Evaluation Space Visualization Mechanical Reasoning Reasoning Word Fluency Number Subtest EAS 1-Verbal Comprehension EAS 2Numerical Ability EAS 3-Visual Pursuit EAS 4-Visual Speed and Accuracy EAS 5-Space Visualization EAS 6Numerical Reasoning EAS 7-Verbal Reasoning EAS 8-Word Fluency EAS 9-Manual Speed and Accuracy EAS 10Symbolic Reasoning Verbal Comprehension Ability X X X X X X X X X X X X X X Item Content The EAS utilizes 10 subtests and is intended to measure eight abilities that are important for a wide range of occupations. Examples of these occupations are professional/managerial/supervisory, clerical, production/mechanical (skilled and semi-skilled), technical, sales, unskilled, protective services, & health professionals. To accommodate the wide range of occupations, item content was generally neutral with respect to specific job content. The technical manual does not specifically state the reading level required to understand the item content. However, it is noted that the EAS is intended for use with individuals of at least 16 years of age, and for occupations that are semi-skilled and possibly even unskilled. Test items are multiple choice, but the number of options varies depending on the subtest and intended ability being measured. 32 Sample Items The EAS is comprised of 10 subtests: A. EAS 1-Verbal Comprehension a. Assess the ability to understand written words and the ideas associated with them. i. In this 30-item vocabulary test, the examinee must select the synonym for a designated word from the four possibilities presented. B. EAS 2-Numerical Ability a. Assesses the ability to add, subtract, multiply, and divide integers, decimals, and fractions. i. This test is designed to measure ability in addition, subtraction, multiplication, and division of whole numbers, decimals, and fractions. The examinee is to add, subtract, multiply, or divide to solve the problem and select a response from the five alternatives provided: four numerical alternatives and an "X" to indicate that the correct answer is not given. Integers, decimal fractions, and common fractions are included in separate tests that are separately timed. C. EAS 3-Visual Pursuit a. Measures a person's ability to make rapid, accurate scanning movements with the eyes. i. Consists of 30 items. The examinee is to visually trace designated lines through an entangled network resembling a schematic diagram. The answer, indicating the endpoint of the line, is selected from five alternatives. 33 D. EAS 4-Visual Speed and Accuracy a. Measures the ability to compare numbers or patterns quickly and accurately. i. The 150 items consist of pairs of number series that may include decimals, letters, or other symbols. The examinee has 5 minutes to review as many pairs as possible, indicating for each pair whether they are the same or different. E. EAS 5-Space Visualization a. Assesses the ability to imagine objects in three-dimensional space and to manipulate objects mentally. i. 50-item test consisting of pictures of piles of blocks. The examinee indicates for a specific block how many other blocks in the pile it touches. The decision to use the familiar blocks format was based on its known predictive validity for a wide variety of mechanical tasks, ranging from those of the design engineer to those of the package wrapper in a department store. 34 F. EAS 6-Numerical Reasoning a. Measure a participant's ability to analyze logical numerical relationships and to discover underlying principles. i. Twenty number series are included. The examinee selects the next number in the series from five alternatives. G. EAS 7-Verbal Reasoning a. Measures the ability to combine separate pieces of information and to form conclusions on the basis of that information. i. A series of facts are presented for the examinee to review. Five conclusions follow each series. The examinee is to indicate whether, based on the factual information given, the conclusion is true, false, or uncertain. 35 H. EAS 8-Word Fluency a. Assesses an individual's ability to generate a number of words quickly without regard to meaning. i. Designed to measure flexibility and fluency with words. The examinee writes as many words as possible beginning with a designated letter. 36 I. EAS 9-Manual Speed and Accuracy a. Measures the ability to make repetitive, fine finger movements rapidly and accurately. i. This test was designed to evaluate the ability to make fine finger movements rapidly and accurately. The examinee is to place pencil marks within as many "O"s as possible in 5 minutes. J. EAS 10-Symbolic Reasoning a. Assesses the ability to apply general rules to specific problems and to come up with logical answers. i. Each of the 30 problems in this test contains a statement and a conclusion. The examinee marks "T" to indicate the conclusion is true, "F" to indicate it is false, or "?" to indicate that it is impossible to determine if the conclusion is true or false based on the information given in the statement. Each statement describes the relationship between three variables: A, B, and C, in terms of arithmetic symbols such as =, <, >, =/, </, and >/. Based on the relationship described, the examinee evaluates the conclusion about the relationship between A and C. Construct Validity Evidence To assess the construct validity evidence of the EAS, test developers factor analyzed EAS scores, excluding Manual Speed and Accuracy using the principal factors method. Test reliabilities were used to estimate communalities. Eight factors were retained and rotated to simple structure using a varimax rotation. The rotated factor loadings are shown in Table 14, using the factor labels provided by PSI. These factor loadings were inspected to identify the subtest characteristics associated with each of the subtests. PSI described the eight factors with the labels shown in Table 14 based primarily on the profile of factor loadings of .40 or greater. However, the small sample size (N = 90) greatly limits the meaningfulness of these results. Table 14. EAS factor loadings. Space Visualization Syntactic Evaluation .25 .16 .17 .19 .24 .26 -.01 .09 .26 .17 .57 .17 .03 .26 .79 .09 .03 .11 .06 .33 .27 .57 .63 .12 .30 .66 .41 .20 -.05 .24 .40 .12 .10 .01 .25 .11 .33 .26 .18 .63 .23 .03 .10 .01 .18 Pursuit Mechanical Reasoning Subtest EAS 1-Verbal Comprehension .82 .11 .20 EAS 2-Numerical Ability .47 .32 .25 EAS 3-Visual Pursuit .04 .10 .03 EAS 4-Visual Speed and .72 .21 .35 Accuracy EAS 5-Space Visualization .08 .16 -.11 EAS 6-Numerical Reasoning .23 .33 .10 EAS 7-Verbal Reasoning .46 .14 .25 EAS 8-Word Fluency .77 .15 .22 EAS 10-Symbolic Reasoning .25 .03 .18 *EAS 9 was purposely left out of the factor analysis. Reasoning Word Fluency Number Verbal Comprehension Ability Factor The EAS has been correlated with several other tests and performance measures. This knowledge aids in defining the constructs/abilities that the test purports to measure. Also, it provides a sense of familiarity for users with past experience of similar tests. The correlations indicate what features the EAS and other tests have in common. In this sense, tests of the EAS that are intended to measure the same constructs as other tests should be highly correlated with each other. 37 The EAS was correlated with the Cooperative School and College Ability Tests (SCAT; 1955). Scores were obtained from 400 junior college students. The SCAT is divided into two parts: verbal and quantitative. The verbal measures a student's understanding of works and the quantitative measures an understanding of fundamental numbers operations. Table 15 reports the correlations found. As expected, verbal-related tests from the EAS correlated with the verbal of the SCAT, and also quantitative-related tests correlated with the quantitative tests of the SCAT. Table 15. Correlations between EAS subtests and SCAT. EAS Test SCAT-V EAS 1-Verbal Comprehension .75 EAS 2-Numerical Ability EAS 3-Visual Pursuit EAS 4-Visual Speed and Accuracy EAS 5-Space Visualization EAS 6-Numerical Reasoning EAS 7-Verbal Reasoning EAS 8-Word Fluency EAS 10-Symbolic Reasoning EAS 1-Verbal Comprehension SCAT-Q .44 .10 .31 .18 .34 .02 .07 .17 .38 .33 .59 .51 .53 .17 .19 .01 .06 .31 .41 One study correlated nine of the EAS subtests with five of the PMA subtests (Thurstone & Thurstone, 1947) with a sample of 90 high school students. The PMA was primarily designed to be used with high school students and includes five subtests. Although they expected high correlations between tests of the EAS and PMA that are intended to measure the same ability, other results were evident. They found that although there were moderate to high correlations where expected, there were also moderate to high correlations where unexpected. They found that there were much stronger correlations between EAS 2 and PMA-Verbal, PMA-Reasoning, and PMA-Word Fluency than originally hypothesized. Table 16 reports the correlations between EAS subtests and the PMS subtests observed in this study. Table 16. Correlations between EAS and PMA tests. EAS Tests EAS 1 EAS 2 EAS 3 EAS 4 EAS 5 EAS 6 EAS 7 EAS 8 EAS 10 Verbal .85 .62 .23 .52 .29 .52 .64 .45 .47 Space .19 .39 .53 .30 .58 .40 .34 .11 .46 PMA Tests Reasoning .53 .59 .44 .50 .46 .68 .74 .41 .52 PMA subtest definitions: A. Verbal a. A vocabulary test similar in format and content to EAS 1. B. Space 38 Number .28 .51 .14 .64 .17 .46 .35 .28 .20 Word Fluency .54 .50 .23 .45 .20 .44 .56 .64 .40 a. Measure of the ability to perceive spatial relationships by manipulating objects in a three-dimensional space. C. Reasoning a. Assesses the ability to discover and apply principles. D. Number a. Measures one’s ability to add numbers. E. Word fluency a. Similar to EAS 8 which measures flexibility and fluency with words. The EAS subtests, except for EAS 9, were also correlated with the California Test of Mental Maturity (CTMM; Sullivan, Clark, & Tiegs, 1936). The same sample of 90 high schools students were used for this study. Like the EAS, the CTMM encompasses a wide variety of item types such as spatial relations, computation, number series, analogies, similarities, opposites, immediate and delayed recall, inference, number series, and vocabulary. Table 17 below shows the resulting correlations, which were at least moderately high. Table 17. Correlations between the EAS subtests and CTMM. CTMM – Total Mental Factors IQ EAS Subtest Sample 1 Sample 2 Sample 3 EAS 1 .72 .75 .83 EAS 2 .70 EAS 3 .31 EAS 4 .44 EAS 5 .43 EAS 6 .66 .70 EAS 7 .67 EAS 8 .40 EAS 10 .63 * Sample 1- 90 male high school students; Sample 2- 103 management selection examinees in an aircraft manufacturing facility; Sample 3- 148 prisoners. Five of the EAS subtests (EAS 1,2,5,6, and 7) were correlated with the Bennett Mechanical Comprehension Test (BMCT; Bennett & Fry, 1941), as shown in Table 18. The BMCT measures the ability to derive, understand, and apply physical and mechanical principles. The moderate correlations for each of the five EAS subtests suggests that there is a general reasoning ability that contributes to performance on each of the tests. Table 18. Correlations between EAS subtests and BMCT. EAS Test EAS 1-Verbal Comprehension EAS 2-Numerical Ability EAS 5-Space Visualization EAS 6-Numerical Reasoning EAS 7-Verbal Reasoning * Sample was 260 applicants for a wide variety of jobs. BMCT .37 .53 .43 .51 .31 Besides the BMCT, the EAS was also correlated with the Otis Employment Test (Otis, 1943), which measures cognitive abilities. Item types include: vocabulary items, reasoning, syllogisms, arithmetic computation and reasoning, proverbs, analogies, spatial relation items, number series, etc… It was expected that the Otis would correlate modestly with the same five subtests used to correlate with the BMCT. As expected, the EAS subtests correlate moderately with the Otis, and these are shown in Table 19. The EAS 4-Visual Speed and Accuracy subtest was correlated with scores from both the Minnesota Clerical Test (MCT) and the Differential Aptitude Test (DAT) Clerical Speed and Accuracy subtest. Each of these tests was designed to assess the ability to perceive details rapidly and accurately. Originally, the MCT was the basis for the development of the EAS 4. Table 20 shows that there is a 39 high correlation for both. A sample of 89 applicants for a wide variety of jobs was used for the correlation between the EAS and MCT. A sample of 100 inmates was used for the correlations between the EAS-4 and DAT – Clerical Speed and Accuracy. Results are reported in Table 20. Table 19. Correlations between the EAS and the Otis. EAS Test EAS 1-Verbal Comprehension EAS 2-Numerical Ability EAS 5-Space Visualization EAS 6-Numerical Reasoning EAS 7-Verbal Reasoning * Sample was 220 applicants for a wide variety of jobs. Otis .55 .51 .47 .60 .56 Table 20. Correlations between EAS 4, MCT, and the DAT. EAS Subtest EAS 4-Visual Speed and Accuracy MCT – Total Score DAT – Clerical Speed and Accuracy .82 .65 The EAS 7-Verbal Reasoning and the Watson-Glaser Critical Thinking Appraisal (W-GCTA; Watson & Glaser, 1942) were also correlated. The W-GCTA assesses five facet of critical thinking as defined by Watson-Glaser, one being the ability to from logical conclusions from various facts. While they expected a moderate correlation, in reality the correlation was moderately low. Table 21 shows the relevant results. Table 21. Correlations between EAS 7 and CTA. EAS Test EAS 7-Verbal Reasoning Sample 1 CTA Sample 2 Sample 3 .45 .26 .59 PSI also computed average intercorrelations between each pair of the EAS subtests. While each of subtests was designed to measure specific abilities, each of the subtests was designed to measure more than one ability. Table 22 presents the average intercorrelations among the subtests. Their findings show that several of the subtests can be combined to predict the same ability. A. EAS 1 and EAS 7 are related because of the emphasis on the ability to understand words and concepts associated with them. B. EAS 2 and EAS 4 are dependent upon the ability to work accurately and with speed with numbers. C. EAS 2 and EAS 6 because they rely on an individual’s ability to interpret numerical materials. D. EAS 3 and EAS 5 share a perceptual component. E. EAS 6, EAS 7, and EAS 10 because are related because they are influenced by the ability to derive and apply rules and principles to solve problems. 40 Table 22. Average intercorrelations of EAS subtests. Average Intercorrelations Test EAS 1 EAS 2 EAS 3 EAS 4 EAS 5 EAS 6 EAS 7 EAS 1 EAS 2 .26 EAS 3 .08 .20 EAS 4 .41 .10 .30 EAS 5 .40 .22 .28 .34 EAS 6 .43 .26 .18 .19 .35 EAS 7 .40 .34 .15 .16 .30 .37 EAS 8 .27 .31 .09 .23 .14 .18 .19 EAS 9 .03 .12 .23 .26 .24 .05 .01 EAS 10 .38 .37 .27 .33 .17 .22 .29 EAS 8 EAS 9 .16 .14 .10 EAS 10 Item and Test Development Item Development Originally, the EAS was designed with 15 short time limit subtests. These subtests were then administered to 273 employees of a medium-sized factory. Results led to decisions to remove some subtests from the battery and change others in regards to format, length, and content of instructions. Each of the subtests of the EAS battery was developed separately by different test developers, or was adapted from previous tests. EAS subtest item content, specifications, and development: A. EAS 1-Verbal Comprehension a. A large pool of items was given to several groups ranging from college students, factory workers, to prisoners. Item difficulty was determined for each of the items, and item score vs. total score phi coefficients were computed. Two alternate forms were created from the larger pool of items which met statistical criteria of comparability; mean, standard deviation, and homogeneity. B. EAS 2-Numerical Ability a. Examinee must add, subtract, multiply, or divide to solve the problem and select a response from the five alternatives provided: four numerical alternatives and an "X" to indicate that the correct answer is not given. Integers, decimal fractions, and common fractions are included in separate tests that are separately timed. Part 1, with a time limit of 2 minutes, measures facility in working with whole numbers; Part 2, which requires 4 minutes of test time, measures facility with decimals; and Part 3, also a 4minute test, measures facility with fractions. This test is really a battery of three tests when this is desired. An “X” is used to indicate that the correct answer is not given. Two equivalent forms available. C. EAS 3-Visual Pursuit a. The original prototype of this test was the Pursuit subtest of the MacQuarrie Tests for Mechanical Ability (MacQuarrie, 1925). Modifications include: (1) increased time length, (2) adapted it to machine scoring, (3) created the first few items to be easier, and (4) redesigned the answer scoring procedure. Face validity was taken into consideration and symbols and format of electrical wiring diagrams and electronic schematics were used. Two equivalent forms were created. D. EAS 4-Visual Speed and Accuracy a. The design of this test was based upon the Minnesota Clerical Test (MCT) by Andrew, Paterson and Longstaff (1933). The test length was shortened to five 41 minutes. It was decided that only number series would be used for the items, and these number series would include decimals, letters, and other symbols. The characters in each of the items were selected using a random numbers table. E. EAS 5-Space Visualization a. This test is based on the MacQuarrie Test for Mechanical Ability (MacQuarrie, 1925) which had a long and successful use in the United States. The length of the EAS subtest is twice the length of the MacQuarrie to gain additional reliability. The test developer also changed the directions for clarity purposes, adapted the test to machine scoring, and arranged the items in order of increasing difficulty. F. EAS 6-Numerical Reasoning a. This test is based on Test 6 of the Army Group Examination Alpha of World War I (1918). For the EAS 6, it was decided that multiple choice instead of open-ended items would be used. The items selected for inclusion were derived from a study by Lovell (1944), in which 20 items were to be used in Form A after analyzing them on an industrial population. Form B was created by turning out a large pool of similar items and selecting 20 items of comparable difficulty and homogeneity. G. EAS 7-Verbal Reasoning a. This test is based on the California Test of Mental Maturity (CTMM) by Sullivan, Clark and Tiegs (1936), Subtest 15. Form A of the test represents surviving items of a larger pool. Form B is composed of items of the same logical form, and differ from Form A items only in content. H. EAS 8-Word Fluency a. This test is an adaption from the SRA Primary Mental Abilities Test (PMA) by Thurston and Thrustone (1947). Multiple forms exist for this test. I. EAS 9-Manual Speed and Accuracy a. This test is based on the Dotting subtest of the MacQuarrie Tests for Mechanical Ability (MacQuarrie, 1925). In order to increase reliability and face validity, the length of the test was increased. J. EAS 10-Symbolic Reasoning a. This test is based on work by authors who developed a test to measure the ability to evaluate symbolic relations (Wilson, Guilford, Christianson and Lewis (1954). Another idea for this test was derived from EAS 7 which used the “X” for uncertain response category. Verbal instructions were also removed for this test to reduce the factor loadings on the verbal factor. The items of the two forms have been arranged in order of increasing difficulty. EAS test scores have been correlated with scores from a wide variety of tests—tests designed to assess several distinct abilities. Table 23 summarizes the relationships between the EAS tests and the kinds of abilities measured by these tests. The resulting outcome of the correlations was as expected. They found: (a) EAS 1 and EAS 7 correlate very highly with other verbal ability tests, (b) EAS 2 and EAS 6 correlate highly with other numerical ability tests, and (c) EAS 4 has a strong relationship with other clerical measures. All of the EAS tests with available data at least moderately correlate with measures of general mental ability. Table 23 shows the correlations between EAS subtests and the abilities. 42 General Mental Ability Mechanical Verbal Fluency Space Reasoning Clerical Numerical Verbal Reliability* Table 23. Average Correlations with EAS and abilities assessed by other batteries. Type of Ability Subtest M SD EAS 1-Verbal .77 .41 .71 .85 19.2 6.38 .53 .19 .54 .37 Comprehensio (2) (2) (4) n EAS 2.21 .35 .57 .87 45.9 13.70 .59 .39 .50 .53 Numerical (2) (2) (2) Ability EAS 3-Visual .19 .31 .86 19.0 4.68 .44 .53 .23 .31 (2) (2) Pursuit EAS 4-Visual .12 .19 .74 .91 93.1 21.80 .50 .30 .45 .44 Speed and (2) (2) (2) Accuracy EAS 5-Space .19 .34 .46 .89 27.7 10.20 .46 .58 .20 .43 (2) (2) (2) Visualization EAS 6.37 .57 .64 .81 10.8 4.27 .68 .40 .44 .51 Numerical (2) (2) (3) Reasoning EAS 7-Verbal .54 .50 .52 .82 14.3 6.19 .74 .34 .56 .31 (2) (2) (5) Reasoning EAS 8-Word .22 .21 .76 46.2 12.20 .41 .11 .64 .40 (2) (2) Fluency EAS 9-Manual .75 408.0 89.00 .01 .06 Speed and Accuracy EAS 10.34 .37 .82 11.6 6.61 .52 .40 .40 .63 Symbolic (2) (2) Reasoning (The number in parentheses is the number of studies on which the correlation is based. If no number is given, the correlation reported is based on one study.) *For 9 of the 10 tests (EAS 1 through EAS 8 and EAS 10), reliability was determined using an alternate form method of estimation. For EAS 9, a test-retest reliability estimate is reported with 2 to 14 days between administrations. Correlations between EAS subtests are reported in Table 24. These provide support for the optimal value of the EAS as a battery. The developers computed intercorrelations between each of the EAS subtests with a sample of educational and occupational groups since it was to be used in a wide variety of situations. The correlations for each of the groups were averaged using Fisher Z transformed indices and then the average Fisher Z was transformed back to a correlation. 43 Table 24. Average intercorrelations between EAS subtests. Tests EAS 1 EAS 2 EAS 3 EAS 4 EAS 5 EAS 6 EAS 1 EAS 2 .26 EAS 3 .08 .20 EAS 4 .10 .41 .30 EAS 5 .22 .28 .40 .34 EAS 6 .26 .43 .18 .19 .35 EAS 7 .40 .34 .15 .16 .30 .37 EAS 8 .27 .31 .09 .23 .14 .18 EAS 9 .03 .12 .23 .26 .24 .05 EAS 10 .27 .33 .17 .22 .29 .38 EAS 7 EAS 8 EAS 9 .19 .01 .37 .16 .14 .10 EAS 10 The EAS 2-Numerical Ability subtest was a bit unusual as it consists of three separately timed subtests that are specific to integers, decimals, and fractions. Correlations between Part 2 (decimals) and Part 3 (fractions) were found to be stable across three samples of job incumbents, but correlations between either Part 2 or Part 3 and Part 1 (integers) are more variable across job samples. It was found that the latter correlations were the lowest in groups that had not been actively engaged in arithmetic computation. Table 25 shows these subtest correlations for the three job samples. Table 25. Part-score intercorrelations for three parts of the EAS 2. Electronic Trainees Graduate Engineers (N = 167) (N = 205) Parts Correlated I (Integers) and II .58 .53 (Decimals) I (Integers) and III .62 .45 (Fractions) II (Decimals) and III .67 .62 (Fractions) Telephone Operators (N = 192) .31 .35 .63 Psychometric Model There is no indication that PSI applied IRT-based item analyses to EAS items during the development process. However, the PSI I-O psychologist indicated that a pool of EAS items has been accumulating as additional transparent forms have been developed over the years since EAS was implemented. The purpose this accumulating “bank” of EAS items with IRT estimates serves is to facilitate the development of new forms of EAS that PSI occasionally introduces in a transparent manner. Multiple Forms There are two forms of each subtest of the EAS, except for EAS 9- Manual Speed and Accuracy. To evaluate the equivalency of alternate forms, the developers administered both forms of each test to 330 junior college students. Half of the students were given Form A first and the other half were given Form B first. Results from the analysis revealed that for each pair of forms there were no statistically significant differences between means and standard deviations. After speaking with the publisher, we have also learned that there is a third form of the EAS. This third form is meant specifically for online unproctored testing using PSI’s ATLAS ™ web-based talent assessment management system. Item Banking Approaches As EAS reported, it maintains an accumulating bank of EAS items. But the only purpose that bank appears to serve is to facilitate the occasional development of new, transparent forms of EAS. 44 Approach to Item / Test Security Test Security For all subtests, except EAS 9, there are two forms. Also, for online unproctored administration, PSI’s ATLAS™ web-based talent assessment management system is available as an enterprise web-based Software-as-a-Service (SaaS) solution for managing selection and assessment processes. This platform can function in a number of capacities (e.g., configuring tests, batteries, score reports, managing test inventory, manage and deploy proctored and unproctored tests, establish candidate workflow with or without applicant tracking systems, etc…). The ability to administer the EAS unproctored is a very desirable feature of the ATLAS™. However, there is no information available about the manner in which PSI handles unproctored test security. The publisher did indicate to us that online unproctored use is only available internationally, not within the U.S. To support that unproctored option outside the US, PSI provides a third form developed specifically for international clients who wish to employ online unproctored testing. However, PSI appears to take no measures to monitor or minimize or respond to indications of cheating or piracy. Criterion Validity Evidence Criterion Validity PSI conducted meta-analyses of the observed validities for each test-criterion-occupation combination that contained five or more criterion validity studies. Forty-nine (49) meta-analyses were conducted, 24 for the predicting job performance and 25 for predicting training success. The average observed and corrected validities were determined for test-criterion-occupation combinations containing fewer than five studies. Table 26 shows the test-criterion-occupation validity coefficient for each occupation category and the corresponding subtest, along with sample size and type of measure used. For the meta-analysis, PSI located 160 criterion validity-related studies; studies from the 1963 technical report, unpublished validation studies from employers and external consultants, and unpublished literature. Information used in the meta-analysis included (a) job category, (b) criterion category, job performance or training success, (c) the type of criterion measure, (d) EAS tests used as predictors, (e) sample size, and (f) the value of the observed validity coefficients. The validation studies included a broad range of jobs which were grouped into the eight occupational families described previously. Validity coefficients were differentiated on the basis of the criterion used in the study, either job performance or training success. Four criterion measures were identified; (a) performance including supervisors, rankings of job proficiency, instructor ratings, and course grades; (b) production including production data and scores on work sample tests; (c) job knowledge which refers to scores on job knowledge tests; and (d) personnel actions such as hiring and terminating employees. In total, 49 meta-analyses were computed; 24 for prediction of job performance and 25 for the prediction of training success. The results of the meta-analyses indicate that 48 of the 49 credibility values were above 0, and therefore 98% of the test-criterion-occupation combinations showed generalizability across jobs and organizational settings within major job categories. Results indicated that certain EAS subtests were better predictors of job performance and training success of various occupations families than others. For example, under the occupational grouping of Professional, Managerial, and Supervisory, EAS 2 was a better predictor of job performance than EAS 1, EAS 7 was a better predictor than EAS 6, and EAS 2 was a better predictor of job performance than EAS 7. Table 26 provides the resulting validity coefficients for the EAS subtests by criterion and occupational grouping. 45 Table 26. Meta-Analysis results for PSI criterion validities for EAS subtests within job families. Professional, Managerial, & Supervisory Test Criterion Category Type of Criterion Total Sample Size ṙ 1 - Verbal Comprehension Job Performance: Personnel Actions 205 .14 .30 2 - Numerical Ability Job Performance: Personnel Actions 130 .49 .89 3 - Visual Pursuit Job Performance: Performance 150 .06 .13 Performance Production 428 142 .13 .20 .28 .42 6 - Numerical Reasoning Job Performance: Personnel Actions 130 .27 .56 7 - Verbal Reasoning Job Performance: Personnel Actions 130 .43 .81 Performance Personnel Actions 250 107 .14 .01 .31 .02 Performance Personnel Actions 100 128 .31 .26 .63 .54 5 - Space Visualization Job Performance: 8 - Word Fluency Job Performance: 10 - Symbolic Reasoning Job Performance: Clerical Test Criterion Category Type of Criterion Total Sample Size ṙ 1 - Verbal Comprehension Job Performance: Production 95 .21 .45 Training Success: Job Knowledge 33 .26 .46 2 - Numerical Ability Job Performance: Production 95 .22 .46 Training Success: Job Knowledge 33 .35 .60 3 - Visual Pursuit Job Performance: Performance 81 .31 .63 4 - Visual Speed and Accuracy Job Performance: Production 95 .42 .80 Training Success: Job Knowledge 33 .09 017 6 - Numerical Reasoning Job Performance: Production 95 .24 .50 Training Success: Job Knowledge 33 .17 .31 7 - Verbal Reasoning Job Performance: Production 95 .24 .49 Performance Job Knowledge 63 33 .63 .19 .90 .29 Training Success: 46 8 - Word Fluency Job Performance: Performance 108 .35 .68 9 - Manual Speed and Accuracy Job Performance: Production 96 .24 .50 10 - Symbolic Reasoning Job Performance: Performance 108 .21 .44 Production/Mechanical (Skilled & Semi-skilled) Test Criterion Category Total Sample Size ṙ Production Job Knowledge 69 69 .03 .20 .07 .42 2 - Numerical Ability Job Performance: Production 39 .24 .43 Training Success: Job Knowledge 39 .46 .73 Production Job Knowledge 69 296 .15 .41 .32 .79 Performance 78 .16 .29 Production Job Knowledge Personnel Actions 69 69 136 .09 .32 .00 .20 .64 .00 Production Job Knowledge 69 131 .29 .43 .59 .81 Production Job Knowledge 40 40 .26 .36 .46 .61 Performance Production Job Knowledge 114 69 69 .22 .23 .43 .46 .48 .81 Production 69 .15 .32 Job Knowledge Performance 131 104 .42 .28 .80 .49 8 - Word Fluency Training Success: Performance 78 .17 .31 9 - Manual Speed and Accuracy Training Success: Performance 78 -.02 -.05 10 - Symbolic Reasoning Job Performance: Performance 157 .19 .40 Training Success: Performance 78 .25 .44 1 - Verbal Comprehension Job Performance: 3 - Visual Pursuit Job Performance: Training Success: 4 - Visual Speed and Accuracy Job Performance: 5 - Space Visualization Job Performance: Training Success: 6 - Numerical Reasoning Job Performance: 7 - Verbal Reasoning Job Performance: Training Success: Type of Criterion Technical Test Criterion Category 47 Type of Criterion Total Sample Size ṙ 1 - Verbal Comprehension Training Success: Job Knowledge 471 .60 .88 2 - Numerical Ability Training Success: Job Knowledge 471 .58 .85 3 - Visual Pursuit Job Performance: Performance 99 .06 .14 Training Success: Job Knowledge 471 .39 .65 4 - Visual Speed and Accuracy Training Success: Job Knowledge 471 .37 .63 5 - Space Visualization Training Success: Job Knowledge 471 .45 .72 6 - Numerical Reasoning Job Performance: Performance 143 .21 .44 Training Success: Job Knowledge 471 .46 .73 7 - Verbal Reasoning Training Success: Job Knowledge 471 .48 .76 8 - Word Fluency Job Performance: Performance 9 .42 .80 Training Success: Job Knowledge 471 .36 .61 9 - Manual Speed and Accuracy Job Performance: Performance 231 .04 .08 Training Success: Job Knowledge 381 .36 .61 10 - Symbolic Reasoning Job Performance: Performance 53 .43 .81 Training Success: Job Knowledge 471 .64 .91 Sales Test Criterion Category Type of Criterion Total Sample Size ṙ 1 - Verbal Comprehension Job Performance: Performance 107 .40 .77 Training Success: Job Knowledge 140 .25 .45 2 - Numerical Ability Job Performance: Performance 107 .41 .79 Training Success: Job Knowledge 140 .50 .78 4 - Visual Speed and Accuracy Job Performance: Performance 107 29 .58 Training Success: Job Knowledge 140 .09 .17 5 - Space Visualization Job Performance: Performance 19 .70 1.11 Performance 107 .34 .68 6 - Numerical Reasoning 48 Job Performance: Job Knowledge 140 .40 .66 7 - Verbal Reasoning Job Performance: Performance 88 .25 .51 Training Success: Job Knowledge 140 .26 .46 8 - Word Fluency Job Performance: Performance 107 .27 .56 Training Success: Job Knowledge 140 .22 .40 Training Success: Unskilled Test Criterion Category Type of Criterion Total Sample Size ṙ 1 - Verbal Comprehension Job Performance: Performance 44 .12 .26 2 - Numerical Ability Job Performance: Performance 190 .05 .11 4 - Visual Speed and Accuracy Job Performance: Performance 44 .21 .44 5 - Space Visualization Job Performance: Performance 186 .10 .22 6 - Numerical Reasoning Job Performance: Performance 44 .16 .34 7 - Verbal Reasoning Job Performance: Performance 44 -.01 -.02 Protective Services Test Criterion Category 1 - Verbal Comprehension Job Performance: Training Success: 2 - Numerical Ability Job Performance: Training Success: 3 - Visual Pursuit Training Success: 4 - Visual Speed and Accuracy Training Success: 5 - Space 49 Total Sample Size ṙ Performance 103 .00 .00 Performance Production Job Knowledge 150 132 137 .35 -.09 .34 .60 -.17 .58 Performance 104 .01 .02 Performance Production Job Knowledge 150 132 137 .26 -.02 .32 .46 -.04 .55 Performance Production Job Knowledge 49 132 137 .34 .19 .16 .58 .35 .29 Performance Production Job Knowledge 49 132 137 .27 -.04 .17 .48 -.08 .31 224 .11 .24 Type of Criterion Visualization Job Performance: Performance 277 259 264 .30 .21 .19 .52 .38 .35 Performance 105 .20 .42 Performance Production Job Knowledge 150 132 137 .31 .09 .35 .54 .17 .60 Performance Production 106 102 -.04 .20 -.09 .42 Performance Production Job Knowledge 235 132 137 .31 -.16 .39 .54 -.29 .65 8 - Word Fluency Training Success: Performance 48 .35 .60 9 - Manual Speed and Accuracy Training Success: Performance 49 .34 .58 10 - Symbolic Reasoning Job Performance: Performance 106 .03 .07 Training Success: Performance 134 .40 .67 Training Success: 6 - Numerical Reasoning Job Performance: Training Success: 7 - Verbal Reasoning Job Performance: Training Success: Performance Production Job Knowledge Health Professional Test Criterion Category 1 - Verbal Comprehension Job Performance: Training Success: 2 - Numerical Ability Training Success: 3 - Visual Pursuit Training Success: 4 - Visual Speed and Accuracy Training Success: 5 - Space Visualization Training Success: 9 - Manual Speed and Accuracy Training Success: 50 Total Sample Size ṙ Production 118 .52 .93 Performance Production 96 30 .15 .37 .28 .62 Performance Production 96 30 .06 .32 .11 .55 Performance Production 96 29 -.01 .32 -.02 .55 Performance Production 96 30 .04 .40 .08 .66 Performance Production 96 30 .19 .22 .35 .40 Performance Production 96 30 .26 .13 .46 .24 Type of Criterion Meta-analytic results were also aggregated with major Occupational Groupings, shown in Table 27. Statistical characteristics were only computed for occupational grouping-criterion combinations where sufficient data was available. A test was included in the battery if the test validity was generalizable across jobs and settings. The premise of this study was that test battery validity is likely to be greater than the validity of any single predictor. The rationale is that by adding tests you are either improving the measurement of some cognitive ability or you are improving the ability of the battery to predict by adding measures of new and independent abilities. Table 27 shows the validity generalization model, in which validity did in fact generalize across jobs and organizational settings. The implication for employers is that, in those instances where EAS validity generalizes, it would not be necessary to conduct a local validation study. Table 27. Validity generalization results: Generalized mean true validity for EAS subtests. Occupational Grouping EAS Test Professional, Managerial, and Supervisory Clerical Job Perf Training Success Job Perf EAS 1 .52 .42 EAS 2 .55 .60 Training Success Production/Mechanical (Skilled and Semiskilled) Technical Job Perf Training Success Job Perf Training Success .37 .35 .62 .16 .48 .46 .38 .69 .53 .75 EAS 3 .26 EAS 4 .28 .46 .32 .31 .30 .39 EAS 5 .33 .49 .37 .48 .37 .46 .35 EAS 6 .63 .33 .29 EAS 7 .67 .29 .46 EAS 8 .39 EAS 9 .27 .40 .34 .13 .63 .33 .47 .21 .27 .28 .29 EAS 10 .47 .59 Note: All validity generalization results are based on performance criteria (e.g., supervisory ratings and course grades). PSI also conducted a study to determine which combinations of EAS subtests can be combined to create composites that will be maximally predictive of specific job families. Table 28 shows the validities coefficients for a combination of EAS subtests and related job families. These combinations of subtests have been recommended by PSI to assess an applicant’s abilities relating to specific job families. 51 Table 28. Reported validity coefficients for combinations of EAS subtests and job families. Production/ Mechanical Battery Technical Battery X X X X X X X X X EAS 10- Symbolic Reasoning EAS 7- Verbal Reasoning EAS 6- Numerical Reasoning EAS 5- Space Visualization EAS 4- Visual Speed and Accuracy EAS 3- Visual Pursuit EAS 2- Numerical Ability Job Family Professional, Managerial, & Supervisory Clerical Battery EAS 1- Verbal Comprehension Subtests Included in Battery Empirically Tailored to Job Family Validity*of Tailored Battery X .57 X .42 X X X .35 X X .46 X .41 IT Battery X X X General Mental Ability Battery X X X .31 (g)** *Based upon meta-analysis of over 160 studies; validities are N-weighted mean coefficients adjusted for criterion unreliability. **Includes validity studies from across the occupational spectrum. A single study by Kolz, McFarland and Silverman (1998) provided support for criterion validity evidence of the EAS. They looked at the EAS Ability Composite and a measure of job performance for incumbents with a certain amount of job experience. Overall, their results indicate that the EAS composites predict job performance better as experience increases. Therefore, The EAS predicts job performance better for candidates with more 10+ years of job experience than candidates who have 1-years of experience. Results are displayed in Table 29. Table 29. Correlations between EAS ability composites and job performance at different levels of job experience. Job Experience Level in Years EAS Ability Composite 1-3 (N = 33) 4-6 (N = 54) 7-9 (N = 39) 10+ (N = 50) Total (N=176) Mechanical .05 -.07 .36 .27 .18 Arithmetic .07 .18 .35 .42 .25 Logic -.05 .04 .54 .50 .23 1 Each correlation shows the relationship between an EAS Ability Composite and a measure of job performance for incumbents with a certain amount of job experience. For example, the .05 correlation in the Mechanical – 1-3 cell indicates that among the 33 incumbents with 1-3 years of job experience, the EAS Mechanical composite correlated .05 with job performance. In contrast, for 52 example, the EAS Mechanical composite correlated .36 with job performance among the 39 incumbents who had 7-9 years of job experience. Overall, these results indicate that the EAS composites predict job performance better as experience increases. Translations / Adaptions Translations The EAS is available in English, Spanish, French, and German. The publisher indicated that they often use a vendor for test translations. They reported that not all 10 of their tests have been translated into other languages either, and to this date EAS has not been translated into Arabic. They have found that translations on average result in 80% of the items being correct in their content after cultural and language translations have been done. User Support Resources Technical manual Fact sheet Descriptive Report Note: PSI does not publicly display user support resources. Evaluative Reviews Fourteenth Mental Measurements Yearbook Overall, the reviews of the reviewers have generally provided positive feedback (Engdahl & Muchinsky, 2001). For instance, it is said that the EAS compares favorably with other multifactor ability batteries for use in selection and vocational guidance. It is also noted that is has been extensively used as demonstrated by its solid heritage and long record of usefulness. The reviewers also mentioned the usefulness of the materials presented by the publishers. The only issues appear to be around validity and using the EAS for selection purposes with upper level jobs. An older review by Crites (1963) reported that the EAS is based upon sound rationale and consists of subtests with proven validities. The concern at this time was computing correlations with large samples and using job success criteria. Otherwise, a well-developed test and can be used for selection and career guidance. Siegel (1958) also reviewed the EAS and reported that the EAS is a great instrument in which tests can be used singly or in combined with other tests to create composites. The best features are the simplicity and ease of administration, as well as the scope of coverage of the tests. DESCRIPTIVE RESULTS FOR GATB Overall Description and Uses Introduction The General Aptitude Test Battery (GATB) was originally developed in 1947 by the United States Employment Service (USES) for use by state employment offices. The intended use was to match job applicants with potential employers in private and public sector organizations. It came to be used as an instrument that private sector organizations were using to screen large numbers of applicants 53 for open positions. The US Department of Labor even proposed the use of the GATB as a prescreening and referral instrument for nearly all jobs in the United States. The GATB originally consisted of 12 separately timed subtests used to measure 9 abilities (i.e., verbal ability, arithmetic reasoning, computation, spatial ability, form perception, clerical perception, motor coordination, manual dexterity, and finger dexterity). Because of concerns raised by a review of the battery by the U.S. National Research Council of the National Academy of Sciences (NCS) in the late 1980's, several changes were made to the GATB. The main concerns were those relating to overall look of the test and accompanying resources, cultural bias, speededness, scoring, and susceptibility to coaching. These concerns were the basis for developing new forms of the GATB, Forms E and F. Our focus therefore is on GATB Forms E and F as these were the forms intended for selection purposes, not just vocational and career guidance such as earlier versions. Development of the GATB Forms E and F had the objectives of reducing test speededness, and therefore less susceptible to coaching. This was done by reducing the number of items and increasing the time limits for a few of the subtests. The developers also incorporated more appropriate scoring procedures, developed better instructions for the examinee, developed test items free from bias, assemble parallel test forms, improved overall looks of the test and accompanying resources, and revised the answer sheets and other supporting documents to be consistent with changes to test format. During revision, the Form Matching subtest was dropped from the battery, leaving 11 subtests as shown in Table X1. Forms E and F were repurposed as O*Net Ability Profiler which is used exclusively for vocational counseling, occupational exploration, and career planning. The O*Net Ability Profiler is offered through the U.S Department of Labor, Employment and Training Administration. Because the Ability Profiler is connected to O*Net, it is linked to more than 800 occupations within the U.S. This particular instrument is not intended for personnel selection. Forms E and F were initially intended to be used as a cognitive ability battery for personnel selection, vocational counseling, and occupational exploration. As with the previous forms, Forms E and F were created to be used in a wide range of occupations (i.e. for nearly every job/occupation in the U.S.) The subtests were developed using work-neutral item content, and consequently not specific to any one occupation group. Since it was to be used across a wide range of occupations, both government and private organizations, it was to be easily scored and interpreted by test administrators. According to Hausdorf, LeBlanc, and Chawla (2003), although the GATB does predict future job performance, it has demonstrated differential prediction and adverse impact against African Americans (a 1 SD mean difference) in the U.S. (Hartigan & Wigdor, 1989; Sackett & Wilk, 1994; Wigdor & Sackett, 1993). Note: Although Form Matching was eventually dropped, the final decision came late in the project during the development of Forms E and F. Consequently, it was included in many of the development steps described in this chapter. Also, Tool Matching was eventually renamed Object Matching (Mellon, Daggett, & MacManus, 1996). Targeted Populations The populations targeted by GATB Forms E and F were intended to for individuals of at least 16 years of age and/or working adults in the U.S. who would be working in nearly any occupation or job. Being intended for such a wide range of occupations required that the GATB be composed at a relatively low reading level of grade six. As such, a wider range of individuals from semi-skilled (perhaps even unskilled) occupations through managerial and even healthcare practitioners could be administered the cognitive test battery. 54 Targeted Jobs / Occupations and Spread of Uses The GATB Forms E and F were designed to be used for personnel selection, vocational counseling, and occupational exploration with nearly every job/occupation in the United States. It was designed for working adults in the U.S., and had a reading level of at least grade six. Administrative Detail Table 30 shows administrative details for Forms E and F of the GATB. Table 30. Administrative detail for GATB Forms E & F. Subtest # Items Time Scoring Rule 1. Name Comparison 90 6 minutes # Correct Raw Score. Percentile scores and stanines reported for norm 2. Computation 40 6 minutes groups. 3. Three-Dimensional 20 8 minutes Space 4. Vocabulary 19 8 minutes 5. Object Matching 42 5 minutes 6. Arithmetic Reasoning 7. Mark Making 18 20 minutes 130 60 seconds 8. Place 48 pegs 15 seconds per peg 9. Turn 48 pegs 15 seconds per peg 10. Assemble 1 trial, 50 90 seconds 55 The administrator and examinee count and record the number of marks. Record the total number of attempts made by the examinee. The number of attempts is equal to the number of pegs moved, regardless of whether they have been turned in the proper direction or not. Record the total number of attempts made by the examinee. The number of attempts is equal to the number of rivets moved, regardless of whether or not they were properly assembled or inserted. The Part 10 score can be determined by counting: _ The number of empty holes in the upper part of the board; or _ The number of rivets inserted in the lower part of the board plus the number of rivets dropped; or _ The number of rivets remaining in the holes in the upper part of the board subtracted from the total number of rivets (50). Two of these methods should be used in scoring, one as a check for the other. Record the total number of Methods of Delivery Paper-pencil; proctored Materials / apparatus supplied for test; proctored rivets 11. Disassemble 1 trial, 50 60 seconds rivets * Form Matching was dropped for GATB Forms E & F. attempts made by the examinee. The number of attempts is equal to the number of rivets moved, regardless of whether or not they were properly disassembled or reinserted. The Part 11 score can be determined by counting: _ The number of empty holes in the lower part of the board; or The number of rivets inserted in the upper part of the board plus the number of rivets dropped; or _ The number of rivets remaining in the holes in the lower part of the board subtracted from the total number of rivets (50). Any two of these methods should be used in scoring, one being used as a check on the other. Same as above for 11. Assemble. Number and Type of Items The number of items for each of the 11 subtests is shown in Table 30. The GATB Forms E and F were designed to be equivalent forms, containing the same number and types of items. The items contained therein are composed of work-neutral content. The power tests are a combination of forced choice and multiple choice items, whereas the speeded tests are psychomotor related in which they assess a participant’s manual and finger dexterity. The creation of the GATB Forms E and F resulted in fewer items for several of the subtests, compared to the earlier Forms A through D. Changes were also made to the order of the power subtests (i.e., subtests 1 – 6). The major differences that arose from the creation of Forms E and F compared to Forms A through D were that (a) the total number of items across the subtests was reduced from 434 items to 224 items, (b) the total time to take the battery increased from 42 minutes to 51 minutes, and (c) Form Matching was eventually dropped from the battery, leaving 11 subtests instead of 12. Time Limits Overall administration time is 2.5 hours. The seven paper-pencil subtests (i.e., Subtests 1 -7) can be administered in approximately 1.5 to 2 hours. This would be an option for individuals who are not applying for occupations or needing career/vocational guidance in jobs that do not require manual or finger dexterity. Also, there are six non-psychomotor subtests (i.e., subtests 1 through 6) of the GATB Forms E and F which can be administered together in about 1.5 to 2 hours. If individuals are not applying for or seeking career guidance for occupations requiring psychomotor abilities, then subtests 1-6 would be a sufficient combination of subtests. The total time to take all 11 subtests is 51 minutes, not including administration and instructive time. Type of Scores The original GATB Forms A through D used number-correct scoring in which the final score is calculated by adding the number of questions answered correctly for each subtest. Each of subtests 1 through 6 consists of multiple choice items. For these forms, there were no penalties for incorrect answers. One issue with this type of scoring was that examinees that would guess and even respond 56 randomly but rapidly, could increase their total score with the speeded tests. Procedures were taken to reduce the speededness of the power tests in order to reduce the influence of such test taking strategies. For the three remaining speeded tests (i.e., Computation, Object Matching, and Name Recognition), a conventional scoring formula was chosen that would impose a penalty for examinees who respond with an incorrect response. The premise of the formula is that incorrect responses are based upon the total number response alternatives for each subtest item. For instance, if there are k alternatives, and an examinee responds randomly, then there will be k - 1 incorrect response for every correct response by the examinee. Therefore, this formula imposes a penalty for incorrect responses that will cancel out the number of correct response expected purely by chance alone through an examinee’s random responses to subtest items. The basic form of this formula is R-W / (k – 1). R is the number of correct responses, W is the number of incorrect responses, and k is the number of options for each subtest item. The specific formulas for each of the subtests are as follows: A. Computation th a. R – W / 4 (for each incorrect response, there is a reduction of 1 / 4 of a point) B. Object Matching rd a. R – W / 3 (for each incorrect response, there is a reduction of 1 / 3 of a point) C. Name Comparison a. R – W (for each incorrect response, there is a reduction of 1 point) Number-correct scoring remained for the power tests of the GATB Forms E and F. Alternative scoring methods were considered, but the number-correct approach was selected, in a part, because it was thought to be easier for test takers to understand. Method of Delivery The GATB Forms E and F are administered in a paper-pencil test booklet, as well manual and finger dexterity boards, proctored by a test administrator. Subtests 1 through 6 are answered by examinees in a booklet form, whereas subtests 7 through 11 have materials/apparatuses that are supplied by and overseen by the test administrator. The power subtests (i.e., subtests 1 through 6) of the GATB battery can be either administered to a single individual or in a group setting, but it was recommended that the power tests be administered prior to the three speeded tests (i.e., subtests 7 through 11, Mark Making, Place, Turn, Assemble, and Disassemble, respectively). There appears to be no online method of test administration for the power subtests. 57 Cost Table 31. Cost of GATB test materials and resources. Item Price Manuals Administration & Scoring $56.75 GATB Application and Interpretation Manual $94.75 Test Booklets Booklet 1 Form A (Parts 1 - 4) $99.00 Booklet 2 Form A (Parts 5 - 7) $99.00 Booklet 1 Form B (Parts 1 - 4) $99.00 Booklet 2 Form B (Parts 5 - 7) $99.00 Answer Sheets (pkg. 35) Form A (Parts 1 – 7) $34.75 Form B (Parts 1 – 7) $34.75 Mark Making Sheets (Part 8) $25.00 Dexterity Boards Pegboard (Manual Dexterity) $254.75 Finger Dexterity Board $179.50 Scoring Masks Form A (Parts 1 – 7) $49.70 Form B (Parts 1 – 7) $49.70 Recording Sheets (pkg. 35) Self-estimate Sheets $25.00 Result Sheets $53.50 Additional Resources Examination Kit $175.00 GATB Score Conversion/Reporting Software $359.50 Interpretation Aid Charts $12.00 *Costs are from Nelson which publishes GATB Forms A and B in Canada. Construct - Content Information Intended Constructs GATB Forms E and F are comprised of subtests similar to subtests in many other commercially available cognitive ability tests. The GATB Forms E and F include two verbal, two quantitative (one of which is based upon reasoning), one object matching, one spatial, and three psychomotor (finger and manual dexterity subtests are comprised of two tests using the same apparatus) subtests. What sets the GATB Forms E and F cognitive battery apart from most other commercially available cognitive batteries is that it its inclusion of psychomotor ability subtests, including motor coordination, finger dexterity, and manual dexterity. Like most commercially available cognitive batteries it measures a range abilities that are related to a wide range of occupations in the U.S. (nearly every occupation/job in the U.S.). The developers of the GATB Forms E and F intended to revise the previous GATB Forms A through D, which had been reviewed and decided upon that revisions were necessary. The revision objectives were (a) to reduce speededness, and susceptibility to coaching, (b) establish more appropriate scoring procedures, (c) develop better instructions for the examinee, (d) develop test items free from bias, (e) assemble parallel test forms, (f) improve the overall “look-and-feel” of the test and accompanying resources, and (g) revise the answer sheets and other supporting documents to be consistent with changes to test format. Also, Forms E and F do not include the Form Matching subtest. Abilities measured by the GATB subtests: 58 Verbal Ability (VA) is the ability to understand the meaning of words and use them effectively in communication when listening, speaking, or writing. Arithmetic Reasoning (AR) is the ability to use math skills and logical thinking to solve problems in everyday situations. Computation (CM) is the ability to apply arithmetic operations of addition, subtraction, multiplication, and division to solve everyday problems involving numbers. Spatial Ability (SA) is the ability to form and manipulate mental images of 3-dimensional objects. Form Perception (FP) is the ability to quickly and accurately see details in objects, pictures, or drawings. Clerical Perception (CP) is the ability to quickly and accurately see differences in detail in printed material. Motor Coordination (MC) is the ability to quickly and accurately coordinate hand or finger motion when making precise hand movements. Manual Dexterity (MD) is the ability to quickly and accurately move hands easily and skillfully. Finger Dexterity (FD) is the ability to move fingers skillfully and easily. Table 32 shows the linkages between the target abilities and the subtests. Table 32. Abilities measured by GATB subtests. Subtest 1. Name Comparison 2. Computation 3. ThreeDimensional Space 4. Vocabulary 5. Object Matching 6. Arithmetic Reasoning 7. Mark Making 8. Place 9. Turn 10. Assemble 11. Disassemble VA AR CM SA Abilities FP CP MC MD FD X X X X X X X X X X X Item Content The GATB Forms E and F consists of 11 subtests that are intended to measure nine abilities important for a wide range of occupations. Because the GATB Forms E and F were intended to be 59 used with nearly every occupation in the U.S., the item content was developed at a lower level of difficulty than many commercially available cognitive batteries as it may be used with semi-skilled occupations, and perhaps unskilled, through professional and healthcare practitioners. The reading level of the GATB Forms E and F subtests was set to a relatively low level of grade six. While it is intended for personnel selection and occupational guidance, all item content is work-neutral. Subtests 1 through 6 are power tests, and therefore are multiple choice questions. Subtests 7 through 11 are speeded tests in which the examinee is administered and overseen by an administrator the materials or apparatuses to measure the specific abilities. Description of Subtests and Sample Items 1. Name Comparison a. Consists of determining whether the two names are the same or different. b. 2. Computation a. Consists of mathematical exercises requiring addition, subtraction, multiplication, or division of whole numbers. b. 3. Three-Dimensional Space a. Consists of determining which one of four three-dimensional figures can be made by bending and/or rolling a flat, two-dimensional form. b. 4. Vocabulary a. Consists of indicating which two words out of four have either the same or opposite meanings. b. 5. Object Matching a. Consists of identifying the one drawing out of four that is the exact duplicate of the figure presented in the question stem. 60 b. 6. Arithmetic Reasoning a. Consists of mathematical word problems requiring addition, subtraction, m u l t i p l i c a tion, or division of whole numbers, fractions, and percentages. b. 7. Mark Making a. Consists of using the dominant hand to make three lines within a square. b. 8. Place a. Consists of using both hands to move pegs, two at a time, from the upper part of the board to the lower part. b. c. 9. Turn 61 a. Consists of using the dominant hand to turn pegs over and insert them back into the board. b. *Actually Subtest 9. This was from an earlier GATB resource. 10. Assemble a. Consists of using both hands to put a washer on a rivet and move the assembled piece from one part of the board to another. b. *Actually Subtest 10. This was from an earlier GATB resource. 11. Disassemble a. Consists of using both hands to remove a washer from a rivet and put the disassembled pieces into different places on the board. 62 b. *Actually Subtest 11. This was from an earlier GATB resource. Construct Validity Evidence The developers of the DAT for PCA reported correlations between the DAT for PCA subtests and the GATB. The pattern of correlations in Table 33 below provides support for the GATB: (1) six GATB cognitive factors are highly related to the DAT battery, (2) the GATB’s general intelligence factor was highly related with all of the DAT subtests, except for the Clerical Speed and Accuracy, (3) each of the GATB factors has its highest correlation with the appropriate DAT subtests, and (4) the GATB perceptual tests and motor tests correlated relatively high with the DAT’s Clerical Speed and Accuracy subtest. Table 33 shows the correlations between the GATB and DAT. 63 Table 33. Correlations between DAT and GATB subtests. VR NA VR + NA AR CSA .73 NA VR + NA AR CSA MR SR SP LU G - - .63 .41 .70 .44 .71 .46 .35 MR SR SP LU G V .68 .68 .68 .81 .78 .76 .63 .67 .57 .66 .72 .64 .71 .72 .68 .80 .81 .76 .69 .70 .44 .51 .64 .58 .27 .38 .40 .44 .48 .46 .72 .35 .55 .62 .57 .40 .53 .64 .55 .75 .70 .68 .80 .81 .94 N S P Q .52 .53 .19 .40 .62 .58 .23 .42 .61 .59 .22 .44 .43 .63 .24 .39 .48 .42 .36 .61 .29 .58 .19 .21 .33 .68 .27 .29 .64 .40 .24 .50 .58 .45 .19 .39 .66 .70 .37 .57 V N S P .54 .57 .28 .51 .41 .35 .62 .49 .46 .49 Table 34 below presents a confirmatory factor analysis of the correlations between aptitudes measured by the GATB battery by Hunter (1983). The results of the confirmatory factor analysis show that the aptitudes break into three clusters: (a) cognitive, (b) perceptual, and (c) psychomotor. This is complimentary of studies computing correlations between validity coefficients for aptitudes measuring the same abilities. Table 34. Confirmatory factor analysis between factors and aptitudes. Factors Cognitive Perceptual Aptitudes VN PQ Intelligence (G) .82 Verbal Aptitude (V) .68 .82 Numerical Aptitude (N) .77 .61 Spatial Aptitude (S) .59 .81 Form Perception (P) .64 .81 Clerical Perception (Q) .78 Motor Coordination (K) .48 .60 Finger Dexterity (F) .25 .46 Manual Dexterity (M) .19 .45 64 Psychomotor KFM .32 .42 .35 .66 .54 .64 .67 .72 Item and Test Development Test Development Procedures Note: Information below regarding item review and revisions is taken directly from Mellon, Daggett, MacManus, and Moritsch (1996) which provides an extensive report of the development of GATB Forms E and F. Item Writing and Editorial Review and Screening For the development of the GATB Forms E and F, great effort was used to develop new items. Many more items were originally composed than ended up in the final versions. The developers analyzed items from previous forms of the GATB and sorted them into categories based upon item difficulty. Detailed information on the specifications and item types/content categories for each of subtests 1 through 7 is reported below. A literature search was first performed to determine the proper procedures for conducting item reviews and selecting participants for the review process. It was then identified what item review instruments would be incorporated into the process and information in regard to three issues: bias guidelines, procedural issues, and rating questions. Preliminary Review. Draft versions of item sensitivity review questions, instructions, and an answer form were sent to Assessment and Research Development (ARD) centers for review. Based on the comments, Assessment and Research Development of the Pacific (ARDP) staff revised draft versions of the sensitivity review materials and sent them to Assessment Research and Development Centers (ARDCs) for further review. The only revision was a minor change in the answer form. Pilot Test. A pilot test was conducted in-house with three Cooperative Personnel Services (CPS) staff members, enabling individuals who were not involved in the ARDP test research program to provide input to the review process. The results led to a number of modifications in procedures, instructions, and documents that would be used for the item review. Item Review Materials. Nine documents were used in the item review process: (1) a list of the criteria to select panel members, (2) a confidentiality agreement, (3) a description of the GATB tests and aptitudes, (4) written instructions for panel members, (5) the administrator’s version of the written instructions for panel members, (6) a list of characteristics of unbiased test items, (7) a list of the review questions with explanations, (8) an answer form, and (9) an answer form supplement. Panel Member Characteristics. Seven panel members participated in the review. The panel included two African Americans, three Hispanics, and two whites. Three members were male and four female. Three members were personnel analysts, two were university professors in counselor education, one was a personnel consultant, and one was a postdoctoral fellow in economics. Procedures. At an orientation meeting held at each of the three participating ARDCs, confidentiality agreements were signed, GATB items and instructions were given to panel members, and several items in each test were reviewed and discussed. Panel members reviewed the remaining items at their convenience. After all items were reviewed, a follow-up meeting was held at each center to resolve any problems and to discuss the review process. A. Name Comparison. The 400 Name Comparison items were developed to be parallel to Form A items and representative in terms of gender and ethnicity. The number of items with names that were the same was equal to the number of items with different names. Item sources 65 included directories, dictionaries, and item developer creativity. Analyses were then performed to develop preliminary estimates of item difficulty. Based on these analyses, the number of characters in the left-hand column of the two-column format used for this test was selected as the item difficulty measure. The 200 items for each form were divided into four 50item quarters of approximately equal estimated overall difficulty. The item order was then randomized within each quarter. a. Review and Screening i. Comments focused on racial, ethnic, and gender stereotyping and representation. Specific concerns included the lack of female and minority businesses, and the need for more females in nontraditional professions, jobs, and businesses. b. Content Revision i. The revisions addressed the racial, ethnic, and gender stereotyping and representation criticisms. Guidelines based on the 1990 U.S. Census were used to increase racial/ethnic and gender representation. Stereotyping was addressed by including items with minorities and females in nontraditional occupations and businesses; more professional occupations and businesses were included. Fewer items with Germanic names were used. Format changes included separating the items into blocks of five, eliminating horizontal lines, and increasing the horizontal and vertical space within and between items. Finally, the instructions were reworded to increase clarity; bold and italicized types were used for emphasis. B. Computation. The 136 Computation items were developed to be parallel to Forms A-D. The original items were developed and reviewed to evaluate difficulty. The number of digits across numbers within each type of operation was used as the item difficulty measure. The 68 items for each form were divided into four 17-item quarters of equal estimated overall difficulty. Type of arithmetic operation and response options were balanced within each quarter. A lowdifficulty item was assigned to the first position within each quarter with the remaining items ordered randomly. a. Review and Screening i. Comments primarily dealt with item characteristics. Specific concerns included difficult and time-consuming problems that might be skipped by testwise applicants, poor distractors, and unclear instructions. b. Content Revision i. Distractors were revised to make them more plausible based on five error types. Minor format changes included adding commas to numbers with at least four digits and placing the operation sign within the item. Finally, the instructions were reworded slightly to increase clarity, and bold and italicized types were used for emphasis. C. Three-Dimensional Space. The 130 Three-Dimensional Space items were developed to be similar in content to prior forms. The number of folds was used as a measure of item difficulty; it had six levels. Newly developed items were grouped according to the number of folds so that an equal number of items would be developed for each of the six difficulty levels. Items were then drawn on a computer, using the CADD-3 software package. Items were continually reviewed for clarity and correctness, and shading was added. Completed items were transferred to Mylar paper and reduced in size photographically, and then plates were made for printing. Items were reviewed again and revised when necessary. Items were then assigned to forms on the basis of difficulty, and response options were checked and tallied. Option positions were changed as necessary. The items were rephotographed and printed. The ARDP used the CorelDRAW! 4 graphics package (Corel Corporation, 1993) to redraw all of the items to make them consistent in appearance. Camera-ready copies of the reformatted 66 items were prepared and sent to a graphic artist for proofing. Some of the items were later revised to correct the problems identified by the graphic artist. Three difficulty levels were identified based on the number of folds and/or rolls made in each item. These difficulty values were then used to form three 16-item quartiles and one 17-item quartile of approximately equal estimated overall difficulty within each form. Within each quartile a low-difficulty item was assigned to the first position with the order of the remaining items randomized. The correct response option frequencies were balanced within each quartile. a. Review and Screening i. Comments concerned possible gender bias and item characteristics. Comments included the presence of male-oriented items and abstract items that might be unfamiliar to females, difficult and time-consuming items that could be skipped by testwise applicants, gender-biased instructions, and overly complicated items. b. Content Revision i. Individual items were revised when needed to increase clarity. Revisions were reviewed by a graphics expert familiar with the test format and the drawing software to ensure that the items were free of errors. Instructions were reworded slightly to increase clarity and eliminate possible gender bias; bold and italicized types were used for emphasis. D. Vocabulary. The 160 Vocabulary items were developed to be parallel to Form B. Item review also focused on word difficulty but used a different approach from previous GATB development efforts. Specifically, The Living Word Vocabulary (Dale & O’Rourke, 1981) provided estimates of item difficulty. This reference assigns a grade level to each word meaning. The assigned grade level is based on the responses of students who completed vocabulary tests during the period of 1954-1979. When multiple word meanings were reported for a given word, the average grade level was used. Higher grade levels indicated greater difficulty. The mean of the reported grade levels for the four words that made up each item was used to estimate item difficulty. Four difficulty level categories were formed. These categories were used to prepare four 20-item quartiles of equal estimated overall difficulty for each form. For each quartile, the two items with the lowest estimated difficulty appeared in the first two positions with the order of the 18 remaining items randomized. The correct response option frequency distributions were balanced within quartiles and forms. a. Review and Screening i. Comments concerned high reading grade level, overly difficult words; words with different meanings for different groups; and inclusion of foreign-language words and technical, biological, and scientific terms. b. Content Revision i. Words were replaced on the basis of the item review panel member comments and on an analysis of word difficulty in Dale and O’Rourke (1981). Items were modified as needed to ensure that each item’s level of word difficulty was appropriate, word forms within items were identical, and the same type of correct response (i.e., synonym or antonym) was maintained within each item. The item format was changed from horizontal to vertical ordering of words. Finally, the instructions were reworked (e.g., bold and italicized types were used to emphasize important points, and a statement was added stressing that all choices should be considered before selecting an answer). E. Object Matching. The 163 original Object Matching items were developed to be parallel to Forms A-D. The ARDP used the number of shaded areas in the four response alternatives for each item to estimate difficulty level. Difficulty level, content considerations, and location of the correct response were used to form four 20-item quartiles of similar overall difficulty for each form. (Three items were deleted.) The item order was randomized within each quartile. 67 A surplus item was then added to each quartile to form three seven-item pages that could be shifted to meet the requirements of the research design. a. Review and Screening i. Comments focused mainly on possible gender bias due to differences in familiarity and the presence of male-oriented items. However, concerns were also expressed that items with electrical and mechanical components might cause problems for minorities due to lack of familiarity and opportunity to learn. Other comments concerned clarity of instructions and positioning of the response letters for the item alternatives. b. Content Revision i. Item revisions included eliminating inconsequential differences among item responses, eliminating duplicate responses, and refining responses (e.g., removing extraneous matter, drawing sharper lines, eliminating broken lines). Finally, the instructions were reworded slightly to increase clarity; bold and italicized type was used for emphasis, and the test name was changed from Tool Matching to Object Matching. (Future forms will include more generic items even though the results from item analyses indicated that female scores are slightly higher than male scores on the current items.) F. Arithmetic Reasoning. The 66 Arithmetic Reasoning items were developed to be parallel to Form A. New situations, contemporary monetary values, gender representation, exclusion of extraneous information, and a sixth-grade reading level were additional considerations in item development. The ARDP reviewed and revised the items so they conformed more closely to the guidelines for development. Item difficulty was estimated by the number of operations needed to solve the problem, the type(s) of operations, and the number of digits included in the terms used in the operation(s). One of the two least difficult items was assigned to the first item position in Form E and the other item assigned to Form F. The remaining 64 items were then assigned to four eight-item quartiles for each form on the basis of difficulty, type (s) of operation(s), correct response key, and content. The items in each quartile were ordered from least to most difficult with the item order then randomized within each quartile. a. Review and Screening i. Most comments were directed toward two areas: (1) racial, ethnic, and gender representation, and (2) gender occupational and activity stereotyping. Other comments concerned time-consuming items that might be skipped by testwise applicants, confusing and incomplete instructions, the presence of items that were overly complicated or involved too many steps, and some groups not having the opportunity to learn how to perform the operations needed to answer the complex items. b. Content Revision i. Revisions involved four areas: making minor item format modifications, eliminating gender stereotyping, making the distractors more plausible, and increasing racial, ethnic, and gender representation. The instructions were reworded slightly to increase clarity; bold and italicized types were used for emphasis. G. Form Matching. The 200 Form Matching items were developed to be parallel to Forms A-D items in terms of content and parallel to Form A item size and arrangement. Eight 25-item blocks were developed by modifying each of the eight blocks of items in Forms A-D. The number of response options for each item was reduced from 10 to five (Form Matching was later removed for the final versions of Forms E and F). a. Review and Screening i. Comments included a possible practice effect for the test and unclear instructions because of reading level. Comments that were directed toward 68 specific items included linear illustrations being perceived as “hostile,” minute differences among shapes, and possible confusion due to shape similarity and location. b. Content Revision i. Changes included enlarging figures to increase clarity, repositioning items to equalize space among items in the lower blocks, and revising an item family to make it less similar to another item family. The number of response options was reduced from 10 to 5. Finally the instructions were reworded slightly to increase clarity; bold and italicized types were used for emphasis. Item Tryout and Statistical Screening. The item pretest and analysis had two goals: (1) conducting an item analysis to obtain preliminary difficulty and discrimination indices, and (2) obtaining a quantitative estimate of ethnic and gender performance differences for each item. The sample comprised 9,327 applicants from USES local offices in the five geographic regions represented within the ARDP. Data were obtained by administering to the sample members 16 test booklets comprising three speeded tests and one power test. Each sample member completed one test booklet. Classical test theory item analyses were performed for the speeded test items. Item selection criteria included difficulty, discrimination, and content considerations. IRT procedures were used for the power test items. The analyses included dimensionality, position effects, item and test fairness, and test information graphs. Item DIF analyses were also performed with Mantel-Haenszel procedures. IRT procedures were used for test-level DIF analyses. Construction of the Final Version of Forms E and F. After items were screened and calibrated, a final set of items was selected for each Form E and Form F test. Items were selected to yield forms as parallel to each other as possible with respect to content coverage, difficulty, and test information. The forms were also balanced on subgroup difference statistics so that no one form provided any relative disadvantage to females, African Americans, or Hispanics. Insofar as possible, the power tests were also designed to be similar with respect to difficulty and information to Form A, after adjusting for differences in test lengths. Psychometric Model Classical test theory item analyses were performed for the speeded test items. Item selection criteria included difficulty, discrimination, and content considerations. IRT procedures were used for the power test items. IRT approaches involve estimating separate item characteristic curves (ICCs) for the two groups being compared. The analyses included dimensionality, position effects, item and test fairness, and test information graphs. Item DIF analyses were also performed with Mantel-Haenszel procedures. IRT procedures were used for test-level DIF analyses. They evaluated the dimensionality of the power tests and performed computer analyses to estimate Item Response Theory (IRT) parameters, in addition to conducting a preliminary selection of items. They used this information, in conjunction with their own analyses for assessing differential item functioning (DIF), to select the final items for the new forms. Equating Multiple Forms In order to equate the new forms, Forms E and F, a study was implemented to equate them with Form A. They collected a nationwide sample of data for 8,795 individuals that were representative of applicant populations. The technical requirements and foundation for these procedures reported below are presented in Segall and Monzon (1995). Data Collection Design Three samples of data were collected for the GATB Forms E and F equating study: (a) the first describes data collection design for the independent-groups sample which was used to equate the 69 new forms with the old GATB forms, (b) the second sample was a repeated-measures sample which was used for comparing the reliability and construct validity of the new and old forms, as well as the equating analysis, and (c) the third data collection design was for the psychomotor sample which was used for examining the need for composite equatings and also to examine construct validity of the psychomotor tests. For the independent-groups sample, 5,892 participants were randomly assigned to one of three forms (i.e., A, E, or F). Because the old forms (Form A) and the new forms (Forms E and F) are composed of different test orders, time limits, and instructions, participants were randomly assigned to one of the test versions at their testing location. Therefore, participants were assigned the test versions as a group as the different versions could not be administered to a single group of participants at the same time. Therefore, groups had to be physically separated during administration of the forms. The repeated-measures sample was administered two forms of the GATB (Forms A, B, E and F), and this sample was primarily used for examining the reliability and construct validity of the GATB, and also to supplement the equating data. Each of the participants was randomly assigned to one of eight conditions. Table 35 shows the conditions to which participants were assigned. Table 35. Repeated-measures design and sample sizes. Second Test First Test A B E A 1 (411) B 2 (432) 5 (209) E 6 (215) F 4 (216) 8 (446) F 3 (236) 7 (446) For the psychomotor subtests, 538 participants were sampled and received each of the five psychomotor subtests (i.e., subtests 7 through 11) and also the non-psychomotor subtests (i.e., subtests 1 through 6). Participants were randomly assigned to one of two groups, each receiving (a) Form A (non-psychomotor), (b) Form A (psychomotor), and (c) Form F (non-psychomotor). The order of presentation of test forms was counterbalanced across the two groups. For example, one group received Form A (non-psychomotor) and Form A (psychomotor) in the morning, and Form F (nonpsychomotor) in the afternoon. The other group would have received the same battery, but the order of the non-psychomotor tests of Forms A and F would have been reversed. Table 35 above shows the order and forms of tests administered. An evaluation of the random equivalence of selected groups within each of the three samples was also conducted because the random equivalence of these groups is a key assumption made in the equating, reliability, and validity analyses. These tests were conducted by gender, race, age, and education. Each of these resulted in non-significant findings were are consistent with the expectation based on random assignment of examinees to condition and support the assumption of equivalent groups made in the equating, reliability, and validity analyses. This also provides assurance that the assignment procedures worked, producing groups that are randomly equivalent with respect to demographic characteristics. Reliability Analysis Because the new test forms have fewer items and longer time limits, it was needed to run a reliability analysis to determine the precision with which the new forms, Forms E and F, measure their purported abilities. Four groups were used from the repeated-measures sample. Fisher’s z transformation was used to test the significance of the difference between the alternate form correlations of the new and old GATB forms. As described in Cohen and Cohen (1983). The alternate form reliabilities of the new GATB forms, Forms E and F, are generally as high as, or higher than, those of the old GATB Forms A and B. This was significant as the length of the three power tests was 70 decreased. The increase in testing time, however, may have added to the reliability of these power tests, offsetting the detrimental effects of shortening test lengths. Because the new test forms have fewer items and longer time limits, new reliability studies were needed. Both new forms were administered to a large sample of test takers, as were both Forms A and B. Composites of subtest scores were formed for several ability measures shown in Table 36. The alternate forms reliabilities reported in Table 36 are the reliabilities of these composite scores. Tests of the difference between new form reliabilities and old form reliabilities showed that Forms E and F were generally as reliable as or more reliable than Forms A and B. The increase in testing time, however, may have added to the reliability of these power tests, offsetting the detrimental effects of shortening test lengths. Table 36. Alternate forms reliability estimates for Forms A and B and for Forms E and F. Reliability Significance Test Ability RE, F (N = 870) RA, B (N = 820) Composite z p General Learning .908 .886 2.284 .011 Ability Verbal Aptitude .850 .858 -.580 .281 Numerical Aptitude Spatial Aptitude Form Perception .876 .884 -.704 .241 .832 .805 1.648 .050 .823 .824 -.049 .480 .755 1.145 .126 Clerical .778 Perception * RE, F = new forms, RA, B = old forms Overall Form Linking Study Results. Equating of the new GATB forms to the old forms proved successful. Despite the changes made in the new test forms, the evidence suggests that there is sufficient similarity to obviate the need for separate composite equating tables for the nonpsychomotor composites. Average subgroup performance levels are similar across the old and new forms, and reliabilities of the new GATB forms are generally as high as or higher than those of the old forms. Construct validity analyses of the old and new forms suggest that the GATB validity data can continue to be used for the new forms. Item Banking Approaches While IRT was used in the development of the power tests, no information was reported about how item banking approaches were used in the development or maintenance of the GATB Forms E and F. It is likely that new items were stored in an item bank, but the methods by which Forms E and F, as well as earlier Forms A and B, were administered in a fixed form mode indicate that no feature of the administration of these fixed forms relied on the ongoing use of banked item characteristics. Criterion Validity Validity across Job Families By the early 1980’s over 500 criterion validity studies had been conducted with GATB subtests. To organize comprehensive meta-analyses of these studies (Hartigan & Wigdor, 1989; Hunter, 1983), three composites of GATB subtests were created; (a) cognitive ability, (b) perceptual ability, and (c) psychomotor ability. In addition, broad job families were identified to organize meta-analyses of these studies to evaluate the generalizability of GATB criterion validity. Included in this study are 515 validation studies that were performed for the U.S. Employment Service. Of the 515 validation 71 studies, 425 used a criterion of job proficiency while the remaining 90 used a criterion of training success. Table 37a-37c shows the variation in observed validities for each of the three ability composites either by the distribution of observed validity coefficients, the distribution of validities corrected for sampling error, and the distribution of validity coefficients corrected for the artifacts of range restriction and criterion unreliability. In these tables, the variation of coefficients is across the entire job spectrum. Much of the variation in Table 37a is due to sampling error in the observed validity coefficients. Overall, Table-37c shows that the three GATB composites are valid predictors of job performance across all jobs, but the level of true validity varies across jobs. Table 37a. Distribution of observed validity coefficients across all jobs. Cognitive Ability Perceptual Ability Mean Observed .25 .25 Correlation Observed SD .15 .15 Psychomotor Ability .25 .17 th Observed 10 Percentile th Observed 90 Percentile .05 .05 .03 .45 .45 .47 Table 37b. Distribution of observed validity coefficients less variance due to sampling error. Cognitive Ability Perceptual Ability Psychomotor Ability Mean Observed .25 .25 .25 Correlation SD Corrected for Sampling Error .08 .07 .11 Variance th Observed 10 Percentile th Observed 90 Percentile .15 .16 .11 .35 .34 .39 Table 37c. Distribution of true validity (corrected for range restriction and criterion unreliability) across all jobs. Cognitive Ability Perceptual Ability Psychomotor Ability Mean Corrected r’s .47 .38 .35 SD of Corrected r’s .12 .09 .14 th Observed 10 Percentile th Observed 90 Percentile .31 .26 .17 .63 .50 .53 Validity Generalization Study This accumulation of studies supports two general conclusions are supported from these studies is that the validity of reliable cognitive ability tests does not vary much across settings or time, and construct valid cognitive ability tests have some positive validity for all jobs (Hunter, 1983). Table 38 shows the average true validity for training success and job proficiency for each of the three cognitive ability composites. Table 39 shows that validity changes with the complexity of job content. As complexity of a job decreases (i.e., 5 being lowest), the validity of cognitive ability also decreases but not to zero. Cognitive validity is shown to be positively related to job complexity. At the same time, psychomotor ability has its highest validity with respect to proficiency where complexity is lowest, Table 39 also 72 provides the results for training success. Across all levels of job complexity, the validity of cognitive ability for predicting training success is high. Overall, cognitive ability predicts job proficiency across all jobs, but the validity does drop off for low levels of job complexity. Table 38. Average true validity. Study Type Training Success Job Proficiency Average # of Jobs Cognitive Ability Perceptual Ability Psychomotor Ability Average 90 .54 .41 .26 .40 425 .45 .37 .37 .40 515 .47 .38 .35 .40 Table 39. Average true validity as a function of job complexity. Complexity Job Proficiency Training Success Level GVN SPQ KFM GVN SPQ 1 .56 .52 .30 .65 .53 2 .58 .35 .21 .50 .26 3 .51 .40 .32 .57 .44 4 .40 .35 .43 .54 .53 5 .23 .24 .48 Average .45 .37 .37 .55 .41 *GVN = Cognitive Ability, SPQ = Perceptual, Ability, KFM = Psychomotor Ability KFM .09 .13 .31 .40 .26 Table 40 shows that differential prediction is very effective if psychomotor ability is included in the composite. With respect to job complexity, psychomotor ability is at its highest where the validity of cognitive ability is at its lowest. Multiple regression equations results in multiple correlations that vary from one complexity level to another yield a higher overall level of validity than would be the case for any single predictor. Using cognitive ability to predict proficiency at the three higher complexity levels and psychomotor ability to predict the two lower levels raises the average validity. Thus, using psychomotor ability in combination with cognitive ability leads to an increase in validity as a function of job complexity. Table 40. Validity of ability combinations for job proficiency: Best single predictor and, and two sets of multiple regression weights with multiple correlation. Best Beta Weights Beta Weights Complexity Single Level Predictor GVN SPQ KFM R3 GVN KFM R2 1 .56 .40 .19 .07 .59 .52 .12 .57 2 .58 .75 -.26 .08 .60 .58 .01 .58 3 .51 .50 -.08 .18 .53 .45 .16 .53 4 .43 .35 -.10 .36 .51 .28 .33 .50 5 .48 .16 -.13 .49 .49 .07 .46 .49 Average .48 .42 -.09 .27 .51 .37 .24 .52 *GVN = Cognitive Ability, SPQ = Perceptual, Ability, KFM = Psychomotor Ability **R3 = multiple correlation with all three ability composites; R2 = multiple correlation with GVN and KFM 73 Table 41. Validity of ability combinations for training success: Best single predictor and, and two sets of multiple regression weights with multiple correlation. Best Beta Weights Beta Weights Complexity Single Level Predictor GVN SPQ KFM R3 GVN KFM R2 1 .65 .57 .21 -.21 .68 .70 -.16 .66 2 .50 .72 -.30 .03 .53 .52 -.05 .50 3 .57 .57 -.07 .15 .58 .53 .13 .59 4 .54 .34 .17 .20 .59 .46 .24 .59 5 Average .55 .59 -.10 .11 .57 .53 .08 .57 *GVN = Cognitive Ability, SPQ = Perceptual, Ability, KFM = Psychomotor Ability **R3 = multiple correlation with all three ability composites; R2 = multiple correlation with GVN and KFM Overall, the results show that composites of GATB subtests are valid in predicting both job proficiency and training success across all jobs. The level of GATB validity varies as a function of job complexity. Even after jobs are stratified by complexity, there is some variation in validity. Some of this variation is most likely due to artifacts in the study itself. However, other variance may be due to unknown job dimensions other than the overall job complexity dimension. Hartigan and Wigdor (1989) also computed validity generalization analyses for the GATB. They had access to the full data set, which were a few years after Hunter (1983) conducted his studies. A larger number of GATB validation studies had been conducted on the GATB since 1983. Hartigan and Wigdor also found the GATB validity to generalize across all jobs. Although, the validities for the new studies were found to be smaller than those reported by Hunter (1983). Several theories exist for these differences such as smaller sample sizes, the use of different criteria, and possibly jobs surveyed in the newer studies. Overall, the newer studies provided sufficient generalization validity for the use of the GATB across all jobs in the U.S. Table 42 and Table 43 below show the differences in validity across job families, abilities, and criteria types, along with the differences between the Hunter (1983) study and the Hartigan and Wigdor (1989) validity generalization results. Table 42. Variation of validities across job families in 1983 (N = 515) and 1989 (N = 264) studies. Abilities GVN SPQ KFM Hunter NRC Hunter NRC Hunter NRC Job Family (1983) (1989) (1983) (1989) (1983) (1989) I (set-up/ .34 .16 .35 .14 .19 .08 precision) II (feeding/ .13 .19 .15 .16 .35 .21 offbearing III .30 .27 .21 .21 .13 .12 (synthesizing) IV (analyze/ compile/ .28 .23 .27 .17 .24 .13 compute) V (copy/ .22 .18 .24 .18 .30 .16 compare) *GVN = Cognitive Ability, SPQ = Perceptual, Ability, KFM = Psychomotor Ability 74 Table 43. Validities for the two sets of studies by job family and type of criterion. Performance Training Hunter NRC Hunter NRC (N) (N) (N) (1983) (1989) (1983) (1989) Job Family GVN I (set-up/ .31 1,142 .15 3,900 .41 180 .54 precision) II (feeding/ .14 1,155 .19 200 offbearing III .30 2,424 .25 630 .27 1,800 .30 (synthesizing) IV (analyze/ compile/ .27 12,705 .21 19,206 .34 4,183 .36 compute) V (copy/ .20 13,367 .18 10,862 .36 655 .00 compare) SPQ I (set-up/ .32 .13 .47 .40 precision) II (feeding/ .17 .16 offbearing III .22 .21 .18 .21 (synthesizing) IV (analyze/ compile/ .25 .16 .29 .25 compute) V (copy/ .23 .18 .38 .01 compare) KFM I (set-up/ .20 .07 .11 .16 precision) II (feeding/ .35 .21 offbearing III .17 .17 .11 .02 (synthesizing) IV (analyze/ compile/ .21 .12 .20 .17 compute) V (copy/ .27 .16 .31 .12 compare) (N) 64 347 3,169 106 Approach to Item / Test Security The historical use of GATB for selection managed test security by requiring all administrations to be proctored by trained and qualified test administrators. To our knowledge, no routine processes were in place to monitor item characteristics or to produce alternative form out of concern for item or test security. Translation / Adaption The GATB has been transcribed into 13 languages in 35 countries, including Arabic. Dagenais (1990) conducted a study on the GATB once translated from English to Arabic and used in Saudi Arabia. Samples of participants from the U.S. and Saudi Arabia were used and contrasted. The resulting factor structures were analyzed and found a three factor structure (i.e., cognitive, spatial perception, and psychomotor) for both samples, therefore establishing factor equivalence. The mean scores for both groups were similar in shape and amplitude. Thus, the overall findings supported the conclusion that GATB could be translated into Arabic and administered in Saudi Arabia. Table 44 shows the factor loadings for the GATB in the U.S. and Saudi Arabia. 75 Table 44. Factor loadings for Saudi Arabian (SA) and American (US) GATB scores. Factor 1 Factor 2 Factor 3 Spatial Cognitive Psychomotor Perception Communalities Subtest SA US SA US SA US SA US Computation .80 .84 .08 .06 .10 .20 .65 .95 Name Comparison Arithmetic Reasoning Vocabulary Object Matching Mark Making Man. Dex.: Turn Man. Dex.: Place Fing. Dex.: Disassem. Fing. Dex.: Assem Three-D Space Form Matching .74 .87 .16 .23 .11 .02 .58 .80 .69 .76 .00 -.02 .33 .41 .57 .74 .56 .79 .01 -.02 .51 .29 .57 .70 .55 .60 .27 .33 .42 .27 .55 .54 .47 .64 .39 .53 -.16 -.26 .36 .76 .07 .13 .77 .78 .10 -.10 .61 .63 .20 .02 .73 .74 .07 .13 .58 .56 .10 .16 .72 .74 .18 .10 .57 .59 .03 .06 .71 .67 .18 .23 .54 .51 .10 .38 .16 .16 .80 .81 .70 .82 .21 .52 .22 .23 .68 .48 .58 .61 I Eigenvalues % of Total Variance Cumulative % Tot. Var. % of Common Var. II III Sum 4.14 5.17 1.72 1.99 1.00 .85 .35 .43 .14 .17 .08 .07 .35 .43 .49 .60 .57 .67 .38 .48 .36 .34 .26 .18 6.86 8.01 User Support Resources Note: The GATB is out of print in the U.S., but it is still used in Canada (i.e., Forms A and B) for vocational counseling, rehabilitation, and occupational selection settings. Nelson in Canada can supply such resources as: 76 Administration and scoring manuals Application and interpretation manuals Test booklets Answer sheets Dexterity boards Scoring masks Recording sheets Examination kits Score conversion/reporting software Interpretation aid charts Evaluative Reviews Not only is the GATB the most extensively researched multiaptitude cognitive ability battery, the GATB is also the only cognitive ability battery that is linked to 800 plus occupations in the U.S., in the US O*NET system. Overall, GATB studies have demonstrated (a) GATB composites have high levels of reliability, (b) validity generalization studies have shown it to be a valid predictor of job performance for all jobs in the U.S. economy, and (c) Forms E and F were developed to race/ethnic and genderrelated sources of bias (Keesling, 1985). However, two limitations are reported: (a) because of the speeded subtests, individuals with disabilities may be penalized, and (b) the norms for the GATB are dated. INFORMATION ABOUT THE US OFFICE OF PERSONNEL MANAGEMENT’S NEW CIVIL SERVICE TEST SYSTEM, USA HIRE In the course of investigating the seven selected batteries, the new civil service testing system recently developed by the US Office of Personnel Management was identified as a useful comparison. This new system includes at least four newly developed cognitive ability tests. (An additional assessment, the Interaction test is a personality assessment.) Unfortunately, at the time of this Report, OPM was making no information about these tests available except what was already accessible online at https://www.usajobsassess.gov/assess/default/sample/Sample.action. This site provides a sample item for each new test. Screen shots of these sample items are shown here for reference below. Occupational Math Assessment In this assessment, you will be presented with multiple-choice questions that measure your arithmetic and mathematical reasoning skills. You will be asked to solve word problems and perform numerical calculations. You will also be asked to work with percentages, fractions, decimals, proportions, basic algebra, basic geometry, and basic probability. All of the information you need to answer the questions is provided in the question text. Knowledge of Federal rules, regulations, or policies is NOT required to answer the questions. You MAY use a calculator and scratch paper to answer the questions. Read the questions carefully and choose the best answer for each question. Once you have selected your response, click on the RECORD ANSWER button. You will not be able to review/change your answers once you have submitted them. This assessment contains several questions. For each question, you will have 5 minutes to select your answer. A sample question is shown below. 77 Occupational Math Example Solve for x. 3x - 3 = 6 1 3 6 9 12 Occupational Judgment Assessment In this assessment, you will be presented with a series of videos that are scenarios Federal employees could encounter on the job. For each scenario, you will be presented with four possible courses of action for responding to the scenario presented in the video. You will be asked to choose the response that is the MOST effective course of action and the response that is the LEAST effective course of action for that particular scenario. When you begin each scenario, the upper left half of the screen contains important information for you to review prior to watching the video. Below this information you will find the question. This question tells you from which perspective you should respond to the scenario. The video appears in the upper right. The bottom half of the screen contains the four courses of action for responding to the scenario. The video can be viewed in closed captioning by clicking on the CLOSED CAPTIONING button located to the left of the VOLUME button. After watching the video and reading through the four courses of action, click on the button under Most Effective to choose the course of action you consider the best for that situation. Do the same for the course of action you consider the worst by clicking on the button under Least Effective. This assessment contains several scenarios. For each scenario, you will have 5 minutes to select both the MOST and LEAST effective courses of action. A sample question is shown below. To view the sample video, click the PLAY button in the lower corner of the video box. To view the video in closed captioning, click on the CLOSED CAPTIONING button located to the left of the VOLUME button. 78 Occupational Judgment Example Progress: Watch the following video. Choose the most AND least effective course of action from the options below. Step 1. Scenario Barbara and Derek are coworkers. Barbara has just been provided with a new assignment. The assignment requires the use of a specific computer program. Derek walks over to Barbara's cubicle to speak to her. If you were in Barbara's position, what would be the most and least effective course of action to take from the choices below? Step 2. Courses of Action Most Effective Least Effective Try to find other coworkers who can explain how to use the new program. Tell your supervisor that you don't know how to use the program and ask him to assign someone who does. Use the program reference materials, tutorial program, and the help menu to learn how to use the new program on your own. Explain the situation to your supervisor and ask him what to do. Occupational Interaction Assessment In this assessment, you will be presented with questions that ask about your interests and work preferences. Read each question carefully, decide which of the five possible responses most accurately describes you, and then click on that response. Once you have clicked on that response, click on the RECORD ANSWER button. The possible responses vary from question to question. The responses will assess: 1) How much you agree with a statement; 2) How often you do, think, or feel things; or 3) How much or how often you do, think, or feel things compared to others. This assessment is not timed. A sample question is shown below. Occupational Interaction Example 79 Progress: If I forget to fill out a form with information that others need, I make sure to follow up. | Almost always | Often | Sometimes | Rarely Never Occupational Reasoning Assessment In this assessment, you will be presented with multiple-choice questions that measure your reasoning skills. You will be asked to draw logical conclusions based on the information provided, analyze scenarios, and evaluate arguments. All of the information you need to answer the questions is provided in the question text. Knowledge of Federal rules, regulations, or policies is NOT required to answer the questions. Read the questions carefully and choose the best answer for each question. Once you have selected your response click on the RECORD ANSWER button. This assessment contains several questions. For each question, you will have 5 minutes to select your answer A sample question is shown below. Occupational Reasoning Example All documents that contain sensitive information are considered to be classified. Mary has drafted a report that contains sensitive information. Based on the above statements, is the following conclusion true, false, or is there insufficient information to determine? The report drafted by Mary is considered to be classified. True False Insufficient Information Occupational Reading Assessment In this assessment, you will be presented with multiple-choice questions that measure your reading comprehension skills. You will be asked to read a passage or table and to answer questions based on the information provided in the passage or table. All of the information you need to answer the questions is 80 provided in the question text. Knowledge of Federal rules, regulations, or policies is NOT required to answer the questions. Read the questions carefully and choose the best answer for each question. Once you have selected your response click on the RECORD ANSWER button. This assessment contains several questions. For each question, you will have 5 minutes to select your answer Occupational Reading Example The job grade definitions establish distinct lines of demarcation among the different levels of work within an occupation. The definitions do not try to describe every work assignment of each position level in the occupation. Rather, based on fact finding and study of selected work situations, the definitions identify and describe those key characteristics of occupations which are significant for distinguishing different levels of work. In the context of the passage, which one of the following could best be substituted for distinct lines of demarcation without significantly changing the author's meaning? agreement comparisons connections boundaries 81 DESCRIPTIVE RESULTS FOR PET Overall Description and Uses Introduction and Purpose PSI’s Professional Employment Test (PET) battery consists of four subtests that measure three major cognitive abilities. The PET was developed in 1986 and has been used extensively for selection purposes with occupations requiring a high-level of education. This battery was developed to assess specific abilities that have been found to be important in performing entry-level, professional, administrative, managerial, and supervisory jobs. PET was initially developed for use with state and local government jobs. PSI subsequently expanded its use to this same range of jobs within the private sector. These abilities have been well-established in the personnel selection validity literature demonstrating the importance for performance in professional and managerial occupations. The Data Interpretation subtest was designed to measure quantitative problem solving ability and reasoning ability. Reasoning measures verbal comprehension ability and reasoning ability in the form of syllogisms. The Quantitative Problem Solving subtest was designed to measure quantitative problem solving ability. Reading Comprehension was designed to measure verbal comprehension ability. Purpose of Use The Professional Employment Test (PET) was designed for use as a selection instrument appropriate from applicants applying for jobs that usually require a college education. It is suited for selection purposes in higher level occupations, more so than many other tests of GMA, because of its complex, work-related content and high reading level. The PET was also created with the intention of easy administration, scoring, and report generation. Targeted Populations The populations targeted by the PET are candidates who are working adults for entry-level, professional, administrative, and managerial, and supervisory occupations as potential professional employees in both public and private sector jobs. These occupations usually require a college-level education. To support this intended use, all items were written at a reading difficulty level of not greater than the fourth year of college (Grade 16), with item content representative of content commonly encountered in business and government work settings. Target Jobs/Occupations Normative information for the PET is presented in the technical manual for three major occupation groups; (a) Facilitative, (b) Research and Investigative, and (c) Technical and Administrative. A. Facilitative a. Jobs such as Buyer, Eligibility Worker, Human Service Worker, and Rehabilitation Counselor that determine the need for services or supplies and identify and apply the appropriate rules and procedures to see that needs are met. i. An example of a job task is obtaining necessary or required verification/authentication from records and from various sources. 82 B. Research and Investigative a. Jobs such as Research Statistician, Probation and Parole Specialist, Staff Development Specialist, Unemployment Insurance Claims Deputy, and Equal Employment Opportunity Coordinator that provide information to decision-makers through investigation and research methods. i. An example of a job task is procuring required documentation and information for new investigations. C. Technical and Administration a. Jobs such as Computer Programmer/Analyst, Prisoner Classification Officer, and Support Enforcement Officer that organize information and resolve problems through specific processes and methodologies. i. An example of a job task is coding new programs within desired time frames. Spread of Uses PET was designed to for use as a selection instrument. PSI has not supported its use for other uses such as development or career guidance. Administrative Details Table 45 describes the administrative features of the PET subtests. Table 45. Administrative detail for the PET. Administrative Detail Subtest # Items Time Data Interpretation 10 20 minutes Reasoning 10 20 minutes Quantitative Problem Solving 10 20 minutes Reading Comprehension 10 20 minutes Scoring Rule Delivery Methods An overall raw score is generated with a percentile score. Paper-pencil; proctored. Online (ATLAS™); proctored and unproctored. Number and Types of Items The PET has two alternate forms, the PET-A and the PET-B, which have the same length. Both forms have four subtests (i.e., data interpretation, reasoning, quantitative problem solving, and reading comprehension), 10 items for each, for a total of 40 items. The short-form PET is half the length as the long-forms, containing only 20 items and therefore only five items for each of the four subtests. Each of the four subtests contains multiple choice questions which measure the three abilities described above. The item types were based on work by French, Ekstrom, and Price (1963) which established various abilities in psychometric literature such as verbal comprehension, quantitative problem solving, and reasoning. Time Limits The paper and pencil version of PET has an 80-minute time limit, not including the time for reviewing instructions. While there are four subtests with time limits, the time limits imposed on the individual subtests are arbitrary as the only real time limit is the overall limit of the four combined subtests; 80 minutes. This is enforced because of the omnibus format. The short-form of the PET contains 20 items with a 40 minute time limit. 83 If using PSI’s proprietary online administration platform, ATLAS™, the testing time is extended by 5 minutes. This is done so that the PET remains a power (non-speeded) test. Type of Score The PET has the ability to be scored three different ways; (a) computer automated scoring (using PSI’s ATLAS™ platform), (b) hand scoring using templates, and (c) on-site optical scanning/scoring. The PET raw score is the number of items answered correctly for each subtest. Each raw score determines a percentile score based on relevant PSI norm groups. Scores can also be banded, pass/fail, or used in ranking of test examinees. Test Scale and Methods of Scaling Raw scores are number correct scores for each subtest and are transformed into percentile scores for reporting and interpretation purposes based on the distributions of PET raw scores in relevant norm groups based on the three occupation groups, Facilitative, Research and Investigative, and Technical and Administrative). Items within the subtests are also ordered in ranking of difficulty such that the first item is easier than the last item. Method of Delivery The PET is available in both paper-pencil administration and online administration. For online use, the PET is administered through PSI’s own web-delivery platform, ATLAS™. The ATLAS™ webbased talent assessment management system is an enterprise web-based Software-as-a-Service (SaaS) solution for managing selection and assessment processes. This platform can function in a number of capacities (e.g., configuring tests, batteries, score reports, managing test inventory, manage and deploy proctored and unproctored tests, establish candidate workflow with or without applicant tracking systems, etc…). One of the important aspects of administering the PET using PSI’s own web-delivery platform, ATLAS™, is that it can be proctored online. Unproctored testing is unavailable for the PET in the U.S., but it is available internationally for unproctored online use. As such, PSI has developed an alternate form specifically for online unproctored use internationally. Costs Test Materials Cost Tests (pkg. 25; scoring sheets included) Long Form $445.00 Short Form $445.00 Manuals PET Technical Manual $350.00 *Other cost information was not available after extensive efforts to attainment. Construct-Content Information Intended Constructs The types of subtests included in the PET are typical of subtests in other commercially available tests of cognitive ability. The PET is comprised of four subtests designed to measure three abilities. These subtests consist of Quantitative Problem Solving, Reading Comprehension, Reasoning, and Data Interpretation, which assess three major cognitive abilities (i.e., verbal comprehension, quantitative problem solving, and reasoning). The subtests are intended to measure abilities similar to those measured by subtests in other cognitive ability batteries but the items are written at a higher reading level and more complex content than most with the exception of Watson-Glaser and Verify subtests. 84 PSI designed PET to be more appropriate than other batteries for use in professional, administrative, and managerial jobs in state government. PSI conducted intensive interviews and nationwide surveys about testing practices and searched the literature for available tests that may have already been appropriate. The results were inconclusive and spurred PSI to compose their own test that would be specifically relevant to professional / managerial occupations within state government organizations. The types of subtests comprising the PET have been well established in psychometric literature (e.g., see French, 1951; Guilford, 1956; Nunnally, 1978) and were selected for use in this battery in view of substantial empirical research. These abilities have also been found to be important in professional occupations (e.g., see McKillip, Trattner, Corts, & Wing, 1977; Hunter, 1980). The abilities and their respective definitions are listed as follows: Verbal Comprehension is described by PSI as the ability to understand and interpret information that is complex and imparted through language. This also includes the ability to read and understand written materials. Quantitative Problem Solving is described by PSI as the ability to apply reasoning processes for solving mathematical problems. This may often be seen as a special form of general reasoning. Reasoning is described by PSI as the ability to analyze and evaluate information for making correct conclusions. Reasoning also includes inductive and deductive logic. Table 46 shows the abilities assessed by the PET. Table 46. PET subtests and the abilities the battery assesses. Ability Verbal Quantitative Problem Subtest Comprehension Solving Data Interpretation X Reasoning Quantitative Problem Solving Reading Comprehension Item Content X Reasoning X X X X X The PET utilizes four subtests and is intended to measure three abilities that have been deemed important in performing occupations such as professional, administrative, and managerial work. PET item content was developed to include work-like content, meaning that the items in the four subtests are reflective of content that may be similarly observed on the job in such occupations. All subtest item content verbal content, and all items were written at a reading difficulty level of not greater than the fourth year of college (grade 16), as measured by the SMOG index of readability (McLaughlin, 1969). Test items are based on language and situations that are commonly encountered in business and government work settings. Specific Item Content Examples The PET is comprised of four subtests: A. Data Interpretation a. Measure quantitative problem solving ability and reasoning ability. i. These items consist of numerical tables, where selected entries from each table have been deleted. The examinee deduces the value of the missing 85 entries using simple arithmetic operations. The examinee then selects the correct answer from five alternatives. ii. Sample item unavailable B. Reasoning a. Used in measuring verbal comprehension ability and reasoning ability, are in the form of syllogisms. i. The examinee is given a set of premises that are accepted as true and a conclusion statement. The examinee determines whether the conclusion is: 1) necessarily true; 2) probably, but not necessarily, true; 3) indeterminable; 4) probably, but not necessarily, false; or 5) necessarily false. The premises of the questions are hypothetical situations related to work in business, industry, and government. C. Quantitative Problem Solving a. Used in measuring quantitative problem solving ability. i. These items consist of word problems. The examinee identifies and applies appropriate mathematical procedures for solving the problem and selects the correct answer from five alternatives. The questions reflect the kinds of mathematical problems that might occur in business, industry, or government. D. Reading Comprehension a. Used in measuring verbal comprehension ability and reasoning ability. i. The examinee reads a paragraph and a multiple-choice question and selects the one of five alternatives that best reflects the concept(s) of the paragraph. The correct answer may be a restatement of the concept(s) or a simple inference. Reading passages were based on publications and materials dealing with common issues in society. 86 Construct Validity Evidence Construct Validity No construct validity evidence was reported in the technical manual or other available resources. There are no studies or other information reported in technical manuals or published sources that compare the subtests of the PET cognitive ability battery with other similar cognitive ability batteries. The publisher confirmed this as well. Item and Test Development Item Development Two hundred items were initially developed by five members of the PSI professional staff. The initial version of the test consisted of 54 inference, 48 quantitative reasoning, 47 reading comprehension, and 49 tabular completion (arithmetic such as addition and subtraction) items. The reading level was no higher than grade 16. A subcommittee of item writers reviewed and approved the items. This review was designed to ensure that items did not contain material that may offensive or unequally familiar to members of any racial, ethnic, or gender group. The four subtest items were based on the work of French, Ekstrom, and Price (1963) which established various abilities in psychometric literature such as verbal comprehension, quantitative problem solving, and reasoning. PSI also relied upon validity from research that evaluated these items types as predictors of professional job performance (O'Leary, 1977; O'Leary & Trattner, 1977; Trattner, Corts, van Rijn, & Outerbridge, 1977; Corts, Muldrow, & Outerbridge, 1977). After development of the items, they were pretested and subjected to statistical analyses to cast test forms to meet psychometric specifications. The test was divided into two booklets, each containing two item types (i.e., Booklet 1: Data Interpretation and Reading Comprehension; Booklet 2: Reasoning and Quantitative Problem Solving). The two forms were then administered to state employees in a variety of professional occupations according to standard instructions. An item analysis was then conducted. Examinees who failed to answer the last five items of a test or two-thirds of the total number of items on a test were excluded from the item analysis samples. Also, items that were not attempted would be scored as incorrect. Data was used from the pretest to compute Classical Test Theory (CCT) item statistics: Item Difficulty The target item difficulty was set at .60. Items that were seen to be very difficult (less than 20% of examines answered correctly) and items that were very easy (more than 90% of examinees answered correctly) were removed from the PET. Item Discrimination Item-total point-biserial correlations were used to determine item discrimination. Items that had the highest discrimination values were selected. Distractor Effectiveness Items with inefficient distractors were avoided (i.e., where a positive distractor-total correlation coefficient was obtained or where the item’s distractors were not selected). 87 Item Bias The Delta-plot method (Angoff, 1982), was used to evaluate the extent to which items may have been biased against Black test takers. Items that were seen to be disproportionately difficult for Blacks were removed from the test. Assessing Prediction Bias PSI conducted a study of prediction bias on the PET using White and Black job performance in the Facilitative job category. No other occupations or groups were studied due to small sample sizes and insufficient statistical power to detect an effect. The Gulliksen and Wilks (1950) procedure was used to evaluate differences between Black and White regression lines. Although intercepts were found to vary between Whites and Blacks, these differences resulted in over-prediction for Black job performance, a common result with cognitive predictors. Based on this result, PSI concluded that PET was not biased against Blacks applicants. Table 47 provides the results from the study which demonstrated a lack of prediction bias against Black applicants. Table 47. Descriptive Statistics for PET Scores for Blacks and Whites Blacks (N = 99) M SD PET-A 16.9 5.6 PET-B 17.3 Supervisory Rating Criterion Whites (N = 115) Correlations M SD Correlations PETB 22.9 6.5 4.8 PETA 23.7 5.6 PETA PETB 3.9 .6 .22 .37 4.0 .7 .36 .23 Work Sample 15.2 2.5 .42 .23 16.9 2.8 .28 .27 Job Knowledge Test 25.5 4.6 .36 .39 29.5 4.4 .55 .47 Psychometric Model There is no indication that PSI applied IRT-based item analyses for PET items during the development process. However, the interview with PSI’s industrial-organizational psychologist who supports PET indicted that PSI does estimate IRT parameters for PET items as the sample size of PET test takers has increased. Multiple Forms To create two equivalent forms, PSI assigned items that were thought to be closely matched to each of the two forms with respect to difficulty and discrimination/reliability. They then computed reliabilities between the two forms using the Kuder-Richardson (KR-20) formula. Results showed that the two forms, PET-A and PET-B, were equivalent in difficulty, variance, and reliability. The correlation between the two scores on the two forms was found to be .85 (n = 582). When corrected for attenuation, the correlation is .99, and therefore showing two forms of the same test. Table 48 reports the results of the equivalency study. Table 48. Reliability of PET-A and PET-B. Form A B 88 Number of Examinees 582 582 Mean 23.0 23.2 Standard Deviation 7.3 7.2 Reliability .86 .86 Standard Error of the Estimate 2.94 2.95 Item Banking Approaches As would be expected from having only two fixed forms of each subtest, PET items do not appear to be managed as a bank of items although the PSI I-O psychologist reported that PSI’s overall strategy is to accumulate items in an increasingly large bank as new items are developed for new forms. It is not known whether new forms of PET have been developed. Approach to Item / Test Security Test Security For all four of the PET subtests there are two forms; Form A and Form B. Also, for online proctored administration in the U.S., PSI’s ATLAS™ web-based talent assessment management system is available as an enterprise web-based Software-as-a-Service (SaaS) solution for managing selection and assessment processes. This platform can function in a number of capacities (e.g., configuring tests, batteries, score reports, managing test inventory, manage and deploy proctored and unproctored tests, establish candidate workflow with or without applicant tracking systems, etc…). According to the publisher (PSI), the ATLAS™ platform has the ability to provide test items in a randomized order. They have a specific form intended for online unproctored use internationally. However, unproctored testing is forbidden by the publisher within the U.S. and they do make an agreement with clientele that the PET must be proctored and overseen by a test administrator to reduce cheating or dissemination of test items/materials. Another security step taken by the publisher is supplying password protections for test takers. Criterion Validity Evidence Criterion Validity Criterion validity studies were conducted for PET in the three target job families. The observed validity coefficients were corrected for criterion unreliability and range restriction. The reliability estimates for the supervisory ratings and work samples were not available, so PSI used the work of Pearlman, Schmidt, and Hunter (1980) to estimate the reliability to be .60. The KR-20 reliability of the job knowledge test was estimated to be .69. The validity coefficients in Table 49 provide significant support for validity evidence of both forms of the test, PET-A and PET-B. All correlations are significant and positive between the PET scores and the criterion measures, resulting in corrected validity coefficients to range from .25 to .84. Although not reported in the table, the weighted averages of corrected coefficients for PET-A and PET-B were.54 and .48, respectively. Overall, the average was .51, which is congruent with validity evidence of other tests of cognitive ability used in selection (Schmidt and Hunter, 1998). Table 48 above also provided predictive validity for the norm groups of Whites and Blacks. Because PET over-predicted Black performance compared to Whites, it does not create a disadvantage for Black applicants. 89 Table 49. Validity Coefficients for PET – Corrected (Uncorrected) Job Family Research & Facilitative Investigative Criterion PET-A PET-B PET-A PET-B Supervisory .54 (.31) .53 (.29) .40 (.27) .32 (.22) Ratings (N) (223) (181) Work Sample .61 (.42) .54 (.37) .44 (.30) .34 (.23) (N) Job Knowledge Test (N) Standardized Two-Criterion Composite (N) (224) .75 (.56) (196) Technical & Administrative PET-A PET-B .35 (.24) .25 (.17) (132) .84 (.59) .83 (.58) (137) .74 (.55) (224) (.48) (.44) (.38) (2220) (.31) (178) (.49) (.42) (123) The technical manual for the PET also reports that there may be strong transport validity evidence for the battery across other similar jobs. PSI note that US federal guidelines describe three requirements for transportability of validity results in other jobs: (a) there must be a criterion-related validity study meeting the standards of Section 14B; (b) there must be a test fairness study for relevant race, sex, and ethnic groups, if technically feasible; and (c) the incumbents of the job to which validity is to be transported must perform substantially the same major work behaviors as the incumbents of the job in which the criterion-related study was conducted. The technical manual reports that the first two requirements have already been met by the publisher, but the test user must establish job similarities. In order for the test user to establish the similarity of a job to any of the jobs in the criterion-related studies, the major work behaviors comprising that job must be identified. Resulting from such a comparison, it can then be determined whether the criterion validity evidence can actually be transported to other jobs. Translations / Adaptions No translations exist for the PET which is only available in English. User Support Resources Technical manual Fact sheet Note: PSI does not publicly display user support resources. Evaluative Reviews The Twelfth Mental Measurements Yearbook The reviewers of the PET provide positive feedback in regards to the reliability and validity evidence, as well as the test construction and supporting documents (Cizek & Stake, 1995). A few of the strengths highlighted are the validity evidence, bias prevention procedures, technical documentation, instructions for test administrators, and also its ability to be administered quickly and scored easily. It is a selection instrument that should be useful to government organizations that are selecting for professional, managerial, and administrative positions. However, there are also cautions of the PET. The reviewers point out that no construct validity or test-retest indices are reported. Also, the validation sample encompasses only three occupation 90 groups. The reviewers also raised questions about PET’s potential for bias against certain applicant groups. DESCRIPTIVE RESULTS FOR PPM Overall Description and Uses Introduction Hogrefe’s Power and Performance Measures (PPM) battery consists of nine subtests that may be administered and scored in any combination appropriate to the interests of the organization. Originally developed by J. Barratt in the late 1980’s, Hogrefe is now supports and markets PPM. These subtests are organized into four groups based on two characteristics. The first characteristic distinguishes between verbal and nonverbal content. The second characteristic distinguishes between power and performance content. This classification of subtests is as follows. Verbal Nonverbal Power Verbal Reasoning Numerical Reasoning Perceptual Reasoning Applied Power Performance Verbal Comprehension Numerical Computation Spatial Ability Mechanical Understanding Processing Speed Developed in the early 1990’s, the PPM battery was intended to represent a battery of cognitive ability subtests measuring a wide range of specific abilities and aptitudes. In addition to the common verbal v. nonverbal distinction, the unique feature of the PPM subtests is that they are organized around the distinction between power and performance measures. The PPM Technical Manual describes power subtests as designed to measure aptitude to learn, or reasoning, where the measure of the target aptitude has little dependence on previous learning or experience. Performance subtests are designed to measure ability to perform specific types of tasks, where the measure of the target performance ability depends somewhat more on previous learning and experience, although not so much as to be a test of knowledge. It should be noted that PPM’s power measures are very similar to measures of “fluid intelligence” described by the Cattell-Horn-Carroll (CHC) factor model of intelligence and the PPM performance measures are somewhat more similar to measures of “crystallized intelligence”. In addition, it should be noted that the PPM Technical Manual’s usage of the term “ability” to refer “what an individual can do now and may be able to do after experience and training” is not a widely accepted use of the term ability. More often, the terms aptitude and ability are interpreted as synonyms, both of which refer to a general ability to learn and perform. For example, the expression “general mental ability” is commonly used to refer to the general capacity to learn. Finally, it should be noted that the PPM Spatial Ability subtest measures an attribute that is more often considered a facet of “fluid intelligence”, particularly when measured using symbolic, nonverbal items as is the case in PPM’s Spatial Ability subtest. Purpose PPM was introduced to provide a battery of subtests that could be used in various combinations to inform personnel selection decisions and career counseling / skill development planning. PPM was intended to be used by organizations and individuals to inform personnel decisions. Notably, however, no item content in any subtest is specific to work activity. All subtest content is work neutral. PPM was intended to be easy to use and score, thus facilitating its use in organizations. 91 Target Populations The PM subtests are designed to be appropriate for working adults ranging from 18 tom 60 years. Although no target reading level is provided, normative data has been gathered from three samples of British adults, including 1,600 members of the general population, 337 professional managers, and various samples of working populations in a variety of job categories. Target Jobs / Occupations At the time the PPM battery was being developed research evidence had largely confirmed that cognitive ability measures were highly valid predictors of job performance across the full range of occupations in the workforce. (See, Schmidt & Hunter, 1998) The PPM battery was designed to be relevant to virtually all occupations that required a minimum of information processing complexity. The authors anticipated that tailored combinations of the PPM subtests could be assembled based on job information that would optimize the relevance of the PPM scores to performance in that target job. As examples, they described likely tailored combinations for the following job groups. Engineering / Technical Numerical Computation Spatial Ability Mechanical Understanding Technological / Systems (IT) Applied Power Numerical Reasoning Perceptual Reasoning High Level Reasoning Applied Power Verbal Reasoning Perceptual Reasoning Numerical Reasoning General Clerical Verbal Comprehension Numerical Comprehension Processing Speed Sales and Marketing Verbal Reasoning Verbal Comprehension Numerical Reasoning The general applicability of PPM subtests across a wide range of occupations is facilitated by the decision to only use work-neutral content in all PPM subtests. 92 Spread of Uses Hogrefe describes two primary applications for PPM within organizations – personnel selection and career counseling. Both applications are supported by a variety of norm tables that provide the PPM subtest score distributions for various groups of individuals, organized around job types or disciplines. The technical manual describes the following 10 disciplines. Science “Professional” Engineering Arts/Humanities Numeracy Engineering / Technology Clerical Business / Accountancy Literary Social Science Craft For each profile, the technical manual provides the average subtest scores achieved by incumbents on three different scales – subtest raw score, percentile score within the British working population and a so-called “IQ” score within the British working population. The “IQ” score is a transformation of the raw score scale on which the average score is 100 and the standard deviation is 10. A percentile and IQ scale score based on the British working population is available for each subtest raw score and for certain combinations of subtests such as verbal and nonverbal. Hogrefe does not appear to recommend cut scores for selection but advises user organizations to establish their own cut scores based on Hogrefe-supplied norm tables for common job groups (discipline). Hogrefe describes an approach to career counseling that relies on the similarity of an individual’s PPM score profile to the score profile for the average incumbent in each of several relevant job groups (disciplines). In general, the more similar one’s PPM score profile is to a job group profile, the stronger the recommendation that the individual would be a good fit with jobs within that group. 93 Administrative Details Administrative detail is summarized in Table 50 and briefly described below. Table 50. Administrative features of the PPM subtests. Subtest # Items Time Limit Scoring Rule (Raw score) 25 12 minutes # Correct (both answers must be correct) 32 8 minutes # Correct – 1/4 # wrong 40 (Two 20item sections) 6 minutes (3 minutes for each section) # Correct – 1/3 # wrong Numerical Reasoning 25 10 minutes # Correct – 1/4 # wrong Perceptual Reasoning 26 6 minutes # Correct – 1/3 # wrong Processing Speed 50 3 minutes # Correct – 1/3 # wrong Spatial Ability 26 6 minutes # Correct – 1/6 # wrong Verbal Comprehension 60 6 minutes # Correct – 1/3 # wrong Verbal Reasoning 31 10 minutes # Correct – 1/4 # wrong Applied Power Mechanical Understanding Numerical Computation Other Score Scales Method of Delivery Percentile and T-scores scores normed against the General British population, the Working British population, and an industry sample PPM tests are delivered in paper-pencil format and in a computer-based format using the Hogrefe TestSystem. Both modes use fixed item sets within each subtest. Time Limits PPM subtests were designed to be relatively short measures for a variety of cognitive abilities. Except for the highly speeded Processing Speed test, which allows only 3 minutes, the time limits for the unspeeded subtests range from 6 minutes to 12 minutes. Among commonly used commercially available cognitive test batteries, these are among the shorter time limits for subtests designed to measure similar constructs Number and Types of Items Table 50 shows the number of items for each subtest, ranging from 25 to 31 items for the more complex power (reasoning) subtests and Spatial Ability to 32 to 60 items for the less complex performance subtests. None of the subtests include work-like content in the items. Four are heavily dependent on verbal content, whereas five are somewhat less dependent on verbal content. Only two, Numerical Computation and Perceptual Reasoning, are nonverbal. While reading level is not specified for any subtest, the level of reading content is described by Hogrefe as appropriate for all job levels. This is even true for the Verbal Reasoning subtest, which Hogrefe describes as deliberately kept “relatively low” although no precise definition is provided for “relatively low”. Type of Score Scores on all subtests are computed as “number correct” scores, with a traditional correction for guessing on all except Applied Power. This guessing correction reduces the number of correct answers by a fraction of the number of incorrect answers, where the fraction is rounded down to the nearest whole integer. This fraction is ¼ for Mechanical Understanding, Numerical Reasoning, and Verbal Reasoning, 1/6 for Spatial Ability, and 1/3 for all other subtests. 94 Test Scale and Methods of Scaling Two norm-based scores are provided in addition to the raw score, number correct adjusted for guessing. Percentile scores and T-scores are provided each based on up to three norm groups, the general British population. the working British population, and a relevant industry norm sample. Method of Delivery The PPM battery may be administered either on computer or in paper-pencil format. Based on interviews with a Hogrefe industrial psychologist, Dr. Nikita Mikhailov, time limits, numbers of items and types of items are the same for both delivery methods. Computer administration, however, reports additional information including the numbers of correct, incorrect and missing responses, the individual item responses as well as response latency information. Computer administration does not require additional items or an expanded item bank beyond the items contained in the fixed paperpencil forms. Further, computer administration is expected to be administered in a proctored environment, based on the responsibility of Level A users to protect the usage of PPM. Cost PPM materials required for administration are sold by the subtest. For each subtest administered in a paper-pencil mode, the following costs apply. Item Unit Cost PPM Manual £58.00 Specimen Kit £160.00 Single Subtest Booklet £8.00 Set of 25 Answer Sheets £76.00 Single Subtest Online £6.60 - £8.00, depending on volume Construct-Content Information Intended Constructs The types of subtests in PPM are typical of subtests in current commercially available cognitive ability batteries. PPM consists of two verbal, two quantitative, two abstract reasoning, two spatial/mechanical, and one speeded processing accuracy subtest. None of these subtests is substantially different from comparable subtests in similar cognitive ability batteries. PPM is somewhat distinctive, however, in that it includes a relatively high proportion of subtests that focus on reasoning abilities. (Of course, all three subtests of the Watson-Glaser II battery are considered to measure facets of reasoning.) These four subtests are the “power” subtests within the battery, in which the relevance of learned/experienced content is minimized. The PPM developer’s apparent frame of reference in the late 1980’s when developing PPM was the set of well established, commonly used cognitive batteries available at the time. The developer developed PPM to borrow from and improve on earlier commercially available cognitive batteries including the Differential Aptitude Battery (DAT), Employee Aptitude Survey (EAS) and Wechsler Intelligence Scales (WIS). The developer described the primary objectives for the development of the PPM as: 95 Provide a multidimensional battery of subtests measuring different types of aptitudes and abilities from which groups of subtests could be tailored to specific jobs; Short time limits and ease of administration; Simplicity of scoring and interpretation; Applicable for selection and career guidance uses. Barrett’s approach to the development of the subtests does not appear to be strongly influenced by any particular theoretical foundation. (This, in spite of the apparent similarity of the power (reasoning) subtests to CHC fluid intelligence and the performance tests to crystallized intelligence.) Rather, his approach appears to have depended on the assumption that similarity between specific job requirements, in terms of cognitive abilities, and specific test constructs was closely related to test validity. An important consideration in the development of PPM was that subtests be different from one another in ways that would presumably be differentially relevant to different jobs. The combination of subtests associated with target job families shown above does not appear to have been empirically developed. Rather, it appears to have been developed from assumptions / observations about cognitive ability requirements of different job families. For example, performance in Sales and Marketing jobs is presumed to be better predicted by a combination of Verbal Reasoning, Verbal Comprehension and Numerical Reasoning than any other combination. In contrast, performance in IT jobs (Technological/Systems jobs) is presumed to be better predicted by reasoning subtests, Applied Power, Numerical Reasoning and Perceptual Reasoning. No information could be located about PPM that provides a theoretical or empirical foundation for these presumed differences between job families. It is noteworthy to compare this perspective about job-specific validity to Schmidt’s (2012) recent description of a content-based validation rationale for tests of specific cognitive abilities. That content-based rationale requires that the job tasks and specific cognitive test tasks have sufficient observable similarity to conclude that the cognitive ability measured by the test will predict performance of the similar job tasks. But this content validity rationale is not a rationale for jobspecific validity. In other words, this rationale would not lead one to predict higher empirical predictive validities for cognitive tests with stronger content evidence than for cognitive tests with weaker content evidence. That is, stronger content evidence is a rationale for generalizing empirical validity results for cognitive tests to the target job. But weaker content evidence does not imply weaker validity, only that some other rationale, i.e., validity generalization, must be relied upon to generalize empirical results. Nevertheless, the PPM approach of tailoring cognitive composites to job families is very common among other well-regarded commercially available cognitive ability batteries. Because tailoring is not expected to harm validity, other reasons for preferring tailored composites may be appropriate. Applicant acceptance is likely enhanced when they “see” the relevance of a test to the sought job. To the extent the subtests depend significantly on already learned job knowledge, they may be better predictors of early performance than work neutral subtests. (Of course, if already learned job knowledge is an important consideration, job knowledge tests may be an effective supplement to or an alternative to cognitive ability tests.) Item Content in the Subtests Note, in the descriptions below, the sample items were developed by the first author of this Report because no practice or sample items could be located in publisher sources. These sample items were intended to be similar to actual items in structure, format and content without disclosing any actual item content. Applied Power was designed to measure non-verbal, abstract reasoning, focusing on logical and analytical reasoning. The items were designed to be appropriate for all job levels. Although no information is provided about the intended reading levels of instructions, verbal ability is likely to be a small factor in Applied Power because no item includes written verbal content. Every item consists of a sequence of letter-number pairs with the last two pairs missing. Two answers must be chosen to 96 identify the two missing letter-number pairs. The sample item shown below is very typical of all 25 items except that in many items the letter is sometimes in in upper case and sometimes lower case to provide more complex patterns within the sequence. Each item is a patterned sequence of groups of letters and numbers. Identify the two groups that would follow the sequence shown. X1 - Y2 - X1 - Y2 - X1 - Y2 - X1 1. A) X1 B) X2 C) Y1 D) Y2 2. A) X1 B) X2 C) Y1 D) Y2 Mechanical Understanding was designed to measure understanding of basic mechanical principles of dynamics, with a modest dependence on past related experience. Although no information is provided about the intended reading level of items, verbal ability is likely to be a moderately small factor in Mechanical Understanding scores. The stem of every item includes a picture depicting some object(s) in a mechanical context and a written question about that picture. Presumably, past experience with the depicted objects will influence scores on this subtest. The sample item shown below is typical of all items in that it displays a picture of an object (a beam) and a question about that object. However, the pictured objects vary considerably. A B C At which point is the beam most likely to balance? Numerical Computation was designed to measure the ability to quickly solve arithmetic problems. . Although no information is provided about the intended reading levels of instructions, verbal ability is likely to be a small factor in Numerical Computation because no item includes written verbal content. Every item stem consists of an arithmetic problem consisting of 2-4 integers or decimal numbers and one or more of four arithmetic operation symbols, +, -, x, and ÷. The answer options are integers or decimal numbers. The sample item shown below is very typical of all 40 items, with some less complex and some more complex. 22 X 5 – 2 A) 66 B) 108 C) 25 D) 37 Numerical Reasoning was designed to measure the ability to reason about the relationships between numbers. It was intended to be an indicator of critical and logical thinking. The items were designed to be appropriate for all job levels. Although no information is provided about the reading levels of the moderately complex instructions, verbal ability is likely to be a small factor in Numerical Reasoning because no item includes written verbal content. Every item consists of three pairs of two number values and a fourth answer alternative, “no odd pair”. Each pair of two numbers is presented as an “is to” logical relationship. For example, the pair, “100 : 50”, is intended to represent the logical relationship, “100 is to 50”. The task is to identify which one of the pairs, if any, represents a different logical relationship than the other two. If all three pairs represent the same logical relationship, then 97 the correct answer is “no odd pair”. The sample item shown below is very typical of the 25 items, except that some items contain fractions in the pairs of numbers. Find the pair that has a relationship different from the other two pairs. Is some cases, no pair has a different relationship (A) 5 :10 (B) 1 : 2 (C) 2 : 10 (D) No different pair Perceptual Reasoning is designed to measure the ability to reason about and find relationships between pictures of abstract shapes. This subtest was developed to provide an indicator of general intelligence and problem solving ability. The items were designed to be appropriate for all job levels. Although no information is provided about the reading levels of the moderately complex instructions, verbal ability is likely to be a small factor in Perceptual Reasoning because items contain very little, non-complex written verbal content. The stem of each item consists of 2 -4 abstract shapes which have some form of relationship to one another. The relationships include sequence, pattern, shape, and logic, among other possibilities. Embedded in the stem of most items is short text that asks a question about the target relationship. The alternatives in most items consist of four similar abstract shapes, among which one represents the correct answer to the question. For those few items in which pattern similarity is the target relationship, the question is “which one is the odd one out”. The sample item shown below represents this style of item and is similar to some of the abstract shapes used. However, a wide variety of abstract shapes are used among the 26 items. is to (A) (B) as is to ? (C) (D) Processing Speed is designed to measure the ability to quickly perform a low-moderate complexity mental task on a set of written content. More specifically, it is designed to measure the capacity to quickly analyze and alphabetize three similar words. The items were designed to be appropriate for all job levels. However, no information is provided about the reading levels of (a) the relatively simple instructions, or (b) the stimulus words themselves, which range from common, simple four-letter words to much less common, more complex words. While familiarity with the meaning of the words is not required to alphabetize the words, differences in familiarity and understanding of the meaning of the words potentially affects the speed with which this task may be performed. Each item consists of three words with the instruction to identify the word that comes first in alphabetical order. Two sample items are shown below. The words in the sample items below appear to the Report author to be approximately average in length, complexity and familiarity. For each item, you are shown three words. Identify the word that comes first in alphabetical order. 1. (A) James 2. (A) Wind (B) Jerard (C) Janet (B) Window (C) While Spatial Ability was designed to measure the ability to visualize and mentally manipulate objects in three dimensions. The items were designed to be appropriate for all job levels. Although no 98 information is provided about the reading levels of the complex instructions, verbal ability is likely to be a small factor in Spatial Ability because items contain no written verbal content. However, the instructions are relatively complex and may require more reading comprehension than other PPM subtests. The stem of each item consists of two squares with lines drawn inside each. For each item o the task is the same. Mentally, rotate Square X 90 to the left, turn Square Y upside down so its top side in on bottom, and superimpose rotated X onto upside down Y. The answer requires the test taker to identify the number of spaces displayed in the superimposed square. o Rotate Square X 90 to the left. Turn Square Y upside down so its top side is on the bottom. Then superimpose Square X onto Square Y.. How many spaces are formed by the resulting figure? X. Y. (A) 1 space (B) 2 spaces (C) 3 spaces (D) 4 spaces It should be noted that Spatial Ability is regarded as a Performance measure, which is considered an experience based ability to perform, and not a Power measure, which is considered a reasoning ability. However, the abstract item types that comprise Spatial Ability do not appear to be susceptible to experience and, instead require mental rotation of abstract, although familiar (squares), figural representations. This type of task is more frequently recognize as an assessment of spatial reasoning. Verbal Comprehension was designed to measure the ability to understand English vocabulary. While the level of word difficulty appears to vary considerably, the Report authors have not been able to locate any documentation of the reading / vocabulary levels of the words used in the items. Each item consists of two words. The test taker’s task is to determine whether the two words have similar meaning, opposite meaning, or unrelated meaning. The sample items below are very similar to the range of levels observed in the subtests Each item consists of two words. Choose A if they have similar meaning, B if they have opposite meaning, and C if there is no connection between their meanings. 1. Happy Glad A. Similar 2. Ecstatic B. Opposite C. No Connection B. Opposite C. No Connection Elastic A. Similar Verbal Reasoning was designed to measure comprehension of and reasoning about verbally expressed concepts. This subtest is intended to be an indicator of the ability to think critically and logically about verbal content. This subtest is intended to be appropriate for all job levels, although no information is provided about actual reading levels. Nevertheless, the Technical Manual indicates that the vocabulary level was deliberately kept “relatively low” to avoid too much influence of language achievement. As a Power subtest, Verbal Reasoning was intended to be an indicator of the ability to think critically and logically. The two sample items shown below are typical of those used in this subtest in that both are based on the same short reading passage. The 31 items in the subtest consist of 8 reading passages, each with 3-5 items. Overall, the PPM developers clearly sought to develop work neutral items for all subtests, with a mix of verbal and non-verbal content. While no information was presented about reading levels of the subtests, the developers intended to develop item content that would be appropriate for all job levels. It is notable that item tasks for some subtests were novel in the Report author’s judgment (see, in 99 particular, Perceptual Reasoning and Spatial Ability) and may have resulted in relatively complex instructions. James’ 4-yr old daughter Sally has a close friend and classmate, Joyce. uncle, Samuel, who was a good friend of James’ father. Joyce’s father Bill has an 1. Who is most likely to be the oldest person in the passage? (A) James (B) Bill (C) Samuel 2. Who is Sally least likely to know? (A) Bill (B) James (C) Samuel Construct Validity Evidence The publisher reported correlations between selected PPM subtests and subtests included in three other commercially available cognitive ability batteries – Employee Aptitude Survey (EAS), SHL tests, and NIIP tests. These construct validity correlations are shown in Table 51. Although the PPM Technical Manual does not explain why certain correlations were reported and others were not, the Report authors identified the EAS, SHL and NIIP subtests that appear to have similar item content and underlying constructs to corresponding PPM subtests. These correlations are regarded as estimates of convergent validities and are shown in the circles. Their average value is .60, which is not uncommon for subtests measuring similar constructs where the subtest reliabilities are in the upper .70s-.80s. The highest of these convergent validities are associated with pairs of subtests with presumably very high construct and content similarity. 100 Table 51. Correlations between PPM subtests and other subtests regarded as measuring similar constructs PPM Subtest Verbal Reasoning Verbal Comprehensi on. Space Visualization Spatial Ability 140158 Processing Speed Visual Speed Perceptual Reasoning NIIP 152 .44 Numerical Reasoning SHL Numerical Reasoning Numerical Computation EAS 144153 Mechanical Understand. Subtest Applied Power N Study .60 .74 .42 .53 .61 .68 20 Numerical 140 Visual Pursuit .34 140 Symbolic Reasoning .49 131 Word Knowledge 201 Verbal Reasoning 97107 Number Series .29 41182 Numerical Reasoning .39 .60 .26 100182 Verbal Reasoning .23 .29 .44 47, 157* Bennett Mechanical 47 Arithmetic 47 Spatial Test 47 Verbal .76 .57 .46 .49 .54 .55 .35 .75 .61 Subtests pairs which do not appear to the Report authors to have high construct or content similarity are not circled. These probably should not be regarded as divergent validities because many were designed to measure similar constructs but with somewhat different types of item content. Nevertheless, it is reasonable to expect these uncircled correlations to be lower on average than the circled convergent correlations. They have an average value of .36. Overall, these construct correlations appear to provide moderate to strong construct evidence for Numerical Computation, Numerical Reasoning, Processing Speed, Spatial Ability, and Verbal Comprehension. Applied Power, especially produced lower correlations than would be expected with other reasoning measures even though none of those reasoning measures used the same item content as Applied Power. No factor analyses of PPM subtests have located and the Hogrefe psychologist is not aware of any such studies of the PPM factor structure. Item / Exam Development The technical manual and other sources provided very little technical/psychometric information about the processes used to develop the items and subtests. No information was provided about the item writing processes including item specification provided to items writers, review processes that might have been used to review for culturally salient, irrelevant content, pilot studies, item analyses, psychometric standards for item acceptance/rejection, or the development of alternate, equivalent 101 forms. It appears clear that no IRT-based psychometric methods were applied. The Hogrefe psychologist confirmed that no documentation is available about Barratt’s original item development process and that new items have not been developed since then. During development each subtest was administered to various samples of incumbent employees and applicants. Although we are not certain, it appears these samples comprise the British working population norms groups used to determine one of the T-score scales. Table 52 shows the characteristics of these development norm group samples. The Age and Sex samples shown in Table 52 provided the group statistics reported in Table 54 below. Table 52. Demographics and statistics for the item development norm groups. Information about Norm Groups PPM Subtest Mean SD N Description Age Range Sex N Applied Power 11.82 4.39 384 Clerical workers in large financial institutions 17 yrs – 37 yrs Mean = 22.26 yrs N=374 Female = 71 Male = 313 Processing Speed 19.91 6.63 555 Clerical workers in large financial institutions No information Female = 425 Male = 99 Mechanical Understanding 15.09 5.78 718 Engineers, engineering tradespeople, factory workers, manufacturing applicants, pilot applicants 17 yrs – 60 yrs Mean = 26.83 yrs N=616 Female = 49 Male = 659 Numerical Computation 13.80 6.27 793 Engineers, engineering tradespeople, factory workers, manufacturing applicants, pilot applicants 17 yrs – 60 yrs Mean = 27.32 yrs N = 669 Female = 57 Male = 724 Numerical Reasoning 10.89 4.98 368 Engineers, engineering tradespeople (fitters, electricians), managers, pilot and manufacturing applicants, factory workers 17 yrs – 54 yrs Mean = 25.38 yrs N = 255 Female = 90 Male = 262 Perceptual Reasoning 11.32 3.65 283 Engineering tradespeople, factory workers, manufacturing applicants 17 yrs – 54 yrs Mean = 26.71 yrs N = 159 Female = 20 Male = 247 Spatial Ability 9.27 4.15 590 Engineers, engineering tradespeople (fitters, electricians), pilot applicants, creatives 16 yrs – 59 yrs Mean = 25.66 yrs N = 495 Female = 32 Male = 558 Verbal Comprehension 21.20 10.79 162 Managers, engineering tradespeople 18 yrs – 60 yrs Mean = 26.25 yrs N = 134 Female = 13 Male = 137 Verbal Reasoning 13.97 5.48 464 Engineering tradespeople, factory workers, manufacturing applicants, managers 20 yrs – 58 yrs Mean = 29.33 yrs N = 332 Female = 77 Male = 371 Table 53 shows results for subtest reliability and correlations among subtests. It is not clear what sample(s) were used for reliability analyses. It seems most likely that the norm group samples described above in Table xx are the samples that produced these reliability estimates. The correlations among all subtests were generated in a separate sample of 337 professional managers. The PPM Technical Manual describes all reliability estimates as alpha estimates of internal consistency reliability. It should be notes that alpha estimates of reliability are positively biased for speeded tests, such as Processing Speed. As a result, the .89 estimated reliability for Processing Speed is very likely an over estimate by some amount. Two patterns of results in the subtest correlations stand out. First, except for Spatial Ability and Processing Speed, the correlations of other subtests with Mechanical Understanding are lower than would be expected. Often tests of mechanical aptitude/reasoning are closely associated with general mental ability and tend to have somewhat higher correlations with other subtests. Second, in contrast, the subtest correlations with Processing Speed appear to be somewhat higher than expected. Often in batteries of cognitive subtests, the highly speeded subtests correlate least with the other subtests. It is possible that the PPM Processing Speed subtest is more heavily loaded on verbal reasoning because of the requirement in each item that test takers determine the alphabetical order of often complex words. 102 Table 53. Item development statistics – Reliability and correlations among subtests. Sample of 337 Professional Managers Reliability (Alpha) Mean SD AP Applied Power (AP) .88 13.07 4.69 -- Processing Speed (PS) .89* 28.84 7.43 .44 -- Mechanical Understanding (MU) .79 18.39 5.76 .36 .20 -- Numerical Computation (NC) .86 21.16 6.69 .36 .46 .14 -- Numerical Reasoning (NR) .83 12.88 4.37 .47 .48 .27 .59 -- Perceptual Reasoning (PR) .74 11.46 3.95 .46 .44 .32 .40 .51 -- Spatial Ability (SA) .79 12.57 3.23 .49 .34 .45 .30 .51 .43 -- Verbal Comprehension (VC) .84 40.57 9.09 .25 .28 .33 .24 .25 .15 .25 -- Verbal Reasoning (VR) .83 17.12 4.44 .46 .53 .22 .37 .43 .45 .36 .34 PPM Subtest PS MU NC NR PR SA VC VR -- * Note: Alpha estimates of reliability are known to be overestimates with highly speeded tests. Table 54 shows sex and age group differences on each PPM subtest. Because neither the Technical Manual nor other sources provided any information about item development methods to minimize biasing factors, it is not clear whether the group differences reported in Table 54 reflect bias-free or biased estimates of valid group differences on the subtest constructs. Setting aside this uncertainty, it is notable that female test takers scored higher than male test takers on six of the nine subtests, particularly on Spatial Ability. A very common result is that males score higher than females on spatial/mechanical aptitude tests. Perhaps the highly abstract feature of the PPM Spatial Ability item content produces scores that are much less affected by previous spatial/mechanical experience, which is regarded as an important factor in the more typical result favoring males. Another possible explanation is that the norm groups for all subtests except Applied Power and Processing Speed consisted in some part of engineering, manufacturing, and factory workers. The women in these worker samples may not be representative of the British working population with regard to experience and ability in the spatial/mechanical domains. 103 Table 54. Group means, standard deviations and sample sizes. Male Subtest Younger than 2224 yrs Female Mean SD N Mean SD Applied Power 11.60 4.29 313 12.79 Processing Speed 17.38 7.13 99 20.54 Mechanical Understanding 15.32 5.74 659 Numerical Computation 13.87 6.36 Numerical Reasoning 10.76 Perceptual Reasoning 22-24 yrs and older N Mean SD N Mean SD N 4.72 71 12.26 4.51 214 11.33 4.30 160 6.41 425 12.14 5.82 49 16.43 5.45 327 14.49 5.81 289 724 13.01 5.52 57 14.64 5.81 328 13.75 6.48 341 5.16 262 11.83 4.33 90 13.79 4.48 122 10.75 4.83 133 11.28 3.59 247 12.80 3.64 20 12.78 3.41 82 11.29 3.71 77 Spatial Ability 9.21 4.15 558 10.19 4.08 32 10.41 3.72 270 8.15 4.40 225 Verbal Comprehension 20.36 0.71 137 22.15 7.62 13 17.39 6.26 74 24.08 12.40 60 Verbal Reasoning 13.15 5.20 371 18.45 4.95 77 17.88 4.71 170 11.57 4.75 162 Age correlated with score, r=-.16 (N=225) No evidence was presented in the PPM Technical Manual or could be located elsewhere about alternate forms. No development information was presented about the manner in which they were developed or the psychometric properties that demonstrated their equivalence. The Hogrefe psychologist confirmed that IRT based psychometrics have not been used to evaluate PPM items and exams. The Hogrefe psychologist confirmed that the PPM approach relies on a fixed, non-overlapping item set for each alternate form regardless of method of administration. PPM is not supported by an item bank other than the items that are included in the fixed forms of the current subtests. Criterion Validity Evidence No criterion validity evidence has been located for PPM subtests in publisher-provided materials or in any other publically available sources or in the one professional review found in Buros. Further, the Hogrefe psychologist is not aware of any criterion validity studies or reports for PPM. Approach to Item / Test Security This information was discussed at some length with the Hogrefe psychologist responsible for supporting PPM. Based on information he shared, PPM is a low volume battery used primarily in the UK. While it is administered in both paper-pencil and online modes, it consists only of the two original forms (we have only found documentation for one form). Hogrefe requires proctored administration in both paper-pencil and online modes of delivery. Hogrefe relies entirely on proctoring and administrator training and certification to ensure the security of individual items and whole subtests. Hogrefe has no process for monitoring item psychometric characteristics to evaluate possible drift due to over exposure, fraudulent disclosure or other patterns of cheating. Hogrefe employs no data forensic methods for reviewing items responses to detect indicators of cheating or attempts for fraudulently obtaining item content. In short, Hogrefe falls considerably short of professional practices for the protection of PPM items and test content from cheating Translations / Adaptations PPM is available only in UK English and Hogrefe has had no experience translating PPM items or supporting material into other languages. User Support Resources (E.g., Guides, Manuals, Reports, Services) 104 Hogrefe provides only the bare minimum of support resources to enable users to administer and score PPM subtests. Other than the test materials themselves, answer sheets, test booklets and results Reports, the only available support resources are PPM Technical Manual, including administration instructions Sample Reports Online administration may be arranged at tara.vitapowered.com. No other support materials could be identified by the authors or by the Hogrefe industrial psychologist who supports the PPM battery. Evaluations of PPM Only one professional review of PPM was been located. Geisinger (2005) reviewed PPM in Buros’ Sixteenth Mental Measurements Yearbook. His review appears to be based on an earlier technical manual than was available for this project. However, in our judgment virtually all of his evaluations apply equally to the technical manual reviewed for this project, except for the criticism that no group difference data was presented. The manual reviewed for this project provides male-female differences that apparently were not reported in the manual Geisinger reviewed. He criticized PPM for the lack of predictive validity evidence, for reporting (positively biased) alpha coefficients for speeded subtests, and for a lack of data about group differences. Overall, Geisinger was critical of the lack of technical detail relating to item and subtest development and of the lack of criterion validity evidence. We have located no evaluative empirical analyses of PPM’s factor structure or other psychometric properties in published sources or from Hogrefe. Hogrefe’s psychologist indicated that he was unaware of any such empirical analyses. Overall, the available documentation about PPM is not adequate to evaluate its criterion validity or the processes of item development to ensure item quality and lack of measurement bias with respect to culture or other group differences. (It should be noted that the nonverbal / abstract content in the reasoning subtests in particular eliminates most sources of group-related measurement bias.) At the same time, typical reliability levels and adequate convergent correlations with other established subtests in other batteries provide some support for the psychometric and measurement adequacy of PPM. Our evaluation of the PPM battery is that the subtests themselves appear to be welldeveloped measures with adequate reliability and convergent validities with other tests. Nevertheless, the publisher’s support of PPM in terms of validity studies and documentation as well as security protection of the fixed forms falls well below professional standards for validity and protection of test content. Two interviews with Hogrefe’s industrial psychologist who has recently taken on support of PPM, Dr. Nikita Mikhailov, have confirmed this lack of support documentation. The most likely explanation, in Dr. Mikhailov’s opinion, is that Hogrefe’s recent acquisition of PPM’s previous publisher failed to transition any available support material to Hogrefe. Descriptive Results for Verify SHL’s Verify Battery Authors’ Note: The information provided in this section is taken from the 2007 Technical Manual for Verify and other available sources. In 2007, Verify consisted of three subtests, Verbal Reasoning, Numerical Reasoning and Inductive Reasoning and their Verification counterparts.. After 2007, SHL 105 added four subtests to the Verify battery, Mechanical Comprehension, Calculation, Checking, and Reading Comprehension and their Verification counterparts. At the time this Final Report was being finalized, it was learned that two additional subtests have just been added, Spatial Ability and Deductive Reasoning. However, in the interest of protecting their intellectual property, SHL declined to approve our access to any further publisher-provided information about Verify and its additional subtests beyond the 2007 Technical Manual and publically available documents. No SHL documents or other published documents could be located that provide samples of item types for the two most recent subtests or validity information beyond what was reported in the 2007 Technical Manual. Overall Description and Uses Introduction SHL developed the Verify battery of subtests in stages beginning with three subtests of Numerical Reasoning, Verbal Reasoning and Inductive Reasoning, implemented by 2007. Four additional tests were incorporated into Verify before 2013, Mechanical Comprehension, Checking, Calculation and Reading Comprehension. The Verify battery is designed to be administered online in unproctored settings. The where the randomized form is constructed from a large bank of items, each of which has IRT estimates validity and security of these unproctored tests are supported by an overall strategy with several key components. Each test taker is given a randomized form of each subtest and other characteristics which govern the construction process to ensure all randomized forms are psychometrically equivalent. In addition, SHL recommends that any applicant who reaches a final stage of consideration should complete a somewhat shorter verification battery of subtests, also randomized forms, to verify that the first test result was not fraudulently achieved. The most distinctive feature of the Verify battery is the comprehensiveness of the security measures taken to enable confidence in scores from unproctored, online administration. SHL’s Verify clearly represents the best practice in employment testing for the use of unproctored online testing. Purpose SHL developed Verify to provide a professionally acceptable unproctored online assessment of cognitive abilities used for personnel selection purposes. Developed in stages, Verify now represents a broad range of subtest types which are appropriate for a wide range of occupations and a wide range of work levels. Verify was among the first widely available, commercial cognitive batteries to include a separate, proctored verification testing capability designed to evaluate whether there is any risk that the first, operational score was achieved fraudulently. Verify was developed, in part, to optimize the convenience of the test taking process for both applicants and employers by enabling it to be administered to applicants anywhere, anytime. A unique aspect of Verify is that the IRT-based construction rules enable SHL to control the “level” (i.e., difficulty) of each randomized set of items. This allows SHL to tailor the level of the Verify battery to any of six different levels associated with different roles. Target Populations Verify is appropriate for a wide range of worker populations because of its ability to tailor the difficulty level of randomized items to the particular job/role. As a result, there is no one target reading level for Verify items. The item content of Verify subtests is tailored to 9 different levels of jobs (1) Manager / Professional, (2) Graduate, (3) Skilled Technology, (4) Junior Manager, (5) Senior Customer Contact, (6) Skilled 106 Technical, (7) Junior Customer Contact, (8) Administrator , and (9) Semi-Skilled Technical. Levels 13 comprise the Managerial/Graduate job group; Levels 4-6 comprise the Supervisory / Skilled Technical job group; and Levels 7-9 comprise the Operational / Semi-Skilled Technical job group. Target Jobs / Occupations SHL describes the range of appropriate job families as “from operational roles to senior management”. The different abilities assessed by Verify subtests and the different levels of item difficulty and content enable Verify to be appropriate for a wide range of jobs. These include jobs in the following industry sectors. Banking, Finance, & Professional Services Retail, Hospitality, & Leisure Engineering, Science & Technology Public Sector/Government General population. In addition to occupation groups, each Verify subtest is linked to one or more of the following levels of work as shown in Table 55. (Listed from highest to lowest level.) Table 55. Use of Verify subtests at different job levels. Verify Subtest* Job Level Verbal Reasoning Numerical Reasoning Inductive Reasoning Director, Senior Manager X X X Manager, Professional, Graduate X X X X Junior Manager, Supervisor X X X X Sales, Customer Service, Call Center Staff X X Information Technology Staff X X Administrative and Clerical Staff X X Technical Staff X X Semi-Skilled Staff Mechanical Comprehension X X X Checking Calculation X X X X X X *Reading Comprehension, Spatial Ability and Deductive Reasoning were not included in Verify at the time SHL published this alignment It should be noted here that the association of a subtest with a specific job level means that SHL has specified a level of item difficulty in terms of IRT theta range for items to be included in randomized forms for the particular job level. This implies that randomized forms are constructed based on several conditions, including the job level for which the applicant is applying. That is, the item content for any subtest differs depending on job level. Spread of Uses 107 Verify appears to be used only for personnel selection. There is no indication in any SHL support materials that Verify is used for other purposes such as career development. Certainly, SHL’s significant investment in supporting the unproctored mode of administration attempts to protect against fraudulent responding, which is likely to be relevant only for high stakes applications such as employment. Administrative Details Table 56 reports several administrative features of the Verify subtests. For each subtest and its verification counterpart, the number of items and administration time limit are reported as well as the scoring rules and mode of delivery. Time Limit The time limits for the Verbal, Numerical and Inductive reasoning subtests are among the longer time limits for similar subtests among the reviewed batteries. Time limits for the remaining tests are typical of similar subtests in other batteries. The time limits vary somewhat for the Verbal and Numerical subtests because higher level versions appropriate for higher level jobs/roles require slightly more time than lower level versions. SHL manipulates “level” of test content by shifting the difficulty levels of items included in the subtests to align with the level of role complexity. Numbers and Types of Items Overall, the numbers of items are typical of similar subtests in other batteries. Item content in the Numerical and Verbal Reasoning subtests is somewhat more work-like than the large majority of other subtests among the reviewed batteries. In contract, the Inductive Reasoning item content is abstract and not at all work-like. 108 Table 56. Administrative features of Verify subtests. Administrative Information Subtest # Items Time Limit Verbal Reasoning 30 17-19 minutes* Verbal Reasoning Verification 18 11 minutes Numerical Reasoning 18 17-25 minutes* Numerical Reasoning Verification 10 14-15 minutes* Inductive Reasoning 24 25 minutes Inductive Reasoning Verification 7 7 minutes Mechanical Comprehension 15 10 minutes Mechanical Comprehension Verification 15 10 minutes Checking 25 4-5 minutes Checking Verification 25 4-5 minutes Calculation 20 10 minutes Calculation Verification 10 5 minutes Reading Comprehension 18 10 minutes Reading Comprehension Verification Not Available 10 minutes Spatial Ability 22 15 minutes Deductive Reasoning 20 18 minutes Scoring Rule For unproctored tests, IRTbased ability estimates are converted to percentiles, standardized sten scores, and T scores based on various norm groups For Verification tests, no scores are reported. Only Verified or Not Verified is reported. The determination of Verified or Not Verified is based on a comparison of the two IRT ability estimates Delivery Mode “Randomized” individual forms; Online; Unproctored Verification tests must be administered in a proctored setting, online with “randomized” individual forms. Scoring Verify produces test scores by estimating the ability level (theta) for each test taker based on their item responses. The 2007 Verify technical manual provides the following description of this thetabased scoring process. Each test taker’s theta estimate is obtained through an iterative process which essentially operates, for 2-parameter models, as follows: A set of items for which a and b values are known are administered to the candidate The candidate’s right and wrong responses to the items are obtained. 109 An initial estimate of the candidate’s θ is chosen (there are various procedures for making this choice). Based on the initial θ used and knowledge of each item’s properties, the expected probability of getting the item correct is calculated. The difference between the candidate answering an item correctly and the probability expected of the candidate answering the item correctly, given the initial theta value, is calculated. The sum of these differences across items is standardised, and this standardised difference is added to the initial θ estimate (negative differences reducing the estimated theta and positive differences increasing it). If the differences are non-trivial, then the new θ estimate obtained from the previous step is used to start the above cycle again. This process is repeated until the difference between the value of • θ at the start of a cycle and the value obtained at the end of a cycle is negligible. (Pg. 13) See Baker (2001) for a more detailed account of theta scoring with worked examples. This approach to scoring is suited to the randomised testing approach where candidates receive different combinations of items. As all the items are calibrated to the same metric, then this process also allows scores on different combinations of items to be directly compared and treated as from the same underlying distribution of ability scores. For reporting purposes, these theta estimates are converted to percentile scores, sten scores and Tscores based on selected norm groups. SHL’s research based supporting the Verify subtests has established over 70 comparison (norm) groups represented by combinations of industry sectors and job levels and subtests. Delivery Mode: Operational and Verification Subtests The operational and verification subtests are both delivered online using randomized forms. However, operational tests are delivered in unproctored settings while the verification subtests must be delivered in proctored settings with identification confirmation in order to minimize the likelihood of fraudulent verification scores. Also, in most cases, but not all, the verification version of a subtest has fewer items and a shorter time limit than the operational version of the subtest. (Note, SHL’s approach to the use of the operational subtests and their verification counterparts is that the operational test results are considered to be the applicants’ formal test result that informs the hiring decision. Verification test results are used only to identify the likelihood that the operational results were obtained fraudulently. This policy is necessary because only some applicants take the verification tests. If the policy were to combine the verification results with the operational results or to replace the operational results with the verification results, fraudulent test takers would be rewarded, in effect, for their fraudulent performance.) Cost Note, because SHL declined to approve the authors’ registration for access to SHL information, we were unable to gather information about Verify costs. Construct-Content Information Intended Constructs Table 57 provides SHL’s descriptions of the constructs intended to be measured by each Verify subtest. Notably, with the exception of Inductive Reasoning which it describes as a measure of fluid intelligence, SHL has not defined any of these constructs in the language of a particular theory of ability such as the CHC model of intelligence or the French-Harmon Factor Model of Cognitive Ability. Rather, as is typical of all reviewed batteries except Watson-Glaser, the ability constructs, 110 presumably, are taken from the accumulated validity evidence in personnel selection research showing the types of cognitive ability tests that consistently yield substantial positive validity. Table 57. Information about Verify subtest constructs and items content. Subtest Construct Definition Verbal Reasoning Level Distinctions Item Specifications This is the ability to deductively reason about the implications / inferences of verbally presented problems. This is deductive reasoning because it applies to problems that are “bounded and where methods or rules to reach a solution have been previously established.” Verbal subtests constructed for Management and Graduate level jobs should have an effective theta range from -2.0 to +0.5 This is the ability to deductively reason about the implications / inferences of numerically presented problems. This is deductive reasoning because it applies to problems that are “bounded and where methods or rules to reach a solution have been previously established.” Numeric subtests constructed for Management and Graduate level jobs should have an effective theta range from -1.5 to +1.0 Inductive Reasoning This subtest is a measure of fluid intelligence (Cattell, 1971) and is sometimes referred to as “abstract reasoning” due to the nature of conceptual level reasoning as opposed to reasoning in the context of learned content. Inductive reasoning subtests constructed for Management and Graduate level jobs should have an effective theta range from -1.8 to 0.0 Inductive reasoning problems requiring the ability to reason about relationships between various concepts independent of acquired knowledge. Item content is abstract and does not include any worklike context. Mechanical Comp. Relevant to many technical roles, the Mechanical Comprehension test is designed to measure a candidate’s understanding of basic mechanical principles and their application to devices such as pulleys, gears and levers. Typically used for production, manufacturing or engineeringrelated recruitment or development, the test is often deployed to assess school leaver suitability for modern apprenticeship schemes, the practical application abilities of science and technology graduates or those with work experience looking to move to a technical role. Not Available Checking Designed to measure a candidate’s ability to compare information quickly and accurately, the Checking test is particularly useful when assessing an individual's potential in any role where perceptual speed and high standards for maintaining quality are required. The test is relevant to entry-level, administrative and clerical roles, as well as apprenticeship schemes. Comparisons of target alphanumeric strings with other alphanumeric strings Calculation Designed to measure a candidate’s ability to add, subtract, divide and manipulate numbers quickly and accurately, the Calculation test is particularly useful when assessing an individual’s potential in any role where calculation and estimation, as The test is relevant to entry-level, administrative and clerical roles, as well as apprenticeship schemes Item content includes arithmetic operations used in work-like calculations and estimation. Numerical Reasoning 111 Verbal subtests constructed for Supervisory and Operational levels of jobs should have an effective theta range from -3.0 to -0,8. Numeric subtests constructed for Supervisory and Operational levels of jobs should have an effective theta range from -2.5 to -0,9. Deductive reasoning problems presented in verbal content with a work-like context. Deductive reasoning problems presented in numeric content with a work-like context. well as auditing and checking the numerical work of others, are required. Reading Comp. Spatial Ability Deductive Reasoning The Verify Reading Comprehension test measures a candidate's ability to read and understand written materials and is useful for assessment at a range of job levels. This ability is very important wherever candidates will be expected to read, understand and follow instructions, or use written materials in the practical completion of their job. An effort has been made to ensure the text passages and resulting questions are relevant to as wide a range of industries and roles as possible Intended to measure the ability to rapidly perceive and manipulate stimuli, and to accurately visualise how an object will look after it has been rotated in space. The test is designed to provide an indication of how an individual will perform when asked to manipulate shapes, interpret information and visualise solutions. The Spatial Ability test is completely non-verbal and features only shapes and figures. This ability is commonly required when an individual is required to work with complex machinery or graphical information. The Verify Spatial Ability Test is an online screening assessment. It enables organisations to recruit candidates applying to jobs at all levels that require spatial ability. Intended to measure the ability to: draw logical conclusions based on information provided, identify strengths and weaknesses of arguments, and complete scenarios using incomplete information. The test is designed to provide an indication of how an individual will perform when asked to develop solutions when presented with information and draw sound conclusions from data. This form of reasoning is commonly required to support work and decision-making in many different types of jobs and at many levels Note: No information is provided about the theta ranges for different job levels. Note: No information is provided about the theta ranges for different job levels. Note: No information is provided about the theta ranges for different job levels. The task involves reading a passage of text, and answering a short written question.. The content of the test makes no assumptions about prior knowledge, with applicants directed to use only the information in the passage to derive their answer. Sample tasks for jobs that may require spatial ability include, but are not limited to: • rapidly perceiving and manipulating stimuli to accurately visualise how an object will look after it has been rotated • correctly interpreting graphical information • visualising the interactions of various parts of machines • efficiently communicating visual information (e.g., using charts or graphs) in a presentation Sample tasks for jobs that may require deductive reasoning include, but are not limited to: • evaluate arguments • analyse scenarios • draw logical conclusions * The content of each Verify subtest may be tailored to three or more of nine different job levels. Each subtest has versions for 3-6 different levels. In SHL’s case, they report that the constructs selected for test development were informed by their accumulated validity evidence showing the types of cognitive ability tests that had been predictive of each of seven dimensions of performance across a wide range of job families. SHL refers to these seven performance dimensions as the SHL Universal Competency Framework (UCF). The UCF dimensions are Presenting and Communicating Information Writing and Reporting 112 Applying Expertise and Technology Analyzing Learning and Researching Creating and Innovating Formulating Strategies and Concepts Extensive SHL research, summarized by Bartram (2005), has identified these performance dimensions and investigated the validity of a variety of types of predictors including cognitive abilities. This research provided the foundation for SHL’s decisions to select the Verify cognitive constructs. Unfortunately, no publically available research database shows the matrix of meta-analyzed validities between these UCF dimensions and various cognitive ability tests. Nevertheless, SHL’s strategy for selecting constructs for Verify subtests is clear. They relied on the validities previous cognitive test demonstrated against a set of universal performance dimensions relevant across job families and job levels. They did not rely directly and explicitly on a theoretical framework of broad and narrow abilities. The Level Distinctions column provides SHL’s description of the item difficulty ranges for each of the reasoning subtests, for which this information was provided. As shown in Table 57, except for Reading Comprehension, all Verify subtests are available in multiple levels of difficulty ranging from six levels of difficulty for Verbal and Numerical Reasoning to three levels for Checking and Calculation. Verify was initially developed with three subtests primarily to be appropriate for higher level management and service jobs. The later expansion of Verify included subtests, such as Checking and Calculation that are more suited to lower level jobs. Item Content Sample items are shown below for the three reasoning subtests. Sample items were not available for the remaining four subtests. Verbal Reasoning Verbal Reasoning content was developed with a clear work-like context, although this work-like context, itself, is unlikely to change the construct being measured because each statement requires deductive reasoning about presented information rather than judgment about how that information is applied in a work context. However, this item content clearly represents an application of crystalized ability rather than fluid ability and this presumably is due the verbal complexity of the item, rather than its work-like context. “Many organisations find it beneficial to employ students over the summer. Permanent staff often wish to take their own holidays over this period. Furthermore, it is not uncommon for companies to experience peak workloads in the summer and so require extra staff. Summer employment also attracts students who may return as well qualified recruits to an organisation when they have completed their education. Ensuring that the students learn as much as possible about the organisation encourages interest in working on a permanent basis. Organisations pay students on a fixed rate without the usual entitlement to paid holidays or bonus schemes.” Statement 1 - It is possible that permanent staffs who are on holiday can have their work carried out by students. T-F-Cannot say Statement 2 – Students in summer employment are given the same paid holiday benefit as permanent staff. T-F-Cannot say Statement 3 – Students are subject to the organisation’s standard disciplinary and grievance procedures. T-F-Cannot say Statement 4 – Some companies have more work to do in the summer when students are available for vacation work. T-F-Cannot say 113 Numerical Reasoning Very similar to Verbal Reasoning, the Numerical Reasoning content demonstrated in both sample items was developed with a clear work-like context, although this work-like context, itself, is unlikely to change the construct being measured because each statement requires deductive reasoning about presented information rather than judgment about how that information is applied in a work context. However, this item content clearly represents an application of crystalized ability rather than fluid ability and this presumably is due primarily due to number knowledge required and perhaps to a modest level of verbal complexity, rather than its work-like context. Sample 1 For each question below, click the appropriate button to select your answer. You will be told whether your answer is correct or not. Newspaper Readership Daily Newspapers Readership (millions) Percentage of adults reading each paper in Year 3 Year 1 Year 2 Males Females The Daily Chronicle 3.6 2.9 7 6 Daily News 13.8 9.3 24 18 The Tribune 1.1 1.4 4 3 The Herald 8.5 12.7 30 23 Daily Echo 4.8 4.9 10 12 Question 1 - Which newspaper was read by a higher percentage of females than males in Year 3? The Tribune The Herald Daily News Daily Echo The Daily Chronicle Question 2 – What was the combined readership of the Daily Chronicle, the Daily Echo and The Tribune in Year 1? 10.6 114 8.4 9.5 12.2 7.8 Sample 2 Amount Spent on Computer Imports Question 3 – In Year 3, how much more than Italy did Germany spend on computer imports? 650 million 700 million 750 million 800 million 850 million Question 4 – If the amount spent on computer imports into the UK in Year 5 was 20% lower than in Year 4, what was spent in Year 5? 1,080 million 1,120 million 1,160 million 1,220 million 1,300 million Inductive Reasoning Because it was intended to measure fluid ability, Inductive Reasoning was developed to avoid requiring acquired knowledge. This eliminated the opportunity to include work-like content 115 In each example given below, you will find a logical sequence of five boxes. Your task is to decide which of the boxes completes this sequence. To give your answer, select one of the boxes marked A to E. You will be told whether your answer is correct or not. Questions Question 1 A B C D E B C D E B C D E B C D E Question 2 A Question 3 A Question 4 A Mechanical Comprehension If all items are similar to this sample item, Mechanical Comprehension is a fluid ability measure and contains work-neutral content. 116 Question: As the wedge moves downwards, in which direction will the slider move” A. B. C. Right Left The slider will not move Checking Checking is a processing speed test in which the score for each item is the speed with which the candidate makes choice. Items are presented on the screen one at a time following the response to the previous item. In this sample item, the candidate is asked to identify an identical string of letters or numbers from the list of options on the right. Each item is individually timed. WNPBFVZKW A. B. C. D. E. WPBNVFZKW NWPBFVZWK NWBPFVKZW WPNFBVZKW WNPBFVZKW Calculation Each item in this subtest requires the candidates to calculate the number that has been replaced by the question mark. Each item is individually timed. The answer is provided on a separate answer screen. Notably, the use of a calculator is allowed for this test. The calculation is ? + 430 = 817 Reading Comprehension Although SHL does not report reading levels for its Reading Comprehension items, this sample item demonstrates a moderate level of complexity. 117 Passage Question: Biotechnology is commonly seen as ethically neutral. However, it is closely related to the conflicting values of society. Genetically modified food has the potential to bear more resilient and nutritious crops. Thus it may help the fight against world hunger. However, it also raises concerns about its long term-effects and ethics. It is this controversy that has led to the rejection of genetically modified food in Europe. Where has genetically modified food been rejected? A. B. C. D. Society Europe Agriculture Nowhere Sample items are not available for Spatial Ability or for Deductive Reasoning. Construct Validity Evidence As shown in Table 58, SHL reported two construct validity studies in which modest samples of college students completed the three Verify reasoning subtests and either Ravens Advanced Progressive Matrices or the GMA Abstract subtest. (Note, in these studies the three Verify reasoning subtest were shorter versions of the operational subtests and appear to be defined as the verification subtests are defined.) Table 58. Observed (uncorrected) construct validity results for Verify subtests. Construct Validity Observed Correlations in Samples of College Students Subtest: Ravens Matrices (N=60) GMA Abstract (N=49) Verify Verbal Reasoning (N=109) Verify Numerical Reasoning (N=109) Verbal Reasoning .45 .39 Numerical Reasoning .40 .37 .25 -- Inductive Reasoning .56 .54 .39 .32 At the time of these studies, Mechanical Comprehension, Checking, Calculation, Reading Comprehension, Spatial Ability and Deductive Reasoning were not part of the Verify battery. As expected, all Verify subtests correlated substantially with both Ravens and GMA Abstract, with Inductive Reasoning correlating highest. This result is consistent with the abstract content feature Inductive Reasoning shares with Ravens Matrices and GMA Abstract. Item and Test Development Developing Items and the Item Bank Similar to all other batteries except Watson-Glaser, no available SHL document provides an explicit description of the item writing procedures used to generate Verify items. It is clear, however, that all Verify items were evaluated during the development process using a 2-parameter IRT model as the basis for the decision to retain the items in the Verify item bank. The 2007 Verify Technical Manual provides the following descriptions of the IRT-based process for evaluating and retaining items in the Verify item bank. “The fit of 1, 2 and 3-parameter models to SHL Verify ability items were tested early in the SHL Verify programme with a sample of almost 9,000 candidates. As expected, the fit of a 1 118 parameter model to items was poor, but the expected gain from moving from a 2 to a 3parameter model was not found to be substantial, and for the majority of items evaluated (approaching 90%) no gain was found from moving to a 3-parameter model. Accordingly, a 2parameter model was selected and used for the calibration of verbal and numerical item banks as generated for the SHL Verify Range. The item development programme supporting the SHL Verify Range of Ability Tests extended over 36 months during which items were trialed using a linked item design and with a total of 16,132 participants. Demographic details of the sample used to evaluate and calibrate SHL Verify items are provided in the sections in this manual that describe the SHL Verify comparison groups and the relationships between SHL Verify Ability Test scores and sex, ethnicity and age. Items were screened for acceptance into the item bank using the following procedure: A sensitivity review by an independent group of SHL consultants experienced in equal opportunities was used to identify and screen out items that might be inappropriate or give offence to a minority group. This was conducted prior to item trials. Once trial data was obtained, a-parameters were reviewed with items exhibiting low a-parameters being rejected. Review of b-parameters with items exhibiting extreme values (substantially less than 3 or greater than +3) being rejected. Review of item response times (time to complete the item) with items exhibiting large response times (e.g. 2 minutes) being rejected. Item distractors (alternate and incorrect answer options presented with the item) were also reviewed with items being rejected where distractors correlated positively with item-total scores (i.e. indicators of multiple correct answers to the item) or where the responses across distractors were uneven (the latter analysis being conditional on the difficulty of the item). Items surviving the above procedure were subjected to a final review in terms of a and bparameters as well as content and context coverage (i.e. that the item bank gave a reasonable coverage across different work settings and job types). This final review also sought to provide a balance across the different response options for different item types. That is, the spread of correct answers for verbal items avoided, say, the answer A dominating over B and C correct answers across items in the SHL Verify item bank, and that the spread of correct answers was approximately even for A, B, C, D and E options across numerical and Inductive Reasoning items.” (Pg. 12) Subtest Reliability Once the Verify item banks were developed, at least to a sufficient extent, the reliability of randomized forms was evaluated. Each randomized form of a subtest is developed by applying a set of rules to the selection of the required number of items from the bank for that subtest. Unfortunately, SHL does not provide even a high level description of the procedures for selecting items into a randomized form. However, at least three considerations are presumed to apply. Some threshold level for standard errors of the theta or the test information function associated with a set of items. That is, items in different forms must be selected that produce similarly reliable estimates of theta. Items must selected to represent the level of the randomized subtest, where level is defined as a prescribe range of theta values. Higher level jobs are associated with higher ranges of thetas. Items must be selected to be representative of the bank distribution of job-related item characteristics such as work settings and job types. Estimating the traditional reliability of randomized forms tests, no one of which is likely to be identical to any other, requires that many randomized forms be generated and the average internal consistency measure of reliability (alpha) be computed for the many generated forms. SHL uses procedures for estimating alpha from IRT results provided by du Toit (2003). 119 Table 59 shows the average alpha reliabilities derived for each of the three reasoning subtests in large development samples. For the Verbal Reasoning and Numerical Reasoning subtests, reliability was estimated separately for two different levels of subtests. This was done because the item banks for different levels are somewhat different, although overlapping. Inductive Reasoning reliability was estimated at only one level even though it is available at more than one level. Typically, Verify subtest reliabilities are in the upper .70’s to mid-.80’s which is generally considered adequate for selection tests. The Inductive Reason verification test reliability averaged .72 almost certainly because it is much shorter – 7 items - than the operational Inductive Reasoning subtest – 24 items. Table 59 also shows the magnitude of group differences for gender, race and age. In the section above, Developing Items and the Item Bank, the brief description of some steps in the item development process indicated that an independent group of SHL consultants screened items that may be inappropriate or offensive to a minority group. But no mention is made of common empiricalstatistical tactics for evaluating evidence of differential item functioning. Because empirical methods were not used, the possibility increases somewhat that the Verify subtests may be subject to groupbased bias. If group mean differences are unusually large for the Verify subtests, there could be an increased concern about possible sources of bias. However, Table 59 shows that most group differences, expressed as standardized mean differences, d values, are very small and only two values, both for Numerical Reasoning are moderately small. Most notably, all White-Black group differences are very small, which is a different result than is typically found in US-based research on group White-Black differences. The magnitude of the group differences reported in Table 59 suggests that the prospect of group bias is unlikely. 120 Table 59. Reliability and group difference estimates for Verify’s three original reasoning subtests. Reliability Male-Female Difference Subtest: White-Non-White Difference <40 - >40 Difference Average Alpha Male N Female N *d White N NonWhite N *d < 40 N >40 N *d Overall .80 4,382 3,885 .06 3,796 3,796 .11 5,155 3,028 .04 Manager & Graduate .81 Supervisor & Operational .78 4,382 3,885 .23 3,796 3,796 .09 5,155 3,028 .22 4,200 3,769 .01 3,291 4,678 -.08 7,228 614 .14 Level Verbal Reasoning Verbal Verification .77 Numerical Reasoning Overall .84 Manager & Graduate .83 Supervisor & Operational .84 Numerical Verification .79 Inductive Reasoning .77 Inductive Reasoning Verification .72 *For each pair of groups, d is defined as the difference between group means divided by the SD of test scores, where the target group mean (i.e., Female, Non-White, >40) is subtracted from the reference group mean (i.e., Male, White, <40) Criterion Validity Evidence SHL presents criterion validity data only for the Verbal Reasoning and Numerical Reasoning subtests. During the development of Verify seven criterion validity studies were conducted. Table 60 describes key characteristics of these seven studies, two of which did not include Verbal Reasoning. 121 Table 60. Characteristics of SHL validity studies for Verbal Reasoning and Numerical Reasoning. Industry Sector Job Level Country Banking Graduate UK Banking Manager Australia Professional Services Graduate UK Financial Supervisor UK Financial Operational UK Retail Operational US Education Operational Ireland Criterion Measure Manager’s ratings of competency Manager’s ratings of competency Accountancy exam result Manager’s ratings of competency Supervisor’s ratings of competency Manager’s ratings of competency Performance on business education exams Total Sample Verbal Reasoning Sample Size Not Used in Study Numerical Reasoning Sample Size 221 220 Not Used in Study 11 45 45 12 121 89 89 72 72 548 760 102 This information shows that two types of performance measures were used. In two studies, local jobrelated job knowledge tests were used as criteria. The remaining five studies all used ratings of jobrelated competencies provided by the supervisor / manager. No further information is provided. However, it is likely that the competency ratings were based on SHL’s Universal Competency Framework, which has a strong psychometric and substantive foundation. The results of a meta-analysis applied to the validities from these seven studies are shown in Table 61. These results show that the weighted average validity, corrected for range restriction and criterion unreliability, was .50 for Verbal Reasoning and .39 for Numerical Reasoning. The Verbal Reasoning results is virtually identical to the overall average operational validity reported by Schmidt and Hunter (1998) for general mental ability, which in most validity studies is a composite of 2-4 cognitive subtests. The .39 operational validity estimate for Numerical Reasoning does not indicate that it is estimated to be less valid than typical cognitive subtests. It is the composite of subtests representing general mental ability that is estimated to have an operational validity of .50. These results show that all variability in observed validity values is accounted for by sampling error. Overall, these validity studies are in line with the broad, professionally accepted conclusion about the level of predictive validity for measures of cognitive ability. Table 61. Meta-analyses of validity studies of Verbal Reasoning and Numerical Reasoning. Meta-analysis of Verify Ability Test Validities Number of Studies (K) Total Sample Size Average Sample Size Range of Observed Validities Variance in Observed Validities (A) Sampling Error Across Studies (B) True Variance in Validities (A-B) Weighted Mean Operational Validity 122 Verbal Reasoning Numerical Reasoning 5 548 110 0.21 to 0.43 7 760 109 0.11 to 0.34 0.01 0.00 0.01 0.01 0.00 -0.01 0.50 0.39 Approach to Item and Test Security Perhaps the most distinctive aspect of the Verify battery is the item and test security approach SHL has implemented primarily to ensure that scores from this unproctored online assessment are credible and adequately protected against various methods of cheating and piracy. At the same time, SHL has been active professionally in describing its overall security strategy for Verify in presentations and written material. Even though MCS is very unlikely to implement unproctored test administration as part of it civil service system, many aspects of Verify’s security strategy are likely to be applicable to MCS’s proctored, computer administered testing process. Study 2 recommendations strongly encourage MCS to launch its testing system with a very strong and prominent security strategy to build credibility in the new process and to discourage attempts to cheat or defraud the system. Several of the components of SHL’s Verify security strategy described below may be applicable to MCS’s civil service testing system. SHL describes seven major components of its overall security strategy for Verify. 1. 2. 3. 4. 5. 6. 7. Capitalize on technology Develop a large item bank Verify unproctored test results with proctored test results Continually monitor sources of information for indicators of cheating Support flexible testing options Use scientific rigor Clear communications with the test taker Capitalize on Technology Unproctored Verify relies heavily on randomized test forms for each individual and on response times as indictors of potentially fraudulent attempts to complete tests. These strategies are enabled by Flash player applets. There are several advantages of randomized forms, even with proctored administration. Randomized forms are versions of whole subtests generated by an IRT-based algorithm that are psychometrically equivalent but contain different items from one another. For each test taker, one computer-generated version is randomly assigned. The advantages of randomized forms include Minimize item exposure. Depending on bank size and other factors, each item is expected to appear on only a small percentage of all forms. Minimize answer key exposure. The answer key for each randomized form is not the same due to different items and different items sequences. Is discouraging to attempts to cheat or steal items where there has been clear communication about their use Response time information can be used effectively in data forensics as one source of information about possible cheating (e.g., patterns of very short response times) or possible efforts to steal item information (e.g., patterns of very long response times). Develop a large item bank A large item bank is important for a variety of reasons. 123 Enables the generation of randomized forms Reduces the overall impact on the testing program of occasional compromised items Encourages a strategy of continual item replacement or temporary retirement Supports development of item psychometric properties under the conditions in which the items are used (International Test Commission, (2006) (ITC), Guideline 22.3) Compare unproctored test results to proctored test results ITC Guideline 45.3 establishes that applicants whose unproctored test scores result in them being regarded as qualified should be required to compete a proctored version of the same test to check the consistency of the unproctored an proctored scores. Verify explicitly implements this strategy with its “Verification” forms of each of the Verify subtests. Applicants who are regarded as qualified to be hired are required to complete a proctored verification subtest for each Verify subtest that contributed to their “qualified” status. SHL has developed a sophisticated algorithm for estimating the likelihood of the two sets of scores results occurring by chance and has carried our extensive analyses of the odds of detecting cheaters, as reported in the 2007 Verify Technical Manual. Importantly, SHL’s practice is to not combine or replace the initial unproctored test result with the verification test result. In certain ways, this can reward the cheater to the disadvantage of noncheaters who achieve the similar unproctored scores. SHL discourages the practice of combining or replacing unproctored score by not reporting the verification score but only reporting whether the verification result confirms or disconfirms the unproctored result. Clearly, verification testing would not be relevant if all MCS civil service exams are proctored. However, even in the case of proctored exams, if MCS undertakes other data forensic analyses to identify possible cheaters, a possible administrative course of action would be to require the applicant to complete a second administration of the test. In such a case, the verification approach used in unproctored testing may be applicable. Continually monitor sources of information for indicators of cheating and piracy SHL has established a partnership with Caveon Test Security, an industry leader in evaluating evidence of cheating and/or piracy, to monitoring SHL’s Verify item database and routinely inspect item level responses and response times for indicators of cheating or other fraudulent practices. In addition, SHL actively engages in web-patrols continuously search for web sites in which information may be disclosed about cheating or piracy attempts against Verify, or for that matter, other SHL test instruments. These web patrols can identify not only communications about cheating and piracy efforts but can also build an understanding of the reputation of the target test. While such strategies would have a high priority with unproctored testing, they may also provide significant value with proctored testing, which MCS should assume will also be the target of collaborative efforts to steal and sell information about the civil service exams. Support flexible testing options Given the cost and convenience advantages of unproctored online testing, SHL encourages the use of Verify by making it available in a variety of ways, for a variety of jobs and embedded in a variety of delivery systems. This feature of Verify has less to do with security than it has to do with marketing the use of Verify. Use scientific rigor To establish credibility and confidence in the use of Verify especially given that it is administered in unproctored settings, SHL has, in certain respects, demonstrated a high level of scientific rigor in its development, use and maintenance. The development of large item banks, the development of IRTbased psychometric properties in large samples, the use of randomized forms, the investment in data forensic services and their high visibility in professional conferences all contribute to a reputation for Verify that it yields credible scores. Communicate a clear “Honesty” contract with test takers. An important component of any large scale testing program in which test takers have opportunities to know and communicate with one another, is that the test publisher clearly communicate about 124 security measures used to protect the test but also require that test takers agree to complete the test as they are instructed to complete it. In the case of unproctored testing, ITC Guideline 45.3 states that “Test takers should be informed in advance of these procedures (security measures) and asked to confirm that they will complete the tests according to instructions given (e.g., not seek assistance, not collude with others etc.) This agreement may be represented in the form of an explicit honesty policy which the testtaker is required to accept.” (ITC, 2006, p. 164) Study 2 Authors’ Comment Regarding This Description of SHL’s Approach to Unproctored Testing Our view of SHL’s approach to the maintenance of item and test security is that it represents a professional “best practice” in the sense that SHL adheres to virtually all professional guidance relating to the use of online testing and, in particular, unproctored online testing. Indeed, SHL principals have been in the forefront of communicating about the inevitability and the need for professional standards guiding unproctored online testing. In short, if MCS were to decide that unproctored administration was needed, SHL’s Verify is currently the best “role-model” for what a comprehensive approach should be like. But we do not encourage MCS use unproctored administration for their civil service exams. First, at the same time that ITC has established guidelines for unproctored testing, unproctored testing risks violation of other existing professional guidance (See, Pearlman (2009)). Second, we believe that MCS should launch and communicate about its civil service testing system in a manner that deliberately and explicitly is designed to build its credibility and engender the confidence of all applicants. As a government process, its civic reputation will be important. Unproctored administration, especially at the outset, risks the credibility of the system especially in a cultural setting where the use of tests does not have a long track record in personnel selection. Translations / Adaptations SHL has developed multiple language versions for all Verify subtests, shown in Table 62, with the exception of Reading Comprehension, available only in International English and for Spatial Ability and Deductive Reasoning, for which language versions are not yet reported. Table 62. Languages in which subtests are available. Subtest Verbal Reasoning Language Versions Available Inductive Reasoning Arabic, Brazilian Portuguese, Canadian French, Chinese (Complex), Chinese (Simplified), Danish, Dutch, Finnish, French, German, Hungarian, Indonesian, Italian, Japanese, Korean, Latin American Spanish, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish, Thai, Turkish, UK English, US English Mechanical Comprehension Canadian French, Chinese (Simplified) , Dutch, French, German, International English, US English Numerical Reasoning Checking Calculation UK English, US English, French, Dutch, German, Swedish, Norwegian, Danish, Finnish, Italian, Indonesian Reading Comprehension International English Spatial Ability Not Reported Deductive Reasoning For most subtests, Test Fact sheets are available in UK English and in French. 125 User Support Resources (Guides, Manuals, Reports, Services, Etc.) In support of its employment selection tools, SHL provides extensive resources for registered users and somewhat fewer resources for unregistered browsers. Since being acquired by CEB, SHL has begun to position itself as a multiservice resource to organizations about People Intelligence providing a range of services including consulting services, business management tools, assessment tools, HR process tools, and training. SHL clearly provides access to more resources for both registered and unregistered individuals than any other publisher of the reviewed batteries. Resources supporting just employment related test products include Technical Manuals User Guides Information Overviews Product Description Materials Individual Test Fact Sheets Sample Report Practice Information (Although we are not certain whether this includes practice booklets or just sample items.) Access to an SHL Library containing many, if not all, professional presentations and publications by SHL professional Business tools for talent management Technology platforms for test delivery and applicant management Evaluative Reviews Perhaps because Verify is a relatively new employment testing system that emphasizes its mode of delivery rather than the uniqueness of the constructs measured by its subtests, we have found no reviews of Verify in common professional resources such as Buros Mental Measurements Yearbook. Further, we have found no published evaluation of verify tests other than studies published by SHL principals about research that preceded and informed the development of Verify. Perhaps the best example of such a published study is Bartram (2005) which reported the extensive research effort at SHL to identify common work performance “competencies”. These competencies, now represented as SHL’s Universal Competency Framework (UCF) have become the primary framework by which SHL describes the type of work for which individual subtests are considered appropriate. For example, the Test Fact sheet for Mechanical Comprehension describes this subtest as relevant to work roles depending on the following Universal Competency an underlying behaviors. 4.2 – Applying Expertise and Technology Applies specialist and detailed technical expertise Develops job knowledge and expertise through continual professional development Shares expertise and knowledge with others Uses technology to achieve work objectives Overall, our evaluation of the SHL approach to the development, administration, maintenance and use of selection tools and its support of users/customers is superior to any other publisher for the reviewed batteries. Our recommendations about several facets of MCS’s civil service system have been strongly influenced by SHL’s example of professional practice. 126 DESCRIPTIVE RESULTS FOR WATSON-GLASER Overall Description and Uses Introduction Goodwin Watson and Edward Glaser developed in the 1920’s an assessment process for measuring critical thinking. Their precursor tests evolved by 1960 into standardized versions that came to be called the Watson-Glaser Critical Thinking Appraisal (W-GCTA). This test was designed to assess five facets of critical thinking – Recognize Assumptions, Evaluate Arguments, Deduction, Inference, and Interpretation. In addition, W-GCTA was designed to assess critical thinking in neutral contexts and in controversial contexts. Over the next 50 years the early Forms Ym and Zm (100 items) were eventually replaced by somewhat shorter Forms A and B (80 items), A British version, Form C was developed, also with 80 items. Subsequently, a Short Form was developed based only on 40 items while continuing to assess the same five facets. More recently, in 2010 two new shortened forms, D and E (both 40 items) were introduced to replace the previous long Forms A, B, and C (UK), which was a British adaptation of Form B, and the Short Form. Several interests motivated the development of Forms D and E: (a) empirical factor analytic studies had concluded that W-GCTA scores were best explained by three factors of critical thinking rather than the five traditional components of W-GCTA; (b) improvements in the business relevance and global applicability of all items and the currency of controversial issues; and, (c) increase the range of scores and retain current levels of reliability at the shorter length of 40 items. At the same time, a much larger bank of items was developed to support an online, unproctored version of W-GCT in which each test taker would receive a different randomized set of items. This online, unproctored administration is currently available only in the UK. The W-GCTA is constructed around testlets of 2-5 items each. Each testlet consists of a text passage that provides the information the test taker analyzes to answer critical thinking questions. This report will focus primarily on the new Forms D and E and will provide information about the very recent online, unproctored version available in the UK. However, empirical data gathered over time using the longer earlier Forms A and B and Short Form will be cited where empirical data provides evidence for Form equivalence and for construct and criterion validity . Purpose W-GCTA was developed to provide an assessment of critical thinking for use in selection decisions, development/career planning, interview planning, and academic assessment. While the W-GCTA has been applied in a wide variety of settings, its application in personnel selection has been especially focused on higher level job families and roles such as manager / executive / professional consistent with the relatively high complexity of it items content. While the reading level of all Wth GCTA passages and items is intended to be no higher than the 9 grade level, the content of the passages, especially, is intended to be representative of content that would be encountered in a business context or would be found in newspaper media. As a result, the W-GCTA content complexity level is intended to be comparable to complexity level of common business scenarios. In spite of its relatively high content complexity, it is relatively easy to use and score, with scoring only requiring that the number of correct answers be totaled for each part of the whole test. Targeted Populations and Jobs The 9th grade reading level and complexity of items result in W-GCTA being appropriate with more educated populations such as candidates for managerial / executive / professional occupations. 127 To accommodate the use of W-GCTA across a wide range of occupations and levels, norms have been developed for many occupations and job levels, In 2012, Pearson published Form D and E norms for the following occupations: Accountant, Consultant, Engineer, Human Resource Professional, Information Technology Professional, and Sales Representative. At the same time, Pearson also published Form D and E norms for the following job levels: Executive, Director, Manager, Supervisor, Professional / Individual Contributor, and Manager in Manufacturing / Production. In the 2010 W-GCTA Technical Manual, Pearson reported that W-GCTA “customer base” consisted primarily of business professionals (approximately 90%) and college students (approximately 10%). Spread of Uses W-GCTA has been used in a wide range of setting and purposes ranging from personnel selection to clinical mental assessment to academic evaluation. Within personnel selection, W-GCTA has a long history of use with managerial / executive / professional decision making. Specific uses have included personnel selection, development feedback, training achievement, and academic readiness and achievement. Online, Unproctored UK Version All of the above information applies to the online, unproctored UK version. Administrative Detail Table 63 shows administrative details for the current and previous forms of W-GCTA. Numbers and Types of Items Current and previous forms include passages and items for each of the original 5 components of critical thinking – Recognize Assumptions, Evaluate Arguments, Deduction, Inference and Interpretation. However, the current versions report scores on only three scales – Recognize Assumptions, Evaluate Arguments and Draw Conclusions, where Draw Conclusions is a composite of the shorter versions of Deduction, Inference and Interpretation. Previous factor analytic studies demonstrated that scores on Deduction, Inference and Interpretation were sufficiently homogeneous to constitute a single factor, now called Draw Conclusions Time Limits W-GCTA places less emphasis on time limits than other standardized cognitive ability batteries. Users have the option of administering W-GCTA with or without time limits. Nevertheless, when time limits are imposed, the same 40-minute time limit is used for both the shorter current versions as were used for the longer previous forms. Pearson reported that in the standardization sample used in the development of the current shorter forms the median time to completion was 22.48 minutes. 128 The UK versions of the current shorter forms have a somewhat shorter time limit of 30 minutes. Table 63. Administrative detail about previous and current W-GCTA Forms Administrative Information Subscale # Items Time Limit Scoring Rule Method of Delivery Raw scores as number of items answered correctly. Percentile rank scores are provided based on relevant industry/occupation/jo b norms. Local norms may also be used. Fixed tests; Paper-pencil or computer administratio n, proctored. (An unproctored version is available in the UK) Raw scores as number of items answered correctly. Percentile rank scores are provided based on relevant industry/occupation/jo b norms Fixed items, PaperPencil administratio n, proctored Watson-Glaser II (Current) Recognize Assumptions – Forms D, E 12 Evaluate Arguments – Forms D, E 12 40 minutes for whole battery (30 minutes for UK version) (for Paper-Pencil administration) or Draw Conclusions – Forms D, E 16 5 – Deduction 5 – Inference 6 - Interpretation Untimed (for Computer or Paper-Pencil administration) Watson-Glaser (Previous) Recognize Assumptions Forms A, B 16 Evaluate Arguments Forms A, B 16 Deduction Forms A, B 16 40 minutes for whole battery or Untimed Inference Forms A., B 16 Interpretation Forms A, B 16 Types of Scores Raw scores are computed as the number of items answered correctly. Although all items are multiple choice, no adjustment is made for guessing. Raw scores are converted into Percentile rank scores and T-scores (Mean=50, SD=10.0) for reporting purposes. Percentile rank scores are provided with respect to several norm populations including the British general population, and several occupation groups, and several role/level groups. Method of Delivery Two methods of delivery are available in the US with the current Forms D and E, paper-pencil and online. Both methods of delivery are required in the US to be proctored and consist of the same item sets as included in paper-pencil Forms D and E. As in the US, in the UK Form D may be administered in a proctored online mode. 129 However, in the UK a version of W-GCTA is available that may be delivered online without proctoring. This “version” of W-GCTA does not correspond to any particular fixed form of W-GCTA because unproctored administration requires that a different randomized “form” of a 40-item W-GCTA be administered to each test taker. This process of creating randomized forms requires a large number of items in an item bank that may be repeatedly sampled in a prescribed randomized manner such that virtually every test taker is administered a unique, equivalent randomized form. This process of item development, bank construction and test creation will be described in detail below. As of early 2013, this development process had produced an item bank of 376 quality-tested items associated with 100 passages. Cost UK prices for the specific resources necessary to administer, score and receive reports are shown below. Pricing was not available for online administration. Item Unit Price Test Booklet £22.00 25 Answer Forms £85.00 Scoring Key £47.50 10 Practice Tests £30.50 Individual Reports £21.00 Construct – Content Information Intended Constructs The W-GCTA was developed to measure critical thinking as Watson and Glaser defined it. They defined critical thinking as the ability to identify and analyze problems as well as seek and evaluate relevant information in order to reach an appropriate conclusion. This definition assumes three key aspects of critical thinking: 1. Attitudes of inquiry that involve an ability to recognize the existence of problems and an acceptance of the general need for evidence in support of what is asserted to be true; 2. Knowledge of the nature of valid inferences , abstractions, and generalizations in which the weight or accuracy of different kinds of evidence are logically determined, and 3. Skills in employing and applying the above attitudes and knowledge. Current W-GCTA Forms D and E organize the assessment of critical thinking around three primary components, Recognize Assumptions, Evaluate Arguments and Draw Conclusions. Recent factor analyses of W-GCTA scores (Forms A, B, and Short Form) have shown that previous components Recognize Assumptions and Evaluate Arguments are each single factors and that Draw Conclusions is a relatively homogeneous factor consisting of three previously identified components, Deduction, Inference and Interpretation. Current W-GCTA Forms D and E continue to include passages and items associated with each of the original five components, although in somewhat different proportions, but report scores only on the three factors of Recognize Assumptions, Evaluate Arguments, and Draw Conclusions.. Recognition of Assumptions is regarded as the ability to recognize assumptions in presentations, strategies, plans and ideas. 130 Evaluate Arguments is regarded as the ability to evaluate assertions that are intended to persuade. It includes the ability to overcome a confirmation bias. Draw Conclusions is regarded as the ability to arrive at conclusions that logically follow from the available evidence. It includes evaluating all relevant information before drawing a conclusion, judging the plausibility of different conclusions, and selecting the most appropriate conclusions, while avoiding overgeneralizing beyond the evidence. Although Draw Conclusions has been shown empirically to be a relatively homogeneous factor, operationally the W-GCTA defines it as the composites of scores on the facets of Deduction, Inference and Interpretation. Watson and Glaser added a further consideration to the assessment of critical thinking. In their view critical thinking involved the ability to overcome biases attributable to strong affect or prejudice. As a result, they built into the W-GCTA measurement process several passages and items intended to invoke strong feeling or prejudices. This measurement approach continues into the current forms of W-GCTA. Item Content Each item is a question based on a reading passage that supports 2-6 items. These clusters of a passage and 2 or more items are referred to as testlets. (Note: we have not located any reference to investigations of the extent to which this constructed interdependency of items within testlets has influenced W-GCTA measurement properties. As reported above, W-GCTA scores are number right scores without any adaptation/weighting for this built-in item interdependence. Nevertheless, it can be assumed that this interdependence among items within testlets reduces the effective length of the W-GCTA assessment. This may be a possible explanation for the relatively low reliabilities reported for the Evaluate Arguments scale in which the distinction between neutral and controversial passages is most pronounced.) General Item Content Specifications Certain characteristics are prescribed for all testlets, regardless of the target component. It appears that these characteristics have not changed over the decades of W-GCTA evolution. The passages are required to include problems, statements, arguments and interpretations of data similar to those encountered on a daily basis in work contexts, academic contexts, and in newspaper/magazine content. Certainly, these passages will be laden with local culture and social context and should not be assumed to be equivalent across cultures. As an example, the first British version of W-GCTA, Form C, is a British adaptation of the US-based Form B. In addition to this prescription about testlet content, testlets must also be written to be either neutral or “controversial”. Neutral content does not evoke strong feelings or prejudices. Examples of neutral content (typically) include weather, scientific facts and common business problems. “Controversial” content is intended to invoke affective, emotional responses. Domains from which controversial content may developed include political, economic and social issues. (Note: we have found no explicit reference to any step in W-GCTA testlet development designed to evaluate the affective valence of draft content.) Watson and Glaser concluded that the Evaluates Arguments component was most susceptible to controversial content. As a result, few controversial testlets are included in Recognize Assumptions and any part of Draw Conclusions. More controversial testlets are included in Evaluate Arguments. We have found no information reporting the specific numbers or proportions of controversial testlets targeted for each test component. th All testlets are developed to have a reading level no higher than 9 grade (US) and are heavily loaded with verbal content. All content has a strong social/cultural context that in some testlets is work-like. But there is no overall prescription that a high percentage of the testlets should contain work-like 131 content. The recent emphasis has been to develop testlet content that is more applicable across international boundaries. Specific Item Content Examples This section shows example testlets taken from W-GCTA user manuals. All example testlets were taken from US-oriented sample materials. Recognize Assumptions Statement: We need to save time in getting there so we’d better go by plane. Answer Options: Assumptions Made? YES NO Proposed assumptions: 1.Going by plane will take less time than going by some other means of transportation. (YES, it is assumed in the statement that the greater speed of a plane over the speeds of other means of transportation will enable the group to reach its destination in less time.) 2. There is a plane service available to us for at least part of the distance to the destination. (YES, this is necessarily assumed in the statement as, in order to save time by plane, it must be possible to go by plane.) 3. Travel by plane is more convenient than travel by train. (NO, this assumption is not made in the statement – the statement has to do with saving time, and says nothing about convenience or about any other specific mode of travel.) Evaluate Arguments Statement: Should all young people in the United Kingdom go on to higher education? Answer Options: Strong Weak Proposed Arguments: 1. Yes; college provides an opportunity for them to wear college scarves. (WEAK, this would be a silly reason for spending years in college.) 2. No; a large percentage of young people do not have enough ability or interest to derive any benefit from college training. (STRONG. If it is true, as the directions require us to assume, it is a weighty argument against all young people going to college.) 3. No; excessive studying permanently warps an individual’s personality. (WEAK, this argument, although of great general importance when accepted as true, is not directly related to the question, because attendance at college does not necessarily require excessive studying.) Draw Conclusions - Inference Statement: Two hundred school students in their early teens voluntarily attended a recent weekend student conference in Leeds. At this conference, the topics of race relations and means of achieving lasting world peace were discussed, since these were problems that the students selected as being most vital in today’s world. 132 Answer Options: True (T), Probably True (PT), Insufficient Data (ID), Probably False (PF), and False (F) Proposed Inferences: 1. As a group, the students who attended this conference showed a keener interest in broad social problems than do most other people in their early teens. (PT, because, as is common knowledge, most people in their early teens do not show so much serious concern with broad social problems. It cannot be considered definitely true from the facts given because these facts do not tell how much concern other young teenagers may have. It is also possible that some of the students volunteered to attend mainly because they wanted a weekend outing.) 2. The majority of the students had not previously discussed the conference topics in the schools. (PF, because the students’ growing awareness of these topics probably stemmed at least in part from discussions with teachers and classmates.) 3. The students came from all parts of the country. (ID, because there is no evidence for this inference.) 4. The students discussed mainly industrial relations problems. (F, because it is given in the statement of facts that the topics of race relations and means of achieving world peace were the problems chosen for discussion.) 5. Some teenage students felt it worthwhile to discuss problems of race relations and ways of achieving world peace. (T, because this inference follows from the given facts; therefore it is true.) Draw Conclusions – Deduction Statement: Some holidays are rainy. All rainy days are boring. Therefore: Answer Options: Conclusion follows? YES NO Proposed Conclusions: 1. No clear days are boring. (NO, the conclusion does not follow. You cannot tell from the statements whether or not clear days are boring. Some may be.) 2. Some holidays are boring. (YES, the conclusion necessarily follows from the statements as, according to them, the rainy holidays must be boring.) 3. Some holidays are not boring. (NO, the conclusion does not follow, even though you may know that some holidays are very pleasant.) Draw Conclusions – Interpretation Statement: A study of vocabulary growth in children from eight months to six years old shows that the size of spoken vocabulary increases from 0 words at age eight months to 2,562 words at age six years. Answer Options: Conclusion follows? YES NO Proposed Conclusions: 133 1. None of the children in this study had learned to talk by the age of six months. (YES, the conclusion follows beyond a reasonable doubt since, according to the statement, the size of the spoken vocabulary at eight months was 0 words.) 2. Vocabulary growth is slowest during the period when children are learning to walk. (NO, the conclusion does not follow as there is no information given that relates growth of vocabulary to walking.) Construct Validity Evidence The ability constructs underlying the W-GCTA have remained unchanged since its inception but do not correspond precisely to the cognitive ability constructs assessed by other commonly used cognitive tests. Nevertheless, it is to be expected that the W-GCTA measure of critical thinking would correlate substantially with other measures of reasoning or intelligence or general mental ability. Table 64 presents empirical construct validity correlations for various forms of W-GCTA gathered since 1994. The results reported in Table 64 are taken from recent W-GCTA technical manuals and do not capture all studies of W-GCTA construct validity. Also, it should be noted that, with the likely exception of the WAIS-IV Processing Speed Index, all other comparison tests are sources of convergent validity evidence. The results indicate that W-GCTA total scores correlate substantially with other convergent measures. Based on these results, a couple of points are worth noting. First, the Short Form shows observed convergent validities similar in magnitude to validities for the then-current longer Forms A and B. Second, in the small 2010 study comparing W-GCTA to WAIS-IV scores, there appears to be noticeable variation in W-GCTA scale validities, even though total score validity is similar to that in other studies with other tests. In particular, Draw Conclusions correlates substantially higher than either Evaluates Arguments or Recognize Assumptions with all WAIS-IV indices. This is likely the result of the increased heterogeneity of the testlets in Drawing Conclusions compared to the other two scales. Also, it is worth noting that none of the W-GCTA scales yielded a strong correlation with the WAIS-IV Processing Speed index. This might be considered a supportive divergent validity result given that the W-GCTA does not intended to assess speededness of critical thinking and is administered either with a generous time limit or no time limit. 134 Table 64. Observed construct validity correlations with forms of Watson-Glaser Critical Thinking Appraisal Study N Comparison Tests 147194 Nursing students (Adams 1999) 203 Job Incumbents (Pearson 2006) Form A/B/ Short Total Ennis-Weir Critical Thinking .43 .39 .37 ACT Composite .53 SAT Verbal Education Majors (Taube, 1995) Watson-Glaser Test Form D / UK Proctored SAT Math 41 Ravens APM .53 Job Incumbents (Pearson 2006) 452 Rust Advanced Numerical Reasoning Appraisal. .68 Job Incumbents (Pearson 2005) 63 Miller Analogies for Prof. Selection .70 Rail dispatchers (W-G 1994) Industrial Reading Test 180 .53 .50 .51 .54 .48 .66 .50 .51 .54 .42 .47 Lower Managers (W-G 1994) Test of Learning Ability 219 Wesman Verbal 217 EAS Verbal Comprehension 217 EAS Verbal Reasoning 208209 EAS Verbal Comprehension Wesman Verbal Mid Managers (WG 1994) EAS Verbal Reasoning Exec Managers (W-G 1994) Teaching applicants (Pearson 2009) 440 Wesman Verbal 437 EAS Verbal Comprehensions 436 EAS Verbal Reasoning 556 SAT Verbal 558 SAT Math 251 SAT Writing 254 ACT Composite WAIS-IV Full Scale IQ WAIS-IV Perceptual Reasoning WAIS (Pearson 2010) WAIS-IV Working Memory 62 WAIS-IV Verbal Comprehension WAIS-IV Processing Speed WAIS-IV Fluid Reasoning Form E Total Recognize Assumptions Evaluate Arguments Draw Conclusions Total Score .31 .20 .24 .34 .09 .32 .21 .25 .13 .10 -.01 .36 .62 .56 .59 .46 .22 .67 .52 .51 .50 .48 .59 p>.05 .60 At this point in time, no construct validity studies have been reported for the UK online, unproctored version of W-GCTA. Testlet / Test Development The testlet and test development procedures described below are those used recently to develop the new, shorter Forms D and E, as well as the larger bank of items required to support the online, unproctored mode of delivery. This development process was organized into six stages of work. For the first three stages, Conceptual Development, Item Writing/Review, and Piloting, the procedures were essentially the same for the process to develop new fixed Forms D and E as for the process to develop a large item bank for unproctored online test. The later stages of development used different procedures for fixed form development than for item bank development. These differences will be noted below. Testlet Development Procedures: Fixed Forms D and E and the Item Bank for Online Unproctored Administration 135 Conceptual Development Stage Exploratory factor analyses of Form A, B and Short Form data were undertaken separately for each Form to evaluate the factor structure underlying the 5-component framework for Forms A, B and the Short Form. Initial results identified three factors that were stable across Forms and two additional factors that could not be identified but were associated with psychometrically weaker testlets, and were not stable across Forms. Subsequent factor analyses that excluded the weaker testlets and specified three factors revealed the three factors Recognize Assumptions, Evaluate Arguments and Draw Conclusions, which included the previous Deduction, Inference and Interpretation components. In addition to factor analyses of Forms A, B and Sort Form data, Pearson consulted with W-GCTA users including HR professionals internal to client organizations, external HR consultants and educational instructors. These two sources of input combined to establish the objectives for the process of developing new testlets to support new fixed Forms and the development of a larger bank of items to support online, unproctored testing. As a result of this conceptual development stage, the following development objectives were established: Ensure the new questions were equivalent to the existing items Ensure that content is relevant for global markets Ensure that content is appropriate for all demographic groups Ensure that the item-banked test is continuing to reliably measure the same critical thinking constructs as other versions of the Watson-Glaser. Ensure that a sufficient number of new items were developed to support two new 40-items forms (each with approximately 35% new items an 65% existing items) and the item bank required to support online, unproctored delivery Item Writing and Review Stage Item writing was conducted by individuals with extensive prior experience writing critical thinking / reasoning items. These included experienced item writers, occupational psychologists or psychometricians with at least 15 years of experience. Detailed guidelines and specifications for writing items were provided to each writer, and each writer submitted items for review prior to receiving approval to write additional items. Writers were instructed to write items at a 9th grade reading level, using the identical testlet format as used in previous forms and measuring the same constructs as measured in the previous forms. Subject matter experts with experience writing and reviewing general mental ability/reasoning items reviewed and provided feedback on how well each new draft item measured the target construct, clarity and conciseness of wording, and difficulty level. In addition, a separate group of subject matter experts reviewed and provided feedback on how well draft items and items from existing forms could be expected to transport or be adapted to other countries/cultures. This included a consideration of topics or language which would be unequally familiar to different groups. As a final step, experimental scenarios and items intended to be business relevant were reviewed for use of appropriate business language and situations by Pearson’s U.S. Director of Talent Assessment. For fixed form tests, Forms D and E, 200 new items were produced through this item writing process. These items were based on approximately 40 new testlets, each with approximately 5 items. 136 For the item-banked online, unproctored test, 349 items were produced. (This development activity nd for the item-banked test took place in 2012. An additional development is currently underway in a 2 phase of development.) Pilot Stage The pilot stage consisted of the administration of new items to W-GCTA test takers by adding some number of new items to each form of the existing W-GCTA test being administered. These new items were unscored. These pilot administrations were intended to provide empirical data for psychometric evaluations of the new items. Fixed Forms Each administration of an existing form (Forms A, B, and Short) included a set of 5-15 of the 200 new items. Each set was administered until at least 100 cases had been collected. When an adequate number of cases had been accumulated for each set, it was replaced by another set of 5-15 new items. This process continued until all new items had been administered to an adequate number of test takers. The average number of cases across all 200 new items was 203. Classical Test Theory (CTT) analyses were used to estimate the psychometric properties of each new item. Item-Banked Test Because a larger number of items was needed to generate the item bank for online unproctored testing, up to 40 new items were added to existing operational paper-pencil forms (Forms A, B, Short, and the UK paper-pencil version of B) as well as to the UK online, proctored version, which is a fixed form of computer-administered test based on the . Psychometric Analyses of Initial Pilot Data For both development efforts, CTT psychometric analyses were conducted for each new item. Item difficulty, discrimination and distractor analyses were used to identify poor performing items, which were then deleted from the development pools. Calibration Stage The primary purpose of the Calibration stage was to place retained new items and existing items on common ability scales using IRT analyses. The first question was whether the full pool of items was unidimensional enough to support IRT analyses on the whole pool or whether IRT analyses should be carried out separately for the Recognize Assumption (R) items, the Evaluate Arguments (E) items, and the Draw Conclusions (D) items. Factor analyses indicated a large first factor, R, and a smaller second factor, a combination of E and D. BILOG-MG was used to estimate IRT item parameters for both the whole pool and for the R pool and E+D pool using both 2 and 3 parameter IRT models. The criteria for selecting between IRT models, in spite of small samples sizes of 350 cases for many items, included: Goodness of fit indices for items and the model as a whole; Parameter errors of estimate; and Test-Retest reliability for modeled parallel forms based on different parameterizations Using these criteria, a 3 parameter model with fixed guessing showed the best fit. (Note, this result was influenced by the fact that the majority of items have only two options, yielding a 50% chance of guessing correct.) Also sample size limitations prevented the estimation of the full 3 parameter model. Also, test-retest reliability for modeled parallel forms favored estimates based on the whole 137 item pool over estimates based on the R and E+D pools separately. Reliabilities produced from IRT parameter estimates based on all three subtests combined into a single whole pool ranged from .80 to .88. Reliabilities produced from IRT parameter estimates based on R and E+D pools separately were lower, ranging from .71 to.82. The reliability advantage of the whole pool estimates over the separate pool estimates was the primary factor leading Pearson to conclude that the unidimensionality assumption was supported sufficiently. Goodness of fit indices did not provide strong support for the unidimensionality assumption. As a result of these analyses, the final IRT model chosen was the 3 parameter model with a fixed guessing parameter treating the whole item pool as unidimensional. Tryout Stage: Form D The primary purpose of the tryout stage, which was applied only to the new Form D, was to confirm the three factors, Recognize Assumptions, Evaluate Arguments and Draw Conclusions and to demonstrate subscale reliability. Based on a sample of 306 candidates with at least a Bachelor’s degree, factor analysis results supported the three factor structure and internal consistency estimates provided evidence of adequate reliability, although in the case of Evaluate Arguments the estimated internal consistency was marginally adequate. Reliability results are reported below. Standardization Stage: Forms D and E Standardization analyses addressed the relationships between the new Forms D and E and the previous Forms A, B and Short and other measures of closely related constructs. These latter investigations of construct validity were reported above in the Construct Evidence section in Table 64. Alternate forms reliabilities are shown below in the Reliability section. Overall, alternate forms reliabilities between Form D and the Short From at the Total score and subscale score levels were strong, all above .80, with the exception of Evaluates Arguments where the correlation between Form D and Short Form was only .38, indicating a lack of equivalence. Form E was constructed from new items following the demonstration that Form D showed adequate factor structure and internal consistency. Form E items were selected based on item selection criteria, information from the pilot and calibration stages and item difficulty levels corresponding with the difficulty of From D items. Equivalence of New Forms and Previous Forms Following the development of Forms D and E, a development study with 636 test takers was conducted to compare the means and standard deviations of these two alternative forms and to assess their correlation. Table 65 shows the results of this equivalency study. Prior to this study, other equivalency studies were conducted for various pairs of forms. These are also reported in Table 65. 138 Table 65. Equivalency results for various pairs of Watson-Glaser test forms Form 1 Study Sample Size Test Score Mean Total (40) Form 2 SD Mean 27.1 6.5 29.6 5.7 .85 Recognize (12) 8.2 3.1 5.8 2.2 .88 Evaluate (12) 7.7 2.2 6.9 1.3 .38 4.0 .82 5.6 .82 Form D Form D Standardization Study 636 Forms D and E Development Study 209 Paper v. Computer Administration 226 US 12th Grade Students 288 UK “Sixth Form” Students Draw (16) 11.2 Total (40) 22.3 Short Form 17.9 6.3 22.9 Form E Short Form (Paper) 29.8 5.6 Form A Total (80) 46.8 Total (80) 56.8 Short Form (Computer) 29.7 5.6 .87 9.3 .75 Form B 9.8 46.6 Form C (UK adaptation of Form B) Form B 53 1 3.1 Form D Total (40) SD Observed Alternate Forms Reliability 8.3 57.4 9.5 .71 1 The Draw Conclusions mean reported for the Short Form, 17.9, is almost certainly a typo because the Draw Conclusions scale includes only 16 scored items. However, we have not been able to confirm the correct value. A few points are worth noting about the equivalency data reported in Table 65. First, the Total Score alternate forms reliabilities for all 40-item tests are substantial, all being greater than .82. However, the Total Score alternate forms reliabilities for the 80-item Forms A, B and C are lower, .71 and .75, than those for the shorter newer forms. Given that the construct and testlet content specifications were virtually identical for all forms, this is a remarkable result. One possible explanation is that the equivalency studies involving Forms A, B and C were conducted recently and the item content had become outdated to the extent that students may have responded less reliably to the older testlet content. Second, the computerized version of the Short Form correlated .87 with its paper-pencil counterpart and produced nearly identical means and SDs. This result created great confidence on the publisher’s part that online administration alone would not likely distort scores. Third, scale scores for Evaluate Arguments correlated only .38 between Form D and Short Form. This is a low alternate forms reliability for two scales defined and constructed in virtually identical ways. Other forms of reliability evidence reported below strengthen the overall conclusion that Evaluates Arguments has demonstrated marginal levels of reliability. The 2010 Watson-Glaser Technical Manual comments on this result by noting “Because Evaluates Arguments was the psychometrically weakest subscale in previous forms of the Watson-Glaser, the low correlation was not surprising.” Surprising or not, this result strongly suggests Evaluate Arguments may not contribute to the overall usefulness of Watson-Glaser scores. Construct validity results reported above in Table 64 and criterion validity results reported below in Table 69, support this expectation that Evaluates Arguments contributes little, if anything, to the meaningfulness and usefulness of WGCTA scores. 139 Item Bank Configuration for Online, Unproctored Testlets Once new items developed for the large item bank and existing items were calibrated on the same scale, the process of configuring the item bank was carried out. Including all new testlets and existing testlets that survived the preceding stages of development and review, 100 passages remained, supporting 376 specific items. The primary objective guiding the configuration of the item bank was to ensure that all test takers receive equivalent tests. All administered tests must have the same number of scored items, must be a representative sample of the diverse content areas within each subscale, including some but not all business related passages, and must produce. Also, all administered tests should avoid extreme differences in overall test difficulty. However, given that online, unproctored tests are scored based on IRT estimates of overall ability level, and not on number right scores, overall test difficulty as measured by the number of incorrect answers, is not a critical consideration. Unfortunately, Pearson has not provided a precise description of the manner in which items are banked or selected into tests. From the information provided, it is plausible that the item bank has a moderately complex taxonomic structure in which one dimension is type of scale (5) and a second dimension id type of topic (e.g., business, academic, news, etc.). Presumably there are formalized constraints on item sampling that restrict the number of easy and more difficult questions and, most likely, restrict the tendency to over sample the highest quality items. Pearson estimates that based just on the Phase 1 bank size of approximately 376 items (100 testlets) over 1 trillion tests are possible. Analyses of overlap between pairs of simulated randomized 40-item tests showed that the average amount of overlap was approximately 1 item. Equivalency of Randomized Test Forms An important consideration is the equivalency of randomized tests generated from the same item bank. A study was conducted to assess the means, standard deviations and correlations between pairs of randomized tests administered to the same test takers. In this study, each test taker completed two 40-item randomized tests. In one condition, the two pairs of tests developed for a test taker were constructed to be fully equivalent, including similar levels of difficulty. In a second condition, the two pairs were constructed to be equivalent except with respect to difficulty. One test was constructed to be easier than the test construction algorithm would allow and the other, more difficult than would be allowed. The reason for this second condition was to evaluate the extent to which IRT calibration would control for variation in item difficulty, as is expected. This study was conducted in four different samples in which two sample received equivalent pairs of tests and two sample received non-equivalent pairs of tests. Table 66 shows the equivalency results in these four samples. 140 Table 66. Equivalency results for pairs of unproctored randomized online tests, each with 40 items. Test 1 Sample Test 2 Observed Alternate Forms Reliability Sample Size Mean SD Mean SD Equivalent Pairs - Sample A 335 -.17 .80 -.12 .82 .82 Equivalent Pairs - Sample B 318 -.06 .89 -.07 .92 .88 Non-Equivalent Pairs - Sample C 308 -.17 .89 -.21 .90 .78 Non-Equivalent Pairs - Sample D 282 -.24 .91 -.24 .89 .80 Table 66 shows that equivalent tests had alternative form reliabilities of .82 and .88 and nonequivalent (in terms of test difficulty) tests had reliabilities of .78 and .80. Also, within each sample the pairs of tests had very similar means and SDs, regardless of the difficulty-related equivalency of the pairs of tests. These results are strong evidence that even large differences in randomized test difficulty are unlikely to distort estimated ability levels to any substantial degree. This empirical result that ability estimates do not substantially vary as a function of item difficulty provides some support for the IRT invariance assumption. However, Pearson does not report other evaluations of this assumption. Nevertheless, the fact that tests of unequal difficulty produced ability estimates correlating approximately .80 with one another indicates that the invariance assumption is sufficiently supported for selection purposes. Reliability and Inter-Scale Correlations While equivalency studies produced alternate forms reliability estimates, reported above, other studies of W-GCTA reliability for more recent forms have been conducted that estimate test-retest and internal consistency reliability. Table 67 shows the results of these other recent reliability studies. 141 Table 67. Observed test-retest and internal consistency reliability estimates. Study N Form B (80 items) 96 UK Version of Form B (80 items) Test-Retest Estimates Total Recog. Eval Draw 714 .92 .83 .75 .86 UK Version of Form B (80 items) 182 .93 UK Version of Form B (80 items) 355 .84 Short Form 1994 (40 items) 42 .81 57 .89 Short Form 2006 (40 items) Form D (40 items) Total Recog. Eval Internal Consistency Estimates Draw .73 1,011 .83 .80 .57 .70 UK Version of Form D (40 items) 169 .75 .66 .43 .60 UK Version of Form D (40 items) 1,546 .81 Form E (40 items) 1043 .81 .83 .55 .65 UK Unproctored Online (40 items) 2,446 .90 .81 .66 .81 UK Unproctored Online (40 items) 355 .82 UK Unproctored Online (40 items) 318 .84 UK Unproctored Online (40 items) 318 .86 A number of points can be made about the reliability estimates reported in Table 67. First, the original Form B shows the lowest test-retest reliability while the UK adaptation of Form B shows the highest of the internal consistency reliabilities. It appears the adaptation of testlet content to the British context improved reliability. Second, the Evaluates Arguments subscale is always the least internally consistent of the subscales. The distinctive feature of Evaluates Arguments that it includes a much larger proportion of controversial passages invites speculation that this feature harms its psychometric properties. This continues the pattern of this subscale demonstrating the poorest psychometric properties. Third, in the shorter forms, Recognition of Assumptions always shows the highest internal consistency, while having fewer items, 12, than the Draw Conclusions subscale,16. Fourth, the internal consistency of the randomized forms constructed for unproctored online administration demonstrate similarly strong levels of reliability compared to the fixed forms. Some evidence is available about the correlations among the 3 subscales in the more recent shorter forms. Table 68 shows inter-subscale correlations in five separate study samples. Perhaps the most notable pattern of results is that Evaluate Arguments correlates the least with other subscales in the US-oriented Form D and Short Form while, in contrast, it correlates highest with Draw Conclusions among all inter-subscale correlations for subscales that have been developed / adapted for use in the UK. This is unlikely to be due to differences in currency between US and UK oriented Forms since all versions have been developed within the past several years, with the possible exception of the UK adaptation of the older Form B. It is not clear when that development took place. Also, the results for the UK Adaptation of Form B are not directly comparable to the other within subscale correlations because the Draw Conclusion subscale in the UK Form B adaptation is a nonoperational composite of three subscales, Deduction, Inference and Interpretation, making it 60% of the total raw score whereas it is only 40% of the total raw score in the shorter tests.. 142 Table 68. Observed correlations among the three subscales of the newer shorter forms. Study N Subscales Subscales Total Recognize Evaluate Draw Total (40) UK Adaptation of Form D 169 Recognize (12) .76 Evaluate (12) .73 .32 Draw (16) .67 .28 .40 Total (80) UK Adaptation of From B 714 Recognize (16) .74 Evaluate (16) .78 .40 Draw (48) .95 .58 .66 Total (40) UK Unproctored Randomized Tests 2,446 Recognize (12) .86 Evaluate (12) .73 .47 Draw (16) .86 .58 .63 Total (40) US Form D Comparison of Form D to Short Form Row subscales are Short Form Column subscales are Form D 636 636 Recognize (12) .79 Evaluate (12) .66 .26 Draw (16) .84 .47 .41 Total (40) .85 .68 .43 .80 Recognize (12) .74 .88 .26 .48 Evaluate (12) .41 .24 .38 .36 Draw (16) .75 .50 .37 .82 The correlations inside the ovals are alternate forms reliabilities reported in TableW3. Criterion Validity Table 69 presents criterion validity results for several recent studies in which some Form of W-GCTA was administered to applicants or employees or students for whom some work or academic performance measure was also available. A number of key points can be noted based on this compilation of observed criterion validities. First, W-GCTA demonstrates substantial positive validity with respect to a wide range of performance measures. Second, little criterion validity evidence has been reported recently for the three subscales. But what evidence has been reported shows that Draw Conclusions has higher criterion validity than both Recognize Assumptions and Evaluate Arguments. This raises the prospect for Evaluates Arguments, especially, that it contributes little to the usefulness of the W-GCTA in personnel selection applications. Along with this evidence of relatively low validity, the evidence of somewhat lower reliability levels reported above in Table 65 for alternate forms reliability and in Table 67 for internal consistency draw attention to the possibility that Evaluates Arguments is only marginally adequate psychometrically. Third, the single criterion validity study of the unproctored online form strengthens the empirical foundation for the psychometric properties of unproctored ability testing. Overall, there is strong evidence that W-GCTA Total score predicts work and academic performance. 143 Table 69. Observed criterion validities for W-GCTA forms. Study Assessment Center Ratings – Retail (Kurdish & Hoffman 2002) N 71 Form A/B Criterion Assessor rating of Analysis .58 Assessor rating of Judgment .43 41 Semester GPA Freshman nursing classes (Behrens 1996) 31 Semester GPA 37 Semester GPA .59 .53 .51 Education majors (Gadzella et al 2002 114 GPA .41 Education majors (Taube 1995) 147194 GPA .30 Educational Psych students (Gadzella et al 2004) Course grades .42 139 GPA .28 Organization level achieved .33 Total AC performance (sum of all 19 dimensions) .28 Job incumbents (Pearson 2006) Assessment Center (Spector et al 2000) Job incumbents (Pearson 2006) 2,303 ? Overall Performance (single rating) 142 Supervisor rating of Problem Solving Supervisor rating of Decision Making Supervisor rating of Problem Solving Supervisor rating of Decision Making Government analysts (Ejiogu, et al., 2006) 64 Supervisor rating of Professional/Technical Knowledge & Expertise UK Unproc tored Trial Form E/D Recognize Evaluate Draw Total Score .24 .33 .23 .40 .40 .37 Final exam score .39 .25 .42 .57 123 Final Exam Grade .62 UK Bar Professional Training Course (2011) 988 Final Exam Grade Teaching applicants (Watson-Glaser) 1043 GPA for last 3 years of undergraduate study .18 .16 .24 .26 Supervisor rating of Core Critical Thinking .23 .17 .10 .02 .09 .11 .10 .24 -.03 .14 .33 .31 .21 .26 .37 .30 .26 .24 .12 .27 Total Score Overall Potential Educational Psych students (Williams 2003) 428 UK Bar Professional Training Course (2010) Mid-term exam score .51 Supervisor rating of Creativity Insurance managers (Watson-Glaser) 65 Supervisor rating of Job Knowledge Supervisor rating of Overall Performance Supervisor rating of Overall Potential Approach to Item / Test Security Historically the Watson-Glaser publishers have relied entirely on proctored administration and the appropriate training of those proctors to ensure the security of item and whole tests. There is no indication that testlets / items were frequently, or at all, replaced with fresh items. Until the 2012 Pearson UK Manual for Watson-Glaser, there has no indication I any sources we have seen that banks of items have been developed to support earlier Forms A, B, C, D, E or the Short Form beyond the items that were initially placed on those fixed forms. Beginning in 2010, approximately, Pearson launched the development of an item bank to support the requirements of unproctored online administration of a new “form” of W-GCTA. The primary purpose of this strategic shift to enable online administration, with or without proctoring, has been to accommodate changing expectations, communications capabilities, and client requirements, all of which have increased the demand for online assessment and the expectation that such online capability not be constrained to environments that are suited to proctoring. 144 Pearson’s approach to unproctored online testing is similar in many ways to SHL’s approach with Verify, although there are differences. The major elements of Pearson’s approach are described here: 1. Randomized forms of tests. Each test taker will be administered a test of 40 items and, presumably, a variable number of passages required to support those 40 items. The 40 items selected for each test taker are taken from a finite bank of items (currently approximately 400 items) Pearson has estimated that, given the current bank size, any two randomized tests will overlap by approximately 1 item, on average. 2. Test Construction to Ensure Equivalence. The 40 items are selected by an algorithm from the bank of items to satisfy a number of requirements and constraints to ensure equivalence. These include: a. Each test will contain 12 Recognize items, 12 Evaluates items and 16 Draw items. b. Items will be sampled from the bank taxonomy to ensure that at least some (unknown) number of items are based on business content. c. Items of high or low difficulty will be avoided to better ensure tests will have similar levels of overall difficulty. (However, early studies have shown the IRT ability estimation to be very robust against large differences in test difficulty.) d. (Presumably) sampling rules will protect against over-sampling the better quality items. 3. Bank Construction Practices. Banks of items are currently being developed in at least 2 phases. Phase 1 has produced 376 items. The item bank for unproctored online administration includes new items as well as existing items that are included on current fixed forms of W-GCTA tests. Using a 3-parameter model, with a fixed guessing parameter, all items are analyzed in the same unidimensional pool. (IRT estimates are not developed within Recognize, Evaluate or Draw content categories.) Newly developed items must first satisfy CTT psychometric requirements for item quality including difficulty, discrimination, and distractor characteristics before they are available to be trialed for IRT estimation purposes. No information has been available about the nature of the bank taxonomy that governs the process of randomized sampling of items. We have inferred from the 2012 UK Technical Manual that there are at least two dimensions to the item taxonomy. One dimension is certainly the 5 components of Watson-Glaser’s model of critical thinking, Recognize Assumptions, Evaluate Arguments, Inference, Deduction and Interpretation, the last three of which comprise Draw Conclusions. A second dimension may well be a categorization of item content into subcategories such as business content, school content, news/media content, and so on. Presumably sampling rules govern the number of items to be randomly selected from within each subcategory of the item taxonomy. But we have not been able to confirm the details of the item taxonomy and the sampling rules. 4. Proctored Verification Testing. Pearson recommends that clients verify satisfactory unproctored test results by retesting candidates prior to making job offers. The verification test is proctored and its results are compared to the previous unproctored result by Pearson algorithms to evaluate the likelihood of cheating and make recommendations to the client about managing the discrepancy. 5. Bank Maintenance. No information is yet available that describes the technical details of Pearson’s approach to bank maintenance. Given the enhanced prospect for cheating in unproctored environments, it is likely that Pearson uses some form of ongoing item review to inspect item characteristics for evidence of drift attributable to cheating. 145 Translation / Adaptation W-GCTA tests are available in the following 12 languages and contexts. UK English US English Australian English Indian English Chinese Dutch German French Japanese Korean Spanish Swedish The Pearson expert interviewed for this project,. Dr. John Trent, had no information about Pearson’s process for creating different language versions. Given the high dependency on social/cultural content relating to business, school and media contexts and the need to ensure that passages are either neutral or controversial, Pearson’s item development process relies heavily on the review of draft items by professionals with considerable experience with the Watson-Glaser test instrument. Certainly, equal expertise and experience is required for successful translations. Experienced translation organizations such as Language Testing International are likely to be capable resources. User Support Resources (E.g., Guides, Manuals, Reports, Services) Pearson (Talents Lens) provides a wide range of user support resources for the Watson-Glaser battery. (We note here that the assessment options available in the US and UK are not the same. The unproctored online version of Watson-Glaser is currently available only in the UK. As reported by Dr. John Trent, Pearson is currently working on making the UK assessment capabilities available in the US.) The Watson-Glaser resources available through the Talent Lens distribution channel include • • • • • • • • • • • Technical Manuals Sample Reports (Interview, Profile, Development) Frequently Asked Question documents User Guides Norm documentation Case Study descriptions Validity Study descriptions Information for Test Takers Sample Items White Papers Regular Newsletters (must be registered) Overall, Pearson / Talent Lens provide strong and comprehensive support information to users. We also found that it was relatively easy to obtain access to Pearson psychologists for more detailed information about Watson-Glaser. 146 Evaluative Reviews Many reviews of previous Watson-Glaser forms have been conducted. We focus here on the four most recent reviews of 80-item Forms A and B (Berger, 1985; and Helmstadter, 1985) and the 40item Short Form, S (Geisinger, 1998; Ivens, 1998). To our knowledge, no evaluative reviews have been published about the most recent shorter versions, Forms D and E or their UK counterparts. Certain common themes have emerged from these reviews. Reviewers generally view the WatsonGlaser definition of critical thinking as a strength, although some have criticized the lack of construct validity evidence to further develop and refine the specific meaning of the measure. Also, the large amount of empirical criterion validity evidence that has been reported in earlier manuals for Forms A and B especially has been praised. Geisinger, in particular noted that most of these studies correlate Watson-Glaser scores with academic outcomes, rather than organization or work performance. Broadly, reviewers agree that Watson-Glaser predicts important outcomes in both academic and work settings. The criticisms of Watson-Glaser focus on lower than desired reliability and on the usefulness of the use of controversial topics. Both Berger (1985) and Helmstadter (1985) describe the Form A and B reliability levels as “adequate – but not outstanding”. Geisinger (1998), in an especially insightful review, criticizes the original Short Form, S, for insufficient score discrimination in the most popular range of scores. This problem results from a combination of marginally adequate reliability in the Short Form and significant range restriction in observed scores. He notes that 95% of the Short Form scores fall within a 15-point range, 26-40 within the full 0-40 range of possible scores. As a th result, a test taker with a 60 %ile score of 34, would have a 95% confidence interval from 29.7 to th th 38.3, which ranges from the 19 %ile to the 99 %ile. As a result, reliance of percentile scores could lead to unreliable decisions. To Watson-Glaser’s credit, they appear to have given great weight to this criticism and established an objective for the development of Forms D and E that the range of observed scores should be increased. Our own evaluation of this issue of moderate reliability is that the scores for the US versions of the new 40-item Forms D and E appear to have similar reliability to the previous forms but that the UK versions appear to have somewhat higher reliabilities. In either case, the reliability levels for Total scores are adequate for use in selection processes, but the reliability levels in some studies are not adequate to rely on subscores for selection purposes. Further, the pattern of relatively low reliabilities and validities for the Evaluates Arguments subscale, which is heavily loaded with controversial content, raises a question about the meaning and measurement properties of the socalled controversial content. Berger (1985) shares the same question about the meaningfulness of the controversial content. In an interview with Pearson’s Dr. John Trent about Watson-Glaser, he expressed the view that this subscale relies not only on cognitive components but on personality components. (This may help to explain why some “convergent” validity results reported in WatsonGlaser technical manuals describe correlations with personality assessments.) Loo and Thorpe (1999) reported an independent factor analytic study of the Short From and reached a very similar conclusion that (a) internal consistency reliabilities were “low” for the five subscales, and (b) principal components analyses did not support the five subtest structure. It also appears WatsonGlaser sought to address this issue of factor structure by revising the subscale structure from 5 to 3 in the development of Forms D and E. We would note, however, that at the item level no changes were made to the intended content. Controversial items were retained and items capturing the five original facets were retained. The measurement modification in Forms D and E was that the three facets of Interpretation, Inference and Deduction are combined into a single scored scale, Draw Conclusions. Overall, our evaluative assessment of Watson-Glaser forms is that they are valid predictors or education and work outcomes but their construct meaning is not yet clear and the inclusion of controversial content may not improve the measurement or prediction properties of the battery. 147 SECTION 6: INTEGRATION, INTERPRETATION AND EVALUATION OF RESULTS This section follows the presentation of extensive information about each target battery by integrating that information into comparative assessments of battery qualities, interpreting the potential implications for MCS’s civil service test system and evaluating the features of batteries as possible models for MCS civil service tests. These three primary purposes are served by organizing this section into eight sections, each of which addresses a particular aspect of cognitive batteries themselves or their use that is likely to be an important consideration for MCS’s planning, development and management of its civil service test system. These eight sections are: The question of validity Test construct considerations Item content considerations Approaches to item / test security and methods of delivery Item and test development Strategies for item management / banking Use of test results to make hiring decisions Considerations of popularity, professional standards and relevance to MCS interests It is important to note that none of the commercial batteries reviewed in Study 2 is being used for civil service hiring purposes. None of the publishers’ sources or other sources describes the manner in which the target batteries would be best applied for purposes of civil service selection. The MCS civil service testing system will have requirements and conditions that none of the target batteries has accommodated. As a result, it is not our intention to identify a “most suitable” model from among the target batteries. Rather, we will use information about key features of the batteries to describe the ways in which those features may have value (or not have value) for MCS’s civil service system. The Question of Validity The battery-specific summaries report a wide variation in amount of available construct and criterion validity information. At the low end, PPM has a modest amount of construct evidence with respect to other reference subtests but no criterion validity evidence. (This remarkable feature of PPM has been confirmed in discussion with Hogrefe’s PPM expert.). Our own effort to locate published criterion validity studies for PPM has found none. At the high end, GATB is perhaps the most validated battery of all cognitive tests used for selection, with the possible exception of the US Armed Services Vocational Aptitude Battery (ASVAB). EAS also has an extensive record of criterion validity evidence. Similarly, Watson-Glaser and DAT for PCA have extensive records of construct and criterion validity evidence with respect to both work and academic performance. Both Verify and PET are somewhat more recent batteries for which a modest amount of empirical validity evidence is available. Two important points can be made about these differences between batteries. First, this difference in the amount of available validity evidence does not suggest differences in their true levels of criterion validity. Although PPM’s publishers have not met professional standards for gathering and reporting evidence of criterion validity, we are confident in our expectation that PPM has levels of true criterion validity very similar to GATB or any of the other well-developed batteries we are reporting on. We should note that this was not an “automatic” conclusion. Rather, this conclusion followed from an adequate amount of evidence that PPM is a well-developed cognitive battery reliably measuring the abilities it was intended to measure. Second, for well-developed cognitive ability batteries, the question of criterion validity has largely been answered by the large volume of validity evidence that has been aggregated and reported in 148 meta-analytic studies especially over the past 40 years (See, e.g., Schmidt & Hunter, 1998). These results can be summarized in any number of ways. The summary that may be most relevant to this Study 2 purpose of focusing on the relative merits of different subtest types is that for any welldeveloped and reliable cognitive ability battery of subtests measuring different abilities, most combinations of 3 or 4 subtests will yield a true operational criterion validity of approximately .50. th th Adding a 5 or 6 subtest to the composite generally will not increase criterion validity. There is an important implication of these high level points once the decision is made (a) to develop a civil service selection test assessing cognitive abilities and (b) that these subtests will be representative of the types of subtests that produced the meta-analytic results. This implication is that the remaining decisions about the development of items and subtests such as specific target constructs, specific types of item content, and specific ways of tailoring the use of the subtests to “fit” with specific job families will not depend in an important way on considerations of criterion validity. The one exception to this overall implication is that recent evidence (Postlethwaite, 2012) has shown subtests assessing crystallized abilities have, on average, higher criterion validity than subtests assessing fluid abilities. This has important implications for the design of item content. As a consequence of this implication, the discussions below about important considerations in the design and development of items and subtests will make little reference to the consideration of criterion validity. Test Construct Considerations The first consideration is the nature of the abilities measured by the reviewed batteries. (Note, for comparison sake we have included the OPM subtests used in the USA Hire civil service employment system.). Table 70 shows a method of classifying the subtests in each battery into one of five categories of work-related cognitive abilities. It is important to note that these five categories do not correspond to theoretical taxonomies of cognitive ability such as the CHC model of cognitive ability. While it may be feasible to assign individual subtests to specific CHC abilities, none of the subtests among these seven batteries was designed specifically to measure any particular broad or narrow ability within the CHC model. Rather, it might be best to interpret these as five categories of workrelated cognitive ability. While each battery has its own origins and initial conceptual rationale, these categories have emerged over time as clusters of subtests that have shown evidence of predictive validity and, broadly speaking, can be associated with job ability requirements in the language commonly used in job analysis methods designed to describe ability requirements. This distinction between categories of work-related cognitive ability and theoretical taxonomies of cognitive ability is not a criticism of the review batteries. It is not likely a source of imperfection or loss of validity. (Note, in fairness to many of the original test developers for these and other cognitive batteries designed for selection purposes, the developers were often aware of and informed by theoretical taxonomies of cognitive ability. But they were likely aware that reliance on the structure of these taxonomies did not lead to tests with more criterion validity. Of course, in some cases such as Watson-Glaser and PPM, the original developers were acting on their own theories of cognitive ability.) 149 Table 70. Subtests in the target cognitive test batteries. Type of Subtest Professional Employment Test Differential Aptitude Test--Personnel and Career Assessment Verbal Reasoning Verbal (Crystallized) Reading Language Comprehension Usage Spelling Quantitative (Crystallized) General/ Abstract Reasoning (Fluid) Spatial / Mechanical Aptitude (Mixed) Quantitative Problem Solving Data Interpretation Employee Aptitude Survey Verbal Comprehension Verbal Reasoning Numeric Ability Numerical Reasoning Space Relations Verify OPM USA Hire Verbal Comprehension Verbal Reasoning Verbal Reasoning Reading Reading Comprehension Reasoning Vocabulary Computation Arithmetic Reasoning Symbolic Reasoning Space Visualization Numerical Computation Numerical Reasoning Numerical Reasoning Calculation Applied Power Inductive Reasoning Perceptual Reasoning Reasoning Mechanical Reasoning WatsonGlaser II Word Fluency Numerical Ability Abstract Reasoning General Power and Aptitude Performance Test Battery Measures ThreeDimensional Space Math Deductive Reasoning Mechanical Understanding Mechanical Comprehension Spatial Ability Spatial Ability Processing Speed Checking Name Comparison Processing Speed / Accuracy Visual Speed and Accuracy Clerical Speed and Accuracy Visual Pursuit Manual Speed and Accuracy Mark Making Tool/Object Matching Manual Dexterity Place Turn Assemble Dissemble Recognize Assumptions Other Evaluate Arguments Judgment Draw Conclusions An inspection of Table 70 reveals a number of key points about these employment tests. Setting Watson-Glaser aside for this paragraph, the first point is that all batteries include subtests of Verbal and Quantitative abilities. Some of these subtests are tests of learning achievement, virtually, such as Numerical Computation, Vocabulary, and Spelling. Some of these tests are a composite of reasoning with specific learned knowledge, such as Verbal Reasoning and Numerical / Quantitative / Arithmetic Reasoning. Second, all batteries evaluate reasoning ability either in an abstract form (most) or in the context of verbal / quantitative content. Third, most batteries include some form of spatial and/or mechanical subtest. In most cases, these subtests are intended to assess reasoning in a specific context, either an abstract spatial context or a learning-dependent content of mechanical devices. To our knowledge none of these subtests was designed to measure knowledge of mechanical principles / properties, with the possible exception of PPM’s Mechanical Understanding subtest, which was designed specifically to be a “Performance” test (influenced by learned content) rather than a “Power” test of reasoning. Fourth, the Processing Speed category represents a variety of different approaches to the assessment of cognitive and/or psychomotor processing speed. These are highly speeded tests in that short time limits are allowed to complete as many easy items as possible where each item requires some form of visual inspection or completion of a manual task. In 150 general, these speeded tests were designed to be used either for clerical / administrative jobs requiring rapid processing of less complex information or for manufacturing assembly jobs requiring high volumes of manual construction or placement of pieces. Our overall observation is that these types of subtests are being used less frequently as automation replaces simple tasks and as job complexity, even for entry-level administrative work increases. For example, Pearson recently dropped its Clerical Speed and Accuracy test (and its Spelling test) from its DAT for PCA battery. Finally, the Other category contains two distinctively different types of tests, among the reviewed cognitive batteries. First, the Watson-Glaser battery is quite different in many respects from the other batteries. First, it is designed to measure critical thinking skill, which the developers viewed 85 years ago as a learned skill. The current developers continue to adhere to precisely the same construct definition. It is very likely that the most common application of Watson-Glaser has been and is in training / educational programs in which critical thinking skills are being taught. These have included academic programs for training medical professionals (medical diagnosis being viewed as critical thinking), lawyers, philosophers, managers/executives and psychologists among many others. Approximately 2/3 of the predictive validity studies reported above are in educational settings. Second, consistent with the view that critical thinking is a learned skill, the developers continue to systematically include controversial topics in all of the subtests, but especially in Evaluates Arguments. To our knowledge, this is the only non-clinical cognitive battery commonly used for selection that manipulates the affective valence of item content. The presence of controversial content is, in effect, intended to represent a difficult ”critical thinking” problem and is intended to serve the purpose that is often served by difficult items. Our observation from the psychometric and validity information presented above is that it is quite possible the Evaluates Arguments subtest, which deliberately includes the highest percentage of controversial items, may not be a psychometrically adequate subtest especially when Watson-Glaser is used for selection purposes. To our knowledge, no research has been conducted on the incremental value of the controversial topics. Setting aside our concerns about the value of controversial topics, Watson-Glaser might be considered a test of verbal reasoning in which the problem content is highly loaded with business/social/academic content. From that perspective, many of the Watson-Glaser testlets appear to represent the most work-like content of any of the other reviewed batteries, except for the OPM Judgment subtest. The USA Hire website, https://www.usajobsassess.gov/assess/default/sample/Sample.action, shows sample items for all OPM subtests. The sample item for the OPM Judgment test is shown here. 151 Watch the following video. Choose the most AND least effective course of action from the options below. Step 1. Scenario Barbara and Derek are coworkers. Barbara has just been provided with a new assignment. The assignment requires the use of a specific computer program. Derek walks over to Barbara's cubicle to speak to her. If you were in Barbara's position, what would be the most and least effective course of action to take from the choices below? Step 2. Courses of Action Most Effective Least Effective Try to find other coworkers who can explain how to use the new program. Tell your supervisor that you don't know how to use the program and ask him to assign someone who does. Use the program reference materials, tutorial program, and the help menu to learn how to use the new program on your own. Explain the situation to your supervisor and ask him what to do. Second, the OPM Judgment subtest is a situational judgment test for which, we believe, several versions have been developed where each version is in the context and content of work problems characteristic of a particular job family. The distinction between the OPM Judgment subtest and the Watson-Glaser work-oriented critical thinking items may be largely in the definition of the correct answer. Watson-Glaser, like most other tests designed to measure cognitive ability defines the correct answer by applying objective rules of inference / deduction such that the correct answer is specified by rule, not by judgment. Even if imprecision is introduced into Watson-Glaser items by careful language or deliberately incomplete or vague information, it is our understanding that the correct answer is found by applying specifiable rules of logic, not by appealing to subject matter expert judgment. In contrast, it is possible – we are not certain – that the correct answers to OPM’s (situational) Judgment subtest are specified by the concurrence of subject matter experts. If this is the case, then the OPM Judgment test is much closer to a job knowledge test than is the WatsonGlaser test, where the judgment required by the OPM test is judgment about the application of specific job information and experience to a specific job-like problem. Overall, we do not recommend that MCS develop situational judgment tests specific to particular job families / roles. The primary reasons are (a) not enough validity is likely to be gained to be worth the substantial additional cost required to maintain such tests and (b) such tests place significant weight on closely related previous experience, which may not be the best use of cognitive testing. 152 Other than our concerns about Watson-Glaser’s distinctive content and OPM’s use of a situational judgment test requiring (apparently) subject matter expert definitions of correct answers, we judge all the subtests in the reviewed batteries as having approximately equally well-founded construct foundations, even if they are not directly derived from theoretical taxonomies of cognitive ability. The diverse batteries have equally well-founded construct foundations for three basic reasons. First, their construct foundations are very similar. All assess verbally oriented abilities, quantitatively oriented abilities, except Watson-Glaser, and reasoning either abstract or learning-loaded or both. Also, most assess processing speed and accuracy and spatial / mechanical abilities. However, both of those construct categories may have little relevance to MCS’s civil service system due to the lack of job families that are typically associated with these specific abilities. Second, the long history of predictive validity evidence has shown that virtually any combinations of 3 or 4 different cognitive abilities tests produce maximum validity levels because the composite measures formed by such groups effectively assess general mental ability. Third, job analytic methods for analyzing ability requirements for at least moderately complex jobs will virtually always identify verbal abilities and reasoning/decision making/problem solving/critical thinking abilities, whether abstract or learningloaded. Our view expressed above that it would not be useful to attempt to reconceptualize the constructs underlying cognitive tests used for selection has an important exception. Postlethwaite (2012) conducted an extensive set of meta-analyses of work-related criterion validities for measures of fluid ability (Gf) and crystallized ability (Gc). These analyses demonstrated that crystallized ability shows higher levels of criterion validity than fluid ability. The likely explanation is based on two primary factors. First, job/training performance is most directly affected by job/training knowledge. Cognitive ability predicts performance because it enables learning of job/training knowledge. Crystallized ability captures the ability to perform using learned information. Presumably measures of job-related knowledge are more directly predictive of job/training performance. Second, Gc measures of acquired knowledge also capture variance due to differences in effort, motivation and disposition to learn the content. Such variation is also expected to add incremental validity to the relationship between cognitive ability alone and job/training performance. Because this theory-based distinction relates to criterion validity differences it is an important construct distinction to consider when evaluating the constructs underlying the reviewed batteries. In order to evaluate the extent to which the reviewed batteries include measures of fluid ability and crystallized ability, Figure 1 displays each subtest included in the reviewed batteries for which a determination could be made by the authors of this Report of each subtest’s Gf – Gc characteristic. Subtests designed to measure processing speed are excluded from this Figure. (Note, Figure 1 also displays information about work-related item content which is discussed below.) In Figure 1, subtests shown at the top are judged to be measures of crystallized ability; subtests listed at the bottom are judged to be measures of fluid ability. Within the groupings at the top (crystallized) and bottom (fluid) the order from top to bottom is arbitrary. We do not intend to convey that subtests measure crystallized or fluid ability to some degree. Rather, following Postlethwaite (2012), subtests are classified as measuring either crystalized or fluid ability. 153 Two key points can be made about the distribution of Gc and Gf among these subtests. First, both are commonly represented among the reviewed batteries with three times as many Gc measures as Gf measures. By design, PPM contains a higher percentage than other batteries of Gf subtests. Watson-Glaser, PET and OPM batteries contain no Gf subtests. Regarding this point, it is interesting to note that both PET and Watson-Glaser are designed for and commonly used for higher level jobs and positions such as manager, executive and professional which might be expected to place more weight on more general, broadly applicable reasoning and thinking abilities. Certainly, both PET and Watson-Glaser are designed to assess such thinking skills but they accomplish that purpose with item content that is high complexity verbal content using moderately work-like content, except for PET Quantitative Problem Solving, which contains moderately complex arithmetic computation problems. Second, the PPM distinction between Power tests and Performance tests, which is similar to the Gf Gc distinction, does not align exactly with the Gf – Gc distinction. In particular, (a) the Spatial Ability subtest, which PPM identifies as a Performance subtest is a Gf measure (novel 2-d figures requiring mental rotation), and (b) the Verbal Reasoning subtest, which PPM identifies as a Power subtest, is a Gc measure (complex written paragraphs of information from which information must be derived). Our own view of this misalignment is that it is simply a misclassification by the PPM developer. It is clear the developer intended Power tests to be very much the same as Gf tests, describing Power as “…reasoning, where prior knowledge contributes minimally…” and Performance tests to be closely aligned with Gc, describing Performance as “…relates more strongly (than Power) though not 154 inevitably, to the presence of experience…”. On this point, it is worth noting that verbal content alone does not imply a Gc measure. A subtest using verbal content using simple words familiar to the testtaker population can be a Gf measure. Overall, the reviewed batteries, including the briefly referenced OPM subtests, are representative of an effective overall construct strategy that places more emphasis on subtests measuring crystallized ability than on fluid ability subtests. However, it is important to note that most batteries also provide Gf subtests. While none of the publishers have provided an explicit rationale for this common decision about target constructs, it is likely that this decision stems from the intention to develop a battery of a modest number of subtest that are viewed applicable across a very wide range of job families. Subtests of fluid ability may be viewed as supporting this objective of having broadly applicable subtests. Item Content Considerations Separate from decisions about target constructs, subtests also vary with respect to item content. For the purposes of this project evaluating employment tests, the two primary content variables are (a) work relatedness, and (b) level. Developers also often distinguish between verbal and non-verbal content but that distinction sometimes means the same thing as the distinction between learned and novel / abstract content. For example, developers sometimes classify arithmetic subtests as “verbal”. For our purposes in describing the variation in item content, we will treat the distinction between verbal and non-verbal as, effectively, the same as the Gc – Gf distinction. Level of content can have two separate but related meanings (a) reading level as indexed by some standardized reading index and (b) content complexity or difficulty. For example, SHL varies the level of Verify subtest content depending on the level of the target role/job family by changing the range of item theta values included in the subtest. Higher level subtests include items within a higher range of theta values. Work Relatedness of Content To describe the variation in item content, Figure 1’s horizontal axis represents the range of workrelatedness from work-neutral content on the left to work-like content on the right. (Level will be treated separately because level can be difficult to “see” in item content.) A high level of work-like content can vary considerably depending on the nature of the target job family. For example, a test of mechanical understanding could have a high level of work-like content by relying on the machines, tools, and equipment used in the target job family. In contrast, a high level of work-like content for a Reasoning test targeting an upper management role might include complex business problems. It should be noted, however, that among these reviewed batteries, no subtest has such a high level of work-like content that items are samples of specific work tasks performed in a target job. Rather, even the most work-like items among these batteries are abstracted from and described in a more general form than would be the case for an actual job task. The clear reason for this is that all of these reviewed batteries, except Verify, are designed to be suited to a range of jobs and positions and these ranges without tailoring the content of subtests to different job families. Verify tailors subtest content to target job level by adjusting the theta range for items that may be sampled from the item bank. Higher theta values are sampled for higher level jobs. (Also, although the OPM subtests were not formally reviewed in this project, it appears that OPM developed different versions of each subtest in which tem content was tailored to specific job families. None of the reviewed batteries tailor subtest content in this manner.) A number of key points can be made about the representation of item content along the horizontal axis. First, by definition, no work-like content is present in any subtest measuring fluid ability. We assume all work-like content would be learned content. Certainly, it’s conceivable that novel content (Gf) for one applicant population may be learned, work-like content (Gc) for another. For example, two-dimensional drawings of shapes in a spatial reasoning test could constitute job knowledge for, say, a drafting job in an architectural firm. But for such a population of test takers, this type of content 155 in a spatial reason test would be a measure of Gc, not Gf. But this would be the rare exception. In almost all cases, spatial reasoning item content consisting of two-dimensional drawing of shapes would constitute a measure of fluid ability. Second, among the subtests measuring crystallized ability there are, roughly, three tiers of item content as it relates to work content. Tier 1: Work-Neutral Content. Item content is unrelated to work contexts and has no appearance of being in a work context. A common example is subtests designed to measure computation skills. Like OPM Math, DAT Numerical Ability, EAS Numerical Ability, PPM Numerical Computation, PET Quantitative Problem Solving, and GATB Computation, these subtests present arithmetic problems in addition, subtraction, multiplication, division usually with no other context. Even if some (many?) jobs require computation, these context free arithmetic problems typically are not designed to sample arithmetic problems in the workplace. Similarly, many tests based on knowledge of word meaning such as vocabulary tests (e.g., GATB Vocabulary, PPM Verbal Comprehension) choose words based on overall frequency of use, rather than frequency of use in work settings. Subtests with work-neutral content are listed on the left. Tier 2: Work-Like Context. Item content is placed in a work-like context but the work context is irrelevant to the application or definition of the skill/ability required to answer the item correctly. An example is the Verify Numerical Ability subtest designed to measure deductive reasoning with quantitative information. SHL places the quantitative information in a work-like context such as an excel file with work-related labels on rows and columns and perhaps asks the deductive reasoning question about selected quantities in the language of a business problem. For example, “Which newspaper was read by a higher percentage of females than males in Year 3.” In such items, the context is work-like but the work-like context is unrelated to the skill/ability required to answer correctly. That is, the experience of having learned to perform a job has no distinctive relationship to the skill/ability required to answer the question. Subtests with irrelevant work content are shown in the middle “column” under crystallized ability. Tier 3: Relevant Work Content. Information / knowledge learned to perform work influences the ability to answer correctly. None of the reviewed batteries includes Tier 3 work content. Only OPM Judgment and, probably, OPM Reading appear to be influenced by learned work content. Common examples of such tests are situational judgment tests (SJT) and, in the extreme, job knowledge tests. Our evaluation of the OPM Judgment test is that it is an SJT designed to be relevant to a family of related jobs. Different job families would require different SJTs. The conspicuous disadvantage of SJT’s, like job knowledge tests, is that they require periodic monitoring for currency. Job changes may require SJT changes. Also, SJT’s usually require expert consensus to establish the scoring rules and become a surrogate measure of closely related job experience. Watson-Glaser has been positioned between Tiers 2 and 3 although not with great confidence. It is speculation on our part that work experience in higher level roles and in roles requiring complex decision making, diagnoses, risk evaluations and the like will enable higher levels of performance on Watson-Glaser tests. This may also be true across all types of content ranging from business contexts, school contexts and media contexts. In any case, an important point to make about Watson-Glaser is that the same item content has been in use for decades (updated from time to time for currency and relevance, but with precisely the same construct definition) and has been applied across the full spectrum of management, executive and professional domains of work. There is no evidence to suggest it has been more (or less) effective within the range of roles or job families in which it has been applied. 156 Using the Figure 1 representation of construct – content characteristics of cognitive tests appropriate for employment purposes is that the subtests in the Gc –Tier 2 category are likely to be the most appropriate, efficient and useful prototypes for subtest development where MCS has an interest in user acceptance of test content as reflecting work-related contexts. It should be clear that choosing Tier 2 content over Tier 1 or, for that matter, over Tier 3 content will have little consequence for criterion validity in job domains that are not highly dependent on specialized job knowledge that can be learned only on the job or in training provided by the hiring organization. Also, Tier 2 content can be developed without requiring special task-oriented or worker-oriented job analyses as would be required to develop Tier 3 content. Tier 2 content does not require precise descriptions of important job tasks or required worker abilities. Often available information about major job tasks/responsibilities/duties can provide appropriate information for building Tier 2 content. Also, the involvement of job experts can provide useful ideas about Tier 2 work information. This evaluation that Tier 2 content may be optimal for certain subtests does not mean we recommend that all tests be targeted for Tier 2 content. (More detail about this point will be provided in the Recommendations section.) There would be no loss in validity for choosing Tier 1 content, especially for subtests intended to be used across all job families, as might be the case with Gc assessments of reasoning / problem solving. Level of Content The distinction among tiers of content describe above is a distinction in the manner in which work content affects item characteristics. This section is about level of content which represents two different aspects of item content. The first is the reading level of item content and the second is the job level. Table 71 describes these two features of content level for the seven reviewed batteries. Table 71 shows that four of the seven batteries specify a target reading grade level that was part of th the item specifications. DAT PCA and GATB were developed at a 6 grade reading level, Watsonth Glaser at a 9 grade level, and PET at a college graduate level. Notably, the four batteries that are intended to be used across the full range of work are all designed with virtually no work content embedded in the items. As Figure 1 shows, only the DAT PCA Mechanical Reasoning subtest among all DAT PCA, PPM, EAS and GATB subtests contains Tier 2 work content. (OPM is not included in Table 2 because we have no information about reading level.) Clearly, the test developers avoided embedded work content that could affect the level of item content for those batteries intended to have broad applicability across the full range of jobs. 157 Table 71. Content level information for reviewed batteries. Battery Reading Level (US) Job Content Level PPM Not specified Virtually no job content but intended to be appropriate for all job families Watson-Glaser Up to 9 grade Manager / Professional Verify Not Specified Manager / Professional Not specified No job content but intended to be appropriate for the full range of work from entry-level wage work to managerial / professional EAS th th DAT PCA 6 grade GATB 6 grade PET Up to 16 (Bachelors) th th No job content but intended to be appropriate for the full range of work from entry-level wage work to managerial / professional No job content; intended to be appropriate for “working adults” Manager/Professional An important consideration regarding choice of item level is item and test discrimination. In general, selection tests should be designed to have maximum discrimination value near the critical level of ability that distinguishes those selected from those not selected. In IRT terminology, this would mean that the peak of a Test Information Function (TIF) should be in the critical theta range that where the distinction between selected and not selected is located. In CTT terms, this means that the conditional SEM should be lowest in that same critical range of ability. The optimal overall item and test development strategy would be to develop items with difficulty levels in the neighborhood of this critical range. This would mean that tests with high selection rates would be most effective if they were relatively easy. Tests with low selection rates should be relatively difficult. When a cut score is used and the test is the only consideration for that step of the hiring process, then this critical narrow range is known precisely. But when the test is part of a composite and there is no precise cut off score applied to that composite, as will be the case with the MCS process, this critical test score range can easily become very wide. That is, the critical range of test scores above which every applicant is selected and below which no applicant is selected can easily be more than 50% of the range of all test scores. A practical implication is that it is very likely the TIF curves for MCS subtests should peak near the middle of the range or be relatively flat across the range of ability. One method for influencing the precision of test scores in the desired range is to specify desired levels of item difficulty. Selecting the target level(s) of item content is one mechanism for manipulating item difficulty in the applicant pool. Choice of reading level is one way of influencing item difficulty; level of content is another way of influencing item difficulty. This is perhaps the best argument for tailoring level of subtest content to level of job family. It’s not about criterion validity as much as it is about selection accuracy. Considerations such as reading level and content level help produce items measuring ability levels in the range of the selection decision. Although SHL does not address this issue, the question can be raised whether variation in reading level would introduce construct-irrelevant variation thus reducing the validity of the test. Overall, for the two primary reasons below we believe this would not threaten the predictive validity of selection tests like those reviewed in the Project and under consideration by MCS. 158 1. Once reading has been introduced to item content it influences test scores whether it is varied a little or a lot. Once reading is required for item performance, variation in reading ability among applicants ensures that reading will lead to variation in test scores. Increasing the variation in reading levels is unlikely to increase the extent to which reading adds variance. Indeed, it is conceivable that increases variation in item reading levels could reduce the role of reading in item scores. 2. Cognitive tests designed for personnel selection use are usually not intended to measure theoretically singular ability constructs. The recommendations below do not suggest that MCS develop theoretically singular subtests. Rather, subtests used for selection often capture more than one narrow cognitive construct in an effort to maximize predictive validity. Indeed, the addition of reading to item content is likely to improve subtest validity for jobs in which reading is a significant requirement. (Note, in the US the primary issue with reading content in cognitive selection tests is whether the additional variance it contributes also increases group differences such as white-black differences. In that case, researchers examine the consequences of eliminating reading content usually by evaluating whether any loss in predictive validity is not so consequential as to outweigh the gains resulting from smaller group differences.) The question of reading’s impact on dimensionality is separate from reading’s likely positive (or neutral) impact on predictive validity, This is a complex issue, but the overall result with selection tests like Watson-Glaser and Verify is that reading content generally does not violate the common IRT assumption of unidimensionality enough to invalidate the useful application of IRT. Likely, this is for two primary reasons. First, professionally-developed selection tests generally avoid introducing reading content where it would be markedly inappropriate such as in subtests designed specifically to assess Perceptual Speed and Accuracy subtests or Abstract Reasoning. Second, where reading content is included, it is likely sufficiently positively correlated with the other targeted abilities that it does not violate the unidimensionality assumption too much. SHL takes a unique approach to this issue of content level with its Verify battery. Within each subtest, Verify item content is tailored to the job level by selecting items from within higher or lower ranges of theta values. That is, the level of item difficulty shifts up (down) to accommodate higher (lower) job levels. This probably benefits criterion validity very little but is presumed to benefit decision accuracy among “close call” applicants. For MCS’s civil service testing system that will be used mainly by college-level applicants for service th sector administrative and managerial jobs, the 6 grade (US) reading levels in GATB and DAT PCA th th are too low. Reading levels more similar to Watson-Glaser (9 grade (US) or lower) and Verify (16 grade (US or lower) will be more appropriate and better ensure discriminating test cores near the middle of the ability range associated with the level of ability required to perform successfully in the target job family/role. Approaches to Item / Test Security and Methods of Delivery A major consideration for MCS will be its approach to minimizing and protecting against threats to item and test security. The key decision will be whether tests shall be administered online in unproctored environments. We are aware that MCS’s plan is to provide for the administration of the tests in managed test centers and that the administration of the tests in those centers will be proctored whether in online or paper-pencil mode. MCS views online, unproctored testing as a very remote possibility and is not requesting evaluations or recommendations relating to unproctored testing at this point. Current Status among Publishers of Reviewed Batteries All the publishers of the reviewed batteries provide for online delivery of these batteries. In most cases, there are explicit strategies in place for managing unproctored administration of those online batteries. For example, SHL and Pearson provide for randomized forms to be delivered online, 159 proctored verification testing for competitive applicants, and detection methods for comparing the unproctored and proctored results to evaluate the likelihood of cheating. At the other end of the spectrum, Hogrefe (PPM) provides for online administration but does not authorize unproctored delivery. However, they are aware that unproctored delivery occasionally occurs but have no systematic strategy for detecting possible consequences for item characteristics. Overall, Pearson, and especially SHL have demonstrated a strong commitment to “best practices” and adherence to professional standards (ITC, 2001) with regard to maximizing security for unproctored testing. (We make this point while acknowledging that there is not a professional consensus on unproctored administration of employment tests. Many professionals hold the position that unproctored testing is not professionally ethical because it knowingly creates an incentive and opportunity for increased cheating.) Their practices will be of interest to Sabbarah because many of them will also be useful for managing the long-term security of MCS’s proctored online testing process. Methods for Managing Item / Test Security Given that unproctored testing is a very remote possibility, the implications for best practices from our review of these seven batteries is ambiguous because the primary focus of the publishers’ attention to cheating is online, unproctored testing. Nevertheless, we will briefly describe the components of the item / test security strategy SHL has actively promoted because many will be applicable and useful even for proctored test administration. SHL describes a security strategy that is fully in effect with Verify that has two primary components: (1) tactics within the context of the test administration process and (2) tactics external to the administration process. Most of these could be applied even for proctored administration, although the risk of cheating is likely reduced by the proctoring itself. Tactics Internal to Test Administration 1. Avoid the use of single fixed test forms. This can be accomplished in a number of ways. a. At a minimum this may take the form of randomizing the order in which a fixed set of items is presented. This simple step can deter collaborative approaches to cheating in which several conspiring test takers remember items in a predetermined range, for example, items 1-5 for conspirator 1, and so on. b. Randomize the selection of items into tests administered to individual test takers. This process typically, but not necessarily, relies on (a) IRT calibration of all items in a large pool and, often, (b) content based random sampling of items from within narrow clusters of items organized in a multilevel taxonomy of item content. This process is most efficient when the IRT requirements of unidimensionality are satisfied at the subtest level at which scores are used to inform selection decisions. Pearson’s investigation of this requirement with the new, shorter Watson-Glaser forms found that the assumption of unidimensionality was marginally satisfied in the full pool of all Watson-Glaser items and produced more reliable forms than a model using different IRT parameterization for different subscales. SHL has chosen not to share additional information about their investigation of the same issue for Verify. However, the publically available information implies that the IRT parameterization was performed within each of the first three item banks, Verbal Ability, Numerical Ability and Inductive Reasoning. This suggests that SHL’s randomized forms approach does not require unidimensionality across these three subtests but only within each subtest. c. Adapt each test to a set of items tailored to the particular test taker through some form of multistage or adaptive testing model. While this approach may produce more 160 reliable scores with fewer items, there is no reason to believe it produces more secure tests than randomized forms. d. For unproctored administration, develop a system of proctored verification tests following positive unproctored test results and develop procedures for following up on discrepant results that do not assume guilt in the test taker and do not inadvertently disadvantage. 2. Communicate clearly to test takers about their responsibility to take the test honestly and require them to sign a “contract” that they agree to do so. 3. Use technology advances to ensure the identify of test takers a. E.g., thumb print, retinal eye pattern or keystroke pattern signature b. Remote monitoring through CCTV/webcam capability Tactics External to Test Administration 1. Actively manage and monitor the item bank. a. Monitor item parameters for evidence of fraud related drift in parameter values b. Manage item exposure by exposure rules, use of multiple banks, and/or regular development of new items c. Use data forensics to audit test taker response patterns that indicate unusual patterns of responses associated with fraudulent attempts to complete the test. Data forensics benefit from the capability to time keyboard responses to detect fast responses and slow responses. 2. Monitor web sites with “web patrols” searching for indications of collaborative efforts to collect and disseminate information about the test. 3. Comprehensive training, certification and monitoring of test proctors. Even without unproctored administration, the risk of collaborative pirating of test information increases with the frequency with which tests are administered and the number of administratively diverse testing centers. Item and Test Development The review of seven batteries revealed three broad, strategic approaches to item and test development – bank development, item accumulation and fixed form production. At the same time, there appeared to be several characteristics in common across all three strategies. Bank Development Approach SHL for Verify and Pearson for UK Watson-Glaser Unproctored deliberately developed items in order to build large banks of items with diverse items within banks, even within subtest-specific banks. This approach was dictated by the objective of using randomized forms for each test taker to support unproctored online administration. In this approach, there is a distinctive requirement to develop a large number of items and pre-test those items in large samples to provide stable estimates of IRT parameters. Also, for both Verify and Watson-Glaser there was a requirement to develop diverse items within subtests (Verify) or subscales (Watson-Glaser). For each subtest, SHL required that a wide range of item difficulty/complexity/level be developed so that each subtest item bank could accommodate the SHL approach to tailoring online tests to the level of the job family for which the subtest was being administered. (SHL creates randomized forms, in part, by shifting item difficulty up (down) for higher (lower) level job families.) Also, SHL specifies diverse work-like item content for Verbal Reasoning, Numerical Reasoning, and, possibly, Reading Comprehension to bear some 161 similarity to work content in the job families for which these subtests may be used. In contrast, SHL does not require diversity of work-like content for the other subtests that either have abstract item content or work-neutral content. Pearson also requires diversity of item content within the five subscales (Recognize Assumptions, Evaluates Arguments, Inference, Deduction and Interpretation, the last three of which form Draw Conclusions). But it is a different type of diversity. First, there does not appear to be any instruction or desire to significantly vary item difficulty/complexity/level within a subscale because Watson-Glaser does not systematically rely on item difficulty to construct randomized forms. Pearson does not report the instructions / specifications provided to item writers that define the target level(s) for items. Watson-Glaser does require that for each subscale item content be developed for a variety of common contexts including business, school, and newspaper/media. This is an important consideration because Watson-Glaser form construction requires that a minimum percentage of items within a randomized form be business-related. In addition, the Watson-Glaser test requires a portion of passages-items to be controversial and a portion to be neutral. This portion is deliberately different between subscales with Evaluates Arguments being based on the largest portion of controversial passages – items. Unfortunately, Pearson does not report any item development methodology to ensure that the controversial / neutral quality of draft items is actually perceived by test takers. Both SHL and Pearson rely on empirical pre-testing new items and making early decisions about retaining new items based on classical test theory psychometrics relating to item discrimination indices, item difficulty, distractor properties, among other considerations. Pearson relies heavily on experienced expert reviews of draft items to better ensure they satisfy the specific critical thinking facets intended for each subscale. After review and psychometric pre-testing, SHL and Pearson estimate IRT parameters in large samples of test takers. A key consideration is whether the items within a bank satisfy the IRT requirement of unidimensionality. Pearson’s analysis showed that the unidimensionality assumption was satisfied sufficiently within the entire bank of items assessing all subscales. SHL, however, has not provided reports of this level of psychometric detail. Presumably, the unidimensionality requirement is applied only within subtest banks, not across subtest banks. For both Pearson and SHL, the initial bank development for their unproctored strategy was seeded with existing items that had been used in other batteries containing similar subtests. The pace at which these banks become unique sets of items used only for unproctored administration is not clear. Item Accumulation Approach The interview with PSI’s psychologist, Dr. Brad Schneider, disclosed an item development strategy, especially for EAS if not PET, that was not clear from technical manuals or other documents. We call this the item accumulation approach. The key feature of the item accumulation approach is that as PSI develops and tests new items, they are stored in accumulating banks of items, many with long psychometric histories. These banks of items are then used periodically to replace current fixed forms of subtests with new fixed forms of subtests without the process of identifying and introducing each new form. This approach to change in fixed forms is transparent to user/reviewers. It appears that this type of transparent form change occurs more frequently with PSI cognitive tests including EAS and PET than with other reviewed batteries such as GATB and PPM. In spite of the large banks of accumulated items, many or all of which have estimated IRT parameters used to construct new fixed forms, PSI has not implemented any version of randomized forms or computer adaptive testing in support of their online administration platform. Rather, at any point in time EAS is administered with three fixed forms. One is reserved for proctored administration, another for unproctored online administration in the US and a third for unproctored administration 162 outside the US, whether online or paper-pencil. Clients are made aware if they choose unproctored administration that the form they will use is not a secure form and there is some amount of risk that scores may be distorted. This incremental approach to item development and bank accumulation allows PSI to capitalize on an item writing strategy in which new items are written to be parallel to specific existing items. The production of item clones enhances the confidence in the parallelism of the resulting new forms. (It is not known whether PSI uses currently available software to automatically generate new items using highly specifiable item construction rules. SHL currently uses this approach at least for the items types that are highly specifiable such as arithmetic computation items.) Also, it appears that even though item banks accumulate over time, PSI follows an episodic item development approach by which new items are only developed when the is a recognized need for new forms and the existing bank may not be sufficient to provide all the needed items. Fixed Form Production Approach Publishers for the remaining batteries, PPM, GATB, proctored Watson-Glaser, and, possibly, DAT PCA, rely on a traditional fixed form approach to item and test development. The histories of PPM, GATB and unproctored Watson-Glaser are similar in that one or two forms have remained in place for 10-20 years, or more, before they are replaced with one or two new forms. DAT PCA clearly has been revised more frequently, especially more recently, and may be managed currently in a way more similar to the Item Accumulation Approach. It is unclear. With the possible exception of DAT PCA, this approach has led to infrequent episodes of item writing. The extreme case is PPM where, it appears, no new form has been developed since its introduction in the mid-late 1980s. GATB and unproctored Watson-Glaser both date back to the 1940’s for their original two forms and both have had 2 or 3 pairs of new forms developed since then. Occasionally, the introduction of the infrequent new form has been the opportunity for improvement or change. For Watson-Glaser, each new pairs of forms was used to shorten the tests and address psychometric limitations of earlier forms. For GATB, the development of Forms E and F was the occasion to eliminate Form Matching, reduce length and eliminate sources of race/ethnic and cultural bias.. Episodic, fixed form production invites an item development strategy that rests heavily on the development of new items developed to be parallel to specific existing items. The development of items for GATB Forms E and F relied heavily on this approach but not the development of new forms for Watson-Glaser. Indeed, a primary emphasis in the development of Forms D and E was to update the content of the passages and items to be relevant to current populations globally. However, these targeted improvements did not lead to any change in the target constructs or definition of critical thinking. Precisely the same definition of critical thinking was specified. Common Item Development Characteristics While the contexts and objectives for item development have varied considerably among the seven reviewed batteries, certain characteristics of the item development were common across most batteries. Test Construct Selection With the notable exception of Watson-Glaser, the publishers made remarkably similar choices about the target test constructs. As shown in Table 70 above, excluding Watson-Glaser, either 5 or 6 of the remaining six batteries developed at least one subtest in the following construct categories (No category represents a theoretically single or narrow cognitive ability, as if it were sampled from the CHC framework.. Rather, each represents a class of related cognitive abilities with certain features in common that have an empirical history of predicting work performance.) (Note, we acknowledge that 163 this commonality may, in part be due to the criteria by which these batteries were selected. They may not be representative of the cognitive tests in common use for employment purposes. However, they were not explicitly selected to be similar.) Verbal Ability. All batteries explicitly developed one or more subtests measuring some verbal ability. Two primary types of verbal ability were selected, (a) verbal knowledge such as vocabulary, spelling, comprehension, and reading, and (b) reasoning about verbal information, which requires some form of logical reasoning applied to information presented in moderately complex verbal content. Quantitative Ability. All batteries explicitly developed one or more subtest measuring some quantitative ability. As with verbal ability, two primary types of quantitative ability constructs were selected, (a) knowledge of some facet of arithmetic, and (b) reasoning about quantitative information. Fluid Reasoning Ability. Other than GATB, all batteries included at least one subtest measuring fluid reasoning ability. In most cases this was some form of reasoning about relationships among abstract symbols/figures. Spatial / Mechanical Ability. Other than PET, all batteries include a subtest measuring some form of spatial / mechanical ability. None of the reviewed subtests in this category (see Figure 1) were developed to be job knowledge tests but some will be more influenced than others by mechanical knowledge. We acknowledge that there is conceptual overlap between this category and Fluid Reasoning Ability. We distinguish them for purposes of Study 2 because the common rationale for using Spatial / Mechanical Ability subtests is quite different than the rationale for using subtests measuring Fluid Reasoning Ability. The distinction lies in job characteristics. Spatial / Mechanical subtests are ordinarily used only for job families involving the use of equipment, tools, mechanical devices where trade skills may be required. In contrast, subtests of Fluid Reasoning Ability are most often used with higher level jobs that require decision making with complex information. Processing Speed / Accuracy. We include in this category perceptual speed and psychomotor speed / dexterity. Like Spatial / Mechanical Ability subtests, these are commonly used only for specific job families including, mainly, clerical/administrative jobs and manual tool use jobs. The point of this description is to demonstrate a framework for organizing types of cognitive abilities that is commonly used with cognitive ability employment tests. Item Content For job-oriented employment tests an important decision about items is the extent to which and manner in which work-like content will be introduced into the items. Figure 1, above, and the accompanying discussion describe the large differences among cognitive employment tests with regard to item content. Even though cognitive ability employment tests typically focus on a small set of common constructs, their item content can be quite different. Certainly commercial interests play an important role. Subtests with work-neutral content may be more acceptably applied across a wide range of job families. Subtests with job-specific content may be more appealing for specific job families. Until recently when Postlethwaite (2012) demonstrated that crystallized ability tests are more predictive of job performance than fluid ability tests, there has not be a clearly understood empirical rationale for preferring more or less job-like content in cognitive ability tests. As it has happened, it appears that the majority of subtests contains learned content, whether work-neutral or work-relevant and presumably benefits from the advantage of crystalized ability. In spite of the 164 recent criterion validity results relating to crystalized ability, we anticipate that commercial interests will continue to play the most influential role in the decision about the work relevance of item content. Use of CCT Psychometrics for Empirical Item Screening Although some batteries do not provide this information, both the Verify and Watson-Glaser batteries that rely heavily on IRT psychometrics to manage banks and construct randomized forms, report relying on CCT psychometrics for the initial screening of pre-tested items. This is likely to continue to be the common practice for commercial tests because CCT estimates do not require as large samples as IRT estimates. Culture / Sensitivity Reviews A common practice continues to be the use of culture experts to review draft items for culturally/racially/socially sensitive content that could introduce invalid, group related variance into items responses. Measurement Expert Reviews Some publishers use an additional round of item review by item development experts to better ensure item quality prior to empirical pre-testing. This approach appears to be most important for novel, complex items such as Watson-Glaser that relies heavily of expert reviews. Specification of Item Level/Difficulty/Complexity No publisher provided information about the manner in which item writers were instructed re the targets for item level/difficulty/complexity. Nevertheless, this will always be an important process requirement for the development of items and subtests appropriate to the target job families and role levels. Specified reading level, which some batteries used, may be a part of this consideration. But other considerations such as level of important job tasks and applicant pool characteristics such as education level should also inform the decisions about target levels/difficulty/complexity of item content. IRT-Based Test Construction Although few technical manuals provided information about fixed forms test construction, the interviews with publishers’ industrial psychologists revealed that most of the publishers rely on IRT psychometrics to construct and describe whole subtest characteristics. This is certainly true for the Verify subtests and the unproctored Watson-Glaser in the UK, but it this approach is also used by PSI for EAS and PET subtests. It is also likely that this approach is used for DAT PCA, but we have not been able to confirm that. This frequent practice indicates that IRT methods are used in test construction even where tests/subtests use number-correct scoring or other manual scoring procedures. This approach has a number of advantages ranging from providing a test information function for any combination of items, which allows poor performing or corrupted items to be easily replaced with other items that maintain the same TIF characteristics, shifting test characteristics by replacing items that improve precision in the range of the ability distribution near the critical score range that differentiates hired from not hired, and enabling the rapid production of equivalent new fixed forms as the item bank accumulates more items. Further Considerations Regarding IRT-Based Randomized Forms Both SHL with Verify and Pearson with Watson-Glaser use IRT-based methods for generating a “randomized form” for each candidate. While it is possible that two candidates could receive the 165 same randomized form, it is extremely unlikely. The primary purpose of randomized forms is to protect item and test security in an unproctored, online test environment. Although the details about their IRT methodology for generating randomized forms were not available to this Project, we can safely speculate about certain characteristics of the construction process. 1. The item banking process stores several characteristics of each item including keyed response, estimated IRT parameters, some manner of designating the target ability, other information about more specific abilities or narrow clusters of highly specific content groupings(for example, Pearson stores information about the specific subscales each item is linked to as well as the neutral-controversial characteristic) and, possibly, administrative information such as time since last use, frequency of use, and pairing with other items that should be avoided. 2. The process of selecting items into randomized forms is based on complex algorithms to control for a variety of test characteristics. The Pearson (Watson-Glaser) psychologist explained that there were approximately a dozen specific rules that each item selection must satisfy, although he would not share what those rules were. 3. Nothing about randomized forms would prevent the administration the items in any particular order or grouping. Certainly, randomized forms could easily be administered by ordering the items by difficulty or grouping the items by content similarity or balancing content considerations across different subsets of content. Indeed, the last two of these characteristics of item order and grouping are required by the Watson-Glaser assessment.. 4. It is possible that at some late stage of item selection, items are randomly selected from relatively narrow and small clusters or grouping of highly parallel items in the bank taxonomy. This could be the case, for example if item difficulty and discrimination were each polytomized into several categories and were then used as dimensions of the taxonomy. This would enable the taxonomy to store information necessary to identify groupings of parallel items highly similar with respect to narrowly defined content distinctions, discrimination, difficulty and possibly other characteristics. It is conceivable that an efficient construction strategy would be to randomly select an item from within such narrow clusters where the construction algorithm is indifferent to the differences between items in such narrow groupings. Neither SHL nor Pearson provided literature resources describing their IRT-based algorithms for generating randomized forms. As a result, it was not possible to carefully evaluate either algorithm. However, the equivalency evidence presented by Pearson in the Watson-Glaser (UK) manual provides persuasive evidence that randomized forms are correlated enough to treat them as equivalent. Strategies for Item and Bank Management Approaches to item and bank management appear to differ considerably across the seven reviewed batteries. These differences can be captured using the three categories identified above for item development. Fixed Form Production Approach At one end, represented by PPM, GATB, unproctored Watson-Glaser and, possibly DAT PCA, there is no bank management because the only “active” items are those in the fixed forms. In this case, items are also regarded as fixed and are not routinely replaced. If evidence accumulates about problematic items or compromised test forms, new forms are developed and replace existing forms. Historically, such replacement is infrequent with forms often in use for 10-20 years or longer. For these publishers, we have no information indicating that they routinely update items’ psychometric characteristics based on accumulating data. 166 Item Accumulation Approach EAS and PET items appear to be managed somewhat more actively with accumulating banks of items. Some items in these banks are active in the sense that they are in fixed forms that are currently in use. But other items are inactive and are not in forms that are in use. PSI describes a process of routine monitoring of items within active forms to detect indicators of over-exposure, or significant fraudulent responding. Indeed, PSI distinguishes between forms available only for proctored administration, unproctored US administration and unproctored administration outside the US. This suggests that items are tagged with their current form “status” and that different standards for retention are applied to items depending on the use of their current form. However, even PSI does not replace single items. Instead, they will refresh a whole form by replacing all/most items and regarding the new set of items as a new form, even if this is transparent to the user. Using this approach, PSI relies on IRT parameter estimates for all accumulated items so that new, transparent forms may be introduced and equated to previous forms using IRT psychometrics. This approach also suggests that PSI attends more regularly than publishers of fixed forms with no banks to the accumulating or recent psychometric properties of active items. PSI provided no detail about the manner in which they might be doing this. Bank Development Approach SHL for Verify and Pearson for unproctored Watson-Glaser in the UK describe a process of active bank management to support the production of randomized forms to each individual unproctored test taker. Between them, SHL appears to take a more active approach specifically to monitoring current item characteristics using data forensics as provided by their partner Caveon, which specializes in forensic analysis of item responses to detect aberrant patterns. Also, SHL routinely refreshes items in their banks, based in part on indications of patterns of cheating or item piracy that may have compromised certain items. Pearson’ description of their approach to bank management does not indicate the use of data forensics or routine refreshing of items in the bank. In general, active bank maintenance both for security protection and for forms production requires that each item be tagged with several characteristics. Pearson describes that each Watson-Glaser passage-item combination is tagged with 10-15 characteristics. Although Pearson did not identify all these characteristics, these include IRT parameter estimates (Pearson uses a 3-parameter model with a fixed guessing parameter), item type (business, school, news media, etc.), subscale (Recognize, Evaluates, Inference, Deduction, Interpretation), and other item characteristics such as exposure indices, date of most recent psychometric update, keyed response of correct alternative, and linkage to other items via passages. All of these 10-15 characteristics are associated with specific rules for the production of randomized forms. SHL presumably uses a conceptually similar approach of tagging Verify items with characteristics that are used to create equivalent forms. For example, SHL’s approach with Verify of linking subtests to job families likely requires that each item have information tagged to it regarding the job families or universal Competencies to which it is linked. Overall, the Bank Development Approach requires routine, ongoing monitoring of item characteristics and, at least in SHL’s support of Verify, routine development of new items to continually refresh the bank. Potential Item / Bank Management Considerations All of the three approaches described above will be influenced by several policies about item and bank characteristics. While none of the publishers provided information about these policies, it may be helpful to identify certain key ones here. Data Aggregation. Where the psychometric characteristics of items are periodically updated, an important consideration is the manner in which previous item data is aggregated over time. With 167 small volume batteries, it may be advantageous to aggregate item data as far back in time as the item has remained unchanged. The advantage is in the increased stability of the estimates. However, with high volume batteries two other interests may outweigh modest gains in stability from exhaustive aggregation. First, publishers and users may have more interest in the current item characteristics rather than historical characteristics. That interest would favor updating item characteristics but only based on the most recent data, where recent data replaces dated data. Second, forensic analysis or larger scale national/social analyses may have an interest in the pattern of change over time. While this interest is likely to be focused on changes in applicant populations rather than changes in item properties, nevertheless it implies that longitudinal data be segmented in some fashion with item and applicant characteristics are evaluated within each segment. Exposure Rules. An objective in any approach to the production of a very large number of equivalent forms from a finite bank is that over-sampling of the most useful items should be avoided. With highly specified algorithms, over-sampling may be managed by specifying item exposure limits. With proctored fixed forms, publishers generally act as if no amount of exposure is too risky. However, in a highly visible and “closed” environment like MCS’s civil service testing system may be, exposure may be a significant concern even with a modest number of fixed forms. Item Retention Thresholds. Publishers/developers may set different retention thresholds for initial item development processes than for longer term bank maturation objectives. For example, during the initial pre-testing phase of item development lower item discrimination values, e.g., r item-total, may be acceptable in the interest of building a large enough bank and avoiding the deletion of a potentially valuable item once a large sample is available. During the ongoing process of bank maturation, higher retention thresholds may be established in the interest of migrating to a bank of more uniformly high psychometric quality. Of course, this example of lower thresholds in early item development decisions than in later bank maturation decisions assumes that it is reasonable to expect sufficient bank stability that gradual maturation should take place over time. But that may not be the case for all types of item content or all types of administration contexts. For example, highly technical job knowledge content might be expected to have a relatively short life span and require frequent updating with new items. Unproctored administration may require a bank management strategy of continual item replacement such that no bank “matures” over time. (SHL appears to have adopted this view to some extent even though they also emphasize practices intended to minimize the likelihood and incentives for cheating.) Use of Data Forensics. SHL, with its partner Caveon, has promoted the strategy with unproctored administration of routine data forensic investigations of test takers responses and patterns. Yet, there are strong arguments that proctored modes of delivery are subject to corruption due to cheating and piracy, especially with collaborators (Drasgow, et al, 2009). Publishers should not presume that data forensics have value only when tests are administered in an unproctored setting. Bank Health Metrics. Closely related to the consideration of item retention thresholds is the consideration of bank health metrics. (As a practical matter, this consideration is relevant only for the Bank Development Approach to item and bank management.) Such metrics are intended to be bank-oriented metrics whether they are simply the aggregate of item-level or forms-level characteristics or some other bank level indicators such as coverage of all required content domains. In effect, such metrics would constitute the “dashboard” of bank characteristics publishers would routinely attend to. Possible examples include (a) the average reliability of whole forms of subtests produced by a bank following all construction rules; (b) the distribution of item use frequencies, and (c) average reliabilities of subsets of items associated with particular subcategories of items such as content categories or job categories. Bank Size. Many factors influence the optimal size of any bank. These include the number of item characteristics that must be represented on forms, the number of administrations, item exposure considerations, subtest lengths, and item loss and replacement rates. The task of estimating the 168 optimal bank size can be very complex and can depend on uncertain assumptions. Nevertheless, unless a publisher/developer has an approximate estimate of desired bank size, it will be difficult to budget and staff for the work of item development and bank maintenance. Use of Test Results To Make Hiring Decisions This section, while important for Sabbarah / MCS’s future consideration, depends only in a minor way on the Study 2 review of the seven batteries. The reason is that publishers provide little information about the manner in which their test results might be used by employers. At most, some publishers identify ranges of test scores that are usually norm-based for employers to use as standards for interpreting score levels. But even in these cases, the suggested ranges and interpretations cannot reflect an individual employer’s particular set of considerations. Similarly, publishers almost always recommend relying on additional selection considerations beyond tests scores but rarely describe methods for doing so. Generally, the publishers of the reviewed tests understand and accept that each employer’s particular method for using their test scores should be tailored to that employer’s particular needs and interests. For this reason, this section is not based on publishers’ recommendations or practices. Rather, it is based on the Study 2 authors’ understanding of Sabbarah and MCS’s objectives and perspectives about the new civil service testing system and our own experience with this set of considerations with other employers. (Where an approach is related to characteristics of the reviewed tests, this will be noted.) The starting point for this section is that MCS has already made the following important decisions about the manner in which tests results will be used in the new civil service testing system. Cut scores will not be used to “qualify” applicants on cognitive test results alone Cognitive test results will be combined with other selection factors in a weighed composite, which will provide the basis for hiring decisions. MCS will not design the testing system to be used for other purposes than hiring decisions. (That may not prevent users from adopting other purposes for tests results but MCS is not designing the testing system to support other purposes.) These decisions that have already been made limit the scope of this section to a discussion of methods by which subtest results may be tailored to job families. (Note, related topics having to do with the management of test results will be addressed in the Recommendations section. These topics will include, for example, policies about the use of tests results such as the manner in which test results obtained while applying for one job family may be applied to other job families, and policies about retaking tests.) That said, we make one suggestion here about MCS policy that will govern the manner in which the composite score is constructed and used. We suggest that MCS combine test results only with other factors expected to be predictive of future job performance such as past experience/accomplishments records, academic degrees, interview evaluations and so on. A primary reason for this suggestion is so that the task of weighting each component would be based on the expected job relevance of the individual components. In this case, the task of choosing the weights will be guided entirely by considerations of future job performance. This composite can be meaningfully understood by all others including employers and applicants. Also, any future interest in changing the composite can be evaluated in a straightforward manner. We further suggest that factors unrelated to future job performance such as length of time as an applicant, age, previous civil service, financial needs, etc. can be evaluated separately based on a different set of considerations and values. One major benefit of not bundling these two sets of considerations into a single quantitative composite is that the impact of the non-job related factors can 169 be more easily recognized. It will be clearer to MCS and employers what the real “cost” is of the nonjob related factors. Methods for Tailoring the Use of Subtests to Job Families A pervasive theme in all publishers’ materials is that it is meaningful to tailor the use of cognitive subtests to different job families. Indeed some publishers, PSI with PET, Pearson with WatsonGlaser, and SHL with Verify, have already taken the step of building job specific content into the battery. PET and Watson-Glaser were both developed specifically to be appropriate for Managerial / Professional work and not for other less complex work. Using its IRT foundation and large item banks, SHL modifies the difficulty of Verify test content to accommodate job level differences. We provide a framework in Table 72 for evaluating possible ways of tailoring the use of subtests to job families. Option 1, “No Tailoring”, refers to the practice of administering the full battery of all subtests to all applicants, regardless of sought job. Option 2, “Tailored Weights”, as described here, assumes the same administrative process of administering the whole battery to all applicants. The distinction between the two is that the “Tailored Weights” produces a total score that is a weighted combination of subtest scores where the weighting, presumably, reflects the expected job-specific criterion validities for all subtests. An implication is that a single applicant could receive multiple job-specific score results, whether needed or not. The potential advantage of maximized validity is likely to be a small gain but the incremental cost of estimating job-specific subtest weights may be a small cost if the empirical evidence necessary is already in the research literature. The applicant experience in these two options would be identical in the testing process but would differ in the presumed feedback process. All reviewed batteries and OPM could be administered in a whole battery form although the number and diversity of subtests especially for GATB and PPM could be a noticeable inconvenience to the test taker because some subtests may appear to have little relevance to the sought job. Only Watson-Glaser would not be meaningful in the Option 2 mode. Pearson and the original developers provided no rationale or argument for distinguishing the measurement of critical thinking between job families. Option 3 is an operationally significant method of tailoring because it reduces the number of subtests administered to any applicant to be just those judged to be most relevant to the sought job. Of course for short batteries like PET and for homogeneous batteries like Watson-Glaser, Option 3 would not be a meaningful alternative. Neither type of battery was designed to be subdivided in this manner. On the other hand, batteries with many and diverse subtests would realize significant user benefits resulting from taking many fewer subtests and avoiding those subtests that would be conspicuously unlike the job they are seeking. If test centers operate only with paper-pencil fixed forms, Option 3 can add significant operational costs and administrative time because there are now a larger number of discreet testing events that can be scheduled. Online administration largely eliminates this disadvantage, relative to paper-pencil administration. This model of tailoring is fundamental to the Verify strategy of routinely administering only the job-specific subtests. 170 Table 72. Methods of tailoring the use of subtests to job families Test Form Construction Methods of Tailoring Subtest Results to Job Families 1. No Tailoring One complete battery, All subtests administered All contribute to total score Same total score used for all jobs 2. Tailored Weights One complete battery All subtests administered Weighed combinations tailored to jobs; Job-specific composite scores used per job 3. Tailored Combinations Multiple groups of 3-4 subtests Tailor grouping of subtests to jobs Administer only the group of subtests for sought job Use total score of administered subtests 4. Tailored Content Tailor item/test content to jobs o Sample items from jobspecific banks or by jobspecific rules Content level / difficulty Content type Administer tailored subtests + other generic subtests per job Use total score of administered subtests Fixed Forms Watson-Glaser PPM GATB PET DAT PCA EAS PPM EAS DAT GATB PET Benefits / Costs Individualized Forms (I.e., MST / Randomized) Benefits Watson-Glaser, Unproctored Verify OPM OPM Verify PPM EAS DAT GATB, PET Costs Verify OPM(?) Verify Easy to use Minimize cost to develop Fewer testing events Easy to use Minimize cost to develop Fewer testing events Maximize validity Easy to use Shorter testing sessions Minimize cost to develop Improved test taker experience Increased user acceptance Lower face validity Longer test times Higher cost to deliver Unneeded results Lower face validity Longer test times Higher cost to deliver Unneeded results More testing events Higher cost to develop There is a nuanced but possibly significant distinction between Option 2 and 3. Both produce jobspecific score results. But Option 2 (and Option 1) can produce all job-specific score results from the same testing session whereas Option 3 only produces a single job-specific score result from a single testing session. The issue that can be raised is whether Option 2 score results should be considered “official” results for all jobs regardless of the applicant’s intent to apply for only one, or more, jobs. The problem that can arise occurs when the employer (or MCS) has a retest policy that requires applicants to wait some period of time, say 3 months, before being allowed to retake an earlier test in an attempt to achieve a better result. In that case, an applicant may be applying only for one job and taking the test only to compete for that one job but would be required to wait the retest period before being allowed to take the same test to apply for a different job. In that scenario, the employer’s retest policy could inadvertently harm the applicant’s ability to apply for a different job that would require the same test event. Option 4 is qualitatively different from the other options. The form of job tailoring is to modify the subset content in some fashion. Verify uses this approach by modifying the range of item difficulty in a randomized form to correspond to the level of the sought job. In the Verify system this is done primarily for the psychometric benefits of item difficulty levels being more closely aligned with the level of ability of the test takers for higher (lower) level jobs. Verify does not tailor item difficulty to achieve greater test taker acceptance. In all likelihood, modest changes in level of item difficulty may not be noticed by test takers. 171 Content can be tailored in other ways than difficulty whether it is Tier 2 content that does not changes the construct being measured or Tier 3 content which tailors the construct to assess more job-specific knowledge/ability. In a way, both PET and Watson-Glaser chose their test content to be tailored to a specific set of job families in the management / professional domains. While neither is designed to adapt its content to different job families, both include content designed to be appropriate for a broad range of higher level jobs. While the primary benefit of Option 4 is increased user acceptance, it is costly because Option 4 requires a (much?) larger item bank to enable the adaptation of item content to differences between job families. Also, if these forms or content tailoring occur only with Tier 2 changes in content, then there will be no benefit in increased criterion validity. If the content changes are Tier 3 changes, which change what the test is measuring, then improved criterion validity is feasible. Neither of the batteries we reviewed that we know require large item banks – WatsonGlaser and Verify – are designed to tailor Tier 3 content. It is possible OPM banks are designed in that way. Overall, these alternative approaches to tailoring the use of subtests in the context of the types of cognitive abilities we reviewed are unlikely to benefit criterion validity in any significant way. Only if the job content in items were of the Tier 3 type that measures job-specific abilities/knowledge, would Option 4 tailoring possibly have a significant effect on criterion validity. None of the reviewed batteries has been designed to depend on Tier 3 job content. Nevertheless, we have located an apparent example of Tier 3 content in OPM’s USA HIRE Judgment and Reading subtests. Considerations of Popularity, Professional Standards and Relevance to MCS The Study 2 specifications require information about the popularity, adherence to professional standards and the relevance to MCS interests of each of the reviewed batteries. We attempt to succinctly address each of these in a single table of information, Table 73, that allows the reader to make an overall evaluation of the usefulness of the information available for each battery. Summary comments are offered here for each consideration. Popularity Publishers were unable/unwilling to provide information about volume of usage for any battery. As a result, we are only able to estimate the popularity for each battery from other available information. Given the imprecision of our assessment or popularity, we only use categories of popularity to rank order them. This information should not be taken to reflect actual usage volume. Overall, there is strong indication that Verify and Watson-Glaser are relatively high volume batteries. Watson-Glaser is described by Pearson as its most popular selection test. But the likelihood is that it is more heavily used in educational contexts than selection contexts. At the same time the DAT PCA is listed as their ninth most popular selection test. The Pearson psychologist described DAT PCA as in the top five of their cognitive ability selection tests. Although no volume data is available for Verify, the development information shows very large samples (e.g., 9,000, 16,000) of SHL test takers who completed trial forms of Verify. The unproctored, online feature coupled with SHL’s aggressive international marketing approach is virtually certain to have resulted in high volume usage of Verify. At the other end of the popularity spectrum, the Hogrefe psychologist described PPM as a “low volume” product used only for certain applications in the UK. Nelson distributes the GATB, but only in Canada. Popularity of the remaining three batteries, EAS, PET and DAT PCA, is more ambiguous. Our professional experience is that EAS is a well-established and commonly used battery and is used as 172 a “marker” battery for construct validity studies for other tests, for example PPM. PET, also supported by PSI, serves a narrower market with smaller volume associated with it than the broadbased EAS. Also, the PSI psychologist indicated that PET has fewer forms than EAS in current use. The implication is that PET has lower usage than EAS. DAT PCA has lower usage than WatsonGlaser and subtests were recently removed from DAT PCA due to low usage. (Clients may purchase DAT PCA by the subtest.) Adherence to Standards Our evaluation of adherence to professional standards is based on all available information including interviews with publishers’ psychologists. Overall, our evaluation is that publishers range from moderate adherence to low adherence, especially with NCME, AERA, APA Standards. Generally, publishers deserve high marks for administrative/instructional information (e.g., Standards 3.19, 3.20, 3.21 and 3.22) but low marks for technical detail about item and test development processes such as item/test specifications (see e.g., Standards 3.3 and 3.7). Similarly, most publishers have described the results of construct and criterion validity studies but have provided little information about the technical details for those studies (see Standard 1.13 for example). Adherence is more difficult to judge with respect to SIOP’s Principles partly because they are described in less prescriptive language and many of the principles address details relating to specific organizational decisions/procedures such as cut scores and job analyses, which publishers may not be in a position to address. Overall, however, publishers deserve low marks for their efforts to protect the security of the tests, with the exception, of course, of Verify and Watson-Glaser unproctored in the UK. We have observed that publishers over estimate the protection provided by proctoring. We sold note that non-compliance with professional standards does not necessarily imply that the battery has no information value for MCS. For example, we judge PPM to be in poor compliance with professional standards. Nevertheless, PPM offers a useful example of a full set of cognitive ability subtests constructed with work-neutral item content. In many instances, the value of a battery for MCS does not depend on the publisher’s adherence to professional standards, especially those standard relating to documentation. Relevance to MCS Interests All batteries were recommended and selected because, for each one, certain features were initially seen as relevant to MCS’s interests. So, all reviewed batteries were expected to have some information value for MCS. As information was gathered, we recognized that Verify became a more valuable source of information while PPM and Watson-Glaser became somewhat less valuable, in our judgment. Also, the contributions of EAS, DAT PCA, GATB and PPM were somewhat more interchangeable than were the somewhat distinctive contributions of Verify, PET and Watson-Glaser. The relevance of each battery’s information to MCS will be more evident in the Section 7 recommendations. 173 Table 73. Information about batteries’ popularity, adherence to standards, and relevance to MCS. Battery Verify WatsonGlaser EAS PET DAT PCA GATB PPM 174 Judged Popularity Adherence to Professional Standards Relevance to MCS High Moderate Technical test production detail missing Item development detail missing Validation results provided Validation procedures missing Security protection detail provided Rationale for randomized forms provided Test specifications not well described Reliability detail provided Psychometric model described High Best example of security protection for unproctored administration Best example of bank-oriented construction of multiple forms Some of the best examples of worklike content All constructs are applicable Good example of strategy for controlling subtest level High Moderate Technical test development detail missing Item development detail provided Construct detail lacking Reliability detailed provided Validity results provided Validation procedures missing Administration/scoring instructions provided Time limit information is ambiguous Moderate Good example of bank-oriented construction of multiple forms with testlets Content unlikely to be directly applicable Construct, as is, may not be applicable Strong publishers support of users Moderate High Moderate Extensive validation results presented Validation procedures missing Inadequate security protection in unproctored setting Reliability detail presented Item development detail missing Adequate administrative support Moderate - High Good example of diverse batteries Good examples of work-neutral content Most constructs are applicable Extensive validation Moderate Moderate Validation results presented Validation procedures missing Inadequate security protection in unproctored settings Reliability detail provided Modest Item development detail reported Admin support yes High Very good example of higher level work-like content All constructs are applicable Moderate Moderate Most item development detail missing Most test construct missing Reliability yes Validity yes Validity procedures missing Marginal security protection (1 fixed form) Provide admin/scoring info Moderate - High Good example of diverse batteries Good examples of work-neutral content All constructs are applicable Strong publisher support or users Moderate Low Moderate - High Extensive item/test development detail provided Extensive reliability and validity detail provided Extensive item review processes Marginal security protection (fixed forms) Moderate - High Good examples of work-neutral content Most constructs are applicable Extensive validation Low Moderate Good example of work-neutral content Most constructs are applicable Poor example of validation practice Poor example of publisher support Low Criterion validity evidence nonexistent Item development detail missing Reliability detail presented Construct validity detail presented Administrative information provided Inadequate security protection provided for aging forms SECTION 7: RECOMMENDATIONS AND SUGGESTIONS A requirement of Study 2 is that recommendations be made with respect to (a) MCS’s GCAT plan and specifications, and (b) important organizational and operational considerations relating to management, staffing, security, item banking, technical support, administration and the design of test guides, manuals and reports. This section provides these suggestions and recommendations organized in the following manner. 1. GCAT Plan and Specifications A. Choosing Test Constructs B. Specifying Item Content C. Methods of Item and Test Development D. Validation Strategies 2. Organizational and Operational Considerations A. Security Strategy B. Item Banking C. Operational Staff Requirements D. Design of User Materials E. Strategies and Policies These recommendations are based on multiple sources including the reviews of the seven batteries, our own professional experience with personnel selection testing programs and the professional literature relating to the design and management of personnel selection systems. We do not limit these recommendations to just the methods or approaches used by one or more of the reviewed batteries, GCAT PLAN AND SPECIFICATIONS Choosing Test Constructs Background Perhaps no decision is more important than the choice of abilities the tests should measure. The recommendations described here about what abilities should be selected are based on many considerations. But there are four assumptions about MCS’s civil service testing strategy that are important to these recommendations. A. MCS desires an efficient testing system that uses no more tests than are necessary to provide maximum predictive validity for all civil service jobs. B. MCS does not intend to use these same tests for career guidance purposes. C. MCS does not require that that the tests measure job knowledges or skills specific to particular jobs. (We acknowledge that MCS may have an interest in a Phase 2 addition to the recommended tests that includes job-specific tests.) D. MCS’s civil service jobs are predominantly in the service sector of work ranging in level from clerical / administrative jobs to management / professional jobs. The applicant pools for these civil service jobs will have some level of college education. Some will be new graduates, others will be experienced workers. In addition to these assumptions, the recommendations are based on certain perspectives about tests used for personnel selection purposes. The first perspective is that the selection purpose or use of 175 the tests requires that test scores predict performance in the target jobs. This is the most important requirement of selection test scores. Second, there is no requirement from MCS or professional practice/standards that these tests be theoretically singular. It is acceptable, perhaps even desired, that individual tests measure more than one theoretically distinct ability. For example, Reading Comprehension tests are likely to measure Verbal Comprehension as well as acquired knowledge relating to the substantive content of the reading passage. Generally, such complexity is not disadvantageous and may be beneficial for personnel selection tests. They are not designed to test theories of ability. Rather, the language and meaning of theories about cognitive ability should inform and guide the decisions about what abilities should be measured and the manner in which the tests should measure them. For the purposes of these recommend dations about ability test constructs we will refer to the Kit of Factor-Referenced Tests described in Ekstrom, French, Harman, Dermen (1976) associated with the 23 cognitive aptitude factors identified in the work by the Educational Testing Service (ETS). For our purposes, the advantage of this framework is that it identifies a small number of marker tests for each of the 23 identified aptitude factors. These marker tests enable a clearer discussion about the types of tests that we recommend to MCS. The third perspective is that there is a considerable research foundation about the predictive validity of cognitive ability tests with respect to job performance. Many decisions that MCS will make should be informed by that research base beginning with this decision about what the tests should measure. Recommendations Recommendation 1: Develop subtests in four categories of cognitive ability: (a) Verbal Ability, (b) Quantitative Ability, (c) Reasoning Ability, and (d) Processing Speed/Accuracy. These categories are widely used as observed in the six of the seven reviewed batteries, have a research foundation of predictive validity, and are routinely identified by structured job analysis methods as required for service jobs. Recommendation 2: Gather available job information across the service sector to make an initial evaluation of whether Psychomotor and/or Spatial/Mechanical categories of ability may also be required by jobs within the service sector. Recommendation 3: Develop two or more subtests within each selected ability category to measure specific abilities within each category. Suggestions for specific abilities are described in the rationale below for this recommendation. Rationale for Recommendations Recommendation 1 This recommendation is largely based on the accumulated evidence that cognitive ability tests relating to Verbal Ability, Quantitative Ability and Reasoning Ability predict job performance across a wide range of jobs. Further evidence shows that Processing Speed tests predict performance in jobs requiring the rapid processing of alphanumeric information. For example, in a large scale metaanalysis among clerical jobs, Pearlman, Schmidt and Hunter (1980) reported average corrected validities with respect to proficiency criteria for Verbal Ability, Quantitative Ability, Reasoning Ability and Perceptual Speed of .39, .47, .39 and .47, respectively. With respect to training criteria, these same ability categories averaged corrected correlations of .71, .64, .70 and .39, respectively. This research foundation is the likely explanation for the fact that commercially available cognitive ability batteries used for selection often contain subtests within each of these categories. For example, of the six reviewed batteries other than Watson-Glaser, five include subtests in all four categories, and all six included subtests in Verbal Ability and Quantitative Ability. 176 This strategy of including subtests in four diverse ability categories better ensures that some combination of 2-4 subtests from across the four categories will produce a maximum level of prediction validity because the combination of 2-4 different ability tests will constitute a measure of general mental ability. Recommendation 2 Recommendation 1 to develop subtests in four categories of abilities is intended, in part, to ensure that performance will be maximally predicted for all jobs within the service sector of civil service jobs. Nevertheless, there may be specific features of some service sector jobs that require one or both of two more specialized ability categories – psychomotor skills/abilities and spatial/mechanical ability. Thorough job analyses would not likely be needed to determine whether spatial/mechanical subtests will be warranted. Often, available job descriptive information showing significant use/repair/maintenance of tools, equipment and machinery is sufficient to recognize the potential value of spatial/mechanical ability tests. For this category, our recommendation is that subtests not be developed unless job information indicates potential value for such subtests. The question of whether psychomotor tests have potential value should be addressed somewhat differently. (We presume that the Study 2 requirement to review only cognitive ability tests is an indication that MCS has already determined that psychomotor testing is not warranted. Nevertheless, we note here that, if that decision has not already been made, the decisions about whether to develop psychomotor tests and what psychomotor skills/abilities should be assessed would require a more specialized and thorough analysis of the physical ability requirements of the jobs in questions. Recommendation 3 The recommendation to develop two or more subtests within each selected ability category is based on two primary considerations. First, each ability category represents a moderately broad range of related specific abilities. Including two or more distinctly different subtests in each category improves the representation of that category and increases the likelihood that Verbal Ability is adequately represented within the battery. For example, the Verbal Ability category includes at least four specific verbal ability test types described by Ekstrom et al. (1976) – Verbal Comprehension, Verbal Closure, Verbal Fluency and Associational Fluency. In addition, the commonly used Reading Comprehension test type involving paragraph comprehension fits within this category. Also, many publishers treat Verbal Reasoning tests as part of a group of Verbal Ability tests. This diverse set of possible specific verbal abilities will be better represented by two or more subtest than by a single subtest. Second, distinctively different specific ability subtests within a category increases the opportunity for one of the subtests to reliably predict performance in any target job family at a higher level than the other subtest(s) within the category. This rationale is the primary reason publishers commonly adopt this recommended approach. Perhaps the best example of this approach among the reviewed batteries is EAS, for which PSI has an extensive record of job-specific validity research. PSI recommends an empirically optimal combination of subtests for each of several job families. For example, PSI recommends the combination of Verbal Comprehension, Numerical Ability, Numerical Reasoning, and Verbal Reasoning for Professional, Managerial and Supervisory jobs. This particular combination produced a meta-analytic average validity of .57, which was higher than for any other combination. Table 74 shows suggested specific ability subtests for each of the recommended categories of ability. These suggestions are taken from the Study 2 review of seven cognitive batteries as well as the Ekstrom, et al. (1976) kit of factor-referenced tests. These are not exhaustive suggestions but are intended to provide examples of the typical diversity of subtests within the broader categories. (Note: 177 Examples of specific psychomotor subtests are not shown here because the choice of specificity depends to a great extent on the particular psychomotor requirements of the job.) The examples described in Table 74 are considered to be plausible subtest types for MCS’s civil service system. Many of the marker tests shown in Ekstrom, et al. (1976) are instructive but not plausible as item/test types for MCS’s civil service system. Table 74. Examples of plausible specific ability subtest types within each broad recommended ability category. Broad Ability Category Verbal Ability Quantitative Ability Reasoning Ability Spatial / Mechanical Ability Processing Speed / Accuracy Name of Specific Ability Subtest Item Type Verbal Comprehension (Ekstrom, et al.) Vocabulary meaning Reading Comprehension (PET) Paragraph meaning Verbal Reasoning (Verify) Reasoning to conclusions from a paragraph of information Number Facility (Ekstrom, et al.) Speed test of simple to moderate arithmetic computation problems (no calculator) Numerical Reasoning (Verify) Deriving conclusions from numerical information in a moderately complex problem context Calculation (Verify) Deriving the missing number in an arithmetic equation (with calculator) Quantitative Problem Solving (PET) Computing answers to arithmetic problems taken from business/government contexts Following Directions (Ekstrom, et al.) Deriving answers to complex questions about tabled information Diagramming Relationships (Ekstrom, et al) Comparing Venn diagrams to verbally described categories Inductive Reasoning (Verify) Reasoning about the relationships among abstract shapes/objects Reasoning (PET) Syllogistic reasoning about written information Card Rotations Test (Ekstrom, et al.) Mental rotations of drawn shapes to recognize similarity (difference) Mechanical Comprehension (Verify) Recognize effects on complex objects of described actions Mechanical Understanding (PPM) Answer questions about drawn objects, tools, equipment, machines. (Close to mechanical knowledge) Space Relations (DAT PCA) Visualize and rotate folded 2-D drawing in its 3-D form Number Comparison Test (Ekstrom, et al.) Comparing pairs of long number for sameness or difference (Speeded) Checking (Verify) Finding the same alphanumeric string in a set of several strings (Speeded) Processing Speed (PPM) Alphabetize three written words (Speeded) Name Comparison (GATB) Determine whether two written names are the same or different Specifying Item Content Background Once ability constructs are selected for development, decisions must be made about item content. The seven reviewed batteries are especially instructive for this purpose because they demonstrate the primary options available to MCS regarding item content. For purposes of these 178 recommendations and MCS’s item-test development work, we describe three key facets of item content A. Work relevance B. Level / Difficulty / Complexity C. Fluid Ability v. Crystallized Ability Three levels of work relevance are described: Tier 1 – Work Neutral; Tier 2 – Work-Like Context; and, Tier 3 – Work Relevant. Tier 1 content relies on acquired knowledge, usually reading/language/arithmetic skills but the content is not derived from job content and is not intended to be similar to job content. Tier 2 content relies on acquired knowledge, usually reading/language/arithmetic skills, but the content is deliberately given a work-like context such as a table of information similar to what might be used in a work context. However, the work-like context is not so specific to any particular job that job experience or acquired job knowledge is required to answer correctly. The intent of Tier 2 content is to convey to the test taker the manner in which the test content may be related to job performance. Tier 2 content is not used to measure acquired job knowledge. Tier 3 content, on the other hand, is introduced for the purpose of measuring some specific acquired job knowledge. Situational judgment tests often exemplify Tier 3 content. Most reviewed subtests are based on Tier 1 content, a smaller number are based on Tier 2 content, and no reviewed subtest was based on Tier 3 content, although Watson-Glaser falls between Tier 2 and Tier 3. We believe two OPM tests, which were noted in Study 2 but not reviewed, represent Tier 3 content. The level of item content is an important consideration for two reasons, especially for selection tests. First, level of content usually influences item difficulty, which influences the precision with which the test measures ability levels. Overall, it is important that selection tests have adequate measurement precision in the critical ability range where hired and not hired applicants are discriminated. That critical range is often a function of the job’s complexity level or educational level. Items written at a level in the vicinity of this critical ability range are more likely to be psychometrically effective. Second, item content that is seen by test takers or by hiring organizations as dissimilar to the level of job content, can be a source of significant dissatisfaction with the test. It is important the item and test development process develop a level of content appropriate to both job level/complexity as well as applicant level. Postlethwaite (2012) recently demonstrated that selection tests measuring crystallized ability (Gc) tend to have higher predictive validity than test measuring fluid ability (Gf). This is new information and should have a significant influence on MCS’s decisions about item content. Overall, MCS should prefer Gc item content over Gf content except where there may be a compelling rationale that offsets the overall validity difference. Even though this information is new and was not a well-established view prior to 2010, publishers have shown a strong preference for Gc subtests. Approximately 75% of the non-speeded cognitive subtests in the seven reviewed batteries were Gc measures. In addition to these three facets of item content, decisions about item content are also influence by cost and efficiency considerations. Overall, our professional experience has been that the job specificity of item content is related to cost of item and test development. Tier 3 content is likely to be significantly more expensive to develop because it requires more structured, comprehensive information about the content of job tasks, activities, knowledge, skills, abilities and/or other characteristics (KSAOs). Also, Tier 3 content is usually motivated by an interest in highly job-specific assessments of job KSAOs. This strategy is very likely to lead to a significant increase in the number of required subtests. It is no accident, in our view, that none of the publishers of the reviewed batteries has adopted a Tier 3 content strategy. They have a strong marketing interest in single batteries that are attractive the widest possible range of users. This marketing interest happens to be closely aligned with the science of cognitive ability selection testing which has demonstrated that 179 well-developed cognitive ability tests are predictive of job performance across a wide range of jobs without requiring job-specific content. Recommendations: Recommendation 4: Develop items with Tier 2, crystalized ability content in most, if not all, subtests. The MCS range of job families is restricted to service sector jobs. This is likely to mean that Tier 2 content could be developed to be generally similar to the types of information processing, problem solving and learning contexts common to jobs across this sector of work. There is unlikely to be a validity, cost or user acceptance argument favoring fluid ability tests with abstract content or crystallized ability test with work-neutral (Tier 1) or work-relevant (Tier 3) content. 1. Develop reasoning subtests that are framed as tests of “problem solving” or “decision making” ability in work-like problem contexts rather than using abstract content. 2. Note, MCS’s plan to develop in Phase 2 job-specific assessments such as performance assessments may require Tier 3 content and the additional job analytic work that such content is likely to require. Recommendation 5: Specify a modest range of content levels for all subtests associated with the range of college education levels of most applicants. 1. Develop items that are likely to span the range of difficulty for this applicant population from relatively easy, say difficulty near .30 to relatively difficult, say .90. Do not establish a narrow objective of developing items with fairly homogeneous difficulty levels near a middle value such as .60. 2. Within each subtest, develop items with Tier 2 content that samples job-like activities and tasks across different service sector job families (administrative, managerial, technical support, etc.) 3. Do not develop different subtests with lower content levels for lower level administrative / clerical jobs than for higher level professional / managerial jobs. Rather, develop moderately diverse items across a range of job content representative of the range in the service sector. Each subtest should have sufficiently diverse item content levels to be seen as relevant to the full range of service sector work. Rationales for Item Content Recommendations Recommendation 4 Recommendation 4 is based primarily on three separate considerations 1. Test of crystallized ability has been shown to be more predictive of job proficiency and training success than tests of fluid ability. The advantage of fluid ability tests that they are equally relevant, on their face, to all jobs, does not overcome the validity disadvantage. 2. Tier 2 content is likely to increase user acceptance compared to Tier 1 content or abstract content. Among cognitive ability tests, Tier 3 content is unlikely to improve predictive validity. 3. Unlike the commercially available batteries reviewed for this Study 2, MCS has no business requirement to develop a single battery that is seen as a relevant across a very wide range of work. The popularity of Tier 1 content among commercially available batteries is very likely due to their market considerations, which are quite different than MCS’s “market” considerations. MCS’s focus on service sector jobs allows it to capitalize on the benefits of work-like content without causing any restriction in their usage for Saudi Arabia civil service testing purposes. 180 Recommendation 5 Recommendation 5 essentially proposes that item content for all subtests vary in level/complexity around an overall level associated with the college level of education of most applicants. This level may involve reading level but should also involve the complexity of item content so that it is similar to the range of complexity in service sector work performed by college educated employees. This approach should result in item content similar in level to the Verify Numerical Reasoning and Verbal Reasoning subtests as well as the PET Reading Comprehension and Reasoning subtests. Also, the OPM Reasoning subset exemplifies this approach to item content. This objective of developing items of modest diversity centered on a level appropriate to college educated applicants is intended, in part, to support a possible longer term bank management strategy of estimating IRT-based item parameters that may be used to develop test forms tailored to differences in job level similar to the manner in which SHL adjusts the theta level of Verify items for the level of the target job family. A recommendation is made below to develop large item banks over time to support the automatic production of large numbers of forms. This longer term strategy would be facilitated by an item bank that has relatively flat distribution of item difficulties. This recommendation to develop diverse items within a subtest may raise the question of whether such diversity of level and content would threaten the unidimensionality of the items within a subtest. Potentially, this could be problematic for the eventual IRT-based approach recommended for bank management and forms production. We believe it is true that, all else the same, increasing item diversity within a subtest will tend to slightly reduce the unidimensionality of the item pool. This modest effect would not be due to variability in difficulty itself. After all, IRT psychometrics explicitly represent variation in item difficulty and examinee ability levels. The likely source of small increases in item heterogeneity would likely be the diverse item content associated with differences in job-like tasks used to generate item topics or other differences in test tasks unrelated to job content. For example, the items in a Computation subtest frequently include items capturing all four major arithmetic operations and vary the number of significant digits in number values. Strictly speaking, both of these common sources of item diversity increase item heterogeneity, thus reducing item unidimensionality. But this degree of increase in item heterogeneity has never been problematic for IRT. In other words, the IRT unidimensionality assumption does not require theoretically singular item pools. In the special case of subtests designed to measure the types of moderately broad cognitive abilities recommended here, which are typical of commercially available selection tests, it appears likely that the positive manifold of cognitive subtests satisfies the unidimensionality requirement sufficiently for IRT parameter estimates to be stable, consistent, invariant, and useful. Both SHL (Verify) and Pearson (Watson-Glaser) found this to be true. (As did PSI in their IRT analyses of accumulating banks of used items for EAS and PET.) While we believe SHL estimates Verify IRT parameters based on within-subtest parameterization, Pearson concluded that the full set of Watson-Glaser items spanning all three major subtest domains was sufficiently unidimensional to produce useful, stable and invariant scores. We believe that our recommendation to develop diverse items within the specific domain of each subtest is very unlikely to risk failure of the unidimensionality assumption. This is not because job complexity is, itself, unidimensional. Rather, it is because diverse measures of narrow, specific abilities with a broader category of cognitive ability tend to positively correlate with one another sufficiently to satisfy IRT’s unidimensionality assumption. Finally, Recommendation 5 suggests that items of diverse levels of psychometric difficulty be developed. MCS is proposing to use the subtest scores by combining them with other selection factors possibly including interview scores, experience scores, and other factors. A consequence of this composite approach is that the critical range of cognitive test scores becomes much wider. If the lowest score in this range is defined as the lowest score achieved by any hired applicant and the 181 highest score is defined as the highest score achieved by any applicant who is not hired, then this range of scores is the range within which test scores are expected to help discriminate between hired and not hired applicants. (That is, all applicants who scored below this range are not hired and all applicants who score above this range are hired.) In composite systems like MCS is proposing, this critical range of cognitive test scores can become very wide. A professional standard and principle for personnel selection tests is that scores within the critical range should have adequate discrimination power. In cases where this critical, range is wide, as will be the case the proposed system, the optimal test information function should be relatively flat across this range, rather than peaked in the middle. The recommended process of developing items of diverse difficulty levels will support that psychometric objective. Methods of Item and Test Development Background We based the following recommendations on three key assumptions A. In the initial stage of item and test development large enough samples will not be available to estimate IRT parameters for new items. IRT estimates will be gathered over time as the civil service system becomes operational and provides opportunities for large numbers of applicants to take large numbers of items B. Phase 1 should develop only the subtests needed to predict performance across the full range of service sector jobs in Saudi Arabia C. A sufficient number of fixed forms should be developed initially to support an adequate security strategy until large numbers of items are developed and banked with IRT estimated characteristics to produce large numbers of equivalent forms. Our overall expectation is that the MCS items and tests will initially be developed based on classical test theory characteristics and then, gradually over time, will accumulate IRT parameter estimates in order to support a large bank strategy for the production of multiple forms. Recommendations Recommendation 6: Establish a long-term objective to develop enough items to support bankbased forms production to administer a large number of different forms to different applicants. 1. The near-term implication of this bank-oriented strategy is that a sufficient number of items should be developed initially to support the development of several (4-6) fixed forms for each subtest, except for the Processing Speed and Accuracy category, assuming that the initial administration processes will be limited to fixed form delivery, either paper-pencil, online or both. These fixed forms may be constructed based on classical test theory item characteristics, to be replaced over time with IRT estimates. A. If subtests are needed for the Processing Speed and Accuracy category it is very likely that 2 forms would be adequate. 2. We do not recommend delaying implementation until a large bank of items is sufficient to support a large number of equivalent forms 3. We anticipate that it would not be feasible prior to implemnetation to conduct pilot studies of hundreds of items large enough to establish IRT item parameter estimates for large banks of items. It seems more feasible that such large banks of items will only be developed over time as operational test administration processes provide the opportunity to pilot new items. Recommendation 7: Place a priority on the initial development of 6-8 subtests, each with several forms, rather than a larger number of subtests with fewer forms. Two subtests from each of the Verbal, Quantitative and Reasoning categories would provide sufficient generality to be predictive 182 across all service sector jobs. We also assume that one or two subtests of Processing Speed Accuracy will be appropriate for the clerical/administrative jobs. Recommendation 8: Develop items using standard professional practices for item writing, review, empirical evaluation and subtest construction. The following features are singled out for special attention. 1. Specify construct and content requirements based on Recommendations 1 and 2. 2. Capitalize on sample/practice items drawn from existing batteries especially Verify and PET sample items using Tier 2 content. 3. Job materials should be used as content resources for item writers a. Item writers should have some exposure to key target jobs such as a tour of job facilities to observe the manner in job materials represent performance of key tasks across a range of jobs 4. Item review procedures a. Item Expert review (Include experts in the development of personnel selection tests) b. Statistical differential item functioning analyses (e.g., Mantel-Haenszel) for culturally salient groups, perhaps male-female? 5. Test construction : Develop several fixed forms (4 or 6) based on Classical Test Theory equivalency, a. Develop several practice tests per subtest of similar length and complexity. (Practice tests do not require empirical equivalency. It can be sufficient to develop practice tests based on item experts’ judgments of item equivalency based on expected difficulty and construct overlap.) Rationales for Item and Test Development Recommendations Recommendation 6 This recommendation is based on the assumption that MCS’s resources for item and test development will be inadequate initially for IRT-based item and test development. This should not prevent MCS from developing 4-6 forms for each subtest as the foundation that enables the civil service systems to be implemented. Once implemented, applicant testing should provide adequate samples sizes to evolve to an IRT based bank management approach to item maintenance and forms production. Our recommendation below about the NCS security strategy is that it should be active and comprehensive from the beginning, even though unproctored administration is not a likely option. Having 4-6 equivalent forms for each subtests will enable MCS to (a) administer several different forms within each paper-pencil test session or cluster of online sessions, (b) announce that different applicants are receiving different forms of the same tests, and (c) create the likelihood that applicants who might later talk among themselves will recognize that many of them were given different forms of the subtests. Recommendation 7 For purposes of MCS’s Phase 1, 6-8 subtests assessing Verbal Ability, Quantitative Ability, Reasoning Ability and, as needed, Processing Speed and Accuracy Ability will be sufficient to predict success in the full range of civil service jobs within the service sector. Validity generalization studies of cognitive ability tests have shown that 3 or 4 subtests capturing diverse specific abilities will provide virtually optimal prediction. 183 Recommendation 8 This recommendation is intended primarily to assure MCS that standard professional practice regarding the development of items and tests will be sufficient to produce effective subtests. In addition, Recommendation 8 suggests three steps to better ensure that the development of items content is well-informed by available information about job content. This is not for the purpose of creating job knowledge tests or even tests that are benefited by acquired job knowledge. Nor is this intended to develop subtests designed to measure a specific ability to perform specific job tasks. Rather, this is intended to ensure that the test taker is satisfied that the civil service test appears to be relevant to the hiring decision. Note Tier 2 content is not adequate to provide content validity evidence of an item-job linkage because Tier 2 content is not intended to provide a measure of the KSAO required to perform a specific job task. Tier 2 content only provides a work-like context for the test taking experience. We also call attention to item review procedures especially empirical procedures such as MantelHaenszel that may be used to identify items in which group factors appear to add irrelevant variance to item scores. In MCS’s case, this would be a mechanism primarily for detecting construct invalid variance rather than a mechanism for addressing social/legal considerations regarding fairness. At the same item, legal considerations notwithstanding, professional standards for item and test development call for an investigation of group-related sources of bias where there is some rationale for suspecting that there is a risk of such bias. Finally, this recommendation encourages the development of several practice tests consistent with the recommendation below regarding support policies that MCS provide support for test takers interest in practice opportunities. The perspective provided below is that the incremental cost of making practice tests available is small compared to the benefits of such practice opportunities. Certainly, there is a good will benefit for support test takers interest in preparing for the tests. But also, practice is known to provide a modest benefit in expected score, especially for novel items types. An useful way to minimize differences between applicants in level of practice is to provide easy access to practice opportunities. Validation Strategies Background The recommendations about validation represent a broad view of the purposes, meaning and value of validation for MCS’s civil service testing system. In this broad sense, validation could be thought of as program evaluation. Validation evidence would be evidence that the program is achieving its objectives. These objectives include the test-oriented objectives of measuring the intended abilities accurately and reliably and predicting the eventual job performance of new hires, as well as organizational interests in optimizing the multiple interests that will influence the design of the system including test quality, user acceptance, cost, timeliness, work performance and consistency with government objectives for the program. Our overall perspective is that validation will consist of a sequence of steps, eventually evolving into a steady-state process of routine data gathering and analysis to monitor the ongoing effectiveness of the system. (This overall approach is manifest especially in PSI’s and SHL’s support of their cognitive batteries. They have routine mechanisms for gathering job performance data from clients based on recent hires for whom test results are available.) We recommend which of these steps should be completed prior to implementation and which may be undertaken after implementation. Within this broad view, we also acknowledge the importance of gathering validity evidence for all components of the selection system itself including interviews, experience assessment and other 184 possible factors. This also includes components that may be included even though there is no expectation that they are job relevant. An example might include points added to applicants’ composite scores based on the length of time they have been waiting as fully qualified. The value of data about these types of components is that it allows an evaluation of their consequences for the selection decisions. Recommendations Recommendation 9: Develop a long-term validation (program evaluation) strategy that evaluates all selection factors with respect to their intended use and evaluates key program features such as cost and timeliness with respect to the needs of the hiring organizations, the applicants, and the Saudi Arabian government. This long-term validation strategy should address a number of considerations including: 1. The purposes of the key elements of the system relating to a. Intended use of job relevant selection factors; b. Intended use of non-job relevant selection factors, (c) c. Requirements of hiring organizations including requirements for adequacy of job performance, turnover rates, speed of position filling, and costs d. Applicant interests e. Government interests 2. The sequence in which related information may be gathered. With respect to validation evidence relevant to the cognitive tests themselves, the sequence should address the following types of evidence: a. Ability – Test linkage as described in Study 1; b. Relationships of newly developed tests to “marker” tests of target abilities c. Documentation of the validity generalization rationale for the new cognitive ability tests; d. The definition and collection of outcomes measures for recent hires including job proficiency measures, training success measures, and turnover status e. Applicant score distributions for all selection factors for norming purposes and decisions about optimizing the manner in which selection factors are combined to inform selection decisions; f. Job analysis data relating to the purposes of Phase 2 development of job-specific assessments such as performance assessments 3. Operational metrics , including a. Cost per hire; per applicant b. Speed to fill vacancy c. Hiring manager satisfaction with new hires (Note: Distinguish system metrics from professional validation measures) d. Security processes such as data forensics and web patrols Recommendation 10: For the new cognitive tests, identify the validity evidence that should be gathered prior to implementation. We suggest that these include 1. Content evidence relating to the linkage between the target ability domain and the tests as proposed in Study 1 2. Item development evidence showing lack of measurement bias at the item level 3. Subtest score distribution characteristics in applicant populations 4. Correlational studies of relationships between new subtests and established marker tests 5. Evidence of subtest reliability and inter-subtest relationships 6. Document validity generalization rationale for subtests that addresses type of subtest, job categories and criterion types 185 Recommendation 11: For all selection factors including the cognitive subtests, the following information should be defined, gathered and analyzed following implementation to support a postimplementation validation strategy. 1. Use job analysis methods to define important dimensions of job behavior (See Recommendation 12.1 for additional detail.)) 2. Establish routine methods for gathering job “performance” data for recent hires 3. Develop IRT estimates of item characteristics; introduce bank management; establish item retention thresholds and established a bank-based forms production strategy 4. Analyze / Monitor relationships between all selection factors and important criterion measures (proficiency, training success, progression, turnover, etc.) a. Note: this recommendation is intended to include all selection factors including those that are not expected to be job relevant so that the effects of the non-job relevant factors on selection utility may be assessed. 5. Identify potential sources of incremental value for potential Phase 2 job-specific assessments 6. Evaluate the costs-benefits of establishing minimum standards for hiring on the most job relevant selection components, including the cognitive subtests. a. Given MCS’s intention to use the cognitive test scores by combining them with other selection factors, we recommend that MCS evaluate the consequences of this approach. To the extent that the cognitive tests are one of many selection factors, their usefulness may be significantly reduced. A possible approach to maintaining a minimum level of usefulness for cognitive tests would be to establish a low (presumably) score threshold that applicants must achieve, independent of other selection factors, to be considered further. Recommendation 12: Develop a an approach to criterion measurement that is applicable in a uniform manner across all civil service jobs 1. Analyze jobs to identify important facets of valued work behavior. Do not limit this effort to just the proficiency and training success factors that are most commonly associated with cognitive prediction. Design the job analysis process to identify the full range of valued job behaviors including contextual behaviors such as helping behavior, counterproductive behavior, and citizenship behaviors such as organization loyalty. a. Identify broad criteria than may be described at a level of generality to apply across all civil service job families within the service sector. Potential examples include “Completes all work accurately” and “Completes all work on time”. b. Identify job-family specific criteria that are meaningful only for specific jobs or narrower groups of jobs. Potential examples include “Handles customer interactions effectively” and “Supervises direct report effectively”. 2. Design and implement an electronic system (email? Internet?) for routinely gathering supervisors’ ratings of recent hires’ work behavior. a. Note: distinguish between this type of standardized data gathered for the purpose of validation research and process management feedback data such as managers’ “customer satisfaction” ratings for new hires. Note: the rationales for Recommendations 9-12 are embedded in the description of the recommendations themselves. Comments about Criterion Measurement and Predictive Valdity Estimation Recommendation 9 suggests a broad view of MCS’s validation effort. This validation effort would be both ongoing and broad in scope. The ultimate purpose of this view is to ensure that Saudi Arabia’s civil service selection system is as effective as possible, not merely to establish that the cognitive tests predict job performance. For the purposes of this set of comments, we assume that Saudi Arabia’s civil service system has responsibility for identifying and measuring all selection factors that 186 will be combined to inform government organizations’ selection decisions. These selection factors might include the cognitive tests addressed in this Report as well as interview ratings, academic records, records of previous work accomplishments, work skill measures such as typing skills, personality assessments, and the like. (We do not include other selection factors unrelated to job behavior such as the amount of elapsed time an applicant has been “waiting” for a job offer.) The starting point for this broad validation effort is to identify and define the job behaviors and results / outcomes that the hiring organizations intend to influence with their selection system. Once identified and defined, these will guide the development of the criterion measures against which the predictive validity of the selection factors will be evaluated. Two primary sources are typically used to generate this list of intended behaviors and results/outcomes – organization leadership/management and/or some form of job analysis. We encourage MCS to rely on both sources of information about the important job behaviors and/or results / outcomes. MCS should consider two broad types of criterion measurement methods should be considered (1) existing administrative records, and (2) ad hoc assessments designed specifically for the purposes of the validation study. These are usually take the form of ratings by supervisors/managers. Existing Administrative Records In most cases, organizations create and maintain records of employees’ work behavior and results for a variety of purposes such as performance management, productivity metrics, appraisal ratings, development planning, and compensation or promotion decisions. In the case of Saudi Arabia’s civil service jobs that are primarily in the service sector of work, these might include existing measures such as appraisal ratings, merit-based compensation decisions, promotion readiness assessments, training mastery and various work behavior measures, especially in customer service work, such as the number of customers handled per day and work accuracy. While the examples above are focused on work behavior and results, we also encourage MCS to consider other existing records that capture employee outcomes not specific to work tasks but that have value for the organization. These include outcome data such as safety records, injury instances/costs, work-affecting health/sickness records, turnover, absence/tardiness records, and records of counterproductive behavior such as employee theft as well as records of employee citizenship such as awards or honors. We also draw special attention to the possibility of training mastery records, given that the MCS civil service exam system is intended to focus on cognitive ability. It is widely accepted that the major psychological mechanism by which cognitive ability predicts work behavior is that cognitive ability enables the effective learning of job knowledge, which is a direct determinant of effective work behavior. MCS should give special consideration to available records of training mastery that might be created during employee training events that are intended to enable effective work behavior. If such records are available and sufficiently free of confounds or irrelevant artifacts, they often represent the criterion measures that most directly capture the intended benefit of cognitive abilitybased selection. Of course, the mere fact that an existing administrative record is used by the organization and is available does not mean that it should be used as a criterion measure for purposes of predictive validation. Existing records were not developed to serve as criteria in validity studies. As a result, they may have significant flaws and may not provide a meaningful evaluation of the predictive validity of a selection procedure. These considerations are addressed below under the headings of Strengths and Limitations. Strengths. In general, existing administrative records that are seen as relevant to the work behaviors associated with the selection procedures being validated have three characteristics that are typically important for validity study criteria: (1) they are, by definition, important to the organization; 187 (2) they are well-understood and in the language of managers and supervisors in the organizations; and, (3) they are available at relatively little cost. A general guiding principle of good professional practice is that selection procedures be designed to predict work behaviors/outcomes that are important to the organization. Validation study procedures often address this issue of importance by gathering ratings of importance for tasks/activities/behaviors/knowledges documented in structured job analyses. This need to ensure the importance of validity criteria may also be addressed by the use of existing administrative records. Certainly, administrative records such as appraisal ratings, promotion-readiness ratings, performance metrics and training mastery assessments provide a clear indication that the organization attaches importance to the information they provide for the broader purpose of managing worker performance and behavior. For instance, if the performance of customer service representatives is managed by providing feedback and coaching to maximize the number of customers handled per hour, then that work outcome is, by definition, important to the organization. (Note, this point does not mean that a particular organizational tactic such as managing the number of customers handled per hour is necessarily an effective method for optimizing overall organizational results. Sometimes organizations manage to the wrong metrics. A classic example in customer service work is the occasional practice of managing to low average customer talk time. Often this performance metric leads to customer dissatisfaction, which can lead to a loss of customers. However, validity studies should define criterion importance as the organization defines it, assuming every effort has been made to ensure the organization understands the meaning of importance as it is relevant to the question of predictive validity of selection procedures.) In our view, the purpose of validation is to inform the organization about the effectiveness of its selection system, not merely to inform researchers about the success of their selection design and implementation effort. This purpose requires that organization managers and leaders understand and attach importance to validation results. Our advice is that MCS treat the leaders of the hiring civil service organizations as the primary beneficiaries of its validation research efforts. Certainly, validation efforts will also be critical to the developers and implementers of the testing system who have operational responsibility for its optimization. But, ultimately, validation efforts will be meaningfully linked to the intended purposes of the selection procedures to the extent organization leaders understand the evidence and recognize its importance. The use of existing administrative records – assuming they satisfy other evidentiary requirements – usually helps ensure that the validation effort reflects actual organizational values and managers and leaders of the hiring organizations understand and attach importance to the results. Validation can be an expensive undertaking both in terms of money and time. Our advice is that that an initial criterion validation effort be conducted that is both affordable and timely. This could possibly be a concurrent study prior to implementation, although this would not be necessary. This would be followed by a more programmatic post-implementation criterion validation strategy. Certainly, the availability of existing administrative records that are meaningful and unconfounded would greatly facilitate such an early criterion validity study. Limitations. While existing administrative records can have important strengths, they are often subject to problems that render them unusable as validation criterion measures. These limitations typically arise from the fact that they were not developed to serve as validity criterion measures. They were developed to serve other operational purposes. Three main types of limitations often plague administrative records: (1) they often have unknown and suspicious psychometric properties; (2) they can be confounded by other influences; and (3) they may be conceptually irrelevant to the intended uses of the tests being validated. A significant limitation with existing administrative records is that, usually, their psychometric properties are unknown and, in many cases, there is reason to be suspicious of their psychometric properties. The two primary considerations are their reliability (usually in the sense of stability over 188 time or agreement between raters) and their validity. Unfortunately, in many cases there is no opportunity to empirically evaluate either the reliability or validity of existing records. An exception occurs where the same measure, for example a performance appraisal or performance metrics, is gathered on multiple occasions over time for the same sample of incumbents. In that case, it may be possible to empirically assess the consistency of such measures over time with a sample of incumbents. Inconsistency over time can occur for at least two reasons. First, the measure itself may be unreliable in a psychometric measurement sense. This might be the case, for example, with untrained supervisors’ ratings of promotion readiness. Second, even if the measurement process itself is highly reliable, the work behavior may not be consistent over time or between employees. This is common with many types of sales metrics. For example, a common sales management practice is to give the best sellers the most important and largest sales opportunities. This can have dramatic consequences for sales results. Also, seasonality sales trends can impact differences between sales people’s results. As a result, even if the measurement of sales results is highly reliable, the results themselves can be inconsistent over time and sellers. In this case, it is important to understand that a highly reliable measure of inconsistent behavior/results makes a poor validity criterion. In most cases, empirical assessments of the psychometric properties of existing records will not be possible. In those cases, expert judgment must be relied on to evaluate whether the existing records are likely to have the necessary psychometric properties. Perhaps the most common limitation of existing records is that they are confounded by other influences that add irrelevant / invalid variance to the measures; as a result, they become unusable. Common examples include appraisal ratings that may be artificially constrained by forced distribution requirements or by supervisors’ strategies of evening appraisals out over time. Another example is the use of locally developed knowledge tests as measures of training mastery. Often, such tests are not designed to discriminate among trainees but, rather, to provide trainees the opportunity to demonstrate their mastery of the training content. This can lead to ceiling effects in which a high percentage of trainees receive very high scores, thus greatly restricting the range of scores. Similarly, training mastery tests are sometimes used as a mechanism for feedback about acquired knowledge. In that case, the process of completing the tests might include feedback about correct and incorrect answers which can lead, eventually, to high scores such that a very high percentage of trainees “successfully” complete training. In or experience, there is rarely any opportunity to empirically evaluate the extent to which existing records are confounded. Expert judgment is almost always necessary to determine whether the degree of confounding invalidates the measure as a possible validity criterion. Finally, an important consideration is whether an available record of work behavior is relevant to the intended benefit of the selection procedure in question. This can be a significant consideration with regard to the validation of cognitive ability tests used for selection purposes. Given that cognitive ability is understood to influence job behavior by facilitating the learning of job knowledge, records of work behavior that are unlikely to be influenced by job knowledge should be regarded as irrelevant to the validation of cognitive selection tests. For example, turnover is a frequent work behavior of great importance to organizations. But job knowledge and, therefore, cognitive ability may or may not be relevant to turnover. In contexts in which turnover is largely driven by the attractiveness of available alternatives in the employment market, for example better paying jobs doing similar work in a more convenient location, job knowledge and cognitive ability may not predict turnover. In contrast, in contexts in which job performance is a large determiner of turnover, job knowledge and cognitive ability may be significant predictors of turnover. Another example often occurs in customer service jobs in which adherence to customer interface protocols are required in a high volume of repetitive customer interactions. Successful performance of this type of task may depend far more on sociability and emotional resilience than on easily learned job knowledge. Cognitive ability tests would not be expected to predict such performance because the specific role of cognitive ability in performance would not be relevant to this particular form of job performance. 189 This point distinguishes between invalidity and irrelevance. Criterion measures that are properly understood to be irrelevant to the construct/meaning of a cognitive ability test should not be used as criteria in the validation of that cognitive ability test. Validity studies should be designed to provide evidence of the extent to which the selection procedures are achieving their intended purpose. That is, irrelevant work measures do not provide evidence of validity. (It is important to note here that this point about the relevance of criterion measures is not inconsistent with Recommendation 9 above to view the overall validation effort in the broad context of program evaluation. We are assuming that MCS will have responsibility for more than just the cognitive ability tests used within the civil service employment system. We assume MCS will also have some degree of responsibility for the other factors in the selection system. This broader view in Recommendation 9 means, among other things, that the design of the validation strategy should take into account all selection factors, not just the validity of the cognitive ability tests.) All of the above issues considered, we encourage MCS to give careful consideration to existing administrative records as a source of criterion measures. In our experience, certain types of administrative records are often too confounded / unreliable to be useful. These include: (a) performance appraisal ratings by untrained supervisors where those ratings influence compensation decisions such as raises and bonuses (the linkage to compensation often introduces significant irrelevant considerations); (b) training mastery tests that are so “easy” the large majority of trainees achieve very high scores; (c) sales records where there are significant and frequently changing differences between sellers in their opportunity to sell; and (d) peer ratings as might be used for 360 feedback purposes in which peers have a potential conflict of interest due to the use of 360 ratings to help make promotion decisions. In contrast, other existing administrative measures often have enough relevance and (judged) reliability to be adequate criterion measures. These include: (a) performance metrics in customer service jobs, especially call centers where the performance metrics are automatically gathered and most employees are working under common conditions; (b) records of completed sales or “qualified” callers in an online sales environment where the sales roles and responsibilities are relatively uniform across all sales representatives; and (c) attendance records of tardiness and absence including days missed due to work-related illness or injury. Ad Hoc Assessments In most cases where existing administrative measures are useable as criteria, supplemental ad hoc criterion measures can add important value. Often, the benefit of ad hoc measures is to complement existing measures to avoid criterion deficiency and to provide a method for assessing some form of overall performance or value to the organization. Examples of common ad hoc assessments include Ratings of overall performance and/or contribution to the organization Ratings of specific facets of overall performance such as customer interface effectiveness, work accuracy, work efficiency, team participation, Ratings of work-related skills and knowledge Ratings of potential of readiness for progression and promotion Ratings of citizenship behaviors such as helping behaviors and loyalty behaviors Ratings of counterproductive behaviors such as non-collaborative behaviors, behavior antagonistic to the goals of the job or organization, theft, malingering, etc. Scores on ad hoc job knowledge tests Scores on ad hoc high fidelity work sample tasks Ad hoc assessments are virtually always specific to the particular job. A common source of job information used to specify the ad hoc assessments is some form of structured job analysis that seeks to identify the most important skills, outputs, tasks/activities based on incumbent and supervisors ratings of many specific job tasks/activities/abilities. An alternate, often less costly approach is to use 190 existing job descriptive material that identifies major duties, responsibilities, tasks/activities as a source of broad information about the job followed with focus groups discussions with job experts that specify the particular job outputs, tasks, behaviors associated with the documented job information. Virtually always, the source of job information used to identify and specify ad hoc criteria is job expert judgment elicited in some structured fashion whether in the form of job analysis surveys, focus group outputs or individual expert interviews. In effect, this expert source is the most common method of ensuring the validity or relevance of the ad hoc criterion. The most common measurement method used to assess ad hoc criteria is ratings by supervisors and managers. Woehr and Roch (2012) provide an excellent summary of recent research and discussion about supervisory ratings of performance. Also, Viswesvaran (2001) provides an excellent summary of the broader issues of performance measurement. Both acknowledge the usefulness and professional acceptability of supervisory ratings while at the same time providing information about their limitations. Assuming that job expertise was appropriately used to specify relevant content for such rating scales, the primary concern about supervisory ratings is their relatively low reliability. Although there are different aspects of reliability (internal consistency, stability over time, etc.) and different methods for estimating reliability (parallel forms, test-retest, split-half, internal consistency, inter-rater reliability, etc.) a commonly used estimate of supervisory rating reliability is .62. This estimated value is frequently used in meta-analyses of validities based on supervisors’ rating where no reliability evidence is presented in the particular studies. While this level of reliability is marginal and would not be considered adequate for measures to be used as the sole basis for making employment selection decisions, it is commonly accepted professionally given the purposes supervisory ratings typically support. Typically, supervisory ratings are one source of validity information used to provide confidence about the appropriateness of a particular selection procedure. Given this marginal level of reliability, it is important to develop supervisory rating scales following professionally accepted methods that will maximize the level of reliability. Three major elements of the development process are important to ensure maximum reliability (and validity): (1) adequate job expertise; (2) careful specification of rating scales; and (3) adequate rater training. Adequate Job Expertise. This requirement manifests itself in two ways. First, the process of identifying and specifying ad hoc criteria almost always relies on the judgments of job experts about the facets of performance and work behavior that are important enough to include as validation criteria. Second, the raters who assess the performance and behavior of the participating employees should be job experts and sufficiently familiar with the target employee to make accurate ratings. A common threshold for familiarity is 3-6 months supervising the employee. Also, the supervision should include adequate exposure to the employee’s work behavior and performance. While this can be difficult to judge in some cases, the rater should be willing to endorse his own familiarity with the target employee. Careful Specification of the Rating Scales. Generally, well-constructed rating scales will include behavioral anchors (descriptions) for at least the high and low ends of the rating scale. These anchors are descriptions of the types of behavior and/or performance levels that exemplify the meaning of the rating level associated with the anchors. These anchors are typically developed following structured processes involving job experts who are able to generate descriptions of high and low levels of behavior/performance. Also, a common practice is to use two stages of development of anchors by having guided experts develop draft anchors in one stage and then have a separate group of equally expert judges resort the draft anchors into their original scales and levels. Anchors that Stage 2 experts associate with different scales and/or different levels would be removed. Adequate Rater Training. There is a clear professional consensus that rater training is important to the validity and reliability of supervisory ratings. At a minimum this training describes the meaning of 191 the rating scales and their anchors and a clear description of common rater errors such as halo effects, leniency, and central tendency. Better training would include opportunities for multiple raters to observe the same performance / behavior samples, make ratings and then compare their ratings with feedback and discussion about the relationship between the observed behaviors / performance examples and the meaning of the rating scales. The Psychometrics of Criterion Measures and the Estimation of Operational Validity Like other personnel selection professionals who design and conduct validation studies using supervisory ratings, Sabbarah / MCS may wonder what standards are applied to determine whether specific rating scales are good enough to include in a study. While professional standards and practice do not provide precise guidance about this, a professionally reasonable guideline would be based on the three development elements described above. To the extent these three elements were meaningfully applied in the development of a rating scale, Sabbarah / MCS could be confident that the rating scale is relevant (i.e. valid) and psychometrically adequate (i.e. reliable) enough to include as a criterion measure in a validity study. To the extent that any one of these elements is missing from the development process, the risk increases that the rating will not add useful information. An additional tactic for protecting against inadequate individual scales is to factor analyze the multiple scales used to gather ad hoc assessments, identify the small number of meaningful factors and create factor-based composite scales by combing the individual scales loading on the same factors into composite measures. This tactic requires that several rating scales be developed to sample or represent the criterion “space” captured in the expert-based analysis of job performance and behavior. Of course, a non-empirical approach to the same tactic is to rationally identify the major criterion space factors prior to the validation study, using well-instructed job experts. This would enable the criterion composites to be formed prior to data gathering. The advantage of the factor analysis approach is that it provide some empirical evaluation of the measurement model underlying the rating scales. In general, criterion unreliability is well-understood to be an artifact that negatively biases validity estimates. That is, all else the same, less reliable criteria produce lower observed validities than more reliable criteria. Further, any unreliability in a criterion measure results in a validity estimate that is lower than the level of validity representing the relationship between the selection procedure as used operationally and the actual criterion as experienced and valued by the organization. That is, the organization benefits from actual employee behavior and performance, not imperfectly measured behavior and performance. For this reason, the validity estimate that best represents the real relationship between the selection procedure as used and the criterion as it impacts the organization is an observed validity that has been corrected for all of the unreliability in the criterion measure. As a standard practice, Sabbarah / MCS should correct observed validities for the unreliability of the criterion measures to produce a more accurate index of the operational validity of a selection procedure. (This is the well-known correction for attenuation described in many psychometrics texts. The corrected validity is defined as the observed validity divided by the square root of the observed criterion reliability index.) By this method of correction, the psychometric bias introduced by any imperfectly reliable criterion, including marginally reliable supervisory ratings, can be eliminated from the estimation of the predictive validity. However, one other artifact in a common type of predictive validity study introduces a separate negative bias into observed validities. Range restriction in the predictor (and criterion) occurs when the sample of employees for whom criterion data is available has a narrower range of predictor scores than the applicant population within which selection decisions are made based on the selection test. This restriction occurs in validity studies where the test being validated is used to make the selection decisions that determine which applicants become employees whose work performance measures are eventually used as the criteria. (Validity studies that administer the test to applicants but do not 192 use the test scores to make selection decisions may avoid all or most of this range restriction. Unfortunately, such studies are unusual because they are significantly more costly because they require the employer to hire low scoring applicants who are likely to be lower performing employees.) Like the negative bias due to criterion unreliability, the negative bias due to range restriction results in the observed validity coefficient underestimating the actual, or operational, validity of the test. Because the employers benefits from avoiding the low performance levels of rejected applicants, the validity coefficient that most accurately represents the effective validity of a test is one that has been corrected for range restriction, as well as criterion unreliability. This corrected, more accurate validity is known as the operational validity of a selection procedure and is the validity of interest to an organization. This is why Sabbarah / MCS should correct all observed predictive validities for both criterion unreliability and for range restriction. These corrections and the underlying psychometrics justifying these corrections are described in many sources including Hunter and Schmidt (1990). Overall, the typically marginal levels of reliability with supervisory ratings should be taken into account in two separate ways. 1. The development of such ad hoc measures should include three key elements relating to the use of job expertise, scale construction, and rater training to minimize the risk of inadequate reliability. 2. The empirical estimation of operational validity, which is the validity index most important to the hiring organizations, should correct for the psychometric effects of unreliability in any criterion measure, including those with marginal reliability. (Note, we have addressed the problem of marginally reliable supervisory ratings in the context of ad hoc criteria developed for the purposes of the validity study. However, exactly the same problem exists, and likely more severely in many cases, with existing administrative measures such as performance appraisal ratings that also rely on supervisory ratings.). Recommendations about Content Validity Background Given Sabbarah’s strong interest in the role of content validation evidence, these recommendations were developed to help clarify two primary points relating to content-oriented evidence for cognitive ability selection tests. A. Professional standards and principles about content validity for selection tests focus on the nature of content evidence linking test content to job content. This linkage is necessary to infer from content evidence that test scores will predict job performance. B. Even the recent arguments that job-oriented content evidence may be available for cognitive ability tests (e.g., Stelly & Goldstein, 2007; Schmidt, 2012) acknowledge that content evidence is not relevant to tests of broad cognitive abilities that are not specific to identifiable job tasks and/or KSAOs. Our view is that the Phase 1 cognitive subtests will be like this. Content evidence will not be available for the test-job linkage This set of recommendations and the rationale provided are informed by four primary considerations: A. The Study 1 proposal regarding the role of content validity for item and test development; B. The Study 2 findings regarding the absence of content validity evidence for any of the seven reviewed cognitive ability batteries; C. Professional guidance regarding content validity evidence for employment tests; and, 193 D. MCS’s proposal to stage the development of the civil service exam system such that the focus of Phase 1 will be on a set of cognitive abilities that are applicable across the full range of civil service jobs and a later Phase 2 will focus on more job-specific testing such as performance assessments. Recommendation 13: Acknowledge the difference between job-oriented content validity evidence and ability-oriented content validity evidence. These are different types of evidence and support different inferences about test scores. Recommendation 14: Where content evidence is intended to support the inference that test scores predict job performance, apply the definition and procedures of job-oriented content validity that have been established for personnel selection tests in the relevant professional standards (I.e., NCME, AERA, APA Standards for Educational and Psychological Testing (Standards), and SIOP Principles for the Validation and Use of Personnel Selection Procedures (Principles). Phase 1 tests are unlikely to provide job-oriented content evidence whereas job-oriented content evidence for Phase 2 tests maybe important and efficient to obtain. Study 1 Proposal Study 1 proposes a different type of content validation evidence than is described in the Standards and Principles for employment tests. Perhaps the most critical distinction between the content validity evidence described in Study 1 and content validity evidence described in professional standards relating to employment selection is the domain of interest. Both approaches to content validity begin with the requirement that the domain of interest be clearly specified. In both cases, the domain of interest refers to the domain of abilities/activities/behaviors to which test scores are linked by content-oriented evidence. In the case of cognitive employment tests developed for MCS’s civil service tests, Study 1 proposes that the domain of interest as the CHC theoretical framework that provides a taxonomy of cognitive abilities. “The first layer (of the recommended ECD process) is domain analysis. Here, the test developer specifies lists of concepts according to the Cattell-Horn-Carroll (CHC) intelligence theory, which Keith & Reynolds (2010) believe serves as “a ‘Rosetta Stone’…for understanding and classifying cognitive abilities.”” (pg. 5) In clear contrast, the Principles for the Validation and Use of Personnel Selection Procedures (Society for Industrial and Organizational Psychology, 2003) describes this first step of a content validation process as, “The characterization of the work domain should be based on accurate and thorough information about the work including analysis of work behaviors and activities, responsibilities of the job incumbents, and/or the KSAOs prerequisite to effective performance on the job.” (pg. 22). Essentially, the Study 1 content validity methodology, which is about education assessment, describes the domain of interest as the abilities the items and tests are designed to measure. Whereas professional principles for content validity of employment tests describe the domain of interest as the work activities/behaviors the items and tests are designed to predict. The reason for the difference follows directly from the difference in the purposes of the two types of ability assessment. The typical purpose of educational assessments is to assess achievement or capability or some other characteristic of the individual’s level of the target skill/ability/knowledge. This requires that item and test scores are interpretable as accurate measures of the target skill/ability/knowledge. In contrast, the universal purpose of employment tests is to inform personnel decisions, which requires that item and test scores are interpretable as accurate predictors of work performance. These are two forms of content evidence that inform different inferences about item and test scores necessary to support different purposes of the tests. ECD-based content validity informs measurement inferences because it is ability oriented; employment test content validity informs prediction inferences because it is job oriented. The development of employment tests will benefit 194 from both types of content evidence because employment tests should not only predict work behavior but should also be valid measures of the abilities they are designed to assess. Findings from the Review of Seven Batteries None of the seven reviewed batteries is supported with content validity evidence. Two primary reasons for this, presumably, are (a) US regulations governing matters of employment discrimination view content validity evidence as irrelevant to cognitive ability tests, and (b) it is widely agreed that content-oriented evidence is irrelevant to certain types of commonly used general, broad cognitive ability because their content does not lend itself to a comparison to job content and they are often not designed explicitly to represent important elements in the job domain. For example, abstract reasoning tests do not lend themselves, usually, to content-oriented validity evidence because they are not developed to sample any particular work behavior and their content is not comparable to typical descriptions of job content. This lack of content evidence is not a criticism of the likely validity of the seven batteries or of the developers’ professional practices. It is simply an acknowledgement of the widely held view that certain types of commonly used cognitive ability tests were not developed to sample job content and do not have the specificity of content to enable content evidence to be gathered. In our professional experience this finding about the lack of reliance on content validity is typical of commercially available cognitive ability employment tests. This point will have important implications for MCS’s overall validation strategy as captured in Recommendation C. Further, this lack of content validity evidence should not be interpreted to mean that developers or the personnel selection profession in general have little interest in other sources of validity evidence than predictive validity evidence. Almost all of the reviewed batteries are supported by some form of construct validity evidence, most often in the form of correlations with separate tests of similar abilities. In addition, although we were unable to gather detailed information about item and test development practices in many cases, professional guidance for employment tests calls for a variety of development procedures intended to improve the measurement validity of items and tests, including psychometric analyses of differential item functioning to reduce group-based sources of invalid variance (bias) and reviews of newly developed items by cultural experts to identify potentially biasing content. In addition, a common practice is the use of pilot studies to evaluate the psychometric characteristics of new items to screen out underperforming items, including items that appear to be measuring something different than other items designed to measure that same abilities. Professional Guidance about Job-Oriented Content Validity for Employment Tests Professional guidance about employment tests describes five major processes necessary to establish content evidence of validity. 1. An appropriate job analysis foundation that provides credible information about the job 2. A clearly defined job content domain describing the important work behaviors, activities, and/or worker KSAOs necessary for effective job performance 3. Appropriate procedures for establishing key linkages (a) between the job analysis and the job content domain, (b) between the job content domain and the test development specifications, and (c) between the test development specifications and the item content. 4. Appropriate qualifications of the subject matter experts who will make judgments, such as importance ratings, about the key linkages 5. Adequate documentation of all methods and procedures The Study 1 effort apparently to incorporate certain aspects of job-oriented content evidence is not adequate to satisfy these professional standards. The most significant gaps are (a) the lack of a wellspecified job content domain, (b) the absence of job content domain considerations in the 195 development of the test plan specifications and item content, and (c) an inadequate linkage between item content and the job content domain. Study 1 proposes that each item be rated by job experts for its importance “for use in civil service”. But this reference to “civil service” as the domain of interest for importance ratings is not sufficient to link items to the “most important work behaviors, activities, and / or worker KSAO’s necessary for performance on the job…” as prescribed by SIOP’s Principles. MCS’s Staged Approach to Test Development MCS has established a plan to develop and use a set of general cognitive ability tests during the first phase of its civil service system that will be used to inform hiring decisions for all civil service jobs. Our recommendation (see below) is that these tests not have job-specific content and not be designed to sample important job content. MCS’s plan is to introduce job-specific tests, such as performance tests and, possibly, tests of job-specific skills/abilities/knowledge in a second phase of the civil service system. This distinction between Phase 1 and Phase 2 tests is critical to the relevance of job-oriented content validity evidence. Our recommendation (see below) is that Phase 1 tests be similar in many respects to the most common types of tests represented in the reviewed batteries capturing cognitive abilities in the verbal, quantitative and problem solving/reasoning domains. We also recommend that Tier 2 content be built into the items in these tests but that none of these tests be developed explicitly to sample important job content. If MCS proceeds in this recommended manner, job-oriented content validity evidence will not be relevant because item/test content was not developed to sample job content. Rationale for Recommendations Recommendation 13: Acknowledge the difference between job-oriented content validity evidence and ability-oriented content validity evidence. These are different types of evidence and support different inferences about test scores. Both types of evidence can be valuable. The ECD-based content evidence described in Study 1 is evidence about the linkage between items, tests and the abilities they are intended to measure. The value of this evidence is that it allows scores to be interpreted as measures of the intended abilities. In contrast, the content evidence described in professional standards for employment tests is about the linkage between items, tests and the job content they are intended to sample. The value of this evidence is that it allows scores to be interpreted as predictive of job performance. Recommendation A asserts that these are separate types of evidence and both have value. However, ability-oriented content evidence does not, itself, imply that scores are predictive of job performance. By being oriented toward a theoretical model of abilities that have been identified independent of their relationship to job performance, this abilityoriented evidence cannot directly establish a link between scores of job performance. Further, item importance ratings that refer to importance for “use in civil service” are not sufficient to establish this score-job linkage because “use in civil service” does not specify the behaviors/activities/KSAOs important for effective performance. This recommendation supports the general concept of ability-oriented content evidence within the ECD process model but cautions that it does not substitute for job-oriented content evidence. As an aside to this recommendation, we note here that reliance on the ability-oriented ECD process as described in Study 1 is likely to lead to the identification of many more ability constructs and the development of a potentially much larger number of subtests than selection research has shown are needed to achieve maximum predictive validity. Our broad suggestion is that the initial domain specification in the ECD approach should include considerations of job content in the service sector and previous relevant predictive validity research results, as well as a theoretical model of cognitive 196 abilities. Such an approach to domain specification that is more job oriented is likely to yield a much more efficient, lower cost item and test development effort. Recommendation 14: Where content evidence is intended to support the inference that test scores predict job performance, apply the definition and procedures of job-oriented content validity that have been established for personnel selection tests in the relevant professional standards (I.e., NCME, AERA, APA Standards for Educational and Psychological Testing (Standards), and SIOP Principles for the Validation and Use of Personnel Selection Procedures (Principles). Phase 1 tests are unlikely to provide job-oriented content evidence whereas job-oriented content evidence for Phase 2 tests maybe important and efficient to obtain. We anticipate that the tests developed for use in Phase 1 are unlikely to be designed to be samples of important job content. Rather, they are likely to be somewhat broad, general ability tests relying on Tier 1 (work-neutral) and/or Tier 2 (irrelevant work context), which does not enable job-oriented content evidence to be gathered. In contrast, it appears that MCS’s approach to test development in Phase 2 will emphasize job-specific tests. While these tests might take several different forms (e.g., performance tests, situational judgment tests, job knowledge tests, and cognitive ability tests requiring job-specific learning (Tier 3)) these types of tests designed to sample job-specific content are likely to enable the collection of job-oriented content validity evidence. For these types of cognitive tests, content validity is likely to be an important and efficient form of evidence. In this case, the professional standards relating to content validity for employment tests should be applied to the validation process. The Role of Content Validity in the Development of GATB Forms E and F In an effort to more completely address the possible role of content validity evidence in the development of GCAT cognitive ability tests, we more closely reviewed the processes used to develop GATB Forms E and F to describe and evaluate the role of content validity processes and evidence in that development effort. We reviewed Mellon, Daggett, MacManus and Moritsch (1996), Development of General Aptitude Test Battery (GATB) Forms E and F, which is a detailed, 62-page summary of the presumably longer technical report by the same authors of the same GTAB Forms E and F development effort. The longer technical report could not be located. However, the summary we reviewed provided sufficient detail about item content review processes that we are able to confidently report about the nature of that process and its possible relationship to content validity. In particular, pgs. 11-15 addressed “Development and Review of New Items” and provided detailed information about the nature of the expert review of the newly drafted items. Our focus was on the item-level review process used to evaluate the fit or consistency between item content and the meaning of the abilities targeted by the newly developed items. Several key features of this process are described here. 1. All new items were reviewed by experts, which led to revisions to the content of specific items in all seven subtests 2. The focus of these reviews, however, was not primarily on the fit or linkage between item content and the target aptitudes, but instead was on possible sources of item bias relating to race, ethnicity and gender. However, this “item sensitivity” review was closely related to an item validity review because sources of race, ethnic, gender bias were regarded as “bias”, that is, sources of invalidity. 3. A highly structured set of procedures was developed including bias guidelines, procedural details, a list of characteristics of unbiased items, and rating questions. 4. A key feature of the review process was a set of standardized “review questions” that reviewers were to answer for each item using a prescribed answer form. 5. Although this review was focused primarily on item content that was plausibly sensitive to race, ethnic and gender differences, reviewers also provided feedback about other item 197 6. 7. 8. 9. quality characteristics such as distractor quality, clarity of test instructions, susceptibility to testwise test takers, unnecessary complexity, appropriateness of reading level, excessive difficulty, and sources of unnecessary confusion. The panel of seven reviewers consisted of two African-Americans, three Hispanics, two whites, three males and four females. All had relevant professional experience including three personnel analysts, two university faculty members in counselor education, one personnel consultant, and a post-doctoral fellow in economics. After training, the reviews appear to have been completed individually by the reviewers. The output of the review panel was used to revise items, instructions and formats. It does not appear that revised items were re-reviewed not does it appear that the reviewers’ output was aggregated into an overall summary of the extent of bias in the items within subtests. The developers/authors do not describe this process as a content validation process although it appears to have important similarities to the concept of item content review described in Study 1 in that the bias factors appear to have been described as sources of construct-irrelevant variance. However, this review was considerably narrower in scope than the process envisioned by Study 1. A second point worth noting is that this process is not unlike similar processes frequently used by developers of personnel selection tests to minimize the possibility of group-related sources of bias and invalidity. However, in the personnel selection domain and in related standards and principles, this type of review is not described as a content validation process. Rather it is described by various terms including bias review and sensitivity review to indicate that it’s primary focus is on group-related bias reduction. Semantics notwithstanding, this exercise is directly relevant to the matter of construct validity. Certainly, we encourage Sabbarah / MCS to undertake this type of bias or sensitivity review, although the focus may be less on race/ethnic group differences than on other possible group differences (male – female?) that, in Saudi Arabia, could be indications of construct-irrelevant variance. Sabbarah /MCS may want to consider broadening the scope of the review to include any possible sources of construct-irrelevant variance. In that case, this review would presumably be very similar to the type of review envisioned by Study 1 and labeled as content validity. If the scope of the review were broadened, Sabbarah / MCS would need to broaden the range of expertise represented among the reviewers. In addition to having reviewers with expertise relating to race, ethnic and gender group differences in test performance, other reviewers would be needed whose expertise was in the meaning and measurement of cognitive ability, One final observation about the comparability of the GATB Forms E and F development process and the GCAT development process. The GATB Forms E and F development process was described to item developers as a process of developing new items that should be parallel to existing items in the previous forms. This mayu have even included item-by-item parallelism where a new item was developed to be parallel to a specific exiting item. In any case, the item specifications and writing instructions guiding item development for Forms E and F clearly focused on the requirements of itemlevel parallelism. This presumably meant that there was less focus on precise definitions of the target aptitudes. In contrast, the item specifications and writing instructions for GCAT item development will focus on the clear description of the meaning of the target abilities and the work-like tasks (Tier 2 content) that are presumed to require them. These same GCAT item specifications and writing instructions would presumably provide the descriptions of the target abilities necessary to enable reviewers of item content to make judgments about construct-irrelevant variance. One implication of this feature of the GCAT item development is that the item specifications themselves and the processes used to (a) adhere to those specifications during item writing, and (b) confirm that the specifications were met in drafted items would, itself, constitute content-oriented validity (or construct validity) evidence, once carefully documented. 198 ORGANIZATIONAL AND OPERATIONAL CONSIDERATIONS Security Strategy Background Our perspective about the important question of MCS’s security strategy is grounded in three observations. First, even though MCS is unlikely to implement an unproctored administration option, the protection against cheating and piracy and the effort to detect violations of test security remain critical requirements for the maintenance of a successful civil service testing program (Drasgow, Nye, Guo & Tay, 2009). We take this perspective while acknowledging that we have no experience with the environment in Saudi Arabia with respect to this issue. Second, we believe a model test security program has been established by SHL in support of Verify, which is administered unproctored, online. While certain features of SHL’s strategy are not relevant to MCS’s proposed system of proctored test administration, many could be. Finally, MCS’s civil service testing system is a resource to government agencies and citizens of Saudi Arabia who will rely on it to be effective and fair. MCS has a responsibility as a government agency to ensure the effectiveness and fairness of this system, which is somewhat different responsibility than the commercial interests of publishers of the reviewed batteries. Recommendations Recommendation 15: Implement a comprehensive test security strategy modeled after SHL’s strategy for supporting its Verify battery. This approach should be adapted to MCS’s proctored administration strategy and include the following components. 1. In the early stages of the program prior to the online capability to construct nearly unique online forms for each test taker, use several equivalent forms of the subtests in both paperpencil and online modes of administration. Do not rely on 2 fixed forms. 2. Refresh item banks routinely over time with the early objective of building large item banks to eventually enable nearly unique online forms for every test taker. 3. Monitor item characteristics as applicant data accumulates over time. 4. Use data forensic algorithms for auditing applicants’ answers for indications of cheating and/or piracy. At the beginning of the program, evaluate the option of entering a partnership with Caveon to support this form of response checking a. To support this approach, measure item-level response time with online administration. b. Establish a clear policy about steps to be taken in the case of suspected cheating. 5. Establish a technology based practice of conducting routine web patrols of Internet sites for indications of collaboration, cheating or piracy with respect to the civil service tests Recommendation 16: Communicate clearly and directly to test takers about their responsibility to complete the tests as instructed. Require that they sign an agreement to do so. (Note, we acknowledge that the manner in which this is done, and perhaps even the wisdom of doing this, must be appropriate to the Saudi Arabian cultural context. Rationale for Recommendations Recommendation 15 The proactive and vigilant approach recommended here assumes that cheating and collaborative piracy will be ongoing threats. The recommended tactics are intended to respond quickly to indicators of security violations and to prevent potential or imminent threats. The two purposes of this 199 approach are to protect the usefulness of the item banks and to promote the public’s confidence in the civil service system. Recommendation 16 The International Test Commission (2006) has established guidelines for unproctored online testing that call for an explicit agreement with the test taker that he accepts the honest policy implied by the test instructions. This recommendation suggests to MCS that it benefit from this professional practice even though the risk of cheating may be somewhat less with proctored administration. Such an agreement allows MCS to establish clearer actions to be taken in the case of suspected cheating and it also serves to communicate to all test takers MCS’s proactive approach to the protection of test security. In the long run, this type of communication will help to develop a reputation for vigilance that will encourage public confidence in the system. Item Banking Background The Study 2 review of batteries provided relatively little technical information about item banks, even for Pearson’s Watson-Glaser and SHL’s Verify, both of which rely on an IRT-based item bank approach to the production of randomized forms for unproctored administration. No publisher that relies on a small number of fixed forms appears to use item banking in any sense other than, presumably, for item storage. Even among these fixed form publishers, only PSI has described a deliberate strategy of accumulating large numbers of items over time to support the periodic transparent replacement of forms. As a result, this review provides no recommendations relating to the technical characteristics of bank management and forms production. Rather, we strongly encourage an IRT bank management approach for somewhat less technical purposes. (We understand that Study 1 provides a description of technical methods for bank management.) Recommendations Recommendation 17: Develop an IRT-based bank management approach to item maintenance and test form construction. Even if MCS does not implement unproctored administration, a bank management strategy will support 1. the development of large numbers of forms to support test security strategies, 2. a structured approach to planning for new item development, 3. job-specific adaptation of form characteristics, as appropriate over time, in the manner SHL adapts Verify forms to job level, and 4. an ongoing item refreshment/replacement strategy that includes new items and, potentially deactivates compromised items for a period of time. The bank management approach would be based on IRT estimates of item parameters that replace classical test theory statistics over time, would enable decisions about and analyses of item retention thresholds to periodically delete poor performing items from the bank and would require that bank health metrics be developed Recommendation 18: Seed the bank with enough items initially to support approximately 4-6 fixed forms for both paper-pencil and online administration. 1. We do not recommend using an episodic approach of using one or two fixed forms for a period of time, then replacing them with two new fixed forms. From the beginning, MCS should administer different forms to different applicants to the extent possible. Note, these initial forms will be based on CTT item and test statistics and will not provide an opportunity to apply IRT-based rules form the production of equivalent forms. 200 Recommendation 19: Update and review item characteristics as adequate data allows. Such updates/reviews may be triggered by time or volume since last update but may also be triggered by security incidents. 1. Reviews based on security incidents might only involve the investigation of recent response patterns, which may not be sufficient to update items statistics. Decisions to deactivate compromised items should be conservative in favor of item security. Compromised items would be deactivated and moved out of the active bank so they do not influence bank-wide analyses that might be done to develop plans for new item development or to provide metrics of overall bank health. 2. Updated item statistics should place more weight on recent data unless specific attributes of recent data require that they be given less weight. a. Decisions to remove items from the bank based on recent data should be conservative in favor of retaining the item until there is a high probability that the item quality harms test Rationale for Recommendations Recommendation 17 In the early stages of item and test development, bank management will have little, if any, practical value since, presumably, the majority of newly developed items will be used in the initial set of 4-6 fixed forms. (We presume an objective of initial item development is to write a sufficient number of items to support 4-6 fixed forms.) However, bank management will become a fundamental component of MCS’s support of its civil service tests and MCS processes should be developed from the beginning to be consistent with a bank management approach. Recommendation 18 The proactive approach to test security is the primary consideration in recommending that 4-6 fixed forms, rather than two as is more typical, be developed and used initially. To the extent that applicants will communicate with one another following the initial testing session, the early reputation of the testing system will be bolstered by the applicants experience that many of them competed different forms. Recommendation 19 The overall purpose of Recommendation 19 is that MCS should become engaged in monitoring and tracking item characteristics as soon as adequate data is available. Such early analyses should not only examine item characteristics and perhaps begin to develop IRT estimates but should also examine test score distributions in different groups of applicants including applicants for different job families, applicants with different levels of education, and applicants with different amounts/types of work experience. In addition, early analyses should also examine subset characteristics and relationships and should carefully examine the characteristics of hired applicants compared to applicants who were not hired. An important but unknown (at least to the authors) feature of the selection system is the levels of selection rates for different jobs and locations. Early analyses of the effects of selection rates on the benefits of the selection processes will be important. Staff Support Background The Study 2 requirements specified that we provide recommendations about important staffing considerations. We are happy to provide the recommendations below but these are based on our 201 own professional experience with more and less successful personnel selection programs. These recommendations were not informed by our review of the seven batteries. Kehoe, Mol, and Anderson (2010), Kehoe, Brown and Hoffman (2012) and Tippins (2012) summarize requirements for effectively managed selection programs. While none of these chapters specifically address civil service selection programs, they are likely to provide useful descriptions of many organizational considerations important for MCS. Our suggestions about staff requirements to manage a successful civil service selection program will be described as a single recommendation with several specific suggestions about key staff roles and functions. The rationale for each suggested role/function will be embedded in the description of that role/function. For the purposes of this set of recommended roles/functions, we are assuming the operational roles required to manage the Test Centers are part of the whole civil service program organization. Finally, we are making no assumptions about the current organizational structure or staffing in MCS or in Sabbarah that may be directly involved in the early development of the subtests and other program features. Our perspective is that it is unlikely either organization is currently staff and organized in manner that is optimal for the management of a civil service selection program. Recommendations Recommendation 20: Staff key roles/functions with the skills and experience required to support a successful personnel selection program. While this organization has diverse responsibilities including research and development, policy and strategy, government leadership, and delivery operations, its core responsibility is in the domain of personnel selection program management, including both the technical and management requirements of this core function. Specific roles/functions are recommended in each of the following subsections. For convenience we will refer to the organization supporting MCS’s civil service selection program as Saudi Arabian Selection Program (SASP). 1. Chief Operations Officer. SASP’s chief operations officer who is responsible for the day-today operation should have significant personnel selection expertise with experience in strategic, technical and operational leadership roles with large scale personnel selection programs. In our view, this is the single most critical staffing decision for SASP. This role would support both operations and technical/research support. Presumably, this role would report to the SASP chief executive officer who we assume to be a senior manager/officer in the Saudi Arabian government. We recognize that this may be a very challenging recommendation for MCS. If it is not feasible to identify a person with these qualifications at the beginning of MCS’s development work, we strongly encourage MCS to develop a strong consulting relationship with a senior industrial psychologist in the region to provide oversight and consultation. Absent this particular professional experience and influence, decisions may be made about the many recommendations from Study 1 and Study 2, among other things, that are more expensive than necessary or less effective than is possible. 2. Research and Development Function. SASP should support a strong research and “product development” function. We anticipate that SASP will be a small organization and, at least initially, research-related work will be indistinguishable from product development work since the only technical work underway will have a primary focus on early product 202 development. While this single research and product development function would ideally be led by an individual with a strong technical background in personnel selection, the early focus on item and test development and on systems development (administration systems, scoring systems, applicant data management systems, scheduling and job posting systems, etc.) could be effectively led by individuals with technical experience and expertise in psychological test development and by individuals with experience and expertise in employment systems. This research function is central to the progressive development of the civil service testing system. This function would be responsible for overseeing validation studies, norming studies, IRT estimation analyses, overall bank management, utility studies about the impact of each selection factor on selection decisions, job analysis work, forensic data analyses in support of the security strategy, and any analyses/investigations relating to the specification and plans for Phase 2 work on job specific assessments. 3. Security Strategy Leader. Given the strong emphasis we have placed on a proactive and well-communicated security strategy, we recommend that a SASP establish the role of Security Strategy Leader at the beginning of the early product development work. The MCS security strategy will have a significant impact on early decisions about questions such as numbers of forms, numbers of items, systems capabilities for handling multiple forms, and systems capabilities for forensic data analyses. The security strategy should be well established and well represented early in the product development and systems planning work. 4. Interdisciplinary Team Approach to Work. SASP should implement a team-based development strategy by which technical/psychometric/testing experts responsible for item and test development have frequent opportunities to review program plans with IT / Systems / Programming experts and with operations support managers. While the chief operating officer should have overall management responsibility for all parts of this team approach, these three major roles should interact at an organizational level below the chief operating officer. We further recommend that this interdisciplinary approach to work become part of the SASP method of work. 5. Test Center Operations. All operational roles responsible for the management of testing processes themselves – test administration, scoring, materials storage, etc. – should require that staff are trained and certified as competent for the performance of their roles and responsibilities. 6. Account Manager. We recommend that SASP establish what might be called Account Manager roles responsible for SASP’s relationships with the users of its testing services. The three primary user groups appear to be applicants, hiring agencies, and government officials who fund and support SASP. It is also possible that a fourth group, schools and training organizations who prepare student for entry into the civil service workforce, should be included as well. An Account Manager would be responsible for representing the interests of the target group to SASP in its strategic planning, research and development and operations. For example, an organization’s interest in reducing turnover or reducing absenteeism or improving time to fill vacancies would be represented to SASP initially by the Account Manager. Similarly, the Account Manager would be responsible for representing SASP’s interest to the group he represents. For example, the Research leader’s interest in studying the relationship between subtest scores and subsequent job performance would be supported by an Account Managers relationship with the organizations in which such research might be conducted. 7. User Support Desk. We recommend that SASP establish a “User Support Desk” to support users’ operational needs, questions, and requirements. This would be an information 203 resource applicants and hiring managers/HR managers, primarily, could contact for questions about the manner in which they may use the civil service system. Presumably, this type of information would also be available as SASP’s website to handle as many of these inquiries as possible. We did not address roles/functions relating to other selection factors such as interviews, experience assessments, academic success indicators (grade point average, etc.). Although we have assumed the data produced by those assessment processes is available to SASP’s research and development function, we have not assumed that SASP has any responsibility for designing those processes or for implementing them. We have presumed that the hiring organizations will be responsible for designing and gathering all other assessment tools and processes other than the SASP cognitive tests. User Support Materials Background In this section, we describe the types of support materials MCS should eventually develop to provide to each of several user groups. MCS’s role as a government agency providing a service to the public, to applicants and to the government’s hiring organizations call for a somewhat different set of support materials than commercial publishers provide for prospective and actual clients. The recommendations about user support materials are presented in Table 75 where we list the suggested support materials and provide a brief summary of the purpose of each named document. We organize this list by the type of user group shown here. User groups HR / Selection Professionals Test Centers Hiring Organizations Applicants The Public HR / Selection Professionals may have any of several roles. They may represent hiring organizations, they may be researchers with an interest in MCS’s selection research, and they may be members of the public requesting information on their own behalf or on behalf of other public groups. In any of these cases, HR / Selection Professionals are regarded as having an interest in technical information. The Test Centers are part of the civil service testing organization and have an interest in having necessary instructions and other information needed to carry out their operational roles and responsibilities. Hiring Organizations are the government organizations whose hiring decisions are informed by MCS’s testing program. Applicants are those who complete the MCS tests in some process of applying for jobs. The Public are citizens or organizations such as schools or training centers in Saudi Arabia who may have some interest or stake in the civil service testing system. 204 Table 75. Recommended support materials for each user group. User Group Purpose of Support Material Name of Document HR / Selection Professional Detailed technical summary of development, reliability, validity and usage information Administrative Instructions Instructions to be read to test takers Procedural instructions for managing test administration processes Scoring instructions Training Materials Handbook of training content FAQs Answers to questions test takers frequently ask Guidelines for using SASP test results and other selection factors to make selection decisions Business oriented summary information about the usefulness of SASP (and other) selection tests; Includes focus on different criteria impacted by selection tests Instructions for participation in research activities Reports describing the test performance (and other selection factors?) of one or more applicants for organizations vacancies Information about the ways a government organization could use SASP testing services (selection only?) Information about specific procedures to be used as part of SASP testing processes (e.g., submitting requests to fill vacancies, obtaining applicant test results, etc.) Technical Manual Testing Centers Hiring Organizations Guidelines for Hiring Decisions Validity / Utility Information Research Participation Instructions Results Reports How to use SASP Resources Reports describing cost, time, quality of new hires, etc. resulting from one or more episodes of hiring for the mangers organization Information Brochures General information about SASP and selection tests FAQs Answers to many applicant questions Practice Tests and Guidance Instructions about the use and scoring of practice tests Practice test materials Information about the services SASP provides to applicants including published resources Report describing test performance and manner in which results will be used. Process Management Reports Applicants How to use SASP Feedback Report The Public / Schools/Training Centers/Government Officials Information Brochures Practice Tests and Guidance General information about SASP and selection tests Information about ways people can prepare to take SASP tests with an emphasis on practice tests Information about the relationship between SASP selection tests and academic/training Summary Reports Report summarizing information of public interest about SASP testing, e.g., volume of test takers, number of vacancies filled, selection ratios, descriptions of factors considered, etc., Guiding Policies Background Although the Study 2 requirements did not specifically request recommendations about guiding policies, we are including a set of recommended policy topics that, in our experience with selection programs, are often needed to address a wide range of common issues. These recommended policy topics are not derived from the review of the seven batteries because test publishers do not have 205 authority over hiring organizations’ management policies. Test publishers may require that professional standards be met for the use of their tests but they do not impose policy requirements on organizations. These recommended policy topics are offered even while we acknowledge that the relationship between the owner of the civil service testing system (SASP in our hypothetical model) and hiring organizations is unclear. While we presume the hiring organization will make the hiring decisions and will have some authority over the manner in which test information is used, it is not clear to what extent the governing policies will be “owned” by SASP or by the hiring organizations, or both. Also, it is not clear who has authority over all the selection data entering into a civil service hiring decision, or who gathers it. For example, are the test results produced by SASP Test Centers downloaded to hiring organizations which then have authority over the use of the test data? Does SASP have any role gathering and recording other selection information such as experience records, interviews, resume information, and the like? Who assembles the composite scores that combine SASP test results with scores from other sources? While these unknowns affect who the policy owners would be – some SASP, some the hiring organizations – it is very likely that all of the recommended policy topics should be addressed. We recommend policy topics without recommending the specific policies themselves. For example, the first policy topic listed in Table 76 below is the question of eligibility to take a SASP cognitive ability test. Our recommendation is that a policy should be developed to answer this question. That eventual policy may take several forms ranging from very few constraints on who may take a test to other more restrictive eligibility requirements, for example, a person must be a basically qualified applicant for a job with a current vacancy the hiring organization is actively attempting to fill. Chapters in recent personnel selection handbooks address many of these policy issues and other specific selection program practices. (Kehoe, Brown and Hoffman, 2012; Kehoe, Mol and Anderson, 2010, Tippins, 2012) Recommendations Table 76 describes all the recommended policy topics organized by the “audience” for the various policies. By “audience” we mean the user group whose actions and choices are directly affected by the policy. For example, a policy about eligibility for testing is a policy about applicants’ access to the civil service testing system, even though Test Centers and Hiring Organizations would presumably have some role in managing and enforcing this policy. Admittedly, the organization of Table R3 into three non-overlapping user groups is somewhat arbitrary and imperfect but it does serve a useful purpose of distinguishing between types of policies. 206 Table 76. Recommended topics for which SASP should develop guiding policies. Audience What the Policy Should Address Policy Topic Applicants What conditions must an applicant satisfy to be eligible to take a test? o Basically qualified for the vacancy? o Recency since last test result? o Result of previous test events? o Must a vacancy must be open that requires the test? What conditions must an applicant satisfy to be eligible to retake a test? o Does previous result(s) on same test matter? o Does length of time since last result on same test matter? o Does number of attempts at same test matter? Once a test result has been obtained, how long is that result usable by hiring organizations? o 6 months, 1 year, 5 years; Indefinitely? If an applicant has previously worked successfully in a job for which the test is now required, is that applicant required to take the test? What conditions may exempt an applicant from the requirement to take test? o Highly desired experience? o Hiring manager preference? Who has the authority to exempt an applicant from a test requirement? Eligibility for testing, application Retesting Duration of Test Result Grandfathering Exemptions/ Waivers What information are applicants given about their own test result? o Raw score? Percentile score? Score range? o Interpretation of score (E.g., High, Average, Low?) What advice are applicants given about test preparation options? o Commercial sources for test prep course? o Information about effectiveness of test preparation, such as practice o Interpretation of own practice test results? What conditions must an individual satisfy to be given access to practice tests? o None? o Must be an applicant for an open vacancy? o Must not have taken the test before? Must all locations of same job use the test in the same way? Must a test be used in same way for all applicants for same job? Who has the authority to decide whether a specific test is a required selection factor? May role / weight of test result be temporarily change based on employment market conditions? Who may make a hiring decision that is based, at least in part, on a test result? What information other than the test result may be used? Who may decide? Must those who make test-based hiring decisions be trained and certified to do so? Who may have access to an applicant’s test result(s)? For what purpose? What procedures are required to confirm applicant ID? How are late-arriving applicants handled? What steps should be taken if cheating is observed or suspected? What steps should be taken if an applicant leaves a test session before completing the test? Feedback Test Preparation Access to practice material Hiring Organizations Selection standards Roles in Hiring Decisions Access to Test Results Testing Centers Applicant management and identification 207 May instructions be repeated / reread? May test administrators offer interpretations of test administration instructions? How are ambiguous answer marks to be handled? Must manually calculated scores be independently checked? If test result data is manually entered, when does that happen? By whom? Record Keeping How are test material handled following completion of a test? Feedback May test administrators provide individual feedback to test takers? Material Security What are the rules for test materials security? What are the consequences for violations of security rules? Use of Test Instructions Scoring Data Management 208 SECTION 8: REFERENCES Andrew, D. M., Paterson, D. G., & Longstaff, H. P. (1933). Minnesota Clerical Test. New York: Psychological Corporation. Baker, F. (2001). The basics of item response theory. http://edres.org/irt/baker/ Bartram, D. (2005). The Great Eight Competencies: A criterion-centric approach to validation. Journal of Applied Psychology, 90, 1185-1203. Berger, A. (1985). [Review of the Watson-Glaser Critical Thinking Appraisal ] In J.V. Mitchel, Jr. (Ed.), The ninth mental measurements yearbook, (pgs. 1692-1693) Lincoln, NE. Buros Institute of Mental Measurements. rd Bolton, B. (1994). A counselor’s guide to career assessment instruments. (3 ed.). J. T. Kapes, M. M. Mastie, and E. A. Whitfield (Eds.). The National Career Development Association: Alexandria, VA. Cizek, G. J., & Stake, J. E. (1995). Review of the Professional Employment Test. Conoley, J. C., & Impara, J. C. (Eds.). The twelfth mental measurements yearbook. Lincoln, NE: Buros Institute of Mental Measurements. Cohen, J., & Cohen, P. (1983). Applied multiple regression/correlation analysis for the behavioral sciences (2nd Ed.). Hilllsdale, New Jersey: Erlbaum. Corel Corporation (1993). CorelDRAW! user’s manual – version 4.0. Ontario, Canada: Author. Crites, J. O. (1963). Test reviews. Journal of Counseling Psychology, 10, 407-408. Dagenais, F. (1990). General Aptitude Test Battery Factor Structure for Saudi Arabian and American samples: A comparison. Psychology and Developing Societies, 2, 217-240. Dale, E., & O’Rourke, J. (1981). The living word vocabulary. Chicago: Worldbook-Childcraft International, Inc. Differential aptitude tests for personnel and career assessment: Technical manual (1991). The Psychological Corporation: Harcourt Brace & Company: San Antonio, TX. Drasgow, F., Nye, C. D., Guo, J. & Tay, L. (2009). Cheating on proctored tests: The other side of the unproctored debate. Industrial and Organizational Psychology: Perspectives on Science and Practice, 2, 46-48. du Toit, M. (2003). IRT from SSI: BILOG-MG MULTILOG, PARSCALE, TESTFACT. Lincolnwood, IL: Scientific Software International. Engdahl, B., & Muchinsky, P. M. (2001). Review of the Employee Aptitude Survey. Plake, B. S., & Impara, J. C. (Eds.). (2001). The fourteenth mental measurements yearbook. Lincoln, NE: Buros Institute of Mental Measurements. Ekstrom, R.B., French, J. W., Harman, H.H., Dermen, D. (1976). Manual for Kit of Factor-Referenced Cognitive Tests, Educational Testing Service. Princeton, NJ. French, J. W. (1951). The description of aptitude and achievement tests in terms of rotated factors. Psychometric Monographs, No. 5. French, J. W., Ekstrom, R. B., & Price, L. A. (1963). Manual for kit of reference tests for cognitive factors. Princeton, NJ: Educational Testing Service. Geisinger, K. F., (1998). [Review of the Watson-Glaser Critical Thinking Appraisal, Form S ] In J.V. Mitchel, Jr. (Ed.), The thirteenth mental measurements yearbook, Lincoln, NE. Buros Institute of Mental Measurements. 209 Guilford, J. P. (1956). The structure of intellect. Psychological Bulletin, 53(4), 267-293. Gulliksen, H., & Wilks, S. S. (1950). Regression test for several samples. Psychometrika, 15, 91-114 Hartigan, J.A. & Wigdor, A.K. (1989). Fairness in employment testing: Validity generalization, minority issues and the General Aptitude Test Battery. Washington, DC: National Academy Press. Hausforf, P. A., LeBlanc, M. M., & Chawla, A. (2003). Cognitive Ability Testing and Employment Selection: Does Test Content Relate to Adverse Impact? Applied H.R.M. Research, 2003, Volume 7, Number 2, 41-48 Helmstadter, G. C., (1985). [Review of the Watson-Glaser Critical Thinking Appraisal ] In J.V. Mitchel, Jr. (Ed.), The ninth mental measurements yearbook, (pgs. 1693-1694) Lincoln, NE. Buros Institute of Mental Measurements. Hunter, J. E. (1983a). USES test research report no. 44: The dimensionality of the General Aptitude Test Battery (GATB) and the dominance of general factors over specific factors in the prediction of job performance for the U.S. Employment Service. Division of Counseling and Test Development Employment and Training Administration, U.S. Department of Labor: Washington, D.C. Hunter, J. E. (1980). Validity generalization for 12,000 jobs: An application of synthetic validity and validity generalization to the General Aptitude Test Battery (GATB). Washington, DC: U.S. Employment Service, U.S. Department of Labor. Hunter, J. E. (1983b). USES test research report no. 45: Test validation for 12,000 jobs: An application of job classification and validity generalization analysis to the General Aptitude Test Battery. Division of Counseling and Test Development Employment and Training Administration, U.S. Department of Labor: Washington, D.C. Hunter, J. E. & Schmidt, F. L. (1990). Methods of meta-analysis: Correcting error and bias in research findings. Thousand Oaks, CA. Sage Publications. International Test Commission (2006). International guidelines on computer-based and Internet delivered testing. International Journal of Testing, 6, 143-172. Ivens, S. H., (1998). [Review of the Watson-Glaser Critical Thinking Appraisal, Form S ] In J.V. Mitchel, Jr. (Ed.), The thirteenth mental measurements yearbook, Lincoln, NE. Buros Institute of Mental Measurements. Kapes, J. T., Mastie, M. M., & Whitfield, E. A. (1994). A counselor’s guide to career assessment rd instruments (3 ed.). The National Career Development Association, A Division of The American Counseling Association: Alexandria, VA. Keesling, J. W. (1985). Review of USES General Aptitude Test Battery. Mitchell, J. V., Jr. (Ed.). The ninth mental measurements yearbook. Lincoln, NE: Buros Institute of Mental Measurements. Kehoe, J. F., Brown, S. & Hoffman, C. C. (2012). The life cycle of successful section programs. In N. Schmitt (Ed.). The Oxford Handbook of Personnel Selection and Assessment. (pp. 903-936). New York, NY., Oxford University Press. Kehoe, J. F., Mol, S. T. & Anderson, N. R. (2010). Managing sustainable selection programs. In J. L. Farr and N. T. Tippins (Eds.), Handbook of Employee Selection. (pp. 213-234), New York, NY., Routledge, Kolz, A. R., McFarland, L. A., & Silverman, S. B. (1998). Cognitive ability and job experience as predictors of work performance. The Journal of Psychology, 132, 539-548. Loo, R & Thorpe, K., (1999). A psychometric investigation of scores on the Watson-Glaser Critical Thinking Appraisal new Form S. Educational and Psychological Measurement, 59, 995-1003. 210 Lovell, C. (1944). The effect of special construction of test items on their factor composition. Psychological Monographs, 56 (259, Serial No. 6). MacQuarrie, T. W. (1925). MacQuarrie Test for Mechanical Ability. Los Angeles: California Test Bureau. McKillip, R. H., Trattner, M. H., Corts, D. B., & Wing, H. (1977). The professional and administrative career examination: Research and development. (PRR-77-1) Washington, DC: U.S. Civil Service Commission. McLaughlin, G. H. (1969). SMOG Gradings: A new readability formula. Journal of Reading, 12, 639646. Mellon, S. J., Daggett, M., & MacManus, V. (1996). Development of General Aptitude Test Battery (GATB) Forms E and F. Division of Skills Assessment and Analysis Office of Policy and Research Employment and Training Administration U.S. Department of Labor: Washington, D.C. O'Leary, B. (1977). Research base for the written test portion of the Professional and Administrative Career Examination (PACE): Prediction of training success for social insurance claim examiners. (TS-77-5) Washington, DC: Personnel Research and Development Center, U.S. Civil Service Commission. O'Leary. B., & Trattner, M. H. (1977). Research base for the written test portion of the Professional and Administrative Career Examination (PACE): Prediction of job performance for internal revenue officers. (TS-77-6) Washington, DC: Personnel Research and Development Center, U.S. Civil Service Commission. Otis, A. S. (1943). Otis Employment Test. Tarrytown, NY: Harcourt, Brace, and World, Inc. Pearlman, K. (2009). Unproctored Internet Testing: Practical, Legal and Ethical Concerns. Industrial and Organizational Psychology: Perspectives on Science and Practice, 2, 14-19. Pearlman, K., Schmidt, F. L., & Hunter, J. E. (1980). Validity generalization results for tests used to predict job proficiency and training success in clerical occupations. Journal of Applied Psychology, 65, 373-406. Postlethwaite, B. E., (2012). Fluid ability, crystallized ability, and performance across multiple domains: A meta-analysis. Dissertation Abstracts International Section A: Humanities and Social Sciences, 72 (12-A), 4648. Ruch, W. W., Stang, S. W., McKillip, R. H., & Dye, D. A. (1994). Employee aptitude survey technical nd manual (2 ed.). Psychological Services, Inc.: Glendale, CA. Ruch, W. W., Buckly, R., & McKillip, R. H. (2004). Professional employment test technical manual (2 ed.). Psychological Services, Inc.: Glendale, CA. Sackett, P.R. & Wilk, S.L. (1994). Within-group norming and other forms of score adjustment in preemployment testing. American Psychologist, 49 (11), 929-954. Schmidt, F. L., & Hunter, J. E. (1998). The validity and utility of selection methods in personnel psychology: Practical and theoretical implications of 85 years of research findings. Psychological Bulletin, 124(2), 262-274. doi:10.1037/0033-2909.124.2.262 Schmidt, F. L., (2012). Cognitive tests used in selection can have content validity as well as criterion validity: A broader research review and implications for practice. International Journal for Selection and Assessment, 20, 1-13. Segall, D.O., & Monzon, R.I. (1995). Draft report: Equating Forms E and F of the P&P-GATB. San Diego, CA: Navy Personnel Research and Development Center. 211 nd Siegel, L. (1958). Test reviews. Journal of Counseling Psychology, 5, 232-233. Stelly, D. J. & Goldstein, H. W., (2007). Applications of content validation methods to broader constructs. In S. M. McPhail (Ed.), Alternative validation strategies: Developing new and leveraging existing validity evidence (pp. 252-316). San Francisco, CA. Jossey-Bass. Sullivan, E. T., Clark, W. W., & Tiegs, E. W. (1936). California Test of Mental Maturity. Los Angeles: California Test Bureau. Thurstone, L. L., & Thurstone, T. G. (1947). SRA Primary Mental Abilities. Chicago: Science Research Associates. Tippins, N. T. (2012). Implementation issues in employee selection testing. In N. Schmitt (Ed.). The Oxford Handbook of Personnel Selection and Assessment. (pp. 881-902). New York, NY., Oxford University Press, Trattner, M. H., Corts, D. B., van Rijn, P., & Outerbridge, A. M. (1977). Research base for the written test portion of the Professional and Administrative Career Examination (PACE): Prediction of job performance for claims authorizers in the social insurance claims examining occupation. (TS-77-3) Washington, DC: Personnel Research and Development Center, U.S. Civil Service Commission. Viswesvaran, C. (2001). Assessment of individual job performance: A review of the past century and a look ahead. In N. Anderson, D. Ones, H. K. Sinangil, & C. Viswesvaran (Eds.). Handbook of Industrial, Work & Organizational Psychology. (pp. 110-126). Thousand Oaks, California, Sage Publications. Wang, L. (1993). The differential aptitude test: A review and critique. Paper presented at the annual meeting of the Southwest Educational Research Association (Austin, TX, January 28-30). Watson, G., & Glaser, E. M. (1942). Watson-Glaser Critical Thinking Appraisal. Tarrytown, NY: Harcourt, Brace, and World, Inc. Wigdor, A.K. & Sackett, P.R. (1993). Employment testing and public policy: The case of the General Aptitude Test Battery. In H. Schuler, J.L. Farr, & M. Smith (Eds.). Personnel Selection and Assessment: Individual and Organizational Perspectives. Hillsdale, NJ: Lawrence Erlbaum Associates. Wilson, R. C., Guilford, J. P., Christianson, P. R., & Lewis, P. J. (1954). A factor analytic study of creative-thinking abilities. Psychometrika, 19, 297-311. Willson, V. L., & Wing, H. (1995). Review of the Differential Aptitude Test for Personnel and Career Assessment. Conoley, J. C., & Impara, J. C. (Eds.). (1995). The twelfth mental measurements yearbook. Lincoln, NE: Buros Institute of Mental Measurements. Woehr, D. J. & Roch, S. (2012) Supervisory performance ratings. . In N. Schmitt (Ed.). The Oxford Handbook of Personnel Selection and Assessment. (pp. 517-531). New York, NY., Oxford University Press. 212