DIFFERENTIAL ITEM FUNCTIONING AND ADVERSE IMPACT: A COMPARISON OF MANTEL-HAENZSEL AND LOGISTIC REGRESSION Heather JoAn Whiteman B.A., University of California, Davis 2006 THESIS Submitted in partial satisfaction of the requirements for the degree of MASTER OF ARTS in PSYCHOLOGY (Industrial/Organizational Psychology) at CALIFORNIA STATE UNIVERSITY, SACRAMENTO SPRING 2011 DIFFERENTIAL ITEM FUNCTIONING AND ADVERSE IMPACT: A COMPARISON OF MANTEL-HAENZSEL AND LOGISTIC REGRESSION A THESIS by Heather JoAn Whiteman Approved by: ____________________________, Committee Chair Lawrence S. Meyers, PhD. ____________________________, Second Reader Lee Berrigan, PhD. ____________________________, Third Reader James E. Kuthy, PhD. Date: _______________________ ii Student: Heather JoAn Whiteman I certify that this student has met the requirements for format contained in the University format manual, and that this thesis is suitable for shelving in the Library and credit is to be awarded for the thesis. ___________________________________________ Jianjian Qin, PhD., Graduate Coordinator Department of Psychology iii ________________ Date Abstract of DIFFERENTIAL ITEM FUNCTIONING AND ADVERSE IMPACT: A COMPARISON OF MANTEL-HAENZSEL AND LOGISTIC REGRESSION by Heather JoAn Whiteman This study serves as a comparative analysis of two measures for detecting differential item functioning (DIF) in item responses of 29,171 applicants on a 49 item selection test. The methods compared in this study were two of the more commonly used DIF detection procedures in the testing arena: the Mantel-Haenszel chi-square and the logistic regression procedure. The study focused on the overall effect each method had on adverse impact when used for the removal of items from a test. The study found that the presence of adverse impact findings were decreased by the removal of items displaying DIF, and that the effect on adverse impact differed by method of DIF detection. The study does not however, provide enough evidence to support the use of one DIF detection method over the other in applied settings where considerations such as cost and test reliability are of concern. ____________________________, Committee Chair Lawrence S. Meyers, PhD. iv ACKNOWLEDGMENTS I would first like to thank Biddle Consulting Group, Inc. the equal employment opportunity, affirmative action, and employee selection firm in the western United States that allowed me to use their data. I would like to thank all of the professors in the Industrial/Organizational Psychology program who have influenced me. I would particularly like to thank Dr. Meyers, Dr. Kuthy, and Dr. Berrigan who served on my committee and who assisted in guiding me through the thesis process. I would like to especially appreciate and thank Dr. Meyers for the time he took in giving me advice on the thesis and the enthusiasm and exceptional teaching he offered in his courses. I attribute the bulk of my learning in Industrial/Organizational Psychology to Dr. Meyers and the courses that he taught. I would like to thank Dr. Kuthy for serving as a professional role model and instilling in me a knowledge and respect for the applied Industrial/Organizational field. I would like to thank Dr. Berrigan for his dedication to the students and his willingness to contribute to my thesis. I would also like to thank my parents, Randy and Carrie, who have always been supportive of my goals. I would also like to thank my friends and colleagues who have provided support, input and camaraderie. v TABLE OF CONTENTS Page Acknowledgments ……………………………………………………………………….. v List of Tables ………………………………………………………………………….... ix List of Figures …………………………………………………………………………... xi Chapter 1. INTRODUCTION .…………………………………………………………...………. 1 Early History of Selection Testing ………………………………………...…….. 1 Setting Legal Precedent for Fairness in Selection Testing …………………...…. 4 Setting the Standards for Fairness in Selection Procedures …………………...… 7 Adverse Impact ………………………………………………………...………... 9 Validity in Selection Procedures ………………………………...……………... 13 Evidence based on test content ……………………………………….... 14 Evidence based on relation of a test to other variables ……………….... 16 Evidence based on response processes ……………………………….... 19 Evidence based on internal structure of a test ………………………….. 20 Evidence based on consequences of testing ……………………………. 22 Validity and Reliability ……………………………………...…………………. 22 Differential Item Functioning ………………………………………...………... 26 Differential Item Functioning and Item Bias ……..……………………. 26 Measuring Differential Item Functioning ……………………………… 29 Ability Measures in DIF Analyses …………………………………....... 30 vi Factors Affecting DIF Detection …………………………………….… 33 Uniform/Non-Uniform DIF ……………………………………………. 34 DIF Detection Methods ………………………………………………………… 36 Mantel-Haenszel ……………………………………………………….. 36 Logistic Regression …………………………………………………….. 41 DIF Detection Method Comparisons ……………………….………...... 44 Purpose of the Study …………………………………………………..…………. 46 2. METHOD .…………………………………………………………………...…….... 48 Sample Description ………………………………...………………………...… 48 Instrument ……………………………………………………...………………. 48 Procedure ………………………………………………………………...…….. 50 DIF Analysis for Item Removal ……………………………………...… 50 Adverse Impact Analyses …………………………………………….... 54 3. RESULTS ……………………………………………………………………...……. 56 DIF and Item Removal …………………………………………………...…….. 56 Mantel-Haenszel Analyses ……………...……………………………… 56 Logistic Regression Analyses ……………..…………………………… 61 Comparison of the MH and LR Methods for DIF Detection and Item Removal ……………………………………………………….….. 64 Adverse Impact Analyses ……………………………………………...…….… 70 Original Test 80% Rule Adverse Impact Analyses ……………………. 72 MH Test 80% Rule Adverse Impact Analyses ………………………… 74 vii LR Test 80% Rule Adverse Impact Analyses ………………………..… 77 Comparison of 80% Rule Adverse Impact Analyses …………………... 78 4. DISCUSSION ………………………………………………………………...…...… 86 Findings & Conclusions ……….……………………………………………..… 86 Limitations …………………………………………………………………...… 89 Implications for Future Studies ………………………………………………… 91 Appendices …………………………………………………………………................... 93 Appendix A. Item Means and Standard Deviations ……………………………..…...... 94 Appendix B. MH DIF Values and Classification Level by Item ……………..…..……. 96 Appendix C. Nagelkerke R2 Values and DIF Classification Category by Item …....… 106 Appendix D. Number of Applicants Passing at Cut-off Score Level by Test and Comparison Group …………………………………….…………...… 120 Appendix E. Fisher’s Exact Statistical Significance Results of Adverse Impact by Test & Comparison Group ...…………………………………………... 125 References ……………………………………………………………………..……… 130 viii LIST OF TABLES Page 1. Table 1 Demographic Characteristics of Examinees ………..……………….… 49 2. Table 2 Descriptive Statistics of Examinee Test Scores .………………………. 50 3. Table 3 MH Method DIF Classifications by Reference Group ……………...… 57 4. Table 4 MH Method DIF Classification by Item Number ………..………….… 59 5. Table 5 Item Numbers Displaying Small or No DIF with the MH Method …… 60 6. Table 6 Descriptive Statistics of the MH Test Scores ….……………………… 60 7. Table 7 LR Method DIF Classification by Reference Group ………..………… 62 8. Table 8 LR Method DIF Classification by Item Number ………..…………..… 62 9. Table 9 Item Numbers Displaying Small or No DIF with the LR Method ….… 63 10. Table 10 Descriptive Statistics of LR Test Scores ……………..……….……… 64 11. Table 11 MH & LR DIF Classifications by Item Number ………..…………… 69 12. Table 12 Descriptive Statistics of the Original, MH and LR Test Scores …...… 70 13. Table 13 Number of 80% Rule Violations and Practically Significant 80% Rule Violations in the Original Test by Comparison Groups ………………..… 73 14. Table 14 Number of 80% Rule Violations and Practically Significant 80% Rule Violations in the Original Test by Cut-off Score Levels ...……………..… 75 15. Table 15 Number of 80% Rule Violations and Practically Significant 80% Rule Violations in the MH Test by Comparison Groups …………………….… 76 16. Table 16 Number of 80% Rule Violations and Practically Significant 80% Rule Violations in the MH Test by Cut-off Score Levels ..………………….… 77 ix 17. Table 17 Number of 80% Rule Violations and Practically Significant 80% Rule Violations in the LR Test by Comparison Groups ..…………………...… 79 18. Table 18 Number of 80% Rule Violations and Practically Significant 80% Rule Violations in the LR Test by Cut-off Score Levels ..…………………….. 80 x LIST OF FIGURES Page 1. Figure 1 Uniform DIF ………………………………………………………….. 35 2. Figure 2 Non-uniform DIF ……………………………………………………... 36 3. Figure 3 MH Contingency Table Example …………………………………….. 37 4. Figure 4 DIF Assessment Classifications …………………………………….... 67 5. Figure 5 80% Rule Violations by Test ………………………………………… 81 6. Figure 6 80% Rule Violations by Comparison Group ………………………… 82 7. Figure 7 Practically Significant 80% Rule Violations by Comparison Group … 83 8. Figure 8 80% Rule Violations by Percent Cut-off Score Level ……………..…. 84 9. Figure 9 Practically Significant 80% Rule Violations by Percent Cut-off Score Level ..…………………………………………………………………… 85 xi 1 Chapter 1 INTRODUCTION Early History of Selection Testing The origins of testing for filling occupational roles began in 210 BC with the Chinese Civil Service System (Goodwin, 1997). When a new dynasty began in China, it was often a result of a militaristic takeover. A new emperor would come into power and require a large body of new civil service workers to run the empire. For this reason, a system was developed to appoint individuals into civil service positions based on merit rather than family or personal connections. The Civil Service Examination became one way to select a set of men to fill necessary bureaucratic positions and was first instituted around the sixth century. While these exams were intensely trained for and taken by many, very few individuals actually passed; in fact, there is evidence to suggest that the failure of these examinations created many hardships for the individual and sometimes even resulted in suicide. Those that did pass the exams rose in both the financial and social ranks, as did all the members of their family. So, while these tests may have been a good first attempt at a meritocracy, they too had shortfalls in their almost unattainable status (a passing rate supposedly around 2%) and in the way whole families and clans would be elevated by the achievement of one relative (“Confucianism,” n.d., para. 5). Early America also required reform in its appointment systems. A change of President typically resulted in a spoils system, which rewarded individuals who supported the new political party by appointing them to public offices. Unfortunately, due to the 1 2 nature of the spoils appointment system many people who were in political offices did not have the competence to fulfill their job duties. As the government grew even larger there was a clear need for reform in the political system (Heineman, et al., 1995). When Ulysses S. Grant was president (1869-1877), there was such a need for a civil service system that in 1871 Congress allowed for the setting of regulations for admission to public service and for the appointment of a Civil Service Commission. Soon after its creation, the Civil Service Commission ceased to be funded by Congress and was dismantled in 1875. Despite the Civil Service Commission’s short existence, it proved itself to be a functional tool for appointing individuals to public offices based on merit. Republican Rutherford B. Hayes was a major proponent of reform in this area and when he became president (1877-1881) he used competitive examinations for all public office appointments. Hayes worked with Democrat George H. Pendleton to recreate the Civil Service Commission in hopes that a value-based process would become part of both political parties. While the Civil Service Commission did not get far during Hayes’ presidency, there was an outcry for civil service reform after President James A. Garfield was assassinated in the first year of his term, 1881. His life was taken by a disgruntled officeseeker who was denied a political office but felt he was entitled to it based on the spoils system (Manley, 1986). The death of President Garfield, along with the need for more specialized skills and knowledge in government jobs, helped spur the passage of the Pendleton Act (Milkovich & Wigdor, 1991). Introduced by George H. Pendleton, the act rendered it unlawful to use the spoils system. Helped by President Chester Arthur (18812 3 1885), the bill became the Civil Service Act of 1883 and re-established the Civil Service Commission. Under the act, the government is required to appoint people into positions based on their merit as judged by official Civil Service Examinations. The Act ensured that certain aspects of proper testing should be enforced, such as validity and fairness in accessibility. It also stipulated that Civil Service Examinations need to be practical and related to matters that fairly test the capacity and fitness of a person for a position and that the exams must be given in a location that would be “reasonably convenient and inexpensive for applicants to attend” (Civil Service Act, 1883). It specified that no person in a current office could be promoted to another without first having passed the Civil Service Examination. It even addressed issues of substance abuse by stating that “no person habitually using intoxicating beverages to excess shall be appointed to, or retained in, any office…” (Section 8). The Pendleton Act further hindered any form of patronage by stating that no more than two family members could be appointed to positions covered under its constitution. It eliminated the potential for Congress members to influence the appointment of individuals to federal offices by negating their recommendation through examination. The Act also prohibited the soliciting of campaign donations on Federal government property (Civil Service Act, 1883). While the Pendleton Act may have been the placemark for the end of the spoils system in America, the law only applied to federal jobs and it was not mandated to be used for appointment into state and local offices. One result of the Civil Service Act of 1883 was that public offices were being held by individuals with more expertise and less political clout. The Pendleton Act served to push America toward a meritocracy in its 3 4 selection of employees (Milkovich, 1991); however, it fell short of creating a true nondiscriminatory standard for selection processes in the workforce. For example, women were not allowed to sit for these early examinations and, as a result, were not able to gain a public office. The tests were still only plausible for those individuals who had the privileges of an education and the means to afford the trip or days away from work to take such a test. Setting Legal Precedent for Fairness in Selection Testing The issues of validity and test fairness for the selection of employees went unaddressed until the Civil Rights Act of 1964 and its emphasis on equal employment opportunity in Title VII. The Civil Rights Act of 1964 was written under the Presidency of John F. Kennedy (1961-1963) in 1963 but was not passed until shortly after his assassination. President Kennedy had been a leader in the civil rights movement and his successor, President Lyndon B. Johnson (1963-1969), continued with the legacy. In his first address to the nation he said, “the ideas and the ideals which [Kennedy] so nobly represented must and will be translated into effective action.” On July 2, 1964 President Johnson signed the Civil Rights Act into law among political guests including Martin Luther King, Jr. While the Civil Rights Act of 1964 is most popularly known for the racial desegregation of schools and public places, it also specifically applied to fair employment practices. Title VII of the Civil Rights Act of 1964 prohibits the discrimination of an individual from employment or other personnel transactions based on race, color, religion, sex, or national origin (Civil Rights Act of 1964, Sec. 703). When the Civil Rights Act was in its early draft stage, the category of sex was not included. 4 5 However, a powerful democrat, Howard W. Smith, who was against the Civil Rights Act, added it in an attempt to lessen the desirability and ultimately the passing of the bill. Despite this, the Civil Rights Act passed in congress and Title VII became applicable on the basis of sex discrimination as well (Freeman, 1991). Title VII also prohibits the discrimination of individuals who associate with persons of a particular race, color, religion, sex, or national origin, and ensures that no employee can be fired for making a claim of discrimination (Civil Rights Act of 1964, Sec. 703). Title VII coined the term protected class, which is used to define groups of people who are protected from discrimination in employment situations based on personal characteristics. The protected classes labeled within the 1964 act included only the personal characteristics of race, color, religion, sex, and national origin. Age was added as a protected class in the Age Discrimination in Employment Act (ADEA) of 1967 to stipulate that employment discrimination of individuals based on age is unlawful (Age Discrimination in Employment Act of 1967, Sec. 623). The categories of age are divided at forty years of age; individuals at or above the age of forty represent one class, while those below the age of forty represent another. The Equal Employment Opportunity Commission (EEOC) was created to implement the laws set forth by Title VII of the Civil Rights Act of 1964. The United States Government Manual, published and updated annually by the Federal Government since 1935, defines the EEOC’s role as one that enforces laws prohibiting discrimination based on race, color, religion, sex, national origin, disability, or age in hiring, promoting, firing, setting wages, testing, training, apprenticeship and all other terms and conditions 5 6 of employment (U.S. Government Manual, 2005-2006). In 1966 this agency published the EEOC Guidelines which provided a basic description of what constituted a valid selection procedure. Validity, as it will be discussed here and more thoroughly described below, is concerned with the evaluation of a test or measurement. In fact, where testing is concerned, validity is the most fundamental consideration when developing and evaluating a test. Without it you cannot support the interpretation of test scores nor use the test fairly for legitimate decision making purposes. Despite Title VII of the Civil Rights Act and the creation of the EEOC, there was not much attention placed on the assessment of selection procedure fairness until the court case of Griggs v. Duke Power Co., 401 U.S. 424 (1971). In this case there was a unanimous agreement of eight Supreme Court justices that the selection procedures employed by the Duke Power Company were invalid because they did not directly comply with a key aspect listed in the EEOC Guidelines of 1966: the selection procedures did not demonstrate job relatedness. The court determined that the practice of requiring a high school diploma for supervisory positions in the Duke Power Company was a deliberate attempt on the part of the Duke management to prevent the promotion of African American employees. Since the knowledge, skills, and abilities (KSAs) required for the completion of high school did not directly relate to those required to perform supervisory duties at Duke, the power company was forced to change its selection practices and pay restitution to employees who had suffered as a results of these practices (Guion, 1998). The Griggs v. Duke Power Co. case was a strong impetus for explicating what constitutes a violation of Title VII of the Civil Rights Act and for defining the 6 7 various aspects of the field revolving around selection tests and potential discrimination as a result of their use. Setting the Standards for Fairness in Selection Procedures About a decade following the Griggs v. Duke Power Co. case, a joint effort was attempted by the EEOC, the Civil Service Commission (CSC), the Department of Labor (DOL), and the Department of Justice (DOJ) to publish a single set of principles to address issues pertaining to the use of tests, selection procedures and other employment decisions. This set of principles was published in 1978 as the Uniform Guidelines on Employee Selection Procedures (Guidelines). The Guidelines were designed to assist employers, licensing and certification boards, labor organizations, and employment agencies in complying with the requirements of the federal law which prohibits employment practices that discriminate on grounds of race, color, religion, sex, and national origin (Guidelines, 1978). Federal agencies have adopted the Guidelines to provide a uniform set of principles governing use of employee selection procedures which are consistent with applicable legal standards and validation standards that are generally accepted by the psychological profession and which the Government will apply in the discharge of its responsibilities (Guidelines, 1978). The Guidelines continue to serve as a respected source of guidance for employers in regard to compliance with Title VII in employment processes; however, there have been more recent publications developed to serve as more exhaustive sources of reference for establishing selection procedure fairness. One of these publications is the Standards for Educational and Psychological Testing (Standards). The Standards were developed 7 8 by the American Educational Research Association, American Psychological Association and the National Council on Measurement Education in 1985 to explicate the standards which should be followed in the development, fairness and use of all tests. The Standards were revised in 1999 to provide more information for employers’ use in adhering to the laws and regulations set forth in Title VII of the Civil Rights Act of 1964 (Standards, 1999). A third set of guidelines currently in use are the Principles for the Validation and Use of Personnel Selection Procedures (2003) (Principles). These principles were originally published by Division 14 of the American Psychological Association (the Society for Industrial and Organizational Psychology, or SIOP) in 1975 and most recently updated in 2003. The Principles are not an alternative set of guidelines to the Standards, but instead are intended to be complementary to them with more precise application to employment practices. “The Principles are not meant to be at variance with the Standards for Educational and Psychological Tests (APA, 1974). However, the Standards were written for measurement problems in general while the Principles are addressed to the specific problems of decision making in the areas of employee selection, placement, promotions, etc.” (Principles, 2003, p. 2). The major distinction among these three sets of guidelines is their purpose of application. The Guidelines were developed primarily for evaluating testing practices in light of Title VII [of the Civil Rights Act of 1964] (Biddle & Noreen, 2006). The Principles even go so far as to directly state that their guidance is not necessarily 8 9 intended to parallel legal standards: “Federal, state, and local statutes, regulations, and case law regarding employment decisions exist. The Principles is not intended to interpret these statutes, regulations, and case law, but can inform decision making related to them” (Principles, p. 1). The Standards and the Principles serve as a much needed set of resources that are more explicit in the practices governing fair and valid selection procedures. Adverse Impact Neither the Standards nor the Principles discuss the technical determination of disparate impact because it is a legal term (Biddle & Noreen, 2006). The Guidelines, however, are intended to provide guidance around employee selection processes in light of the Title VII legal requirements; as a result they are predominately focused on addressing the issue of adverse impact and disparate treatment in employment settings. The Guidelines define adverse impact as a “substantially different rate of selection in hiring, promotion, or other employment decision which works to the disadvantage of members of a race, sex, or ethnic group” (Guidelines, Sec. 16B). In order to determine whether there is a substantially different rate of selection, the Guidelines provide the 80% rule (Guidelines, Sec. 4D), a heuristic for determining whether adverse impact may exist. In order to assess whether a violation of this rule has occurred, the passing rate of the focal group is divided by the passing rate of the reference group. If the passing rate of the focal group is not at least 80% of the passing rate of the reference group, then the 80% rule has been violated and there is evidence of adverse impact (Guion, 1998). The group of interest, the focal group, is the group with the lower passing rate and is generally 9 10 compromised of individuals of a minority or protected class (e.g., females or minority ethnicities). The reference group is the group with the higher passing rate and is generally comprised of the non-minority or unprotected class (e.g., males or Caucasians). An example of the 80% rule is generally helpful in understanding its use in selection procedures. Example: A selection procedure has 200 applicants who must pass a test in order to be considered for employment, 100 of the applicants are focal group members and 100 are reference group members. If 55 focal group individuals pass the test (a passing rate of 55%) and 80 reference group individuals pass the test (a passing rate of 80%) the ratio of passing rates would be 55:80. This ratio is equivalent to a percentage difference of .69; thus indicating that the passing rate of the focal group is 69% of the passing rate of the reference group. This value indicates a violation of the 80% rule and is evidence of adverse impact against the focal group. A problem with the 80% rule is that it is easily affected by small numbers; when the group sizes are small, the ratio can change drastically just by the passing/failing of one individual. For example, if there were two groups, each with only 4 individuals, and the passing rate of the reference group was 100% there would be evidence of adverse impact against the focal group if even just one person did not pass. This would create an erroneous message of adverse impact if the 80% rule was used as the sole determiner of adverse impact. There are also times when the 80% rule is not violated but there is still a presence of adverse impact in the test. The Guidelines explicitly address just such an occurrence and also name alternative methods for assessing adverse impact: 10 11 …smaller differences in selection rates may nevertheless constitute adverse impact, where they are significant in both statistical and practical terms or where a user’s actions have discouraged applicants disproportionately on grounds of race, sex, or ethnic group (Guidelines, 1978, Sec. 4D). The processes mentioned for detecting adverse impact, a statistical test of differences between the groups and an assessment of practical significance can be used in lieu of, or in addition to, the use of the 80% rule. The process of assessing statistical significance between two groups is generally done with a statistical software program and utilizes a “Fisher Exact” procedure to determine when the difference is as large as or larger than one that could be likely due to chance (Guion, 1998). The measure of whether or not a difference is due to chance is computed as a p value and can also be called a probability. A p value below .05 has less than a 5% probability of having occurred by chance (Meyers & Hunley, 2008). A value below 05 is generally accepted by the courts, Guidelines, and professional standards as the threshold for indicating adverse impact (Biddle & Noreen, 2006). While .05 has become the standard level for determining statistical significance, the value comes from a somewhat arbitrary determination when Irving Fisher published tables (Fisher, 1926, 1955) that labeled values below .05 as statistically significant. A comment in a textbook by Keppel & Wickens (2004) beg the questions: Why do researchers evaluate… an unvarying significance level of 5 percent? Certainly some results are interesting and important that miss this arbitrary criterion. Should not a researcher 11 12 reserve judgment on a result for which p = .06, rather than turning his or her back on it? (Keppel & Wickens, 2004, p. 45). The courts have also contended that significance criteria "must not be interpreted or applied so rigidly as to cease functioning as a guide and become an absolute mandate or proscription" (Albemarle Paper Company v. Moody, 1975). According to Biddle (2006), values below .05 are “statistically significant,” and values between .05 and .10 can be considered “close” to significance. The second process that was addressed by the Guidelines for the assessment of adverse impact, practical significance, is an assessment that can be used in conjunction with the 80% rule and the statistical significance tests. Practical significance is evaluated by hypothetically changing the passing status of a couple individuals in the adversely affected group to a passing status from failing status. Practical significance is useful for counteracting the potential problems associated with tests that are easily affected by sample sizes, such as the 80% rule and statistical tests of significance. If a finding of adverse impact is no longer statistically significant or there is no longer a violation of the 80% rule after hypothetically changing the passing status, then the results of the statistical significance and/or 80% rule tests are likely instable and should not be seen as concrete evidence of adverse impact. According to Biddle & Noreen (2006), some courts assessing legal cases of adverse impact give consideration to whether the adverse impact seen in a test is also practically significant before determining the presence of adverse impact. Practical significance is only a necessary assessment when there is a finding of adverse impact. A test can possess statistically and practically 12 13 significant adverse impact, statistically significant but not practically significant adverse impact, or neither statistically or practically significant adverse impact. Validity in Selection Procedures When adverse impact is found, test developers and users should evaluate if the test is unsuitable for selection purposes; however, the Guidelines state that adverse impact does not render a test unsuitable when there is sufficient validity evidence to support the test or procedure and there are no other reasonable alternative measures of selection available. This is because the overall validity of the test should be the most important aspect when considering whether it is appropriate for use in a selection process. Adverse impact should be avoided if at all possible, however, where a test is a valid predictor for successful performance in a job, it should be used in order to ensure that the candidate with the best abilities in the tasks, knowledge or skills needed for the position is selected. There are not clear legal expectations around the interaction of adverse impact and test validity, in fact, the Guidelines do not require a test user to conduct validity studies of selection procedures where no adverse impact occurs. However, all test users are encouraged to use selection procedures which are valid, especially users operating under merit principles (Guion, 1998). Guion (1998) defines validity as the effectiveness of test use in predicting a criterion measuring something else and valued in its own right, meaning that a valid test is one that can effectively predict the desired performance, skill or attribute of interest. Perfect assessment of an individual is not possible, therefore, we use tests as an approximation for a persons’ true ability in one area and then infer something about the 13 14 individual from their score. The Standards state that validity references the degree to which evidence and theory support the interpretations of test scores entailed by the proposed uses of a test. Validity therefore is a measure which indicates the accuracy of a test. A test that is valid would be capable of fully assessing all of the traits it was designed to measure without also capturing extraneous traits that were not intended for inclusion in the assessment. A test is only considered valid for its specific purpose and context. Because validity refers to an inference which is drawn from a test score, the Standards intimate that tests capable of drawing successful inferences are to be considered valid. Conversely, a test with low validity would not be able to successfully draw inferences based on test results. The Standards outline five sources of validity evidence used in evaluating the proposed interpretations of test scores for specific purposes. These sources of validity evidence include: 1) evidence based on test content, 2) evidence based on the relation of a test to other variables, 3) evidence based on response processes, 4) evidence based on internal structure, and 5) evidence based on the consequences of testing. While not all of these five sources are necessary to validate a test, the most important types of evidence to determine validity will depend on the individual test and the legal context that may or may not be present when adverse impact is concerned. Evidence Based on Test Content The first source listed by the Standards, evidence based on test content, assesses the relationship between the content of the test and the domain that it is intended to measure. In a test designed to measure an applicant’s knowledge in the field for which 14 15 they are applying, one would expect some basic things: 1) that the content of the test is related to the topic area of the position, 2) that the test covers a broad range of topics relevant to the position, neither focusing too heavily on, or omitting any topic areas, and 3) that the test includes only items directly related to the position. Validity evidence based on content is unique because it can be determined through pure logical determination (Standards, 1999); there is no required statistical procedure or mathematical computation undertaken to determine content validity (though metrics are often used to quantify and simplify the process). A job analysis is a commonly performed method to determine validity evidence based on content. A job analysis serves to identify the tasks that are performed on a job and to link them to the skills, abilities and knowledge that are required for successful performance of the job. This analysis is done with the use of subject matter experts who are familiar with the job and can determine the relevance of the skills, abilities, or knowledge to the actual tasks performed on the job. In order to show content validity evidence, items on a test must be clearly linked to those tasks identified in the job analysis as critical or important to the successful performance of an individual on the job. Content validity evidence should also address the design of the test. A question that is unnecessarily difficult, intended to “trick” people, or poorly worded would not be valid for inclusion on a selection test even if it is was appropriately related to the job content. In the case of selection tests, the difficulty of the test should be related to the difficulty of the job. For example, if a test is designed for an entry level position but requires the use of expert skill levels, the test would not be a valid measure for assessing 15 16 a job of entry level skills. If the domain being assessed is comprised of many traits, a valid test would seek to measure all of those traits and none which do not exist in the given domain. Evidence Based on Relation of a Test to Other Variables The second source for validity evidence according to the Standards can be obtained by assessing the relationship of test scores to external variables. This form of validity evidence encompasses two traditionally held views of validity, criterion and construct validity. Criterion validity for a selection test concerns the ability of the test to predict a test taker’s performance in the job based off of the test results. Criterion validity is established through a mathematical study in which statistically significant results ‘prove’ that the test predicts job performance. There are two forms of criterion validity, predictive and concurrent. In concurrent studies, data on the criterion variable are collected at about the same time as the predictors, whereas predictive studies consist of criterion data collected some time after the predictor variables have been measured (Standards, 1999). Predictive criterion validity can be assessed by administering a test to individuals and then comparing their performance on the test to later obtained measures of performance. For example, colleges and universities in the U.S. demonstrate the validity evidence of their use of standardized aptitude test scores in admissions processes because the tests have been shown to predict the first year of an undergraduate’s success in terms of grade point average. Concurrent criterion validity can be determined by comparing the outcome of an individual’s score to a current measure of ability. For 16 17 example, a company may administer a test they believe will assess an individual’s ability to perform well in a particular position to current employees in those positions. If individuals who perform well in the position currently also perform well on the test (and individuals who are poor performers on the job are also poor performers on the test), then this would demonstrate validity evidence based on the relation of the test to a concurrent measure of job performance. Construct validity differs from criterion related validity by its relation to theoretical constructs which are thought to underlie behaviors such as performance, knowledge, or ability. Obtaining an adequate amount of empirical evidence underlying a particular construct is usually a difficult task to undertake, especially if the test is specific to a unique construct. In some cases there are already measures in place to assess the job related aspects of interest for a test. For example, a test assessing an individual’s degree of depression should significantly correlate to a scale such as the Beck Depression Inventory. When using external variables as a source for construct validity evidence one can look at both convergent and divergent evidence. Convergent evidence refers to the relationship between test scores and other measures assessing related constructs, such as described in the previous example. Divergent evidence refers to a lack of relationship between test scores and measures of constructs which should be unrelated to the test construct. For example, a test assessing an individual’s degree of social desirability should not significantly correlate to the Ray Directiveness Scale since research provides evidence for little or no relationship between these two constructs (Ray, 1979). These 17 18 convergent and divergent relationships between the test and other variables can be established by use of correlations or other statistical procedures (Standards, 1999). The uses of criterion and construct validity are only two of the ways that evidence for validity of a test can be found based on its relationship to other variables. Another process through which the Standards advocate gaining validity evidence based on relation of a test to other variables is validity generalization. Validity generalization concerns the use of a test in contexts other than the one for which it was first designed. A tests’ applicability in a new context can be ascertained through different procedures; however, the most commonly used process for obtaining this form of validity evidence is through a meta-analytic study. A meta-analytic study is performed by studying literature and research results concerning the tests’ use in other contexts. A good example is the literature available concerning the wide applicability of the SAT to many different academic contexts. There is a large amount of data available concerning the ability of the SAT, a standardized test of reading, writing, and math skills, to predict a college freshman’s GPA in their first year. A meta-analysis of these studies would indicate that this prediction of GPA by SAT scores is stable across many different universities. As such, a university wanting to use the SAT for admissions would have demonstrable evidence of validity generalization for the use of this test for admissions purposes. Synthetic validity is another way in which a test can have validity evidence based on a relationship to other variables. A test that is comprised of parts of other tests is said to be a synthesized test. If an employer wanted to hire recruiters for their company, they may wish to assess a prospective recruiter in many different ways. Suppose the company 18 19 already had a valid test for predicting performance in this position but they have come to realize that extroversion is also a good predictor for success in this position. Rather than design a whole new test, the company could synthesize together their previous test and part of a psychological inventory that assesses extraversion for a more accurate predictor of a successful employee in the recruiter job position. Under the concept of synthetic validity, if they use an already established and validated procedure to measure extroversion, the new selection process would not need to be assessed again in terms of validity. The caveats related to the use of validity evidence obtained from a tests relation to other variables should be taken into consideration, such as the large degree of error variance that external criterion variables often contain. For example, if using performance appraisal scores of employees as a measure of concurrent criterion validity evidence, there may be errors in the rating system used, or the results may be skewed since the population of test takers were already proven to perform at least adequately in the position given their current employment in the position. It is important that the other variables being used to gather validity evidence are valid and appropriate measure or constructs. Evidence Based on Response Processes Validity evidence based on response processes focuses on the removal of confounding variables in a test. If a test is given to measure extraversion, such as in the previous example for a recruiter job position, care should be taken so that the test takers’ responses are not based on their knowledge that the employer is looking for an 19 20 extraverted person. Typically this form of evidence comes from the analysis of test takers responses to questions concerning about the process they used while taking the test (Standards, 1999). It is important to note that evidence based on response processes may lead to confounded responses of test takers during the data collection procedure if it is undertaken by the evaluator or administrator of the test. Thus, evidence of the rater’s ability to appropriately collect and evaluate responses to an item on a test in accordance with the intended construct of the test should be ensured (Standards, 1999). Evidence Based on Internal Structure of a Test The fourth type of validity evidence, evidence based on internal structure of a test, assesses the relationships among test items and the criterion they are attempting to measure. The Standards identify two sources of evidence based on internal structure; factor structure and internal consistency. The factor structure of a test is generally determined by running a factor analysis procedure in a statistical program. This procedure identifies patterns of correlation between individual items and the constructs represented in a test. It serves as a method for finding the underlying themes or patterns within items of a test. Factor analysis provides the degree of relationship that exists between test items and test constructs. This allows for the creation of test subscales based on item relationships. For a simplistic example, a factor analysis of an elementary school proficiency exam might show that addition and subtraction items are highly correlated with each other and that grammar and spelling questions are highly correlated with each other. This would indicate the presence of two subscales, a math subscale and a verbal subscale, which could be scored 20 21 and analyzed. By evaluating results based on subscales, a greater inference of an individuals’ true ability can be made. Consider two elementary students who have an average overall score on this example proficiency exam. One student performed at an average level on both the math items and the verbal items. The other student performed exceptionally well on the math items but very poorly on the verbal items. To consider these two example students as equal in terms of proficiency with both math and verbal would be an invalid use of the test results, but by assessing the factor structure it can be identified and corrected. The purpose of the test should be considered when determining whether subscales are necessary or an averaged total test result may be adequate. The second type of validity evidence based on internal structure of a test, internal consistency can be found by assessing aspects of the individual items on a test. Inter-item correlations are one way to determine internal structure based on internal consistency; if two items have a high inter-item correlation they are essentially measuring the same construct, not different a unique aspect of the domain being assessed. Another measure useful to assess internal consistency is item-total correlation. This describes how related an item is to the total test score. If an item-total correlation is high it indicates that individuals who tend to get the item correct also tend to score higher on the test, and visaversa. If an item-total correlation is negative, however, it would indicate that individuals who score high on the test tend to score poorly on that particular item. A negative itemtotal correlation is usually an indicator of a poor item or possibly an improperly keyed item. 21 22 Evidence Based on Consequences of Testing The final type of validity evidence stated in the Standards concerns evidence based on consequences of testing and is inherent in many of the other forms of validity evidence. Zumbo (1999) considers the consequences of test use to be an essential part of validation. This form of evidence is somewhat controversial due to its link with adverse impact. The Standards state that one may have to accept group differences on relevant attributes as real if a test measures those attributes consistently with the proposed interpretation of the test. One of the benefits of validity evidence based on the consequences of testing is that it allows one to be sure in ruling out confounds in a test which might be correlated with group membership (Meyers, 2007). In addition, a distinction must be made between evidence that is directly related to the validity of a test and evidence that may inform decisions about social policy, but falls outside the realm of validity (Standards, 1999). Validity and Reliability Validity is often confused with the concept of reliability. Reliability indicates the precision of a test; it is the extent to which a set of measurements is free from randomerror variance (Guion, 1998). A test on which a person consistently scores similarly would indicate a test that has very little random error and would be considered to have high reliability. A test in which a person first scores very highly and then very poorly would be said to have low reliability because the scores that a person receives seem random. Reliability is important because of its ability to determine whether a valid inference can be made based on a test score. Any test that is being assessed for validity 22 23 should first be found to be reliable. According to Guion (1998), “if a test is not reliable, it cannot have any other merit.” A valid test must be reliable but a reliable test is not always valid. Consistent scoring of individuals on a test does not mean that the test is measuring what it is intended to measure. For example, a bathroom scale may report that a person is 10 pounds lighter than they actually are every time someone stands it. That scale would be reliable; however it is reliably inaccurate as to a persons’ true weight. Reliability has its basis in consistency. The scale is only said to be reliable because it had been used repeatedly and produced similar results. The same is true for any test, a test would have to be administered many times and produce similar scores each of those times. Within classical test theory, reliability is thought of in terms of how close a test can approximate an individual’s “true score.” A true score is the score that an individual would have obtained if measurement was perfect, that is, being able to measure without error. Theoretically, every individual has a true score on any test, but one’s observed score, the score actually received on the test, is usually different from this true score as a result of any number of issues. A person’s true score is affected by problems inherent in the test itself, the test takers’ mood, the test taking environment and any number of social or personal influences. Classical test theory states that the relationship between an individual’s true score and their observed score can be expressed mathematically: X=T+E Where X is the observed score of the individual, T is the true score, and E represents the error in a test (Guion, 1998). Classical test theory also expresses the deviation from a true 23 24 score in reference to the variance inherent in the scores themselves and the error variance. This is expressed mathematically as follows: VO = VT + VE Where VO is the variance of the obtained score, VT is the variance of the true scores, and VE is the error variance. If a test were perfectly reliable, VE = 0, the variance of the observed score would be identical to that of the variance of the true score. A different mathematical representation of the notion of reliability is expressed by the function: Reliability = VT VO This equation presents a ratio of reliability in terms of the variance of the true score that can be found in an observed score from a test. This shows how much of the true score variance is shared with the observed score variance. It also shows how much unique variance the observed score has apart from the true score that would indicate error. The importance of reliability in assessing the validity of a test comes from the effect it has on the relation of the observed score to an individual’s true score. The less reliable a test is the less appropriate assumptions are if based on the test results. A separate notion of reliability concerns the concept of internal consistency. This addresses the likelihood that an individual will respond similarly to all items on a test. It is typically assessed by internal consistency coefficients of items taken from a specific content domain. Analyses are performed on these test items that determine how well they perform as a part of the test. The analysis of items tells how difficult an item is and whether or not the item is capable of discriminating between test takers that have 24 25 different ability levels. This can be determined using item difficulty measures such as Cohen’s d, corrected item-total correlations (the correlation of performance on one item to the performance on the test as a whole) and with item characteristic curves (indicators of the area on which an item has the most power in differentiating between individuals of varying ability). Also, if a test happens to be a multiple choice test, these methods can be used to assess item distractors and determine their role in the test. Internal consistency can be assessed through procedures like split half reliability and the Kuder-Richardson equation. Split half reliability consists of dividing a test into two sets of items, scoring each set of items separately and then looking at the correlation between the two sets of items. If a high correlation was found, then people tended to score similarly across the test and there is likely “functional unity” across the test as a whole (Guion, 1998). It is important when doing split half reliability that certain things are understood. For example, it would be improper to split the test into halves in which the first half of questions were compared to the last half, due to the fact that test takers may be experiencing fatigue and therefore may perform more poorly on test items occurring later in a test. The Kuder-Richardson procedures are internal consistency measures that were developed off of two mathematical equation, the 20th and 21st equation. First developed by Kuder and Richardson in 1937, the equations are useful in determining the reliability of a test which has dichotomously scored items (e.g., true/false, correct/incorrect). The Kuder-Richardson formulas may be considered averages of all the split-half coefficients that would be obtained using all possible ways of dividing the test (Guion, 1998). The problem with both the split half reliability and the 25 26 Kuder-Richardson is that they are only useful for assessing the reliability of a dichotomously scored test. This dilemma was attended to by Cronbach, who developed the more commonly used coefficient alpha in 1951.The coefficient alpha uses a general application of the Kuder-Richardson 20th equation (Guion, 1998) and can be used on any quantitative scoring system whether dichotomous or not. Another classical test theory aspect of reliability concerns the precision of the test itself. Precision can be thought of as the absence of error in a test, and is measured by the standard error of measurement. The standard error of measurement (SEM) is the amount of deviation in the test scores that would be found if an individual took the same test numerous times in the same conditions (and if their memory could be wiped clean so that they would not be affected by previous exposure to the test). Once an SEM is obtained, a confidence interval can be made around the score. A confidence interval is a score range within which we can say with a certain degree of confidence that an individual’s score would fall if they took the test repeatedly. A 95% confidence interval means that if an individual were to take the test 100 times they would score within the range specified 95 out of those 100 times (Meyers, 2007). The range of the interval is also an indicator of precision in a test. For example, on a test with 100 points possible a range between 80 and 85 points is more precise than a range between 75 and 90 points. Differential Item Functioning Differential Item Functioning and Item Bias While there are many different forms of validity evidence of concern when evaluating a test, the focus of this work is to address validity as it concerns the 26 27 consequences of testing through bias at the individual item level. Individual items in a test can be assessed in many ways. One way is to assess the ratio of each group answering the item correctly; this is similar to the way adverse impact is assessed on the test as a whole. If items are scored dichotomously, an average of the number of individuals who answered the item correctly could be compared between groups. This simple comparison of means may seem adequate; however, it does little more than indicate which group tended to answer the item correctly more often. Simple differences between groups can often lead to an assumption of bias; but, there are often very real reasons why two groups may score differently. For example, a test item on art history would likely have more correct responses given by a group of art students than a group of psychology students. This would not be an indicator of bias; it would merely represent a greater ability of the art students to answer questions related to art history. To assume that this item was biased against psychology students and to remove it from the test would be inappropriate since the item was designed to assess an individuals’ knowledge of art history, not to avoid any possible group differences. An item is considered biased when equally able (or proficient) individuals from different groups do not have equal probabilities of answering an item correctly. The key point in distinguishing biased items from simple differences in item response is the inclusion of ability as a consideration. Differential Item Functioning (DIF) is a statistical technique designed to help detect and measure item bias. DIF assesses individual items on a test and indicates if one group of individuals has a different probability of answering correctly, but only after differences in ability between groups have been controlled for. 27 28 DIF is a statistical property of an item while item bias is more general and lies in the interpretation (Wiberg, 2007). DIF is necessary, but not sufficient, for an item to be considered biased; that is, if we do not find DIF then we do not have item bias, but if we do find DIF then it may be item bias or it may be item impact. Item impact refers to the occurrence of test takers from different groups having different probabilities of responding correctly to an item due to true differences in ability measured by the item (Dorans & Holland, 1993). Item impact can be measured through the proportion of test takers passing an item regardless of their total score (Wiberg, 2007). A difference in item responses would be expected when the examinee groups differed in knowledge, experience or interest in the items content. It may not always be clear whether differences in groups that are a result of history or preference should be considered bias or whether they can be legitimately useful in selecting the most appropriate individuals based on test results. In fact it might be impossible to remove all DIF because groups do not have the same life experience (Wiberg, 2007). An item displaying DIF may not necessarily indicate bias; in fact, an item may be fair for one purpose and unfair for a different purpose (Zieky, 2003). The distinction between item bias and simple differences between groups is that item bias is a kind of invalidity that harms one group more than another (Holland & Wainer, 1993). If an item displays DIF, it would only be considered biased if the results of the test put individuals at a disadvantage. For example, an item with DIF on a test used to assess group differences would not be biased, but an item with DIF on a test used for employment 28 29 selection would be biased because it harms the employment opportunities of one group of individuals. Measuring Differential Item Functioning DIF analyses extend beyond a simple comparison of average scores between groups by matching individuals first on ability before comparing rates of correct responses. Because of this, some courts have specifically approved using DIF for the review and refinement of personnel tests (Biddle, 2006). The Standards (1999) also support the use of DIF analyses for assessing the fairness of a test and encourage further assessment of items displaying DIF: When credible research reports that differential item functioning exists across age, gender, racial/ethnic, cultural, disability, and/or linguistic groups in the population of test takers in the content domain measured by the test, test developers should conduct appropriate studies when feasible. Such research should seek to detect and eliminate aspects of test design, content, and format that might bias test scores for particular groups (Standards, 1999, Sec. 7.3). Grouping Individuals in DIF Analyses In the excerpt from the Standards (1999) above there is a listing of some groupings of individuals, i.e., age, gender, racial/ethnic, etc., highlighting the different possible groupings that can be compared and assessed with DIF analyses. In assessing DIF, groups of individuals can be created based on any group characteristic of concern. Typically, DIF studies concern one “focal” group of individuals that the study is 29 30 concerned with and a separate group meant to serve as the “reference” group. In many employment selection environments the groupings tend to center around protected groupings of individuals such as those stated in the Standards excerpt above. Ability Measures in DIF Analyses In order to assess desired groupings of individuals on an item a measure of ability is necessary. Because true ability is unobservable (Scheuneman & Slaughter, 1991) a proxy for ability must be established. Many of the methods used for DIF detection utilize the total test score as a measure of ability in order to assess an individual item. While it may seem circular to base an ability measure on a test that may contain DIF items, it is one of the most appropriate criterion measures available. This is because there is not usually another indicator of ability available to use for differentiating between individuals in two groups. Also, the test itself will inevitably be the most relevant to the item because each item of the test was designed to specifically assess the dimension of interest. While many methods for detecting DIF utilize total test score as a proxy for ability level, it should be noted that there is also another measure of ability that is used for detecting DIF, called a theta value. Theta is an estimated true score based on item response theory (IRT) methods. The theta value used in IRT methods is quite different from the use of ability measures that reflect only the number of items answered correctly on a test since it is based on a systematic relationship, which can be assumed through mathematical functions, between levels of the trait being assessed and the likelihood that the individual will respond in a particular way (Guion, 1998). Despite the differences in 30 31 methodology, IRT true scores are still estimated from performance on the test and hence are not independent estimates of “true” ability” (Scheuneman, 1991). The use of total test score as an appropriate matching criterion is made more stable by the validation process of the test itself. The more validity evidence indicating that a test can provide appropriate inferences based on test scores, the greater its strength as a matching variable. It should be noted that the reliability of a test is also an important factor in regard to the appropriateness of the total test score as a matching variable. This is because, within each group, a reliable and valid test would properly discriminate between those of high ability and those of low ability and do a reasonably satisfactory job of rank ordering individuals on that ability dimension (Scheuneman, 1991). Test length also affects the accuracy of total score as a measure of ability level; the longer the test, the more reliable the total scores (Clauser & Mazor, 1998; Rogers & Swaminathan, 1993). In some instances a further step is taken to ensure that the total test score used for matching is free of questions that may be unfair by removing items with elevated values of DIF before matching individuals (Camilli & Shephard, 1994; Wiberg, 2007; Zieky, 2003). When this sort of approach is used it is important to always include the item being assessed for DIF in the overall test score used for matching (Donoghue, et al., 1993; Dorans & Holland, 1993; Holland & Thayer, 1988; Lewis, 1993; Zwick, 1990). One reason for excluding items with elevated DIF levels before matching individuals is that the percentage of DIF items in a test can reduce the validity of the total test score as the matching variable. According to Jodoin & Gierl (2001), the greater the percentage of 31 32 items with elevated DIF the more likely there will be errors made when identifying items that display DIF. Still, Mazor et. al. (1995) state that a high percentage of DIF items in a test may be indicative of the dimensional complexity of the test rather than bias per se. This is due to the fact that apparent DIF may sometimes be the result of multidimensionality in a test that measures complex, multidimensional skill areas. According to Mazor et. al. (1995) several studies have found high percentages of items exhibiting DIF in well constructed tests. Multidimensionality of a test can trigger the detection of DIF items because when a test assesses many dimensions, the total score, used as a matching variable, is actually a composite score comprised of the abilities of individuals in the many different dimensions. The total score will be affected by the number of items representing each dimension in the test. Consider a verbal aptitude test for example; if there are many items focusing on sentence completion and only a few assessing other dimensions of verbal ability, the total score would reflect an individuals’ ability more appropriately matched on sentence completion since the majority of the total score was based on those items. Consequently, an item assessing comprehension of written text may display DIF because individuals were not well matched on this form of verbal ability. Mazor et al., (1995) found that use of multiple ability measures can be used to correct for the identification of non-DIF items. For example, comprehension items originally identified as DIF may not be identified if individuals are matched on both their sentence completion ability and comprehension ability. When DIF analyses utilize only one ability measure (i.e, total test 32 33 score) the test should be relatively uni-dimensional in order to best assess and detect DIF in individual items. Factors Affecting DIF Detection Just as some items can erroneously display DIF when they are not in fact bias it is also possible for a test with biased items to avoid DIF detection. If the test is measuring an extraneous trait across the test as a whole it will be incidentally measuring an additional domain, knowledge, or skill beyond that which it was intended to measure, but because it assesses it equally on all items no detection of that extraneous trait could be made (Donoghue & Allen, 1993). Another characteristic that affects the detection of DIF in items concerns the sample size. Items are more reliably flagged as displaying DIF when based on large sample sizes (e.g., more than 500 individuals) than on small sample sizes (e.g., less than 100 or so); (Biddle, 2006). The presence or absence of biased items in a test does not constitute evidence of bias or lack of bias in test scores (Scheuneman & Slaughter, 1991). In fact, even when there is a finding of bias it might be contradictory to the Guidelines to remove it if it shows strong validity evidence, unless that validity evidence is stronger in the majority group than in the protected group(s) and/or equal ability individuals from the different groups have different success rates on the item. This is because an item which displays strong validity is likely a good measure of the skill, knowledge, or ability of interest. Removing the item would decrease the overall validity and usefulness of the selection tool. As with valid tests which display adverse impact, there is no clear determining point 33 34 as to when items should be removed from a test based on bias and retained based on validity. An item displaying DIF needs to be evaluated to determine whether it is truly biased based on group membership. While some items can clearly be identified as valid or not, the majority call for judgments to be made on the basis of vague or unspecified criteria (Scheuneman & Slaughter, 1991). To date, a body of research has yet to emerge that all observers can agree demonstrates that the scores of minority examinees are or are not biased. This lack of certainty leaves people free to accept or reject the various findings according to which of these agree with their individual “biases” concerning what they believe to be true. Given that the true ability or skill we are trying to measure is unobservable and that the stakes of testing are so high, this situation is likely to remain unchanged for some time to come (Scheuneman & Slaughter, 1991, p. 13). Uniform/Non-uniform DIF There are two types of DIF, uniform and non-uniform differential item functioning. Uniform differential item functioning occurs when an item on a test affects one group of individuals differently than another evenly across ability levels, e.g., one group of individuals always scores higher on an item than another group. Figure 1 presents an illustration of this concept. 34 35 Performance on an Item Uniform DIF 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 High Low Group A Group B Ability Level Figure 1. In uniform DIF individuals of different groups score at an equally different level across the range of ability. Non-uniform differential item functioning measures the presence of an interaction between ability level and group membership. For example, test takers low in ability may score higher if they are in group A and lower in group B, while those with moderate ability score comparably on the item and those with a high level of ability may have a greater score on an item if they are a member of group B. In this instance we could say that while group A does not always score better than group B or vice versa, the item functions differently for the two groups. Figure 2 presents a simplified illustration of this concept. 35 36 Non-Uniform DIF (Interaction) Performance on an Item 0.9 0.8 0.7 0.6 0.5 Group A 0.4 Group B 0.3 0.2 0.1 0 Low High Ability Level Figure 2. In non-uniform DIF individuals of different groups do not score at an equally different level across the range of ability. DIF Detection Methods There are many different methodologies for detecting differential item functioning in a test item. This study focuses on two of the more popularly used methods for detecting DIF in test items, the Mantel-Haenszel (MH) method and the logistic regression (LR) method. These methodologies were selected for this study because of their relative ease in use and accessibility in an applied setting. For a detailed review of other methodologies used for assessing DIF, such as the IRT method referenced earlier, consult Holland & Wainer (1993). Mantel-Haenszel The Mantel-Haenszel procedure is widely considered to be one of the most popular and commonly used procedures for detecting DIF (Clauser & Mazor, 1998; 36 37 Dorans & Holland, 1993; Hidalgo & López-Pina, 2004; Mazor, et al., 1995; Wiberg, 2007; Zwick, 1990). Rogers & Swaminathan (1993) indicate that its popularity is likely due to “computational simplicity, ease of implementation, and associated test of statistical significance.” The procedure was first developed by Mantel & Haenszel (1959) in order to control extraneous variables when studying illnesses in populations. It was later proposed as a method for detecting DIF by Holland & Thayer in 1988. The MH procedure is a variation of the chi-square test that assesses the associations between two variables, both of which are dichotomous, for example, correct/incorrect or focal group/reference group. MH is an easy method for detecting possible bias in an item on a test and takes into account ability level by matching groups first on this basis before analyzing them for differences in correct/incorrect item response probabilities. For this reason it falls into the classification of a contingency table method. When detecting DIF with the MH method, multiple 2x2 contingency tables are created, one for each level of ability. Within the tables, the probabilities of individuals from one group correctly/incorrectly responding to an item are compared to the probabilities of individuals from the other group (see Figure 3). Correct Incorrect Probability that members of Group Probability that members of Group A correctly responded to the item A incorrectly responded to the item Probability that members of Group Probability that members of Group B correctly responded to the item B incorrectly responded to the item Group A Group B Figure 3. Example structure of an MH contingency table at one ability level. 37 38 A comparison of probabilities is performed by obtaining the ratio of focal group probabilities to reference group probabilities. These ratios are then averaged across all of the ability levels, with greater weight given to those ratios gathered from larger sample sizes; this is in response to the larger error found in smaller sample sizes. MH is a DIF detection method that typically matches ability by total test score. As such, there can be as many 2x2 tables as there are items in the test. This is called thin matching and it allows for the assessment of every ability level possible based on total test scores. Thin matching is the strategy typically used for MH DIF studies and yields the best results for long tests (40+ items) with adequate sample sizes (1600+ individuals) (Donoghue & Allen, 1993). If, however, one of the table cells lacks individuals from each group (e.g., there are no focal group individuals who scored at one of the ability levels) then it is not analyzed, leading to a loss of data. As a result, thin matching can yield poor results for short tests (5 or 10 items) and/or those with smaller sample sizes (Donoghue & Allen, 1993). To avoid this, grouped ability levels can be created by placing individuals into groups comprised of more than one observed test score, this method is called thick matching. Thick matching has the greatest advantage when the test has a small sample size or when there is little variation in the scores of individuals. This is because less of the data is discarded as a result of an empty cell in one of the ability contingency tables. Thin and thick matching techniques are generally the most commonly used forms of ability matching in MH, however, there are many other methods for matching; see Donoghue & Allen (1993) for a description of other strategies used to match individuals on total test score. 38 39 MH is useful in detecting DIF in dichotomous items and can be extended to polytomously scored items. The MH method can both detect and measure the size of DIF in an item. It cannot, however, be used to identify non-uniform DIF (Wiberg, 2007). It has been suggested that MH can be modified for use in detecting non-uniform DIF (Mazor, et al., 1994), but the appropriateness of such an extension is still questioned (Wiberg, 2007). Because MH is capable of both detecting and measuring the amount of DIF, items can be identified as possessing DIF and can also be classified into levels of DIF based on the amount displayed. The MH produces a statistic based on the odds ratio of the reference and focal groups summed across the many ability levels. The odds ratio is created by summing the numbers of reference group individuals who answered the item correct along with the number of focal group individuals who answered the item incorrectly, this summation is then divided by the total number of individuals at the given score level. This value is then divided by the summation of the number of reference group members who answered the item incorrectly and the number of focal group members who answered the item correctly divided by the total number of individuals at the given score level. This value is then generally used to determine an effect size estimate of the MH procedure; created by performing a linear transformation to create a “delta” metric for use in categorization. The calculation is performed by taking the natural log of the odds ratio multiplied by -2.35. This delta metric, MH D-DIF (Curley & Schmitt, 1993; Wiberg, 2007; Zwick, 1990), is used to indicate item difficulty in the test development process. 39 40 The MH D-DIF measure can range from negative infinity to infinity. A negative MH D-DIF value would indicate that the item displays DIF against the focal group, while a positive value would indicate DIF against the reference group. A MH D-DIF value of zero would indicate lack of DIF in the item. The MH D-DIF metric and a measure of statistical significance can then be used to classify DIF into various levels. The most commonly used categorization method for DIF while using the MH method is one currently in use by the Education Testing Service (ETS). This commonly used classification system labels DIF items as having either Type A, B, or C DIF. Type A DIF is classified as items whose MH D-DIF absolute values are less than 1, this level of DIF is often considered to show negligible differences between the two groups. Type B DIF is classified as those items which have an absolute value between 1 and 1.5 and that are also statistically significant, this level of DIF is considered to show an intermediate or moderate difference between the two groups. Type C DIF is classified as items having an absolute MH D-DIF value above 1.5 and that are also statistically significant, this level of DIF is considered to show a large difference between the two groups (Hidalgo & López-Pina, 2004). Logistic Regression The second method addressed for the detection of DIF in this study is the logistic regression (LR) method. It is a regression model that was first proposed for the detection of DIF by Swaminathan & Rogers (1990). Regression models predict a dependent variable in terms of independent variables; logistic regression is designed to predict a dependent variable that is dichotomous (i.e., correct/incorrect item responses). It can also 40 41 be extended to work with factors that have more than two categories. The LR method differs from MH in many ways; it is a parametric test of DIF and is designed to assess both uniform and non-uniform DIF. The LR method also allows for the use of continuous ability measures, thus it allows for all ability levels (scores on the total test) to be evaluated regardless of whether there are members from both the focal and reference group with that same total test score. With the MH procedure there may be a necessity to group individuals into ability levels, as in thick matching, to prevent data loss when there are no individuals in one of the groups at a particular core level. For this reason some consider the LR method to be more efficient than the matching process undertaken for MH DIF detection (Mazor, et al., 1995). LR assesses DIF in an item by entering the independent variables of the model (total test score and group membership) in a particular order to predict the likelihood of an individual answering an item correctly. The ability level (total test score) is entered first into the regression model. This allows the model to assess how much of the predicted likelihood of answering the item correctly is due to an individuals’ ability on the test as a whole. The second variable (group membership) is then entered into the model in order to assess how much of the individuals’ likelihood of answering an item correctly is related to their group membership above and beyond what would be expected by their ability level. If an individuals’ likelihood of answering the item correctly is significantly affected by their group membership after accounting for differences in ability, then the item is exhibiting DIF. LR is capable of also assessing non-uniform DIF simultaneously with the assessment of uniform DIF. This is done by lastly including into 41 42 the model the interaction of ability and group membership as an independent variable. If there is a significant interaction effect, indicating that groups perform differently at different levels of ability, then that item is displaying non-uniform DIF. The LR method is used for the study of DIF in preference to other regression methods because test items are typically scored on a binary scale (correct/incorrect). Logistic regression is generally thought to be superior to other forms of regression for the detection of DIF in dichotomously scored items. For example, unlike linear regression, logistic regression will not give you results of less than zero or greater than one, a common problem in linear regression despite the possible score item outcomes of only 0 (incorrect) or 1 (correct). Other forms of regression can be applied when dealing with items scored on a scale other than binary; ordinal logistic regression can be used for rating scale or Likert-type items and ordinary least-squares regression can be used for continuous or Likert-type items that have many scale points (e.g., more than 6 scale points) (Slocum, et al., 2003). At each of the steps in the LR process an effect size measure is created which can be used to assess the level of DIF in an item; this is computed by most statistical software programs as a Nagelkerke R-squared value. The Nagelkerke R-square is an approximation of the R-square used in other regression models however, some caution that it should not be treated as if it were a measure of the proportion of variance accounted for, such as in the use of R-squared values in other regression models (Cohen, et al., 2003). While it may not be appropriate to interpret and use these values in the same way as R-squared values obtained in other analyses, this study will refer to these values 42 43 as simply R-squared for the purposes of this study. The R-squared value from each of the steps of the LR method can be used to determine the magnitude of DIF by subtracting the R-squared value of the step prior (Zumbo, 1999). For example, to determine the magnitude of uniform DIF the R-squared value of the first step (ability level only) is subtracted from the R-squared value of the second step (group membership added after ability level). This allows for a measurable difference of the effect that group membership has on the likelihood of answering an item correctly after accounting for ability level. This same process can be used when assessing the magnitude of nonuniform DIF, in this case the R-squared value of the first step would be subtracted from the R-squared value of the third step (the interaction of ability and group membership is added after the two individual variables have been added) (Slocum, et al., 2003). Uniform DIF is only considered present when steps 1 and 2 differ significantly and there is no significant difference between steps 1 and 3 (Swaminathan & Rogers, 1990). Once the differences in R-squared values have been determined they can be evaluated to assess and categorize levels of DIF. One of the more commonly used criteria for categorizing DIF levels into categories of A, B and C, like those used in the MH procedure is the effect size criteria of Jodoin & Gierl (2001). If the difference between Rsquared values is below 0.035 then there is said to be negligible DIF (Category A), differences in R-squared values between 0.035 and 0.070 are considered moderate DIF (Category B), and differences in R-squared values above 0.070 are considered large DIF (Category C). It should also be noted that there is a second commonly used categorization system by Hidalgo & Lopez-Pina (2004), however the categorization system by Jodoin & 43 44 Gierl (2001) was proposed as being more sensitive to detecting DIF (Wiberg, 2007) and was therefore used in preference to the Hidalgo & Lopez-Pina (2004) system for this study. The R-squared differences are used to determine the effect size of DIF in an item; however, they do not identify which group is performing at a lower rate on the item after controlling for ability. The odds ratio is used to determine if one group has higher odds of responding correctly to an item than another group after ability has been accounted for (Slocum, et al., 2003) and this will tell you the direction of the DIF. DIF Detection Method Comparisons Previous studies have found differences in performance by the MH and LR methods that would sometimes suggest the appropriateness of using one method over the other; those differences, which are applicable to this study, will be highlighted here. It is important to note that some of the studies which identify differences between the two methods have yet to be replicated and that emerging research continues to accumulate on differences between these two methods for DIF detection. The LR method differs from the MH method greatly in its reliance on certain assumptions. For example, unlike the MH method, LR assumes that there is a linear relationship between the probability of individuals answering an item correct and their total test score (Wiberg, 2007). There is also an assumption in LR that the scores of individuals are distributed evenly; this is because of the process by which the LR method attempts to fit the scores to the regression curve. When there are not enough observed scores at either the extremely low or high end of the possible score range, the LR may not 44 45 be able to as accurately predict an individuals’ likelihood of responding to an item correctly. Data that are not evenly spread across all of the possible score ranges are not quite as problematic for the MH procedure, since contingency tables without information in one of the cells are skipped in the process of detecting DIF (Rogers & Swaminathan, 1993). Studies have also found that the LR method for detecting DIF is affected by the percentage of items in the test containing DIF while the MH method is not (Rogers & Swaminathan, 1993). According to Rogers and Swaminathan (1993) there are pros and cons to the use of each method in terms of uniform/non-uniform DIF; because the LR procedure contains parameters representing both uniform and non-uniform DIF, it may be less powerful than the MH procedure in detecting strictly uniform DIF. Conversely, the MH procedure has been found to have slightly higher detection rates for uniform DIF (Rogers & Swaminathan, 1993) but since it is designed for the detection of DIF that is constant across the range of ability levels, it is generally not able to detect non-uniform DIF and when it is modified to do so, it may not be effective in detecting non-uniform DIF. An advantage that the LR method has over the MH method is the ability to be extended for use with multiple ability measures. Mazor et al. (1995) found a greater ease of use with the LR method than with MH given the difficult nature of building contingency tables off of multiple ability estimates in the MH method. While the current study is only concerned with DIF detection methods that utilize one ability measure (total test score), it should be noted that some tests may require the use of multiple ability measures to be assessed appropriately and that the differences between the two methods 45 46 could differ based on the degree of multidimensionality in the ability measure used. According to Clauser, et al., (1991) the choice of using either total test score or subtest scores has a substantial influence on the classification of items showing DIF. Studies in which the LR and MH methods have been compared have found MH to be superior in regards to Type I errors; a greater rate of false positives has been found when using the LR method for DIF detection (Mazor et al., 1995). Additionally, a study by Ibrahim (1992) suggests that the level of false positives detected by both methods may increase with larger sample sizes. In addition to affecting false positive rates, the sample size, or number of test takers involved in an analysis of DIF, can affect the ability of detection models to accurately identify and classify DIF in an item. The MH method may be more appropriate when sample sizes are low (Schumacker, 2005) since the LR method requires a larger sample to function appropriately, though there is still likely to be loss of accuracy in the MH method given the necessity of thick matching with small sample sizes. Rogers and Swaminathan (1993) indicate that the sample size has a strong effect on both the MH and LR method and Mazor et al., (1994) found that the percentage of DIF items correctly identified decreased with smaller sample sizes. Purpose of the Study Mazor et al. (1995), Scheuneman & Slaughter (1991) and others have claimed that regardless of which criterion the comparison is based on, the MH and LR procedures result in similar numbers of items (and similar items) being identified. It is precisely this statement that the current study seeks to either support or provide evidence against. Also important to the purpose of this study is the practical usefulness of one method over the 46 47 other. For example, if for long tests the MH is a simpler and less expensive method than others, as Clauser and Hambleton (1994) suggest, then a negligible difference between the two methods in terms of number of items or degree of DIF identified could support a preference for the use of the MH method over the LR method in applied settings where time and cost are of concern. Conversely, if LR is found to outperform the MH method to such a degree that the overall adverse impact seen in a test would be altered then there would be justification for the practical use of the LR method regardless of the impacts on time or cost. The purpose of this study is to serve as a comparative analysis of two measures for detecting differential item functioning (DIF) in data concerning individual test items. The methods compared in this study are two of the more commonly used procedures in the testing arena; the Mantel-Haenszel chi-square and the logistic regression procedure. The study focuses on the overall effect each method has on adverse impact when used for the removal of items from a test. It is this author’s hypotheses that 1) adverse impact will be decreased by the removal of items that display DIF, and 2a) that the overall adverse impact of the test will be different depending on the method used for detecting differential item functioning, but 2b) that there will be no practical significance in the differences found. 47 48 Chapter 2 METHOD Sample Description The data used in this study were provided by Biddle Consulting Group, Inc. and included test item responses of 29,799 job applicants for entry level security officer positions in over 180 locations throughout the United States and Guam during 2007. For the sake of test security and client confidentiality, the specific names of the test and administering company have been excluded. The sample used for this study included only the applicant data for which self identified demographic data were available (N = 29,171). As shown in Table 1, self identified gender of the applicants was 69% male and 31% female. Self identified ethnicities of the applicants was 50.7% Caucasian and 49.3% minority; the minority status is comprised of American Indian/Alaskan Native (0.8%), Asian (1.9%), Hispanic (8.8%), African American (35.0%), Native Hawaiian/Pacific Islander (0.6%), and those self identified as belonging to two or more races (2.3%). Instrument All applicants for an entry-level security officer position with the company for which this test was developed were required to take a multiple-choice test for consideration in the hiring process. The test was designed to measure the basic knowledge, skills, abilities, and personal characteristics that were found through a content validated job analysis to be linked to critical duties of the position and that are necessary on the first day of the job. The test included 49 multiple choice items. These items were 48 49 scored 1 for correct responses and 0 for incorrect responses, for a total possible score of 49. The mean test score of all 29,171 applicants was 40.20 with a standard deviation of 5.30 and an internal consistency reliability coefficient (Cronbach’s alpha) of .774. The individual item means and standard deviations can be found in Appendix A. The mean test scores and standard deviations by demographic group are presented in Table 2. Table 1 Demographic Characteristics of Examinees (N = 29,171). ___________________________________________________________________________ Characteristics N % ___________________________________________________________________________ Gender Male 20,136 69.0 Female 9,035 31.0 Caucasian 14,780 50.7 Total Minority 14,391 49.3 American Indian/Alaskan Native 244 0.8 Asian 563 1.9 Hispanic 2,554 8.8 African American 10,197 35.0 Native Hawaiian/Pacific Islander 162 0.6 Two or More Races 671 2.3 Ethnicity ___________________________________________________________________________ 49 50 Table 2 Descriptive Statistics of Examinee Test Scores (N = 29,171). ___________________________________________________________________________ Characteristics M SD ___________________________________________________________________________ Gender Male 40.43 5.32 Female 39.66 5.17 Caucasian 41.96 4.26 Total Minority 38.28 5.64 American Indian/Alaskan Native 40.64 4.58 Asian 37.80 6.00 Hispanic 38.98 5.72 African American 38.08 5.59 Native Hawaiian/Pacific Islander 38.61 5.08 Two or More Races 40.26 4.72 Ethnicity ___________________________________________________________________________ Procedure DIF Analysis for Item Removal Two DIF analyses were performed on each item of the selection test. One analysis was performed using the MH method and another was performed using the LR method. Though some researchers find that it may seem inconsistent to focus on only one type of DIF because it can be detected with little effort or expense (Rogers & Swaminathan, 1993), it is the purpose of this study to analyze the most typical way that DIF detection 50 51 methods are used. Therefore, in this study the MH procedure was used only to identify uniform DIF (even though it is possible to extend it for use in detecting non-uniform DIF) and the LR method was used to analyze both uniform and non-uniform DIF simultaneously. Given the large sample size and number of items, DIF detection analyses with the MH method were performed using thin matching; every item was analyzed across all possible test score values (0-49). The absolute MH D-DIF value, a measure of the effect size of DIF, is calculated by analyzing the odds ratios of the groups assessed across the many ability levels and transforming them into a “delta” metric by taking the natural log of the odds ratio across the score levels and multiplying it by -2.35. The MH D-DIF value, which can range from negative infinity to positive infinity, was then used to categorize the items by the classification rules developed by ETS as laid out in a study by Hidalgo & López-Pina (2004). Items were classified as displaying large DIF (ETS classification category C) when the absolute MH D-DIF values were greater than 1.5 and statistically significant. Items were classified as displaying intermediate DIF (ETS classification category B) when the absolute MH D-DIF values were between 1.0 and 1.5 and statistically significant. All items with an absolute MH D-DIF value that were not significant or that were below 1.0 were classified as displaying small or no DIF (ETS classification category A). In order to detect DIF with the LR method, analyses were performed using a 3 stage logistic regression procedure which assessed the dependent variable of applicant score responses (0 incorrect, 1 correct) for each item. At stage 1 total test score was 51 52 included in the model; total test score was included to serve as a proxy for ability and was entered first in the process so that further stages of the analyses could examine other attributes while the ability level of applicants was controlled for. At stage 2 the variable for group membership was included in the model; this stage of the process was performed to assess response differences as a result of group membership when applicant ability was controlled for. This comparison of group membership with applicant ability controlled for assesses the presence of uniform DIF. At stage 3 the interaction of total test score and group membership was included in the model; this stage of the process was performed to assess response differences as a result of the interaction of ability and group membership when applicant ability was controlled for. This comparison of an ability and group membership interaction with applicant ability controlled for assesses the presence of nonuniform DIF. The logistic regression procedure computes the amount of variance in the applicants score responses that can be accounted for by the variables entered into the model at that stage and the previous stages. In the first stage the variance accounted for by the total test score is computed. The second stage of the model computes the amount of variance accounted for by both total test score and group membership. The third stage of the model computes the amount of variance accounted for by both the variable in stage 2 and the interaction of total test score and group membership. The amount of variance accounted for at each stage is represented by a Nagelkerke R2 value. The Nagelkerke R2 value can then be compared across the different stages to detect DIF. To determine nonuniform DIF, the Nagelkerke R2 values at stage 3 and stage 1 are compared. If the 52 53 difference between these two values was not significant, a second comparison was made to determine the presence of uniform DIF by comparing the Nagelkerke R2 values at stage 2 and stage 1. The classification of small, intermediate and large levels of DIF was modeled after the classification criteria suggested by Jodoin and Gierl (2001). The classification of large DIF (equivalent to an ETS classification category C) was applied if a statistically significant change in Nagelkerke R2 values was greater than .070. If there was a change between Nagelkerke R2 values between .035 and .070 and this change was statistically significant, the classification of intermediate DIF (equivalent to an ETS classification category B) was applied. A categorization of small or no DIF (equivalent to an ETS classification category A) was applied to items that did not have a statistically significant difference in Nagelkerke R2 values or a difference of less than .035. The categorization of DIF levels was applied to items displaying large or intermediate DIF regardless of whether or not the negatively impacted group is a legally protected group or reference group. The categorizations were then used to create two alternate test scores for each applicant, one in which all items categorized as displaying large or intermediate DIF using the MH method were removed, and one in which all items categorized as displaying large or intermediate DIF using the LR method were removed. This created 3 total test scores available for adverse impact analyses; the original 49 item total test score (Original Test), the test score based only on items which did not have large or intermediate DIF detected by the MH method (MH Test) and the test score based only on items which did not have large or intermediate DIF detected by the LR method (LR Test). 53 54 Adverse Impact Analyses The demographic groups available in the data set were used to create comparison groups for adverse impact analyses. One comparison was made with respect to gender, Male v. Female. Seven comparisons were made with respect to ethnicity (1) Caucasian v. all other ethnic groups, labeled Total Minority, (2) Caucasian v. American Indian/Alaskan Native, (3) Caucasian v. Asian, (4) Caucasian v. Hispanic, (5) Caucasian v. African American, (6) Caucasian v. Native Hawaiian/Pacific Islander, and (7) Caucasian v. two or more races. Since it was not the purpose of this study to determine appropriate cut-off scores for selection tests, the three test scores were analyzed with respect to adverse impact at all possible cut-off scores. The Original Test, MH Test and LR Test scores were analyzed at all possible cut-off scores for adverse impact using the 80% rule and the Fisher Exact procedure. A test of practical significance was also assessed for each result that indicated adverse impact in the test. If the passing rate of one group within a comparison was not at least 80% of the passing rate of the other group at a particular cut-off score, it was marked as a violation of the 80% rule. If the p value of the Fisher Exact test at a particular cut-off score was below .05, it was marked as displaying statistically significant adverse impact. If the p value of a Fisher Exact test at a particular cut-off score was between .05 and .10, it was marked as approaching statistically significant adverse impact. Practical significance was assessed for all violations of the 80% rule and all statistically significant findings of adverse impact. Because the intent of this study was to 54 55 utilize the most commonly used methods for DIF detection and adverse impact analyses, the 80% rule and Fisher’s Exact test were re-run after changing the status of two individuals in the lower passing rate group to a pass instead of a fail. This method of assessing practical significance was considered to be the most commonly used method because it was used in two court cases in which the courts found that if two or fewer persons from the group with the lowest pass rate were hypothetically changed from “failing” to “passing” status, and this resulted in eliminating the statistical significance finding, the results were not to be considered practically significant (Biddle, 2006). The court cases which changed the status of two individuals were U.S. v. Commonwealth of Virginia (569 F2d 1300, CA-4 1978, 454 F. Supp. 1077) and Waisome v. Port Authority (948 F.2d 1370, 1376, 2d Cir.,1991). A third court case Contreras v. City of Los Angeles (656 F.2d 1267, 9th Cir., 1981) involved the hypothetical status change of three individuals. These analyses were run for each of the eight demographic group pairs on all three tests so that each possible cut-off score of the three tests had an assessment of the 80% rule, statistical significance and test(s) of practical significance if applicable. The number of occurrences of 80% rule violations, statistical significance findings of adverse impact and the occurrence of practically significant findings of adverse impact were noted and compared to assess the overall effect each method had on the adverse impact of this test when used for the removal of items. 55 56 Chapter 3 RESULTS The main research issues that this study sought to address were: 1. The potential for decreasing overall adverse impact of a test by the removal of items displaying DIF. 2. The differences in overall adverse impact of a test when different methods are used for detecting and removing items which display DIF. 3. The practical significance in the differences found between the use of the two methods for item removal as it relates to adverse impact findings. To address these issues as they apply to the test and procedures used in this study, analyses of differential item functioning were performed using two methods, the MH method and the LR method. New test scores were created based on the removal of items displaying moderate or large levels of DIF, and the results of adverse impact analyses performed on these new tests, as well as the original test, were evaluated. DIF and Item Removal Mantel-Haenszel Analyses Each of the 49 items available in the test was assessed for DIF using the MantelHaenszel procedure among the eight comparison groups; this resulted in 392 assessments of DIF using the MH method. Of the 392 assessments, 3.6% (14) were classified as displaying a large amount of DIF, 9.4% (37) of the assessments were classified as 56 57 displaying intermediate DIF and 87% (341) of the assessments were classified as displaying small or no DIF. Table 3 displays the DIF classifications by comparison groups. The MH DIF values and DIF classification level of each item is presented in Appendix B. Table 3 MH Method DIF Classifications by Reference Group. _____________________________________________________________________________ Comparison Group ETS Classification Category A B C Male/Female 45 2 2 Total Minority/Caucasian 40 7 2 American Indian/Caucasian 46 3 0 Asian/Caucasian/Caucasian 41 6 2 Hispanic/Caucasian 45 3 1 African American/Caucasian 40 7 2 Hawaiian/Caucasian 37 9 3 Two or More/Caucasian 47 0 2 Grand Total 341 37 14 Note. The ETS classification category of A corresponds to small or no DIF, the category of B corresponds to moderate or intermediate levels of DIF, and the category of C corresponds to large levels of DIF. Twenty items on the test were found to display either moderate or large levels of DIF on one or more group comparisons when assessed with the MH method. Only 1 item, item #10, displayed a moderate or large level of DIF on all eight comparison groups. The 20 test item numbers and a count of the group comparisons displaying either 57 58 moderate DIF (ETS Classification Category B), or large DIF (ETS Classification Category C) are shown in Table 4. The remaining 29 items which displayed small or no DIF on all 8 group comparisons were retained to create the new test score based only on items which did not have large or intermediate DIF detected by the MH method. The term MH Test will be used to describe further results of this study as they pertain to the combined set of these 29 original test items; the complete list of item numbers on the MH Test can be found in Table 5. The mean test score of all 29,171 applicants on the 29 item MH Test was 23.32 with a standard deviation of 3.44 and a reliability coefficient (Cronbach’s alpha) of .660. The individual item means and standard deviations can be found in Appendix A. The mean test scores and standard deviations by demographic group are presented in Table 6. 58 59 Table 4 MH Method DIF Classifications by Item Number. ETS Classification ETS Classification Category B Category C 3 1 1 4 1 0 7 2 0 10 4 4 11 1 0 13 1 0 16 3 2 18 1 0 19 3 0 20 2 1 26 2 0 31 1 0 34 3 1 35 3 0 36 4 0 38 1 0 43 0 1 45 1 0 46 1 0 47 2 4 Grand Total 37 14 Item # Note. The ETS classification category of A corresponds to small or no DIF, the category of B corresponds to moderate or intermediate levels of DIF, and the category of C corresponds to large levels of DIF. 59 60 Table 5 Item Numbers Displaying Small or No DIF with the MH Method. Original Test Item Numbers of the MH Test 1, 2, 5, 6, 8, 9, 12, 14, 15, 17, 21, 22, 23, 24, 25, 27, 28, 29, 30, 32, 33, 37, 39, 40, 41, 42, 44, 48, 49. Table 6 Descriptive Statistics of MH Test Scores (N = 29,171). ___________________________________________________________________________ Characteristics M SD ___________________________________________________________________________ Gender Male 23.43 3.46 Female 23.06 3.38 Caucasian 24.34 2.88 Total Minority 22.27 3.65 American Indian/Alaskan Native 23.59 2.93 Asian 21.76 3.90 Hispanic 22.62 3.69 African American 22.11 3.65 Native Hawaiian/Pacific Islander 22.30 3.25 Two or More Races 23.27 3.13 Ethnicity ___________________________________________________________________________ 60 61 Logistic Regression Analyses Each of the 49 items available in the test was assessed for DIF using the logistic regression procedure among the eight comparison groups; this resulted in 392 assessments of both uniform and non-uniform DIF using the LR method. No DIF assessments were found to display non-uniform DIF. Therefore, all remaining discussion of DIF results are applicable only to uniform DIF findings. No DIF assessments were classified as displaying a large amount of DIF, 1.3% of assessments (5) were classified as displaying intermediate DIF and 98.7% of assessments (387) were classified as displaying small or no DIF. Table 7 displays the DIF classifications by comparison groups. The Nagelkerke R2 values and DIF classification level of each item is presented in Appendix C. Three items on the test were found to display either moderate or large levels of DIF on one or more group comparisons when assessed with the LR method. No items displayed a moderate or large level of DIF on all eight comparison groups. The 3 test item numbers and a count of the group comparisons displaying either moderate DIF (ETS Classification Category B), or large DIF (ETS Classification Category C) are shown in Table 8. 61 62 Table 7 LR Method DIF Classifications by Reference Group. _____________________________________________________________________________ Comparison Group ETS Classification Category Male/Female A 49 B 0 C 0 Total Minority/Caucasian 47 2 0 American Indian/Caucasian 49 0 0 Asian/Caucasian 48 1 0 Hispanic/Caucasian 49 0 0 African American/Caucasian 47 2 0 Hawaiian/Caucasian 49 0 0 Two or More/Caucasian 49 0 0 Grand Total 387 5 0 Note. The ETS classification category of A corresponds to small or no DIF, the category of B corresponds to moderate or intermediate levels of DIF, and the category of C corresponds to large levels of DIF. Table 8 LR Method DIF Classifications by Item Number. ETS Classification ETS Classification Category B Category C 10 2 0 16 1 0 47 2 0 Grand Total 5 0 Item # Note. The ETS classification category of A corresponds to small or no DIF, the category of B corresponds to moderate or intermediate levels of DIF, and the category of C corresponds to large levels of DIF. 62 63 The remaining 46 items which displayed small or no DIF on all 8 group comparisons were retained to create the new test score based only on items which did not have large or intermediate DIF detected by the LR method. The term LR Test will be used to describe the results of this study as they pertain to the combined set of these 46 original test items. The list of items on the LR Test can be found in Table 9. Table 9 Item Numbers Displaying Small or No DIF with the LR Method. Original Test Item Numbers of the LR Test 1, 2, 3, 4, 5, 6, 7, 8, 9, 11, 12, 13, 14, 15, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 48, 49. The mean test score of all 29,171 applicants on the 46 item LR Test was 40.19 with a standard deviation of 5.28 and a reliability coefficient (Cronbach’s alpha) of .746. The individual item means and standard deviations can be found in Appendix A. The mean test scores and standard deviations by demographic group are presented in Table 10. 63 64 Table 10 Descriptive Statistics of LR Test Scores (N = 29,171). ___________________________________________________________________________ Characteristics M SD ___________________________________________________________________________ Gender Male 40.43 5.32 Female 39.66 5.17 Caucasian 41.96 4.26 Total Minority 38.38 5.61 American Indian/Alaskan Native 40.64 4.58 Asian 37.80 6.00 Hispanic 38.97 5.73 African American 38.08 5.59 Native Hawaiian/Pacific Islander 38.61 5.08 Two or More Races 40.26 4.72 Ethnicity ___________________________________________________________________________ Comparison of the MH and LR Methods for DIF Detection and Item Removal In order to compare DIF results between two differing detection methods, a single set of classification categories was applied to all of the DIF analysis results. All DIF assessments on items in this study were classified as displaying either large, intermediate, or small/no DIF. In the MH analyses, DIF classifications were determined using the absolute MH D-DIF value, a measure of the effect size of DIF, which is calculated by analyzing the 64 65 odds ratios of the groups assessed across the many ability levels and transforming them into a “delta” metric by taking the natural log of the odds ratio across the score levels and multiplying it by -2.35. The MH D-DIF value, which can range from negative infinity to positive infinity, was then used to categorize the items by the classification rules developed by ETS as laid out in a study by Hidalgo & López-Pina (2004). These classification rules indicated large DIF (ETS classification category C) when the absolute MH D-DIF values were greater than 1.5 and statistically significant, intermediate DIF (ETS classification category B) when the absolute MH D-DIF values were between 1.0 and 1.5 and statistically significant, and small or non-existent DIF (ETS classification category A) when the absolute MH D-DIF values were not significant or below 1.0. Although there is a process for creating a measure similar to the MH D-DIF value, LR D-DIF, where the odds ratios are assessed and transformed to a delta metric by multiplying the result by -2.35 (Monohan, et. al, 2007), the purpose of this study was to assess the two methods in their most commonly used manner. So, the more widely referenced classification criteria suggested by Jodoin and Gierl (2001) was used to assess large or intermediate DIF in the LR analyses. The classification of large DIF (equivalent to an ETS classification category C) was applied if a statistically significant change in Nagelkerke R2 values was greater than .070. If there was a change between Nagelkerke R2 values of between .035 and .070 and this change was statistically significant, the classification of intermediate DIF (equivalent to an ETS classification category B) was applied. A categorization of small or no DIF (equivalent to an ETS classification category 65 66 A) was applied to items that did not have a statistically significant difference in Nagelkerke R2 values or a difference of less than .035. Each of the 49 items on the original selection test was analyzed for DIF among the eight comparison groups. Thus, each item resulted in 8 DIF assessments performed by each detection method. The outcomes of these assessments were classified into one of the ETS classification categories. As a result, there were a total of 392 DIF assessments performed using each DIF detection method (784 DIF assessments overall). The term “assessments” will be used to refer to these 8 comparison group DIF assessments and the subsequent classifications performed with each method on all of the 49 items in the original test. The MH DIF detection method identified more assessments displaying DIF than did the LR method on the selection test used in this study. Of the 392 assessments of DIF performed using each method, the MH method classified 3.6% (14) as displaying a large amount of DIF, while the LR method classified no item as displaying a large amount of DIF. The MH method classified 9.4% (37) of the assessments as displaying intermediate DIF, while the LR method classified only 1.3% (5) assessments as displaying intermediate DIF (see Figure 4). 66 67 40 # of DIF Classifications 35 30 25 20 MH LR 15 10 5 0 Large DIF Intermediate DIF Figure 4. Display of DIF Assessments Classifications by DIF Detection Method. Of the 49 items assessed for DIF, the MH method identified 20 items as displaying either moderate or large levels of DIF on one or more assessments. The LR method identified only 3 items displaying moderate DIF on one or more assessments. Although the number of items identified by each method differed, there appeared to be a similar pattern in regard to the number of assessments identified as displaying DIF on particular items. The 3 items identified by the LR method, item numbers 10, 47 and 16, coincide with the test items identified by the MH method as displaying the largest number of moderate or large DIF assessments. Item 10 was identified as displaying the largest number of DIF assessments by the MH method (8) and 2 assessments by the LR method. Item 47 was identified as displaying the second largest number of DIF assessments by the MH method (6) and also 2 assessments by the LR method. Item 16 was identified as displaying the third largest number of DIF assessments by the MH method (5) and one assessment displaying DIF by the LR method. This may indicate that, 67 68 although the MH method of classification used in this study identifies more assessments as displaying DIF, the items identified would be in alignment with those also identified by the LR method if a more lenient classification system were used (see Table 11). When items were removed based on the two DIF detection methods, two new versions of the test were created, the MH Test and LR Test. Only items which displayed small or no DIF on all 8 group comparison assessments were retained to create the new test scores. This resulted in an MH Test comprised of 29 of the original 49 items and an LR Test comprised of 46 of the original 49 items. The test statistics of these two new tests were assessed and compared to the original test to determine if any significant changes occurred on the overall performance of the tests and applicant test scores as a result of the item removal. To ensure that the act of item removal when creating the MH and LR test did not greatly affect the average test score of applicants, the percent of mean test scores was calculated by dividing the mean test score of applicants by the total number of items on each test. This allowed for an assessment of applicant performance (total number of items correct divided by total number of items on the test) across the three tests. The percent of mean test scores on the MH Test (80.4%) and LR Test (82.2%) did not differ greatly from the average percentage score of the Original Test (82%). The reliability of the tests, however, did slightly decrease as the number of items on the test decreased (see Table 12). 68 69 Table 11 MH & LR DIF Classifications by Item Number. MH Assessments Displaying Large LR Assessments Displaying or Moderate DIF Large or Moderate DIF 10 8 2 47 6 2 16 5 1 34 4 36 4 35 3 19 3 20 3 3 2 7 2 26 2 11 1 13 1 4 1 18 1 31 1 38 1 43 1 45 1 46 1 Item # Grand Total 37 5 69 70 Table 12 Descriptive Statistics of the Original, MH and LR Test Scores (N = 29,171). ___________________________________________________________________________ Test # Test Items M SD r ___________________________________________________________________________ Original Test 49 40.19 5.28 .774 MH Test 29 23.31 3.45 .662 LR Test 46 37.81 4.84 .748 Note. Reliability (r) is reported by the Cronbach’s alpha measure. Adverse Impact Analyses Because this study was concerned with the effect of item removal based on DIF results on the incidence of adverse impact in a test, but did not seek to address the creation of cut-off points for selection tests, adverse impact analyses were performed for all possible cut-off points on all 3 tests. The counts of individuals who passed/failed in each comparison group at each possible cut-off score were used to perform adverse impact assessments; the number of passing individuals at each cut-off score level by test and comparison group can be found in Appendix D. The eight comparison groups assessed for adverse impact were the same comparison groups used for DIF detection analyses. These groups included one comparison made with respect to gender, Male v. Female, and seven comparisons made with respect to ethnicity: (1) Caucasian v. all other ethnic groups, labeled Total Minority, (2) Caucasian v. American Indian/Alaskan Native, (3) Caucasian v. Asian, (4) Caucasian v. Hispanic, (5) Caucasian v. African American, (6) Caucasian v. Native Hawaiian/Pacific Islander, and (7) Caucasian v. two or more races. Adverse impact 70 71 analyses included an assessment of whether or not there was an 80% rule violation on any of the eight comparison groups, whether or not there was a statistically significant difference on any of the eight comparison groups, and whether or not there was practical significance of any 80% rule violation or statistical significance adverse impact finding. Given the large number of applicants used in the study data (29,171), the results of the statistical significance tests were believed to contain too much alpha inflation for the results to be conclusive. Meyers & Hunley (2008) consider the situation of alpha inflation to be analogous to throwing darts at a dartboard: “We may be completely unskilled in this endeavor, but given enough darts and enough time we will eventually hit the center of the board. But based on that particular throw, we would be sadly mistaken to claim that the outcome was anything but sheer luck. In performing statistical analyses, the more related tests we perform the more likely it is that we will find “something significant,” but that particular outcome might not represent a reliably occurring effect." (Meyers & Hunley, 2008, p. 15). Of the 824 statistical significance tests performed on the three tests at all possible cut-off scores for the eight comparison groups, 523 (63.5%) were found to display statistically significant adverse impact. These findings were likely to be the result of alpha inflation, not true indicators of adverse impact. Appendix E displays the alpha level of each Fisher’s Exact Test performed on the three tests at all possible cut-off scores for the eight comparison groups. A comparison between the MH Test and LR Test in terms of statistically significant adverse impact assessments, whether practically significant or 71 72 not, was deemed inappropriate for the purposes of this study. Thus, only the results of the 80% rule assessments of adverse impact will be discussed in detail and utilized for comparison purposes. The term assessments will be used to explain adverse impact results performed at all possible cut off scores of a test for each group comparison. Original Test 80% Rule Adverse Impact Analyses In order to determine if the removal of items displaying DIF changed the amount of adverse impact displayed by a test, the applicant scores on the original 49 item test (Original Test) were analyzed to determine the level of adverse impact present in the unaltered test. The observed scores on the Original Test ranged from 10-49; thus adverse impact was assessed on the eight comparison groups at all 39 possible cut-off scores (312 assessments). To assess if violations to the 80% rule had occurred, thus indicating adverse impact, the passing rates of each group in a comparison were assessed to determine if the passing rate of one group was at least 80% the passing rate of the other group. If the passing rate of one group was not at least 80% the passing rate of the other group, a violation of the 80% rule was said to have occurred and was considered to be evidence of adverse impact. Among the 312 assessments, 91 were identified as having violated the 80% rule. The practical significance of any 80% rule violation was then assessed by the method of hypothetically changing the failing/passing status of two individuals, which was found to be the more commonly performed method for assessing practical significance of adverse impact in a legal setting. To perform the test of practical significance, the status of two individuals in the lower passing rate group was changed 72 73 from a failing status to a passing status and the 80% rule assessment was re-assessed. If there was still an 80% rule violation after the manipulation of two individuals occurred, then the 80% rule violation was considered to be practically significant. Eighty-nine of the 91 80% rule violations were found to be practically significant. Table 13 displays the number of 80% rule violations and practically significant 80% rule violations by comparison groups. Table 13 Number of 80% Rule Violations and Practically Significant 80% Rule Violations in the Original Test by Comparison Groups. _____________________________________________________________________________ Comparison Groups # of 80% Rule Violations # of Practically Significant 80% Rule Violations Male/Female 7 7 Total Minority/Caucasian 13 13 American Indian/Caucasian 8 7 Asian/Caucasian 15 15 Hispanic/Caucasian 12 12 African American/Caucasian 14 14 Hawaiian/Caucasian 13 12 Two or More/Caucasian 9 9 Grand Total 91 89 73 74 There were no findings of 80% rule violations in the lowest 24 cut-off score levels; cut-off scores between 11 and 34. There was at least one 80% rule violation and one practically significant 80% rule violation at every cut-off score at or above 35, a passing score of 71% or better on the 49 items. The number of group comparisons showing 80% rule violations at each cut-off score level increased as the cut-off score increased, such that the 7 highest cut-off scores on the test 43-48, which are equivalent to a passing score of 87.8% or higher, were shown to have 80% rule violations on all assessments performed (see Table 14). MH Test 80% Rule Adverse Impact Analyses The observed scores on the MH Test, which was comprised of the 29 items from the Original test which displayed small or no DIF on all 8 group comparisons when using the MH DIF detection method, ranged from 3-29; thus adverse impact was assessed on the 8 comparisons groups at all 26 possible cut-off scores. The passing rates of each group in a comparison were assessed to determine if any violations to the 80% rule had occurred. A violation was said to have occurred when the passing rate of one group was not at least 80% the passing rate of the other group. Among the 208 assessments (26 possible cut-off scores assessed on 8 comparison groupings), there were a total of 58 80% rule violations. To asses practical significance of an 80% rule violation the number of passing individuals in the lower passing rate group was increased by two individuals and the 80% rule assessment was re-assessed. If there was still an 80% rule violation after that manipulation of two individuals occurred, then the 80% rule violation was considered to be practically significant. Fifty-six of the 58 80% rule violations were also 74 75 found to be practically significant. Table 15 displays the number of 80% rule violations and practically significant 80% rule violations by comparison groups. Table 14 Number of 80% Rule violations and Practically Significant 80% Rule Violations in the Original Test by Cut-off Score Levels. Original Test # of Practically Significant Cut-off Score # of 80% Rule Violations 80% Rule Violations 35 1 1 36 2 2 37 4 4 38 5 5 39 5 5 40 5 5 41 6 6 42 7 7 43 8 8 44 8 8 45 8 8 46 8 8 47 8 8 48 8 8 49 8 6 91 89 Grand Total 75 76 Table 15 Number of 80% Rule violations and Statistically Significant 80% Rule Violations in the MH Test by Comparison Groups. _____________________________________________________________________________ Comparison Groups # of 80% Rule Violations # of Practically Significant 80% Rule Violations Male/Female 3 3 Total Minority/Caucasian 8 8 American Indian/Caucasian 5 3 Asian/Caucasian 10 10 Hispanic/Caucasian 8 8 African American/Caucasian 9 9 Hawaiian/Caucasian 9 9 Two or More/Caucasian 6 6 Grand Total 58 56 There were no findings of 80% rule violations in the lowest 16 cut-off score levels; cut-off scores between 4 and 19. There was at least one 80% rule violation and one practically significant 80% rule violation at every cut-off score at or above 20, a passing score of 69% or better on the 29 items. The number of group comparisons showing 80% rule violations at each cut-off score level increased as the cut-off score increased, such that the 3 highest cut-off scores on the test 27-29, which are equivalent to a passing score of 93.1% or higher, were shown to have 80% rule violations on all assessments performed (see Table 16). 76 77 Table 16 Number of 80% Rule violations and Statistically Significant 80% Rule Violations in the MH Test by Cut-off Score Levels. MH Test # of Practically Significant Cut-off Score # of 80% Rule Violations 80% Rule Violations 20 1 1 21 3 3 22 5 5 23 5 5 24 6 6 25 7 7 26 7 7 27 8 8 28 8 7 29 8 7 58 56 Grand Total LR Test 80% Rule Adverse Impact Analyses The observed scores on the LR Test, which was comprised of the 46 items from the Original test which displayed small or no DIF on all 8 group comparisons when using the LR DIF detection method, ranged from 8-46; thus adverse impact was assessed on the 8 comparisons groups at all 38 possible cut-off scores. The passing rates of each group in a comparison were assessed to determine if any violations to the 80% rule had occurred. A violation was said to have occurred when the passing rate of one group was not at least 77 78 80% the passing rate of the other group. Among the 304 assessments (38 possible cut-off scores assessed on 8 comparison groupings), there were a total of 81 80% rule violations. To assess practical significance of an 80% rule violation the number of passing individuals in the lower passing rate group was increased by two individuals and the 80% rule assessment was re-assessed. If there was still an 80% rule violation after that manipulation of two individuals occurred, then the 80% rule violation was considered to be practically significant. Seventy-eight of the 81 80% rule violations were also found to be practically significant. Table 17 displays the number of 80% rule violations and practically significant 80% rule violations by comparison groups. There were no findings of 80% rule violations in the lowest 25 cut-off score levels, cut-off scores between 10 and 33. There was at least one 80% rule violation and one practically significant 80% rule violation at every cut-off score at or above 34, a passing score of 73.9% or better on the 46 items. The number of group comparisons showing 80% rule violations at each cut-off score level increased as the cut-off score increased, such that the 5 highest cut-off scores on the test 42-46, which are equivalent to a passing score of 91.3% or higher, were shown to have 80% rule violations on all assessments performed (see Table 18). Comparison of 80% Rule Adverse Impact Analyses Because each of the three tests differed in the number of items on the test and therefore the number of cut-off scores at which adverse impact analyses could be performed, a simple comparison of the overall number of 80% rule violations and practically significant 80% rule violations was not appropriate. A longer test would 78 79 contain a larger number of cut-off scores and would therefore have a greater number of assessments performed that could potentially contain adverse impact than would a shorter test. To account for this, the total number of 80% rule violations and practically significant 80% rule violations were divided by the number of adverse impact analyses performed on each test; this created a percentage that could be used for making comparisons between all three tests. The following comparisons made in regards to 80% rule violations reference a percentage which is calculated by dividing the number of violations and practically significant violations by the total number of assessments performed on the test being described. Table 17 Number of 80% Rule violations and Statistically Significant 80% Rule Violations in the LR Test by Comparison Groups. _____________________________________________________________________________ Comparison Groups # of 80% Rule Violations # of Practically Significant 80% Rule Violations Male/Female 5 5 Total Minority/Caucasian 13 12 American Indian/Caucasian 8 6 Asian/Caucasian 14 13 Hispanic/Caucasian 12 11 African American/Caucasian 12 12 Hawaiian/Caucasian 13 11 Two or More/Caucasian 10 8 Grand Total 81 78 79 80 Table 18 Number of 80% Rule violations and Statistically Significant 80% Rule Violations in the LR Test by Cut-off Score Levels. LR Test Cut- # of Practically Significant off Score # of 80% Rule Violations 80% Rule Violations 34 1 1 35 4 4 36 5 5 37 5 5 38 6 5 39 6 6 40 7 7 41 7 7 42 8 8 43 8 8 44 8 8 45 8 8 46 8 6 81 78 Grand Total The Original Test contained the highest percentage of assessments with 80% rule violations (29.2%) and practically significant 80% rule violations (28.5%). The MH Test contained the next highest percentage of assessments with 80% rule violations (27.9%) and practically significant 80% rule violations (26.9%). The LR Test contained the lowest 80 81 percentage of assessments with 80% rule violations (26.6%) and practically significant 80% rule violations (25.7%). The percent of assessments found to violate the 80% rule and the percent of assessments found to be practically significant violations of the 80% rule are displayed graphically in Figure 5. Percent of Assessments 30 29 28 80% Rule Violations 27 26 Pract. Sig. 80% Rule Violations 25 24 23 Original Test MH Test LR Test Figure 5. Percentage of Adverse Impact Analyses Displaying 80% Rule Violations and Practically Significant 80% Rule Violations. The percentage of assessments with 80% rule violations were also analyzed by comparison groups to determine if the three versions of the test differed in the amount of adverse impact seen by comparison groups. However, all tests showed a similar pattern in terms of the percentage of assessments that were identified as 80% rule violations by comparison groups (see Figure 6). The lowest percentage of 80% rule violations by group occurred in the comparison between male and female applicants for all tests (1.4%2.2%). The highest percentage of 80% rule violations by group occurred in the 81 82 comparison between applicants self-identified as Asian or Pacific Islander and Caucasian 6 5 4 3 2 1 0 Original Test MH Test Two or More/Caucasian Hawaiian/Caucasian African American/Caucasian Hispanic/Caucasian Asian/Caucasian American Indian/Caucasian Total Minority/Caucasian LR Test Male/Female Percent of Assessments for all tests (4.3%-4.8%). Figure 6. Percentage of Adverse Impact Analyses Displaying 80% Rule Violations by Comparison Group and Test. All tests also showed a similar pattern in terms of the percentage of analyses that were identified as practically significant 80% rule violations by comparison groups (see Figure 7). The lowest percentage of practically significant 80% rule violations by group occurred in the comparison between male and female applicants for all tests (1.4%2.2%). The highest percentage of practically significant 80% rule violations by group occurred in the comparison between applicants self-identified as Asian or Pacific Islander and Caucasian for all tests (4.3%-4.9%). 82 6 5 4 Original Test 3 2 MH Test 1 LR Test Two or More/Caucasian Hawaiian/Caucasian African American/Caucasian Hispanic/Caucasian Asian/Caucasian American Indian/Caucasian Total Minority/Caucasian 0 Male/Female Percent of Assessments 83 Figure 7. Percentage of Adverse Impact Analyses Displaying Practically Significant 80% Rule Violations by Comparison Group and Test. For the sake of comparing cut-off score levels between the three tests which differ in the number of total items, each cut-off score level was divided by the total number of items on the test to create a percentage score of cut-off levels. By comparing percentage scores of cut-off levels the three tests can be assessed for the appearance of adverse impact at different difficulty levels. The term difficulty level in this instance refers only to the number of items an applicant must answer correctly to pass at that particular cutoff score divided by the total number of items on the test. The lowest percentage score cut-off level at which an 80% rule violation occurred was in the MH Test, which had a practically significant 80% rule violation at the percentage score cut-off level of 69%. The next lowest percentage score cut-off level at which an 80% rule violation occurred was in the Original Test, which had a practically significant 80% rule violation at the percentage score cut-off level of 71.4%. The first 83 84 occurrence of an 80% rule violation in the LR Test was at the percentage score cut-off level of 73.9%. The lowest percentage score cut-off level at which an 80% rule violation occurred on all group comparisons was in the Original Test, which had 8 practically significant 80% rule violations at the percentage score cut-off level of 87.8%. The next lowest percentage score cut-off level at which all group comparisons showed an 80% rule violation occurred in the LR Test, which had 8 practically significant 80% rule violations at the percentage score cut-off level of 91.3%. The lowest percentage cut of score level at which the MH Test violated the 80% rule on all group comparisons was at the percentage score cut-off level of 93.1%. All tests showed a similar pattern in terms of an increased number of 80% rule violations as percentage score cut-off levels increased (see Figure 8). 7 6 5 Original Test 4 MH Test 3 LR Test 2 1 100% 98-99% 96-97% 94-95% 92-93% 90-91% 88-89% 86-87% 84-85% 82-83% 80-81% 78-79% 76-77% 74-75% 72-73% 70-71% 0 68-69% # of 80% Rule Violations 8 % Score of Cut Off Level Figure 8. Number of 80% Rule Violations by % Cut-Off Score Level and Test. 84 85 All tests also showed a similar pattern in terms of an increased number of 8 7 6 5 Original Test 4 3 2 MH Test LR Test 100% 98-99% 96-97% 94-95% 92-93% 90-91% 88-89% 86-87% 84-85% 82-83% 80-81% 78-79% 76-77% 74-75% 72-73% 70-71% 1 0 68-69% # of Practially Significant 80% Rule Violations practically significant 80% rule violations as cut-off score levels increased (see Figure 9). % Score of Cut Off Level Figure 9. Number of Practically Significant 80% Rule Violations by % Cut-Off Score Level and Test. 85 86 Chapter 4 DISCUSSION Findings & Conclusions The results of this study indicate that the Mantel-Haenszel procedure of DIF detection identifies more instances of moderate or large DIF when using the classification rules developed by ETS as laid out in a study by Hidalgo & López-Pina (2004) than the logistic regression procedure of DIF detection when the classification of small, intermediate, and large levels of DIF was modeled after classification criteria suggested by Jodoin and Gierl (2001). This finding is consistent with Rogers and Swaminathan (1993) who also found the MH method to have higher detection rates for uniform DIF. Because the MH method detected more items displaying DIF, the use of this method for item removal resulted in fewer items that could be retained for a test comprised solely of items displaying only small or no DIF. The MH test created from the items displaying small or no DIF with the MH DIF detection method contained only 29 of the original 49 items, while the LR method included most (46) of the original 49 items. As would be expected, the reliability of the test was affected when the number of items on the test decreased with the removal of items. The reliability of the original test using the Cronbach’s alpha measure of reliability was .774; the reliability of the 46 items displaying small or no DIF by the LR method was .748 and the reliability of the 29 items displaying small or no DIF by the MH method was .662. This indicates that the use of the MH method for item removal may also lead to a decrease in test reliability. This may 86 87 indicate a potential weakness of the MH method, as it is used in this study, when the reliability of a test is of concern. Although the LR method is capable of assessing both uniform and non-uniform DIF simultaneously, there were no findings of non-uniform DIF detected by the LR method in this study. This suggests that in applied test assessment settings there may not always be a benefit to using the LR method for the sake of its ability to detect nonuniform DIF. The adverse impact results of this study show that a larger percentage of potential cut-off scores containing instances of 80% rule violations indicative of adverse impact were found in the original test (29.2%) and that this number is decreased with the removal of items displaying DIF through either the MH method (27.9%) or the LR method (26.6%). The results of the adverse impact analyses show a similar pattern in adverse impact findings by comparison groups; this suggests that the MH and LR method do not differ in their effect on particular comparison groups and that both methods reduce adverse impact findings in a similar manner across the eight comparison groups used in this study. The results also indicate that the occurrence of 80% rule violations indicative of adverse impact are more common as the difficulties of the cut-off score level increases. Difficulty in this instance refers to the number of items an applicant must answer correctly to pass at the particular cut-off. The difficulty of the cut-off score levels are determined by dividing the number of items an applicant must answer correctly to pass at the cut-off score by the total number of items on that test; higher percentage cut-off score 87 88 levels are considered to be more difficult. When the LR test was assessed, the appearance of adverse impact findings did not occur at any percentage cut-off score below 74%; this is much higher than the percentage cut-off score at which the original test (71%) or the MH test (69%) first displayed adverse impact. This may suggest that the use of the LR method for item removal would allow for more difficult cut-off score levels to be used without adverse impact than would the Original test. The MH method does not appear to allow for any more difficult cut-off score levels to be used without adverse impact than the Original test. To summarize, the results of this study indicate that the MH method, as it is applied in this study, identifies more items displaying potential DIF, but that the LR method, when used for item removal, will serve to better reduce possible adverse impact at slightly higher cut-off score levels and retain a higher test reliability level than will the MH method. Though these findings seem to indicate that the LR method should be used over the MH method, when performing DIF analyses in an applied setting there may be other factors involved. For example, in an applied setting the availability of resources may influence the decision to use one method over another. According to Rogers & Swaminathan (1993) the LR procedure appears to be three to four times more expensive than the MH procedure. This study also found that to be true when time is considered the “expense” of the procedure; the amount of time taken to apply the LR method in this study was approximately three times the amount of taken to apply the MH method for DIF detection analyses. Also, while many people may be at least conceptually comfortable with chi square statistics and the basic premises underlying the MH method, 88 89 the analytical skills required to understand the assumptions and interpret the output of the LR method may be a resource not readily available in many applied settings. Given that the reduction of adverse impact when the LR method is used only occurs in a couple out of many possible cut-off score levels, this study does not conclusively support the necessity of using the LR method for DIF detection and item removal over the use of the MH method in an applied setting. Limitations The MH method has been found to both over and under-estimate DIF according to several factors such as the matching variable (Holland & Thayer, 1988), guessing (Camilli & Penfield, 1997), and a lack of sufficient statistics for matching (Zwick, 1990). However, the selection test used in this study required a certain degree of confidentiality to ensure test security and anonymity. As such, the necessary information to address any of these factors was unavailable in the current study. Because the items could not be assessed for multi-dimensionality an assumption was made that the total test score was being appropriately used and assessed as a uni-dimensional measure for ability. The nature of this study did not allow for decisions to be made about the appropriate cut-off score for this particular selection test as it was intended for use with the employment organization. As a result the findings of this study do not revolve around a single cut-off score that can be assessed to determine simply whether or not adverse impact ceased to exist with the removal of items displaying DIF. While this provides for the assessment of many possible cut off scores, it is a potential limitation given that the assessment of adverse impact in a test is generally undertaken after considerable thought 89 90 has been given to selecting a single cut-off score that is appropriate for the purposes that the test has been designed to meet. Biddle (2006) warns that there is no firm set of rules for removing items based on DIF analyses, and that the practice should be approached with caution. A limitation of this study is that the method for removing items was based solely on the removal of all items which displayed even moderate DIF. There may be any number of reasons why an item displaying DIF may be essential to the purposes of a test and therefore necessary to retain. In an applied setting item removal should only occur after a thorough consideration of all item aspects as they relate to the purposes of the test has occurred. “If an item in a test displays DIF, one should try to find the source of the DIF, because it is not necessarily a bad item. An item might display DIF if it has a different item format than the rest of the items in the test (Longford et al., 1993). Another possibility is that the item measures an ability different from the one measured in the test or reflects that two groups have learned something with different pedagogical methods, hence making an item easier for one of the groups (Camilli, 2006). If it really is an item that favors one group, conditional on the ability, there are some strategies that one can apply. The most common ones are a) rewrite the item b) remove the item c) control for the underlying differences using an IRT model for scoring respondents. If however the item is kept in the test the test constructor should have a reason for that decision.” (Wiberg, 2007, p. 32). 90 91 Implications for Future Studies Rogers & Swaminathan (1993) found that detection rates for DIF differed depending on whether the item had a low, moderate, or high level of difficulty. Item difficulty was not addressed in this study; future studies comparing DIF detection methods may benefit by item difficulty analyses to determine if any differences found between the MH and LR methods may be a result of item difficulty levels within the test. Recently some discussion has begun concerning the use of practical significance assessments of adverse impact findings. Biddle (2010) suggests that employers should “tread carefully” in regards to practical significance of adverse impact findings, and that “hard-and-fast” practical significance rules should not be applied when analyzing adverse impact. Future research may be needed on the appropriateness of practical significance tests of adverse impact before making strong assumptions about the ability of particular DIF detection methods for item removal to reduce practically significant adverse impact findings. While the MH method used in this study did identify more assessments displaying DIF than the LR method, the items identified by both methods appeared to be in alignment (e.g., all items identified as displaying DIF with the LR method were also identified by the MH method). This indicates that the application of either a more lenient classification system with the LR method or a more stringent classification system with the MH method may have resulted in similar DIF identification results. Future research should be undertaken to compare various classification systems of DIF with these methods; there may be reason to believe that the results of this study which indicate that 91 92 the MH method identifies more assessments displaying DIF may be attributed to the classification system used rather than the methodology of the processes themselves. 92 93 APPENDICES 93 94 APPENDIX A Item Means and Standard Deviations Table A1 ____________________________________________________ Item # Mean SD N ____________________________________________________ Item 1 Item 2 Item 3 Item 4 Item 5 Item 6 Item 7 Item 8 Item 9 Item 10 Item 11 Item 12 Item 13 Item 14 Item 15 Item 16 Item 17 Item 18 Item 19 Item 20 Item 21 Item 22 Item 23 Item 24 Item 25 Item 26 Item 27 Item 28 Item 29 Item 30 Item 31 Item 32 Item 33 Item 34 0.8853 0.8272 0.9317 0.8605 0.7383 0.8545 0.9384 0.6981 0.5928 0.7942 0.6042 0.7183 0.8638 0.9802 0.6232 0.9530 0.7453 0.8570 0.7669 0.7629 0.7611 0.8804 0.8255 0.8307 0.7282 0.7687 0.9351 0.8899 0.9037 0.8611 0.7866 0.8625 0.6370 0.8581 0.3186 0.3781 0.2522 0.3464 0.4395 0.3526 0.2405 0.4591 0.4913 0.4043 0.4890 0.4498 0.3430 0.1394 0.4846 0.2116 0.4357 0.3500 0.4228 0.4253 0.4264 0.3245 0.3796 0.3750 0.4449 0.4217 0.2464 0.3130 0.2950 0.3459 0.4097 0.3444 0.4809 0.3490 29171 29169 29171 29170 29171 29170 29170 29170 29171 29169 29171 29170 29171 29170 29171 29170 29170 29169 29168 29154 29167 29167 29167 29167 29166 29167 29166 29167 29167 29165 29167 29166 29164 29164 94 95 ____________________________________________________ Item # Mean SD N ____________________________________________________ Item 35 Item 36 Item 37 Item 38 Item 39 Item 40 Item 41 Item 42 Item 43 Item 44 Item 45 Item 46 Item 47 Item 48 Item 49 0.8648 0.9184 0.8345 0.9101 0.8821 0.9260 0.6981 0.7538 0.9335 0.7665 0.9183 0.9503 0.6396 0.9298 0.7498 0.3420 0.2738 0.3716 0.2861 0.3225 0.2618 0.4591 0.4308 0.2491 0.4231 0.2739 0.2174 0.4801 0.2555 0.4331 29166 29167 29166 29163 29166 29147 29158 29158 29159 29158 29171 29157 29159 29159 29158 _____________________________________________________ 95 96 APPENDIX B MH DIF Values and Classification Level by Item Table A2 Item # 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 Comparison Group Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female MH CHI 7.6329 0.5673 2.8002 0.1426 6.5140 1.2694 4.4911 5.7022 0.6360 1.4704 1.9063 0.0000 17.4479 0.0564 6.5363 0.0312 19.7283 2.0049 9.3948 1.3957 0.3799 0.0014 190.3507 7.6473 2.5397 1.0546 0.0002 1.4825 0.9996 4.0262 70.8068 0.8472 10.8096 2.1012 87.2016 0.8305 1.5649 0.0009 0.6779 MH LOR LOR SE LOR Z 0.3348 0.1201 2.7877 -0.1161 0.1414 -0.8211 0.0728 0.0431 1.6891 -0.1177 0.2387 -0.4931 -0.1953 0.0760 -2.5697 0.2760 0.2237 1.2338 -0.0898 0.0421 -2.1330 0.1103 0.0460 2.3978 0.0940 0.1094 0.8592 0.1332 0.1045 1.2746 -0.0484 0.0345 -1.4029 -0.0167 0.1783 -0.0937 -0.2658 0.0630 -4.2190 0.0674 0.1992 0.3384 -0.0886 0.0345 -2.5681 -0.0073 0.0375 -0.1947 0.6076 0.1365 4.4513 0.2326 0.1559 1.4920 0.1631 0.0531 3.0716 -0.4603 0.3486 -1.3204 0.0589 0.0892 0.6603 -0.0378 0.3187 -0.1186 -0.8136 0.0601 -13.5374 0.1583 0.0570 2.7772 0.1900 0.1148 1.6551 0.1207 0.1112 1.0854 -0.0011 0.0366 -0.0301 0.2259 0.1744 1.2953 0.0638 0.0621 1.0274 -0.5756 0.2725 -2.1123 -0.3232 0.0385 -8.3948 -0.0380 0.0405 -0.9383 0.3417 0.1021 3.3467 0.1468 0.0987 1.4873 0.2933 0.0314 9.3408 0.1603 0.1625 0.9865 0.0712 0.0555 1.2829 -0.0237 0.1897 -0.1249 0.0265 0.0315 0.8413 96 BD 1.653 0.008 5.269 0.000 0.621 0.002 3.046 6.108 1.067 0.850 4.916 0.060 3.545 0.084 4.160 3.096 1.033 0.074 2.516 0.013 0.056 0.293 3.719 3.349 0.856 0.193 0.003 0.343 0.732 0.345 0.087 0.000 0.382 0.519 3.331 0.069 1.451 0.143 4.840 ETS CDR Classification Flag A OK A Flag A OK A Flag A OK A OK A Flag A OK A OK A OK A OK A Flag A OK A Flag A OK A Flag B OK A Flag A OK A OK A OK A Flag C Flag A OK A OK A OK A OK A OK A OK B Flag A OK A Flag A OK A Flag A OK A OK A OK A OK A 97 Item # 5 6 6 6 6 6 6 6 6 7 7 7 7 7 7 7 7 8 8 8 8 8 8 8 8 9 9 9 9 9 9 9 9 10 10 10 10 10 10 10 10 11 11 11 11 Comparison Group African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian MH CHI 110.3216 1.2442 0.1187 0.0538 2.2497 2.7840 0.0128 9.8320 0.8664 1.2969 1.2174 73.2346 0.5304 18.6170 2.1867 33.4078 72.5511 3.8260 3.3330 80.3454 6.0626 2.0086 0.2106 13.0895 94.0084 13.1920 0.9109 3.4921 0.5859 6.1157 0.8164 24.2685 12.7124 20.3974 43.1554 520.3883 8.6821 97.8521 10.3165 402.6554 550.1581 33.6978 0.8928 39.5689 0.8207 MH LOR LOR SE LOR Z 0.3541 0.0338 10.4763 0.1366 0.1163 1.1745 -0.0494 0.1222 -0.4043 -0.0097 0.0386 -0.2513 -0.3631 0.2273 -1.5974 0.1068 0.0631 1.6926 0.0496 0.2191 0.2264 -0.1204 0.0383 -3.1436 -0.0398 0.0419 -0.9499 -0.1905 0.1583 -1.2034 -0.2015 0.1702 -1.1839 -0.4744 0.0557 -8.5171 -0.2480 0.2878 -0.8617 -0.4151 0.0956 -4.3421 -0.5456 0.3402 -1.6038 -0.3324 0.0572 -5.8112 -0.5188 0.0611 -8.4910 -0.1940 0.0965 -2.0104 -0.1701 0.0907 -1.8754 -0.2567 0.0286 -8.9755 -0.3947 0.1564 -2.5237 -0.0710 0.0491 -1.4460 0.0917 0.1690 0.5426 0.1036 0.0285 3.6351 -0.3068 0.0317 -9.6782 0.3279 0.0891 3.6801 -0.0818 0.0825 -0.9915 0.0490 0.0260 1.8846 0.1106 0.1329 0.8322 -0.1159 0.0463 -2.5032 -0.1658 0.1670 -0.9928 0.1295 0.0262 4.9427 0.1033 0.0288 3.5868 0.5226 0.1136 4.6004 0.6822 0.1059 6.4419 0.8192 0.0365 22.4438 0.5623 0.1844 3.0493 0.5755 0.0593 9.7049 0.6468 0.1982 3.2634 0.6711 0.0339 19.7965 0.8905 0.0386 23.0699 -0.5444 0.0944 -5.7669 -0.0805 0.0820 -0.9817 -0.1641 0.0261 -6.2874 -0.1312 0.1350 -0.9719 97 BD 9.624 0.083 0.003 0.030 0.022 0.946 0.204 0.231 0.201 0.088 0.168 7.774 0.066 3.228 0.010 6.063 7.896 1.906 0.263 3.586 0.003 3.031 0.003 10.736 3.319 1.476 0.051 27.419 0.149 6.336 0.882 4.251 26.972 1.277 2.550 8.776 0.150 25.338 0.010 27.728 9.240 0.460 0.235 0.746 0.004 ETS CDR Classification Flag A OK A OK A OK A OK A OK A OK A Flag A OK A OK A OK A Flag B OK A Flag A OK A Flag A Flag B OK A OK A Flag A Flag A OK A OK A Flag A Flag A Flag A OK A Flag A OK A Flag A OK A Flag A Flag A Flag B Flag C Flag C Flag B Flag B Flag B Flag C Flag C Flag B OK A Flag A OK A 98 Item # 11 11 11 11 12 12 12 12 12 12 12 12 13 13 13 13 13 13 13 13 14 14 14 14 14 14 14 14 15 15 15 15 15 15 15 15 16 16 16 16 16 16 16 16 17 Comparison Group Hispanic/Caucasian Hawaiian/Caucasian Male/Female African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female African American/Caucasian Asian/Caucasian MH CHI 1.7030 4.8206 2.4494 40.1383 13.2069 0.5121 20.9173 2.9070 5.6643 2.3182 70.2999 15.0035 6.7945 5.5269 82.1253 0.2960 5.6447 5.7626 41.2868 90.5703 0.8288 0.0156 4.6217 0.2686 0.2232 0.2630 11.4512 10.6201 0.0356 0.0001 16.5709 4.4212 23.7308 0.1685 26.6199 6.2316 75.6019 0.2711 49.7746 0.0164 24.9067 15.1942 5.9341 32.6918 1.8446 MH LOR LOR SE LOR Z -0.0602 0.0454 -1.3260 -0.3855 0.1693 -2.2770 -0.0416 0.0264 -1.5758 -0.1837 0.0290 -6.3345 -0.3741 0.1020 -3.6676 0.0696 0.0907 0.7674 -0.1340 0.0292 -4.5890 -0.2794 0.1577 -1.7717 -0.1218 0.0507 -2.4024 -0.2980 0.1847 -1.6134 -0.2511 0.0299 -8.3980 -0.1246 0.0321 -3.8816 -0.3315 0.1256 -2.6393 -0.3100 0.1280 -2.4219 -0.3549 0.0391 -9.0767 -0.1290 0.2002 -0.6444 -0.1550 0.0647 -2.3957 -0.6328 0.2574 -2.4584 -0.2530 0.0394 -6.4213 -0.4141 0.0434 -9.5415 0.2460 0.2346 1.0486 0.0829 0.2986 0.2776 -0.2238 0.1025 -2.1834 -0.5561 0.7137 -0.7792 0.0817 0.1487 0.5494 0.3332 0.4389 0.7592 -0.3457 0.1011 -3.4194 -0.3631 0.1116 -3.2536 -0.0221 0.0944 -0.2341 0.0032 0.0875 0.0366 0.1127 0.0276 4.0833 0.3018 0.1402 2.1526 0.2304 0.0474 4.8608 0.0856 0.1722 0.4971 -0.1465 0.0283 -5.1767 0.0767 0.0305 2.5148 1.2681 0.1595 7.9505 0.1588 0.2456 0.6466 0.5602 0.0797 7.0289 -0.1859 0.5060 -0.3674 0.5695 0.1150 4.9522 1.1208 0.2985 3.7548 -0.1663 0.0673 -2.4710 0.4797 0.0837 5.7312 -0.1400 0.1005 -1.3930 98 BD 0.234 0.246 0.090 0.009 0.002 0.652 2.973 0.004 0.121 0.090 2.383 6.036 0.010 0.347 2.046 0.020 0.285 0.004 1.993 3.766 0.223 0.056 0.635 0.025 0.780 0.000 0.033 0.551 1.139 0.468 1.789 0.144 13.767 0.006 1.228 0.012 26.031 0.154 0.064 0.147 0.365 1.226 0.020 0.002 2.283 ETS CDR Classification OK A OK A OK A Flag A Flag A OK A Flag A OK A Flag A OK A Flag A Flag A Flag A Flag A Flag A OK A Flag A Flag B Flag A Flag A OK A OK A OK A OK A OK A OK A Flag A Flag A OK A OK A Flag A OK A Flag A OK A Flag A Flag A Flag C OK A Flag B OK A Flag B Flag C Flag A Flag B OK A 99 Item # 17 17 17 17 17 17 17 18 18 18 18 18 18 18 18 19 19 19 19 19 19 19 19 20 20 20 20 20 20 20 20 21 21 21 21 21 21 21 21 22 22 22 22 22 22 Comparison Group Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian MH CHI 0.3673 0.3728 1.3236 4.2266 0.1160 5.9819 0.0022 0.6763 2.4090 1.1207 0.0577 0.9921 4.5873 4.5800 1.8145 0.2904 13.2235 203.0315 0.0058 26.1497 7.4117 107.9905 240.7583 11.0258 51.0234 317.0217 1.2289 42.2156 0.1043 34.2901 315.3005 8.7910 5.7641 35.1764 1.6152 0.0522 0.0003 33.6385 54.3611 5.0090 1.9655 21.5248 1.6389 1.1242 0.9495 MH LOR LOR SE LOR Z 0.0610 0.0935 0.6524 -0.0187 0.0300 -0.6233 -0.2002 0.1633 -1.2260 -0.1093 0.0527 -2.0740 0.0778 0.1796 0.4332 -0.0745 0.0303 -2.4587 -0.0021 0.0327 -0.0642 0.1012 0.1156 0.8754 0.1847 0.1141 1.6188 -0.0414 0.0384 -1.0781 0.0655 0.1931 0.3392 -0.0673 0.0655 -1.0275 -0.5613 0.2569 -2.1849 0.0809 0.0374 2.1631 -0.0573 0.0418 -1.3708 0.0639 0.1072 0.5961 0.3522 0.0958 3.6764 0.4470 0.0314 14.2357 0.0273 0.1694 0.1612 0.2768 0.0538 5.1450 0.5021 0.1788 2.8082 0.3181 0.0306 10.3954 0.5238 0.0338 15.4970 -0.3447 0.1026 -3.3596 -0.7623 0.1087 -7.0129 -0.5439 0.0307 -17.7166 -0.1800 0.1529 -1.1772 -0.3438 0.0528 -6.5114 -0.0746 0.1795 -0.4156 -0.1810 0.0308 -5.8766 -0.6080 0.0344 -17.6744 -0.3355 0.1124 -2.9849 0.2248 0.0919 2.4461 0.1781 0.0300 5.9367 0.2040 0.1510 1.3510 -0.0135 0.0533 -0.2533 -0.0149 0.1899 -0.0785 0.1730 0.0298 5.8054 0.2419 0.0327 7.3976 -0.3111 0.1351 -2.3027 -0.1884 0.1305 -1.4437 -0.1900 0.0410 -4.6341 -0.3096 0.2282 -1.3567 -0.0737 0.0679 -1.0854 -0.2596 0.2424 -1.0710 99 BD 0.182 11.608 0.000 2.295 0.017 0.191 14.437 1.351 0.228 0.000 0.046 0.260 0.062 0.466 0.062 0.821 0.038 1.563 0.107 0.246 0.070 2.313 0.721 0.282 0.107 1.036 0.059 0.319 0.227 18.626 1.493 0.220 0.124 1.357 0.001 8.291 0.002 0.129 0.275 0.025 1.036 1.452 0.137 2.137 0.138 ETS CDR Classification OK A Flag A OK A OK A OK A Flag A Flag A OK A OK A OK A OK A OK A OK B OK A OK A OK A Flag A Flag B OK A Flag A Flag B Flag A Flag B Flag A Flag C Flag B OK A Flag A OK A Flag A Flag B Flag A Flag A Flag A OK A Flag A OK A Flag A Flag A OK A OK A Flag A OK A OK A OK A 100 Item # 22 22 23 23 23 23 23 23 23 23 24 24 24 24 24 24 24 24 25 25 25 25 25 25 25 25 26 26 26 26 26 26 26 26 27 27 27 27 27 27 27 27 28 28 28 Comparison Group Male/Female African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian MH CHI 5.2473 26.9158 4.0924 8.8836 59.4804 0.1056 0.0368 0.3809 4.0119 83.5781 2.5749 2.8039 36.9827 0.1845 0.0299 0.3525 1.3996 49.4233 11.2522 0.4047 1.6605 1.7818 0.5255 3.5609 27.5269 0.0883 1.7256 0.5134 131.8035 3.1769 6.4987 7.6384 176.5991 172.9213 2.4068 0.0290 29.6966 1.5732 4.1932 0.0083 1.0636 31.0061 9.3194 0.0722 43.6632 MH LOR LOR SE LOR Z -0.0946 0.0410 -2.3073 -0.2331 0.0450 -5.1800 0.2297 0.1109 2.0712 0.3196 0.1049 3.0467 0.2696 0.0348 7.7471 0.0780 0.1852 0.4212 0.0138 0.0618 0.2233 0.1523 0.2077 0.7333 -0.0700 0.0346 -2.0231 0.3428 0.0375 9.1413 -0.1945 0.1159 -1.6782 -0.1969 0.1137 -1.7318 -0.2154 0.0354 -6.0847 -0.0949 0.1832 -0.5180 -0.0119 0.0586 -0.2031 0.1308 0.1914 0.6834 -0.0423 0.0351 -1.2051 -0.2745 0.0390 -7.0385 0.3163 0.0932 3.3938 0.0607 0.0895 0.6782 0.0377 0.0288 1.3090 0.2015 0.1430 1.4091 0.0374 0.0499 0.7495 0.3284 0.1676 1.9594 -0.1545 0.0294 -5.2551 0.0101 0.0320 0.3156 0.1497 0.1084 1.3810 0.0831 0.1072 0.7752 0.3792 0.0331 11.4562 0.3338 0.1735 1.9239 0.1482 0.0575 2.5774 0.5081 0.1823 2.7872 0.4252 0.0322 13.2050 0.4610 0.0353 13.0595 0.2595 0.1598 1.6239 0.0486 0.1842 0.2638 0.3119 0.0574 5.4338 0.3847 0.2728 1.4102 0.1945 0.0934 2.0824 -0.0229 0.3222 -0.0711 -0.0569 0.0539 -1.0557 0.3386 0.0609 5.5599 0.3790 0.1257 3.0151 0.0487 0.1438 0.3387 0.3011 0.0454 6.6322 100 BD 0.279 0.137 0.787 1.124 0.071 0.054 0.011 0.348 2.130 0.637 0.718 0.071 2.553 0.003 1.022 0.418 6.145 2.816 0.027 1.030 2.983 0.003 0.621 0.039 3.998 5.220 0.912 0.392 3.381 1.982 0.081 0.530 9.821 14.259 0.985 0.199 1.436 0.007 0.346 0.322 1.796 1.145 13.444 0.347 0.645 ETS CDR Classification Flag A Flag A OK A Flag A Flag A OK A OK A OK A OK A Flag A OK A OK A Flag A OK A OK A OK A Flag A Flag A Flag A OK A OK A OK A OK A OK A Flag A Flag A OK A OK A Flag A OK A Flag A Flag B Flag A Flag B OK A OK A Flag A OK A OK A OK A OK A Flag A Flag A OK A Flag A 101 Item # 28 28 28 28 28 29 29 29 29 29 29 29 29 30 30 30 30 30 30 30 30 31 31 31 31 31 31 31 31 32 32 32 32 32 32 32 32 33 33 33 33 33 33 33 33 Comparison Group American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female African American/Caucasian MH CHI 2.2645 17.0406 0.1172 4.8695 38.6236 0.0032 0.1313 0.3482 0.0379 0.3263 0.2454 1.0554 0.2839 0.5293 1.2819 12.9435 0.0033 0.0912 0.0318 42.8949 15.1075 6.5747 9.0356 7.5341 0.2862 9.2041 4.4873 85.4549 33.9164 2.3701 0.2031 28.0935 0.0026 3.0174 0.0106 6.3864 34.1811 10.8524 2.6715 65.9624 0.1460 1.7003 0.0072 0.4773 82.6203 MH LOR LOR SE LOR Z 0.3518 0.2186 1.6093 0.3047 0.0734 4.1512 0.1164 0.2495 0.4665 0.0954 0.0428 2.2290 0.3037 0.0487 6.2361 0.0173 0.1376 0.1257 -0.0644 0.1471 -0.4378 -0.0280 0.0458 -0.6114 -0.0770 0.2449 -0.3144 -0.0466 0.0769 -0.6060 -0.1705 0.2720 -0.6268 -0.0472 0.0449 -1.0512 -0.0275 0.0495 -0.5556 -0.0972 0.1235 -0.7870 -0.1519 0.1266 -1.1998 -0.1422 0.0393 -3.6183 -0.0324 0.2035 -0.1592 -0.0221 0.0657 -0.3364 -0.0644 0.2239 -0.2876 0.2478 0.0378 6.5556 -0.1675 0.0429 -3.9044 -0.3019 0.1159 -2.6048 -0.3524 0.1163 -3.0301 0.0932 0.0337 2.7656 0.1108 0.1753 0.6321 -0.1858 0.0606 -3.0660 -0.4522 0.2116 -2.1371 0.3031 0.0327 9.2691 0.2106 0.0360 5.8500 0.1930 0.1205 1.6017 0.0621 0.1211 0.5128 0.2021 0.0379 5.3325 0.0102 0.2022 0.0504 0.1155 0.0652 1.7715 0.0034 0.2315 0.0147 -0.0963 0.0378 -2.5476 0.2415 0.0410 5.8902 -0.3133 0.0940 -3.3330 -0.1445 0.0858 -1.6841 -0.2202 0.0271 -8.1255 0.0602 0.1353 0.4449 -0.0620 0.0467 -1.3276 -0.0282 0.1674 -0.1685 0.0191 0.0271 0.7048 -0.2739 0.0301 -9.0997 101 BD 0.029 0.772 0.389 1.085 0.152 0.075 0.530 0.003 0.022 1.465 0.000 2.234 0.180 0.030 0.236 0.896 0.019 1.461 0.038 0.315 0.800 1.187 0.040 1.808 0.170 0.835 0.020 1.752 0.647 0.022 0.122 1.260 0.010 0.005 0.433 0.708 1.240 0.241 0.088 0.603 0.583 0.142 0.039 0.069 2.387 ETS CDR Classification OK A Flag A OK A OK A Flag A OK A OK A OK A OK A OK A OK A OK A OK A OK A OK A Flag A OK A OK A OK A Flag A Flag A Flag A Flag A Flag A OK A Flag A OK B Flag A Flag A OK A OK A Flag A OK A OK A OK A Flag A Flag A Flag A OK A Flag A OK A OK A OK A OK A Flag A 102 Item # 34 34 34 34 34 34 34 34 35 35 35 35 35 35 35 35 36 36 36 36 36 36 36 36 37 37 37 37 37 37 37 37 38 38 38 38 38 38 38 38 39 39 39 39 39 Comparison Group Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian MH CHI 10.9106 0.8379 151.5159 6.7812 14.3402 12.9722 3.5027 171.4637 2.4891 8.9142 180.2551 1.8022 21.4304 0.0531 120.0345 204.9177 2.3624 2.6072 68.1963 4.9203 45.6909 3.9345 6.0288 50.3235 0.8955 0.0184 62.8395 1.7043 12.9854 0.2553 15.7077 68.1385 6.9443 0.1396 13.8043 0.8858 7.4637 1.5254 8.2626 12.1489 0.8666 2.2702 1.0193 3.5026 0.0290 MH LOR LOR SE LOR Z -0.4115 0.1245 -3.3052 -0.1089 0.1126 -0.9671 -0.4607 0.0377 -12.2202 -0.5744 0.2185 -2.6288 -0.2430 0.0636 -3.8208 -0.9980 0.2834 -3.5215 -0.0707 0.0375 -1.8853 -0.5471 0.0421 -12.9952 -0.1964 0.1202 -1.6339 -0.3827 0.1269 -3.0158 -0.5232 0.0393 -13.3130 -0.2985 0.2075 -1.4386 -0.3059 0.0659 -4.6419 -0.0720 0.2134 -0.3374 -0.4482 0.0410 -10.9317 -0.6245 0.0440 -14.1932 0.2408 0.1505 1.6000 0.2689 0.1605 1.6754 0.4346 0.0531 8.1846 0.5574 0.2430 2.2938 0.5312 0.0805 6.5988 0.5554 0.2594 2.1411 0.1185 0.0480 2.4688 0.3953 0.0564 7.0089 -0.1131 0.1124 -1.0062 -0.0201 0.1068 -0.1882 -0.2811 0.0354 -7.9407 -0.2619 0.1896 -1.3813 -0.2230 0.0612 -3.6438 -0.1222 0.2036 -0.6002 -0.1412 0.0355 -3.9775 -0.3221 0.0390 -8.2590 -0.4354 0.1629 -2.6728 -0.0625 0.1409 -0.4436 -0.1669 0.0449 -3.7171 0.2224 0.2101 1.0585 -0.2182 0.0794 -2.7481 -0.4125 0.3063 -1.3467 -0.1335 0.0460 -2.9022 -0.1727 0.0494 -3.4960 0.1240 0.1240 1.0000 0.1918 0.1232 1.5568 -0.0435 0.0424 -1.0259 -0.5150 0.2600 -1.9808 0.0145 0.0704 0.2060 102 BD 1.105 0.106 0.073 0.050 0.460 0.021 0.000 0.865 0.577 0.008 2.204 0.012 0.263 0.476 1.094 3.119 1.069 0.594 31.930 0.227 25.894 0.064 13.908 36.408 0.310 0.120 0.472 0.007 0.438 0.036 0.205 0.669 0.448 0.016 1.657 0.156 1.931 0.380 0.105 0.465 0.678 0.558 0.195 0.001 2.019 ETS CDR Classification Flag A OK A Flag B Flag B Flag A Flag C OK A Flag B OK A Flag A Flag B OK A Flag A OK A Flag B Flag B OK A OK A Flag B OK B Flag B OK B Flag A Flag A OK A OK A Flag A OK A Flag A OK A Flag A Flag A Flag B OK A Flag A OK A Flag A OK A Flag A Flag A OK A OK A OK A OK A OK A 103 Item # 39 39 39 40 40 40 40 40 40 40 40 41 41 41 41 41 41 41 41 42 42 42 42 42 42 42 42 43 43 43 43 43 43 43 43 44 44 44 44 44 44 44 44 45 45 Comparison Group Hawaiian/Caucasian Male/Female African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female African American/Caucasian Asian/Caucasian Two or More/Caucasian MH CHI 1.2014 2.5388 2.5582 3.6565 4.9895 47.5060 0.0153 5.2155 0.3737 11.0548 48.5086 7.8076 0.2801 45.3010 0.0248 0.9660 0.0306 35.0741 62.1709 0.1099 2.1091 109.5007 1.9700 10.6218 0.1329 5.0608 143.5730 4.0190 0.1200 15.1501 3.2843 19.4826 17.0158 0.0000 6.3793 1.2638 2.3017 0.7990 0.0048 2.5731 1.0851 14.5052 0.4279 11.1270 0.0081 MH LOR LOR SE LOR Z 0.2545 0.2138 1.1904 -0.0667 0.0414 -1.6111 -0.0738 0.0457 -1.6149 -0.3135 0.1586 -1.9767 -0.3968 0.1734 -2.2884 -0.3550 0.0516 -6.8798 -0.0004 0.2545 -0.0016 -0.1996 0.0853 -2.3400 -0.2218 0.2922 -0.7591 -0.1719 0.0514 -3.3444 -0.3916 0.0564 -6.9433 -0.2802 0.0983 -2.8505 0.0516 0.0895 0.5765 -0.1941 0.0289 -6.7163 0.0351 0.1494 0.2349 -0.0496 0.0494 -1.0040 -0.0435 0.1701 -0.2557 -0.1739 0.0293 -5.9352 -0.2514 0.0320 -7.8562 0.0373 0.0979 0.3810 0.1369 0.0919 1.4897 -0.3271 0.0313 -10.4505 -0.2424 0.1636 -1.4817 -0.1765 0.0534 -3.3052 -0.0844 0.1838 -0.4592 0.0693 0.0307 2.2573 -0.4168 0.0348 -11.9770 0.3123 0.1522 2.0519 0.0760 0.1761 0.4316 0.2198 0.0563 3.9041 0.4958 0.2562 1.9352 0.3842 0.0867 4.4314 0.9480 0.2335 4.0600 0.0017 0.0531 0.0320 0.1537 0.0605 2.5405 -0.1272 0.1070 -1.1888 -0.1677 0.1054 -1.5911 -0.0291 0.0319 -0.9122 0.0023 0.1662 0.0138 -0.0916 0.0558 -1.6416 -0.2278 0.1990 -1.1447 0.1207 0.0315 3.8317 0.0232 0.0345 0.6725 -0.5937 0.1755 -3.3829 0.0253 0.1517 0.1668 103 BD 0.284 0.666 0.129 0.009 0.037 1.446 0.147 3.702 0.089 0.246 0.687 0.005 0.003 1.219 0.206 0.440 0.221 3.358 4.385 0.640 0.222 22.404 0.002 6.178 0.060 3.623 28.597 1.336 0.073 4.388 0.007 0.539 0.188 0.060 6.336 2.543 1.510 5.089 0.068 4.114 0.198 1.305 1.143 0.425 0.306 ETS CDR Classification OK A OK A OK A OK A OK A Flag A OK A Flag A OK A Flag A Flag A Flag A OK A Flag A OK A OK A OK A Flag A Flag A OK A OK A Flag A OK A Flag A OK A Flag A Flag A OK A OK A Flag A OK A Flag A Flag C OK A Flag A OK A OK A Flag A OK A OK A OK A Flag A OK A Flag B OK A 104 Item # 45 45 45 45 45 45 46 46 46 46 46 46 46 46 47 47 47 47 47 47 47 47 48 48 48 48 48 48 48 48 49 49 49 49 49 49 49 49 Comparison Group Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female African American/Caucasian MH CHI 10.2223 0.6418 2.1424 0.0464 7.3001 8.6026 7.7537 0.2268 10.0576 0.8335 2.6467 0.3565 2.8326 9.4407 52.5346 17.4978 1,182.3783 1.1434 283.6459 21.0493 275.4469 1,219.1791 0.2834 0.0131 14.1285 0.0108 0.4386 0.1463 49.6821 16.5381 8.8995 1.4765 46.8025 0.0214 0.0001 2.1503 24.2932 86.6831 MH LOR LOR SE LOR Z -0.1618 0.0502 -3.2231 0.2231 0.2383 0.9362 -0.1283 0.0846 -1.5165 -0.0980 0.2782 -0.3523 -0.1349 0.0495 -2.7253 -0.1605 0.0541 -2.9667 -0.5994 0.2140 -2.8009 0.1052 0.1828 0.5755 -0.1998 0.0626 -3.1917 0.3170 0.2887 1.0980 -0.1749 0.1044 -1.6753 -0.2834 0.3683 -0.7695 -0.1059 0.0617 -1.7164 -0.2089 0.0676 -3.0902 0.6824 0.0981 6.9562 0.3835 0.0929 4.1281 0.9929 0.0293 33.8874 0.1929 0.1643 1.1741 0.8319 0.0506 16.4407 0.7691 0.1752 4.3898 0.4852 0.0294 16.5034 1.0835 0.0317 34.1798 0.0992 0.1607 0.6173 0.0346 0.1734 0.1995 0.2077 0.0554 3.7491 0.0736 0.2913 0.2527 0.0652 0.0922 0.7072 0.1522 0.2871 0.5301 0.3538 0.0506 6.9921 0.2358 0.0585 4.0308 0.2881 0.0958 3.0073 -0.1211 0.0964 -1.2562 -0.2107 0.0308 -6.8409 0.0331 0.1512 0.2189 0.0009 0.0519 0.0173 -0.2950 0.1900 -1.5526 -0.1538 0.0311 -4.9453 -0.3212 0.0345 -9.3101 BD 3.395 0.084 3.493 0.047 0.297 2.718 0.022 0.301 0.001 0.315 0.054 0.018 1.347 0.007 41.559 8.415 56.713 0.624 32.145 6.208 6.551 116.070 0.001 0.647 13.809 0.000 0.466 0.030 31.868 22.164 1.440 0.205 0.043 1.477 0.016 0.000 0.007 0.973 ETS CDR Classification Flag A OK A OK A OK A Flag A Flag A Flag B OK A Flag A OK A OK A OK A OK A Flag A Flag C Flag A Flag C OK A Flag C Flag B Flag B Flag C OK A OK A Flag A OK A OK A OK A Flag A Flag A Flag A OK A Flag A OK A OK A OK A Flag A Flag A Table A2 Heading Descriptions Mantel-Haenszel Chi-Square (MH CHI) – The Mantel-Haenszel chi-square statistic (Holland & Thayer, 1988; Mantel & Haenszel, 1959) is distributed as chi-square with one degree of freedom. Critical values of this statistic are 3.84 for a Type I error rate of 0.05 and 6.63 for a Type I error rate of 0.01. 104 105 Mantel-Haenszel Common Log-Odds Ratio (MH LOR) – The Mantel-Haenszel common log-odds ratio (Camilli & Shepard, 1994; Mantel & Haenszel, 1959) is asymptotically normally distributed. Positive values indicate DIF in favor of the reference group, and negative values indicate DIF in favor of the focal groups. Standard Error of the Mantel-Haenszel Common Log-Odds Ratio (LOR SE) – The standard error of the Mantel-Haenszel common log-odds ratio. The standard error computed here is the non-symmetric estimator presented by Robins, Breslow and Greenland (1986). Standardized Mantel-Haenszel Log-Odds Ratio (LOR Z) – This is the MantelHaenszel log-odds ratio divided by the estimated standard error. A value greater than 2.0 or less than –2.0 may be considered evidence of the presence of DIF. Breslow-Day Chi-Square (BD) – The Breslow-Day chi-square test of trend in odds ratio heterogeneity (Breslow & Day, 1980; Penfield, 2003) is distributed as chisquare with one degree of freedom. Critical values of this statistic are 3.84 for a Type I error rate of 0.05 and 6.63 for a Type I error rate of 0.01. This statistic has been shown to be effective at detecting non-uniform DIF. Combined Decision Rule (CDR) – The combined decision rule (CDR) flags any item for which either the Mantel-Haenszel chi-square or the Breslow-Day chi-square statistic is significant at a Type I error rate of 0.025 (Penfield, 2003). The message OK is printed if neither statistic is significant, and the message FLAG is printed if either statistic is significant. The ETS Categorization Scheme (ETS Classification) – The ETS categorization scheme (Hidalgo & López-Pina, 2004; Zieky, 1993) categorizes items as having small (A), moderate (B), and large (C) levels of DIF. 105 106 APPENDIX C Nagelkerke R2 Values and DIF Classification Category by Item and Group Comparison Table A2 Item # 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 Comparison Group Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female Block 1 Nagelkerke R² 0.1780 0.1700 0.1870 0.1670 0.1720 0.1710 0.1870 0.1870 0.0790 0.0770 0.0910 0.0740 0.0780 0.0740 0.0910 0.0900 0.1070 0.0980 0.1170 0.0990 0.1100 0.1000 0.1170 Block 2 Nagelkerke R² 0.1800 0.1700 0.1880 0.1670 0.1720 0.1720 0.1870 0.1890 0.0790 0.0780 0.0910 0.0740 0.0790 0.0740 0.0910 0.0910 0.1100 0.0990 0.1190 0.0990 0.1100 0.1000 0.1330 Block 3 Nagelkerke R² 0.1810 0.1700 0.1910 0.1670 0.1740 0.1720 0.1870 0.1920 0.0790 0.0780 0.0920 0.0740 0.0800 0.0740 0.0920 0.0910 0.1110 0.0990 0.1200 0.0990 0.1110 0.1000 0.1330 106 Block 2 Significance Level 0.002** 0.753 0.000*** 0.719 0.043* 0.093 0.302 0.000*** 0.631 0.139 0.733 0.875 0.000*** 0.823 0.103 0.109 0.000*** 0.069 0.000*** 0.300 0.307 0.987 0.000*** Block 3 Significance Level 0.011* 0.973 0.000*** 0.513 0.000*** 0.889 0.034* 0.000*** 0.330 0.415 0.000*** 0.589 0.159 0.093 0.003** 0.000*** 0.044* 0.563 0.000*** 0.319 0.101 0.206 0.275 ETS Classification A A A A A A A A A A A A A A A A A A A A A A A 107 Item # 3 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6 7 7 7 7 7 Comparison Group African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Block 1 Nagelkerke R² 0.1140 0.0370 0.0350 0.0310 0.0340 0.0340 0.0350 0.0320 0.0320 0.1960 0.1840 0.2330 0.1830 0.2090 0.1830 0.2310 0.2260 0.1560 0.1470 0.1820 0.1470 0.1580 0.1500 0.1820 0.1820 0.1250 0.1260 0.1480 0.1250 0.1340 Block 2 Nagelkerke R² 0.1160 0.0370 0.0360 0.0310 0.0340 0.0340 0.0350 0.0360 0.0320 0.1970 0.1850 0.2380 0.1830 0.2090 0.1830 0.2310 0.2340 0.1560 0.1470 0.1820 0.1480 0.1580 0.1500 0.1820 0.1820 0.1250 0.1260 0.1530 0.1250 0.1370 Block 3 Nagelkerke R² 0.1170 0.0370 0.0360 0.0330 0.0340 0.0350 0.0350 0.0360 0.0330 0.1970 0.1850 0.2400 0.1830 0.2090 0.1830 0.2310 0.2370 0.1560 0.1480 0.1840 0.1480 0.1600 0.1500 0.1820 0.1830 0.1260 0.1260 0.1530 0.1250 0.1370 107 Block 2 Significance Level 0.000*** 0.254 0.235 0.327 0.151 0.360 0.026* 0.000*** 0.849 0.001 0.076 0.000*** 0.233 0.116 0.982 0.058 0.000*** 0.294 0.885 0.028* 0.166 0.033* 0.553 0.033* 0.219 0.248 0.428 0.000*** 0.606 0.000*** Block 3 Significance Level 0.000*** 0.275 0.853 0.000*** 0.185 0.001 0.148 0.235 0.000*** 0.374 0.269 0.000*** 0.731 0.297 0.154 0.035* 0.000*** 0.144 0.497 0.000*** 0.458 0.000*** 0.338 0.242 0.000*** 0.035* 0.621 0.248 0.468 0.643 ETS Classification A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A 108 Item # 7 7 7 8 8 8 8 8 8 8 8 9 9 9 9 9 9 9 9 10 10 10 10 10 10 10 10 11 11 11 Comparison Group Hawaiian/Caucasian Male/Female African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian Block 1 Nagelkerke R² 0.1240 0.1480 0.1440 0.1000 0.1010 0.0870 0.1000 0.1010 0.1010 0.0870 0.0850 0.0280 0.0240 0.0480 0.0240 0.0290 0.0250 0.0480 0.0480 0.2230 0.2060 0.2700 0.2050 0.2250 0.2070 0.2750 0.2800 0.0310 0.0350 0.0290 Block 2 Nagelkerke R² 0.1240 0.1500 0.1490 0.1010 0.1010 0.0880 0.1010 0.1010 0.1010 0.0880 0.0870 0.0290 0.0240 0.0490 0.0240 0.0290 0.0250 0.0490 0.0480 0.2250 0.2110 0.3060 0.2070 0.2350 0.2080 0.2950 0.3160 0.0350 0.0350 0.0310 Block 3 Nagelkerke R² 0.1240 0.1510 0.1490 0.1020 0.1010 0.0920 0.1010 0.1030 0.1010 0.0890 0.0920 0.0290 0.0240 0.0490 0.0240 0.0300 0.0250 0.0490 0.0490 0.2250 0.2110 0.3070 0.2070 0.2380 0.2080 0.2950 0.3180 0.0360 0.0350 0.0320 108 Block 2 Significance Level 0.165 0.000*** 0.000*** 0.003** 0.092 0.000*** 0.012* 0.130 0.799 0.000*** 0.000*** 0.000*** 0.278 0.015* 0.455 0.009** 0.225 0.000*** 0.000*** 0.000*** 0.000*** 0.000*** 0.001 0.000*** 0.001 0.000*** 0.000*** 0.000*** 0.335 0.000*** Block 3 Significance Level 0.930 0.051 0.365 0.006** 0.567 0.000*** 0.393 0.000*** 0.094 0.006** 0.000*** 0.567 0.338 0.010 0.195 0.146 0.060 0.013* 0.015* 0.683 0.223 0.000*** 0.225 0.000*** 0.853 0.075 0.000*** 0.001 0.130 0.000*** ETS Classification A A A A A A A A A A A A A A A A A A A A A B A A A A B A A A 109 Item # 11 11 11 11 11 12 12 12 12 12 12 12 12 13 13 13 13 13 13 13 13 14 14 14 14 14 14 14 14 15 Comparison Group American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female African American/Caucasian Asian/Caucasian Block 1 Nagelkerke R² 0.0360 0.0350 0.0360 0.0290 0.0300 0.0980 0.1050 0.1030 0.1030 0.1000 0.1030 0.1030 0.1070 0.1210 0.1220 0.1390 0.1210 0.1250 0.1190 0.1390 0.1390 0.2500 0.2280 0.2620 0.2300 0.2620 0.2320 0.2600 0.2480 0.1570 Block 2 Nagelkerke R² 0.0360 0.0350 0.0360 0.0290 0.0320 0.1000 0.1050 0.1030 0.1030 0.1010 0.1030 0.1050 0.1070 0.1220 0.1230 0.1430 0.1210 0.1250 0.1200 0.1410 0.1440 0.2530 0.2280 0.2620 0.2300 0.2630 0.2330 0.2620 0.2490 0.1570 Block 3 Nagelkerke R² 0.0360 0.0360 0.0360 0.0290 0.0330 0.1010 0.1050 0.1060 0.1030 0.1030 0.1030 0.1050 0.1100 0.1220 0.1230 0.1430 0.1210 0.1260 0.1200 0.1420 0.1440 0.2530 0.2280 0.2620 0.2300 0.2630 0.2330 0.2620 0.2490 0.1580 109 Block 2 Significance Level 0.380 0.116 0.009** 0.204 0.000*** 0.000*** 0.288 0.093 0.107 0.018* 0.058 0.000*** 0.18 0.003** 0.020* 0.000*** 0.552 0.016* 0.008** 0.000*** 0.000*** 0.030* 0.430 0.411 0.673 0.106 0.110 0.001 0.044* 0.451 Block 3 Significance Level 0.923 0.001 0.747 0.472 0.000*** 0.000*** 0.946 0.000*** 0.562 0.000*** 0.594 0.971 0.000*** 0.053 0.399 0.002** 0.986 0.001 0.526 0.021* 0.044* 0.574 0.974 0.442 0.692 0.763 0.313 0.590 0.340 0.000*** ETS Classification A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A 110 Item # 15 15 15 15 15 15 15 16 16 16 16 16 16 16 16 17 17 17 17 17 17 17 17 18 18 18 18 18 18 18 Comparison Group Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female Block 1 Nagelkerke R² 0.1550 0.1750 0.1550 0.1600 0.1560 0.1740 0.1810 0.3040 0.2980 0.3400 0.2950 0.3330 0.2940 0.3400 0.3380 0.1070 0.1090 0.1040 0.1080 0.1080 0.1100 0.1040 0.1040 0.1240 0.1230 0.1480 0.1200 0.1280 0.1170 0.1470 Block 2 Nagelkerke R² 0.1550 0.1760 0.1560 0.1610 0.1560 0.1740 0.1810 0.3390 0.2990 0.3510 0.2950 0.3430 0.3030 0.3400 0.3480 0.1080 0.1090 0.1040 0.1090 0.1090 0.1100 0.1040 0.1050 0.1250 0.1230 0.1480 0.1200 0.1280 0.1180 0.1470 Block 3 Nagelkerke R² 0.1550 0.1790 0.1560 0.1650 0.1570 0.1740 0.1840 0.3400 0.3000 0.3510 0.2960 0.3430 0.3030 0.3400 0.3480 0.1090 0.1090 0.1080 0.1090 0.1100 0.1100 0.1040 0.1090 0.1250 0.1230 0.1490 0.1200 0.1290 0.1180 0.1470 110 Block 2 Significance Level 0.942 0.000*** 0.028* 0.000*** 0.702 0.000*** 0.000*** 0.000*** 0.145 0.000*** 0.855 0.000*** 0.000*** 0.090 0.000*** 0.071 0.495 0.428 0.221 0.028* 0.757 0.129 0.235 0.373 0.065 0.619 0.597 0.394 0.016* 0.004** Block 3 Significance Level 0.035* 0.000*** 0.227 0.000*** 0.376 0.990 0.000*** 0.073 0.176 0.128 0.26 0.575 0.989 0.833 0.136 0.000*** 0.184 0.000*** 0.804 0.000*** 0.642 0.643 0.000*** 0.018* 0.741 0.000*** 0.672 0.005** 0.097 0.574 ETS Classification A A A A A A A B A A A A A A A A A A A A A A A A A A A A A A 111 Item # 18 19 19 19 19 19 19 19 19 20 20 20 20 20 20 20 20 21 21 21 21 21 21 21 21 22 22 22 22 22 Comparison Group African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Block 1 Nagelkerke R² 0.1470 0.0860 0.0830 0.1230 0.0800 0.0950 0.0830 0.1220 0.1260 0.0590 0.0580 0.0440 0.0610 0.0570 0.0620 0.0440 0.0420 0.0320 0.0340 0.0410 0.0340 0.0310 0.0340 0.0410 0.0460 0.1340 0.1290 0.1450 0.1310 0.1380 Block 2 Nagelkerke R² 0.1470 0.0860 0.0850 0.1350 0.0800 0.0970 0.0840 0.1280 0.1420 0.0610 0.0630 0.0580 0.0610 0.0610 0.0620 0.0450 0.0590 0.0340 0.0340 0.0430 0.0340 0.0310 0.0340 0.0420 0.0490 0.1350 0.1290 0.1460 0.1310 0.1380 Block 3 Nagelkerke R² 0.1480 0.0860 0.0850 0.1360 0.0800 0.0980 0.0840 0.1280 0.1430 0.0610 0.0630 0.0590 0.0610 0.0620 0.0620 0.0460 0.0610 0.0340 0.0350 0.0440 0.0340 0.0330 0.0340 0.0420 0.0500 0.1350 0.1300 0.1470 0.1310 0.1390 111 Block 2 Significance Level 0.869 0.884 0.000*** 0.000*** 0.917 0.000*** 0.007** 0.000*** 0.000*** 0.000*** 0.000*** 0.000*** 0.294 0.000*** 0.403 0.000*** 0.000*** 0.000*** 0.037* 0.000*** 0.277 0.343 0.624 0.000*** 0.000*** 0.006** 0.215 0.000*** 0.195 0.341 Block 3 Significance Level 0.001 0.362 0.752 0.000*** 0.162 0.014* 0.981 0.095 0.000*** 0.015* 0.175 0.000*** 0.392 0.002** 0.926 0.002** 0.000*** 0.153 0.118 0.000*** 0.403 0.000*** 0.501 0.558 0.000*** 0.142 0.026* 0.000*** 0.205 0.000*** ETS Classification A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A 112 Item # 22 22 22 23 23 23 23 23 23 23 23 24 24 24 24 24 24 24 24 25 25 25 25 25 25 25 25 26 26 26 Comparison Group Hawaiian/Caucasian Male/Female African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian Block 1 Nagelkerke R² 0.1310 0.1430 0.1440 0.0830 0.0820 0.1060 0.0790 0.0870 0.0800 0.1070 0.1070 0.1130 0.1110 0.1230 0.1110 0.1210 0.1100 0.1230 0.1180 0.0330 0.0280 0.0450 0.0290 0.0320 0.0300 0.0440 0.0460 0.2070 0.2000 0.2450 Block 2 Nagelkerke R² 0.1310 0.1430 0.1460 0.0840 0.0830 0.1110 0.0790 0.0870 0.0800 0.1070 0.1130 0.1140 0.1110 0.1240 0.1110 0.1210 0.1110 0.1230 0.1200 0.0340 0.0280 0.0450 0.0290 0.0320 0.0300 0.0450 0.0460 0.2070 0.2000 0.2540 Block 3 Nagelkerke R² 0.1310 0.1430 0.1470 0.0840 0.0830 0.1120 0.0790 0.0870 0.0800 0.1070 0.1150 0.1140 0.1110 0.1250 0.1110 0.1220 0.1110 0.1230 0.1210 0.0340 0.0290 0.0450 0.0290 0.0320 0.0300 0.0450 0.0460 0.2070 0.2000 0.2560 112 Block 2 Significance Level 0.214 0.074 0.000*** 0.099 0.004** 0.000*** 0.784 0.976 0.629 0.219 0.000*** 0.048* 0.161 0.000*** 0.783 0.915 0.393 0.972 0.000*** 0.002** 0.573 0.218 0.179 0.633 0.065 0.000*** 0.732 0.389 0.329 0.000*** Block 3 Significance Level 0.072 0.944 0.000*** 0.082 0.502 0.000*** 0.975 0.063 0.613 0.429 0.000*** 0.122 0.942 0.000*** 0.788 0.009** 0.048* 0.057 0.000*** 0.452 0.038* 0.655 0.616 0.076 0.543 0.117 0.741 0.148 0.877 0.000*** ETS Classification A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A 113 Item # 26 26 26 26 26 27 27 27 27 27 27 27 27 28 28 28 28 28 28 28 28 29 29 29 29 29 29 29 29 30 Comparison Group American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female African American/Caucasian Asian/Caucasian Block 1 Nagelkerke R² 0.2000 0.2180 0.1980 0.2450 0.2420 0.1480 0.1370 0.1810 0.1400 0.1620 0.1380 0.1790 0.1800 0.1710 0.1630 0.2180 0.1640 0.1890 0.1630 0.2150 0.2190 0.1260 0.1230 0.1550 0.1200 0.1280 0.1200 0.1550 0.1560 0.1470 Block 2 Nagelkerke R² 0.2010 0.2190 0.1990 0.2450 0.2540 0.1480 0.1370 0.1850 0.1400 0.1630 0.1380 0.1790 0.1840 0.1720 0.1630 0.2220 0.1640 0.1910 0.1630 0.2160 0.2230 0.1260 0.1230 0.1550 0.1200 0.1280 0.1200 0.1550 0.1560 0.1470 Block 3 Nagelkerke R² 0.2010 0.2200 0.1990 0.2540 0.2570 0.1490 0.1380 0.1860 0.1400 0.1630 0.1380 0.1790 0.1850 0.1730 0.1630 0.2230 0.1640 0.1920 0.1630 0.2160 0.2240 0.1260 0.1240 0.1550 0.1200 0.1290 0.1200 0.1550 0.1560 0.1470 113 Block 2 Significance Level 0.040* 0.010 0.006** 0.000*** 0.000*** 0.095 0.689 0.000*** 0.166 0.024* 0.921 0.553 0.000*** 0.006** 0.662 0.000*** 0.105 0.000*** 0.677 0.003** 0.000*** 0.985 0.774 0.518 0.810 0.620 0.579 0.591 0.588 0.174 Block 3 Significance Level 0.007** 0.004** 0.170 0.180 0.000*** 0.144 0.312 0.007** 0.875 0.196 0.136 0.102 0.007** 0.001 0.225 0.001 0.742 0.035* 0.115 0.319 0.004** 0.170 0.261 0.009** 0.870 0.005** 0.861 0.265 0.041* 0.128 ETS Classification A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A 114 Item # 30 30 30 30 30 30 30 31 31 31 31 31 31 31 31 32 32 32 32 32 32 32 32 33 33 33 33 33 33 33 Comparison Group Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female Block 1 Nagelkerke R² 0.1430 0.1760 0.1420 0.1600 0.1420 0.1760 0.1710 0.1650 0.1570 0.2160 0.1590 0.1740 0.1570 0.2140 0.2150 0.0480 0.0460 0.0670 0.0450 0.0520 0.0440 0.0660 0.0670 0.0760 0.0780 0.0740 0.0760 0.0760 0.0780 0.0740 Block 2 Nagelkerke R² 0.1430 0.1770 0.1420 0.1600 0.1420 0.1790 0.1710 0.1650 0.1580 0.2170 0.1590 0.1750 0.1570 0.2180 0.2180 0.0490 0.0460 0.0700 0.0450 0.0520 0.0440 0.0670 0.0700 0.0770 0.0780 0.0760 0.0760 0.0760 0.0780 0.0740 Block 3 Nagelkerke R² 0.1430 0.1770 0.1420 0.1610 0.1420 0.1790 0.1720 0.1650 0.1580 0.2170 0.1590 0.1750 0.1580 0.2180 0.2190 0.0490 0.0460 0.0700 0.0450 0.0520 0.0440 0.0670 0.0710 0.0780 0.0780 0.0780 0.0770 0.0780 0.0780 0.0740 114 Block 2 Significance Level 0.216 0.020* 0.913 0.590 0.563 0.000*** 0.005** 0.005** 0.002** 0.000*** 0.613 0.003** 0.041* 0.000*** 0.000*** 0.230 0.610 0.000*** 0.830 0.129 0.833 0.035* 0.000*** 0.000*** 0.126 0.000*** 0.522 0.166 0.902 0.125 Block 3 Significance Level 0.978 0.001 0.809 0.180 0.258 0.449 0.001 0.867 0.536 0.002** 0.895 0.180 0.109 0.104 0.001 0.496 0.644 0.038* 0.940 0.178 0.021* 0.815 0.036* 0.030* 0.745 0.000*** 0.033* 0.000*** 0.886 0.351 ETS Classification A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A 115 Item # 33 34 34 34 34 34 34 34 34 35 35 35 35 35 35 35 35 36 36 36 36 36 36 36 36 37 37 37 37 37 Comparison Group African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Block 1 Nagelkerke R² 0.0750 0.1020 0.1060 0.0990 0.1060 0.1100 0.1050 0.0980 0.0970 0.1340 0.1290 0.1360 0.1300 0.1340 0.1320 0.1360 0.1290 0.2210 0.2090 0.2170 0.2110 0.2290 0.2160 0.2160 0.2160 0.1100 0.1080 0.1060 0.1070 0.1090 Block 2 Nagelkerke R² 0.0780 0.1040 0.1060 0.1070 0.1060 0.1120 0.1070 0.0980 0.1070 0.1350 0.1300 0.1460 0.1300 0.1370 0.1320 0.1430 0.1430 0.2220 0.2090 0.2250 0.2120 0.2370 0.2170 0.2170 0.2240 0.1110 0.1080 0.1080 0.1070 0.1100 Block 3 Nagelkerke R² 0.0800 0.1050 0.1060 0.1090 0.1070 0.1120 0.1070 0.0980 0.1090 0.1350 0.1300 0.1470 0.1300 0.1370 0.1320 0.1430 0.1430 0.2230 0.2100 0.2300 0.2120 0.2390 0.2170 0.2180 0.2290 0.1110 0.1080 0.1100 0.1070 0.1110 115 Block 2 Significance Level 0.000*** 0.000*** 0.371 0.000*** 0.006** 0.000*** 0.000*** 0.235 0.000*** 0.026* 0.002** 0.000*** 0.136 0.000*** 0.582 0.000*** 0.000*** 0.122 0.088 0.000*** 0.016* 0.000*** 0.023* 0.001 0.000*** 0.135 0.870 0.000*** 0.189 0.000*** Block 3 Significance Level 0.000*** 0.000*** 0.114 0.000*** 0.217 0.007** 0.234 0.433 0.000*** 0.347 0.505 0.000*** 0.859 0.014* 0.455 0.838 0.002** 0.026* 0.076 0.000*** 0.396 0.000*** 0.784 0.000*** 0.000*** 0.051 0.217 0.000*** 0.343 0.000*** ETS Classification A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A 116 Item # 37 37 37 38 38 38 38 38 38 38 38 39 39 39 39 39 39 39 39 40 40 40 40 40 40 40 40 41 41 41 Comparison Group Hawaiian/Caucasian Male/Female African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian Block 1 Nagelkerke R² 0.1080 0.1060 0.1040 0.0460 0.0470 0.0540 0.0480 0.0470 0.0450 0.0540 0.0550 0.1730 0.1630 0.1900 0.1630 0.1820 0.1650 0.1880 0.1840 0.1290 0.1240 0.1550 0.1270 0.1480 0.1270 0.1540 0.1440 0.1280 0.1330 0.1270 Block 2 Nagelkerke R² 0.1080 0.1060 0.1060 0.0470 0.0470 0.0560 0.0480 0.0480 0.0460 0.0540 0.0570 0.1730 0.1640 0.1900 0.1640 0.1820 0.1650 0.1880 0.1840 0.1300 0.1250 0.1580 0.1270 0.1490 0.1270 0.1550 0.1480 0.1290 0.1330 0.1280 Block 3 Nagelkerke R² 0.1080 0.1060 0.1090 0.0470 0.0470 0.0560 0.0480 0.0480 0.0460 0.0540 0.0570 0.1730 0.1640 0.1920 0.1640 0.1830 0.1650 0.1880 0.1860 0.1300 0.1250 0.1580 0.1280 0.1490 0.1280 0.1550 0.1480 0.1310 0.1330 0.1320 116 Block 2 Significance Level 0.522 0.005** 0.000*** 0.002** 0.527 0.000*** 0.395 0.002** 0.086 0.002** 0.000*** 0.396 0.043* 0.151 0.084 0.528 0.127 0.741 0.306 0.051 0.029* 0.000*** 0.905 0.041* 0.565 0.002** 0.000*** 0.000*** 0.391 0.000*** Block 3 Significance Level 0.201 0.660 0.000*** 0.423 0.937 0.150 0.494 0.107 0.027* 0.974 0.322 0.080 0.072 0.000*** 0.876 0.018* 0.109 0.150 0.000*** 0.347 0.434 0.160 0.305 0.522 0.399 0.390 0.071 0.000*** 0.215 0.000*** ETS Classification A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A 117 Item # 41 41 41 41 41 42 42 42 42 42 42 42 42 43 43 43 43 43 43 43 43 44 44 44 44 44 44 44 44 45 Comparison Group American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female African American/Caucasian Asian/Caucasian Block 1 Nagelkerke R² 0.1330 0.1330 0.1310 0.1270 0.1280 0.1390 0.1370 0.1430 0.1360 0.1410 0.1370 0.1430 0.1410 0.1680 0.1590 0.1830 0.1600 0.1840 0.1630 0.1820 0.1790 0.1370 0.1340 0.1720 0.1310 0.1480 0.1310 0.1730 0.1660 0.1440 Block 2 Nagelkerke R² 0.1330 0.1330 0.1310 0.1280 0.1300 0.1390 0.1370 0.1470 0.1360 0.1420 0.1370 0.1430 0.1460 0.1690 0.1600 0.1850 0.1610 0.1880 0.1670 0.1820 0.1800 0.1370 0.1340 0.1720 0.1310 0.1480 0.1310 0.1740 0.1660 0.1470 Block 3 Nagelkerke R² 0.1330 0.1350 0.1320 0.1280 0.1330 0.1400 0.1370 0.1480 0.1360 0.1430 0.1370 0.1430 0.1470 0.1700 0.1600 0.1870 0.1610 0.1890 0.1670 0.1820 0.1820 0.1370 0.1340 0.1720 0.1310 0.1480 0.1310 0.1740 0.1670 0.1470 117 Block 2 Significance Level 0.714 0.260 0.620 0.000*** 0.000*** 0.664 0.084 0.000*** 0.202 0.001 0.433 0.000*** 0.000*** 0.047* 0.542 0.000*** 0.051 0.000*** 0.000*** 0.448 0.000*** 0.105 0.114 0.269 0.953 0.120 0.133 0.000*** 0.038* 0.000*** Block 3 Significance Level 0.535 0.000*** 0.008** 0.706 0.000*** 0.001 0.050 0.000*** 0.452 0.000*** 0.375 0.024* 0.000*** 0.050 0.525 0.000*** 0.858 0.029* 0.623 0.947 0.000*** 0.894 0.070 0.002** 0.487 0.672 0.718 0.377 0.000*** 0.672 ETS Classification A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A 118 Item # 45 45 45 45 45 45 45 46 46 46 46 46 46 46 46 47 47 47 47 47 47 47 47 48 48 48 48 48 48 48 Comparison Group Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female Block 1 Nagelkerke R² 0.1470 0.1890 0.1450 0.1670 0.1450 0.1900 0.1830 0.1410 0.1460 0.1710 0.1450 0.1570 0.1430 0.1710 0.1690 0.2420 0.2340 0.2790 0.2360 0.2650 0.2350 0.2770 0.2810 0.2040 0.1860 0.2110 0.1890 0.2120 0.1910 0.2090 Block 2 Nagelkerke R² 0.1470 0.1900 0.1450 0.1670 0.1450 0.1900 0.1830 0.1430 0.1460 0.1720 0.1450 0.1570 0.1440 0.1720 0.1690 0.2460 0.2350 0.3280 0.2360 0.2840 0.2370 0.2880 0.3360 0.2040 0.1860 0.2130 0.1890 0.2130 0.1910 0.2130 Block 3 Nagelkerke R² 0.1470 0.1900 0.1450 0.1670 0.1450 0.1900 0.1840 0.1430 0.1460 0.1720 0.1460 0.1580 0.1440 0.1720 0.1700 0.2490 0.2360 0.3330 0.2360 0.2850 0.2380 0.2880 0.3420 0.2040 0.1860 0.2150 0.1890 0.2130 0.1910 0.2150 118 Block 2 Significance Level 0.602 0.038* 0.245 0.200 0.949 0.027* 0.090 0.002** 0.454 0.011* 0.261 0.099 0.566 0.186 0.016* 0.000*** 0.000*** 0.000*** 0.220 0.000*** 0.000*** 0.000*** 0.000*** 0.351 0.790 0.000*** 0.771 0.292 0.407 0.000*** Block 3 Significance Level 0.280 0.268 0.248 0.867 0.499 0.760 0.196 0.212 0.469 0.062 0.216 0.270 0.640 0.295 0.094 0.000*** 0.005** 0.000*** 0.134 0.000*** 0.013* 0.517 0.000*** 0.790 0.070 0.000*** 0.886 0.086 0.830 0.000*** ETS Classification A A A A A A A A A A A A A A A A A B A A A A B A A A A A A A 119 Item # 48 49 49 49 49 49 49 49 49 Comparison Group African American/Caucasian Asian/Caucasian Two or More/Caucasian Total Minority/Caucasian American Indian/Caucasian Hispanic/Caucasian Hawaiian/Caucasian Male/Female African American/Caucasian Block 1 Nagelkerke R² 0.2030 0.1400 0.1320 0.1490 0.1310 0.1450 0.1330 0.1480 0.1440 Block 2 Nagelkerke R² 0.2060 0.1400 0.1320 0.1510 0.1310 0.1450 0.1330 0.1490 0.1480 Block 3 Nagelkerke R² 0.2090 0.1410 0.1320 0.1520 0.1310 0.1460 0.1330 0.1490 0.1490 Block 2 Significance Level 0.000*** 0.011* 0.175 0.000*** 0.855 0.845 0.069 0.000*** 0.000*** Block 3 Significance Level 0.000*** 0.007** 0.062 0.000*** 0.002** 0.009** 0.359 0.222 0.000*** ETS Classification A A A A A A A A A Table A3 Heading Descriptions Block 1 Nagelkerke R² – The amount of variance accounted for in the first stage of the logistic regression procedure. Displays the amount of variance accounted for by applicants’ total test score. Block 2 Nagelkerke R² – The amount of variance accounted for in the second stage of the logistic regression procedure. Displays the amount of variance accounted for by both total test score and group membership. Block 3 Nagelkerke R² – The amount of variance accounted for in the third stage of the logistic regression procedure. Displays the amount of variance accounted for by total test score, group membership and the interaction of total test score and group membership. Block 2 Significance Level – The significance level of the Block 2 Nagelkerke R² value. * denotes that the value is significant at the <.05 level. ** denotes that the value is significant at the <.001 level. amount of variance accounted for in the second stage of the logistic regression procedure. Displays the amount of variance accounted for by both total test score and group membership. Block 3 Significance Level – The significance level of the Block 3 Nagelkerke R² value. * denotes that the value is significant at the <.05 level. ** denotes that the value is significant at the <.001 level. amount of variance accounted for in the second stage of the logistic regression procedure. Displays the amount of variance accounted for by both total test score and group membership. The ETS Categorization Scheme (ETS Classification) – The ETS categorization scheme (Jodoin and Gierl, 2001) categorizes items as having small (A), moderate (B), and large (C) levels of DIF. 119 120 APPENDIX D Number of Applicants Passing at Cut-off Score Level by Test and Comparison Group Table A4-1. Original Test Applicants Passing at Cut-off Score Level by Comparison Group Original Test Cutoff Score 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 Male 20133 20128 20118 20107 20086 20074 20058 20042 20018 19986 19961 19929 19895 19865 19813 19754 19675 19586 19468 19314 19118 18876 18596 Female 9031 9028 9024 9021 9018 9012 9002 8997 8984 8976 8969 8962 8946 8930 8907 8883 8847 8803 8742 8665 8548 8410 8242 Caucasian 14776 14774 14769 14765 14757 14756 14753 14748 14740 14732 14724 14715 14705 14703 14688 14673 14647 14629 14596 14556 14499 14448 14368 Total Minority 14388 14382 14373 14363 14347 14330 14307 14291 14262 14230 14206 14176 14136 14092 14032 13964 13875 13760 13614 13423 13167 12838 12470 American Indian/ Alaskan Native 244 244 244 244 244 243 243 243 243 243 243 243 243 243 243 243 242 242 241 240 239 237 231 Asian 563 563 562 562 561 561 560 559 558 557 555 552 550 548 543 540 535 526 519 508 498 484 472 120 Hispanic 2553 2552 2551 2548 2541 2538 2536 2533 2526 2520 2516 2509 2505 2497 2492 2483 2471 2455 2427 2393 2351 2303 2256 African American 10195 10190 10183 10176 10168 10156 10136 10124 10103 10079 10061 10042 10009 9975 9927 9873 9804 9722 9614 9475 9283 9030 8744 Native Hawaiian/ Pacific Islander 162 162 162 162 162 162 162 162 162 162 162 162 162 162 160 160 160 158 157 153 151 145 139 Two or More Races 671 671 671 671 671 670 670 670 670 669 669 668 667 667 667 665 663 657 656 654 645 639 628 121 Original Test Cutoff Score 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 Male 18187 17731 17167 16412 15620 14565 13323 11880 10077 8244 6269 4330 2603 1263 426 67 Female 8025 7784 7473 7049 6535 5963 5305 4511 3694 2923 2080 1388 792 374 124 20 Caucasian 14232 14037 13802 13423 12988 12384 11601 10577 9213 7760 6014 4295 2657 1323 447 78 Total Minority 11980 11478 10838 10038 9167 8144 7027 5814 4558 3407 2335 1423 738 314 103 9 American Indian/ Alaskan Native 227 219 207 199 192 176 169 146 119 92 71 49 28 14 2 0 Asian 441 422 395 360 333 295 265 220 184 132 92 46 18 8 2 0 121 Hispanic 2182 2098 2009 1881 1740 1580 1395 1197 976 745 529 337 172 71 30 3 African American 8385 8013 7531 6944 6289 5532 4687 3816 2928 2154 1442 852 444 189 56 6 Native Hawaiian/ Pacific Islander 130 128 123 113 102 89 79 63 52 45 28 20 9 2 1 0 Two or More Races 615 598 573 541 511 472 432 372 299 239 173 119 67 30 12 0 122 Table A4-2. MH Test Applicants Passing at Cut-off Score Level by Comparison Group. MH Test Cut-off Score 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 Male 20136 20135 20133 20126 20111 20078 20050 20019 19970 19908 19819 19699 19521 19258 18892 18396 17677 16652 15328 13649 11497 8939 6121 3488 1457 303 Female 9034 9030 9030 9028 9022 9017 9008 8993 8969 8942 8909 8849 8762 8626 8439 8182 7788 7291 6604 5687 4634 3437 2209 1183 464 96 Caucasian 14780 14779 14778 14773 14767 14759 14751 14742 14729 14710 14680 14655 14607 14541 14430 14261 13937 13406 12683 11595 10054 8082 5734 3364 1441 318 Total Minority 14390 14386 14385 14381 14366 14336 14307 14270 14210 14140 14048 13893 13676 13343 12901 12317 11528 10537 9249 7741 6077 4294 2596 1307 480 81 American Indian/ Alaskan Native 244 244 244 244 244 244 243 243 243 243 243 243 242 241 236 231 222 207 196 170 136 101 66 32 19 4 Asian 563 563 563 563 563 561 559 558 553 549 542 536 525 506 480 452 412 373 324 276 223 165 85 41 14 1 122 Hispanic 2554 2553 2553 2552 2546 2539 2533 2530 2521 2512 2500 2473 2433 2388 2315 2236 2105 1941 1744 1485 1211 893 568 300 110 16 African American 10196 10193 10192 10189 10180 10159 10141 10108 10063 10006 9933 9819 9655 9399 9082 8645 8077 7343 6381 5293 4093 2826 1684 824 292 53 Native Hawaiian/ Pacific Islander 162 162 162 162 162 162 162 162 162 162 162 161 160 155 146 138 124 115 101 83 64 51 30 14 2 0 Two or More Races 671 671 671 671 671 671 669 669 668 668 668 661 661 654 642 615 588 558 503 434 350 258 163 96 43 7 123 Table A4-3. LR Test Applicants Passing at Cut-off Score Level by Comparison Group. LR Test Cut-off Score 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 Male 20136 20135 20130 20122 20109 20094 20075 20061 20042 20023 19999 19963 19940 19900 19857 19812 19745 19662 19560 19424 19240 18984 18691 18266 17778 17127 16325 Female 9033 9033 9031 9028 9024 9021 9013 9006 8999 8988 8978 8972 8965 8951 8938 8909 8879 8849 8795 8736 8633 8504 8338 8131 7852 7554 7108 Caucasian 14779 14778 14775 14774 14767 14759 14756 14754 14749 14740 14731 14725 14716 14707 14698 14683 14663 14644 14615 14582 14532 14463 14373 14221 14011 13717 13289 Total Minority 14390 14389 14385 14375 14365 14355 14331 14312 14291 14270 14245 14209 14188 14143 14096 14037 13960 13866 13739 13577 13340 13024 12655 12175 11618 10963 10143 American Indian/ Alaskan Native 244 244 244 244 244 244 244 243 243 243 243 243 243 243 243 243 243 242 241 241 240 237 234 227 219 211 201 Asian 563 563 563 562 561 561 561 559 559 559 558 556 553 549 547 544 539 534 527 520 509 492 479 457 435 401 368 123 Hispanic 2554 2554 2553 2551 2549 2546 2538 2538 2531 2525 2522 2517 2514 2506 2499 2491 2481 2472 2453 2415 2375 2329 2273 2202 2114 2018 1887 African American 10196 10196 10193 10186 10179 10172 10156 10141 10127 10112 10091 10064 10050 10017 9979 9932 9872 9797 9704 9591 9415 9176 8897 8539 8123 7641 7036 Native Hawaiian/ Pacific Islander 162 162 162 162 162 162 162 162 162 162 162 162 162 162 162 161 160 160 158 157 154 148 141 133 128 123 111 Two or More Races 671 671 671 671 671 671 671 670 670 670 670 668 667 667 667 667 666 662 657 654 648 643 632 618 600 570 541 124 LR Test Cut-off Score 36 37 38 39 40 41 42 43 44 45 46 Male 15321 14140 12674 10898 8903 6790 4724 2846 1383 469 71 Female 6542 5906 5118 4258 3378 2472 1640 943 433 153 23 Caucasian 12704 11979 10984 9691 8153 6367 4562 2835 1415 488 83 Total Minority 9158 8066 6807 5464 4127 2894 1801 953 400 133 10 American Indian/ Alaskan Native 188 175 156 130 98 76 54 28 16 2 0 Asian 326 290 248 207 157 112 64 24 12 3 0 124 Hispanic 1723 1546 1345 1117 876 632 406 217 87 34 3 African American 6313 5511 4586 3614 2680 1837 1114 594 249 79 8 Native Hawaiian/ Pacific Islander 103 89 76 60 50 38 26 11 2 1 0 Two or More Races 506 456 397 337 267 200 138 80 35 15 0 125 Appendix E Fisher’s Exact Statistical Significance Results of Adverse Impact by Test & Comparison Group Table A5-1. Original Test Fisher’s Exact Statistical Significance Results of Adverse Impact by Comparison Group Original Test Cutoff Score 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 Male/Female 0.214 0.261 0.425 0.869 0.357 0.482 0.837 0.638 0.868 0.410 0.235 0.079 0.120 0.217 0.238 0.219 0.265 0.433 0.750 0.974 0.241 0.035*† 0.001***† 0.000***† Total Minority/ Caucasian 1.000 0.449 0.195 0.046*† 0.010***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† American Indian/ Caucasian 1.000 1.000 1.000 1.000 1.000 0.336 0.368 1.000 1.000 0.817 0.938 0.944 0.831 0.811 0.993 0.846 0.895 0.755 0.983 0.875 0.867 0.666 0.029*† 0.013*† Asian/ Caucasian 1.000 1.000 0.362 1.000 1.000 0.247 0.096 0.042*† 0.024*† 0.011*† 0.001***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 125 Hispanic/ Caucasian 1.000 0.614 0.708 0.112 0.001***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† African American/ Caucasian 1.000 0.402 0.154 0.041*† 0.034*† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† Hawaiian/ Caucasian 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.977 0.890 0.806 0.726 0.712 0.631 0.768 0.976 0.156 0.083 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† Two or More/ Caucasian 1.000 1.000 1.000 0.650 0.624 1.000 1.000 1.000 0.726 0.905 0.990 0.978 0.969 0.792 0.932 0.784 0.568 0.015*† 0.040*† 0.055 0.001***† 0.000***† 0.000***† 0.000***† 126 Original Test Cutoff Score 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 Male/Female 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.134 Total Minority/ Caucasian 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† American Indian/ Caucasian 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.001***† 0.000***† 0.000***† 0.000***† 0.000***† 0.003**† 0.011*† 0.102 0.069 0.491 Asian/ Caucasian 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.154 Hispanic/ Caucasian 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.008**† African American/ Caucasian 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† Note. * significance level <.05. ** significance level <.01. *** significance level <.001. 126 Two or Hawaiian/ More/ Caucasian Caucasian 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.001***† 0.000***† 0.120 0.084 0.705 0.108 † Practical significance. 127 Table A5-2. MH Test Statistical Significance Results of Adverse Impact by Comparison Group MH Test Cut-off Score 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 Male/Female 0.310 0.013*† 0.066 0.431 0.726 0.175 0.126 0.229 0.433 0.467 0.255 0.569 0.883 0.537 0.176 0.028*† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.003**† Total Minority/ Caucasian 0.493 0.120 0.000***† 0.475 0.051 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† American Indian/ Caucasian 1.000 1.000 1.000 1.000 1.000 1.000 0.388 1.000 0.864 0.885 0.912 0.700 0.837 0.825 0.476 0.178 0.039*† 0.003**† 0.020*† 0.001***† 0.000***† 0.000***† 0.000***† 0.000***† 0.359 0.745 Asian/ Caucasian 1.000 1.000 1.000 1.000 1.000 0.206 0.032*† 0.020*† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.002**† Hispanic/ Caucasian 1.000 0.273 0.380 0.629 0.007**† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.004**† African American/ Caucasian 0.408 0.166 0.130 0.432 0.094 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† Note. * significance level <.05. ** significance level <.01. *** significance level <.001. 127 Two or Hawaiian/ More/ Caucasian Caucasian 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.668 1.000 0.624 1.000 0.644 1.000 1.000 0.943 0.918 0.765 0.922 0.571 0.637 0.752 0.123 0.940 0.571 0.017*† 0.096 0.000***† 0.002**† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.005**† 0.107 0.069 † Practical significance. 128 Table A5-3. LR Test Statistical Significance Results of Adverse Impact by Comparison Group LR Test Cut-off Score 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 Male/Female 0.096 0.228 0.734 1.000 0.862 0.387 0.408 0.526 0.448 0.670 0.642 0.158 0.110 0.068 0.033*† 0.182 0.225 0.125 0.337 0.333 1.000 0.606 0.055 0.055 0.000***† 0.002**† 0.000***† 0.000***† Total Minority/ Caucasian 1.000 1.000 0.772 0.033*† 0.036*† 0.046*† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† American Indian/ Caucasian 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.358 1.000 1.000 1.000 0.924 0.956 0.852 0.762 0.942 0.761 0.870 0.868 0.882 0.963 0.582 0.284 0.016*† 0.001***† 0.000***† 0.000***† 0.000***† Asian/ Caucasian 1.000 1.000 1.000 0.230 0.103 0.206 0.247 0.023*† 0.038*† 0.077 0.047*† 0.004**† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 128 Hispanic/ Caucasian 1.000 1.000 1.000 0.135 0.170 0.063 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† African American/ Caucasian 1.000 1.000 1.000 0.051 0.066 0.071 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† Hawaiian/ Caucasian 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.684 0.900 0.815 0.741 0.677 0.951 0.852 0.998 0.213 0.117 0.004**† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† Two or More/ Caucasian 1.000 1.000 1.000 1.000 0.668 0.624 0.428 1.000 1.000 0.726 0.526 0.756 0.744 0.930 0.888 0.850 0.894 0.367 0.035*† 0.016*† 0.001***† 0.001***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 129 LR Test Cut-off Score 37 38 39 40 41 42 43 44 45 46 Male/Female 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.001***† 0.210 Total Minority/ Caucasian 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† American Indian/ Caucasian 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.004**† 0.003**† 0.138 0.047*† 0.460 Asian/ Caucasian 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.136 Hispanic/ Caucasian 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.005**† African American/ Caucasian 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† Note. * significance level <.05. ** significance level <.01. *** significance level <.001. 129 Two or Hawaiian/ More/ Caucasian Caucasian 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.000***† 0.091 0.158 0.671 0.094 † Practical significance. REFERENCES Age Discrimination in Employment Act of 1967. (n.d.) In New World Encyclopedia. Retrieved May 2, 2009, from http://www.eeoc.gov/policy/adea.html. Albermarle Paper Company v. Moody, 422 US 405 (1975). American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1999). Standards for educational and psychological testing.Washington, D. C.: American Educational Research Association. American Psychological Association, Division of Industrial-Organizational Psychology, (1980). Principles for the validation and use of personnel selection procedures (2nd ed.). Berkeley, CA: American Psychological Association. Biddle, D. A. (2006). Adverse Impact and Test Validation; A practitioner’s Guide to Valid and Defensible Employment Testing (2nd ed). Burlington, VT; Gower. Biddle, D. A., & Nooren, P. M. (2006). Validity generalization vs. Title VII: Can employers successfully defend tests without conducting local validation studies? Labor Law Journal, 57, 216-237. Camilli, G. (2006). Test fairness. In R. L. Brennan (Ed.), Educational Measurement (4th ed., Vol. 4, pp. 221-256). Westport: American Council on Education & Praeger Publishers. cxxxi Camilli, G., & Penfield, D. A. (1994). Variance estimation for differential test functioning based on Mantel-Haenszel Statistics. Journal of Educational Measurement, 34(2), 123-139. Camilli, G., & Shepard, L. A. (1994). Methods for identifying biased items. Thousand Oaks, CA: Sage Publications. Chinese Imperial Examination System, Confucianism and the Chinese Scholastic System. California State Poly, Pomona. Retrieved August 24, 2007, from http://www.csupomona.edu/~plin/ls201/confucian3.html. Civil Rights Act of 1964. (n.d.) In New World Encyclopedia. Retrieved May 2, 2009, from http:/www.newworldencyclopedia.org. Civil Service Act of 1883. (n.d.) In Biography of an Ideal. Retrieved May 2, 2009, from http://www.opm.gov/biographyofanideal/PU_CSact.htm. Clauser, B. E., & Hambleton, R. K. (1994). Differential Item Functioning by P. W. Holland & H. Wainer. [Review of the book Differential Item Functioning]. Journal of Educational Measurement, 37(1), 88-92. Clauser, B. E., & Mazor, K. M. (1998). Using statistical procedures to identify differential item functioning test items. Educational Measurement: Issues and Practice, 17, 31-44. Clauser, B. E., & Mazor, K. M., & Hambleton, R. K. (1998). Influence of the criterion variable on the identification of differential item functioning test item using the Mantel-Haenszel statistic. Applied Psychological Measurement, 15, 353-359. cxxxi cxxxii Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2003). Applied multiple regression/correlation analysis for the behavioral sciences. Lawrence Erlbaum Associates: New Jersey. Contreras v. City of Los Angeles. 656 F.2d 1267, 9th Cir. (1981). Curley, W. E. & Schmitt, A. P. (1993). Revising SAT-Verbal items to eliminate differential item functioning. (ETS Research Report RR-93-61). Princeton, NJ: Educational Testing Service. Donoghue, J. R., & Allen, N. L. (1993). Thin versus thick matching in the MantelHaenszel procedure for detecting DIF. Journal of Educational Statistics, 18(2), 131-154. Donoghue, J. R., Holland, P. W., & Thayer, D. T. (1993). A Monte Carlo study of factors that affect the Mantel-Haenszel and standardization measures of DIF. In P. W. Holland & H. Wainer (Eds.), Differential item functioning: Theory and practice, 137-166. Hillsdale, NJ: Erlbaum Dorans, N. J., & Holland, P. W. (1993). DIF detection and description: Mantel-Haenszel and standardization. In P. W. Holland & H. Wainer (Eds.), Differential item functioning, 35-66. Hillsdale, NJ: Erlbaum. Equal Employment Opportunity Commission, Civil Service Commission, Department of Labor, & Department of Justice. (1978). Uniform guidelines on employee selection procedures. Federal Register, 43(166), 38290-38309. cxxxii cxxxiii Fisher, R. A. (1926). The arrangement of field experiments, Journal of the Ministry of Agriculture of Great Britain, 33, 503-513. Fisher, R. A. (1956). Statistical Methods and Scientific Inference. New York: Hafner. Freeman, J. (1991). How ‘sex’ got into Title VII: Persistent opportunities as a maker of public policy. Law and Inequality: A Journal of Theory and Practice, 9(2), 163184. Goodwin, A. L. (1997). Assessment for equity and inclusion: embracing all our children. New York: Routledge. Griggs v. Duke Power Co., 401 U.S. 424 (1971). Guion, R. M. (1998). Assessment, measurement, and prediction for personnel decisions. Mahwah, NJ; Lawrence Erlbaum Associates. Heineman, R. A., Peterson, S. A., & Rasmussen, T. H. (1995). American Government (2nd ed.). New York: McGraw-Hill. Hidalgo, M. D., & López-Pina, J. A., (2004). Differential item functioning detection and effect size: A comparison between logistic regression and Mantel-Haenszel procedures. Educational and Psychological Measurement, 64(6), 903-915. Holland, P. W., & Thayer, D. T. (1988). Differential item performance and the MantelHaenszel procedure. In H. Wainer & H. Braun (Eds.), Test Validity ( pp.129-145). Hillsdale, NJ: Erlbaum. Holland, P. W. & Wainer, H. (1993). Differential item functioning. Hillsdale, NJ: Lawrence Erlbaum Associates, Publishers. cxxxiii cxxxiv Ibrahim, A. K. (1992) Distribution and power of selected item bias indices: A Monte Carlo study. Unpublished Ph.D. thesis. University of Ottowa, Ottawa, Ontario. Jodoin, M. G., & Gierl, M. J. (2001). Evaluating type 1 error and power rates using an effect size measure with logistic regression procedures for DIF detections. Applied Measurement in Education, 14, 329-349. Keppel, G. & Wickens, T.D. (2004). Design and analysis: A researchers handbook (4th ed.). New Jersey: Prentice Hall. Lewis, C. (1993). A note on the value of including the studied item in the test score when analyzing test items for DIF. In P. W. Holland & H. Wainer (Eds.), Differential item functioning, 321-335. Hillsdale, NJ: Lawrence Erlbaum Associates. Longford, N. T., Holland, P. W., & Thayer, D. T. (1993). Stability of the MH D-DIF statistics across populations. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 171-196). Hillsdale, NJ: Lawrence Erlbaum Associates. Manley, C. H. (1986). Federal Employee Job Rights: The Pendleton Act of 1881 to the Civil Service Reform Act of 1978. Howard Law Journal, 29(spring). Mantel, N., & Haenszel, W. (1959). Statistical aspects of the analysis of data from retrospective studies of disease. Journal of the National Cancer Institute, 22, 719748. cxxxiv cxxxv Mazor, K. M., Clauser, B. E., & Hambleton, R. K. (1994). Identification of nonuniform differential item functioning using a variation of the Mantel-Haenszel procedure. Educational and Psychological Measurement, 54, 284-291. Mazor, K. M., Kanjee, A., & Clauser, B. E. (1995). Using logistic regression and the mantel-haenszel with multiple ability estimates to detect differential item functioning. Journal of Educational Measurement, 32(2), 131-144. Meyers, L. S. (2007). Sources of validity evidence. Unpublished Manuscript. Meyers, L. S. (2007). Reliability, error, and attenuation. Unpublished Manuscript. Meyers, L. S. & Hunley, K. (2008). CSA differential Item Functioning. Unpublished Manuscript, California Department of Corrections and Rehabilitation. Milkovich, G. T., & Wigdor, A. K. (1991). Pay for performance: Evaluating appraisal and merit pay. National Research Council, USA:. Committee on Performance Appraisal. Monahan, P. O., McHorney, C. A., Stump, T. E., & Perkins, A. J. (2007). Odds ratio, delta, ETS Classification, and Standardization Measures of DIF Magnitude for Binary Logistic Regression. Journal of Educational and Behavioral Statistics, 32(1), 92-109. Ray, J.J. (1979). The authoritarian as measured by a personality scale: Solid citizen or misfit? Journal of Clinical Psychology, 35, 744-747. cxxxv cxxxvi Rogers, H. J., & Swaminathan, H. (1993). A comparison of logistic regression and Mantel-Haenszel Procedures for Detecting Differential Item Functioning. Applied Psychological Measurement, 17, 105-116. Scheuneman, J. D. & Slaughter, C. (1991). Issues of test bias, item bias, and group differences and what to do while waiting for the answers. Unpublished manuscript, Educational Testing Service. Schumacker, R. E. (2005). Test bias and differential item functioning. Unpublished manuscript, Applied Measurement Associates. Slocum, S. L., Gelin, M. N., & Zumbo, B. D. (in press). Statistical and graphical modeling to investigate differential item functioning for rating scale and Likert item formats. In B. D. Zumbo (Ed.), Developments in the theories and applications of measurement, evaluation, and research methodology across the disciplines: Vol. 1. Vancouver: Edgeworth Laboratory, University of British Columbia. Swaminathan, H., & Rogers, J. (1990). Detecting differential item functioning using logistic regression procedures. Journal of Educational Measurement, 27, 361-370. United States Government Manual (2005-2006). Washington: Government Printing Office, 2005. U.S. v. Commonwealth of Virginia. 569 F2d 1300, (CA-4 1978), 454 F. Supp. 1077. Waisome v. Port Authority. 948 F.2d 1370, 1376, 2d Cir. (1991). cxxxvi cxxxvii Wiberg, M. (2007). Measuring and detecting differential item functioning in criterionreferenced licensing test; a theoretic comparison of methods. Educational Measurement, 60. Umeå: Department of educational measurement, Umeå universitet. Zieky, M. (2003). A DIF Primer. Princeton, NJ: Educational Testing Service. Zumbo, B. D. (1999). A Handbook on the Theory and Methods of Differential Item Functioning (DIF): Logistic Regression Modeling as a Unitary Framework for Binary and Likert-type (Ordinal) Item Scores. Ottawa, ON: Directorate of Human Resources Research and Evaluation, Department of National Defense. Zwick, R. (1990). When do item response function and Mantel-Haenszel Definitions of Differential Item Functioning Coincide? Journal of Educational Statistics, 15(3), 185-197. cxxxvii