TEST OF ENGLISH AS A FOREIGN LANGUAGE Research Reports REPORT 65 JUNE 2000 Monitoring Sources of Variability Within the Test of Spoken English Assessment System Carol M. Myford Edward W. Wolfe Monitoring Sources of Variability Within the Test of Spoken English Assessment System Carol M. Myford Edward W. Wolfe Educational Testing Service Princeton, New Jersey RR-00-6 ® ® ® Educational Testing Service is an Equal Opportunity/Affirmative Action Employer. Copyright © 2000 by Educational Testing Service. All rights reserved. No part of this report may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopy, recording, or any information storage and retrieval system, without permission in writing from the publisher. Violators will be prosecuted in accordance with both U.S. and international copyright laws. EDUCATIONAL TESTING SERVICE, ETS, the ETS logo, GRE, TOEFL, the TOEFL logo, TSE, and TWE are registered trademarks of Educational Testing Service. The modernized ETS logo is a trademark of Educational Testing Service. FACETS Software is copyrighted by MESA Press, University of Chicago. Abstract The purposes of this study were to examine four sources of variability within the Test of Spoken English (TSE®) assessment system, to quantify ranges of variability for each source, to determine the extent to which these sources affect examinee performance, and to highlight aspects of the assessment system that might suggest a need for change. Data obtained from the February and April 1997 TSE scoring sessions were analyzed using Facets (Linacre, 1999a). The analysis showed that, for each of the two TSE administrations, the test usefully separated examinees into eight statistically distinct proficiency levels. The examinee proficiency measures were found to be trustworthy in terms of their precision and stability. It is important to note, though, that the standard error of measurement varies across the score distribution, particularly in the tails of the distribution. The items on the TSE appear to work together; ratings on one item correspond well to ratings on the other items. Yet, none of the items seem to function in a redundant fashion. Ratings on individual items within the test can be meaningfully combined; there is little evidence of psychometric multidimensionality in the two data sets. Consequently, it is appropriate to generate a single summary measure to capture the essence of examinee performance across the 12 items. However, the items differ little in terms of difficulty, thus limiting the instrument’s ability to discriminate among levels of proficiency. The TSE rating scale functions as a five-point scale, and the scale categories are clearly distinguishable. The scale maintains a similar though not identical category structure across all 12 items. Raters differ somewhat in the levels of severity they exercise when they rate examinee performances. The vast majority used the scale in a consistent fashion, though. If examinees’ scores were adjusted for differences in rater severity, the scores of two-thirds of the examinees in these administrations would have differed from their raw score averages by 0.5 to 3.6 raw score points. Such differences can have important consequences for examinees whose scores lie in critical decisionmaking regions of the score distribution. Key words: oral assessment, second language performance assessment, Item Response Theory (IRT), rater performance, Rasch Measurement, Facets i The Test of English as a Foreign Language (TOEFL®) was developed in 1963 by the National Council on the Testing of English as a Foreign Language. The Council was formed through the cooperative effort of more than 30 public and private organizations concerned with testing the English proficiency of nonnative speakers of the language applying for admission to institutions in the United States. In 1965, Educational Testing Service (ETS®) and the College Board assumed joint responsibility for the program. In 1973, a cooperative arrangement for the operation of the program was entered into by ETS, the College Board, and the Graduate Record Examinations (GRE®) Board. The membership of the College Board is composed of schools, colleges, school systems, and educational associations; GRE Board members are associated with graduate education. ETS administers the TOEFL program under the general direction of a Policy Council that was established by, and is affiliated with, the sponsoring organizations. Members of the Policy Council represent the College Board, the GRE Board, and such institutions and agencies as graduate schools of business, junior and community colleges, nonprofit educational exchange agencies, and agencies of the United States government. ✥ ✥ ✥ A continuing program of research related to the TOEFL test is carried out under the direction of the TOEFL Committee of Examiners. Its 11 members include representatives of the Policy Council, and distinguished English as a second language specialists from the academic community. The Committee meets twice yearly to review and approve proposals for test-related research and to set guidelines for the entire scope of the TOEFL research program. Members of the Committee of Examiners serve three-year terms at the invitation of the Policy Council; the chair of the committee serves on the Policy Council. Because the studies are specific to the TOEFL test and the testing program, most of the actual research is conducted by ETS staff rather than by outside researchers. Many projects require the cooperation of other institutions, however, particularly those with programs in the teaching of English as a foreign or second language and applied linguistics. Representatives of such programs who are interested in participating in or conducting TOEFL-related research are invited to contact the TOEFL program office. All TOEFL research projects must undergo appropriate ETS review to ascertain that data confidentiality will be protected. Current (1999-2000) members of the TOEFL Committee of Examiners are: Diane Belcher Richard Berwick Micheline Chalhoub-Deville JoAnn Crandall (Chair) Fred Davidson Glenn Fulcher Antony J. Kunnan (Ex-Officio) Ayatollah Labadi Reynaldo F. Macías Merrill Swain Carolyn E. Turner The Ohio State University Ritsumeikan Asia Pacific University University of Iowa University of Maryland, Baltimore County University of Illinois at Urbana-Champaign University of Surrey California State University, LA Institut Superieur des Langues de Tunis University of California, Los Angeles The University of Toronto McGill University To obtain more information about TOEFL programs and services, use one of the following: E-mail: toefl@ets.org Web site: http://www.toefl.org ii Acknowledgments This work was supported by the Test of English as a Foreign Language (TOEFL) Research Program at Educational Testing Service. We are grateful to Daniel Eignor, Carol Taylor, Gwyneth Boodoo, Evelyne Aguirre Patterson, Larry Stricker, and the TOEFL Research Committee for helpful comments on an earlier draft of the paper. We especially thank the readers of the Test of Spoken English and the program’s administrative personnel—Tony Ostrander, Evelyne Aguirre Patterson, and Pam Esbrandt—without whose cooperation this project could never have succeeded. iii Table of Contents Introduction......................................................................................................................1 Rationale for the Study ....................................................................................................2 Review of the Literature ..................................................................................................3 Method .............................................................................................................................5 Examinees ............................................................................................................5 Instrument ............................................................................................................6 Raters and the Rating Process..............................................................................6 Procedure .............................................................................................................7 Results..............................................................................................................................9 Examinees ............................................................................................................11 Items.....................................................................................................................20 TSE Rating Scale .................................................................................................24 Raters ...................................................................................................................34 Conclusions......................................................................................................................43 Examinees ............................................................................................................43 Items.....................................................................................................................44 TSE Rating Scale .................................................................................................45 Raters ...................................................................................................................45 Next Steps ............................................................................................................46 References........................................................................................................................48 Appendix..........................................................................................................................51 iv List of Tables Page Table 1. Distribution of TSE Examinees Across Geographic Locations........................5 Table 2. TSE Rating Scale ..............................................................................................8 Table 3. Misfitting and Overfitting Examinees from the February and April 1997 TSE Administrations........................................................................................................16 Table 4. Rating Patterns and Fit Indices for Selected Examinees...................................17 Table 5. Examinees from the February 1997 TSE Administration Identified as Having Suspect Rating Patterns.......................................................................................19 Table 6. Examinees from the April 1997 TSE Administration Identified as Having Suspect Rating Patterns.......................................................................................20 Table 7. Item Measurement Report for the February 1997 TSE Administration ...........21 Table 8. Item Measurement Report for the April 1997 TSE Administration .................21 Table 9. Rating Scale Category Calibrations for the February 1997 TSE Items ............25 Table 10. Rating Scale Category Calibrations for the April 1997 TSE Items ................25 Table 11. Average Examinee Proficiency Measures and Outfit Mean-Square Indices for the February 1997 TSE Items ........................................................................28 Table 12. Average Examinee Proficiency Measures and Outfit Mean-Square Indices for the April 1997 TSE Items ............................................................................. 29 Table 13. Frequency and Percentage of Examinee Ratings in Each Category for the February 1997 TSE Items ..........................................................................................30 Table 14. Frequency and Percentage of Examinee Ratings in Each Category for the April 1997 TSE Items ................................................................................................31 Table 15. Rating Patterns and Fit Indices for Selected Examinees.................................33 Table 16. Summary Table for Selected TSE Raters ........................................................35 v Table 17. Effects of Adjusting for Rater Severity on Examinee Raw Score Averages, February 1997 TSE Administration ................................................................37 Table 18. Effects of Adjusting for Rater Severity on Examinee Raw Score Averages, April 1997 TSE Administration......................................................................37 Table 19. Frequencies and Percentages of Rater Mean-Square Fit Indices for the February 1997 TSE Data..................................................................................................39 Table 20. Frequencies and Percentages of Rater Mean-Square Fit Indices for the April 1997 TSE Data .......................................................................................................39 Table 21. Frequencies of Inconsistent Ratings for February 1997 TSE Raters ..............40 Table 22. Frequencies of Inconsistent Ratings for April 1997 TSE Raters ...................41 Table 23. Rater Effect Criteria...................................................………………………..42 Table 24. Rater Effects for February 1997 TSE Data......................................................42 Table 25. Rater Effects for April 1997 TSE Data............................................................42 Appendix. TSE Band Descriptor Chart............................................................................51 vi List of Figures Page Figure 1. Map from the Facets Analysis of the Data from the February 1997 TSE Administration .................................................................................................................12 Figure 2. Map from the Facets Analysis of the Data from the April 1997 TSE Administration .................................................................................................................13 Figure 3. Category Probability Curves for Items 4 and 8 (February Test)......................26 Figure 4. Category Probability Curves for Items 4 and 7 (April Test)............................26 vii Introduction Those in charge of monitoring quality control for complex assessment systems need information that will help them determine whether all aspects of the system are working as intended. If there are problems, they must pinpoint those particular aspects of the system that are out of synch so that they can take meaningful, informed steps to improve the system. They need answers to critical questions such as: Do some raters rate more severely than other raters? Do any raters use the rating scales inconsistently? Are there examinees that exhibit unusual profiles of ratings across items? Are the rating scales functioning appropriately? Answering these kinds of questions requires going beyond interrater reliability coefficients and analysis of variance main effects to understand the impact of the assessment system on individual raters, examinees, assessment items, and rating scales. Data analysis approaches that provide only group-level statistics are of limited help when one’s goal is to refine a complex assessment system. What is needed for this purpose is information at the level of the individual rater, examinee, item, and rating scale. The present study illustrates an approach to the analysis of rating data that provides this type of information. We analyzed data from two 1997 administrations of the revised Test of Spoken English (TSE®) using Facets (Linacre, 1999a), a Rasch-based computer software program, to provide information to TSE program personnel for quality control monitoring. In the beginning of this report, we provide a rationale for the study and lay out the questions that focused our investigation. Next, we present a review of literature, discussing previous studies that have used many-facet Rasch measurement to investigate complex rating systems for evaluating speaking and writing. In the method section of the paper we describe the examinees that took part in this study, the Test of Spoken English, the TSE raters, and the rating procedure they employ. We then discuss the statistical analyses we performed, presenting the many-facet Rasch model and its capabilities. The results section is divided into several subsections. First, we examine the map that is produced as part of Facets output. The map is perhaps the most informative piece of output from the analysis, because it allows us to view all the facets of our analysis—examinees, TSE items, raters, and the TSE rating scale—within a single frame of reference. The remainder of the results section is organized around the specific quality control questions we explored with the Facets output. We first answer a set of questions about the performance of examinees. We then look at how the TSE items performed. Next, we turn our attention to questions about the TSE rating scale. And lastly, we look at a set of questions that relate to raters and how they performed. We then draw conclusions from our study, suggesting topics for future research. 1 Rationale for the Study At its February 1996 meeting, the TSE Committee set forth a validation research agenda to guide a long term program for the collection of evidence to substantiate interpretations made about scores on the revised TSE (TSE Committee, 1996). The committee laid out as first priority a set of interrelated studies that focus on the generalizability of test scores. Of particular importance were studies to determine the extent to which various factors (such as item/task difficulty and rater severity) affect examinee performance on the TSE. The committee suggested that Facets analyses and generalizability studies be carried out to monitor these sources of variability as they operate in the TSE setting. In response to the committee’s request, we conducted Facets analyses of data obtained from two administrations of the revised TSE—February and April 1997. The purpose of the study was to monitor four sources of variability within the TSE assessment system: (1) examinees, (2) TSE items, (3) the TSE rating scale, and (4) raters. We sought to quantify expected ranges of variability for each source, to determine the extent to which these sources affect examinee performance, and to highlight aspects of the TSE assessment system that might suggest a need for change. Our study was designed to answer the following questions about the sources of variability: Examinees • How much variability is there across examinees in their levels of proficiency? Who differs more—examinees in their levels of proficiency, or raters in their levels of severity? • To what extent has the test succeeded in separating examinees into distinct strata of proficiency? How many statistically different levels of proficiency are identified by the test? • Are the differences between examinee proficiency measures mainly due to measurement error or to differences in actual proficiency? • How accurately are examinees measured? How much confidence can we have in the precision and stability of the measures of examinee proficiency? • Do some examinees exhibit unusual profiles of ratings across the 12 TSE items? Does the current procedure for identifying and resolving discrepancies successfully identify all cases in which rater agreement is "out of statistical control" (Deming, 1975)? Items • Is it harder for examinees to get high ratings on some TSE items than others? To what extent do the 12 TSE items differ in difficulty? 2 • Can we calibrate ratings from all 12 TSE items, or do ratings on certain items frequently fail to correspond to ratings on other items (that is, are there certain items that do not "fit" with the others)? Can a single summary measure capture the essence of examinee performance across the items, or is there evidence of possible psychometric multidimensionality (Henning, 1992; McNamara, 1991; McNamara, 1996) in the data and, perhaps, a need to report a profile of scores rather than a single score summary for each examinee if it appears that systematic kinds of profile differences are appearing among examinees who have the same overall summary score? • Are all 12 TSE items equally discriminating? Is item discrimination a constant for all 12 items, or does it vary across items? TSE Rating Scale • Are the five scale categories on the TSE rating scale appropriately ordered? Is the rating scale functioning properly as a five-point scale? Are the scale categories clearly distinguishable? Raters • Do TSE raters differ in the severity with which they rate examinees? • If raters differ in severity, how do those differences affect examinee scores? • How interchangeable are the raters? • Do TSE raters use the TSE rating scale consistently? • Are there raters who rate examinee performances inconsistently? • Are there any overly consistent raters whose ratings tend to cluster around the midpoint of the rating scale and who are reluctant to use the endpoints of the scale? Are there raters who tend to give an examinee ratings that differ less than would be expected across the 12 items? Are there raters who cannot effectively differentiate between examinees in terms of their levels of proficiency? Review of the Literature Over the last several years, a number of performance assessment programs interested in examining and understanding sources of variability in their assessment systems have been experimenting with Linacre’s (1999a) Facets computer program as a monitoring tool (see, for example, Heller, Sheingold, & Myford, 1998; Linacre, Engelhard, Tatum, & Myford, 1994; Lunz & Stahl, 1990; Myford & Mislevy, 1994; Paulukonis, Myford, & Heller, in press). In this study, we build on the pioneering efforts of researchers who are employing many-facet Rasch measurement to answer questions about complex rating systems for evaluating speaking and 3 writing. These researchers have raised some critical issues that they are investigating with Facets. For example, • • • • • • Can rater training enable raters to function interchangeably (Weigle, 1994)? Can rater training eliminate differences between raters in the degree of severity they exercise (Lumley & McNamara, 1995; Marr, 1994; McNamara & Adams, 1991; Tyndall & Kenyon, 1995; Wigglesworth, 1993)? Are rater characteristics stable over time (Lumley & McNamara, 1995; Marr, 1994; Myford, Marr, & Linacre, 1996)? What background characteristics influence the ratings raters give (Brown, 1995; Myford, Marr, & Linacre, 1996)? Do raters differ systematically in their use of the points on a rating scale (McNamara & Adams, 1991)? Do raters and tasks interact to affect examinee scores (Bachman, Lynch, & Mason, 1995; Lynch & McNamara, 1994)? Several researchers examined the rating behavior of individual raters of the old Test of Spoken English and reported differences between raters in the degree of severity they exercised when rating examinee performances (Bejar, 1985; Marr, 1994), but no studies have as yet compared raters of the revised TSE. Bejar (1985) compared the mean rating of individual TSE raters and found that some raters tended to give lower ratings than others; in fact, the raters Bejar studied did this consistently across all four scales of the old TSE (pronunciation, grammar, fluency, and comprehension). More recently, Marr (1994) used Facets to analyze data from two 1992 and 1993 administrations of the old TSE and found that there was significant variation in rater severity within each of the two administrations. She reported that more than two thirds of the examinee scores in the first administration would have been changed if adjustments had been made for rater severity, while more than half of the examinee scores would have been altered in the second administration. In her study, Marr also looked at the stability of the rater severity measures across administrations for the 33 raters who took part in both scoring sessions. She found that the correlation between the two sets of rater severity measures was only 0.47. She noted that the rater severity estimates were based on each rater having rated an average of only 30 examinees, and each rater was paired with fewer than half of the other raters in the sample. This suggests, Marr hypothesized, that much of what appeared to be systematic variance associated with differences in rater severity may instead have been random error. She concluded that the operational use of Facets to adjust for rater effects would require some important changes in the existing TSE rating procedures: A means would need to be found to create greater overlap among raters so that all raters could be connected in the rating design (the ratings of eight raters had to be deleted from her analysis because they were insufficiently connected).1 If 1 Disconnection occurs when a judging plan for data collection is instituted that, because of its deficient structure, makes it impossible to place all raters, examinees, and items in one frame of reference so that appropriate comparisons can be drawn (Linacre, 1994). The allocation of raters to items and examinees must result in a network of links that is complete enough to connect all the raters through common items and common examinees (Lunz, Wright, & Linacre, 1990). Otherwise, ambiguity in interpretation results. If there are insufficient patterns of nonextreme high ratings and non-extreme low ratings to be able to connect two elements (e.g., two raters, two examinees, two items), then the two elements will appear in separate subsets of Facets output as “disconnected.” Only examinees that are in the same subset can be directly compared. Similarly, only raters (or items) that are in the same subset can be directly compared. Attempts to compare examinees (or raters, or items) that appear in two or more different subsets can be misleading. 4 this were accomplished, one might then have greater confidence in the stability of the estimates of rater severity both within and across TSE administrations. In the present study of the revised TSE, we worked with somewhat larger sample sizes than Marr used. Marr’s November 1992 sample had 74 raters and 1,158 examinees, and her May 1993 sample had 54 raters and 785 examinees. Our February 1997 sample had 66 raters and 1,469 examinees, and our April 1997 sample had 74 raters and 1,446 examinees. Also, while both of Marr’s data sets had disconnected subsets of raters and examinees in them, there was no disconnection in our two data sets. Method Examinees Examinees from the February and April 1997 TSE administrations (N = 1,469 and 1,446, respectively) were generally between the ages of 20 and 39 years of age (83%). Fewer examinees were under 20 (about 7%) or over 40 (about 10%). These percentages were consistent across the two test dates. About half of the examinees for both administration dates were female (53% in February and 47% in April). Over half of the examinees (55% in February and 66% in April) took the TSE for professional purposes (for example, for selection and certification in health professions, such as medicine, nursing, pharmacy, and veterinary medicine) with the remaining examinees taking the TSE for academic purposes (primarily for selection for international teaching assistantships). Table 1 shows the percentage of examinees taking the examination in various locations around the world. This table reveals that, for both examination dates, a majority of examinees were from eastern Asia, and most of the remaining examinees were from Europe. Table 1 Distribution of TSE Examinees Across Geographic Locations February 1997 April 1997 Location N % N % Eastern Asia 806 56% 870 63% Europe 244 17% 197 14% Africa 113 8% 52 4% Middle East 102 7% 128 9% South America 74 5% 47 3% North America 57 4% 47 3% Western Asia 35 2% 37 3% 5 Instrument The purpose of the revised TSE is the same as that of the original TSE. It is a test of general speaking ability designed to evaluate the oral language proficiency of nonnative speakers of English who are at or beyond the postsecondary level of education (TSE Program Office, 1995). The underlying construct for the revised test is communicative language ability, which is defined to include strategic competence and four language competencies: linguistic competence, discourse competence, functional competence, and sociolinguistic competence (see Appendix). The TSE is a semi-direct speaking test that is administered via audio-recording equipment using recorded prompts and printed test booklets. Each of the 12 items that appears on the test consists of a single task that is designed to elicit one of 10 language functions in a particular context or situation. The test lasts about 20 minutes and is individually administered. Examinees are given a test booklet and asked to listen to and read general instructions. A tape-recorded narrator describes the test materials and asks the examinee to perform several tasks in response to these materials. For example, a task may require the examinee to interpret graphical material, tell a short story, or provide directions to someone. After hearing the description of the task, the examinee is encouraged to construct as complete a response as possible in the time allotted. The examinee’s oral responses are recorded, and each examinee’s test score is based on an evaluation of the resulting speech sample. Raters and the Rating Process All TSE raters are experienced teachers and specialists in the field of English or English as a second language who teach at the high school or college level. Teachers interested in becoming TSE raters undergo a thorough training program designed to qualify them to score for the TSE program. The training program involves becoming familiar with the TSE rating scale (see Table 2), the TSE communication competencies, and the corresponding band descriptors (see Appendix). The TSE trainer introduces and discusses a set of written general guidelines that raters are to follow in scoring the test. For example, these include guidelines for arriving at a holistic score for each item, guidelines describing what materials the raters should refer to while scoring, and guidelines explaining the process to be used in listening to a tape. Additionally, the trainees are introduced to a written set of item-level guidelines to be used in scoring. These describe in some detail how to handle a number of recurring scoring challenges TSE raters face. For example, they describe how raters should handle tapes that suffer from mechanical problems, performances that fluctuate between two bands on the rating scale across all competencies, incomplete responses to a task, and off-topic responses. After the trainees have been introduced to all of the guidelines for scoring, they then practice using the rating scale to score audiotaped performances. Prior to the training, those in charge of training select benchmark tapes drawn from a previous test administration that show performance at the various band levels. They prepare written rationales that explain why each tape exemplifies performance at that particular level. Rater trainees listen to the benchmark tapes and practice scoring them, checking their scores against the benchmark scores and reading the scoring rationales to gain a better understanding of how the TSE rating scale functions. At the end of this qualifying session, each trainee independently rates six TSE tapes. They then present the scores they assign each tape to the TSE program for evaluation. To qualify as a TSE rater, a trainee can have only one discrepant score—where the discrepancy is a difference of more than 6 one bandwidth (that is, 10 points)— among the six rated tapes. If the scores the trainee assigns meet this requirement, then the trainee is deemed "certified" to score the TSE. The rater can then be invited to participate in subsequent operational TSE scoring sessions. At the beginning of each operational TSE scoring session, the raters who have been invited to participate undergo an initial recalibration training session to refamiliarize them with the TSE rating scale and to calibrate to the particular test form they will be scoring. The recalibration training session serves as a means of establishing on-going quality control for the program. During a TSE scoring session, examinee audiotapes are identified by number only and are randomly assigned to raters. Two raters listen to each tape and independently rate it (neither rater knows the scores the other rater assigned). The raters evaluate an examinee’s performance on each item using the TSE holistic five-point rating scale; they use the same scale to rate all 12 items appearing on the test. Each point on the scale is defined by a band descriptor that corresponds to the four language competencies that the test is designed to measure (functional competence, sociolinguistic competence, discourse competence, and linguistic competence), and strategic competence. Raters assign a holistic score from one of the five bands for each of the 12 items. As they score, the raters consider all relevant competencies, but they do not assess each competency separately. Rather, they evaluate the combined impact of all five competencies when they assign a holistic score for any given item. To arrive at a final score for an examinee, the 24 scores that the two raters gave are compared. If the averages of the two raters differ by 10 points or more overall, then a third rater (usually a very experienced TSE rater) rates the audiotape, unaware of the previously assigned scores. The final score is derived by resolving the differences among the three sets of scores. The three sets of scores are compared, and the closest pair is averaged to calculate the final reported score. The overall score is reported on a scale that ranges from 20 to 60, in increments of five (20, 25, 30, 35, 40, 45, 50, 55, 60). Procedure For this study, we used rating data obtained from the two operational TSE scoring sessions described earlier. To analyze the data, we employed Facets (Linacre, 1999a), a Raschbased rating scale analysis computer program. The Statistical Analyses. Facets is a generalization of the Rasch (1980) family of measurement models that makes possible the analysis of examinations that have multiple potential sources of measurement error (such as, items, raters, and rating scales).2 Because our goal was to gain an understanding of the complex rating procedure employed in the TSE setting, we needed to consider more measurement facets than the traditional two—items and examinees—taken into account by most measurement models. By employing Facets, we were able to establish a statistical framework for analyzing TSE rating data. That framework enabled us to summarize overall rating patterns in terms of main effects for the rater, examinee, and item 2 See McNamara (1996, pp. 283-287) for a user-friendly description of the various models in this family and the types of situations in which each model could be used. 7 Table 2 TSE Rating Scale Score 60 Communication almost always effective: task performed very competently Functions performed clearly and effectively Appropriate response to audience/situation Coherent, with effective use of cohesive devices Use of linguistic features almost always effective; communication not affected by minor errors 50 Communication generally effective: task performed competently Functions generally performed clearly and effectively Generally appropriate response to audience/situation Coherent, with some effective use of cohesive devices Use of linguistic features generally effective; communication generally not affected by errors 40 Communication somewhat effective: task performed somewhat competently Functions performed somewhat clearly and effectively Somewhat appropriate response to audience/situation Somewhat coherent, with some use of cohesive devices Use of linguistic features somewhat effective; communication sometimes affected by errors 30 Communication generally not effective: task generally performed poorly Functions generally performed unclearly and ineffectively Generally inappropriate response to audience/situation Generally incoherent, with little use of cohesive devices Use of linguistic features generally poor; communication often impeded by major errors 20 No effective communication: no evidence of ability to perform task No evidence that functions were performed No evidence of ability to respond appropriately to audience/situation Incoherent, with no use of cohesive devices Use of linguistic features poor; communication ineffective due to major errors Copyright © 1996 by Educational Testing Service, Princeton, NJ. All rights reserved. No reproduction in whole or in part is permitted without express written permission of the copyright owner. 8 facets. Additionally, we were able to quantify the weight of evidence associated with each of these facets and highlight individual rating patterns and rater-item combinations that were unusual in light of expected patterns. In the many-facet Rasch model (Linacre, 1989), each element of each facet of the testing situation (that is, each examinee, rater, item, rating scale category, etc.) is represented by one parameter that represents proficiency (for examinees), severity (for raters), difficulty (for items), or challenge (for rating scale categories). The Partial Credit form of the many-facet Rasch model that we used for this study was: log ( Pnjik Pnjik - 1 ) = Bn - Cj - Di - Fik (1) Pnjik = the probability of examinee n being awarded a rating of k when rated by rater j on item i Pnjik-1 = the probability of examinee n being awarded a rating of k-1 when rated by rater j on item i Bn = the proficiency of examinee n = the severity of rater j Cj = the difficulty of item i Di = the difficulty of achieving a score within a particular score category (k) averaged Fik across all raters for each item separately When we conducted our analyses, we separated out the contribution of each facet we included and examined it independently of other facets so that we could better understand how the various facets operate in this complex rating procedure. For each element of each facet in this analysis, the computer program provides a measure (a logit estimate of the calibration), a standard error (information about the precision of that logit estimate), and fit statistics (information about how well the data fit the expectations of the measurement model). Results We have structured our discussion of research findings around the specific questions we explored with the Facets output. But before we turn to the individual questions, we provide a brief introduction to the process of interpreting Facets output. In particular, we focus on the map that is perhaps the single most important and informative piece of output from the computer program, because it enables us to view all the facets of our analysis at one time. The maps shown as Figures 1 and 2 display all facets of the analysis in one figure for each TSE administration and summarize key information about each facet. The maps highlight results from more detailed sections of the Facets output for examinees, TSE items, raters, and the TSE rating scale. (For the remainder of this discussion, we will refer only to Figure 1. Figure 2 tells much the same story. The interested reader can apply the same principles described below when interpreting Figure 2.) 9 The Facets program calibrates the raters, examinees, TSE items, and rating scales so that all facets are positioned on the same scale, creating a single frame of reference for interpreting the results from the analysis. That scale is in log-odds units, or “logits,”which, under the model, constitute an equal-interval scale with respect to appropriately transformed probabilities of responding in particular rating scale categories. The first column in the map displays the logit scale. Having a single frame of reference for all the facets of the rating process facilitates comparisons within and between the facets. The second column displays the scale that the TSE program uses to report scores to examinees. The TSE program averages the 24 ratings that the two raters assign to each examinee, and a single score of 20 to 60, rounded to the nearest 5 (thus, possible scores include 20, 25, 30, 35, 40, 45, 50, 55, and 60), is reported. The third column displays estimates of examinee proficiency on the TSE assessment— single-number summaries on the logit scale of each examinee’s tendency to receive low or high ratings across raters and items. We refer to these as "examinee proficiency measures." Higher scoring examinees appear at the top of the column, while lower scoring examinees appear at the bottom of the column. Each star represents 12 examinees, and a dot represents fewer than 12 examinees. These measures appear as a fairly symmetrical platykurtic distribution, resembling a bell-shaped normal curve—although this result was in no way preordained by the model or the estimation procedure. Skewed and multi-modal distributions have appeared in other model applications. The fourth column compares the TSE raters in terms of the level of severity or leniency each exercised when rating oral responses to the 12 TSE items. Because more than one rater rated each examinee’s responses, raters’ tendencies to rate responses higher or lower on average could be estimated. We refer to these as “rater severity measures.” In this column, each star represents 2 raters. More severe raters appear higher in the column, while more lenient raters appear lower. When we examine Figure 1, we see that the harshest rater had a severity measure of about 1.5 logits, while the most lenient rater had a severity measure of about -2.0 logits. The fifth column compares the 12 items that appeared on the February 1997 TSE in terms of their relative difficulties. Items appearing higher in the column were more difficult for examinees to receive high ratings on than items appearing lower in the column. Items 7, 10, and 11 were the most difficult for examinees, while items 4 and 12 proved easiest. Columns 6 through 17 display the five-point TSE rating scale as raters used it to score examinee responses to each of the 12 items. The horizontal lines across each column indicate the point at which the likelihood of getting the next higher rating begins to exceed the likelihood of getting the next lower rating for a given item. For example, when we examine Figure 1, we see that examinees with proficiency measures from about -5.5 logits up through about -3.5 logits are more likely to receive a rating of 30 than any other rating on item 1; examinees with proficiency measures between about -3.5 logits and about 2.0 logits are most likely to receive a rating of 40 on item 1; and so on. The bottom rows of Figure 1 provide the mean and standard deviation of the distribution of estimates for examinees, raters, and items. When conducting a Facets analysis involving these three facets, it is customary to center the rater and item facets, but not the examinee facet. 10 By centering facets, one establishes the origin of the scale. As Linacre (1994) cautions, "in most analyses, if more than one facet is non-centered in an analysis, then the frame of reference is not sufficiently constrained, and ambiguity results" (p. 27). Examinees How much variability is there across examinees in their levels of proficiency? Who differs more—examinees in their levels of proficiency or raters in their levels of severity? Looking at Figures 1 and 2, we see that the distribution of rater severity measures is much narrower than the distribution of examinee proficiency measures. In Figure 1, examinee proficiency measures show an 18.34-logit spread, while rater severity measures show only a 3.55-logit spread. The range of examinee proficiency measures is about 5.2 times as wide as the range of rater severity measures. Similarly, in Figure 2 the rater severity measures range from -1.89 logits to 1.30 logits, a 3.19-logit spread, while the examinee proficiency measures range from -5.43 logits to 11.69 logits, a 17.12-logit spread. Here, the range of examinee proficiency measures is about 5.4 times as wide as the range of rater severity measures. A more typical finding of studies of rater behavior is that the range of examinee proficiency is about twice as wide as the range of rater severity (J. M. Linacre, personal communication, March 13, 1995). The finding that the range of TSE examinee proficiency measures is about five times as wide as the range of TSE rater severity is an important one, because it suggests that the impact of individual differences in rater severity on examinee scores is likely to be relatively small. By contrast, suppose that the range of examinee proficiency measures had been twice as wide as the range of rater severity. In this instance, the impact of individual differences in rater severity on examinee scores would be much greater. The particular raters who rated individual examinees would matter more, and a more compelling case could be made for the need to adjust examinee scores for individual differences in rater severity in order to minimize these biasing effects. To what extent has the test succeeded in separating examinees into distinct strata of proficiency? How many statistically different levels of proficiency are identified by the test? Facets reports an examinee separation ratio (G) which is a ratio scale index comparing the "true" spread of examinee proficiency measures to their measurement error (Fisher, 1992). To be useful, a test must be able to separate examinees by their performance (Stone & Wright, 1988). One can determine the number of statistically distinct proficiency strata into which the test has succeeded in separating examinees (in other words, how well the test separates the examinees in a particular sample) by using the formula (4G + 1)/3. When we apply this formula, we see that the samples of examinees that took the TSE in either February 1997 or April 1997 could each be separated into eight statistically distinct levels of proficiency. 11 Figure 1 Map from the Facets Analysis of the Data from the February 1997 TSE Administration ____________________________________________________________________________________________________________________ Rating Scale for Each Item Logit TSE Score Examinee Rater Item 1 2 3 4 5 6 7 8 9 10 11 12 ____________________________________________________________________________________________________________________ 11 High Scores Severe Difficult *. 10 . . 9 60 . 60 60 60 60 60 60 60 60 60 60 60 60 *. 8 *. *. 7 **. --------------------------- ****. -------------------------6 ****. ----------***. 5 ****. 50s *****. 50 50 50 50 50 50 50 50 50 50 50 50 4 *****. *******. 3 *******. *********. ----------------- 2 --------------- *********. ------------------------------*******. * 1 *******. **. ******* ********* 7 10 11 0 *****. ********* 1 2 3 5 6 8 9 40s *****. ******* 4 12 40 40 40 40 40 40 40 40 40 40 40 40 -1 ****. **. **. * -2 **. * --------------------- **. --------------------------------3 *. -----------* ----4 30s . . 30 30 30 30 30 30 30 30 30 30 30 30 -5 . --------------------- . ----------------------6 -----------20 . ------------7 Low scores Lenient Easy 20 20 20 20 20 20 20 20 20 20 20 20 ____________________________________________________________________________________________________________________ Mean 2.48 .00 .00 S.D. 2.88 .72 .30 ____________________________________________________________________________________________________________________ 12 Figure 2 Map from the Facets Analysis of the Data from the April 1997 TSE Administration ____________________________________________________________________________________________________________________ Rating Scale for Each Item Logit TSE Score Examinee Rater Item 1 2 3 4 5 6 7 8 9 10 11 12 _____________________________________________________________________________________________________________________ 11 High Scores Severe Difficult *. 10 . . 9 60 . 60 60 60 60 60 60 60 60 60 60 60 60 . 8 *. **. ---7 *. ------------------------------- ***. -----------------6 ****. -------****. 5 ****. 50s ***** 50 50 50 50 50 50 50 50 50 50 50 50 4 ******. ******. 3 ******** ---*********. ------------- 2 --------------- *********. --------------------------******. . 1 ******. ***. ******. *****. 7 11 0 ******* ******* 1 2 3 5 6 8 9 10 40s *****. ****. 4 12 40 40 40 40 40 40 40 40 40 40 40 40 -1 ****. **. ***. . -2 ****. . ***. --------------- -3 --------------- *. -------------------------*. -4 30s . . 30 30 30 30 30 30 30 30 30 30 30 30 -5 . --------------------- . -------------------6 -----20 ---------------7 Low scores Lenient Easy 20 20 20 20 20 20 20 20 20 20 20 20 ____________________________________________________________________________________________________________________ Mean 2.24 .00 .00 S.D. 2.91 .66 30 ____________________________________________________________________________________________________________________ 13 Are the differences between examinee proficiency measures mainly due to measurement error or to differences in actual proficiency? Facets also reports the reliability with which the test separates the sample of examinees —that is, the proportion of observed sample variance which is attributable to individual differences between examinees (Wright & Masters, 1982). The examinee separation reliability coefficient represents the ratio of variance attributable to the construct being measured (true score variance) to the observed variance (true score variance plus the error variance). Unlike interrater reliability, which is a measure of how similar rater measures are, the separation reliability is a measure of how different the examinee proficiency measures are (Linacre, 1994). For the February and April TSE data, the examinee separation reliability coefficients were both 0.98, indicating that the true variance far exceeded the error variance in the examinee proficiency measures.3 How accurately are examinees measured? How much confidence can we have in the precision and stability of the measures of examinee proficiency? Facets reports an overall measure of the precision and stability of the examinee proficiency measures that is analogous to the standard error of measurement in classical test theory. The standard error of measurement depicts the extent to which we might expect an examinee’s proficiency estimate to change if different raters or items were used to estimate that examinee’s proficiency. The average standard error of measurement for the examinees that took the February 1997 TSE was 0.44; the average standard error for examinees that took the April 1997 TSE was 0.45. Unlike the standard error of measurement in classical test theory, which estimates a single measure of precision and stability for all examinees, Facets provides a separate, unique estimate for each examinee. For illustrative purposes, we focused on the precision and stability of the "average" examinee. That is, we determined 95% confidence intervals for examinees with proficiency measures near the mean of the examinee proficiency distribution for the February and April data. For the February data, the mean examinee proficiency measure was 2.47 logits, and the standard error for that measure was 0.41. Therefore, we would expect the average examinee’s true proficiency measure to lie between raw scores of 49.06 and 52.83 on the TSE scale 95% of the time. For the April data, the mean examinee proficiency measure was 2.24 logits, and the standard error of that measure was 0.40. Therefore, we would expect the average examinee’s true proficiency measure to lie between raw scores of 48.63 and 52.07 on the TSE scale 95% of the time. To summarize, we would expect an average examinee’s true proficiency to lie within about two raw score points of his or her reported score most of the time. It is important to note, however, that the size of the standard error of measurement varies across the proficiency distribution, particularly at the tails of the distribution. In this study, examinees at the upper end of the proficiency distribution tended to have larger standard errors on average than examinees in the center of the distribution. For example, examinees taking the 3 According to Fisher (1992), a separation reliability less than 0.5 would indicate that the differences between examinee proficiency measures were mainly due to measurement error and not to differences in actual proficiency. 14 TSE in February who had proficiency measures in the range of 5.43 logits to 10.30 logits (that is, they would have received reported scores in the range of 55 to 60 on the TSE scale) had standard errors for their measures that ranged from 0.40 to 1.03. By contrast, examinees at the lower end of the proficiency distribution tended to have smaller standard errors on average than examinees in the center of the distribution. For example, examinees taking the TSE in February who had proficiency measures in the range of -3.48 logits to -6.68 logits (that is, they would have received reported scores in the range of 20 to 30 on the TSE scale) had standard errors for their measures that ranged from 0.36 to 0.38. Thus, for institutions setting their own cutscores on the TSE, it would be important to take into consideration the standard errors for individual examinee proficiency measures, particularly for those examinees whose scores lie in critical decisionmaking regions of the score distribution, and not to assume that the standard error of measurement is constant across that distribution. Do some examinees exhibit unusual profiles of ratings across the 12 TSE items? Does the current procedure for identifying and resolving discrepancies successfully identify all cases in which rater agreement is "out of statistical control" (Deming, 1975)? As explained earlier, when the averages of two raters’ scores for a single examinee differ by more than 10 points, usually a very experienced TSE rater rates the audiotape, unaware of the scores previously assigned. The three sets of scores are compared, and the closest pair is used to calculate the final reported score (TSE Program Office, 1995). We used Facets to determine whether this third-rating adjudication procedure is successful in identifying problematic ratings. Facets produces two indices of the consistency of agreement across raters for each examinee. The indices are reported as fit statistics—weighted and unweighted, standardized and unstandardized. In this report, we make several uses of these indices. First, we discuss the unstandardized, information-weighted mean-square index, or infit, and explain how one can use that index to identify examinees who exhibit unusual profiles of ratings across the 12 TSE items. We then examine some examples of score patterns that exhibit misfit to show how one can diagnose the nature of misfit. Finally, we compare decisions that would be made about the validity of examinee scores based on the standardized infit index to the decisions that would be made about the validity of examinee scores based on the current TSE procedure for identifying discrepantly rated examinees. First, however, we briefly describe how the unstandardized infit mean-square index is interpreted. The expectation for this index is 1; the range is 0 to infinity. The higher the infit mean-square index, the more variability we can expect in the examinee’s rating pattern, even when rater severity is taken into account. When raters are fairly similar in the degree of severity they exercise, an infit mean-square index less than 1 indicates little variation in the examinee’s pattern of ratings (a "flat-line" profile consisting of very similar or identical ratings across the 12 TSE items from the two raters), while an infit mean-square index greater than 1 indicates more than typical variation in the ratings (that is, a set of ratings with one or more unexpected or surprising ratings, aberrant ratings that don’t seem to "fit" with the others). Generally, infit mean-square indices greater than 1 are more problematic than infit indices less than 1. There are no hard-and-fast rules for setting upper- and lower-control limits for the examinee infit meansquare index. Some testing programs use an upper-control limit of 2 or 3 and a lower-control limit of .5; more stringent limits might be instituted if the goal were to strive to reduce 15 significantly variability within the system. The more extreme the infit mean-square index, the greater potential gains for improving the system—either locally, by rectifying an aberrant rating, or globally, by gaining insights to improve training, rating, or logistic procedures. For this study, we adopted an upper-control limit for the examinee infit mean-square index of 3.0, a liberal control limit to accommodate some variability in each examinee’s rating pattern, and a lower-control limit of 0.5. We wished to allow for a certain amount of variation in raters’ perspectives, yet still catch cases in which rater disagreement was problematic. An infit mean-square index beyond the upper-control limit signals an examinee performance that might need another listening before the final score report is issued, particularly if the examinee’s score is near a critical decision-making point in the score distribution. Table 3 summarizes examinee infit information from the February and April 1997 TSE test administrations. As Table 3 shows, of the 1,469 examinees tested in February, 165 (about 11%) had infit mean-square indices less than 0.5. Similarly, of the 1,446 examinees tested in April, 163 (again, about 11%) had infit mean-square indices less than 0.5. These findings suggest that about 1 in 10 examinees in these TSE administrations may have received very similar or identical ratings across all 12 TSE items. Further, in the February administration, 15 examinees (about 1%) had infit mean-square indices equal to or greater than 3.0. Similarly, in the April administration, 15 (1%) had infit mean-square indices equal to or greater than 3.0. Why did these particular examinees misfit? In Table 4 we examine the rating patterns associated with some representative misfitting cases. Table 3 Misfitting and Overfitting Examinees from the February and April 1997 TSE Administrations February 1997 April 1997 Infit MeanSquare Index Number of Examinees Percent of Examinees Number of Examinees Percent of Examinees < 0.5 165 11.2% 163 11.3% 3.0 to 4.0 11 0.7% 11 0.7% 4.1 to 5.0 2 0.1% 2 0.1% 5.1 to 6.0 2 0.1% 0 0.0% 6.1 to 7.0 0 0.0% 1 0.1% 7.1 to 8.0 0 0.0% 1 0.1% 16 Table 4 Rating Patterns and Fit Indices for Selected Examinees Ratings Received by Examinee #110 (Infit Mean-Square Index = 0.1; Proficiency Measure = .13, Standard Error = .53) Rater #74 (Severity = -.39) Rater #69 (Severity = .95) Item Number 6 7 1 2 3 4 5 40 40 40 40 40 40 40 40 40 40 40 40 8 9 10 11 12 40 40 40 40 40 40 40 40 40 40 40 40 Ratings Received by Examinee #865 (Infit Mean-Square Index = 1.0; Proficiency Measure = -.39, Standard Error = .47) Rater #30 (Severity = -.53) Rater #59 (Severity = 1.19) Item Number 6 7 1 2 3 4 5 40 40 40 40 40 40 40 30 30 40 30 40 8 9 10 11 12 40 50 40 40 40 50 40 40 40 30 40 40 Ratings Received by Examinee #803 (Infit Mean-Square Index = 3.1; Proficiency Measure = 2.55, Standard Error = .42) Rater #31 (Severity = -.79) Rater #42 (Severity = -.61) Item Number 6 7 1 2 3 4 5 50 50 50 60 50 60 40 40 50 50 40 50 8 9 10 11 12 30 60 60 50 60 50 30 40 50 40 50 40 Ratings Received by Examinee #1060 (Infit Mean-Square Index = 6.6; Proficiency Measure = 2.73, Standard Error = .41) Rater #18 (Severity = -.53) Rater #36 (Severity = .29) Item Number 6 7 1 2 3 4 5 40 40 30 40 30 40 50 60 60 60 60 60 17 8 9 10 11 12 40 30 30 40 40 30 60 60 60 60 50 50 As Table 4 shows, the ratings of Examinee #110 exhibit a flat-line pattern: straight 40s from both raters. Facets flags such rating patterns for further examination because they display so little variation. This deterministic pattern does not fit the expectations of the model; the model expects that for each examinee there will be at least some variation in the ratings across items. Examinees who receive very similar or nearly identical ratings across all 12 items will show fit indices less than 0.5; upon closer inspection, their rating patterns will frequently reveal this flat-line nature. In some cases, one might question whether the raters who score such an examinee’s responses rated each item independently or, perhaps, whether a halo effect may have been operating. The ratings of Examinee #865 (infit mean-square index = 1.0) shown in Table 4 are fairly typical. There is some variation in the ratings: mostly 40s with an occasional 30 from Rater #59, who tends to be one of the more severe raters (rater severity measure = 1.19), and mostly 40s with an occasional 50 from Rater #30, who tends to be one of the more lenient raters (rater severity measure = -.53). Table 4 shows that, for Examinee #803, the two ratings for Item 7 were misfitting. Both raters gave this examinee unexpectedly low ratings of 30 on this item, while the examinee’s ratings on all other items are higher, ranging from 40 to 60. When we examine the Facets table of misfitting ratings, we find that the model expected these raters to give 40s on Item 3, not 30s. These two raters tended to rate leniently overall (rater severity measures = -.61 and -.79), so their unexpectedly low ratings of 30 are somewhat surprising in light of the other ratings they gave the examinee. The rating pattern for Examinee #1060, shown in Table 4, displays a higher level of misfit (infit mean-square index = 6.6). In this case, the examinee received 30s and 40s from a somewhat lenient rater (rater severity measure = -.53) and 50s and 60s from a somewhat harsh rater (rater severity measure = .29). The five 30 ratings are quite unexpected, especially from a rater who has a tendency to give higher ratings on average. These ratings are all the more unexpected in light of the high ratings (50s and 60s) given this examinee by a rater who tends to rate on average more severely. In isolated cases like this one, TSE program personnel might want to review the examinee’s performance before issuing a final score report. The scores for Examinee #1060 were flagged by the TSE discrepancy criteria as being suspect because the averages of the two raters’ sets of scores were 35.83 and 57.50—a difference of more than 10 points; third ratings were used to resolve the differences between the scores. We also analyzed examinee fit by comparing the standardized infit indices to the discrepancy resolution criteria used by the TSE program. The standardized infit index is a transformation of the unstandardized infit index, scaled as a z score (that is, a mean of 0 and a standard deviation of 1). Under the assumption of normality, the distribution of standardized infit indices can be viewed as indicating the probability of observing a specific pattern of discrepant ratings when all ratings are indeed nondiscrepant. That is, standardized infit indices can be used to indicate which of the observed rating patterns are most likely to include surprising or unexpected ratings. For our analyses, we adopted an upper-control limit for the examinee standardized infit of 3.0 which suggests that infit indices for any rating patterns that exceed this value are likely to be nonaberrant only 0.13% of the time. 18 Thus, we identified examinees with one or more aberrant ratings using the standardized infit index, which takes into account the level of severity each rater exercised when rating examinees. We also identified examinees having discrepant ratings according to TSE resolution criteria, which only takes into account the magnitude of the differences between the scores assigned by two raters and does not consider the level of severity exercised by those raters. Tables 5 and 6 summarize this information for the February and April test administrations by providing the number and percentage of examinees whose rating patterns were identified as suspect under both sets of criteria. Based on the TSE discrepancy criteria, about 4% of the examinees were identified as having aberrant ratings in each administration. Based on the standardized infit index, about 3% of the examinees’ ratings were aberrant. What is more interesting, however, is the fact that the two methods identified only a small number of examinees in common. Only 1% of the examinees were identified as having suspect rating patterns according to both criteria, as shown in the lower right cell of each table. About 3% of the examinees in each data set were identified as having suspect rating patterns by the TSE discrepancy criteria but were not identified as being suspect by the infit index (as shown in the shaded upper right cell of each table). These cases are most likely ones in which an examinee was rated by one severe rater and one lenient rater. Such ratings would not be unusual in light of each rater’s overall level of severity; each rater would have been using the rating scale in a manner that was consistent with his or her use of the scale when rating other Table 5 Examinees from the February 1997 TSE Administration Identified as Having Suspect Rating Patterns TSE Criteria Nondiscrepant Discrepant Total Nondiscrepant 1,380 (94%) 46 (3%) 1,426 (97%) Discrepant 29 (2%) 14 (1%) 43 (3%) Total 1,409 (96%) 60 (4%) 1,469 (100%) Infit Criteria 19 Table 6 Examinees from the April 1997 TSE Administration Identified as Having Suspect Rating Patterns TSE Criteria Nondiscrepant Discrepant Total Nondiscrepant 1,359 (94%) 47 (3%) 1,406 (97%) Discrepant 26 (2%) 14 (1%) 40 (3%) Total 1,385 (96%) 61 (4%) 1,446 (100%) Infit Criteria examinees of similar proficiency. The apparent discrepancies between raters could have been resolved by providing a model-based score that takes into account the levels of severity of the two raters when calculating an examinee’s final score. If Facets were used to calculate examinee scores, then there would be no need to bring these types of cases to the attention of the more experienced TSE raters for adjudication. Facets would have automatically adjusted these scores for differences in rater severity, and thus would not have identified these examinees as misfitting and in need of a third rater’s time and energy. The shaded lower left cells in Tables 5 and 6 indicate that about 2% of the examinees were identified as having suspect rating patterns according to the infit criteria, but were not identified based on the TSE discrepancy criteria. These cases are most likely situations in which raters made seemingly random rating errors, or examinees performed differentially across items in a way that differs from how other examinees performed on these same items. These cases would seem to be the ones most in need of reexamination and reevaluation by the experienced TSE raters, rather than the cases identified in the shaded upper right cell of each table. Items Is it harder for examinees to get high ratings on some TSE items than others? To what extent do the 12 TSE items differ in difficulty? To answer these questions, we can examine the item difficulty measures shown in Table 7 and Table 8. These tables order the 12 TSE items from each test according to their relative difficulties. More difficult items (that is, those that were harder for examinees to get high ratings on) appear at the top of each table, while easier items appear at the bottom. For the February 20 Table 7 Item Measurement Report for the February 1997 TSE Administration Item Item 7 Item 11 Item 10 Item 3 Item 9 Item 8 Item 2 Item 5 Item 1 Item 6 Item 4 Item 12 Difficulty Measure (in logits) Standard Error Infit Mean-Square Index .46 .43 .25 .16 .11 .10 -.01 -.11 -.16 -.23 -.48 -.52 .04 .04 .04 .04 .04 .04 .04 .04 .04 .04 .04 .04 1.0 1.0 1.0 1.0 0.9 0.9 1.1 1.0 1.2 0.9 0.9 0.9 Table 8 Item Measurement Report for the April 1997 TSE Administration Item Item 11 Item 7 Item 6 Item 2 Item 3 Item 9 Item 5 Item 10 Item 1 Item 8 Item 12 Item 4 Difficulty Measure (in logits) Standard Error Infit Mean-Square Index .45 .42 .25 .16 .06 -.01 -.03 -.09 -.10 -.10 -.38 -.64 .04 .04 .04 .04 .04 .04 .04 .04 .04 .04 .04 .04 0.9 1.0 0.9 1.1 1.0 0.9 1.0 1.0 1.3 0.9 0.8 0.9 21 administration, the item difficulty measures range from -.52 logits for Item 12 to .46 logits for Item 7, about a 1-logit spread. The range of difficulty measures for the items that appeared on the April test is nearly the same: items range in difficulty from -.64 logits for Item 4 to .45 logits for Item 11, about a 1-logit spread. The spread of the item difficulty measures is very narrow compared to the spread of examinee proficiency measures. If the items are not sufficiently spread out along a continuum, then that suggests that those designing the assessment have not succeeded in defining distinct levels along the variable they are intending to measure (Wright & Masters, 1982). As shown earlier in Figures 1 and 2, the TSE items tend to cluster in a very narrow band. For both the February and April administrations, Items 4 and 12 were slightly easier for examinees to get high ratings on, while Items 7 and 11 were slightly more difficult. However, in general, the items differ relatively little in difficulty and thus do not convincingly define a line of increasing intensity. To improve the measurement capabilities of the TSE, test developers may want to consider introducing into the instrument some items that are easier than those that currently appear, as well as some items that are substantially more difficult than the present items. If such items could be designed, then they would help to define a more recognizable and meaningful variable and would allow for placement of examinees along the variable defined by the test items. Can we calibrate ratings from all 12 TSE items, or do ratings on certain items frequently fail to correspond to ratings on other items (that is, are there certain items that do not "fit" with the others)? Can a single summary measure capture the essence of examinee performance across the items, or is there evidence of possible psychometric multidimensionality (Henning, 1992; McNamara, 1991; McNamara, 1996) in the data and, perhaps, a need to report a profile of scores rather than a single score summary for each examinee if it appears that systematic kinds of profile differences are appearing among examinees who have the same overall summary score? We used Facets to compute indices of fit for each of the 12 items included in the February and April tests, and these are also shown in Tables 7 and 8. The infit mean-square indices range from 0.9 to 1.2 for the February test, and from 0.8 to 1.3 for the April test. All the indices are within even tight quality control limits of 0.7 to 1.3. The fact that all the infit meansquare indices are greater than 0.7 suggests that none of the items function in a redundant fashion. Because none of the infit mean-square indices is greater than 1.3, there is little evidence of psychometric multidimensionality in either data set. The items on each test appear to work together; ratings on one item correspond well to ratings on other items. That is, a single pattern of proficiency emerges for these examinees across all items on the TSE. Therefore, ratings on the individual items can be meaningfully combined; a single summary measure can appropriately capture the essence of examinee performance across the 12 items. Are all 12 TSE items equally discriminating? Is item discrimination a constant for all 12 items, or does it vary across items? Several versions of the many-facet Rasch model can be specified within Facets for the TSE data, and each model portrays a specific set of assumptions regarding the nature of the TSE 22 ratings. The version of the model we used (the Partial Credit Model shown earlier in Equation 1) assumes that a rating scale with a unique category structure is employed in the rating of each item. That is, a score of 20 on Item 1 may be more or less difficult to obtain relative to a score of 30 than is a score of 20 on Item 2. The Fik term in Equation 1 denotes that the rating scale for each item is to be modeled to have its own category structure. A simpler model, the Rating Scale Model, shown in Equation 2 below, assumes that the rating scales for the items share a common category structure; in other words, a score of 20 has uniform difficulty relative to a score of 30, regardless of the item. The Fk term in Equation 2 denotes that all 12 rating scales are to be modeled assuming a common category structure operates across items: log ( Pnijk ) = Bn - Cj - Di - Fk Pnijk - 1 (2) One of the differences between the Partial Credit Model and the Rating Scale Model relates to item discrimination (the extent to which an item differentiates between high and low proficiency examinees). Items that have good discrimination differentiate very well between high and low proficiency examinees. As a result, each examinee who responds to an item has a high probability of being assigned to a single rating scale category and a very low probability of being assigned to any of the remaining categories. That is, there is one rating category to which each examinee is clearly most likely to be assigned. Items that are poor discriminators, on the other hand, do not differentiate between high and low proficiency examinees. With poorly discriminating items, the probabilities associated with each rating scale category are nearly equal, so that high and low proficiency examinees are likely to be assigned the same rating. Differences in item discrimination result in higher and lower variances in the category probability curves for items with low and high levels of discrimination, respectively. To investigate whether the 12 TSE items are equally discriminating, we compared the rating scale category structures from the Partial Credit Model, because that analysis allowed the rating scale category structure to vary from one item to another. We examined the rating scale category calibrations for each item. A rating scale category calibration is the point on the examinee proficiency scale at which the probability curves for adjacent categories intersect. It marks the point in the proficiency distribution where the probability of an examinee’s getting a rating in the next higher scale category begins to exceed the probability of the examinee’s getting a rating in the next lower scale category.4 Tables 9 and 10 show the rating scale category calibrations for each of the 12 TSE items from the February and April tests, as well as the means and standard deviations associated with each rating scale category calibration. For example, 4 Different researchers in the Rasch community use different terms for this critical notion of a transition point between adjacent categories. Linacre, Masters, and Wright speak of “step” calibrations, a concept that would seem to imply movement, of passing through or moving up from a lower category to arrive at the next higher category on a scale. Others have taken issue with this notion of implied movement, arguing that when we analyze a data set, we are modeling persons at a specific point in time, not how each person arrived at their particular location on the proficiency continuum (Andrich, 1998). Andrich prefers the term “threshold” which he defines as the transition point at which the probability is 50% of an examinee being rated in one of two adjacent categories, given that the examinee is in one of those two categories. Linacre (1998) has also used the term “step thresholds” to denote these categorical boundary points. We have chosen to use the term “category calibration,” because we are not comfortable with the connotation of movement implied in the use of the term “step calibration.” However, the reader should be aware that the values we report in Tables 9 and 10 are what Facets reports as “step calibrations.” 23 Table 9 shows that for Item 1, examinees with proficiency measures in the range of –4.96 logits to –3.79 logits have a higher probability of receiving a 30 on item 1 than any other rating, while examinees with proficiency measures in the range of -3.80 logits to 1.89 logits have a higher probability of receiving a 40 on the item than any other rating. Tables 9 and 10 reveal that the rating scale category calibrations are fairly consistent across items—the differences between the mean category calibrations are considerably larger than their standard deviations. However, the rating scale category calibrations are not equal across the 12 TSE items, particularly for the two lowest categories. One way to examine the differences in rating scale category calibrations is to compare the category probability curves of two or more items. These plots portray the probability that an examinee with proficiency B will receive a rating in each of the rating scale categories for that item. Figure 3, for example, shows the expected distribution of probabilities for Item 4 (the solid lines) and Item 8 (the dashed lines) from the February test—two items with the most dissimilar rating scale category calibrations for this test. This plot reinforces what is shown in Table 9—the transitions from the 40 to the 50 categories and from the 50 to the 60 categories are fairly similar between the two items, while the transitions from the 20 to the 30 categories and from the 30 to the 40 categories are less similar. These findings suggest that more proficient TSE examinees (those who would typically receive scores of 50 or 60 on each item) tend to perform equally well across the 12 items on the February test, while less proficient TSE examinees (those who would typically receive scores of 20 or 30 on each item) tend to perform somewhat better on some items than on others. Similarly, Figure 4 shows the expected distribution of probabilities for Item 4 (solid line) and Item 7 (dashed lines) from the April test—again, two items with the most dissimilar rating scale category calibrations. In this case, the similarities are not as apparent. As was true for the February test, the transitions from the 20 to the 30 categories are dissimilar for the two items from the April test. However, the transitions from the 30 to the 40 categories are nearly identical, while the transitions from the 40 to the 50 and from the 50 to the 60 categories are somewhat dissimilar for these two items. TSE Rating Scale Are the five scale categories on the TSE rating scale appropriately ordered? Is the rating scale functioning properly as a five-point scale? Are the scale categories clearly distinguishable? To answer these questions, we examined the average examinee proficiency measure by rating scale category for each of the 12 TSE items from the February and April data. We also examined the outfit mean-square index for each rating category for each TSE item. To compute the average examinee proficiency measure for a rating category, the examinee proficiency measures (in logits) for all examinees receiving a rating in that category on that item are averaged. If the rating scale for the item is functioning as intended, then the average examinee proficiency measures will increase in magnitude as the rating scale categories increase. When we see this pattern borne out in the data, the results suggest that examinees with higher 24 Table 9 Rating Scale Category Calibrations for the February 1997 TSE Items Item Number Category Calibration 1 2 3 4 5 6 7 8 30 -4.96 -5.37 -5.42 -6.65 -6.22 -6.27 -5.23 -5.22 40 -3.80 -3.18 -3.03 -2.88 -2.52 -2.33 -2.62 50 1.90 2.13 1.97 2.74 2.36 2.32 60 6.87 6.43 6.48 6.79 6.38 6.28 9 10 11 12 Mean SD -5.59 -6.08 -5.65 -6.13 -5.73 0.52 -2.41 -2.54 -2.65 -2.59 -2.78 -2.78 0.41 2.00 1.83 1.99 2.20 2.01 2.37 2.15 0.26 5.85 5.80 6.15 6.52 6.23 6.54 6.36 0.33 Table 10 Rating Scale Category Calibrations for the April 1997 TSE Items Item Number Category Calibration 1 2 3 4 5 6 7 8 9 10 11 12 Mean SD 30 -5.65 -5.16 -6.20 -7.58 -5.85 -5.01 -4.99 -5.24 -6.09 -6.23 -5.73 -6.41 -5.85 0.73 40 -3.31 -3.40 -3.13 -2.68 -3.07 -3.25 -2.99 -2.97 -2.42 -2.88 -2.71 -2.93 -2.98 0.28 50 1.98 2.00 2.51 2.97 2.33 2.04 1.88 2.03 2.26 2.34 2.16 2.48 2.25 0.31 60 6.97 6.56 6.82 7.29 6.59 6.22 6.10 6.18 6.25 6.77 6.28 6.86 6.57 0.38 25 Figure 3 Category Probability Curves for Items 4 and 8 (February Test) 1 20-30 40-50 30-40 50-60 0.8 0.6 P(x) 0.4 0.2 0 -8 -7 -6 -5 -4 -3 -2 -1 0 B 1 2 3 4 5 6 7 8 Figure 4 Category Probability Curves for Items 4 and 7 (April Test) 1 30-40 20-30 40-50 50-60 0.8 0.6 P(x) 0.4 0.2 0 -8 -7 -6 -5 -4 -3 -2 -1 0 B 26 1 2 3 4 5 6 7 8 ratings on the item are indeed exhibiting “more” of the variable being measured (i.e., communicative language ability) than examinees with lower ratings on that item, and therefore the intentions of those who designed the rating scale are being fulfilled, Linacre (1999b) asserts.5 Tables 11 and 12 contain the average examinee proficiency measures by rating scale category for each of the 12 TSE items from the February and April TSE administrations. For nearly all the items, the average examinee proficiency measures increase as the rating scale categories increase. The exceptions are items 1 and 4 from the February administration and item 7 from the April administration. In these three cases, the average examinee proficiency measure for category 30 is lower than the average examinee proficiency measure for category 20—an unexpected result. These findings suggest that for these three items the resulting measures for at least some of the examinees receiving 20s on those particular items may be of doubtful utility (Linacre, 1999b). However, as Linacre notes, if raters seldom assign ratings in a particular category, then the calibration for that category may be imprecisely estimated and unstable. This can directly affect the calculation of the average examinee proficiency measure for the category, because the category calibrations are used in the estimation process.6 As is shown in Tables 13 and 14, only 1% or fewer of the total ratings assigned for each item were 20s. For a number of items, fewer than 10 ratings of 20 were assigned (i.e., for February, items 4, 5, 6, and 12; for April, items 3, 4, 10, and 12). When fewer than 10 ratings are assigned in a category, then including or excluding a single rating can often substantially alter the estimated scale structure. Is it problematic that the average examinee proficiency measure for category 30 for three items is lower than the average examinee proficiency measure for category 20, or not? Because raters infrequently use the 20 category on the TSE scale (and that is consistent with the intentions of those who developed the scale), it may be that these results are a statistical aberration, perhaps a function of imprecise statistical estimation procedures. To investigate further the trustworthiness of ratings of 20, we examined a second indicator of rating scale functionality— the outfit mean-square index Facets reports for each rating scale category. For each rating scale category for an item, Facets computes the average examinee proficiency measure (i.e., the “observed” measure) and an “expected” examinee proficiency measure (i.e., the examinee proficiency measure the model would predict for that rating category if the data were to fit the model). When the observed and expected examinee proficiency measures are close, then the outfit mean-square index for the rating category will be near the expected value of 1.0. The greater the discrepancy between the observed and expected measures, the larger the mean-square index will be. For a given rating category, an outfit mean-square index greater than 2.0 suggests that a rating in that category for one or more examinees may not be contributing to meaningful measurement of the variable (Linacre, 1999b). Outfit mean-square indices are more sensitive to the occasional outlying rating than infit mean-square indices; therefore, rating categories at the ends of a scale are more likely to exhibit high outfit mean-square indices than rating categories in the middle of a scale. 5 It should be noted that there is some disagreement in the Rasch community about the meaningfulness of this indicator of rating scale functionality. Andrich (1998) has challenged the notion that if the average measures increase monotonically, then that is an indication that the measuring instrument is performing satisfactorily. The average measures are not distribution free, Andrich points out. By contrast, the threshold estimates produced by the Rasch Unidimensional Measurement Models (RUMM) computer program are independent of the examinee distribution, and, in Andrich’s view, provide a more stable indicator of whether a measuring instrument is functioning as intended. 6 To compute the calibration for a given rating category, F , the log-ratio of the frequencies of the categories k adjacent to that particular category are used. 27 Table 11 Average Examinee Proficiency Measures and Outfit Mean-Square Indices for the February 1997 TSE Items Item Number Category 20 30 40 50 60 1 2 3 4 5 6 7 8 9 10 11 12 -1.90 -3.26 -4.33 -1.10 -3.51 -4.91 -2.64 -3.80 -4.57 -3.45 -3.36 -3.80 (4.1) (2.2) (1.6) (3.2) (2.1) (.7) (3.0) (1.5) (.8) (2.1) (2.4) (1.5) -2.23 -2.36 -2.51 -2.45 -2.28 -2.26 -2.49 -2.31 -2.34 -2.52 -2.36 -2.24 (1.1) (1.0) (.9) (.8) (.9) (.8) (1.0) (.8) (.8) (.8) (1.0) (.9) .42 .48 .35 .90 .68 .70 .28 .40 .46 .47 .29 .78 (1.2) (1.0) (1.0) (.8) (.9) (.8) (.8) (.9) (.9) (1.0) (.9) (.9) 3.71 3.67 3.61 4.26 3.87 3.89 3.41 3.40 3.57 3.72 3.57 3.98 (1.2) (1.0) (1.0) (.8) (1.0) (.8) (.9) (.9) (.9) (.9) (.9) (.8) 6.97 7.04 6.98 7.64 7.13 7.21 6.60 6.74 6.83 7.05 6.80 7.47 (1.4) (1.1) (1.0) (.9) (1.0) (.9) (.9) (.9) (1.0) (1.0) (1.0) (.9) Note. The average examinee proficiency measure for each rating category appears at the top of each cell in the table. The outfit mean-square index associated with that rating category appears in parentheses at the bottom of each cell. 28 Table 12 Average Examinee Proficiency Measures and Outfit Mean-Square Indices for the April 1997 TSE Items Item Number Category 20 30 40 50 60 1 2 3 4 5 6 7 8 9 10 11 12 -3.62 -3.74 -4.29 -3.24 -2.76 -4.25 -1.47 -2.98 -2.31 -3.35 -3.55 -3.83 (1.0) (1.2) (.7) (.8) (2.0) (.7) (3.5) (1.7) (1.6) (1.1) (1.2) (.8) -1.87 -2.15 -2.23 -2.00 -2.40 -2.47 -2.61 -2.26 -2.11 -2.17 -2.43 -2.33 (1.2) (1.1) (1.0) (.8) (.9) (.9) (.8) (.9) (.9) (1.0) (.9) (.8) .54 .27 .54 1.06 .50 .20 .13 .40 .54 .53 .30 .68 (1.3) (1.0) (1.0) (.9) (.9) (.9) (.9) (.9) (.9) (.9) (.9) (.8) 3.50 3.47 3.82 4.41 3.73 3.48 3.28 3.50 3.62 3.84 3.47 4.01 (1.4) (1.1) (1.0) (.9) (1.0) (.8) (.9) (.9) (.8) (1.0) (.9) (.7) 6.84 6.61 6.91 7.56 6.85 6.58 6.45 6.75 6.79 6.93 6.58 7.30 (1.2) (1.1) (1.0) (.9) (1.0) (.9) (.9) (.8) (.8) (.9) (.8) (.8) Note. The average examinee proficiency measure for each rating category appears at the top of each cell in the table. The outfit mean-square index associated with that rating category appears in parentheses at the bottom of each cell. 29 Table 13 Frequency and Percentage of Examinee Ratings in Each Category for the February 1997 TSE Items Item Number Category 20 30 40 50 60 1 2 3 4 5 6 7 8 9 10 11 12 15 16 19 4 9 8 32 24 18 14 23 6 (1%) (1%) (1%) (<1%) (<1%) (<1%) (1%) (1%) (1%) (<1%) (1%) (<1%) 75 121 150 104 168 177 223 210 204 202 201 118 (3%) (4%) (5%) (4%) (6%) (6%) (8%) (7%) (7%) (7%) (7%) (4%) 1,125 1,200 1,163 1,264 1,190 1,146 1,191 1,058 1,105 1,234 1,221 1,111 (39%) (42%) (40%) (44%) (41%) (40%) (41%) (37%) (38%) (43%) (42%) (38%) 1,356 1,192 1,228 1,133 1,136 1,143 1,045 1,149 1,165 1,135 1,115 1,212 (47%) (41%) (43%) (39%) (39%) (40%) (36%) (40%) (40%) (39%) (39%) (42%) 315 357 326 381 383 412 391 444 394 301 326 439 (11%) (12%) (11%) (13%) (13%) (14%) (14%) (15%) (14%) (10%) (11%) (15%) Note. The number of ratings given in a category for an item is shown at the top of each cell in the table. The percentage corresponding to that frequency is shown in parentheses. 30 Table 14 Frequency and Percentage of Examinee Ratings in Each Category for the April 1997 TSE Items Item Number Category 20 30 40 50 60 1 2 3 4 5 6 7 8 9 10 11 12 10 20 8 1 10 26 34 16 10 7 21 4 (<1%) (1%) (<1%) (<1%) (<1%) (1%) (1%) (1%) (<1%) (<1%) (1%) (<1%) 126 137 162 141 157 157 201 154 235 172 252 139 (4%) (5%) (6%) (5%) (6%) (6%) (7%) (5%) (8%) (6%) (9%) (5%) 1,196 1,266 1,382 1,332 1,300 1,280 1,233 1,176 1,205 1,271 1,296 1,252 (42%) (44%) (49%) (47%) (46%) (45%) (43%) (41%) (42%) (45%) (46%) (44%) 1,249 1,142 1,037 1,082 1,077 1,068 1,070 1,123 1,048 1,108 1,001 1,130 (44%) (40%) (36%) (38%) (38%) (38%) (38%) (39%) (37%) (39%) (35%) (40%) 265 283 259 292 304 317 310 379 350 288 278 317 (9%) (10%) (9%) (10%) (11%) (11%) (11%) (13%) (12%) (10%) (10%) (11%) Note. The number of ratings given in a category for an item is shown at the top of each cell in the table. The percentage corresponding to that frequency is shown in parentheses. 31 When we reviewed the outfit mean-square indices presented in Table 11 (February data), we noted that several of the items have outfit mean-square indices greater than 2.0 for rating category 20 (i.e., items 1, 2, 4, 5, 7, 10, and 11). Similarly, Table 12 shows that several of the items from the April 1997 TSE administration have outfit mean-square indices greater than 2.0 for rating category 20 (i.e., items 5 and 7). Given this information, we then looked at the table of individual misfitting ratings that Facets produces to identify the specific misfitting examinees who received ratings of 20 on these particular items. There were 19 examinees in the February data set and 11 examinees in the April data set who received ratings of 20 on the item in question when the model would have predicted that the examinee should have received a higher rating, given the ratings the examinee received on the other items. Interestingly, in 27 of these cases, both raters who evaluated the examinee gave ratings of 20 on the item in question. To some degree, the fact that the two raters agreed on a rating of 20 on the item provides support for the view that those ratings are credible for these 27 examinees and should not be changed, even though the ratings of 20 are surprising and don’t seem to “fit” with the higher ratings each examinee received on the other items. However, there were three examinees that received a rating of 20 on the item in question from one rater who evaluated their performance but not from the other rater. Those examinees’ rating patterns are shown in Table 15. For Examinee #1004, the ratings of 20 on items 1 and 2 from Rater #56 are quite unexpected because this rater gave 40s on all other items. The second rater, Rater #44, gave the examinee nearly all 40s and did not concur with the first rater on the ratings of 20 for items 1 and 2. In the second example the rating of 20 on item 5 for Examinee #353 from Rater #12 is unexpected as well, because the rater gave the examinee 40s and 50s on all other items (as did the second rater, Rater #16, a very harsh rater). Finally, the rating of 20 on item 5 for Examinee #314 from Rater #17 is surprising because the rater gave 40s on all other items. Again, the second rater (Rater #56) gave the examinee 30s and 40s on the items, no 20s. These examples vividly illustrate Facets capacity to identify a rater’s occasionally suspect use of a rating category. To summarize, the rating scales for the twelve TSE items on each test form seem to be functioning as intended, for the most part. The fact that for a number of items the raters infrequently use the 20 category on the scale makes it appear that the 20 category is problematic for some of these items. However, it may be the distribution-sensitive nature of the statistical estimation procedures used to produce the category calibrations that makes the category appear problematic. At any rate, in the future it will be important to monitor closely raters’ use of the 20 category to make certain that those ratings continue to be trustworthy. In this study, a close inspection of the rating patterns for 27 of the 30 misfitting examinees who received surprising ratings of 20 on one item indicates that the raters appeared to use the 20 rating category appropriately, even though the ratings of 20 did not seem consistent with the higher ratings each examinee received on the other items. However, in three cases, the ratings of 20 on one or two items seem somewhat more suspect, because the second rater did not concur with those judgments. In instances like these, TSE program personnel might want to review the examinee’s performance before issuing a final score report to verify that the ratings of 20 should stand. In all three cases, these examinees would not have been identified as in need of third-rater adjudication because the averages of the two raters’ 12 ratings of each examinee were less than 10 points apart. 32 Table 15 Rating Patterns and Fit Indices for Selected Examinees Ratings Received by Examinee #1004 (Examinee Proficiency Measure = -1.05, Standard Error = .45) (Examinee Infit Mean-Square Index = 1.8; Examinee Outfit Mean-Square Index = 2.4) (Outfit Mean-Square Index for Category 20 of Item 1 = 4.1) Rater #56 (Severity = .57) Rater #44 (Severity = .37) Item Number 6 7 1 2 3 4 5 20 20 40 40 40 40 40 40 40 40 40 40 8 9 10 11 12 40 40 40 40 40 40 40 30 40 40 40 40 Ratings Received by Examinee #353 (Examinee Proficiency Measure = 1.50, Standard Error = .46) (Examinee Infit Mean-Square Index = 2.3; Examinee Outfit Mean-Square Index = 2.2) (Outfit Mean-Square Index for Category 20 of Item 5 = 2.1) Rater #16 (Severity = 1.26) Rater #12 (Severity = .25) Item Number 6 7 1 2 3 4 5 50 40 40 40 40 50 40 40 40 40 20 50 8 9 10 11 12 40 50 40 40 50 50 40 50 40 40 40 40 Ratings Received by Examinee #314 (Examinee Proficiency Measure = -3.34, Standard Error = .39) (Examinee Infit Mean-Square Index = 1.3; Examinee Outfit Mean-Square Index = 1.3) (Outfit Mean-Square Index for Category 20 of Item 5 = 2.0) Rater #56 (Severity = -.91) Rater #17 (Severity = -.80) Item Number 6 7 1 2 3 4 5 40 30 30 30 30 40 40 40 40 40 20 40 33 8 9 10 11 12 40 30 30 40 30 30 40 40 40 40 40 40 Raters Do TSE raters differ in the severity with which they rate examinees? We used Facets to produce a measure of the degree of severity each rater exercised when rating examinees from the February and April tests. Table 16 shows a portion of the output from the Facets analysis summarizing the information the computer program provided about the February TSE raters. The raters are ordered in the table from most severe to most lenient. The higher the rater severity measure, the more severe the rater. To the right of each rater severity measure is the standard error of the estimate, indicating the precision with which it has been measured. Other things being equal, the greater the number of ratings an estimate is based on, the smaller its standard error. The rater severity measures for the February TSE raters ranged from -2.01 logits to 1.54 logits, a 3.55 logit spread, while the rater severity measures for the April TSE raters ranged from -1.89 logits to 1.30 logits, a 3.19 logit spread. A more substantive interpretation of rater severity could be obtained by examining each rater’s mean rating. However, because all TSE raters do not rate all TSE examinees, it is difficult to determine how much each rater’s mean rating is influenced by the sample of examinee performances that he or she rates. For example, suppose the mean ratings of two raters (A and B), each of whom rates a different set of examinees, are compared. Rater A’s mean rating is 40.0, and Rater B’s mean rating is 50.0. In this case, we know nothing about the relative severity of Raters A and B because we know nothing about the comparability of the examinee performances that the two raters rated. One way of determining a fair difference between the means of two such raters is to examine the mean rating for each rater once it has been corrected for the deviation of the examinees in each rater’s sample from the overall examinee mean across all raters. This fair average allows one to determine the extent to which raters differ in the mean scores that they assign to examinees. For example, the most severe rater involved in the February scoring session had a fair average of 37.85, while the most lenient rater had a fair average of 44.68. That means that, on average, these raters assigned scores that were about 7 raw score points apart. Similarly, the most severe rater involved in the April scoring session had a fair average of 38.65, while the most lenient rater had a fair average of 44.15—an average difference of 5.5 raw score points per response scored. In addition to the above analysis of rater severity, Facets provides a chi-square test of the hypothesis that all the raters in the February scoring session exercised the same degree of severity when rating examinees. The resulting chi-square value was 3838.7 with 65 degrees of freedom. From this analysis, we concluded that there is .001 probability, after allowing for measurement error, that the raters who participated in the February scoring session can be considered equally severe. The results from the Facets chi-square test on the April data were similar. The chi-square value for that analysis was 2768.2 with 73 degrees of freedom. As with the February scoring, there is .001 probability that the raters who participated in the April scoring can be thought of as equally severe. 34 Table 16 Summary Table for Selected TSE Raters Most severe Most lenient a,b Rater ID Number of Ratings Rater Severity Measure (in logits) Standard Error Infit MeanSquare Index 54 16 47 48 ..... 36 7 33 6 ..... 32 41 61 60 624 948 264 696 ..... 540 960 407 300 ..... 780 300 12 480 1.54 1.31 1.22 1.20 ..... 0.02 0.02 0.01 -0.01 ..... -1.39 -1.46 -1.83 -2.01 .08 .07 .13 .08 ..... .09 .07 .11 .13 ..... .08 .13 .63 .10 0.9 0.7 1.2 1.1 ..... 1.0 0.9 1.4 1.2 ..... 0.9 1.3 0.4 0.9 Meana SDb 581.8 203.6 .00 .72 .10 .07 1.0 0.2 The mean and standard deviation are for all raters who participated in the February TSE scoring session, not just the selected raters shown in this table. If raters differ in severity, how do those differences affect examinee scores? After Facets calibrates all the raters, the computer program produces a fair average for each examinee. This fair average adjusts the examinee’s original raw score average for differences in rater severity, showing what score the examinee would have received had the examinee been rated by two raters of average severity. We then compared examinees’ fair averages to their original raw score averages (that is, their final reported scores) to determine how differences in rater severity affect examinee scores. Tables 17 and 18 show the effects of adjusting for rater severity on examinee raw score averages from the February and April administrations, respectively. Table 17 shows that adjusting for differences in rater severity would have altered the raw score averages of about one third (36.4%) of the February examinees by less than one-half point. 35 However, for about two-thirds (63.6%) of the examinees, adjusting for rater severity would have altered raw score averages by more than one-half point (but less than 4 points). Further, the following statements can be made about the latter group of examinees: • In about 25% of these cases, the examinee fair averages were 0.5 to 1.4 points higher than their raw score averages, while for 22% of the examinees, their fair averages were 0.5 to 1.4 points lower than their raw score averages. • About 8% of the examinees had fair averages that were 1.5 to 2.4 points higher than their raw score averages, while 6.5% had fair averages that were 1.5 to 2.4 points lower than their raw score averages. • About 1% had fair averages that were 2.5 to 3.2 points higher than their raw score averages, while another 1% had fair averages that were 2.5 to 3.4 points lower than their raw score averages. • Finally, in no cases were fair averages more than 3.2 points higher than raw score averages, and in only three cases (0.2% of the examinees) were fair averages 3.5 to 3.6 points lower than raw score averages. Table 18 tells much the same story. The percentages of differences between the examinees’ fair averages and their raw score averages in the April data set closely mirror those found in the February data set. If we were to adjust for differences in rater severity, the fair averages for two thirds (63.5%) of the examinees tested in April would have differed from their raw score averages by more than one-half point (but less than 4 points). It is important to note that in both sets of data, there are instances along the entire score distribution of differences of 24 points between examinees’ raw score averages and their fair averages. This phenomenon does not occur only at certain score intervals; it occurs across the distribution. Educational Testing Service does not set cutscores for passing or failing the TSE. Those institutions and agencies that use the TSE determine what the cutscores will be for their own institutions. Therefore, we are not able to show what the impact of adjusting for rater severity would be on passing rates. However, it is important to note that an adjustment as small as 0.1 could have important consequences for an examinee that was near an institution’s cutoff point. Suppose that an institution used a cutoff point of 50. When computing examinees’ reported scores, raw score averages from 47.5 to 52.4 translate into a reported score of 50. If an examinee’s raw score average were 47.4, that examinee would receive a reported score of 45 and therefore would be judged not to have reached the institution’s cutoff point. However, if a 0.1 adjustment for rater severity were added, the examinee’s score would then be 47.5, and the examinee would receive a score of 50, which the institution would consider to be a passing score. 36 Table 17 Effects of Adjusting for Rater Severity on Examinee Raw Score Averages February 1997 TSE Administration Difference between Examinees’ Raw Score Averages and Their Fair Averages Number of Examinees Percentage of Examinees 2.5 to 3.2 points 13 0.9% 1.5 to 2.4 points 120 8.2% 0.5 to 1.4 points 365 24.8% -0.4 to 0.4 points 535 36.4% -1.4 to -0.5 points 317 21.6% -2.4 to -1.5 points 96 6.5% -3.4 to -2.5 points 20 1.4% -3.6 to -3.5 points 3 0.2% 1,469 100% TOTAL Table 18 Effects of Adjusting for Rater Severity on Examinee Raw Score Averages April 1997 TSE Administration Difference between Examinees’ Raw Score Number of Averages and Their Fair Averages Examinees Percentage of Examinees 2.5 to 3.0 points 8 0.5% 1.5 to 2.4 points 87 6.0% 0.5 to 1.4 points 409 28.3% -0.4 to 0.4 points 528 36.5% -1.4 to -0.5 points 324 22.4% -2.4 to -1.5 points 83 5.7% -3.4 to -2.5 points 6 0.4% -3.5 points 1 0.1% 1,446 100% TOTAL 37 How interchangeable are the raters? Our analysis suggests that the TSE raters do not function interchangeably. Raters differ somewhat in the levels of severity they exercise when rating examinees. The largest difference found here between examinees’ raw score averages and their fair averages was 3.6 raw score points. While this may not seem like a large difference, it is important to remember that even an adjustment for rater severity as small as 0.1 raw score point could have important consequences for examinees, particularly those whose scores lie in critical decision-making regions of the score distribution. Do TSE raters use the TSE rating scale consistently? To answer this question, we examined the mean-square fit indices for each rater and the proportion of unexpected ratings with which each rater is associated. In general, raters with infit mean-square indices less than 1 show less variation than expected in their ratings, even after the particular performances on which those ratings are based have been taken into account. Their ratings tend to be "muted," sometimes as a consequence of overusing the inner rating scale categories (i.e., the 30, 40, and 50 rating categories). By contrast, raters with infit mean-square indices greater than 1 show more variation than expected in their ratings. Their ratings tend to be "noisy," sometimes as a consequence of overusing the outer rating scale categories—that is, the 20 and 60 rating categories (Linacre, 1994). Outfit indices for raters take the same form as infit indices but are more sensitive to the occasional unexpected extreme rating from an otherwise consistent rater. (The term “outfit” is shorthand for “outlier-sensitive fit statistic.”) Most often the infit and outfit mean-square indices for a rater will be identical. However, occasionally a rater may have an infit mean-square index that lies within established upper- and lower-control limits, but the rater may have an outfit mean-square index that lies outside the upper-control limit. In instances such as these, the rater will often have given a small number of highly unexpected extreme ratings. For the most part, the rater is internally consistent and uses the rating scale appropriately, but occasionally the rater will give a rating that seems out of character with the rater’s other ratings. Different testing programs adopt different upper- and lower-control limits for rater fit, depending in part upon the nature of the program and the level of resources available for investigating instances of misfit. Programs involved in making high-stakes decisions about examinees may use stringent upper- and lower-control limits (for instance, an upper-control limit of 1.3 and a lower-control limit of 0.7), while low-stakes programs may use more relaxed limits (for example, an upper-control limit of 1.5 and lower-control limit of 0.4). Generally, misfit greater than 1 is more problematic than is misfit less than 1. For this study, we adopted an upper-control limit of 1.3 and a lower-control limit of 0.7. The rater mean-square fit indices for the February data ranged from 0.4 to 1.4, while for April they ranged from 0.5 to 1.5. Tables 19 and 20 show the number and percentage of raters who exhibit three levels of fit in the February and April TSE data, respectively. These tables show that in each scoring session over 90% of the raters showed the expected amount of variation in their ratings. About 1% to 3% of the raters showed somewhat less variation than expected, while about 5% to 7% showed somewhat more variation than expected. 38 Table 19 Frequencies and Percentages of Rater Mean-Square Fit Indices for the February 1997 TSE Data Infit Outfit Fit Range Number of Raters Percentage of Raters Number of Raters Percentage of Raters fit < 0.7 2 3% 2 3% 0.7 < fit < 1.3 61 92% 60 91% fit > 1.3 3 5% 4 6% Table 20 Frequencies and Percentages of Rater Mean-Square Fit Indices for the April 1997 TSE Data Infit Outfit Fit Range Number of Raters Percentage of Raters Number of Raters Percentage of Raters fit < 0.7 1 1% 1 1% 0.7 < fit < 1.3 69 93% 68 92% fit > 1.3 4 5% 5 7% Are there raters who rate examinee performances inconsistently? For any combination of rater severity, examinee proficiency, and item difficulty, an expected rating can be computed based on the Partial Credit Model. Fit indices (infit and outfit) indicate the cumulative agreement between observed and expected ratings for a particular rater across all items and examinees encountered by that rater. Individual ratings that differ greatly from modeled expectations indicate ratings that may be problematic, and it is informative to examine the largest of these residuals to determine whether the validity of these ratings is suspect. In this study, we examined the proportion of times individual raters were associated with these large residuals. 39 Our test is based on the assumption that, if raters are performing interchangeably, then we can expect large residuals to be uniformly distributed across all the raters, relative to the proportion of the total ratings assigned by the rater. That is, the null proportion of large residuals for each rater (p) is Nu/Nt, where Nu is the total number of large residuals and Nt is the total number of ratings. For each rater, the observed proportion of large residuals (pr) is simply Nur/Ntr, where Nur designates the number of unexpectedly large residuals associated with rater r and Ntr indicates the number of ratings assigned by rater r. Inconsistently scoring raters would be identified as those for whom the observed proportion exceeds the null proportion beyond what could be considered chance variation. Specifically, given sufficient sample size, we can determine the likelihood that the observed proportion could have been produced by the expected proportion using equation 3 (Marascuilo & Serlin, 1988). If an observed zp is greater than 2.00, we would conclude that the rater is rating inconsistently. The sum of the squared zp values across raters should approximate a chi-square distribution with R-1 degrees of freedom (where R equals the number of raters) under the null hypothesis that raters are rating consistently. zp = pr -p (3) p -p 2 N tr Tables 21 and 22 show the number and percentage of raters who were associated with increasing levels of unexpected ratings for the February and April data. Raters whose frequency of unexpected ratings is less than 2.00 can be considered to operate within normal bounds because the number of large residuals associated with these ratings can be attributed to random variation. Raters with increasingly larger zp indices assign more unexpected ratings than would be predicted for them. These tables show that the frequency of unexpected ratings in the February scoring sessions was fairly low. Although the difference is statistically significant, the effect size is very small, c2(66) = 120.39, p < .001, f2 = .003. Although there were considerably more inconsistent raters in the April session, the phi-coefficient did not reveal a much stronger trend, c2(73) = 163.70, p < .001, f2 = .005. Table 21 Frequencies of Inconsistent Ratings for February 1997 TSE Raters zp Number of Raters Percentage of Raters < 2.00 61 91% 4.00 > zp ³ 2.00 5 8% zp ³ 4.00 1 2% 40 Table 22 Frequency of Inconsistent Ratings for April 1997 TSE Raters zp Number of Raters Percentage of Raters < 2.00 45 61% 4.00 > zp ³ 2.00 13 18% zp ³ 4.00 16 22% Are there any overly consistent raters whose ratings tend to cluster around the midpoint of the rating scale and who are reluctant to use the endpoints of the scale? Are there raters who tend to give an examinee ratings that differ less than would be expected across the 12 items? Are there raters who cannot effectively differentiate between examinees in terms of their levels of proficiency? By combining the information provided by the fit indices and the proportion of unexpected ratings for each rater, we were able to identify specific ways that raters may be contributing to measurement error. Wolfe, Chiu, and Myford (1999) found that unique patterns of these indicators are associated with specific types of rater errors. Accurate ratings result in rater infit and outfit mean-square indices near their expectations (that is, about 1.0) and low proportions of unexpected ratings. Random rater errors, on the other hand, result in large rater infit and outfit mean-square indices as well as large proportions of unexpected ratings. Raters who restrict the range of the scores that they assign to examinees, by exhibiting halo effects (assigning similar scores to an examinee across all items) or centrality effects (assigning a large proportion of scores in the middle categories of a rating scale), have infit and outfit mean-square indices that are smaller than expected while having high proportions of unexpected ratings. Also, extreme scoring (a tendency to assign a large proportion of scores in the highest- and/or lowest-scoring categories) is indicated by a rater infit mean-square index that is close to its expected value, an outfit mean-square index that is large, and a high proportion of unexpected ratings. Table 23 summarizes the criteria for identifying raters exhibiting each of these types of rater errors included in this study. As shown in Tables 24 and 25, no raters in the February or April data exhibit centrality or halo effects, and only one rater exhibits extreme scoring. A small percentage of the raters in each scoring session (3% in February and 5% in April) show some evidence of randomness in their scoring. However, the majority of raters in both scoring sessions fit the accurate profile (70% in February and 49% in April). Some raters (27% in February and 45% in April) exhibited patterns of indicators that were not consistent with any of the rater errors we studied (i.e., other). 41 Table 23 Rater Effect Criteria Rater Effect Infit Outfit zp Accurate 0.7 £ infit £ 1.3 0.7 £ outfit £ 1.3 zp £ 2.00 Random infit > 1.3 outfit > 1.3 zp > 2.00 Halo/Central infit < 0.7 outfit < 0.7 zp > 2.00 Extreme 0.7 £ infit £ 1.3 outfit > 1.3 zp > 2.00 Table 24 Rater Effects for February 1997 TSE Raters Rater Effect Number of Raters Percentage of Raters Accurate 46 70% Random 2 3% Halo/Central 0 0% Extreme 0 0% Other 18 27% Note: Raters whose indices do not fit these patterns are labeled “other.” Table 25 Rater Effects for April 1997 TSE Raters Rater Effect Number of Raters Percentage of Raters Accurate 36 49% Random 4 5% Halo/Central 0 0% Extreme 1 1% Other 33 45% Note: Raters whose indices do not fit these patterns are labeled “other.” 42 Conclusions An important goal in designing and administering a performance assessment is to produce scores from the assessment that will lead to valid generalizations about an examinee’s achievement in the domain of interest (Linn, Baker, & Dunbar, 1991). We do not want the inferences we make based on the examinee’s performance to be bound by the particulars of the assessment situation itself. More specifically, to have meaning, the inferences we make from an examinee’s score cannot be tied to the particular raters who scored the examinee’s performance, to the particular set of items the examinee took, to the particular rating scale used, or to the particular examinees who were assessed in a given administration. If we are to accomplish this goal, we need tools that will enable us to monitor sources of variability within the performance assessment system we have created so that we can determine the extent to which these sources affect examinee performance. Beyond this, these tools must provide useful diagnostic information about how the various sources are functioning. In this study, we examined four sources of variability in the TSE assessment system to gain a better understanding of how the complex system operates and where it might need to change. Variability is present in any performance assessment system; within a statistical framework, typical ranges can be modeled. In a system that is "under statistical control" (Deming, 1975), the major sources of variability have been identified, and observations have been found to follow typical, expected patterns. Quantifying these patterns is useful because it makes evident the uncertainty for final inferences (in this case, final scores) associated with aspects of the system and allows effects on the system to be monitored when changes are introduced. Here, we determined to what extent the TSE assessment system is functioning as intended by examining four sources of variability within the system: the examinees, the items, the rating scale, and the raters. Examinees When we focused our lens on the examinees that took the February and April 1997 administrations of the TSE, we found that the test usefully separated examinees into eight statistically distinct levels of proficiency. The examinee proficiency measures were determined to be trustworthy in terms of their precision and stability. We would expect an average examinee’s true proficiency to lie within about two raw score points of his or her reported score 95% of the time. It is important to emphasize, however, that the size of the standard error of measurement varies across the proficiency distribution, particularly at the tails of the distribution. Therefore, one should not assume that the standard error of measurement is constant across that distribution. This finding has important implications for institutions setting their own cutscores on the TSE. The proficiency measures for individual examinees have measurement properties that allow them to be used for making decisions about individuals. Few examinees from either administration of the test exhibited unusual profiles of ratings across the 12 TSE items. Indeed, only 1% of the examinees in each of the administrations showed significant misfit. However, even in isolated cases like these, TSE program personnel might want to review examinee 43 performances on the misfitting items before issuing score reports, particularly if an examinee’s measure is near a critical decision-making point in the score distribution. When we examined the current procedure for resolving discrepant ratings, we found that the Facets analysis identified some patterns of ratings as suspect that would not have been so identified using the TSE discrepancy criteria. The current TSE procedure for identifying discrepant ratings focuses on absolute differences in the ratings assigned by two raters. Such a procedure nominates a large number of cases for third-rater resolution. Cases that result from the pairing of severe and lenient raters could be handled more easily by adjusting an examinee’s score for the severity or leniency of the two raters rather than having a third rater adjudicate. If Facets were used during a scoring session to analyze TSE data, then there would be no need to bring these cases to the attention of more experienced TSE raters. Facets would not identify these cases as discrepant but rather would automatically adjust these scores for differences in rater severity, freeing the more experienced TSE raters to concentrate their efforts on cases deserving more of their time and attention. Indeed, the aberrant patterns that Facets does identify are cases that would be difficult to resolve by adjusting scores for differences in rater severity. These cases are most likely situations in which raters make seemingly random rating errors or examinees perform differentially across items in a way that differs from how other examinees perform on the same items. Future TSE research that focuses on the issue of how best to identify and resolve discrepant ratings may result in both lower scoring costs (by requiring fewer resolutions) and an increase in score accuracy (by identifying cases that cannot readily be resolved through adjustments for differences in rater severity). Studies might also be undertaken to determine whether there are common features of examinee responses—or raters’ behavior—that result in misfit. Facets analyses can identify misfitting examinees or misfitting raters, but other research methodologies (such as, analysis of rater protocols, discourse analyses, task analyses) could be used in conjunction with Facets to gain an understanding of not only where misfit occurs but also why. The results of such research could usefully inform rater training procedures. Items From our analysis of the TSE items, we conclude that the items on each test work together; ratings on one item correspond well to ratings on the other items. Yet, none of the items appear to function in a redundant fashion (all items had infit mean-square indices greater than 0.7). Ratings on individual items within the test can be meaningfully combined; there is little evidence of psychometric multidimensionality in either of the two data sets (all items had infit mean-square indices smaller than 1.3). Consequently, it is appropriate to generate a single summary measure to capture the essence of examinee performance across the 12 items. This should come as good news to the TSE program, because these findings would seem to suggest that there is little overlap among the 12 items on the TSE. Each item appears to make an independent contribution to the measurement of communicative language ability. However, the items tend to differ little in terms of their difficulty; examinees do not have a harder time getting high ratings on some items compared to others. If test developers were to add to the assessment some items that are easier than those that currently appear on the TSE as well as some items that are more difficult than the present items, that might increase the instrument’s ability to better discriminate among levels of examinee proficiency. 44 It should be noted that, due to the current design of the TSE, in which two common items link one test form to all other test forms, our research did not allow us the opportunity to examine item comparability across TSE forms. Future research that looks at issues of item comparability—and the feasibility of using Facets as a tool for equating TSE administrations through common link items or common raters—would also seem to be of potential benefit to the TSE program. TSE Rating Scale The TSE rating scale functions as a five-point scale, and, for the most part, the scale categories are clearly distinguishable. The scale maintains a similar, though not identical, category structure across the items. The fact that for a number of items the raters infrequently use the 20 category on the scale makes it appear that the 20 category is problematic for some of these items. However, it may be the distribution-sensitive nature of the statistical estimation procedures used to produce the category calibrations that makes the category appear problematic. At any rate, in the future it will be important to monitor closely raters’ use of the 20 category to make certain that those ratings continue to be trustworthy. Raters When we examined the raters of the February and April 1997 TSE, we found that they differed somewhat in the severity with which they rated examinees, confirming Marr’s (1994) findings. If we were to adjust for differences in rater severity, the fair averages for two thirds of the examinees in both TSE administrations would have differed from their raw score averages by more than one-half point. The largest rater effect would have been 3.6 raw score points, which means that the most an examinee’s score would have changed would have been about 4 points on the 20 to 60 scale. Though these differences may seem small, such differences can have important consequences, especially for examinees whose scores lie in critical decision-making regions of the score distribution. For these examinees, whether or not they meet an institution’s cutscore may be determined by such adjustments. While the raters differed in severity, the vast majority used the TSE rating scale in a consistent fashion. The raters appear to be internally consistent but are not interchangeable, confirming the findings of Weigle (1994). Our analyses of individual rater behavior revealed that the majority of raters in both of the TSE scoring sessions that we studied (70% of the raters in February and 49% in April) fit the profile of accurate raters. No raters exhibited centrality or halo effects, 1% of the raters exhibited extreme scoring tendencies, and a small percentage of raters in each scoring session (3% of the raters in February and 5% in April) showed some evidence of randomness in their scoring. This latter category of raters, as well as those whose patterns of indicators were not consistent with any of the rater errors we studied, should be the subject of further research, because it seems likely that these raters are responsible for the types of aberrant rating patterns that are most difficult to correct. The results of such research would be beneficial to the TSE program, and ultimately improve the accuracy of TSE scores, because they could provide information that could guide future rater training and evaluation efforts. 45 Next Steps The scoring of an administration of the TSE involves working with many raters who each evaluate a relatively small number of performances in any given scoring session. Establishing sufficient connectivity in such a rating design is an arduous task. Even if the rating design is such that all raters can be linked, in a number of cases the connections between raters are likely to be weak and tentative, because the number of examinees any given pair of raters scores in common is very limited. If there is insufficient connectivity in the data, then it is not possible to calibrate all raters on the same scale. Consequently, raters cannot be directly compared in terms of the degree of severity they exercise when scoring, because there is no single frame of reference established for making such comparisons (Linacre, 1994). In the absence of a single frame of reference, examinees (and items, too) cannot be compared on the same scale. As Marr (1994) noted, insufficient overlap among raters has been a problem for researchers using Facets to analyze TSE data. When analyzing data from 1992-93 TSE administrations, for instance, she found that the ratings of eight raters had to be deleted because they were insufficiently connected. In the present study, all raters were connected, but the connections between a number of the raters were weak. Myford, Marr and Linacre (1996) found the same connectivity problem to exist in their analysis of data from the scoring of the Test of Written English (TWE®). To strengthen the network for linking raters, they suggested instituting a rating design that calls for all raters involved in a scoring session to rate a common small set of examinee performances (in addition to their normal workload). In effect, then, a block of ratings from a fully crossed rating design would be embedded into an otherwise lean data structure, thereby ensuring that all raters would be linked through this common set. We have completed a study that enabled us to evaluate the effectiveness of this rater linking strategy (Myford & Wolfe, in press). All raters who took part in the February and April 1997 TSE scoring sessions rated a common set of six audiotaped examinee performances selected from the tapes to be scored during that session. Prior to each of the two scoring sessions, a small group of very experienced TSE raters who supervised each scoring session met to select the six common tapes that all raters in the upcoming scoring session would be required to score. These experienced raters listened to a number of tapes in order to select the set of six. For each scoring session, they chose a set of tapes that displayed a range of examinee proficiency —a few low-scoring examinees, a few high-scoring examinees, and some examinees whose scores would likely fall in the middle categories of the TSE rating scale. Also, they included in this set a few tapes that would be hard for raters to score (such as, examinees who showed variability in their level of performance from item to item) as well as tapes that they judged to be solid examples of performance at a specific point on the scale (such as, examinees who would likely receive the same or nearly the same rating on each item). Once the set of tapes was identified, additional copies of each tape were made. These tapes were seeded into the scoring session so that raters would not know that they were any different from the other tapes they were scoring. 46 The specific questions that focused this study of the linking strategy include the following: 1. How does embedding in the operational data blocks of ratings from various smaller sets of the six examinee tapes affect: • the stability of examinee proficiency measures? • the stability of rater severity measures? • the fit of the raters? • the spread of rater severity measures? • the spread of examinee proficiency measures? 2. How many tapes do raters need to score in common in order to establish the minimal requisite connectivity in the rating design, thereby ensuring that all raters and examinees can be placed on a single scale? What (if anything) is to be gained by having all raters score more than one or two tapes in common? 3. What are the characteristics of tapes that produce the highest quality linking? Are tapes that exhibit certain characteristics more effective as linking tools than tapes that exhibit other characteristics? When we have completed this study, we will be in a better position to advise the TSE program regarding the adequacy of their current rating design, and to suggest any changes that might be instituted to improve the quality of examinee scores. 47 References Andrich, D. (1985). An elaboration of Guttman scaling with Rasch models for measurement. In N. Brandon-Tuma (Ed.), Sociological methodology (pp. 33-80). San Francisco: JosseyBass. Andrich, D. (1998). Thresholds, steps, and rating scale conceptualization. Rasch Measurement: Transactions of the Rasch Measurement SIG, 12 (3), 648. Andrich, D., Sheridan, B., & Luo, G. (1997). RUMM (Version2.7): A Windows-based item analysis program employing Rasch unidimensional measurement models. Perth, Western Australia: School of Education, Murdoch University. Bachman, L. F., Lynch, B. K., & Mason, M. (1995). Investigating variability in tasks and rater judgments in a performance test of foreign language speaking. Language Testing, 12 (2), 238-257. Bejar, I. I. (1985). A preliminary study of raters for the Test of Spoken English (TOEFL Research Report No. 18). Princeton, NJ: Educational Testing Service. Brown, A. (1995). The effect of rater variables in the development of an occupation-specific language performance test. Language Testing, 12 (1), 1-15. Deming, W. E. (1975). On statistical aids toward economic production. Interfaces, 5, 1-15. Engelhard, G., Jr. (1994). Examining rater errors in the assessment of written composition with a many-faceted Rasch model. Journal of Educational Measurement, 31 (2), 93-112. Fisher, W. P., Jr. (1992). Reliability statistics. Rasch Measurement: Transactions of the Rasch Measurement SIG, 6 (3), 238. Heller, J. I., Sheingold, K., & Myford, C. M. (1998). Reasoning about evidence in portfolios: Cognitive foundations for valid and reliable assessment. Educational Assessment, 5 (1), 5-40. Henning, G. (1992). Dimensionality and construct validity of language tests. Language Testing, 9 (1), 1-11. Linacre, J. M. (1989). Many-faceted Rasch measurement. Chicago: MESA Press. Linacre, J. M. (1994). A user’s guide to Facets: Rasch measurement computer program [Computer program manual]. Chicago: MESA Press. Linacre, J. M. (1998). Thurstone thresholds and the Rasch model. Rasch Measurement: Transactions of the Rasch Measurement SIG, 12 (2), 634-635. 48 Linacre, J. M. (1999a). Facets, Version 3.17 [Computer program]. Chicago: MESA Press. Linacre, J. M. (1999b). Investigating rating scale category utility. Journal of Outcome Measurement, 3 (2), 103-122. Linacre, J. M., Engelhard, G., Tatum, D. S., & Myford, C. M. (1994). Measurement with judges: Many-faceted conjoint measurement. International Journal of Educational Research, 21 (6), 569-577. Linn, R. L., Baker, E., & Dunbar, S. B. (1991). Complex performance-based assessment: Expectations and validation criteria. Educational Researcher, 20 (8), 15-21. Lumley, T., & McNamara, T. F. (1995). Rater characteristics and rater bias: Implications for training. Language Testing, 12 (1), 54-71. Lunz, M. E., & Stahl, J. A. (1990). Judge consistency and severity across grading periods. Evaluation and the Health Professions, 13 (14), 425-444. Lynch, B. K., & McNamara, T. F. (1994, March). Using G-theory and multi-faceted Rasch measurement in the development of performance assessments of the ESL speaking skills of immigrants. Paper presented at the 16th annual Language Testing Research Colloquium, Washington, DC. Marascuilo, L.A., & Serlin, R.C. (1988). Statistical methods for the social and behavioral sciences. New York: W.H. Freeman. Marr, D. B. (1994). A comparison of equating and calibration methods for the Test of Spoken English. Unpublished report. McNamara, T. F. (1991). Test dimensionality: IRT analysis of an ESP listening test. Language Testing, 8 (2), 45-65. McNamara, T. F. (1996). Measuring second language performance. Essex, England: Addison Wesley Longman. McNamara, T. F., & Adams, R. J. (1991, March). Exploring rater behaviour with Rasch techniques. Paper presented at the 13th annual Language Testing Research Colloquium, Princeton, NJ. Myford, C. M., Marr, D. B., & Linacre, J. M. (1996). Reader calibration and its potential role in equating for the TWE (TOEFL Research Report No. 95-40). Princeton, NJ: Educational Testing Service. Myford, C. M., & Mislevy, R. J. (1994). Monitoring and improving a portfolio assessment system (MS #94-05). Princeton, NJ: Educational Testing Service, Center for Performance Assessment. 49 Myford, C. M., & Wolfe, E. W. (in press). Strengthening the ties that bind: Improving the linking network in sparsely connected rating designs (TOEFL Technical Report No. 15). Princeton, NJ: Educational Testing Service. Paulukonis, S. T., Myford, C. M., & Heller, J. I. (in press). Formative evaluation of a performance assessment scoring system. In G. Engelhard, Jr. & M. Wilson (Eds.), Objective measurement: Theory into practice: Vol. 5. Norwood, NJ: Ablex Publishing. Rasch, G. (1980). Probabilistic models for some intelligence and attainment tests. Chicago: University of Chicago Press. Sheridan, B. (1993, April). Threshold location and Likert-style questionnaires. Paper presented at the Seventh International Objective Measurement Workshop, American Educational Research Association annual meeting, Atlanta, GA. Sheridan, B., & Puhl, L. (1996). Evaluating an indirect measure of student literacy competencies in higher education using Rasch measurement. In G. Engelhard, Jr. & M. Wilson (Eds.), Objective measurement: Theory into practice: Vol. 3 (pp. 19-44). Norwood, NJ: Ablex Publishing. Stone, M., & Wright, B. D. (1988). Separation statistics in Rasch measurement (Research Memorandum No. 51). Chicago: MESA Press. TSE Committee. (1996). The St. Petersburg protocol: An agenda for a TSE validity mosaic. Unpublished manuscript. TSE Program Office. (1995). TSE score user’s manual. Princeton, NJ: Educational Testing Service. Tyndall, B., & Kenyon, D. M. (1995). Validation of a new holistic rating scale using Rasch multifaceted analysis. In A. Cumming & R. Berwick (Eds.), Validation in language testing. Clevedon, England: Multilingual Matters. Weigle, S. C. (1994, March). Using FACETS to model rater training effects. Paper presented at the 16th annual Language Testing Research Colloquium, Washington, DC. Wigglesworth, G. (1993). Exploring bias analysis as a tool for improving rater consistency in assessing oral interaction. Language Testing, 10 (3), 305-335. Wolfe, E. W., Chiu, C. W. T., & Myford, C. M. (1999). The manifestation of common rater errors in multi-faceted Rasch analyses (MS #97-02). Princeton, NJ: Educational Testing Service, Center for Performance Assessment. Wright, B. D., & Masters, G. N. (1982). Rating scale analysis: Rasch measurement. Chicago: MESA Press. 50 Appendix TSE Band Descriptor Chart 51 Draft 4/5/96 Overall features to consider: Functional competence is the speaker’s ability to select functions to reasonably address the task and to select the language needed to carry out the function. Sociolinguistic competence is the speaker’s ability to demonstrate an awareness of audience and situation by selecting language, register (level of formality) and tone, that is appropriate. Discourse competence is the speaker’s ability to develop and organize information in a coherent manner and to make effective use of cohesive devices to help the listener follow the organization of the response. Linguistic competence is the effective selection of vocabulary, control of grammatical structures, and accurate pronunciation along with smooth delivery in order to produce intelligible speech. 60 TSE BAND DESCRIPTOR CHART 50 40 30 20 Communication almost always effective: task performed very competently Speaker volunteers information freely, with little or no effort, and may go beyond the task by using additional appropriate functions. • Native-like repair strategies • Sophisticated expressions • Very strong content • Almost no listener effort required Communication generally effective: task performed competently Communication somewhat effective: task performed somewhat competently Communication generally not effective: task generally performed poorly No effective communication; no evidence of ability to perform task Speaker volunteers information, sometimes with effort; usually does not run out of time. • Linguistic weaknesses may necessitate some repair strategies that may be slightly distracting • Expressions sometimes awkward • Generally strong content • Little listener effort required Speaker responds with effort; sometimes provides limited speech sample and sometimes runs out of time. • Sometimes excessive, distracting, and ineffective repair strategies used to compensate for linguistic weaknesses (e.g., vocabulary and/or grammar) • Adequate content • Some listener effort required Speaker responds with much effort; provides limited speech sample and often runs out of time. • Repair strategies excessive, very distracting, and ineffective • Much listener effort required • Difficult to tell if task is fully performed because of linguistic weaknesses, but function can be identified Extreme speaker effort is evident; speaker may repeat prompt, give up on task, or be silent. • Attempts to perform task end in failure • Only isolated words or phrases intelligible, even with much listener effort • Function cannot be identified Functions performed clearly and effectively Functions generally performed clearly and effectively Functions performed somewhat clearly and effectively Functions generally performed unclearly and ineffectively No evidence that functions were performed Speaker is highly skillful in selecting language to carry out intended functions that reasonably address the task. Speaker is able to select language to carry out functions that reasonably address the task. Speaker may lack skill in selecting language to carry out functions that reasonably address the task. Speaker often lacks skill in selecting language to carry out functions that reasonably address the task. Speaker is unable to select language to carry out the functions. Appropriate response to audience/situation Generally appropriate response to audience/situation Somewhat appropriate task response to audience/situation Generally inappropriate response to audience/situation No evidence of ability to respond appropriately to audience/situation Speaker almost always considers register and demonstrates audience awareness. • Understanding of context, and strength in discourse and linguistic competence, demonstrate sophistication Speaker generally considers register and demonstrates sense of audience awareness. • Occasionally lacks extensive range, variety, and sophistication; response may be slightly unpolished Speaker demonstrates some audience awareness, but register is not always considered. • Lack of linguistic skills that would demonstrate sociolinguistic sophistication Speaker usually does not demonstrate audience awareness since register is often not considered. • Lack of linguistic skills generally masks sociolinguistic skills Speaker is unable to demonstrate sociolinguistic skills and fails to acknowledge audience or consider register. Coherent, with effective use of cohesive devices Coherent, with some effective use of cohesive devices Somewhat coherent, with some use of cohesive devices Generally incoherent, with little use of cohesive devices Incoherent, with no use of cohesive devices Response is coherent, with logical organization and clear development. • Contains enough details to almost always be effective • Sophisticated cohesive devices result in smooth connection of ideas Response is generally coherent, with generally clear, logical organization, and adequate development. • Contains enough details to be generally effective • Some lack of sophistication in use of cohesive devices may detract from smooth connection of ideas Coherence of the response is sometimes affected by lack of development and/or somewhat illogical or unclear organization, sometimes leaving listener confused. • May lack details • Mostly simple cohesive devices are used • Somewhat abrupt openings and closures Response is often incoherent; loosely organized, and inadequately developed or disjointed, discourse, often leave listener confused. • Often lacks details • Simple conjunctions used as cohesive devices, if at all • Abrupt openings and closures Response is incoherent. • Lack of linguistic competence interferes with listener’s ability to assess discourse competence Use of linguistic features almost always effective; communication not affected by minor errors Use of linguistic features generally effective; communication generally not affected by errors Use of linguistic features somewhat effective; communications sometimes affected by errors Use of linguistic features generally poor; communication often impeded by major errors Use of linguistic features poor; communication ineffective due to major errors • Errors not noticeable • Accent not distracting • Range in grammatical structures and vocabulary • Delivery often has native-like smoothness • Errors not unusual, but rarely major • Accent may be slightly distracting • Some range in vocabulary and grammatical structures, which may be slightly awkward or inaccurate • Delivery generally smooth with some hesitancy and pauses • Minor and major errors present • Accent usually distracting • Simple structures sometimes accurate, but errors in more complex structures common • Limited ranges in vocabulary; some inaccurate word choices • Delivery often slow or choppy; hesitancy and pauses common • Limited linguistic control; major errors present • Accent very distracting • Speech contains numerous sentence fragments and errors in simple structures • Frequent inaccurate word choices; general lack of vocabulary for task completion • Delivery almost always plodding, choppy and repetitive; hesitancy and pauses very common • Lack of linguistic control • Accent so distracting that few words are intelligible • Speech contains mostly sentence fragments, repetition of vocabulary, and simple phrases • Delivery so plodding that only few words are produced ® Test of English as a Foreign Language P.O. Box 6155 Princeton, NJ 08541-6155 USA 㛭㛭㛭㛭㛭㛭㛭㛭㛭㛭㛭㛭㛭㛭㛭㛭㛭㛭㛭㛭㛭㛭㛭㛭㛭㛭㛭㛭㛭㛭㛭㛭㛭㛭㛭㛭㛭㛭㛭㛭㛭㛭㛭㛭㛭㛭㛭㛭㛭 To obtain more information about TOEFL programs and services, use one of the following: Phone: 609-771-7100 E-mail: toefl@ets.org Web site: http://www.toefl.org 57906-005535 • Y70M.675 • Printed in U.S.A. I.N. 275593