Monitoring Sources of Variability Within the Test of Spoken English

TEST OF ENGLISH AS A FOREIGN LANGUAGE
Research
Reports
REPORT 65
JUNE 2000
Monitoring Sources of Variability
Within the Test of Spoken
English Assessment System
Carol M. Myford
Edward W. Wolfe
Monitoring Sources of Variability Within the Test of Spoken English Assessment System
Carol M. Myford
Edward W. Wolfe
Educational Testing Service
Princeton, New Jersey
RR-00-6
®
®
®
Educational Testing Service is an Equal Opportunity/Affirmative Action Employer.
Copyright © 2000 by Educational Testing Service. All rights reserved.
No part of this report may be reproduced or transmitted in any form or by any means,
electronic or mechanical, including photocopy, recording, or any information storage
and retrieval system, without permission in writing from the publisher. Violators will be
prosecuted in accordance with both U.S. and international copyright laws.
EDUCATIONAL TESTING SERVICE, ETS, the ETS logo, GRE, TOEFL, the TOEFL
logo, TSE, and TWE are registered trademarks of Educational Testing Service. The
modernized ETS logo is a trademark of Educational Testing Service.
FACETS Software is copyrighted by MESA Press, University of Chicago.
Abstract
The purposes of this study were to examine four sources of variability within the
Test of Spoken English (TSE®) assessment system, to quantify ranges of variability for
each source, to determine the extent to which these sources affect examinee performance,
and to highlight aspects of the assessment system that might suggest a need for change.
Data obtained from the February and April 1997 TSE scoring sessions were analyzed
using Facets (Linacre, 1999a).
The analysis showed that, for each of the two TSE administrations, the test
usefully separated examinees into eight statistically distinct proficiency levels. The
examinee proficiency measures were found to be trustworthy in terms of their precision
and stability. It is important to note, though, that the standard error of measurement varies
across the score distribution, particularly in the tails of the distribution.
The items on the TSE appear to work together; ratings on one item correspond
well to ratings on the other items. Yet, none of the items seem to function in a redundant
fashion. Ratings on individual items within the test can be meaningfully combined; there
is little evidence of psychometric multidimensionality in the two data sets. Consequently,
it is appropriate to generate a single summary measure to capture the essence of examinee
performance across the 12 items. However, the items differ little in terms of difficulty,
thus limiting the instrument’s ability to discriminate among levels of proficiency.
The TSE rating scale functions as a five-point scale, and the scale categories are
clearly distinguishable. The scale maintains a similar though not identical category
structure across all 12 items. Raters differ somewhat in the levels of severity they
exercise when they rate examinee performances. The vast majority used the scale in a
consistent fashion, though. If examinees’ scores were adjusted for differences in rater
severity, the scores of two-thirds of the examinees in these administrations would have
differed from their raw score averages by 0.5 to 3.6 raw score points. Such differences
can have important consequences for examinees whose scores lie in critical decisionmaking regions of the score distribution.
Key words:
oral assessment, second language performance assessment, Item Response
Theory (IRT), rater performance, Rasch Measurement, Facets
i
The Test of English as a Foreign Language (TOEFL®) was developed in 1963 by the National Council
on the Testing of English as a Foreign Language. The Council was formed through the cooperative
effort of more than 30 public and private organizations concerned with testing the English proficiency
of nonnative speakers of the language applying for admission to institutions in the United States. In
1965, Educational Testing Service (ETS®) and the College Board assumed joint responsibility for the
program. In 1973, a cooperative arrangement for the operation of the program was entered into by ETS,
the College Board, and the Graduate Record Examinations (GRE®) Board. The membership of the
College Board is composed of schools, colleges, school systems, and educational associations; GRE
Board members are associated with graduate education.
ETS administers the TOEFL program under the general direction of a Policy Council that was
established by, and is affiliated with, the sponsoring organizations. Members of the Policy Council
represent the College Board, the GRE Board, and such institutions and agencies as graduate schools
of business, junior and community colleges, nonprofit educational exchange agencies, and agencies
of the United States government.
✥
✥
✥
A continuing program of research related to the TOEFL test is carried out under the direction of the
TOEFL Committee of Examiners. Its 11 members include representatives of the Policy Council, and
distinguished English as a second language specialists from the academic community. The Committee
meets twice yearly to review and approve proposals for test-related research and to set guidelines for
the entire scope of the TOEFL research program. Members of the Committee of Examiners serve
three-year terms at the invitation of the Policy Council; the chair of the committee serves on the Policy
Council.
Because the studies are specific to the TOEFL test and the testing program, most of the actual research
is conducted by ETS staff rather than by outside researchers. Many projects require the cooperation
of other institutions, however, particularly those with programs in the teaching of English as a foreign
or second language and applied linguistics. Representatives of such programs who are interested in
participating in or conducting TOEFL-related research are invited to contact the TOEFL program
office. All TOEFL research projects must undergo appropriate ETS review to ascertain that data
confidentiality will be protected.
Current (1999-2000) members of the TOEFL Committee of Examiners are:
Diane Belcher
Richard Berwick
Micheline Chalhoub-Deville
JoAnn Crandall (Chair)
Fred Davidson
Glenn Fulcher
Antony J. Kunnan (Ex-Officio)
Ayatollah Labadi
Reynaldo F. Macías
Merrill Swain
Carolyn E. Turner
The Ohio State University
Ritsumeikan Asia Pacific University
University of Iowa
University of Maryland, Baltimore County
University of Illinois at Urbana-Champaign
University of Surrey
California State University, LA
Institut Superieur des Langues de Tunis
University of California, Los Angeles
The University of Toronto
McGill University
To obtain more information about TOEFL programs and services, use one of the following:
E-mail: toefl@ets.org
Web site: http://www.toefl.org
ii
Acknowledgments
This work was supported by the Test of English as a Foreign Language (TOEFL)
Research Program at Educational Testing Service.
We are grateful to Daniel Eignor, Carol Taylor, Gwyneth Boodoo, Evelyne Aguirre
Patterson, Larry Stricker, and the TOEFL Research Committee for helpful comments on
an earlier draft of the paper. We especially thank the readers of the Test of Spoken
English and the program’s administrative personnel—Tony Ostrander, Evelyne Aguirre
Patterson, and Pam Esbrandt—without whose cooperation this project could never have
succeeded.
iii
Table of Contents
Introduction......................................................................................................................1
Rationale for the Study ....................................................................................................2
Review of the Literature ..................................................................................................3
Method .............................................................................................................................5
Examinees ............................................................................................................5
Instrument ............................................................................................................6
Raters and the Rating Process..............................................................................6
Procedure .............................................................................................................7
Results..............................................................................................................................9
Examinees ............................................................................................................11
Items.....................................................................................................................20
TSE Rating Scale .................................................................................................24
Raters ...................................................................................................................34
Conclusions......................................................................................................................43
Examinees ............................................................................................................43
Items.....................................................................................................................44
TSE Rating Scale .................................................................................................45
Raters ...................................................................................................................45
Next Steps ............................................................................................................46
References........................................................................................................................48
Appendix..........................................................................................................................51
iv
List of Tables
Page
Table 1. Distribution of TSE Examinees Across Geographic Locations........................5
Table 2. TSE Rating Scale ..............................................................................................8
Table 3. Misfitting and Overfitting Examinees from the February and April 1997
TSE Administrations........................................................................................................16
Table 4. Rating Patterns and Fit Indices for Selected Examinees...................................17
Table 5. Examinees from the February 1997 TSE Administration Identified as
Having Suspect Rating Patterns.......................................................................................19
Table 6. Examinees from the April 1997 TSE Administration Identified as
Having Suspect Rating Patterns.......................................................................................20
Table 7. Item Measurement Report for the February 1997 TSE Administration ...........21
Table 8. Item Measurement Report for the April 1997 TSE Administration .................21
Table 9. Rating Scale Category Calibrations for the February 1997 TSE Items ............25
Table 10. Rating Scale Category Calibrations for the April 1997 TSE Items ................25
Table 11. Average Examinee Proficiency Measures and Outfit Mean-Square
Indices for the February 1997 TSE Items ........................................................................28
Table 12. Average Examinee Proficiency Measures and Outfit Mean-Square
Indices for the April 1997 TSE Items ............................................................................. 29
Table 13. Frequency and Percentage of Examinee Ratings in Each Category for
the February 1997 TSE Items ..........................................................................................30
Table 14. Frequency and Percentage of Examinee Ratings in Each Category for
the April 1997 TSE Items ................................................................................................31
Table 15. Rating Patterns and Fit Indices for Selected Examinees.................................33
Table 16. Summary Table for Selected TSE Raters ........................................................35
v
Table 17. Effects of Adjusting for Rater Severity on Examinee Raw Score
Averages, February 1997 TSE Administration ................................................................37
Table 18. Effects of Adjusting for Rater Severity on Examinee Raw Score
Averages, April 1997 TSE Administration......................................................................37
Table 19. Frequencies and Percentages of Rater Mean-Square Fit Indices for the
February 1997 TSE Data..................................................................................................39
Table 20. Frequencies and Percentages of Rater Mean-Square Fit Indices for the
April 1997 TSE Data .......................................................................................................39
Table 21. Frequencies of Inconsistent Ratings for February 1997 TSE Raters ..............40
Table 22. Frequencies of Inconsistent Ratings for April 1997 TSE Raters ...................41
Table 23. Rater Effect Criteria...................................................………………………..42
Table 24. Rater Effects for February 1997 TSE Data......................................................42
Table 25. Rater Effects for April 1997 TSE Data............................................................42
Appendix. TSE Band Descriptor Chart............................................................................51
vi
List of Figures
Page
Figure 1. Map from the Facets Analysis of the Data from the February 1997 TSE
Administration .................................................................................................................12
Figure 2. Map from the Facets Analysis of the Data from the April 1997 TSE
Administration .................................................................................................................13
Figure 3. Category Probability Curves for Items 4 and 8 (February Test)......................26
Figure 4. Category Probability Curves for Items 4 and 7 (April Test)............................26
vii
Introduction
Those in charge of monitoring quality control for complex assessment systems need
information that will help them determine whether all aspects of the system are working as
intended. If there are problems, they must pinpoint those particular aspects of the system that are
out of synch so that they can take meaningful, informed steps to improve the system. They need
answers to critical questions such as: Do some raters rate more severely than other raters? Do
any raters use the rating scales inconsistently? Are there examinees that exhibit unusual profiles
of ratings across items? Are the rating scales functioning appropriately? Answering these kinds
of questions requires going beyond interrater reliability coefficients and analysis of variance
main effects to understand the impact of the assessment system on individual raters, examinees,
assessment items, and rating scales. Data analysis approaches that provide only group-level
statistics are of limited help when one’s goal is to refine a complex assessment system. What is
needed for this purpose is information at the level of the individual rater, examinee, item, and
rating scale. The present study illustrates an approach to the analysis of rating data that provides
this type of information. We analyzed data from two 1997 administrations of the revised Test of
Spoken English (TSE®) using Facets (Linacre, 1999a), a Rasch-based computer software
program, to provide information to TSE program personnel for quality control monitoring.
In the beginning of this report, we provide a rationale for the study and lay out the
questions that focused our investigation. Next, we present a review of literature, discussing
previous studies that have used many-facet Rasch measurement to investigate complex rating
systems for evaluating speaking and writing. In the method section of the paper we describe the
examinees that took part in this study, the Test of Spoken English, the TSE raters, and the rating
procedure they employ. We then discuss the statistical analyses we performed, presenting the
many-facet Rasch model and its capabilities.
The results section is divided into several subsections. First, we examine the map that is
produced as part of Facets output. The map is perhaps the most informative piece of output from
the analysis, because it allows us to view all the facets of our analysis—examinees, TSE items,
raters, and the TSE rating scale—within a single frame of reference. The remainder of the results
section is organized around the specific quality control questions we explored with the Facets
output. We first answer a set of questions about the performance of examinees. We then look at
how the TSE items performed. Next, we turn our attention to questions about the TSE rating
scale. And lastly, we look at a set of questions that relate to raters and how they performed. We
then draw conclusions from our study, suggesting topics for future research.
1
Rationale for the Study
At its February 1996 meeting, the TSE Committee set forth a validation research agenda
to guide a long term program for the collection of evidence to substantiate interpretations made
about scores on the revised TSE (TSE Committee, 1996). The committee laid out as first priority
a set of interrelated studies that focus on the generalizability of test scores. Of particular
importance were studies to determine the extent to which various factors (such as item/task
difficulty and rater severity) affect examinee performance on the TSE. The committee suggested
that Facets analyses and generalizability studies be carried out to monitor these sources of
variability as they operate in the TSE setting.
In response to the committee’s request, we conducted Facets analyses of data obtained
from two administrations of the revised TSE—February and April 1997. The purpose of the
study was to monitor four sources of variability within the TSE assessment system: (1)
examinees, (2) TSE items, (3) the TSE rating scale, and (4) raters. We sought to quantify
expected ranges of variability for each source, to determine the extent to which these sources
affect examinee performance, and to highlight aspects of the TSE assessment system that might
suggest a need for change.
Our study was designed to answer the following questions about the sources of
variability:
Examinees
•
How much variability is there across examinees in their levels of proficiency?
Who differs more—examinees in their levels of proficiency, or raters in their
levels of severity?
•
To what extent has the test succeeded in separating examinees into distinct strata
of proficiency? How many statistically different levels of proficiency are
identified by the test?
•
Are the differences between examinee proficiency measures mainly due to
measurement error or to differences in actual proficiency?
•
How accurately are examinees measured? How much confidence can we have in
the precision and stability of the measures of examinee proficiency?
•
Do some examinees exhibit unusual profiles of ratings across the 12 TSE items?
Does the current procedure for identifying and resolving discrepancies
successfully identify all cases in which rater agreement is "out of statistical
control" (Deming, 1975)?
Items
•
Is it harder for examinees to get high ratings on some TSE items than others? To
what extent do the 12 TSE items differ in difficulty?
2
•
Can we calibrate ratings from all 12 TSE items, or do ratings on certain items
frequently fail to correspond to ratings on other items (that is, are there certain
items that do not "fit" with the others)? Can a single summary measure capture
the essence of examinee performance across the items, or is there evidence of
possible psychometric multidimensionality (Henning, 1992; McNamara, 1991;
McNamara, 1996) in the data and, perhaps, a need to report a profile of scores
rather than a single score summary for each examinee if it appears that systematic
kinds of profile differences are appearing among examinees who have the same
overall summary score?
•
Are all 12 TSE items equally discriminating? Is item discrimination a constant
for all 12 items, or does it vary across items?
TSE Rating Scale
•
Are the five scale categories on the TSE rating scale appropriately ordered? Is the
rating scale functioning properly as a five-point scale? Are the scale categories
clearly distinguishable?
Raters
•
Do TSE raters differ in the severity with which they rate examinees?
•
If raters differ in severity, how do those differences affect examinee scores?
•
How interchangeable are the raters?
•
Do TSE raters use the TSE rating scale consistently?
•
Are there raters who rate examinee performances inconsistently?
•
Are there any overly consistent raters whose ratings tend to cluster around the
midpoint of the rating scale and who are reluctant to use the endpoints of the
scale? Are there raters who tend to give an examinee ratings that differ less than
would be expected across the 12 items? Are there raters who cannot effectively
differentiate between examinees in terms of their levels of proficiency?
Review of the Literature
Over the last several years, a number of performance assessment programs interested in
examining and understanding sources of variability in their assessment systems have been
experimenting with Linacre’s (1999a) Facets computer program as a monitoring tool (see, for
example, Heller, Sheingold, & Myford, 1998; Linacre, Engelhard, Tatum, & Myford, 1994; Lunz
& Stahl, 1990; Myford & Mislevy, 1994; Paulukonis, Myford, & Heller, in press). In this study,
we build on the pioneering efforts of researchers who are employing many-facet Rasch
measurement to answer questions about complex rating systems for evaluating speaking and
3
writing. These researchers have raised some critical issues that they are investigating with
Facets. For example,
•
•
•
•
•
•
Can rater training enable raters to function interchangeably (Weigle, 1994)?
Can rater training eliminate differences between raters in the degree of severity they
exercise (Lumley & McNamara, 1995; Marr, 1994; McNamara & Adams, 1991; Tyndall
& Kenyon, 1995; Wigglesworth, 1993)?
Are rater characteristics stable over time (Lumley & McNamara, 1995; Marr, 1994;
Myford, Marr, & Linacre, 1996)?
What background characteristics influence the ratings raters give (Brown, 1995; Myford,
Marr, & Linacre, 1996)?
Do raters differ systematically in their use of the points on a rating scale (McNamara &
Adams, 1991)?
Do raters and tasks interact to affect examinee scores (Bachman, Lynch, & Mason, 1995;
Lynch & McNamara, 1994)?
Several researchers examined the rating behavior of individual raters of the old Test of
Spoken English and reported differences between raters in the degree of severity they exercised
when rating examinee performances (Bejar, 1985; Marr, 1994), but no studies have as yet
compared raters of the revised TSE. Bejar (1985) compared the mean rating of individual TSE
raters and found that some raters tended to give lower ratings than others; in fact, the raters Bejar
studied did this consistently across all four scales of the old TSE (pronunciation, grammar,
fluency, and comprehension). More recently, Marr (1994) used Facets to analyze data from two
1992 and 1993 administrations of the old TSE and found that there was significant variation in
rater severity within each of the two administrations. She reported that more than two thirds of
the examinee scores in the first administration would have been changed if adjustments had been
made for rater severity, while more than half of the examinee scores would have been altered in
the second administration. In her study, Marr also looked at the stability of the rater severity
measures across administrations for the 33 raters who took part in both scoring sessions. She
found that the correlation between the two sets of rater severity measures was only 0.47. She
noted that the rater severity estimates were based on each rater having rated an average of only
30 examinees, and each rater was paired with fewer than half of the other raters in the sample.
This suggests, Marr hypothesized, that much of what appeared to be systematic variance
associated with differences in rater severity may instead have been random error. She concluded
that the operational use of Facets to adjust for rater effects would require some important
changes in the existing TSE rating procedures: A means would need to be found to create greater
overlap among raters so that all raters could be connected in the rating design (the ratings of
eight raters had to be deleted from her analysis because they were insufficiently connected).1 If
1 Disconnection occurs when a judging plan for data collection is instituted that, because of its deficient structure,
makes it impossible to place all raters, examinees, and items in one frame of reference so that appropriate
comparisons can be drawn (Linacre, 1994). The allocation of raters to items and examinees must result in a network
of links that is complete enough to connect all the raters through common items and common examinees (Lunz,
Wright, & Linacre, 1990). Otherwise, ambiguity in interpretation results. If there are insufficient patterns of nonextreme high ratings and non-extreme low ratings to be able to connect two elements (e.g., two raters, two
examinees, two items), then the two elements will appear in separate subsets of Facets output as “disconnected.”
Only examinees that are in the same subset can be directly compared. Similarly, only raters (or items) that are in the
same subset can be directly compared. Attempts to compare examinees (or raters, or items) that appear in two or
more different subsets can be misleading.
4
this were accomplished, one might then have greater confidence in the stability of the estimates
of rater severity both within and across TSE administrations.
In the present study of the revised TSE, we worked with somewhat larger sample sizes
than Marr used. Marr’s November 1992 sample had 74 raters and 1,158 examinees, and her May
1993 sample had 54 raters and 785 examinees. Our February 1997 sample had 66 raters and
1,469 examinees, and our April 1997 sample had 74 raters and 1,446 examinees. Also, while
both of Marr’s data sets had disconnected subsets of raters and examinees in them, there was no
disconnection in our two data sets.
Method
Examinees
Examinees from the February and April 1997 TSE administrations (N = 1,469 and 1,446,
respectively) were generally between the ages of 20 and 39 years of age (83%). Fewer
examinees were under 20 (about 7%) or over 40 (about 10%). These percentages were consistent
across the two test dates. About half of the examinees for both administration dates were female
(53% in February and 47% in April). Over half of the examinees (55% in February and 66% in
April) took the TSE for professional purposes (for example, for selection and certification in
health professions, such as medicine, nursing, pharmacy, and veterinary medicine) with the
remaining examinees taking the TSE for academic purposes (primarily for selection for
international teaching assistantships). Table 1 shows the percentage of examinees taking the
examination in various locations around the world. This table reveals that, for both examination
dates, a majority of examinees were from eastern Asia, and most of the remaining examinees
were from Europe.
Table 1
Distribution of TSE Examinees Across Geographic Locations
February 1997
April 1997
Location
N
%
N
%
Eastern Asia
806
56%
870
63%
Europe
244
17%
197
14%
Africa
113
8%
52
4%
Middle East
102
7%
128
9%
South America
74
5%
47
3%
North America
57
4%
47
3%
Western Asia
35
2%
37
3%
5
Instrument
The purpose of the revised TSE is the same as that of the original TSE. It is a test of
general speaking ability designed to evaluate the oral language proficiency of nonnative speakers
of English who are at or beyond the postsecondary level of education (TSE Program Office,
1995). The underlying construct for the revised test is communicative language ability, which is
defined to include strategic competence and four language competencies: linguistic competence,
discourse competence, functional competence, and sociolinguistic competence (see Appendix).
The TSE is a semi-direct speaking test that is administered via audio-recording equipment using
recorded prompts and printed test booklets. Each of the 12 items that appears on the test consists
of a single task that is designed to elicit one of 10 language functions in a particular context or
situation. The test lasts about 20 minutes and is individually administered. Examinees are given
a test booklet and asked to listen to and read general instructions. A tape-recorded narrator
describes the test materials and asks the examinee to perform several tasks in response to these
materials. For example, a task may require the examinee to interpret graphical material, tell a
short story, or provide directions to someone. After hearing the description of the task, the
examinee is encouraged to construct as complete a response as possible in the time allotted. The
examinee’s oral responses are recorded, and each examinee’s test score is based on an evaluation
of the resulting speech sample.
Raters and the Rating Process
All TSE raters are experienced teachers and specialists in the field of English or English
as a second language who teach at the high school or college level. Teachers interested in
becoming TSE raters undergo a thorough training program designed to qualify them to score for
the TSE program. The training program involves becoming familiar with the TSE rating scale
(see Table 2), the TSE communication competencies, and the corresponding band descriptors
(see Appendix). The TSE trainer introduces and discusses a set of written general guidelines that
raters are to follow in scoring the test. For example, these include guidelines for arriving at a
holistic score for each item, guidelines describing what materials the raters should refer to while
scoring, and guidelines explaining the process to be used in listening to a tape. Additionally, the
trainees are introduced to a written set of item-level guidelines to be used in scoring. These
describe in some detail how to handle a number of recurring scoring challenges TSE raters face.
For example, they describe how raters should handle tapes that suffer from mechanical problems,
performances that fluctuate between two bands on the rating scale across all competencies,
incomplete responses to a task, and off-topic responses.
After the trainees have been introduced to all of the guidelines for scoring, they then
practice using the rating scale to score audiotaped performances. Prior to the training, those in
charge of training select benchmark tapes drawn from a previous test administration that show
performance at the various band levels. They prepare written rationales that explain why each
tape exemplifies performance at that particular level. Rater trainees listen to the benchmark tapes
and practice scoring them, checking their scores against the benchmark scores and reading the
scoring rationales to gain a better understanding of how the TSE rating scale functions. At the
end of this qualifying session, each trainee independently rates six TSE tapes. They then present
the scores they assign each tape to the TSE program for evaluation. To qualify as a TSE rater, a
trainee can have only one discrepant score—where the discrepancy is a difference of more than
6
one bandwidth (that is, 10 points)— among the six rated tapes. If the scores the trainee assigns
meet this requirement, then the trainee is deemed "certified" to score the TSE. The rater can then
be invited to participate in subsequent operational TSE scoring sessions.
At the beginning of each operational TSE scoring session, the raters who have been
invited to participate undergo an initial recalibration training session to refamiliarize them with
the TSE rating scale and to calibrate to the particular test form they will be scoring. The
recalibration training session serves as a means of establishing on-going quality control for the
program.
During a TSE scoring session, examinee audiotapes are identified by number only and are
randomly assigned to raters. Two raters listen to each tape and independently rate it (neither
rater knows the scores the other rater assigned). The raters evaluate an examinee’s performance
on each item using the TSE holistic five-point rating scale; they use the same scale to rate all 12
items appearing on the test. Each point on the scale is defined by a band descriptor that
corresponds to the four language competencies that the test is designed to measure (functional
competence, sociolinguistic competence, discourse competence, and linguistic competence), and
strategic competence. Raters assign a holistic score from one of the five bands for each of the 12
items. As they score, the raters consider all relevant competencies, but they do not assess each
competency separately. Rather, they evaluate the combined impact of all five competencies
when they assign a holistic score for any given item.
To arrive at a final score for an examinee, the 24 scores that the two raters gave are
compared. If the averages of the two raters differ by 10 points or more overall, then a third rater
(usually a very experienced TSE rater) rates the audiotape, unaware of the previously assigned
scores. The final score is derived by resolving the differences among the three sets of scores.
The three sets of scores are compared, and the closest pair is averaged to calculate the final
reported score. The overall score is reported on a scale that ranges from 20 to 60, in increments
of five (20, 25, 30, 35, 40, 45, 50, 55, 60).
Procedure
For this study, we used rating data obtained from the two operational TSE scoring
sessions described earlier. To analyze the data, we employed Facets (Linacre, 1999a), a Raschbased rating scale analysis computer program.
The Statistical Analyses. Facets is a generalization of the Rasch (1980) family of
measurement models that makes possible the analysis of examinations that have multiple
potential sources of measurement error (such as, items, raters, and rating scales).2 Because our
goal was to gain an understanding of the complex rating procedure employed in the TSE setting,
we needed to consider more measurement facets than the traditional two—items and
examinees—taken into account by most measurement models. By employing Facets, we were
able to establish a statistical framework for analyzing TSE rating data. That framework enabled
us to summarize overall rating patterns in terms of main effects for the rater, examinee, and item
2
See McNamara (1996, pp. 283-287) for a user-friendly description of the various models in this family and the
types of situations in which each model could be used.
7
Table 2
TSE Rating Scale
Score
60
Communication almost always effective: task performed very competently
Functions performed clearly and effectively
Appropriate response to audience/situation
Coherent, with effective use of cohesive devices
Use of linguistic features almost always effective; communication not affected by
minor errors
50
Communication generally effective: task performed competently
Functions generally performed clearly and effectively
Generally appropriate response to audience/situation
Coherent, with some effective use of cohesive devices
Use of linguistic features generally effective; communication generally not affected
by errors
40
Communication somewhat effective: task performed somewhat competently
Functions performed somewhat clearly and effectively
Somewhat appropriate response to audience/situation
Somewhat coherent, with some use of cohesive devices
Use of linguistic features somewhat effective; communication sometimes affected
by errors
30
Communication generally not effective: task generally performed poorly
Functions generally performed unclearly and ineffectively
Generally inappropriate response to audience/situation
Generally incoherent, with little use of cohesive devices
Use of linguistic features generally poor; communication often impeded by major
errors
20
No effective communication: no evidence of ability to perform task
No evidence that functions were performed
No evidence of ability to respond appropriately to audience/situation
Incoherent, with no use of cohesive devices
Use of linguistic features poor; communication ineffective due to major errors
Copyright © 1996 by Educational Testing Service, Princeton, NJ. All rights reserved. No reproduction in whole or
in part is permitted without express written permission of the copyright owner.
8
facets. Additionally, we were able to quantify the weight of evidence associated with each of
these facets and highlight individual rating patterns and rater-item combinations that were
unusual in light of expected patterns.
In the many-facet Rasch model (Linacre, 1989), each element of each facet of the testing
situation (that is, each examinee, rater, item, rating scale category, etc.) is represented by one
parameter that represents proficiency (for examinees), severity (for raters), difficulty (for items),
or challenge (for rating scale categories). The Partial Credit form of the many-facet Rasch model
that we used for this study was:
log (
Pnjik
Pnjik - 1
) = Bn - Cj - Di - Fik
(1)
Pnjik
= the probability of examinee n being awarded a rating of k when rated by rater j on
item i
Pnjik-1 = the probability of examinee n being awarded a rating of k-1 when rated by rater j on
item i
Bn
= the proficiency of examinee n
= the severity of rater j
Cj
= the difficulty of item i
Di
= the difficulty of achieving a score within a particular score category (k) averaged
Fik
across all raters for each item separately
When we conducted our analyses, we separated out the contribution of each facet we
included and examined it independently of other facets so that we could better understand how
the various facets operate in this complex rating procedure. For each element of each facet in
this analysis, the computer program provides a measure (a logit estimate of the calibration), a
standard error (information about the precision of that logit estimate), and fit statistics
(information about how well the data fit the expectations of the measurement model).
Results
We have structured our discussion of research findings around the specific questions we
explored with the Facets output. But before we turn to the individual questions, we provide a
brief introduction to the process of interpreting Facets output. In particular, we focus on the map
that is perhaps the single most important and informative piece of output from the computer
program, because it enables us to view all the facets of our analysis at one time.
The maps shown as Figures 1 and 2 display all facets of the analysis in one figure for
each TSE administration and summarize key information about each facet. The maps highlight
results from more detailed sections of the Facets output for examinees, TSE items, raters, and the
TSE rating scale. (For the remainder of this discussion, we will refer only to Figure 1. Figure 2
tells much the same story. The interested reader can apply the same principles described below
when interpreting Figure 2.)
9
The Facets program calibrates the raters, examinees, TSE items, and rating scales so that
all facets are positioned on the same scale, creating a single frame of reference for interpreting
the results from the analysis. That scale is in log-odds units, or “logits,”which, under the model,
constitute an equal-interval scale with respect to appropriately transformed probabilities of
responding in particular rating scale categories. The first column in the map displays the logit
scale. Having a single frame of reference for all the facets of the rating process facilitates
comparisons within and between the facets.
The second column displays the scale that the TSE program uses to report scores to
examinees. The TSE program averages the 24 ratings that the two raters assign to each
examinee, and a single score of 20 to 60, rounded to the nearest 5 (thus, possible scores include
20, 25, 30, 35, 40, 45, 50, 55, and 60), is reported.
The third column displays estimates of examinee proficiency on the TSE assessment—
single-number summaries on the logit scale of each examinee’s tendency to receive low or high
ratings across raters and items. We refer to these as "examinee proficiency measures." Higher
scoring examinees appear at the top of the column, while lower scoring examinees appear at the
bottom of the column. Each star represents 12 examinees, and a dot represents fewer than 12
examinees. These measures appear as a fairly symmetrical platykurtic distribution, resembling a
bell-shaped normal curve—although this result was in no way preordained by the model or the
estimation procedure. Skewed and multi-modal distributions have appeared in other model
applications.
The fourth column compares the TSE raters in terms of the level of severity or leniency
each exercised when rating oral responses to the 12 TSE items. Because more than one rater
rated each examinee’s responses, raters’ tendencies to rate responses higher or lower on average
could be estimated. We refer to these as “rater severity measures.” In this column, each star
represents 2 raters. More severe raters appear higher in the column, while more lenient raters
appear lower. When we examine Figure 1, we see that the harshest rater had a severity measure
of about 1.5 logits, while the most lenient rater had a severity measure of about -2.0 logits.
The fifth column compares the 12 items that appeared on the February 1997 TSE in
terms of their relative difficulties. Items appearing higher in the column were more difficult for
examinees to receive high ratings on than items appearing lower in the column. Items 7, 10, and
11 were the most difficult for examinees, while items 4 and 12 proved easiest.
Columns 6 through 17 display the five-point TSE rating scale as raters used it to score
examinee responses to each of the 12 items. The horizontal lines across each column indicate the
point at which the likelihood of getting the next higher rating begins to exceed the likelihood of
getting the next lower rating for a given item. For example, when we examine Figure 1, we see
that examinees with proficiency measures from about -5.5 logits up through about -3.5 logits are
more likely to receive a rating of 30 than any other rating on item 1; examinees with proficiency
measures between about -3.5 logits and about 2.0 logits are most likely to receive a rating of 40
on item 1; and so on.
The bottom rows of Figure 1 provide the mean and standard deviation of the distribution
of estimates for examinees, raters, and items. When conducting a Facets analysis involving
these three facets, it is customary to center the rater and item facets, but not the examinee facet.
10
By centering facets, one establishes the origin of the scale. As Linacre (1994) cautions, "in most
analyses, if more than one facet is non-centered in an analysis, then the frame of reference is not
sufficiently constrained, and ambiguity results" (p. 27).
Examinees
How much variability is there across examinees in their levels of proficiency? Who differs
more—examinees in their levels of proficiency or raters in their levels of severity?
Looking at Figures 1 and 2, we see that the distribution of rater severity measures is much
narrower than the distribution of examinee proficiency measures. In Figure 1, examinee
proficiency measures show an 18.34-logit spread, while rater severity measures show only a
3.55-logit spread. The range of examinee proficiency measures is about 5.2 times as wide as the
range of rater severity measures. Similarly, in Figure 2 the rater severity measures range from
-1.89 logits to 1.30 logits, a 3.19-logit spread, while the examinee proficiency measures range
from -5.43 logits to 11.69 logits, a 17.12-logit spread. Here, the range of examinee proficiency
measures is about 5.4 times as wide as the range of rater severity measures. A more typical
finding of studies of rater behavior is that the range of examinee proficiency is about twice as
wide as the range of rater severity (J. M. Linacre, personal communication, March 13, 1995).
The finding that the range of TSE examinee proficiency measures is about five times as
wide as the range of TSE rater severity is an important one, because it suggests that the impact of
individual differences in rater severity on examinee scores is likely to be relatively small. By
contrast, suppose that the range of examinee proficiency measures had been twice as wide as the
range of rater severity. In this instance, the impact of individual differences in rater severity on
examinee scores would be much greater. The particular raters who rated individual examinees
would matter more, and a more compelling case could be made for the need to adjust examinee
scores for individual differences in rater severity in order to minimize these biasing effects.
To what extent has the test succeeded in separating examinees into distinct strata of
proficiency? How many statistically different levels of proficiency are identified by the test?
Facets reports an examinee separation ratio (G) which is a ratio scale index comparing
the "true" spread of examinee proficiency measures to their measurement error (Fisher, 1992).
To be useful, a test must be able to separate examinees by their performance (Stone & Wright,
1988). One can determine the number of statistically distinct proficiency strata into which the
test has succeeded in separating examinees (in other words, how well the test separates the
examinees in a particular sample) by using the formula (4G + 1)/3. When we apply this formula,
we see that the samples of examinees that took the TSE in either February 1997 or April 1997
could each be separated into eight statistically distinct levels of proficiency.
11
Figure 1
Map from the Facets Analysis of the Data from the February 1997 TSE Administration
____________________________________________________________________________________________________________________
Rating Scale for Each Item
Logit
TSE Score
Examinee
Rater
Item
1
2
3
4
5
6
7
8
9 10 11 12
____________________________________________________________________________________________________________________
11
High Scores
Severe
Difficult
*.
10
.
.
9
60
.
60 60 60 60 60 60 60 60 60 60 60 60
*.
8
*.
*.
7
**.
--------------------------- ****.
-------------------------6
****.
----------***.
5
****.
50s
*****.
50 50 50 50 50 50 50 50 50 50 50 50
4
*****.
*******.
3
*******.
*********.
----------------- 2 --------------- *********.
------------------------------*******.
*
1
*******.
**.
*******
*********
7 10 11
0
*****.
*********
1 2 3 5 6 8 9
40s
*****.
*******
4 12
40 40 40 40 40 40 40 40 40 40 40 40
-1
****.
**.
**.
*
-2
**.
*
--------------------- **.
--------------------------------3
*.
-----------*
----4
30s
.
.
30 30 30 30 30 30 30 30 30 30 30 30
-5
.
--------------------- .
----------------------6
-----------20
.
------------7
Low scores
Lenient
Easy
20 20 20 20 20 20 20 20 20 20 20 20
____________________________________________________________________________________________________________________
Mean
2.48
.00
.00
S.D.
2.88
.72
.30
____________________________________________________________________________________________________________________
12
Figure 2
Map from the Facets Analysis of the Data from the April 1997 TSE Administration
____________________________________________________________________________________________________________________
Rating Scale for Each Item
Logit
TSE Score
Examinee
Rater
Item
1
2
3
4
5
6
7
8
9 10 11 12
_____________________________________________________________________________________________________________________
11
High Scores
Severe
Difficult
*.
10
.
.
9
60
.
60 60 60 60 60 60 60 60 60 60 60 60
.
8
*.
**.
---7
*.
------------------------------- ***.
-----------------6
****.
-------****.
5
****.
50s
*****
50 50 50 50 50 50 50 50 50 50 50 50
4
******.
******.
3
********
---*********.
------------- 2 --------------- *********.
--------------------------******.
.
1
******.
***.
******.
*****.
7 11
0
*******
*******
1 2 3 5 6 8 9 10
40s
*****.
****.
4 12
40 40 40 40 40 40 40 40 40 40 40 40
-1
****.
**.
***.
.
-2
****.
.
***.
--------------- -3 --------------- *.
-------------------------*.
-4
30s
.
.
30 30 30 30 30 30 30 30 30 30 30 30
-5
.
--------------------- .
-------------------6
-----20
---------------7
Low scores
Lenient
Easy
20 20 20 20 20 20 20 20 20 20 20 20
____________________________________________________________________________________________________________________
Mean
2.24
.00
.00
S.D.
2.91
.66
30
____________________________________________________________________________________________________________________
13
Are the differences between examinee proficiency measures mainly due to measurement error or
to differences in actual proficiency?
Facets also reports the reliability with which the test separates the sample of examinees
—that is, the proportion of observed sample variance which is attributable to individual
differences between examinees (Wright & Masters, 1982). The examinee separation reliability
coefficient represents the ratio of variance attributable to the construct being measured (true
score variance) to the observed variance (true score variance plus the error variance). Unlike
interrater reliability, which is a measure of how similar rater measures are, the separation
reliability is a measure of how different the examinee proficiency measures are (Linacre, 1994).
For the February and April TSE data, the examinee separation reliability coefficients were both
0.98, indicating that the true variance far exceeded the error variance in the examinee proficiency
measures.3
How accurately are examinees measured? How much confidence can we have in the precision
and stability of the measures of examinee proficiency?
Facets reports an overall measure of the precision and stability of the examinee
proficiency measures that is analogous to the standard error of measurement in classical test
theory. The standard error of measurement depicts the extent to which we might expect an
examinee’s proficiency estimate to change if different raters or items were used to estimate that
examinee’s proficiency. The average standard error of measurement for the examinees that took
the February 1997 TSE was 0.44; the average standard error for examinees that took the April
1997 TSE was 0.45.
Unlike the standard error of measurement in classical test theory, which estimates a single
measure of precision and stability for all examinees, Facets provides a separate, unique estimate
for each examinee. For illustrative purposes, we focused on the precision and stability of the
"average" examinee. That is, we determined 95% confidence intervals for examinees with
proficiency measures near the mean of the examinee proficiency distribution for the February
and April data. For the February data, the mean examinee proficiency measure was 2.47 logits,
and the standard error for that measure was 0.41. Therefore, we would expect the average
examinee’s true proficiency measure to lie between raw scores of 49.06 and 52.83 on the TSE
scale 95% of the time. For the April data, the mean examinee proficiency measure was 2.24
logits, and the standard error of that measure was 0.40. Therefore, we would expect the average
examinee’s true proficiency measure to lie between raw scores of 48.63 and 52.07 on the TSE
scale 95% of the time. To summarize, we would expect an average examinee’s true proficiency
to lie within about two raw score points of his or her reported score most of the time.
It is important to note, however, that the size of the standard error of measurement varies
across the proficiency distribution, particularly at the tails of the distribution. In this study,
examinees at the upper end of the proficiency distribution tended to have larger standard errors
on average than examinees in the center of the distribution. For example, examinees taking the
3
According to Fisher (1992), a separation reliability less than 0.5 would indicate that the differences between
examinee proficiency measures were mainly due to measurement error and not to differences in actual proficiency.
14
TSE in February who had proficiency measures in the range of 5.43 logits to 10.30 logits (that is,
they would have received reported scores in the range of 55 to 60 on the TSE scale) had standard
errors for their measures that ranged from 0.40 to 1.03. By contrast, examinees at the lower end
of the proficiency distribution tended to have smaller standard errors on average than examinees
in the center of the distribution. For example, examinees taking the TSE in February who had
proficiency measures in the range of -3.48 logits to -6.68 logits (that is, they would have received
reported scores in the range of 20 to 30 on the TSE scale) had standard errors for their measures
that ranged from 0.36 to 0.38. Thus, for institutions setting their own cutscores on the TSE, it
would be important to take into consideration the standard errors for individual examinee
proficiency measures, particularly for those examinees whose scores lie in critical decisionmaking regions of the score distribution, and not to assume that the standard error of
measurement is constant across that distribution.
Do some examinees exhibit unusual profiles of ratings across the 12 TSE items? Does the
current procedure for identifying and resolving discrepancies successfully identify all cases in
which rater agreement is "out of statistical control" (Deming, 1975)?
As explained earlier, when the averages of two raters’ scores for a single examinee differ
by more than 10 points, usually a very experienced TSE rater rates the audiotape, unaware of the
scores previously assigned. The three sets of scores are compared, and the closest pair is used to
calculate the final reported score (TSE Program Office, 1995). We used Facets to determine
whether this third-rating adjudication procedure is successful in identifying problematic ratings.
Facets produces two indices of the consistency of agreement across raters for each
examinee. The indices are reported as fit statistics—weighted and unweighted, standardized and
unstandardized. In this report, we make several uses of these indices. First, we discuss the
unstandardized, information-weighted mean-square index, or infit, and explain how one can use
that index to identify examinees who exhibit unusual profiles of ratings across the 12 TSE items.
We then examine some examples of score patterns that exhibit misfit to show how one can
diagnose the nature of misfit. Finally, we compare decisions that would be made about the
validity of examinee scores based on the standardized infit index to the decisions that would be
made about the validity of examinee scores based on the current TSE procedure for identifying
discrepantly rated examinees.
First, however, we briefly describe how the unstandardized infit mean-square index is
interpreted. The expectation for this index is 1; the range is 0 to infinity. The higher the infit
mean-square index, the more variability we can expect in the examinee’s rating pattern, even
when rater severity is taken into account. When raters are fairly similar in the degree of severity
they exercise, an infit mean-square index less than 1 indicates little variation in the examinee’s
pattern of ratings (a "flat-line" profile consisting of very similar or identical ratings across the 12
TSE items from the two raters), while an infit mean-square index greater than 1 indicates more
than typical variation in the ratings (that is, a set of ratings with one or more unexpected or
surprising ratings, aberrant ratings that don’t seem to "fit" with the others). Generally, infit
mean-square indices greater than 1 are more problematic than infit indices less than 1. There are
no hard-and-fast rules for setting upper- and lower-control limits for the examinee infit meansquare index. Some testing programs use an upper-control limit of 2 or 3 and a lower-control
limit of .5; more stringent limits might be instituted if the goal were to strive to reduce
15
significantly variability within the system. The more extreme the infit mean-square index, the
greater potential gains for improving the system—either locally, by rectifying an aberrant rating,
or globally, by gaining insights to improve training, rating, or logistic procedures.
For this study, we adopted an upper-control limit for the examinee infit mean-square
index of 3.0, a liberal control limit to accommodate some variability in each examinee’s rating
pattern, and a lower-control limit of 0.5. We wished to allow for a certain amount of variation in
raters’ perspectives, yet still catch cases in which rater disagreement was problematic. An infit
mean-square index beyond the upper-control limit signals an examinee performance that might
need another listening before the final score report is issued, particularly if the examinee’s score
is near a critical decision-making point in the score distribution. Table 3 summarizes examinee
infit information from the February and April 1997 TSE test administrations.
As Table 3 shows, of the 1,469 examinees tested in February, 165 (about 11%) had infit
mean-square indices less than 0.5. Similarly, of the 1,446 examinees tested in April, 163 (again,
about 11%) had infit mean-square indices less than 0.5. These findings suggest that about 1 in
10 examinees in these TSE administrations may have received very similar or identical ratings
across all 12 TSE items.
Further, in the February administration, 15 examinees (about 1%) had infit mean-square
indices equal to or greater than 3.0. Similarly, in the April administration, 15 (1%) had infit
mean-square indices equal to or greater than 3.0. Why did these particular examinees misfit? In
Table 4 we examine the rating patterns associated with some representative misfitting cases.
Table 3
Misfitting and Overfitting Examinees from the February and April 1997 TSE Administrations
February 1997
April 1997
Infit MeanSquare Index
Number of
Examinees
Percent of
Examinees
Number of
Examinees
Percent of
Examinees
< 0.5
165
11.2%
163
11.3%
3.0 to 4.0
11
0.7%
11
0.7%
4.1 to 5.0
2
0.1%
2
0.1%
5.1 to 6.0
2
0.1%
0
0.0%
6.1 to 7.0
0
0.0%
1
0.1%
7.1 to 8.0
0
0.0%
1
0.1%
16
Table 4
Rating Patterns and Fit Indices for Selected Examinees
Ratings Received by Examinee #110
(Infit Mean-Square Index = 0.1; Proficiency Measure = .13, Standard Error = .53)
Rater #74
(Severity = -.39)
Rater #69
(Severity = .95)
Item Number
6
7
1
2
3
4
5
40
40
40
40
40
40
40
40
40
40
40
40
8
9
10
11
12
40
40
40
40
40
40
40
40
40
40
40
40
Ratings Received by Examinee #865
(Infit Mean-Square Index = 1.0; Proficiency Measure = -.39, Standard Error = .47)
Rater #30
(Severity = -.53)
Rater #59
(Severity = 1.19)
Item Number
6
7
1
2
3
4
5
40
40
40
40
40
40
40
30
30
40
30
40
8
9
10
11
12
40
50
40
40
40
50
40
40
40
30
40
40
Ratings Received by Examinee #803
(Infit Mean-Square Index = 3.1; Proficiency Measure = 2.55, Standard Error = .42)
Rater #31
(Severity = -.79)
Rater #42
(Severity = -.61)
Item Number
6
7
1
2
3
4
5
50
50
50
60
50
60
40
40
50
50
40
50
8
9
10
11
12
30
60
60
50
60
50
30
40
50
40
50
40
Ratings Received by Examinee #1060
(Infit Mean-Square Index = 6.6; Proficiency Measure = 2.73, Standard Error = .41)
Rater #18
(Severity = -.53)
Rater #36
(Severity = .29)
Item Number
6
7
1
2
3
4
5
40
40
30
40
30
40
50
60
60
60
60
60
17
8
9
10
11
12
40
30
30
40
40
30
60
60
60
60
50
50
As Table 4 shows, the ratings of Examinee #110 exhibit a flat-line pattern: straight 40s
from both raters. Facets flags such rating patterns for further examination because they display
so little variation. This deterministic pattern does not fit the expectations of the model; the
model expects that for each examinee there will be at least some variation in the ratings across
items. Examinees who receive very similar or nearly identical ratings across all 12 items will
show fit indices less than 0.5; upon closer inspection, their rating patterns will frequently reveal
this flat-line nature. In some cases, one might question whether the raters who score such an
examinee’s responses rated each item independently or, perhaps, whether a halo effect may have
been operating.
The ratings of Examinee #865 (infit mean-square index = 1.0) shown in Table 4 are
fairly typical. There is some variation in the ratings: mostly 40s with an occasional 30 from
Rater #59, who tends to be one of the more severe raters (rater severity measure = 1.19), and
mostly 40s with an occasional 50 from Rater #30, who tends to be one of the more lenient raters
(rater severity measure = -.53).
Table 4 shows that, for Examinee #803, the two ratings for Item 7 were misfitting. Both
raters gave this examinee unexpectedly low ratings of 30 on this item, while the examinee’s
ratings on all other items are higher, ranging from 40 to 60. When we examine the Facets table
of misfitting ratings, we find that the model expected these raters to give 40s on Item 3, not 30s.
These two raters tended to rate leniently overall (rater severity measures = -.61 and -.79), so their
unexpectedly low ratings of 30 are somewhat surprising in light of the other ratings they gave the
examinee.
The rating pattern for Examinee #1060, shown in Table 4, displays a higher level of
misfit (infit mean-square index = 6.6). In this case, the examinee received 30s and 40s from a
somewhat lenient rater (rater severity measure = -.53) and 50s and 60s from a somewhat harsh
rater (rater severity measure = .29). The five 30 ratings are quite unexpected, especially from a
rater who has a tendency to give higher ratings on average. These ratings are all the more
unexpected in light of the high ratings (50s and 60s) given this examinee by a rater who tends to
rate on average more severely. In isolated cases like this one, TSE program personnel might
want to review the examinee’s performance before issuing a final score report. The scores for
Examinee #1060 were flagged by the TSE discrepancy criteria as being suspect because the
averages of the two raters’ sets of scores were 35.83 and 57.50—a difference of more than 10
points; third ratings were used to resolve the differences between the scores.
We also analyzed examinee fit by comparing the standardized infit indices to the
discrepancy resolution criteria used by the TSE program. The standardized infit index is a
transformation of the unstandardized infit index, scaled as a z score (that is, a mean of 0 and a
standard deviation of 1). Under the assumption of normality, the distribution of standardized
infit indices can be viewed as indicating the probability of observing a specific pattern of
discrepant ratings when all ratings are indeed nondiscrepant. That is, standardized infit indices
can be used to indicate which of the observed rating patterns are most likely to include surprising
or unexpected ratings. For our analyses, we adopted an upper-control limit for the examinee
standardized infit of 3.0 which suggests that infit indices for any rating patterns that exceed this
value are likely to be nonaberrant only 0.13% of the time.
18
Thus, we identified examinees with one or more aberrant ratings using the standardized
infit index, which takes into account the level of severity each rater exercised when rating
examinees. We also identified examinees having discrepant ratings according to TSE resolution
criteria, which only takes into account the magnitude of the differences between the scores
assigned by two raters and does not consider the level of severity exercised by those raters.
Tables 5 and 6 summarize this information for the February and April test administrations by
providing the number and percentage of examinees whose rating patterns were identified as
suspect under both sets of criteria.
Based on the TSE discrepancy criteria, about 4% of the examinees were identified as
having aberrant ratings in each administration. Based on the standardized infit index, about 3%
of the examinees’ ratings were aberrant. What is more interesting, however, is the fact that the
two methods identified only a small number of examinees in common. Only 1% of the
examinees were identified as having suspect rating patterns according to both criteria, as shown
in the lower right cell of each table.
About 3% of the examinees in each data set were identified as having suspect rating
patterns by the TSE discrepancy criteria but were not identified as being suspect by the infit
index (as shown in the shaded upper right cell of each table). These cases are most likely ones in
which an examinee was rated by one severe rater and one lenient rater. Such ratings would not
be unusual in light of each rater’s overall level of severity; each rater would have been using the
rating scale in a manner that was consistent with his or her use of the scale when rating other
Table 5
Examinees from the February 1997 TSE Administration
Identified as Having Suspect Rating Patterns
TSE Criteria
Nondiscrepant
Discrepant
Total
Nondiscrepant
1,380
(94%)
46
(3%)
1,426
(97%)
Discrepant
29
(2%)
14
(1%)
43
(3%)
Total
1,409
(96%)
60
(4%)
1,469
(100%)
Infit Criteria
19
Table 6
Examinees from the April 1997 TSE Administration
Identified as Having Suspect Rating Patterns
TSE Criteria
Nondiscrepant
Discrepant
Total
Nondiscrepant
1,359
(94%)
47
(3%)
1,406
(97%)
Discrepant
26
(2%)
14
(1%)
40
(3%)
Total
1,385
(96%)
61
(4%)
1,446
(100%)
Infit Criteria
examinees of similar proficiency. The apparent discrepancies between raters could have been
resolved by providing a model-based score that takes into account the levels of severity of the
two raters when calculating an examinee’s final score. If Facets were used to calculate examinee
scores, then there would be no need to bring these types of cases to the attention of the more
experienced TSE raters for adjudication. Facets would have automatically adjusted these
scores for differences in rater severity, and thus would not have identified these examinees as
misfitting and in need of a third rater’s time and energy.
The shaded lower left cells in Tables 5 and 6 indicate that about 2% of the examinees
were identified as having suspect rating patterns according to the infit criteria, but were not
identified based on the TSE discrepancy criteria. These cases are most likely situations in which
raters made seemingly random rating errors, or examinees performed differentially across items
in a way that differs from how other examinees performed on these same items. These cases
would seem to be the ones most in need of reexamination and reevaluation by the experienced
TSE raters, rather than the cases identified in the shaded upper right cell of each table.
Items
Is it harder for examinees to get high ratings on some TSE items than others? To what extent do
the 12 TSE items differ in difficulty?
To answer these questions, we can examine the item difficulty measures shown in Table
7 and Table 8. These tables order the 12 TSE items from each test according to their relative
difficulties. More difficult items (that is, those that were harder for examinees to get high ratings
on) appear at the top of each table, while easier items appear at the bottom. For the February
20
Table 7
Item Measurement Report for the February 1997 TSE Administration
Item
Item 7
Item 11
Item 10
Item 3
Item 9
Item 8
Item 2
Item 5
Item 1
Item 6
Item 4
Item 12
Difficulty
Measure
(in logits)
Standard
Error
Infit
Mean-Square
Index
.46
.43
.25
.16
.11
.10
-.01
-.11
-.16
-.23
-.48
-.52
.04
.04
.04
.04
.04
.04
.04
.04
.04
.04
.04
.04
1.0
1.0
1.0
1.0
0.9
0.9
1.1
1.0
1.2
0.9
0.9
0.9
Table 8
Item Measurement Report for the April 1997 TSE Administration
Item
Item 11
Item 7
Item 6
Item 2
Item 3
Item 9
Item 5
Item 10
Item 1
Item 8
Item 12
Item 4
Difficulty
Measure
(in logits)
Standard
Error
Infit
Mean-Square
Index
.45
.42
.25
.16
.06
-.01
-.03
-.09
-.10
-.10
-.38
-.64
.04
.04
.04
.04
.04
.04
.04
.04
.04
.04
.04
.04
0.9
1.0
0.9
1.1
1.0
0.9
1.0
1.0
1.3
0.9
0.8
0.9
21
administration, the item difficulty measures range from -.52 logits for Item 12 to .46 logits for
Item 7, about a 1-logit spread. The range of difficulty measures for the items that appeared on
the April test is nearly the same: items range in difficulty from -.64 logits for Item 4 to .45 logits
for Item 11, about a 1-logit spread.
The spread of the item difficulty measures is very narrow compared to the spread of
examinee proficiency measures. If the items are not sufficiently spread out along a continuum,
then that suggests that those designing the assessment have not succeeded in defining distinct
levels along the variable they are intending to measure (Wright & Masters, 1982). As shown
earlier in Figures 1 and 2, the TSE items tend to cluster in a very narrow band. For both the
February and April administrations, Items 4 and 12 were slightly easier for examinees to get high
ratings on, while Items 7 and 11 were slightly more difficult. However, in general, the items
differ relatively little in difficulty and thus do not convincingly define a line of increasing
intensity. To improve the measurement capabilities of the TSE, test developers may want to
consider introducing into the instrument some items that are easier than those that currently
appear, as well as some items that are substantially more difficult than the present items. If such
items could be designed, then they would help to define a more recognizable and meaningful
variable and would allow for placement of examinees along the variable defined by the test
items.
Can we calibrate ratings from all 12 TSE items, or do ratings on certain items frequently fail to
correspond to ratings on other items (that is, are there certain items that do not "fit" with the
others)? Can a single summary measure capture the essence of examinee performance across
the items, or is there evidence of possible psychometric multidimensionality (Henning, 1992;
McNamara, 1991; McNamara, 1996) in the data and, perhaps, a need to report a profile of
scores rather than a single score summary for each examinee if it appears that systematic kinds
of profile differences are appearing among examinees who have the same overall summary
score?
We used Facets to compute indices of fit for each of the 12 items included in the
February and April tests, and these are also shown in Tables 7 and 8. The infit mean-square
indices range from 0.9 to 1.2 for the February test, and from 0.8 to 1.3 for the April test. All the
indices are within even tight quality control limits of 0.7 to 1.3. The fact that all the infit meansquare indices are greater than 0.7 suggests that none of the items function in a redundant
fashion. Because none of the infit mean-square indices is greater than 1.3, there is little evidence
of psychometric multidimensionality in either data set. The items on each test appear to work
together; ratings on one item correspond well to ratings on other items. That is, a single pattern
of proficiency emerges for these examinees across all items on the TSE. Therefore, ratings on
the individual items can be meaningfully combined; a single summary measure can appropriately
capture the essence of examinee performance across the 12 items.
Are all 12 TSE items equally discriminating? Is item discrimination a constant for all 12 items,
or does it vary across items?
Several versions of the many-facet Rasch model can be specified within Facets for the
TSE data, and each model portrays a specific set of assumptions regarding the nature of the TSE
22
ratings. The version of the model we used (the Partial Credit Model shown earlier in Equation 1)
assumes that a rating scale with a unique category structure is employed in the rating of each
item. That is, a score of 20 on Item 1 may be more or less difficult to obtain relative to a score of
30 than is a score of 20 on Item 2. The Fik term in Equation 1 denotes that the rating scale for
each item is to be modeled to have its own category structure. A simpler model, the Rating Scale
Model, shown in Equation 2 below, assumes that the rating scales for the items share a common
category structure; in other words, a score of 20 has uniform difficulty relative to a score of 30,
regardless of the item. The Fk term in Equation 2 denotes that all 12 rating scales are to be
modeled assuming a common category structure operates across items:
log ( Pnijk ) = Bn - Cj - Di - Fk
Pnijk - 1
(2)
One of the differences between the Partial Credit Model and the Rating Scale Model
relates to item discrimination (the extent to which an item differentiates between high and low
proficiency examinees). Items that have good discrimination differentiate very well between
high and low proficiency examinees. As a result, each examinee who responds to an item has a
high probability of being assigned to a single rating scale category and a very low probability of
being assigned to any of the remaining categories. That is, there is one rating category to which
each examinee is clearly most likely to be assigned. Items that are poor discriminators, on the
other hand, do not differentiate between high and low proficiency examinees. With poorly
discriminating items, the probabilities associated with each rating scale category are nearly equal,
so that high and low proficiency examinees are likely to be assigned the same rating. Differences
in item discrimination result in higher and lower variances in the category probability curves for
items with low and high levels of discrimination, respectively.
To investigate whether the 12 TSE items are equally discriminating, we compared the
rating scale category structures from the Partial Credit Model, because that analysis allowed the
rating scale category structure to vary from one item to another. We examined the rating scale
category calibrations for each item. A rating scale category calibration is the point on the
examinee proficiency scale at which the probability curves for adjacent categories intersect. It
marks the point in the proficiency distribution where the probability of an examinee’s getting a
rating in the next higher scale category begins to exceed the probability of the examinee’s getting
a rating in the next lower scale category.4 Tables 9 and 10 show the rating scale category
calibrations for each of the 12 TSE items from the February and April tests, as well as the means
and standard deviations associated with each rating scale category calibration. For example,
4
Different researchers in the Rasch community use different terms for this critical notion of a transition point
between adjacent categories. Linacre, Masters, and Wright speak of “step” calibrations, a concept that would seem to
imply movement, of passing through or moving up from a lower category to arrive at the next higher category on a
scale. Others have taken issue with this notion of implied movement, arguing that when we analyze a data set, we are
modeling persons at a specific point in time, not how each person arrived at their particular location on the
proficiency continuum (Andrich, 1998). Andrich prefers the term “threshold” which he defines as the transition
point at which the probability is 50% of an examinee being rated in one of two adjacent categories, given that the
examinee is in one of those two categories. Linacre (1998) has also used the term “step thresholds” to denote these
categorical boundary points. We have chosen to use the term “category calibration,” because we are not comfortable
with the connotation of movement implied in the use of the term “step calibration.” However, the reader should be
aware that the values we report in Tables 9 and 10 are what Facets reports as “step calibrations.”
23
Table 9 shows that for Item 1, examinees with proficiency measures in the range of –4.96 logits
to –3.79 logits have a higher probability of receiving a 30 on item 1 than any other rating, while
examinees with proficiency measures in the range of -3.80 logits to 1.89 logits have a higher
probability of receiving a 40 on the item than any other rating.
Tables 9 and 10 reveal that the rating scale category calibrations are fairly consistent
across items—the differences between the mean category calibrations are considerably larger
than their standard deviations. However, the rating scale category calibrations are not equal
across the 12 TSE items, particularly for the two lowest categories.
One way to examine the differences in rating scale category calibrations is to compare the
category probability curves of two or more items. These plots portray the probability that an
examinee with proficiency B will receive a rating in each of the rating scale categories for that
item. Figure 3, for example, shows the expected distribution of probabilities for Item 4 (the solid
lines) and Item 8 (the dashed lines) from the February test—two items with the most dissimilar
rating scale category calibrations for this test. This plot reinforces what is shown in Table 9—the
transitions from the 40 to the 50 categories and from the 50 to the 60 categories are fairly similar
between the two items, while the transitions from the 20 to the 30 categories and from the 30 to
the 40 categories are less similar. These findings suggest that more proficient TSE examinees
(those who would typically receive scores of 50 or 60 on each item) tend to perform equally well
across the 12 items on the February test, while less proficient TSE examinees (those who would
typically receive scores of 20 or 30 on each item) tend to perform somewhat better on some items
than on others. Similarly, Figure 4 shows the expected distribution of probabilities for Item 4
(solid line) and Item 7 (dashed lines) from the April test—again, two items with the most
dissimilar rating scale category calibrations. In this case, the similarities are not as apparent. As
was true for the February test, the transitions from the 20 to the 30 categories are dissimilar for
the two items from the April test. However, the transitions from the 30 to the 40 categories are
nearly identical, while the transitions from the 40 to the 50 and from the 50 to the 60 categories
are somewhat dissimilar for these two items.
TSE Rating Scale
Are the five scale categories on the TSE rating scale appropriately ordered? Is the rating scale
functioning properly as a five-point scale? Are the scale categories clearly distinguishable?
To answer these questions, we examined the average examinee proficiency measure by
rating scale category for each of the 12 TSE items from the February and April data. We also
examined the outfit mean-square index for each rating category for each TSE item.
To compute the average examinee proficiency measure for a rating category, the
examinee proficiency measures (in logits) for all examinees receiving a rating in that category on
that item are averaged. If the rating scale for the item is functioning as intended, then the average
examinee proficiency measures will increase in magnitude as the rating scale categories increase.
When we see this pattern borne out in the data, the results suggest that examinees with higher
24
Table 9
Rating Scale Category Calibrations for the February 1997 TSE Items
Item Number
Category
Calibration
1
2
3
4
5
6
7
8
30
-4.96
-5.37
-5.42
-6.65
-6.22
-6.27
-5.23
-5.22
40
-3.80
-3.18
-3.03
-2.88
-2.52
-2.33
-2.62
50
1.90
2.13
1.97
2.74
2.36
2.32
60
6.87
6.43
6.48
6.79
6.38
6.28
9
10
11
12
Mean
SD
-5.59
-6.08
-5.65
-6.13
-5.73
0.52
-2.41
-2.54
-2.65
-2.59
-2.78
-2.78
0.41
2.00
1.83
1.99
2.20
2.01
2.37
2.15
0.26
5.85
5.80
6.15
6.52
6.23
6.54
6.36
0.33
Table 10
Rating Scale Category Calibrations for the April 1997 TSE Items
Item Number
Category
Calibration
1
2
3
4
5
6
7
8
9
10
11
12
Mean
SD
30
-5.65
-5.16
-6.20
-7.58
-5.85
-5.01
-4.99
-5.24
-6.09
-6.23
-5.73
-6.41
-5.85
0.73
40
-3.31
-3.40
-3.13
-2.68
-3.07
-3.25
-2.99
-2.97
-2.42
-2.88
-2.71
-2.93
-2.98
0.28
50
1.98
2.00
2.51
2.97
2.33
2.04
1.88
2.03
2.26
2.34
2.16
2.48
2.25
0.31
60
6.97
6.56
6.82
7.29
6.59
6.22
6.10
6.18
6.25
6.77
6.28
6.86
6.57
0.38
25
Figure 3
Category Probability Curves for Items 4 and 8 (February Test)
1
20-30
40-50
30-40
50-60
0.8
0.6
P(x)
0.4
0.2
0
-8
-7
-6
-5
-4
-3
-2
-1
0
B
1
2
3
4
5
6
7
8
Figure 4
Category Probability Curves for Items 4 and 7 (April Test)
1
30-40
20-30
40-50
50-60
0.8
0.6
P(x)
0.4
0.2
0
-8
-7
-6
-5
-4
-3
-2
-1
0
B
26
1
2
3
4
5
6
7
8
ratings on the item are indeed exhibiting “more” of the variable being measured (i.e., communicative language ability) than examinees with lower ratings on that item, and therefore the
intentions of those who designed the rating scale are being fulfilled, Linacre (1999b) asserts.5
Tables 11 and 12 contain the average examinee proficiency measures by rating scale
category for each of the 12 TSE items from the February and April TSE administrations. For
nearly all the items, the average examinee proficiency measures increase as the rating scale
categories increase. The exceptions are items 1 and 4 from the February administration and item
7 from the April administration. In these three cases, the average examinee proficiency measure
for category 30 is lower than the average examinee proficiency measure for category 20—an
unexpected result. These findings suggest that for these three items the resulting measures for at
least some of the examinees receiving 20s on those particular items may be of doubtful utility
(Linacre, 1999b). However, as Linacre notes, if raters seldom assign ratings in a particular
category, then the calibration for that category may be imprecisely estimated and unstable. This
can directly affect the calculation of the average examinee proficiency measure for the category,
because the category calibrations are used in the estimation process.6 As is shown in Tables 13
and 14, only 1% or fewer of the total ratings assigned for each item were 20s. For a number of
items, fewer than 10 ratings of 20 were assigned (i.e., for February, items 4, 5, 6, and 12; for
April, items 3, 4, 10, and 12). When fewer than 10 ratings are assigned in a category, then
including or excluding a single rating can often substantially alter the estimated scale structure.
Is it problematic that the average examinee proficiency measure for category 30 for three
items is lower than the average examinee proficiency measure for category 20, or not? Because
raters infrequently use the 20 category on the TSE scale (and that is consistent with the intentions
of those who developed the scale), it may be that these results are a statistical aberration, perhaps
a function of imprecise statistical estimation procedures. To investigate further the
trustworthiness of ratings of 20, we examined a second indicator of rating scale functionality—
the outfit mean-square index Facets reports for each rating scale category. For each rating scale
category for an item, Facets computes the average examinee proficiency measure (i.e., the
“observed” measure) and an “expected” examinee proficiency measure (i.e., the examinee
proficiency measure the model would predict for that rating category if the data were to fit the
model). When the observed and expected examinee proficiency measures are close, then the
outfit mean-square index for the rating category will be near the expected value of 1.0. The
greater the discrepancy between the observed and expected measures, the larger the mean-square
index will be. For a given rating category, an outfit mean-square index greater than 2.0 suggests
that a rating in that category for one or more examinees may not be contributing to meaningful
measurement of the variable (Linacre, 1999b). Outfit mean-square indices are more sensitive to
the occasional outlying rating than infit mean-square indices; therefore, rating categories at the
ends of a scale are more likely to exhibit high outfit mean-square indices than rating categories in
the middle of a scale.
5
It should be noted that there is some disagreement in the Rasch community about the meaningfulness of this
indicator of rating scale functionality. Andrich (1998) has challenged the notion that if the average measures
increase monotonically, then that is an indication that the measuring instrument is performing satisfactorily. The
average measures are not distribution free, Andrich points out. By contrast, the threshold estimates produced by the
Rasch Unidimensional Measurement Models (RUMM) computer program are independent of the examinee
distribution, and, in Andrich’s view, provide a more stable indicator of whether a measuring instrument is
functioning as intended.
6 To compute the calibration for a given rating category, F , the log-ratio of the frequencies of the categories
k
adjacent to that particular category are used.
27
Table 11
Average Examinee Proficiency Measures and Outfit Mean-Square Indices for the February 1997 TSE Items
Item Number
Category
20
30
40
50
60
1
2
3
4
5
6
7
8
9
10
11
12
-1.90
-3.26
-4.33
-1.10
-3.51
-4.91
-2.64
-3.80
-4.57
-3.45
-3.36
-3.80
(4.1)
(2.2)
(1.6)
(3.2)
(2.1)
(.7)
(3.0)
(1.5)
(.8)
(2.1)
(2.4)
(1.5)
-2.23
-2.36
-2.51
-2.45
-2.28
-2.26
-2.49
-2.31
-2.34
-2.52
-2.36
-2.24
(1.1)
(1.0)
(.9)
(.8)
(.9)
(.8)
(1.0)
(.8)
(.8)
(.8)
(1.0)
(.9)
.42
.48
.35
.90
.68
.70
.28
.40
.46
.47
.29
.78
(1.2)
(1.0)
(1.0)
(.8)
(.9)
(.8)
(.8)
(.9)
(.9)
(1.0)
(.9)
(.9)
3.71
3.67
3.61
4.26
3.87
3.89
3.41
3.40
3.57
3.72
3.57
3.98
(1.2)
(1.0)
(1.0)
(.8)
(1.0)
(.8)
(.9)
(.9)
(.9)
(.9)
(.9)
(.8)
6.97
7.04
6.98
7.64
7.13
7.21
6.60
6.74
6.83
7.05
6.80
7.47
(1.4)
(1.1)
(1.0)
(.9)
(1.0)
(.9)
(.9)
(.9)
(1.0)
(1.0)
(1.0)
(.9)
Note. The average examinee proficiency measure for each rating category appears at the top of each cell in the table.
The outfit mean-square index associated with that rating category appears in parentheses at the bottom of each cell.
28
Table 12
Average Examinee Proficiency Measures and Outfit Mean-Square Indices for the April 1997 TSE Items
Item Number
Category
20
30
40
50
60
1
2
3
4
5
6
7
8
9
10
11
12
-3.62
-3.74
-4.29
-3.24
-2.76
-4.25
-1.47
-2.98
-2.31
-3.35
-3.55
-3.83
(1.0)
(1.2)
(.7)
(.8)
(2.0)
(.7)
(3.5)
(1.7)
(1.6)
(1.1)
(1.2)
(.8)
-1.87
-2.15
-2.23
-2.00
-2.40
-2.47
-2.61
-2.26
-2.11
-2.17
-2.43
-2.33
(1.2)
(1.1)
(1.0)
(.8)
(.9)
(.9)
(.8)
(.9)
(.9)
(1.0)
(.9)
(.8)
.54
.27
.54
1.06
.50
.20
.13
.40
.54
.53
.30
.68
(1.3)
(1.0)
(1.0)
(.9)
(.9)
(.9)
(.9)
(.9)
(.9)
(.9)
(.9)
(.8)
3.50
3.47
3.82
4.41
3.73
3.48
3.28
3.50
3.62
3.84
3.47
4.01
(1.4)
(1.1)
(1.0)
(.9)
(1.0)
(.8)
(.9)
(.9)
(.8)
(1.0)
(.9)
(.7)
6.84
6.61
6.91
7.56
6.85
6.58
6.45
6.75
6.79
6.93
6.58
7.30
(1.2)
(1.1)
(1.0)
(.9)
(1.0)
(.9)
(.9)
(.8)
(.8)
(.9)
(.8)
(.8)
Note. The average examinee proficiency measure for each rating category appears at the top of each cell in the table.
The outfit mean-square index associated with that rating category appears in parentheses at the bottom of each cell.
29
Table 13
Frequency and Percentage of Examinee Ratings in Each Category for the February 1997 TSE Items
Item Number
Category
20
30
40
50
60
1
2
3
4
5
6
7
8
9
10
11
12
15
16
19
4
9
8
32
24
18
14
23
6
(1%)
(1%)
(1%)
(<1%)
(<1%)
(<1%)
(1%)
(1%)
(1%)
(<1%)
(1%)
(<1%)
75
121
150
104
168
177
223
210
204
202
201
118
(3%)
(4%)
(5%)
(4%)
(6%)
(6%)
(8%)
(7%)
(7%)
(7%)
(7%)
(4%)
1,125
1,200
1,163
1,264
1,190
1,146
1,191
1,058
1,105
1,234
1,221
1,111
(39%)
(42%)
(40%)
(44%)
(41%)
(40%)
(41%)
(37%)
(38%)
(43%)
(42%)
(38%)
1,356
1,192
1,228
1,133
1,136
1,143
1,045
1,149
1,165
1,135
1,115
1,212
(47%)
(41%)
(43%)
(39%)
(39%)
(40%)
(36%)
(40%)
(40%)
(39%)
(39%)
(42%)
315
357
326
381
383
412
391
444
394
301
326
439
(11%)
(12%)
(11%)
(13%)
(13%)
(14%)
(14%)
(15%)
(14%)
(10%)
(11%)
(15%)
Note. The number of ratings given in a category for an item is shown at the top of each cell in the table. The percentage
corresponding to that frequency is shown in parentheses.
30
Table 14
Frequency and Percentage of Examinee Ratings in Each Category for the April 1997 TSE Items
Item Number
Category
20
30
40
50
60
1
2
3
4
5
6
7
8
9
10
11
12
10
20
8
1
10
26
34
16
10
7
21
4
(<1%)
(1%)
(<1%)
(<1%)
(<1%)
(1%)
(1%)
(1%)
(<1%)
(<1%)
(1%)
(<1%)
126
137
162
141
157
157
201
154
235
172
252
139
(4%)
(5%)
(6%)
(5%)
(6%)
(6%)
(7%)
(5%)
(8%)
(6%)
(9%)
(5%)
1,196
1,266
1,382
1,332
1,300
1,280
1,233
1,176
1,205
1,271
1,296
1,252
(42%)
(44%)
(49%)
(47%)
(46%)
(45%)
(43%)
(41%)
(42%)
(45%)
(46%)
(44%)
1,249
1,142
1,037
1,082
1,077
1,068
1,070
1,123
1,048
1,108
1,001
1,130
(44%)
(40%)
(36%)
(38%)
(38%)
(38%)
(38%)
(39%)
(37%)
(39%)
(35%)
(40%)
265
283
259
292
304
317
310
379
350
288
278
317
(9%)
(10%)
(9%)
(10%)
(11%)
(11%)
(11%)
(13%)
(12%)
(10%)
(10%)
(11%)
Note. The number of ratings given in a category for an item is shown at the top of each cell in the table. The percentage
corresponding to that frequency is shown in parentheses.
31
When we reviewed the outfit mean-square indices presented in Table 11 (February data),
we noted that several of the items have outfit mean-square indices greater than 2.0 for rating
category 20 (i.e., items 1, 2, 4, 5, 7, 10, and 11). Similarly, Table 12 shows that several of the
items from the April 1997 TSE administration have outfit mean-square indices greater than 2.0
for rating category 20 (i.e., items 5 and 7). Given this information, we then looked at the table of
individual misfitting ratings that Facets produces to identify the specific misfitting examinees
who received ratings of 20 on these particular items. There were 19 examinees in the February
data set and 11 examinees in the April data set who received ratings of 20 on the item in question
when the model would have predicted that the examinee should have received a higher rating,
given the ratings the examinee received on the other items. Interestingly, in 27 of these cases,
both raters who evaluated the examinee gave ratings of 20 on the item in question. To some
degree, the fact that the two raters agreed on a rating of 20 on the item provides support for the
view that those ratings are credible for these 27 examinees and should not be changed, even
though the ratings of 20 are surprising and don’t seem to “fit” with the higher ratings each
examinee received on the other items.
However, there were three examinees that received a rating of 20 on the item in question
from one rater who evaluated their performance but not from the other rater. Those examinees’
rating patterns are shown in Table 15. For Examinee #1004, the ratings of 20 on items 1 and 2
from Rater #56 are quite unexpected because this rater gave 40s on all other items. The second
rater, Rater #44, gave the examinee nearly all 40s and did not concur with the first rater on the
ratings of 20 for items 1 and 2. In the second example the rating of 20 on item 5 for Examinee
#353 from Rater #12 is unexpected as well, because the rater gave the examinee 40s and 50s on
all other items (as did the second rater, Rater #16, a very harsh rater). Finally, the rating of 20 on
item 5 for Examinee #314 from Rater #17 is surprising because the rater gave 40s on all other
items. Again, the second rater (Rater #56) gave the examinee 30s and 40s on the items, no 20s.
These examples vividly illustrate Facets capacity to identify a rater’s occasionally suspect use of
a rating category.
To summarize, the rating scales for the twelve TSE items on each test form seem to be
functioning as intended, for the most part. The fact that for a number of items the raters
infrequently use the 20 category on the scale makes it appear that the 20 category is problematic
for some of these items. However, it may be the distribution-sensitive nature of the statistical
estimation procedures used to produce the category calibrations that makes the category appear
problematic. At any rate, in the future it will be important to monitor closely raters’ use of the 20
category to make certain that those ratings continue to be trustworthy. In this study, a close
inspection of the rating patterns for 27 of the 30 misfitting examinees who received surprising
ratings of 20 on one item indicates that the raters appeared to use the 20 rating category
appropriately, even though the ratings of 20 did not seem consistent with the higher ratings each
examinee received on the other items. However, in three cases, the ratings of 20 on one or two
items seem somewhat more suspect, because the second rater did not concur with those
judgments. In instances like these, TSE program personnel might want to review the examinee’s
performance before issuing a final score report to verify that the ratings of 20 should stand. In all
three cases, these examinees would not have been identified as in need of third-rater adjudication
because the averages of the two raters’ 12 ratings of each examinee were less than 10 points
apart.
32
Table 15
Rating Patterns and Fit Indices for Selected Examinees
Ratings Received by Examinee #1004
(Examinee Proficiency Measure = -1.05, Standard Error = .45)
(Examinee Infit Mean-Square Index = 1.8; Examinee Outfit Mean-Square Index = 2.4)
(Outfit Mean-Square Index for Category 20 of Item 1 = 4.1)
Rater #56
(Severity = .57)
Rater #44
(Severity = .37)
Item Number
6
7
1
2
3
4
5
20
20
40
40
40
40
40
40
40
40
40
40
8
9
10
11
12
40
40
40
40
40
40
40
30
40
40
40
40
Ratings Received by Examinee #353
(Examinee Proficiency Measure = 1.50, Standard Error = .46)
(Examinee Infit Mean-Square Index = 2.3; Examinee Outfit Mean-Square Index = 2.2)
(Outfit Mean-Square Index for Category 20 of Item 5 = 2.1)
Rater #16
(Severity = 1.26)
Rater #12
(Severity = .25)
Item Number
6
7
1
2
3
4
5
50
40
40
40
40
50
40
40
40
40
20
50
8
9
10
11
12
40
50
40
40
50
50
40
50
40
40
40
40
Ratings Received by Examinee #314
(Examinee Proficiency Measure = -3.34, Standard Error = .39)
(Examinee Infit Mean-Square Index = 1.3; Examinee Outfit Mean-Square Index = 1.3)
(Outfit Mean-Square Index for Category 20 of Item 5 = 2.0)
Rater #56
(Severity = -.91)
Rater #17
(Severity = -.80)
Item Number
6
7
1
2
3
4
5
40
30
30
30
30
40
40
40
40
40
20
40
33
8
9
10
11
12
40
30
30
40
30
30
40
40
40
40
40
40
Raters
Do TSE raters differ in the severity with which they rate examinees?
We used Facets to produce a measure of the degree of severity each rater exercised when
rating examinees from the February and April tests. Table 16 shows a portion of the output from
the Facets analysis summarizing the information the computer program provided about the
February TSE raters. The raters are ordered in the table from most severe to most lenient. The
higher the rater severity measure, the more severe the rater. To the right of each rater severity
measure is the standard error of the estimate, indicating the precision with which it has been
measured. Other things being equal, the greater the number of ratings an estimate is based on,
the smaller its standard error. The rater severity measures for the February TSE raters ranged
from -2.01 logits to 1.54 logits, a 3.55 logit spread, while the rater severity measures for the
April TSE raters ranged from -1.89 logits to 1.30 logits, a 3.19 logit spread.
A more substantive interpretation of rater severity could be obtained by examining each
rater’s mean rating. However, because all TSE raters do not rate all TSE examinees, it is
difficult to determine how much each rater’s mean rating is influenced by the sample of
examinee performances that he or she rates. For example, suppose the mean ratings of two raters
(A and B), each of whom rates a different set of examinees, are compared. Rater A’s mean rating
is 40.0, and Rater B’s mean rating is 50.0. In this case, we know nothing about the relative
severity of Raters A and B because we know nothing about the comparability of the examinee
performances that the two raters rated. One way of determining a fair difference between the
means of two such raters is to examine the mean rating for each rater once it has been corrected
for the deviation of the examinees in each rater’s sample from the overall examinee mean across
all raters. This fair average allows one to determine the extent to which raters differ in the mean
scores that they assign to examinees.
For example, the most severe rater involved in the February scoring session had a fair
average of 37.85, while the most lenient rater had a fair average of 44.68. That means that, on
average, these raters assigned scores that were about 7 raw score points apart. Similarly, the
most severe rater involved in the April scoring session had a fair average of 38.65, while the
most lenient rater had a fair average of 44.15—an average difference of 5.5 raw score points per
response scored.
In addition to the above analysis of rater severity, Facets provides a chi-square test of the
hypothesis that all the raters in the February scoring session exercised the same degree of
severity when rating examinees. The resulting chi-square value was 3838.7 with 65 degrees of
freedom. From this analysis, we concluded that there is .001 probability, after allowing for
measurement error, that the raters who participated in the February scoring session can be
considered equally severe. The results from the Facets chi-square test on the April data were
similar. The chi-square value for that analysis was 2768.2 with 73 degrees of freedom. As with
the February scoring, there is .001 probability that the raters who participated in the April scoring
can be thought of as equally severe.
34
Table 16
Summary Table for Selected TSE Raters
Most severe
Most lenient
a,b
Rater
ID
Number of
Ratings
Rater
Severity
Measure
(in logits)
Standard
Error
Infit
MeanSquare
Index
54
16
47
48
.....
36
7
33
6
.....
32
41
61
60
624
948
264
696
.....
540
960
407
300
.....
780
300
12
480
1.54
1.31
1.22
1.20
.....
0.02
0.02
0.01
-0.01
.....
-1.39
-1.46
-1.83
-2.01
.08
.07
.13
.08
.....
.09
.07
.11
.13
.....
.08
.13
.63
.10
0.9
0.7
1.2
1.1
.....
1.0
0.9
1.4
1.2
.....
0.9
1.3
0.4
0.9
Meana
SDb
581.8
203.6
.00
.72
.10
.07
1.0
0.2
The mean and standard deviation are for all raters who participated in the February TSE
scoring session, not just the selected raters shown in this table.
If raters differ in severity, how do those differences affect examinee scores?
After Facets calibrates all the raters, the computer program produces a fair average for
each examinee. This fair average adjusts the examinee’s original raw score average for
differences in rater severity, showing what score the examinee would have received had the
examinee been rated by two raters of average severity. We then compared examinees’ fair
averages to their original raw score averages (that is, their final reported scores) to determine
how differences in rater severity affect examinee scores. Tables 17 and 18 show the effects of
adjusting for rater severity on examinee raw score averages from the February and April
administrations, respectively.
Table 17 shows that adjusting for differences in rater severity would have altered the raw
score averages of about one third (36.4%) of the February examinees by less than one-half point.
35
However, for about two-thirds (63.6%) of the examinees, adjusting for rater severity would have
altered raw score averages by more than one-half point (but less than 4 points). Further, the
following statements can be made about the latter group of examinees:
•
In about 25% of these cases, the examinee fair averages were 0.5 to 1.4 points higher than
their raw score averages, while for 22% of the examinees, their fair averages were 0.5 to
1.4 points lower than their raw score averages.
•
About 8% of the examinees had fair averages that were 1.5 to 2.4 points higher than their
raw score averages, while 6.5% had fair averages that were 1.5 to 2.4 points lower than
their raw score averages.
•
About 1% had fair averages that were 2.5 to 3.2 points higher than their raw score
averages, while another 1% had fair averages that were 2.5 to 3.4 points lower than their
raw score averages.
•
Finally, in no cases were fair averages more than 3.2 points higher than raw score
averages, and in only three cases (0.2% of the examinees) were fair averages 3.5 to 3.6
points lower than raw score averages.
Table 18 tells much the same story. The percentages of differences between the
examinees’ fair averages and their raw score averages in the April data set closely mirror those
found in the February data set. If we were to adjust for differences in rater severity, the fair
averages for two thirds (63.5%) of the examinees tested in April would have differed from their
raw score averages by more than one-half point (but less than 4 points). It is important to note
that in both sets of data, there are instances along the entire score distribution of differences of 24 points between examinees’ raw score averages and their fair averages. This phenomenon does
not occur only at certain score intervals; it occurs across the distribution.
Educational Testing Service does not set cutscores for passing or failing the TSE. Those
institutions and agencies that use the TSE determine what the cutscores will be for their own
institutions. Therefore, we are not able to show what the impact of adjusting for rater severity
would be on passing rates. However, it is important to note that an adjustment as small as 0.1
could have important consequences for an examinee that was near an institution’s cutoff point.
Suppose that an institution used a cutoff point of 50. When computing examinees’ reported
scores, raw score averages from 47.5 to 52.4 translate into a reported score of 50. If an
examinee’s raw score average were 47.4, that examinee would receive a reported score of 45 and
therefore would be judged not to have reached the institution’s cutoff point. However, if a 0.1
adjustment for rater severity were added, the examinee’s score would then be 47.5, and the
examinee would receive a score of 50, which the institution would consider to be a passing score.
36
Table 17
Effects of Adjusting for Rater Severity on Examinee Raw Score Averages
February 1997 TSE Administration
Difference between Examinees’ Raw Score
Averages and Their Fair Averages
Number of
Examinees
Percentage of
Examinees
2.5 to 3.2 points
13
0.9%
1.5 to 2.4 points
120
8.2%
0.5 to 1.4 points
365
24.8%
-0.4 to 0.4 points
535
36.4%
-1.4 to -0.5 points
317
21.6%
-2.4 to -1.5 points
96
6.5%
-3.4 to -2.5 points
20
1.4%
-3.6 to -3.5 points
3
0.2%
1,469
100%
TOTAL
Table 18
Effects of Adjusting for Rater Severity on Examinee Raw Score Averages
April 1997 TSE Administration
Difference between Examinees’ Raw Score Number of
Averages and Their Fair Averages
Examinees
Percentage of
Examinees
2.5 to 3.0 points
8
0.5%
1.5 to 2.4 points
87
6.0%
0.5 to 1.4 points
409
28.3%
-0.4 to 0.4 points
528
36.5%
-1.4 to -0.5 points
324
22.4%
-2.4 to -1.5 points
83
5.7%
-3.4 to -2.5 points
6
0.4%
-3.5 points
1
0.1%
1,446
100%
TOTAL
37
How interchangeable are the raters?
Our analysis suggests that the TSE raters do not function interchangeably. Raters differ
somewhat in the levels of severity they exercise when rating examinees. The largest difference
found here between examinees’ raw score averages and their fair averages was 3.6 raw score
points. While this may not seem like a large difference, it is important to remember that even an
adjustment for rater severity as small as 0.1 raw score point could have important consequences
for examinees, particularly those whose scores lie in critical decision-making regions of the score
distribution.
Do TSE raters use the TSE rating scale consistently?
To answer this question, we examined the mean-square fit indices for each rater and the
proportion of unexpected ratings with which each rater is associated. In general, raters with infit
mean-square indices less than 1 show less variation than expected in their ratings, even after the
particular performances on which those ratings are based have been taken into account. Their
ratings tend to be "muted," sometimes as a consequence of overusing the inner rating scale
categories (i.e., the 30, 40, and 50 rating categories). By contrast, raters with infit mean-square
indices greater than 1 show more variation than expected in their ratings. Their ratings tend to be
"noisy," sometimes as a consequence of overusing the outer rating scale categories—that is, the
20 and 60 rating categories (Linacre, 1994).
Outfit indices for raters take the same form as infit indices but are more sensitive to the
occasional unexpected extreme rating from an otherwise consistent rater. (The term “outfit” is
shorthand for “outlier-sensitive fit statistic.”) Most often the infit and outfit mean-square indices
for a rater will be identical. However, occasionally a rater may have an infit mean-square index
that lies within established upper- and lower-control limits, but the rater may have an outfit
mean-square index that lies outside the upper-control limit. In instances such as these, the rater
will often have given a small number of highly unexpected extreme ratings. For the most part,
the rater is internally consistent and uses the rating scale appropriately, but occasionally the rater
will give a rating that seems out of character with the rater’s other ratings.
Different testing programs adopt different upper- and lower-control limits for rater fit,
depending in part upon the nature of the program and the level of resources available for
investigating instances of misfit. Programs involved in making high-stakes decisions about
examinees may use stringent upper- and lower-control limits (for instance, an upper-control limit
of 1.3 and a lower-control limit of 0.7), while low-stakes programs may use more relaxed limits
(for example, an upper-control limit of 1.5 and lower-control limit of 0.4). Generally, misfit
greater than 1 is more problematic than is misfit less than 1.
For this study, we adopted an upper-control limit of 1.3 and a lower-control limit of 0.7.
The rater mean-square fit indices for the February data ranged from 0.4 to 1.4, while for April
they ranged from 0.5 to 1.5. Tables 19 and 20 show the number and percentage of raters who
exhibit three levels of fit in the February and April TSE data, respectively. These tables show
that in each scoring session over 90% of the raters showed the expected amount of variation in
their ratings. About 1% to 3% of the raters showed somewhat less variation than expected, while
about 5% to 7% showed somewhat more variation than expected.
38
Table 19
Frequencies and Percentages of Rater Mean-Square Fit Indices
for the February 1997 TSE Data
Infit
Outfit
Fit Range
Number
of Raters
Percentage
of Raters
Number
of Raters
Percentage
of Raters
fit < 0.7
2
3%
2
3%
0.7 < fit < 1.3
61
92%
60
91%
fit > 1.3
3
5%
4
6%
Table 20
Frequencies and Percentages of Rater Mean-Square Fit Indices
for the April 1997 TSE Data
Infit
Outfit
Fit Range
Number
of Raters
Percentage
of Raters
Number
of Raters
Percentage
of Raters
fit < 0.7
1
1%
1
1%
0.7 < fit < 1.3
69
93%
68
92%
fit > 1.3
4
5%
5
7%
Are there raters who rate examinee performances inconsistently?
For any combination of rater severity, examinee proficiency, and item difficulty, an
expected rating can be computed based on the Partial Credit Model. Fit indices (infit and outfit)
indicate the cumulative agreement between observed and expected ratings for a particular rater
across all items and examinees encountered by that rater. Individual ratings that differ greatly
from modeled expectations indicate ratings that may be problematic, and it is informative to
examine the largest of these residuals to determine whether the validity of these ratings is
suspect. In this study, we examined the proportion of times individual raters were associated with
these large residuals.
39
Our test is based on the assumption that, if raters are performing interchangeably, then we
can expect large residuals to be uniformly distributed across all the raters, relative to the
proportion of the total ratings assigned by the rater. That is, the null proportion of large residuals
for each rater (p) is Nu/Nt, where Nu is the total number of large residuals and Nt is the total
number of ratings. For each rater, the observed proportion of large residuals (pr) is simply Nur/Ntr,
where Nur designates the number of unexpectedly large residuals associated with rater r and Ntr
indicates the number of ratings assigned by rater r. Inconsistently scoring raters would be
identified as those for whom the observed proportion exceeds the null proportion beyond what
could be considered chance variation. Specifically, given sufficient sample size, we can
determine the likelihood that the observed proportion could have been produced by the expected
proportion using equation 3 (Marascuilo & Serlin, 1988). If an observed zp is greater than 2.00,
we would conclude that the rater is rating inconsistently. The sum of the squared zp values across
raters should approximate a chi-square distribution with R-1 degrees of freedom (where R equals
the number of raters) under the null hypothesis that raters are rating consistently.
zp =
pr -p
(3)
p -p 2
N tr
Tables 21 and 22 show the number and percentage of raters who were associated with
increasing levels of unexpected ratings for the February and April data. Raters whose frequency
of unexpected ratings is less than 2.00 can be considered to operate within normal bounds
because the number of large residuals associated with these ratings can be attributed to random
variation. Raters with increasingly larger zp indices assign more unexpected ratings than would
be predicted for them. These tables show that the frequency of unexpected ratings in the
February scoring sessions was fairly low. Although the difference is statistically significant, the
effect size is very small, c2(66) = 120.39, p < .001, f2 = .003. Although there were considerably
more inconsistent raters in the April session, the phi-coefficient did not reveal a much stronger
trend, c2(73) = 163.70, p < .001, f2 = .005.
Table 21
Frequencies of Inconsistent Ratings for February 1997 TSE Raters
zp
Number of Raters
Percentage of Raters
< 2.00
61
91%
4.00 > zp ³ 2.00
5
8%
zp ³ 4.00
1
2%
40
Table 22
Frequency of Inconsistent Ratings for April 1997 TSE Raters
zp
Number of Raters
Percentage of Raters
< 2.00
45
61%
4.00 > zp ³ 2.00
13
18%
zp ³ 4.00
16
22%
Are there any overly consistent raters whose ratings tend to cluster around the midpoint of the
rating scale and who are reluctant to use the endpoints of the scale? Are there raters who tend
to give an examinee ratings that differ less than would be expected across the 12 items? Are
there raters who cannot effectively differentiate between examinees in terms of their levels of
proficiency?
By combining the information provided by the fit indices and the proportion of
unexpected ratings for each rater, we were able to identify specific ways that raters may be
contributing to measurement error. Wolfe, Chiu, and Myford (1999) found that unique patterns
of these indicators are associated with specific types of rater errors. Accurate ratings result in
rater infit and outfit mean-square indices near their expectations (that is, about 1.0) and low
proportions of unexpected ratings. Random rater errors, on the other hand, result in large rater
infit and outfit mean-square indices as well as large proportions of unexpected ratings. Raters
who restrict the range of the scores that they assign to examinees, by exhibiting halo effects
(assigning similar scores to an examinee across all items) or centrality effects (assigning a large
proportion of scores in the middle categories of a rating scale), have infit and outfit mean-square
indices that are smaller than expected while having high proportions of unexpected ratings.
Also, extreme scoring (a tendency to assign a large proportion of scores in the highest- and/or
lowest-scoring categories) is indicated by a rater infit mean-square index that is close to its
expected value, an outfit mean-square index that is large, and a high proportion of unexpected
ratings. Table 23 summarizes the criteria for identifying raters exhibiting each of these types of
rater errors included in this study.
As shown in Tables 24 and 25, no raters in the February or April data exhibit centrality or
halo effects, and only one rater exhibits extreme scoring. A small percentage of the raters in each
scoring session (3% in February and 5% in April) show some evidence of randomness in their
scoring. However, the majority of raters in both scoring sessions fit the accurate profile (70% in
February and 49% in April). Some raters (27% in February and 45% in April) exhibited patterns
of indicators that were not consistent with any of the rater errors we studied (i.e., other).
41
Table 23
Rater Effect Criteria
Rater Effect
Infit
Outfit
zp
Accurate
0.7 £ infit £ 1.3
0.7 £ outfit £ 1.3
zp £ 2.00
Random
infit > 1.3
outfit > 1.3
zp > 2.00
Halo/Central
infit < 0.7
outfit < 0.7
zp > 2.00
Extreme
0.7 £ infit £ 1.3
outfit > 1.3
zp > 2.00
Table 24
Rater Effects for February 1997 TSE Raters
Rater Effect
Number of Raters
Percentage of Raters
Accurate
46
70%
Random
2
3%
Halo/Central
0
0%
Extreme
0
0%
Other
18
27%
Note: Raters whose indices do not fit these patterns are labeled “other.”
Table 25
Rater Effects for April 1997 TSE Raters
Rater Effect
Number of Raters
Percentage of Raters
Accurate
36
49%
Random
4
5%
Halo/Central
0
0%
Extreme
1
1%
Other
33
45%
Note: Raters whose indices do not fit these patterns are labeled “other.”
42
Conclusions
An important goal in designing and administering a performance assessment is to
produce scores from the assessment that will lead to valid generalizations about an examinee’s
achievement in the domain of interest (Linn, Baker, & Dunbar, 1991). We do not want the
inferences we make based on the examinee’s performance to be bound by the particulars of the
assessment situation itself. More specifically, to have meaning, the inferences we make from an
examinee’s score cannot be tied to the particular raters who scored the examinee’s performance,
to the particular set of items the examinee took, to the particular rating scale used, or to the
particular examinees who were assessed in a given administration. If we are to accomplish this
goal, we need tools that will enable us to monitor sources of variability within the performance
assessment system we have created so that we can determine the extent to which these sources
affect examinee performance. Beyond this, these tools must provide useful diagnostic
information about how the various sources are functioning.
In this study, we examined four sources of variability in the TSE assessment system to
gain a better understanding of how the complex system operates and where it might need to
change. Variability is present in any performance assessment system; within a statistical
framework, typical ranges can be modeled. In a system that is "under statistical control"
(Deming, 1975), the major sources of variability have been identified, and observations have
been found to follow typical, expected patterns. Quantifying these patterns is useful because it
makes evident the uncertainty for final inferences (in this case, final scores) associated with
aspects of the system and allows effects on the system to be monitored when changes are
introduced. Here, we determined to what extent the TSE assessment system is functioning as
intended by examining four sources of variability within the system: the examinees, the items,
the rating scale, and the raters.
Examinees
When we focused our lens on the examinees that took the February and April 1997
administrations of the TSE, we found that the test usefully separated examinees into eight
statistically distinct levels of proficiency. The examinee proficiency measures were determined
to be trustworthy in terms of their precision and stability. We would expect an average
examinee’s true proficiency to lie within about two raw score points of his or her reported score
95% of the time. It is important to emphasize, however, that the size of the standard error of
measurement varies across the proficiency distribution, particularly at the tails of the distribution.
Therefore, one should not assume that the standard error of measurement is constant across that
distribution. This finding has important implications for institutions setting their own cutscores
on the TSE.
The proficiency measures for individual examinees have measurement properties that
allow them to be used for making decisions about individuals. Few examinees from either
administration of the test exhibited unusual profiles of ratings across the 12 TSE items. Indeed,
only 1% of the examinees in each of the administrations showed significant misfit. However,
even in isolated cases like these, TSE program personnel might want to review examinee
43
performances on the misfitting items before issuing score reports, particularly if an examinee’s
measure is near a critical decision-making point in the score distribution.
When we examined the current procedure for resolving discrepant ratings, we found that
the Facets analysis identified some patterns of ratings as suspect that would not have been so
identified using the TSE discrepancy criteria. The current TSE procedure for identifying
discrepant ratings focuses on absolute differences in the ratings assigned by two raters. Such a
procedure nominates a large number of cases for third-rater resolution. Cases that result from the
pairing of severe and lenient raters could be handled more easily by adjusting an examinee’s
score for the severity or leniency of the two raters rather than having a third rater adjudicate. If
Facets were used during a scoring session to analyze TSE data, then there would be no need to
bring these cases to the attention of more experienced TSE raters. Facets would not identify
these cases as discrepant but rather would automatically adjust these scores for differences in
rater severity, freeing the more experienced TSE raters to concentrate their efforts on cases
deserving more of their time and attention. Indeed, the aberrant patterns that Facets does identify
are cases that would be difficult to resolve by adjusting scores for differences in rater severity.
These cases are most likely situations in which raters make seemingly random rating errors or
examinees perform differentially across items in a way that differs from how other examinees
perform on the same items.
Future TSE research that focuses on the issue of how best to identify and resolve
discrepant ratings may result in both lower scoring costs (by requiring fewer resolutions) and an
increase in score accuracy (by identifying cases that cannot readily be resolved through
adjustments for differences in rater severity). Studies might also be undertaken to determine
whether there are common features of examinee responses—or raters’ behavior—that result in
misfit. Facets analyses can identify misfitting examinees or misfitting raters, but other research
methodologies (such as, analysis of rater protocols, discourse analyses, task analyses) could be
used in conjunction with Facets to gain an understanding of not only where misfit occurs but
also why. The results of such research could usefully inform rater training procedures.
Items
From our analysis of the TSE items, we conclude that the items on each test work
together; ratings on one item correspond well to ratings on the other items. Yet, none of the
items appear to function in a redundant fashion (all items had infit mean-square indices greater
than 0.7). Ratings on individual items within the test can be meaningfully combined; there is
little evidence of psychometric multidimensionality in either of the two data sets (all items had
infit mean-square indices smaller than 1.3). Consequently, it is appropriate to generate a single
summary measure to capture the essence of examinee performance across the 12 items. This
should come as good news to the TSE program, because these findings would seem to suggest
that there is little overlap among the 12 items on the TSE. Each item appears to make an
independent contribution to the measurement of communicative language ability. However, the
items tend to differ little in terms of their difficulty; examinees do not have a harder time getting
high ratings on some items compared to others. If test developers were to add to the assessment
some items that are easier than those that currently appear on the TSE as well as some items that
are more difficult than the present items, that might increase the instrument’s ability to better
discriminate among levels of examinee proficiency.
44
It should be noted that, due to the current design of the TSE, in which two common items
link one test form to all other test forms, our research did not allow us the opportunity to examine
item comparability across TSE forms. Future research that looks at issues of item
comparability—and the feasibility of using Facets as a tool for equating TSE administrations
through common link items or common raters—would also seem to be of potential benefit to the
TSE program.
TSE Rating Scale
The TSE rating scale functions as a five-point scale, and, for the most part, the scale
categories are clearly distinguishable. The scale maintains a similar, though not identical,
category structure across the items. The fact that for a number of items the raters infrequently
use the 20 category on the scale makes it appear that the 20 category is problematic for some of
these items. However, it may be the distribution-sensitive nature of the statistical estimation
procedures used to produce the category calibrations that makes the category appear problematic.
At any rate, in the future it will be important to monitor closely raters’ use of the 20 category to
make certain that those ratings continue to be trustworthy.
Raters
When we examined the raters of the February and April 1997 TSE, we found that they
differed somewhat in the severity with which they rated examinees, confirming Marr’s (1994)
findings. If we were to adjust for differences in rater severity, the fair averages for two thirds of
the examinees in both TSE administrations would have differed from their raw score averages by
more than one-half point. The largest rater effect would have been 3.6 raw score points, which
means that the most an examinee’s score would have changed would have been about 4 points on
the 20 to 60 scale. Though these differences may seem small, such differences can have
important consequences, especially for examinees whose scores lie in critical decision-making
regions of the score distribution. For these examinees, whether or not they meet an institution’s
cutscore may be determined by such adjustments.
While the raters differed in severity, the vast majority used the TSE rating scale in a
consistent fashion. The raters appear to be internally consistent but are not interchangeable,
confirming the findings of Weigle (1994). Our analyses of individual rater behavior revealed
that the majority of raters in both of the TSE scoring sessions that we studied (70% of the raters
in February and 49% in April) fit the profile of accurate raters. No raters exhibited centrality or
halo effects, 1% of the raters exhibited extreme scoring tendencies, and a small percentage of
raters in each scoring session (3% of the raters in February and 5% in April) showed some
evidence of randomness in their scoring. This latter category of raters, as well as those whose
patterns of indicators were not consistent with any of the rater errors we studied, should be the
subject of further research, because it seems likely that these raters are responsible for the types
of aberrant rating patterns that are most difficult to correct. The results of such research would
be beneficial to the TSE program, and ultimately improve the accuracy of TSE scores, because
they could provide information that could guide future rater training and evaluation efforts.
45
Next Steps
The scoring of an administration of the TSE involves working with many raters who each
evaluate a relatively small number of performances in any given scoring session. Establishing
sufficient connectivity in such a rating design is an arduous task. Even if the rating design is
such that all raters can be linked, in a number of cases the connections between raters are likely
to be weak and tentative, because the number of examinees any given pair of raters scores in
common is very limited. If there is insufficient connectivity in the data, then it is not possible to
calibrate all raters on the same scale. Consequently, raters cannot be directly compared in terms
of the degree of severity they exercise when scoring, because there is no single frame of reference
established for making such comparisons (Linacre, 1994). In the absence of a single frame of
reference, examinees (and items, too) cannot be compared on the same scale. As Marr (1994)
noted, insufficient overlap among raters has been a problem for researchers using Facets to
analyze TSE data. When analyzing data from 1992-93 TSE administrations, for instance, she
found that the ratings of eight raters had to be deleted because they were insufficiently connected.
In the present study, all raters were connected, but the connections between a number of the
raters were weak. Myford, Marr and Linacre (1996) found the same connectivity problem to
exist in their analysis of data from the scoring of the Test of Written English (TWE®). To
strengthen the network for linking raters, they suggested instituting a rating design that calls for
all raters involved in a scoring session to rate a common small set of examinee performances (in
addition to their normal workload). In effect, then, a block of ratings from a fully crossed rating
design would be embedded into an otherwise lean data structure, thereby ensuring that all raters
would be linked through this common set.
We have completed a study that enabled us to evaluate the effectiveness of this rater
linking strategy (Myford & Wolfe, in press). All raters who took part in the February and April
1997 TSE scoring sessions rated a common set of six audiotaped examinee performances
selected from the tapes to be scored during that session. Prior to each of the two scoring
sessions, a small group of very experienced TSE raters who supervised each scoring session met
to select the six common tapes that all raters in the upcoming scoring session would be required
to score. These experienced raters listened to a number of tapes in order to select the set of six.
For each scoring session, they chose a set of tapes that displayed a range of examinee proficiency
—a few low-scoring examinees, a few high-scoring examinees, and some examinees whose
scores would likely fall in the middle categories of the TSE rating scale. Also, they included in
this set a few tapes that would be hard for raters to score (such as, examinees who showed
variability in their level of performance from item to item) as well as tapes that they judged to be
solid examples of performance at a specific point on the scale (such as, examinees who would
likely receive the same or nearly the same rating on each item). Once the set of tapes was
identified, additional copies of each tape were made. These tapes were seeded into the scoring
session so that raters would not know that they were any different from the other tapes they were
scoring.
46
The specific questions that focused this study of the linking strategy include the following:
1. How does embedding in the operational data blocks of ratings from various smaller sets of
the six examinee tapes affect:
• the stability of examinee proficiency measures?
• the stability of rater severity measures?
• the fit of the raters?
• the spread of rater severity measures?
• the spread of examinee proficiency measures?
2. How many tapes do raters need to score in common in order to establish the minimal
requisite connectivity in the rating design, thereby ensuring that all raters and examinees can
be placed on a single scale? What (if anything) is to be gained by having all raters score
more than one or two tapes in common?
3. What are the characteristics of tapes that produce the highest quality linking? Are tapes that
exhibit certain characteristics more effective as linking tools than tapes that exhibit other
characteristics?
When we have completed this study, we will be in a better position to advise the TSE
program regarding the adequacy of their current rating design, and to suggest any changes that
might be instituted to improve the quality of examinee scores.
47
References
Andrich, D. (1985). An elaboration of Guttman scaling with Rasch models for measurement. In
N. Brandon-Tuma (Ed.), Sociological methodology (pp. 33-80). San Francisco: JosseyBass.
Andrich, D. (1998). Thresholds, steps, and rating scale conceptualization. Rasch Measurement:
Transactions of the Rasch Measurement SIG, 12 (3), 648.
Andrich, D., Sheridan, B., & Luo, G. (1997). RUMM (Version2.7): A Windows-based item
analysis program employing Rasch unidimensional measurement models. Perth, Western
Australia: School of Education, Murdoch University.
Bachman, L. F., Lynch, B. K., & Mason, M. (1995). Investigating variability in tasks and rater
judgments in a performance test of foreign language speaking. Language Testing, 12 (2),
238-257.
Bejar, I. I. (1985). A preliminary study of raters for the Test of Spoken English (TOEFL
Research Report No. 18). Princeton, NJ: Educational Testing Service.
Brown, A. (1995). The effect of rater variables in the development of an occupation-specific
language performance test. Language Testing, 12 (1), 1-15.
Deming, W. E. (1975). On statistical aids toward economic production. Interfaces, 5, 1-15.
Engelhard, G., Jr. (1994). Examining rater errors in the assessment of written composition with
a many-faceted Rasch model. Journal of Educational Measurement, 31 (2), 93-112.
Fisher, W. P., Jr. (1992). Reliability statistics. Rasch Measurement: Transactions of the Rasch
Measurement SIG, 6 (3), 238.
Heller, J. I., Sheingold, K., & Myford, C. M. (1998). Reasoning about evidence in portfolios:
Cognitive foundations for valid and reliable assessment. Educational Assessment, 5 (1),
5-40.
Henning, G. (1992). Dimensionality and construct validity of language tests. Language
Testing, 9 (1), 1-11.
Linacre, J. M. (1989). Many-faceted Rasch measurement. Chicago: MESA Press.
Linacre, J. M. (1994). A user’s guide to Facets: Rasch measurement computer program
[Computer program manual]. Chicago: MESA Press.
Linacre, J. M. (1998). Thurstone thresholds and the Rasch model. Rasch Measurement:
Transactions of the Rasch Measurement SIG, 12 (2), 634-635.
48
Linacre, J. M. (1999a). Facets, Version 3.17 [Computer program]. Chicago: MESA Press.
Linacre, J. M. (1999b). Investigating rating scale category utility. Journal of Outcome
Measurement, 3 (2), 103-122.
Linacre, J. M., Engelhard, G., Tatum, D. S., & Myford, C. M. (1994). Measurement with judges:
Many-faceted conjoint measurement. International Journal of Educational Research, 21
(6), 569-577.
Linn, R. L., Baker, E., & Dunbar, S. B. (1991). Complex performance-based assessment:
Expectations and validation criteria. Educational Researcher, 20 (8), 15-21.
Lumley, T., & McNamara, T. F. (1995). Rater characteristics and rater bias: Implications for
training. Language Testing, 12 (1), 54-71.
Lunz, M. E., & Stahl, J. A. (1990). Judge consistency and severity across grading periods.
Evaluation and the Health Professions, 13 (14), 425-444.
Lynch, B. K., & McNamara, T. F. (1994, March). Using G-theory and multi-faceted Rasch
measurement in the development of performance assessments of the ESL speaking skills
of immigrants. Paper presented at the 16th annual Language Testing Research
Colloquium, Washington, DC.
Marascuilo, L.A., & Serlin, R.C. (1988). Statistical methods for the social and behavioral
sciences. New York: W.H. Freeman.
Marr, D. B. (1994). A comparison of equating and calibration methods for the Test of Spoken
English. Unpublished report.
McNamara, T. F. (1991). Test dimensionality: IRT analysis of an ESP listening test. Language
Testing, 8 (2), 45-65.
McNamara, T. F. (1996). Measuring second language performance. Essex, England: Addison
Wesley Longman.
McNamara, T. F., & Adams, R. J. (1991, March). Exploring rater behaviour with Rasch
techniques. Paper presented at the 13th annual Language Testing Research Colloquium,
Princeton, NJ.
Myford, C. M., Marr, D. B., & Linacre, J. M. (1996). Reader calibration and its potential role
in equating for the TWE (TOEFL Research Report No. 95-40). Princeton, NJ:
Educational Testing Service.
Myford, C. M., & Mislevy, R. J. (1994). Monitoring and improving a portfolio assessment
system (MS #94-05). Princeton, NJ: Educational Testing Service, Center for
Performance Assessment.
49
Myford, C. M., & Wolfe, E. W. (in press). Strengthening the ties that bind: Improving the
linking network in sparsely connected rating designs (TOEFL Technical Report No. 15).
Princeton, NJ: Educational Testing Service.
Paulukonis, S. T., Myford, C. M., & Heller, J. I. (in press). Formative evaluation of a
performance assessment scoring system. In G. Engelhard, Jr. & M. Wilson (Eds.),
Objective measurement: Theory into practice: Vol. 5. Norwood, NJ: Ablex Publishing.
Rasch, G. (1980). Probabilistic models for some intelligence and attainment tests. Chicago:
University of Chicago Press.
Sheridan, B. (1993, April). Threshold location and Likert-style questionnaires. Paper presented
at the Seventh International Objective Measurement Workshop, American Educational
Research Association annual meeting, Atlanta, GA.
Sheridan, B., & Puhl, L. (1996). Evaluating an indirect measure of student literacy
competencies in higher education using Rasch measurement. In G. Engelhard, Jr. & M.
Wilson (Eds.), Objective measurement: Theory into practice: Vol. 3 (pp. 19-44).
Norwood, NJ: Ablex Publishing.
Stone, M., & Wright, B. D. (1988). Separation statistics in Rasch measurement (Research
Memorandum No. 51). Chicago: MESA Press.
TSE Committee. (1996). The St. Petersburg protocol: An agenda for a TSE validity mosaic.
Unpublished manuscript.
TSE Program Office. (1995). TSE score user’s manual. Princeton, NJ: Educational Testing
Service.
Tyndall, B., & Kenyon, D. M. (1995). Validation of a new holistic rating scale using Rasch
multifaceted analysis. In A. Cumming & R. Berwick (Eds.), Validation in language
testing. Clevedon, England: Multilingual Matters.
Weigle, S. C. (1994, March). Using FACETS to model rater training effects. Paper presented at
the 16th annual Language Testing Research Colloquium, Washington, DC.
Wigglesworth, G. (1993). Exploring bias analysis as a tool for improving rater consistency in
assessing oral interaction. Language Testing, 10 (3), 305-335.
Wolfe, E. W., Chiu, C. W. T., & Myford, C. M. (1999). The manifestation of common rater
errors in multi-faceted Rasch analyses (MS #97-02). Princeton, NJ: Educational Testing
Service, Center for Performance Assessment.
Wright, B. D., & Masters, G. N. (1982). Rating scale analysis: Rasch measurement. Chicago:
MESA Press.
50
Appendix
TSE Band Descriptor Chart
51
Draft 4/5/96
Overall features to consider:
Functional competence is the
speaker’s ability to select functions
to reasonably address the task and to
select the language needed to carry
out the function.
Sociolinguistic competence is the
speaker’s ability to demonstrate an
awareness of audience and situation
by selecting language, register
(level of formality) and tone, that is
appropriate.
Discourse competence is the
speaker’s ability to develop and
organize information in a coherent
manner and to make effective use of
cohesive devices to help the listener
follow the organization of the
response.
Linguistic competence is the
effective selection of vocabulary,
control of grammatical structures,
and accurate pronunciation along
with smooth delivery in order to
produce intelligible speech.
60
TSE BAND DESCRIPTOR CHART
50
40
30
20
Communication almost always
effective: task performed very
competently
Speaker volunteers information freely,
with little or no effort, and may go
beyond the task by using additional
appropriate functions.
• Native-like repair strategies
• Sophisticated expressions
• Very strong content
• Almost no listener effort required
Communication generally effective:
task performed competently
Communication somewhat effective: task
performed somewhat competently
Communication generally not effective:
task generally performed poorly
No effective communication;
no evidence of ability to perform task
Speaker volunteers information,
sometimes with effort; usually does not
run out of time.
• Linguistic weaknesses may necessitate
some repair strategies that may be
slightly distracting
• Expressions sometimes awkward
• Generally strong content
• Little listener effort required
Speaker responds with effort; sometimes
provides limited speech sample and
sometimes runs out of time.
• Sometimes excessive, distracting, and
ineffective repair strategies used to
compensate for linguistic weaknesses
(e.g., vocabulary and/or grammar)
• Adequate content
• Some listener effort required
Speaker responds with much effort; provides
limited speech sample and often runs out of
time.
• Repair strategies excessive, very
distracting, and ineffective
• Much listener effort required
• Difficult to tell if task is fully performed
because of linguistic weaknesses, but
function can be identified
Extreme speaker effort is evident; speaker
may repeat prompt, give up on task, or be
silent.
• Attempts to perform task end in failure
• Only isolated words or phrases intelligible,
even with much listener effort
• Function cannot be identified
Functions performed clearly and
effectively
Functions generally performed clearly
and effectively
Functions performed somewhat clearly
and effectively
Functions generally performed unclearly
and ineffectively
No evidence that functions were performed
Speaker is highly skillful in selecting
language to carry out intended
functions that reasonably address the
task.
Speaker is able to select language to carry
out functions that reasonably address the
task.
Speaker may lack skill in selecting language
to carry out functions that reasonably
address the task.
Speaker often lacks skill in selecting
language to carry out functions that
reasonably address the task.
Speaker is unable to select language to carry
out the functions.
Appropriate response to
audience/situation
Generally appropriate response to
audience/situation
Somewhat appropriate task response to
audience/situation
Generally inappropriate response to
audience/situation
No evidence of ability to respond
appropriately to audience/situation
Speaker almost always considers
register and demonstrates audience
awareness.
• Understanding of context, and
strength in discourse and linguistic
competence, demonstrate
sophistication
Speaker generally considers register and
demonstrates sense of audience
awareness.
• Occasionally lacks extensive range,
variety, and sophistication; response
may be slightly unpolished
Speaker demonstrates some audience
awareness, but register is not always
considered.
• Lack of linguistic skills that would
demonstrate sociolinguistic sophistication
Speaker usually does not demonstrate
audience awareness since register is often not
considered.
• Lack of linguistic skills generally masks
sociolinguistic skills
Speaker is unable to demonstrate
sociolinguistic skills and fails to acknowledge
audience or consider register.
Coherent, with effective use of
cohesive devices
Coherent, with some effective use of
cohesive devices
Somewhat coherent, with some use of
cohesive devices
Generally incoherent, with little use of
cohesive devices
Incoherent, with no use of cohesive devices
Response is coherent, with logical
organization and clear development.
• Contains enough details to almost
always be effective
• Sophisticated cohesive devices result
in smooth connection of ideas
Response is generally coherent, with
generally clear, logical organization, and
adequate development.
• Contains enough details to be generally
effective
• Some lack of sophistication in use of
cohesive devices may detract from
smooth connection of ideas
Coherence of the response is sometimes
affected by lack of development and/or
somewhat illogical or unclear organization,
sometimes leaving listener confused.
• May lack details
• Mostly simple cohesive devices are used
• Somewhat abrupt openings and closures
Response is often incoherent; loosely
organized, and inadequately developed or
disjointed, discourse, often leave listener
confused.
• Often lacks details
• Simple conjunctions used as cohesive
devices, if at all
• Abrupt openings and closures
Response is incoherent.
• Lack of linguistic competence interferes
with listener’s ability to assess discourse
competence
Use of linguistic features almost
always effective; communication not
affected by minor errors
Use of linguistic features generally
effective; communication generally not
affected by errors
Use of linguistic features somewhat
effective; communications sometimes
affected by errors
Use of linguistic features generally poor;
communication often impeded by major
errors
Use of linguistic features poor;
communication ineffective due to major
errors
• Errors not noticeable
• Accent not distracting
• Range in grammatical structures and
vocabulary
• Delivery often has native-like
smoothness
• Errors not unusual, but rarely major
• Accent may be slightly distracting
• Some range in vocabulary and
grammatical structures, which may be
slightly awkward or inaccurate
• Delivery generally smooth with some
hesitancy and pauses
• Minor and major errors present
• Accent usually distracting
• Simple structures sometimes accurate, but
errors in more complex structures common
• Limited ranges in vocabulary; some
inaccurate word choices
• Delivery often slow or choppy; hesitancy
and pauses common
• Limited linguistic control; major errors
present
• Accent very distracting
• Speech contains numerous sentence
fragments and errors in simple structures
• Frequent inaccurate word choices; general
lack of vocabulary for task completion
• Delivery almost always plodding, choppy
and repetitive; hesitancy and pauses very
common
• Lack of linguistic control
• Accent so distracting that few words are
intelligible
• Speech contains mostly sentence
fragments, repetition of vocabulary, and
simple phrases
• Delivery so plodding that only few words
are produced
®
Test of English as a Foreign Language
P.O. Box 6155
Princeton, NJ 08541-6155
USA
㛭㛭㛭㛭㛭㛭㛭㛭㛭㛭㛭㛭㛭㛭㛭㛭㛭㛭㛭㛭㛭㛭㛭㛭㛭㛭㛭㛭㛭㛭㛭㛭㛭㛭㛭㛭㛭㛭㛭㛭㛭㛭㛭㛭㛭㛭㛭㛭㛭
To obtain more information about TOEFL programs
and services, use one of the following:
Phone: 609-771-7100
E-mail: toefl@ets.org
Web site: http://www.toefl.org
57906-005535 • Y70M.675 • Printed in U.S.A.
I.N. 275593