A_Hierarchical_Rater_Model_for_Constructed_Responses

advertisement
A Hierarchical Rater Model for Constructed Responses, with a Signal Detection Rater
Model
DeCarlo L. T.
Kim, Y.
Johnson, M. S.
In the hierarchical rater model (HRM), the rater scores are treated as the
hierarchical structure in which rater scores (Y) are the indicator of latent categories
( ) of the item and the latent categories are the indicator of an examinee’s
proficiency (). The HRM overcomes a basic problem in fitting an multifacet IRT
model to the rater scores which is that more precise measurement of person
proficiency can be obtained by using more raters. However, the HRM can only detect
a rater’s overall severity or leniency with the parameter  .
In real-word data, other rater effect may be interesting, such as central tendency,
restriction of range. By combing Signal Detection Model with the HRM, the raters’
detection precision (djl) and their response criterion (cjkl) could be detected, such that
more information about the rater effects can be obtained.
Comment & Questions:
1. It was argued that the HRM differs from the usual IRT approach to rater scaling in
that it separates of rater and item parameters such that the rater effect and item
effects will no longer confounded. However, as we know, within the framework of
multifacet models, rater effects could be detected successfully, such as investigating
of rater + item + rater*item + rater*step. One can ever consider the nested structure
where the raters only rate part of items when the multifacet model was used (it is
referred as ‘multilevel multifacet IRT model). It is interesting to compare the
performance between the multilevel multifacet IRT mode and the HRM-SDT model.
2. In figure 2, the latent categories are ordered. Is it possible that they are
disordered?
3. In HRM-SDT model, two important assumptions are made (p. 342). These two
assumptions may be stringent since in real data, another type of rater effect, Halo
effect often occurs. The Halo effect may occur when (1) the observed ratings are
dependent of ability ; (2) the ratings are dependent conditional on the latent
variables  .
3. When deriving the log likelihood function for the HRM-SDT, the third assumption
was actually made, that is, the latent variables  are independent. It means the
raters will use different set of  for each item and the  will not affected to each
other. However, in reality, if a rater rate many items, it is very likely that the  are
influenced to each other. Moreover, assuming  s independent means large amount
of parameters that need to be estimated.
4. I’m confusing about the connectivity problems for the results when the items were
rated by separate raters (Figure 6). If there is no common rater between the two
items, can the parameters for the two items be compared?
5. I’m also confusing that what are the ‘true’ criteria for the two items? Since the
data is the real data, it seems that it is no way to know the true criteria. Also, why the
criteria of the two items are the same in the study? If, the ‘true’ criteria for the item
2 are different, the results that ‘rater detection was better for the first item’ will
change.
6. It is hard for me to understand that ‘for rater 11, the rater’s first criterion is below
the first horizontal line and the fourth criterion is above the last line will suggest the
central tendency’. Similar inference is provided for some other raters. In fact, I think,
showing the percentage of using each category for the rater is an useful and effective
way to support the arguments.
7. It is of interest to investigate the rater drift where the rater performance will
change over time. For future study, the Time Facet may be added into the raters’
detection precision (djl) and their response criterion (cjkl).
Download