A Hierarchical Rater Model for Constructed Responses, with a Signal Detection Rater Model DeCarlo L. T. Kim, Y. Johnson, M. S. In the hierarchical rater model (HRM), the rater scores are treated as the hierarchical structure in which rater scores (Y) are the indicator of latent categories ( ) of the item and the latent categories are the indicator of an examinee’s proficiency (). The HRM overcomes a basic problem in fitting an multifacet IRT model to the rater scores which is that more precise measurement of person proficiency can be obtained by using more raters. However, the HRM can only detect a rater’s overall severity or leniency with the parameter . In real-word data, other rater effect may be interesting, such as central tendency, restriction of range. By combing Signal Detection Model with the HRM, the raters’ detection precision (djl) and their response criterion (cjkl) could be detected, such that more information about the rater effects can be obtained. Comment & Questions: 1. It was argued that the HRM differs from the usual IRT approach to rater scaling in that it separates of rater and item parameters such that the rater effect and item effects will no longer confounded. However, as we know, within the framework of multifacet models, rater effects could be detected successfully, such as investigating of rater + item + rater*item + rater*step. One can ever consider the nested structure where the raters only rate part of items when the multifacet model was used (it is referred as ‘multilevel multifacet IRT model). It is interesting to compare the performance between the multilevel multifacet IRT mode and the HRM-SDT model. 2. In figure 2, the latent categories are ordered. Is it possible that they are disordered? 3. In HRM-SDT model, two important assumptions are made (p. 342). These two assumptions may be stringent since in real data, another type of rater effect, Halo effect often occurs. The Halo effect may occur when (1) the observed ratings are dependent of ability ; (2) the ratings are dependent conditional on the latent variables . 3. When deriving the log likelihood function for the HRM-SDT, the third assumption was actually made, that is, the latent variables are independent. It means the raters will use different set of for each item and the will not affected to each other. However, in reality, if a rater rate many items, it is very likely that the are influenced to each other. Moreover, assuming s independent means large amount of parameters that need to be estimated. 4. I’m confusing about the connectivity problems for the results when the items were rated by separate raters (Figure 6). If there is no common rater between the two items, can the parameters for the two items be compared? 5. I’m also confusing that what are the ‘true’ criteria for the two items? Since the data is the real data, it seems that it is no way to know the true criteria. Also, why the criteria of the two items are the same in the study? If, the ‘true’ criteria for the item 2 are different, the results that ‘rater detection was better for the first item’ will change. 6. It is hard for me to understand that ‘for rater 11, the rater’s first criterion is below the first horizontal line and the fourth criterion is above the last line will suggest the central tendency’. Similar inference is provided for some other raters. In fact, I think, showing the percentage of using each category for the rater is an useful and effective way to support the arguments. 7. It is of interest to investigate the rater drift where the rater performance will change over time. For future study, the Time Facet may be added into the raters’ detection precision (djl) and their response criterion (cjkl).