The Institutional Dimension of Document Quality Judgments Bing Bai Rutgers University. New Brunswick, NJ 08903. Email: bbai@paul.rutgers.edu Kwong Bor Ng Queens College, CUNY. Kissena Boulevard, Flushing, NY 11367. Email: kbng@qc.edu Ying Sun Rutgers University. New Brunswick, NJ 08903. Email: ysun@scils.rutgers.edu Paul Kantor Rutgers University. New Brunswick, NJ 08903. Email: kantor@scils.rutgers.edu Tomek Strzalkowski SUNY Albany, Western Avenue, Albany, NY 12222. Email: tomek@csc.albany.edu In addition to relevance, there are other factors that contribute to the utility of a document. For examples, content properties like depth of analysis and multiplicity of viewpoints, and presentational properties like readability and verbosity, all will affect the usefulness of a document. These kinds of relevance-independent properties are difficult to determine, as their estimations are more likely to be affected by personal aspects of one’s knowledge structure. Reliability of judgments on those properties may decrease when one moves from the personal level to the global level. In this paper, we report several experiments on document qualities. In our experiments, we explore the correlation of judgments on nine different document qualities to see if we can generate fewer dimensions underlying the judgments. We also investigate the issue of reliability of the judgments and discuss its theoretical implications. We find that, between the global level of agreement of judgment and the inter-personal level of agreement of judgment, there is an intermediate level of agreement, which can be characterized as an inter-institutional level. For example, novelty of idea (Harman 2002; Cronin, Blaise and Hert, 1995), easy of use and currency of information (Malone, Debbie and Videon, 1997), timeliness and completeness of data (Strong, Lee, and Wang, 1997), etc. All these properties can have impact on the overall usefulness of a document. Among different document properties, we are particularly interested in two types of document properties: properties that are related with the content quality of a paper (e.g., depth of analysis, multiplicity of viewpoints, etc.), and properties that are related with the style or presentational quality of a paper (e.g., readability, verbosity, etc.). Overview In our previous studies (Ng et al 2003, Tang et al 2003), we have identified nine document qualities deemed In addition to topical relevance of a document, there are other factors that contribute to the utility of a document. In studies which intend to elicit users' criteria in real-life information seeking, many non-topical document properties are found crucial to users' selecting of documents. Schamber (1991), after interviewing 30 users, identifies ten groups of criteria: accuracy, currency, specificity, geographic proximity, reliability, accessibility, verifiability, clarity, dynamism, and presentation qualities. Barry (1994) describes experiments that showed that users consider many factors, including some which are non topical, i.e.., depth, clarity, Accuracy, recency, etc. Bruce (1994) identifies some "information attributes" (accuracy, completeness, timeliness ...) that information users might use in their relevance judgments. important by information professionals. They are shown in table 1 (see Appendix A for the definitions): Table 1: Nine Document Qualities Variable Description ACC Accuracy SOU Source reliability OBJ Objectivity, DEP Depth AUT Author/Producer credibility REA Readability VER Verbose conciseness GRA Grammatically Correctness ONE One-Sided Multi-Views Compared to determining topical relevance of a document, it can be argued that these nine document qualities are more difficult to determine consistently, as their estimation is more likely to be affected by one’s personal cognitive style and the personal aspect of knowledge structure. We have the following speculation about the consistency of judgment of these document qualities. That is, there are two factors affecting document quality judgments: (1) Some commonly-agreed-upon knowledge which is relatively more persistent and stable across different people, and (2) some idiosyncratic and personal knowledge which has a relatively higher variance across different people. Consistency of the judgment is achieved when the contribution of commonlyagreed-upon knowledge to the judgment is the dominant factor in the decision making process; while inconsistency is due to dominant contribution of idiosyncratic and personal knowledge. Based on the above speculation, we expect the degree of inter-judge agreement on document qualities to decrease when one moves from the personal level to the global level. However, in our experiments, we find an intermediate level of agreement between the global level of agreement and the inter-personal level of agreement. This intermediate level can be characterized as an inter-institutional level. In the following, we will report several analyses on the reliability of document quality judgments. We explore the correlation of judgments of nine different document properties, to see if there are fewer dimensions underlying the judgments. We also investigate the issue of reliability of judgment and discuss its theoretical implications. Introduction The major concern in information retrieval is topical relevance. With the increase of the size and diversity of the corpora, it is now much easier than before to retrieve relevant documents for a topic or query. As one commonly experiences in Internet searching, sometimes there are simply too many relevant documents retrieved. Therefore, criteria other than topical relevance, like document qualities, become more and more important in providing addition constraints to filter the retrieved documents. However, unlike relevance judgment, for which there exist several document/query collections available for researchers to explore, each with hundreds of topics and ten of thousands relevance judgments corresponding to the topics (e.g., TREC, see Voorhees, 2001), there are no similar collections with document quality judgments. To investigate the nature and characteristics of various document qualities, we have built several small document collections, each document in the collection was judged according to the 9 document qualities. The collection sizes are not comparable to the TREC collection, but adequate for us to run some preliminary experiments. The first collection (from now on it will be denoted as C1) was built in 2002 summer (Tang et al 2002). It has 1000 medium-sized (100 words to 2500 words) news articles, extracted from the TREC collection. They consist of documents originally from LA Times, Wall Street Journal, Financial Time, and Associated Press. The second collection (from now on it will be denoted as C2) was built in 2003 summer. It has 3 types of documents: (1) 1100 documents from the Center for Nonproliferation Studies; (2) 600 articles from the Associated Press Worldstream, the NY Times, and Xinghua News (English), and (3) 500 documents from web. All documents in the collections were judged with regard to the 9 document qualities using a Likert scale of 1 to 10. Each document was judged once at each of two different sites (one at SUNY Albany and one at Rutgers). We recruited two kinds of participants to perform judgments, experts and students. Expert sessions were completed first, and documents judged by experts were used for training student judges to perform quality assessment at the same level as expert judges. The expert participants were experienced news professionals and researchers in the areas of Journalism. This work produced 1000 pairs of quality vectors for C1 and 2200 pairs of quality vectors for C2 (a vector consists of the nine quality variables, each with values equal to the quality scores assigned by Judge from one institution, SUNY Albany or Rutgers). In the work described elsewhere (Ng et al 2003), we used the fact that two judges had assessed every document to produce a combined quality score for the document, and a measure of confidence in that score, which was based on the agreement between the judges. This formed the basis for a detailed analysis of the possibility of computational estimation of these quality scores, based on linguistic features of the texts. That result led to the general conclusion that it is feasible to model, with algorithms, the scores assigned by any one judge. We were more pessimistic about the possibility of modeling the average score. In the work reported here, we looked further into possible explanations of our difficulty. In particular, we moved from a set of nine variables, representing the average scores, to a full set of eighteen variables, representing the nine scores assigned to a document by each of two judges. Correlations Among Documents Qualities We noticed that the results from Rutgers and SUNY Albany are quite different. The quality scores assigned by the two judges from the two institutions tended to vary. In particular, the correlations of judgments assigned at the two schools are very low, while the correlations between the scores for different qualities at the same school are relatively high, as shown in Figure 1. Note that Figure 1 is based on Collection C2. Collection C1 generates graphs with very similar curves, demonstrating the same kind of relationship. Another unexpected fact is that the correlations between the different quality judgments in each school are very similar. Figure 2 shows the comparison between the qualities of different schools. Figure 1. Correlations between Rutgers judgments and SUNY judgments (Collection C2). The quality names with suffix “1” are Rutgers scores, the others are SUNY scores. The x axes are all the qualities from both schools, the lines in the upper graph show the correlations of Rutgers quality scores for all document qualities, the lines in the lower graph show the correlations of SUNY scores for all document qualities. correlation of RU scores with RU and SUNY scores 1.2 1 acc1 0.8 s ou1 obj1 0.6 dep1 aut1 rea1 0.4 ver1 gra1 0.2 one1 0 acc1 s ou1 obj1 dep1 aut1 rea1 ver1 gra1 one1 acc s ou obj dep aut rea ver gra one -0.2 correlation of SUNY scores with RU and SUNY scores 1.2 1 acc s ou 0.8 obj dep aut 0.6 rea ver 0.4 gra one 0.2 0 acc1 s ou1 obj1 dep1 aut1 rea1 -0.2 ver1 gra1 one1 acc s ou obj dep aut rea ver gra one Figure 2. Patterns of the qualities correlations Correlations between AUT and Other Qualities Correlations between ACC and Other Qualities 1.2 1.2 1 1 0.8 0.8 0.6 Rutgers SUNY 0.6 Rutgers SUNY 0.4 0.4 0.2 0.2 0 ACC SOU OBJ DEP AUT REA VER GRA ONE 0 ACC SOU OBJ DEP AUT REA VER GRA ONE Correlations between REA and Other Qualities Correlations between SOU and Other Qualities 1.2 1.2 1 1 0.8 0.8 0.6 Rutgers SUNY 0.6 Rutgers SUNY 0.4 0.4 0.2 0.2 0 ACC 0 ACC SOU OBJ DEP AUT REA VER GRA ONE Correlations between VER and Other Qualities Correlations between OBJ and Other Qualities 1.2 1.2 1 1 0.8 0.8 0.6 Rutgers SUNY 0.6 SOU OBJ DEP AUT REA VER GRA ONE Rutgers SUNY 0.4 0.4 0.2 0.2 0 ACC SOU OBJ DEP AUT REA VER GRA ONE 0 ACC SOU OBJ DEP AUT REA VER GRA ONE Correlations between GRA and Other Qualities Correlations between DEP and Other Qualities 1.2 1.2 1 1 0.8 0.8 Rutgers SUNY 0.6 0.4 Rutgers SUNY 0.6 0.4 0.2 0.2 0 0 ACC ACC SOU OBJ DEP AUT REA VER GRA ONE SOU OBJ DEP AUT REA VER GRA ONE Continuation of Figure 2 words, a score difference as large as 9 (i.e., one judge assigned 10 and the other assigned 1) between 2 judges will be treated exactly the same as a score difference as small as of 1 (e.g., one judge assigned 5 and the other assigned 6.) To compensate for this effect, we use the weighted Kappa statistic implemented by SAS. The results are summarized in table 2. Correlations between ONE and Other Qualities 1.2 1 0.8 Rutgers SUNY 0.6 0.4 0.2 0 ACC SOU OBJ DEP AUT REA VER GRA ONE As we can see from the 9 graphs in Figure 2, each document quality has very specific relationships with the other 8 qualities. Such relationships are independent of the school, as the two schools have the same patterns for all the nine graphs, even though the correlations of quality scores between the two schools are very low, as shown in Figure 1. To understand this phenomenon, we use a Kappa test (Cohen 1968) and analysis to analyze the level of agreement between the two institutions. We then use factor analysis to explore the hidden or underlying dimensions. Since our design provided for one judge at each institution, this separation provides an opportunity to look for an unexpected effect of the institution on the results of the judging process. It is not uncommon, in research on information retrieval, to treat each of several aspects as a separate variable and to model it on its own. However in great deal of other work in communication, and in the social sciences generally, it is more usual to search for a small set of underlying factors which might explain all of the measured variables together. This approach was used in our discussion of the nine averaged variables in our previous papers, and it will be applied here to the larger data set. Kappa Test The Kappa statistic is a way to evaluate the extent of agreement between raters. In a typical two-rater and nresponse-category experiment, the data can be represented in an nn contingency table. The Kappa statistic can then be estimated by calculating the ratio of the observed agreement minus chance to the maximum possible agreement minus chance. In our case, the two raters are judges from SUNY Albany and Rutgers and n=10 (i.e., the 10 possible scores assigned by a judge to a document.) However, if we use the above ratio to estimate Kappa, the statistic will entirely depend on the diagonal elements of the observed contingency table and not depend on any off-diagonal elements at all. In other Table 2. Weighted Kappa between SUNY Albany and Rutgers for all comparable quality pairs, results given by SAS 8.0, with Cicchetti-Allison weight type (Cicchetti 1972), see Appendix C for the detail. Collection C1 C2 Sample Size 8527 18669 Weighted Kappa Statistics 0.1480 0.1592 ASE 0.0072 0.0049 95% confidence limits 0.1339 / 0.1620 0.1497 / 0.1688 In Table 2, “Sample Size” is the number of all the quality pairs excluding the ones with score “0”. “ASE” is the asymptotic standard error, which represents a large-sample approximation of the standard error of the kappa statistic. The 95% confidence interval provides lower and upper bounds that supposedly contain the true value of the kappa statistic with a 95% chance. When weighted Kappa statistic is below 0.2 for both collections, the inter-institutional agreement is very low, even though we can reject the null hypothesis that there is no agreement between the two raters with 95% confidence. This is consistent with the fact we presented earlier. Factor Analysis Factor analysis is the combination of a rigorous mathematical concept with a set of rules and procedures that have grown up over years of practice. The rigorous mathematical concept is that the full set of variables can be manipulated to produce an eighteen-by-eighteen matrix, which is regarded as a linear operator in an abstract vector space. This matrix can then be reduced to a canonical form in which it is diagonal, and its eigenvalues are in decreasing order. If a selection of these eigenvalues is made which does not encompass all of them, it amounts to selecting a particular subspace of the space defined by the eighteen variables. The selection is made in descending order of the size of the eigenvalues because the eigenvalues also measure the amount of variance in the observed variables that is explained by the factors. The selection of some of the eigenvalues amounts to selecting a subspace, and after that the investigator is free to choose a basis set in that subspace. For purposes of our discussion, it is easiest to use the orthogonal basis defined as a result of the first eigenvalue decomposition. The first analysis we did was an 18-variable factor analysis, the 18 variables are 9 quality scores from Rutgers and 9 from SUNY. The method we used was PCA, the matrix was correlation matrix. In Table 3 and 4, we show the first 6 factors and their contribution to the 18 variables. Table 3. First 6 factors extracted by factor analysis of 18 variables (PCA on correlation matrix, SPSS 11.5). Factor Eigenvalue % of variance % of Cumulative variance 1 4.283 23.794 23.794 2 2.89 16.083 39.877 3 1.476 8.203 48.080 4 1.080 6.000 54.079 5 .980 5.442 59.521 6 ..925 5.139 64.661 From Table 3, we see that the first 3 factors are responsible for about 48% of the total variance; the first 6 are responsible for about 65% of variance. Factor 1 can be considered as “General goodness” of the documents. It is positive for all the qualities, judged at SUNY Albany or at Rutgers. Factor 2 can be considered as the “Institution Separator”, on which all the RU scores have positive weight, while all the SUNY scores have negative weight. Factor 3 can be considered as the “Quality Separator”, on which RU and SUNY have the same sign for same quality. Some qualities such as “Depth”, “Objectivity” and “Author Credibility”, not only have the same sign, but also similar amplitude. Figure 3 shows the distribution of the qualities on the graphs of these 3 components. Figure 3. 18 variables regression scores on the first 3 factors. The method is PCA on the correlation matrix. C1, C2 and C3 are first 3 components. (a), (b) and (c) are the views from different perspective. The graphs are made by SPSS 11.5 with a format change to make it easier to view. Figure 3a Table 4: The component matrix of the first 3 factors. The quality variables with “_R” are the quality scores of Rutgers, others are the quality score of SUNY Albany Quality Cmpnt 1 Cmpnt 2 Cmpnt 3 ACC_R 0.536 0.524 -0.0133 SOU_R 0.539 0.481 -0.0464 OBJ_R 0.543 0.401 -0.0942 DEP_R 0.410 0.266 -0.252 AUT_R 0.507 0.435 -0.0744 REA_R 0.478 0.406 0.308 VER_R 0.368 0.379 0.421 GRA_R 0.322 0.415 0.212 ONE_R 0.524 0.304 -0.275 ACC_A 0.572 -0.504 -0.0973 SOU_A 0.617 -0.375 -0.213 OBJ_A 0.549 -0.510 -0.0914 DEP_A 0.427 -0.433 -0.285 AUT_A 0.554 -0.309 -0.0988 REA_A 0.512 -0.371 0.490 VER_A 0.337 -0.373 0.577 GRA_A 0.362 -0.356 0.387 ONE_A 0.488 -0.234 -0.355 Figure 3b Figure 3c In Figure 3(a) and 3(b), we see that RU scores are above the plane C2=0, while SUNY scores are below that plane. In Figure 3(c), we look in the direction of negative C2, and can see that the same qualities from different schools are quite close. RU scores are obtained if we move the SUNY scores in the C2 direction, with small noise added. In other words, the C1 and C3 components match very well for the same quality between the two schools, while C2 components are quite different. This fact confirmed the information shown in Figure 1. The more independent the scores from the two schools, the larger the distance between them in the direction of C2. The more similar the correlation patterns inside each school, the closer their position in the directions other than C2. Please see appendix B for a closer look at this fact. The reason for such a big gap between the two schools is not clear yet. One of the possible reasons is that the instructions given to the human judges in two schools are differently interpreted. Another question is that, since the data from Rutgers and SUNY are quite different, if we are going to use the data for machine learning, to construct automatic classifiers for document qualities, what should we use in the training data? A straight forward idea is to use the average of them. By averaging the scores from Rutgers and SUNY, we have 9 averaged quality scores. We also did the factor analysis on these 9 variables. Figure 4 shows a comparison between the 18-variable factors and 9-average-variable factors. We took the first 6 factors out of each analysis, respectively. For 18-variables, the 6 factors have a cumulative variance of 64.7% while the 6 factors of 9-average-variable account for 85.9%. Figure 4. The comparison between 18-variable factors and 9-average-variable factors. The x and y axes of each cell are the factor scores of two factor analysis, respectively. From Figure 4, we see that the following pairs of factors have very strong correlations: 18-F1/9-F1, 18-F3/9-F2, 18F5/9-F3, 18-F6/9-F4. Factor 2 and factor 4 of 18-variable analysis don’t have strong correlations with any of the 9average-variable factor. We speculate that these two factors represent the independent factors between Rutgers and SUNY scores and are cancelled by the averaging operation in the 9-average-variable analysis. Conclusion As demonstrated in the above analyses, there is a dimension in the data that separates the document quality judgments of SUNY Albany from the document quality judgments of Rutgers. This dimension is perpendicular to other dimensions in such a way that its existence won’t affect the configurations of data points in other dimensions. That explains why the two institutions have the same patterns of internal relationship of the nine document qualities (as shown in Figure 2) even though they don’t have a high correlation with each other. The dimension of institutional difference is still puzzling us. If there is no difference in the instrument and in the data collection procedure, then it may indicate some kind of structural influences of cultural background on document quality judgments. We expect to do more research in this direction to understand this phenomenon. ACKNOWLEDGMENTS This research was supported in part by the Advanced Research Development Activity of the Intelligence Community, under Contract # 2002-H790400-000 to SUNY Albany, with Rutgers as a subcontractor Appendix A: Definition of the Nine Document Qualities Accuracy: The extent to which information is precise and free from known errors. Figure 5. The scatter plot of factor analysis of 2 random variables, sample size is 1000. X axis is factor 1, Y axis is factor 2. Let N(x,y) represents the gaussian distribution with mean x and variance y, the two random variables in (a), (b) and (c) are respectively: (a) r1 = N(0,9), r2 = N(0,9), r1 and r2 are independent. (b) r1 = N(0,9), r2 = r1*0.5 + N(0,9)*0.5, (c) r1 = N(0,9), r2 = r1*0.8 + N(0,9)*0.2. r1 and r2 are more and more correlated in a, b and c. Figure 5a Source Reliability: The extent to which you believe that the indicated sources in the text (e.g., interviewees, eye-witnesses, etc.) provide truthful account of the story. Objectivity: The extent to which the document includes facts without distortion by personal or organizational biases. Depth: The extent to which the coverage and analysis of information is detailed. Author/Producer Credibility: The extent to which you believe that the author of the writing is trustworthy. Readability: The extent to which information is presented with clarity and is easily understood. Verbose Conciseness: The extent to which information is well-structured and compactly represented. Grammatical Correctness: The extent to which the text is free from syntactic problems. One-sided Multi-views: The extent to which information reported contains a variety of data sources and view points. Appendix B: The Component Plot of The Factor Analysis of Two Random Gaussians The more independent the scores from the two school, the larger the distance between them in the direction of C2. The more similar the correlation patterns inside each school, the closer their position in the directions other than C2. To see this, we made Figure 5, in which we show the component scatter plot of two variables with different correlations between them. As illustrated, the more independent, the closer the positions of the variables to the diagonal; the more dependent, the closer they are to the horizontal axis. Figure 5b Figure 5c As mentioned in the main text, the weight function we used was the default Cicchetti-Allison coefficient. REFERENCES Barry, C. L. (1994). User-defined relevance criteria: An exploratory study. Journal of the American Society of Information Science, 45(3):149-159. Bauin, S., & Rothman, H. (1992). "Impact" of journals as proxies for citation counts. In P. Weingart, R. Sehringer, & M. Winterhager (Eds.), Representations of science and technology (pp. 225-239). Leiden: DSWO Press. Borgman, C.L. (Ed.). (1990). Scholarly communication and bibliometrics. London: Sage. Bruce, H. W. (1994). A cognitive view of the situational dynamism of user-centered relevance estimation. Journal of the American Society for Information Science, 45(3):142-148. Buckland, M., & Gey, F. (1994). The relationship between recall and precision. Journal of the American Society for Information Science, 45, 12-19. Appendix C: The Weight Type of Weighted Kappa Method The idea of Weighted Kappa test is: first build a frequency table of the two raters, the two dimensions are all the possible values of the judgments by two raters. For example, cell (3,4) contains the number of times that rater 1 give score 3 while rater 2 give 4. Apparently, we consider the cells closer to the diagonal to be the ones represent better agreements. The Weighted Kappa formula is: r Kw r r r n wij xij wij mi n j i 1 j 1 i 1 j 1 r r n wij mi n j , 2 i 1 j 1 where mi is the sum of ith row, n j is the sum of jth column, r is the number of possible score values, n is the number of total samples. wij is the weight of the number in cell (i,j). There are two ways to assign the weights in SAS 8.0. The first one, also default in SAS, is called Cicchetti-Allison kappa coefficient, it is defined as: wij 1 i j max min , max and min are the maximum and minimum of the possible scores. The second way is called Fleiss-Cohen kappa coefficient, which is defined as: (i j ) 2 . wij 1 (max min) 2 Cicchetti DV. (1972). A new measure of agreement between rank ordered variables. Proceedings of the American Psychological Association, 1972, 7, 17-18. Cohen J. (1968) Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit. Psychological Bulletin. 70:213-20, 1968. Cronin, Blaise and Carol A. Hert. "Scholarly Foraging and Network Discovery Tools." Journal of Documentation 51 (December 1995): 388-403 Harman, D. (2002) Overview of the TREC 2002 Novelty Track. NIST. TREC 11 Proceedings. Hoppe, K., Ammersbach, K., Lutes-Schaab, B., & Zinssmeister, G. (1990). EXPRESS: An experimental interface for factual information retrieval. In J.-L. Vidick (Ed.), Proceedings of the 13th International Conference on Research and Development in Information Retrieval (ACM SIGIR '91) (pp. 63-81). Brussels: ACM. Kling, R. & Elliott, M. (1994). Digital library design for usability. Retrieved December 7, 2001 from http://www.csdl.tamu.edu/DL94/paper/kling.html. Malone, Debbie and Carol Videon. "Assessing undergraduate use of electronic resources: a quantitative analysis of works cited (study of ten undergraduate institutions in the greater Philadelphia area." Research Strategies 15 (1997): 151-158. Ng, K.B., Kantor, P., Tang, R, Rittman, R., Small, S., Song, P.,Strzalkowski, T. Sun, Y. & Wacholder, N. (2003) Identification of effective predictive variables for document qualities. Proceedings of 2003 Annual Meeting of American Society for Information Science and Technology. Medford, NJ: Information Today, Inc. Schamber, L. (1991). User's criteria for evaluation in a multimedia environment. In Proceedings of the American Society for Information Science, 126-133, Washington, DC. Strong, D. M. , Lee, Y. W. and Wang, R. Y. Data Quality in Context. Communications of the ACM, 40(5):103--110, May 1997. Tang, R., Ng, K.B., Strzalkowski, T, & Kantor, P. (2003). Toward machine understanding of information quality. Proceedings of 2003 Annual Meeting of American Society for Information Science and Technology. Medford, NJ: Information Today, Inc. Voorhees, E. (2001). Overview of TREC 2001. In E. Voorhees (ed.) NIST Special Publication 500-250: The Tenth Text REtrieval Conference, pp. 1 – 15. Washington, D.C.