Proceedings of the 38th Hawaii International Conference on System Sciences - 2005 User-Oriented Relevance Judgment: A Conceptual Model Zhiwei Chen School of Computing National University of Singapore chenzhiw@comp.nus.edu.sg Abstract The concept of relevance has been heatedly debated in last decade. Not satisfied with the narrow and technical definition of system relevance, researchers turn to the subjective and situational aspect of this concept. How does a user perceive a document as relevant? The literature on relevance has identified numerous factors affecting such judgment. Taking a cognitive approach, this study focuses on the criteria users employ in making relevance judgment. Based on Grice’s theory of communication, this paper proposes a five-factor model of relevance: topicality, novelty, reliability, understandability, and scope. Data are collected from a semi-controlled survey study and analyzed following a psychometric procedure. The result supports topicality and novelty as the key relevance criteria. Theoretical and practical implications of this study are discussed. Keyword: relevance, relevance criteria, Grice’s theory, psychometric analysis 1. Introduction Searching for relevant information has been a hard and frustrating task for many users [65]. Information retrieval systems (IR systems) nowadays typically return a large number of textual documents, most of which are found irrelevant. This has triggered a resurgence of interest in the concept of relevance, which is the “fundamental and central concept” in both information retrieval and information sciences [50, 54]. As a result of the inadequacy of a system- or algorithm-oriented perspective on relevance, recent studies have adopted a user-oriented and subjective perspective. For example, Saracevic [49, p.120] argues that “only the user himself may judge the relevance of the document to him and his uses.” Consequently, subjective relevance concepts like psychological relevance and situational relevance are proposed as replacement or extension of the objective and system-determined relevance. If relevance is subjective, then what makes a user judge a document as relevant? Many different document attributes have been identified to affect relevance judgment, including recency, reliability, topicality, among others. Such list of document attributes can easily contain Yunjie Xu School of Computing National University of Singapore xuyj@comp.nus.edu.sg more than twenty criteria [e.g. 7]. However, the extant research suffers a few important limitations. First, when the number of factors is so large, it obscures the key factors. Second, although Barry and Schamber [7] suggest that there is a core set of user criteria cross different situations, no consensus has been reached regarding the set and the definition of key factors in the set. Although topicality seems to be unanimously accepted, factors beyond topicality are not agreed upon. Finally, methodology wise, past studies are almost exclusively exploratory and data-driven. Exploratory studies are very useful to uncover an unknown phenomenon. However, it cannot confirm whether a certain factor so identified is statistically significant in the interested domain. Comparatively, confirmatory study adopts a hypothesis testing procedure, which helps to further test the validity of the identified factors and weed out unimportant ones. With a focus on user’s relevance judgment, the purpose of this study is to 1) identify a set of core relevance criteria using a theory-driven approach, paying attention particularly to factors beyond topicality, and 2) test the validity of these factors with a rigorous psychometric approach. 2. Literature review 2.1. Subjective relevance What is relevance? For more than fifty years, information scientists have attempted to conceptualize this concept, and have defined it in different ways [50, 53]. A general trend in information science is that relevance is increasingly regarded as a subjective concept as oppose to an algorithm-determined one [10, 42, 50, 53]. The term subjective relevance is used as an umbrella to cover the concept of subjective topicality [e.g. 32, 42, 54] and situational relevance [e.g. 32, 42, 44, 50]. The subjective topicality extends the systemdetermined query-document match which is known as system relevance. While system relevance is judged by mechanical criteria such as the cosine similarity in the vector space model, topical relevance is a subjective user judgment. However, “relevance is not necessary the same as topicality,” as indicated by Bookstein [9]. Boyes [11] argues that merely hitting on the topic area is insufficient; users are looking for informativeness beyond topicality. 0-7695-2268-8/05/$20.00 (C) 2005 IEEE 1 Proceedings of the 38th Hawaii International Conference on System Sciences - 2005 perceived usefulness, satisfaction, and helpfulness of a Hersh’s [28] study in medical field also calls for the document to user’s problem or information need at hand. recognition of situational factors in defining what is The term relevance refers to situational relevance relevant. In the 1990’s, more researchers turned to the hereafter. situational aspects of this concept [e.g. 6, 27, 44]. Situational relevance takes a pragmatic perspective and defines relevance as the utility of a document to 2.2. Relevance Criteria user’s task or problem at hand. In this view, if a document contributes to the problem-solving, it is relevant; While relevance is conceptualized as perceived otherwise irrelevant. Wilson [64, p.458] first introduces strength of relationship between a document and an the concept of situational relevance and defines it as “the information need [50], a question that follows naturally is actual uses and actual effects of information: how people the criteria that users employ in such a judgment. do use information, how their views actually change or Schamber et al. [54, p.771] highlight the importance of fail to change consequent on the receipt of information.” relevance criteria studies and suggest that “an Saracevic [50, 51] regards the utility perspective of understanding of relevance criteria, or the reasons relevance as a cost-benefit trade-off. Saracevic [50, p.334] underlying relevance judgment, as observed from the indicates that for IR systems, “the true role is to provide user’s perspective, may contribute to a more complete and information that has utility – information that helps to useful understanding of the dimensions of relevance.” As directly resolve given problem, that directly bears on early as in 1960’s, researchers have attempted to identify given actions, and/or that directly fit into given concern the criteria for relevance judgment. For example, Ree and and interests.” Borlund [10, p.922] conceptualizes Schulz [47] suggest 40 variables and indicate the more situational relevance as a user-centered, empirically information is given to user, the more stringent a based, realistic, and potentially dynamic concept. relevance judgment will be. Cuadra and Katter [15] find Between topicality and situational relevance, topicality 38 factors. Since 1990, more empirical studies have been is viewed as a basic requirement while situational carried out to discover such criteria or factors in different relevance is viewed as a “higher” requirement as it relates problem domains. Table 1 summarizes some of these directly to a situation [10]. In this sense, situational studies. relevance demands topicality. In this study, we adopt a situational definition of relevance and define it as the Table 1. Relevance criteria Source [52] [57] Context Weather information Assigned essay [6] Online free search for information [44] Academic problem and need [61, 62] Research project [58] Term paper [30] Research paper Subject (Sample Size) Working people (30) Students (40) Students (18) No. of criteria 10 5 23 Graduate students (24) Graduate students (25) Graduate students (1) Primary students 12 11 10 11 Criteria Presentation quality, Currency, Reliability, Verifiability, Geographic proximity, Specificity, Dynamism, Accessibility, Accuracy, Clarity Completeness, Precision, Relevance, Expectancy, Coverage Information Depth and Scope, Objective accuracy / validity, content of Clarity, Recency, Tangibility, Effectiveness document Source of Source quality, Source reputation / visibility document Document as a Obtainability / available, Cost physical entity Other information Consensus within the field, external verification, and source Available within environment, Personal available User’s Situation Time constraints, Relationship with author User’s belief and Subjective accuracy / validity, Affectiveness, preference Background / experience User’s background Ability to understand, Content novelty, Source novelty, Document novelty Applicable, good, helpful, important, interesting, need, new, related, relevant, similar, studied, useful Topicality, Orientation/level, Discipline, Novelty, Expected quality, Recency, Reading time, Available, Special requisites, Authority, Relation/origin Topical related, types of article, similar topical focus, duplicates, recency, length, depth/breadth, language, geographic focus, version of article (repetitiveness) Textual material Authority, Convenience / accessibility, Interesting, 0-7695-2268-8/05/$20.00 (C) 2005 IEEE 2 Proceedings of the 38th Hawaii International Conference on System Sciences - 2005 on any sports (10) Graphic material Relevance related reasoning [19] Academic task Undergraduates (10) 32 Evaluation related reasoning Affect related reasoning Abstract Author [38] [13] Content Academic need for research paper or thesis Graduate students (12) 29 Images in American history Students (38) 9 Language, Novelty, Peer interest, Quality, Recency / Temporal issues, Topicality Authority, Clarity/ completeness, Interesting, Peer interest, Expediency Interest, Specific idea, Useful or helpful, Specific use, Banned idea, Divergent, Specificity, Background, More is better, Essential, Serendipity, Prior knowledge Good, Context, Methodology, Perspective, Insufficient, Author, Currency, Wrong methodology, Obvious, Strange, Disagree, Authority Funny, Like or dislike, Disturbing, Want, Sad, Annoy, Happy, Fun Citability, Information Author novelty, Discipline, Institutional affiliation, Perceived status, Accuracy -validity, Background Content novelty, Contract, Depth-scope, Domain, Citations, Links to other information, Relevant to other interests, Rarity, Subject matter, Though catalyst Audience, Document novelty, Type, Possible content, Utility, Recency Journal novelty, Main focus, Perceived quality Full text document Journal or Publisher Participant Competition, Time requirements Topicality, accuracy, time frame, suggestiveness, novelty, completeness, accessibility, appeal of information, technical attributes of images These studies have explored a large number of criteria/factors in different situations and tasks. The relevance criteria provided are quite comprehensive. However, there are a few important limitations. First, the number of factors is very large. If a predictive model is to be built eventually in an IR system, asking user to comment on all these factors or measuring them is surely impractical. Second, the terminology is confusing. Same criterion (according to its definition in papers) is named differently by different authors and users (e.g. accuracy and reliability, utility and usefulness), which calls for a combination [33]. Third, factors overlap with each other in meaning (e.g. novelty, new, recency). Fourth, the judgment of an IR system and the judgment of document content need to be distinguished. For example, accessibility is more a property of an IR system (whether it carries a certain document or not) rather than that of a document. Relevance should be based on document content rather than its physical property such as availability. Fifth, variables like utility, usefulness, pertinence, informativeness, and helpfulness should be treated as a surrogate of relevance judgment, i.e., the dependent variables, rather than the independent variables or criteria. Finally, as mentioned above, methodologically these studies are exploratory rather than confirmatory. Confirmatory study is called for to integrate and verify these results. Some of the above-mentioned problems have been identified by prior research as well. For example, Barry and Schamber [7] compare the results of their two studies under totally different situations: academic and weather information search, and find a considerable overlap of relevance criteria. Bateman [8] carries out a longitudinal study and finds that the important criteria remain fairly stable throughout the whole process, although the whole set of criteria might change [35, 60]. Summarizing from past literature, it seems that there is a set of core relevance judgment criteria that most users would follow. However, the actual set and the importance of a particular criterion might change depending on the context [7]. The question remains: What are the set of core relevance criteria and how should we conceptualize them? This study attempts to address the question. 3. Theory and Research Model 3.1. Theory Departing from the extant research which adopts an inductive and exploratory methodology, we adopt a theory-driven approach. To identify the relevance criteria, we propose that Grice’s [24, 25] maxims on human communication can serve as a theoretical foundation of relevance judgment. Not only does Grice’s framework of maxims address the human communication in general (in which IR can be regarded as an indirect form of human communication), it is also consistent with many empirical studies in the IR area. Grice’s work established the foundation of the inferential model in human communication which is more general than Shannon’s code model of communication [55]. Grice [25] posits that the essential feature of human 0-7695-2268-8/05/$20.00 (C) 2005 IEEE 3 Proceedings of the 38th Hawaii International Conference on System Sciences - 2005 cognitive load on the hearer. We term it communication, both verbal and non-verbal, is the “understandability” in the context of written document. In expression and recognition of intention. A communication summary, based on Grice’s maxims, we identify five is successful when both parties are cooperative in making relevance criteria: scope, novelty, reliability, topicality, their meanings and intentions clear (i.e., the principle of and understandability. cooperation). What kind of communication is Grice’s theory plays a significant role in human cooperative? Grice further describes the hearer’s communication and pragmatics studies [e.g. 5]. The expectation of the speaker’s communication in term of the communication maxims have been widely applied in following conversational maxims: quantity, quality, other fields, such as optimality theory [4, 48], cooperative relation, and manner. answering system [22], spoken dialogue systems [17], etc. The maxim of quantity has two sub-maxims. In Most noticeably, in communication studies, Sperber and Grice’s words, contributing appropriate amount of Wilson [55] extend Grice’s work and develop the theory information to communication is to “make your of relevance, in which all Grice’s maxims are reduced to contribution to the conversation as informative as is the “principle of relevance,” i.e., to conform to the required,” and “do not make your contribution to the maxims is to be relevant. Unfortunately, Sperber and conversation more informative than is required.” While Wilson [55] focus on how a hearer adjusts cognitive Grice has a focus on conversational communication, a context to make sense out of a message rather than on the more appropriate term in written communication via attributes of a message that makes it relevant. In documents would be “scope”. We identify “scope” as one comparison, Grice’s maxims directly address this issue. relevant criterion. Although this maxim of quantity Applying Grice’s theory and maxims to IR is focuses on the amount of information, nevertheless it appropriate [32]. First, analogically, IR can be regarded as suggests that new information should be supplied; an asymmetric written communication between an author therefore the conversation is “informative.” Wang and and the reader. The IR system can be seen as an Soergel [61] suggest that novelty and the resultant intermediary that “speaks” for the authors. The iterative epistemic value are implied in any functional value of a process of query and document matching is the process of document. We therefore identify “novelty” as a criterion. “conversation”. Users expect the system to be cooperative The maxim of quality also has two sub-maxims: “do not and the retrieved document to obey the maxims. Second, say what you believe to be false,” and “do not say that for the five criteria identified based on Grice’s maxims which you lack adequate evidence.” We use the term correspond very well to the empirical findings in “reliability” because “quality” implies more than what relevance research. Table 2 summarizes a representative Grice means in IR. The maxim of relation is defined as to list of such studies. As shown in table 2, many factors “be relevant.” However, the term “relevant” is in its daily identified in prior literature tap directly on five criteria sense -- whether a response is on topic or the other party and the five criteria are comprehensive enough to cover abruptly starts to talks something else. In that sense, it is most criteria identified in prior user studies, which in the “topicality” in IR. Finally, the maxim of manner is to return testifies the generalizability of the theory. We shall “avoid obscurity of expression,” “avoid ambiguity,” “be further justify each criterion in the next section. Figure 1 brief,” and “be orderly.” The purpose of this maxim is summarizes our proposed research mode. that conversation should be perspicuous hence reduce the Table 2. The five main factors in the literature Source Topicality Novelty Reliability Scope Understandability [52] [14] [6] [44] [8] Geographic proximity On the topic Assumed Related About my topic Currency Accuracy, Reliability Specificity Clarity Age Content novelty New Novelty Precision Accuracy / validity ---Accurate, Credible Specificity Depth / Scope ---Suitable general or specific Specific to my query, on target, but too technical / narrow Understandability Ability to understand ---Understandable [56] It includes my search terms Identifies a different, but related concept It was an authoritative source [58] [61, 62] [30] [19] Topically related Topicality Topicality ---- Duplicates Novelty Novelty Divergent ---Authority Authority Disagree, Authority Depth / Breadth Discipline ---Specificity language Special requisites Language ---- [38] Subject matter Novelty Accuracy-validity Depth-scope ---- [13] Topicality Novelty Accuracy Completeness ---- 0-7695-2268-8/05/$20.00 (C) 2005 IEEE Wrong language. Don’t understand context 4 Proceedings of the 38th Hawaii International Conference on System Sciences - 2005 3.2. Research Model and Hypothesis Topicality H1 Reliability H2 Understandability H3 Relevance H4 Novelty H5 Scope Figure1. Research model Topicality. Topicality is the essence of Grice’s maxims of relation. If a conversation is to be successful, the violation of this maxim is rare, if not impossible [25]. The importance of topicality is widely recognized in relevance literature. Maron [39] suggests that aboutness is the heart of indexing. Boyce [11] indicates that users first judge the topicality of document, and then think about other factors for their relevance judgment. Howard [33] also acknowledges topicality as the first or basic condition of relevance. Harter [27] treats topicality as a weak kind of relevance. Froehlich [21] summarizes early studies and concludes the nuclear role of topicality for relevance. We adopt a subjective view and define topicality as the extent to which the retrieved document is related to a user’s current topic of interest as perceived by the user. It unifies concepts like aboutness [39], topic relatedness [58], and topical relevance [23, 51] proposed in prior studies. Because of its fundamental role in situational relevance, in consistent with almost all prior exploratory studies, we hypothesize: H1: Topicality is positively associated with relevance. Reliability. Intuitively, people accept information that is perceived to be accurate. Grice [25] observes that “quality” is the prerequisite for other maxims to operate. Ultimately, if a document is to be relevant by reducing uncertainty in the mind of the user, it must be reliable in itself first. Many different disciplines testify the importance of reliability. In data quality management, accuracy is acknowledged as the (if not the only) key dimension of data quality [63]. When evaluating output of database, without accuracy, user will dismiss its usefulness immediately. In persuasion literature of psychology, Petty and Cacioppo [45] indicate that a message receiver first judges the reliability of information, and then decides whether to adopt it. In accounting research, Johnson et al. [34] also show that reliability is the key criterion to evaluate the quality of data for acceptance. How does a user judge reliability of a document? Petty et al. [46, p.103] note that “source status, by influencing perceptions of source credibility, competence, or trustworthiness, can provide message recipients with a simple rule as to whether or not to agree with the message.” Information from an expert is perceived more reliable than the one from a source without credential [45]. Therefore, the credibility of the source can be regarded as an external cue of document reliability [e.g. 6, 30, 56]. We define reliability as the degree that the content of a retrieved document is perceived to be true, accurate, or believable. Similar concepts in the literature are accuracy [52], validity [6], and agree/disagree [19]. We hypothesize: H2: Reliability is positively associated with relevance. Understandability. Understandability corresponds to Grice’s maxim that a message should be perspicuous. Researches in communication and education show that the use of jargon or technical language may reduce the clarity of a message and lead to significantly lower evaluation than a jargon-free message [16]. Both expert and non-expert are sensitive to the use of jargon [59]. In accounting research, understandability is also a measurement of the effectiveness of accounting reports to decision makers [2]. In a client-professional exchange, the use of sophisticated language may affect the acceptance of the professional's advice [18]. We define understandability as the extent to which the content of a retrieved document is easy to read and understand as perceived by user. It unifies similar concepts like clarity [52], language use [30, 58], and special requisites [61]. We hypothesize: H3: Understandability is positively associated with relevance. Novelty. Psychological researchers define novelty as a stimulus that has not been previously presented or observed and thus unfamiliar to the subject. In psychological literature, novelty seeking behavior is regarded as an internal drive or motivation force of human being [1]. Seeking new and potentially discrepant information may help people “create a ‘bank’ of potentially useful knowledge” and further “improve people’s problem-solving skills” [29, p.284]. Lancaster [36] first introduces the concept of novelty into IR research, and defines it as the retrieval of citations previously unknown to requester. Harter [27, p.608] notices that normally “a citation corresponding to an article already known to the requester could not be psychology relevant” because it will not produce cognitive change in the subject. However, it may serve as a reminder. Therefore, novelty should be regarded as a matter of degree. Recent exploratory studies acknowledge 0-7695-2268-8/05/$20.00 (C) 2005 IEEE 5 Proceedings of the 38th Hawaii International Conference on System Sciences - 2005 novelty as an important factor affects relevance [e.g. 6, 38]. We define novelty as the extent to which the content of a retrieved document is new to the user or different from what the user has known before. It unifies the similar concepts such as content novelty [6], new content [44], divergent and strange content [19] etc. We hypothesize: H4: Novelty is positively associated with relevance. Scope. Grice’s [25] maxim of quantity posits that adequate amount of information is what the hearer prefer. The concept of scope can be described in term of two components: breadth and depth [e.g. 40]. Levitin and Redman [37] suggest scope and level of detail to be two important dimensions of data quality. They argue that a user needs the data to be broad enough to satisfy all the intended use and, at the same time, not to include unnecessary information. For the level of detail, they further show that detailed information may be used as quality safeguard, while too detailed information is an annoyance. We define scope as the extent to which the topic or content covered in a retrieved document is appropriate to user’s need, i.e., both the breadth and depth of the document are suitable. This definition represents similar concepts of specificity [19, 52], depth/scope [6], depth/breadth [58] etc. We hypothesize: H5: Scope is positively associated with relevance. 4. Methodology In order to test the proposed models, a survey method was used followed by rigorous psychometric analysis. Structural equation modeling as a quantitative psychometric analysis method is a well established and the dominant data analysis method in psychology, sociology, marketing, information systems research and many other humanity disciplines. It is particularly suitable for studying relationship among psychological perceptions which are not directly measurable. Since no prior relevance research follows such methodology, we will briefly introduce the methodology and point to key references when appropriate. 4.1. Instrument Development In psychometric analysis, a human perception of object is called a construct (e.g. topicality, relevance). A construct is assumed to be not directly measurable, but manifested in different ways. Therefore, multiple questions (a.k.a. items) that reflect different aspects of a construct are asked in a survey for each construct. For example, in order to measure Novelty, survey participants are asked of the amount of new information in a document, amount of unique information, and similarity to prior knowledge instead of a single novelty question. The latent meaning underlying all these items can be extracted using factor analysis, which is more accurate than the score of a single question [43]. All questions were self-developed based on the definition of these constructs. Items (i.e. questions) were constructed as 7point Likert scale. For example, one question to measure relevance is “this document is helpful to solve my problem at hand.” (1—strongly disagree, 7—strongly agree). To ensure that items do reflect the intended construct, the content validity is checked first. Content validity is the degree that questions for a construct have a representative coverage of manifestations for the intended construct. The questions we used were to a large degree the rephrasing of similar concepts proposed in the literature. This provides the basis for content validity [43, chapter 3]. Questions designed to measure a construct should not be measuring another construct. Item sorting is such a method to ensure the pertinence of each question to its own construct (refer to [41] for methodology details). Item sorting has two phases. In the first phase, four judges were used to sort all the questions into as many groups as they deemed appropriate. The number of construct, construct definitions, or construct-question relationships was not known to the judges. In phase two, another four judges were asked to match each question to a construct definition which was now known to them. The inter-judge agreement was measured with Kappa score. The Kappa scores of the final sorting were all above 0.7 which is the suggested threshold. We therefore concluded that our questions had content validity and were suitable for larger scale survey. Questions for this study are listed in Appendix 1. 4.2. Data Collection The survey was carried out in two steps: a pilot study and a main study. The purpose of the pilot test is to quantitatively test the questionnaire quality and construct validity on small-scale data. Both the pilot study and the main study were carried out in a computer lab. Subjects are undergraduate and graduate students in a major university in Southeast Asia. Subjects were asked to search documents on an assigned topic of “The health and safety of using mobile phone.” They were asked to provide their demographics, their prior knowledge on the topic, and then search the Internet and list at least five documents that were at least marginally related after reading. Then they evaluated two documents which were randomly assigned by the research. Subjects generally took 30 to 60 minutes to finish the whole process and a token fee was given out as a reward. Both the pilot and the main study were done in this fashion. 5. Data Analysis and Result 0-7695-2268-8/05/$20.00 (C) 2005 IEEE 6 Proceedings of the 38th Hawaii International Conference on System Sciences - 2005 5.1. Pilot Study In the pilot study, 76 valid questionnaires were collected with a sample of 38 students. Exploratory factor analysis (EFA, also see [43]) was conducted to test the convergent and discriminant validity of questions. Convergent validity means that all questions intended to measure a construct do reflect that construct. Discriminant validity means that a question does not reflect an unintended construct and constructs are statistically different. For pilot study, exploratory factor analysis with principal component analysis was used for such validation. In this procedure, the major principal components were extracted as constructs; minor principal component with eigenvalue less than 1 were ignored as a convention. Table 3. Factor loading table TOP1 TOP2 TOP3 1 .615 .726 .852 2 .347 .194 .150 TOP4 RELI1 RELI2 RELI3 RELI4 UND1 UND2 UND3 UND4 NOV1 NOV2 NOV3 NOV4 SCO1 SCO2 SCO3 RELE1 RELE2 RELE3 RELE4 RELE5 .768 .333 .181 .155 .152 .153 -.036 .151 .326 .156 .066 .058 -.153 .027 .172 .026 .195 .277 .167 .352 .251 .193 .683 .884 .890 .863 .120 .160 .313 .353 -.004 .018 -.094 -.020 .002 .003 -.017 .220 .140 .233 .263 .222 Component 3 4 .027 .226 .075 .124 .241 -.048 .102 .112 .152 .288 .302 .911 .858 .874 .543 -.001 -.062 -.023 .040 -.121 -.064 -.003 .170 .076 .134 .045 .203 .002 -.108 -.082 -.033 .050 -.002 -.052 .015 -.027 .740 .872 .824 .782 -.010 -.097 -.159 .195 .107 -.015 .008 .147 5 .278 .132 .069 6 .372 .378 .235 .015 .057 .007 -.027 -.018 -.072 -.033 -.086 -.075 .140 -.005 -.172 -.206 .898 .892 .752 .097 .093 .063 -.054 .137 .259 .377 .265 .175 .204 -.022 .271 .109 .252 .298 .199 .017 -.171 .020 -.052 .342 .851 .879 .862 .761 .784 Table 3 reports the principal component analysis result (with Varimax rotation). According to [26], the item and intended construct correlation (a.k.a. “factor loading”) should be greater than 0.5 to satisfy the convergent validity, and the item and unintended construct correlation should be less than 0.4 for discriminant validity. Six factors were extracted, corresponding to six constructs. Item Scope4 and Scope5 were dropped because they did not satisfy the discriminant and convergent criteria. The remaining items showed appropriate validity. They were kept for the main study. In the main study, 162 valid questionnaires (81 students) were collected. There were 62% male students, and 38% female. The average age was 24. They used Google as the main search engine (94%). Measurement model. Following the methodological suggestion of Anderson and Gerbing [3], before hypothesis testing, the first step of structural equation modelling is measurement modelling which is to further ensure the questionnaire quality. Unlike EFA, in measurement model we pre-specified the constructquestion correspondence but leave the correlation coefficients (factor loadings) free to change. Questions are expected to be highly correlated with the intended constructs only. Measurement model was analyzed with confirmatory factor analysis (CFA) using statistical package LISREL v8.51. In CFA, the convergent validity is verified by factor loadings, the average variance extracted (AVE) of each item by the intended construct, the composite factor reliability (CFR), and Cronbach’s alphas (Į) [26]. The latter two measures how consistently questions of a construct correlate with each other. Table 4 reports the results of our measurement model. According to Fornell and Larcker [20], an AVE score above 0.5 indicates an acceptable level of convergent validity. Chin [12] recommends the minimal requirement for alpha and CFR should be above 0.7. These criteria are all satisfied. Thus, the convergent validity is ensured. Table 4. Measurement model Item TOP1 TOP2 TOP3 TOP4 RELI1 RELI2 RELI3 RELI4 UND1 UND2 UND3 UND4 NOV1 NOV2 NOV3 NOV4 SCO1 SCO2 SCO3 RELE1 RELE2 RELE3 RELE4 RELE5 Std. Loading 0.89 0.92 0.80 0.69 0.73 0.86 0.90 0.89 0.89 0.92 0.94 0.76 0.77 0.90 0.58 0.61 0.60 0.70 0.86 0.92 0.86 0.85 0.87 0.82 Tvalue 14.10 15.11 11.91 9.72 10.50 13.51 14.50 14.06 14.43 15.12 15.56 11.20 11.07 13.74 7.57 8.20 7.49 8.92 11.22 15.04 13.61 13.40 13.92 12.49 AVE CFR Į 0.69 0.90 0.90 0.71 0.91 0.91 0.77 0.93 0.93 0.52 0.81 0.81 0.53 0.77 0.77 0.74 0.93 0.93 5.2. Main Study 0-7695-2268-8/05/$20.00 (C) 2005 IEEE 7 Proceedings of the 38th Hawaii International Conference on System Sciences - 2005 Table 5. Construct correlation table TOP RELI UND NOV SCO RELE TOP 0.83 0.37 0.23 0.59 0.50 0.78 RELI 0.84 0.09 0.28 0.08 0.32 UND NOV SCO RELE Reliability 0.88 -0.16 0.27 0.21 0.72 0.27 0.57 0.73 0.47 Understand ability 0.86 One way to check discriminant validity is that the interconstruct correlation should be less than the square root of AVE [20]. The correlation among constructs is reported in Table 5. In this case, the discriminant validity was confirmed. Structural Model. Since the measurement model was acceptable, we proceeded to hypothesis testing. Figure 2 reports the hypothesis testing result. Hypothesis testing was done by creating a structural equation model in LISREL, which specifies both item-construct correspondence and construct-construct causal relationship. The coefficients were then solved with maximum likelihood estimation. Before we drew conclusion on the hypotheses, the model fitting should be checked first. The result indicated low yet acceptable model fit. GFI, RFI, and NFI, though low, should be considered acceptable for newly developed instrument [43]. The rest indices were all better than the recommended level [43]. Because model fitting was acceptable, we could interpret the result and conclude that H1 (topicality) and H4 (novelty) were supported, while the other three were not. Because the correlation between topicality and relevance was high, there was a risk that other hypotheses might be rejected because of multicollinearity. The multicollinearity of all constructs was checked by variance inflate factor (VIF). The result indicated that the highest VIF was 1.78, much lower than the threshold of 10. Therefore, the insignificance was unlikely to be a statistical artefact. 6. Discussion and Implications Summary of data analysis. The object of this study is to identify and confirm a set of key relevance judgement criteria. Five such criteria were identified based on Grice’s maxims and prior literature. Based on the EFA of pilot data and the measurement model of the main study data, we show that these constructs do have discriminant validity, i.e., they are distinct concepts. For each construct, different phrasings with minor difference in meaning taps on the same construct. Both exploratory and confirmatory factor analysis offer ways to reduce the vast number of criteria identified in prior literature. 0.57** (5.82) Topicality 0.03 (0.53) 0.09 (1.36) Relevance R2=0.64 0.21* (2.54) Novelty Scope 0.10 (1.34) χ2=419.62, df =237, p=0.0000, RMSEA=0.069, NFI=0.86, NNFI=0.92, CFI=0.93, IFI=0.93, RFI=0.84, GFI= 0.82, * p<0.05, **p<0.01 Figure 2. Standardized LISREL Solutions Even though five criteria have been proposed, not all of them were supported by the data. The result showed that topicality and novelty were statistically significant to relevance judgment, while other not. All criteria together explained 64% of relevance variance. The standardized coefficient (0.57) showed that topicality was the major factor affecting relevance, while novelty was the second most important (0.21). The results also showed that reliability, understandability, and scope were not supported by the data. It is too hasty to conclude that these factors are unimportant in general. The nonsignificance might be due to the design of the survey. This survey asked participants to list documents that they perceived at least marginally related, and then to evaluate two of them. This means that topicality was present in the devaluated documents to certain degree. Such design of procedure is reasonable because we are particularly interested in the factors beyond topicality. It is possible that reliability and topicality were evaluated before topicality. The average score of reliability and understandability was high at 5.5 and 5.8 respectively. Because of that, they became insignificant in the later stage judgement of relevance. Construct Scope had a lower average (about 4.0); its non-significance was less likely a result of survey design. It seemed that readers considered scope as an optional premium in relevance judgment. As the first confirmatory study in this area following a psychometric procedure, we shall point out the key limitations before we draw any implication. First, the conclusions and implications drawn from this study are applicable only to documents that bear minimum topicality in the first place. Second, the use of structural equation modelling assumes an additive model, i.e., the contribution of each criteria to relevance is additive. Such assumption might be viable when minimum topicality is assumed, in which case other criteria are considered extra premium on the top of the basic topicality requirement. If 0-7695-2268-8/05/$20.00 (C) 2005 IEEE 8 Proceedings of the 38th Hawaii International Conference on System Sciences - 2005 such assumption is not met, multiplicative model or stepwise model should be considered. Finally, the model fit is not good enough, suggesting that questionnaire quality is to be improved. These limitations serve as the directions for future study. Theoretical implication. The implication of this study is multi-fold. First, it proposes a theory-based model which helps identify five potentially important relevance criteria. It attempts to unify various conceptualizations in the literature and give them a theoretical foundation. It also makes the first attempt to empirically confirm the importance of relevance criteria. In this pursue, it confirms early observation that topicality is the nuclear part of relevance in IR [e.g. 21]. In addition, it suggests novelty as the next most important criteria in relevance judgement. The prominent role of topicality and novelty provides insight to the concept of relevance. Based on them, the concept of relevance can be depicted with four quadrants delineated by topicality and novelty (Figure 3). Novelty Low Low Irrelevant High Tool Topicality High Potentially relevant Informative Figure 3. Relevance quadrants In the low topicality - low novelty quadrant, a document is neither on topic, nor new to the user. It is thus most likely to be dismissed as irrelevant. In the high topicality – low novelty quadrant, a document is on topic but already known to the user. Imagine if we are going to write another paper to address the limitations of this study, reference [50] is a classical paper on the topic of subjective relevance and has topicality. However, the authors are familiar with the content already. We may still treat it as relevant because we need to reference to it or to check some concepts defined, or to quote some sentences. Such a document is useful and relevant to our research, yet it is used as a tool. The low topicality – high novelty quadrant deals with documents that are unclear in topicality, yet provides certain new information that attracts the user’s attention. As Harter [27] points out, there is no absolutely fixed information need in a search process. Information need can be multiple and vague. The interaction of new information in a document and the current cognitive state helps to clarify the information need and create future search topic. Consequently, a document might be regarded as potentially relevant because the user anticipates its future value rather than the current value. Finally, the high topicality – high novelty quadrant possesses the ideal documents. They might help the user clarify information need, offer new problem solution or new evaluation method for different problem solutions. In each case, they are informative. Practical implication. Decades of research efforts have been made to better capture topicality. This study suggests that the next power house of IR system design might be the quantification of novelty. How to capture a user’s cognitive state before document evaluation? How to measure the novelty of a document against such cognitive state? How to combine novelty and topicality into an overall relevance score? While this study does not offer any answer to theses questions, we do suggest that effort in this direction will be rewarding. 7. References [1] Acker, M and McReynold, P, “The Need for Novelty: A Comparison of Six Instruments”, Psychological Record, 17, 1967, pp.177-182. [2] Adelberg, A.H., “A Methodology for Measuring the Understandability of Financial Report Message”, Journal of Accounting Research, 17(2), 1979, pp.565-592 [3] Anderson, J. C. and Gerbing, D. W., “Structure Equation Modeling in Practice: A Review and Recommended Two-step Approach”, Psychological Bulletin, 103(3), 1988, pp. 411-423 [4] Atlas, J. and Levinson, S., “It-Clefts, Informativeness and Logical Form”, Radical Pragmatics, New York, AP, 1981. [5] Bach, K. and Harnish, R., Linguistic Communication and Speech Acts, MIT Press, 1979. [6] Barry, C. L., “User-defined Relevance Criteria: An Exploratory Study”, Journal of the American Society for Information Science, 45(3), 1994, pp. 149-159. [7] Barry, C., and Schamber, L., “Users’ criteria for relevance evaluation: A cross-situational comparison”, Information Processing & Management, 34, 1998, pp. 219–236. [8] Bateman, J., “Changes in Relevance Criteria: A Longitudinal Study”, Proceedings of the 61st Annual Meeting of the American Society for Information Science, 35, 1998, pp. 23–32. [9] Bookstein, A., “Relevance”, Journal of the American Society for Information Science, 30(5), 1979, pp. 269-273. [10] Borlund, P., “The Concept of Relevance in IR”, Journal of the American Society for information Science and Technology, 54(10), 2003, pp.913-925 [11] Boyce, B., “Beyond Topicality: A Two Stage View of Relevance and the Retrieval Process”, Information Processing and Management, 18(3), 1982, pp. 105-109. [12] Chin, W. W. “The Partial Least Squares Approach to Structural Equation Modeling”, In Modern Methods for Business Research, G. A. Marcoulides (ed.), Mahwah, NJ: Lawrence Erlbaum Associate, 1998, pp.295-336. [13] Choi. Y. and Rasmussen, E. M., “Users’ relevance criteria in image retrieval in American history”, Information Processing and Management, 38, 2002, pp.695–726 [14] Cool, C., Belkin, N. J., and Kantor, P. B., “Characteristics of Texts Affecting Relevance Judgments”, Proceedings of the 14th National Online Meeting, 1993, pp. 77-84 [15] Cuadra, C.A., and Katter, R.V., “Opening the Black Box of ‘Relevance’”, Journal of Documentation, 23(4), 1967, pp.291303. [16] Dwyer, J., Communication in Business: Strategies and Skills, Prentice Hall, Sydney, 1999. [17] Dybkjaer, L., Bernsen, N., and Dybkjaer, H., “A Methodology for Diagnostic Evaluation of Spoken Human Machine Dialogue”, International journal of human- computer studies, 48(5), 1998, pp.605-626. 0-7695-2268-8/05/$20.00 (C) 2005 IEEE 9 Proceedings of the 38th Hawaii International Conference on System Sciences - 2005 [18] Elsbach, K. D. and Elofson, G., “How the Packaging of Decision Explanations Affects Perceptions of Trustworthiness”, Academy of Management Journal, 43(1), 2000, pp. 83-89 [19] Fitzgerald, M. A. and Galloway, C., “Relevance Judging, Evaluation, and Decision Making in Virtual Library: A Descriptive Study”, Journal of the American Society for Information Science and Technology, 52(12), 2001, pp. 9891010. [20] Fornell, C and Larcker, D. F., “Structure Equation Models: LISREL and PLS Applied to Customer Exist-Voice Theory”, Journal of Marketing Research, 18(2), 1981, pp. 39-50 [21] Froehlich, T.J., “Relevance Reconsidered: Towards an Agenda for the 21st Century: Introduction to Special Topic Issue on Relevance Research”, Journal of the American Society for Information Science, 45, 1994, pp.124 –134. [22] Gaasterland, T., Godfrey, P., and Minker, J., “An Overview of Cooperative Answering”, Journal of Intelligent Information Systems, 1(2), 1992, pp.123-157 [23] Green, R., “Topical Relevance Relationships: ȱ. Why Topic Matching Fails”, Journal of the American Society for Information Science, 46, 1995, pp. 646–653. [24] Grice, H. P., “Logic and Conversation”, Syntax and Semantics, 3, Academic Press, 1975, pp. 41-58. [25] Grice, H. P, Studies in the Way of Words, Harvard University Press, 1989 [26] Hair, J F., Anderson, R. E., Tatham, R. L., and Black, W. C, “Multivariate Data Analysis with Reading” (4th ed), Englewood Clffs, NJ: Prentice Hall, 1995. [27] Harter, S. P., “Psychological Relevance and Information Science”, Journal of the American Society for information Science, 43(9), 1992, pp.602-615 [28] Hersh, W. “Relevance and Retrieval Evaluation: Perspective from Medicine”, Journal of the American Society for Information Science, 45(3), 1994, pp. 201-206. [29] Hirschman, E. C., “Innovativeness, Novelty Seeking, and Consumer Creativity”, Journal of Consumer Research, 7, 1980, pp.283-295 [30] Hirsh S.G., “Children’s Relevance Criteria and Information Seeking on Electronic Resources”, Journal of the American Society for information Science, 50(14), 1999, 1265-1283 [31] HjØrland, B., “Information Seeking and Subject Representation: An Activity–theoretical Approach to Information Science”, Westport, CT: Greenwood Press, 1997. [32] HjØrland, B. and Christensen, F. S., “Work Tasks and Socio-Cognitive Relevance: A Specific Example”, Journal of the American Society for Information Science and Technology, 53(11), 2002, pp.960–965. [33] Howard, G., “Relevance Thresholds: A Multi-stage Predictive Model if How Users Evaluate Information”, Information Processing & Management, 39, 2003, pp. 403-423 [34]Johnson, J. R.,Leitch, R. A., and Neter, J., “Characteristics of Errors in Accounting Receivable and Inventory Audits”, Accounting Review, 1(2), 1981, pp. 270-293. [35] Kuhlthau, C.C. “Seeking Meaning: A Process Approach to Library and Information Science”, Norwood, NJ: Ablex Publishing, 1993. [36] Lancaster, F. W., Information Retrieval Systems: Characteristics, Testing, and Evaluation, 1968, New York: Wiley. [37] Levitin, A. and Redman, T., “Quality Dimensions of a Conceptual View”, Information Processing & Management, 31(1), 1995, pp.81-88 [38] Maglaughlin, K. L. and Sonnewald, H., “User Perspective on Relevance Criteria: A Comparison among Relevance, Partially Relevance, and Not-Relevance”, Journal of the American Society for Information Science and Technology, 53(5), 2002, pp. 327-342. [39] Maron, M.E., “On Indexing, Retrieval and the Meaning of About”, Journal of the American Society for Information Science, 28(l), 1977, pp.38-43 [40] Miranda, S. and Saunders, C. S., “The Social Construction of Meaning: An Alternative Perspective on Information Sharing”, Information Systems Research, 14(1), 2003, pp.87-108 [41] Moore, G. C. and Benbasat, I., “Development of an Instrument to Measure the perception adoption an information technology innovation”, Information Systems Research, 2(3), 1991, pp.192-222 [42] Mizzaro, S., “Relevance: The Whole History”, Journal of the American Society for Information Science, 48(9), 1997, pp.810–832. [43] Nummally, J.C., Bernstein, I.H., Psychometric Theory, McGraw-Hill, New York, 1994. [44] Park, H., “Relevance of Science Information: Origins and Dimensions of Relevance and Their Implications to Information Retrieval”, Information Processing & Management, 33(3), 1997, pp.339-352 [45] Petty, R. E. and Cacioppo, J. T., “The Elaboration Likelihood Model of Persuasion”, Advances in experimental social psychology, 19, 1986, pp.123-205. New York: Academic. [46] Petty, R., Priester, J., and Wegender, D., “Cognitive Processes in Attitude Change”, Handbook of Social Cognition, 1994, pp.69-142, Hillsdale, NJ: Erlbaum. [47] Rees, A.M., and Schultz, D.G., A Field Experimental Approach to the Study of Relevance Assessments in Relation to Document Searching, I: Final report (NSF Contract No. C-423), Cleveland: Case Western Reserve University, 1967. [48] Rooy, R.V., “Utility, Informativity, and Protocols”, Proceedings of LOFT 5: Logic and the Foundations of the Theory of Games and Decisions, Torino, 2002 [49] Saracevic, T., “The Concept of ‘relevance’ in information science: A Historical Review”, Introduction to information Science, 1970, pp. 111-151, New York: R.R. Bowker. [50] Saracevic, T., “Relevance: A Review of and a Framework for the Thinking on the Notion in Information Science”, Journal of the American Society for Information Science, 26(6), 1975, pp. 321-343. [51] Saracevic, T., “Relevance Reconsidered '96”, In P. Ingwersen & N. Ole Pots (Eds.) CoLIS2. 2ndInternational Conference on Conceptions of Library and Information Science, 1996, pp. 201-218, Copenhagen, Denmark: Royal School of Librarianship. [52] Schamber, L., “Users' Criteria for Evaluation in a Multimedia Environment”, Proceedings of the 54th Annual Meeting of the American Society for Information Science, 28, 1991, pp. 126-133. [53] Schamber, L., “Relevance and Information Behavior”, Annual Review of Information Science and Technology, 29, 1994, pp. 33-48. [54] Schamber, L., Eisenberg, M. B., and Nilan, M. S., “A Reexamination of Relevance: Toward a Dynamic, Situational Definition”, Information Processing and Management, 26(6), 1990, pp.755-776. 0-7695-2268-8/05/$20.00 (C) 2005 IEEE 10 Proceedings of the 38th Hawaii International Conference on System Sciences - 2005 [55] Sperber, D., and Wilson, D. Relevance: Communication and Cognition. Cambridge, MA: Harvard University Press, 1986. [56] Spink, A., Greisdorf, H., and Bateman, J., “From Highly Relevant to not Relevant: Examining Different Regions of Relevance”, Information Processing & Management, 34, 1998, pp. 599–621. [57] Su, L. T., “Is Relevance an Adequate Criterion for Retrieval System Evaluation: an Empirical Study into the User's Evaluation”, Proceedings of the 56th Annual Meeting of' the American Society for information Science, 30, 1993, pp. 93-103. [58] Tang, R., and Solomon, P., “Towards an Understanding of the Dynamics of Relevance Judgments: An Analysis of One Person’s Search Behavior”, Information Processing & Management, 34, 1998, pp.237–256. [59] Thompson, P.A., Brown, R.D. and Furgason, J., “Jargon and Data do Make a Difference”, Evaluation Review, 5(2), 1981, pp. 269-79. [60] Vakkari, P., and Hakala, N., “Changes in Relevance Criteria and Problem Stages in Task Performance”, Journal of Documentation, 56, 2000, pp.540–562 [61] Wang, P. and Soergel, D., “A Cognitive Model of Document Use during a Research Project. Study I. Document Selection”, Journal of the American Society for Information Science, 49(2), 1998, pp.115–133 [62] Wang, P. and White, M. D., “A Cognitive Model of Document Use during a Research Project. Study II. Decisions at the Reading and Citing Stages”, Journal of the American Society for Information Science, 50(2), 1999, pp.98-144 [63] Wang, R. Y. and Strong, D. M., “Beyond Accuracy: What Data Quality Means to Data Consumer”, Journal of Management Information Systems, 12(4), 1996, pp.5-34. [64] Wilson, P., “Situational Relevance”, Information Storage and Retrieval, 9(8), 1973, pp.457-471 [65] Wurman, R., Information Anxiety. Doubleday, New York, NY, 1989. Appendix 1. Questionnaire Construct Item TOP1 Topicality TOP2 TOP3 TOP4 RELI1 RELI2 RELI3 RELI4 UND1 UND2 UND3 UND4 NOV1 NOV2 Reliability Understand ability Novelty Scope Relevance NOV3 NOV4 SCO1 SCO2 SCO3 RELE1 RELE2 RELE3 RELE4 RELE5 Description This document has a substantially amount of information about my current topic of interest. The content of this document is substantially about my current topic of interest. The topic of this document is substantially related to my current topic of interest. The topic of this document is within the domain of my current topic of interest. I think the content of this document would be accurate. I think the content of this document would be consistent with the fact. I think the content of this document would be true. I think the content of this document would be reliable. Readers of my type should find this document very easy to read. I am able to follow the content of this document with little effort. The content of this document is easy to understand. After reading it, I am very clear about the main content of this document. This document has a substantial amount of new information to me. This document has a substantial amount of unique information that I come across for the first time. The content of this document is different from what I have read before. I have not read the content similar to this document before. The content of document is either too general or too specific for me. The coverage of this document is either too abroad or too narrow for me. This document gives either too many or too few details than what I expected. This document has a great value in meeting my need. This document is satisfactory in meeting my need. This document is very pertinent to my need. This document is helpful to solve my problem at hand. I would make use of this document. 0-7695-2268-8/05/$20.00 (C) 2005 IEEE Mean 4.86 S.D. 1.56 5.07 5.44 5.68 5.44 5.47 5.49 5.56 5.68 5.86 5.82 5.94 4.80 4.46 1.41 1.34 1.17 1.18 1.07 1.08 1.11 1.34 1.12 1.15 1.05 1.45 1.63 3.98 3.81 3.91 4.08 3.78 4.59 4.73 4.65 4.76 5.04 1.53 1.45 1.33 1.42 1.58 1.40 1.28 1.30 1.51 1.56 11