Cole 1 Measuring Label Effect prePilot Eval v26: 619 Measuring the Effectiveness of Cluster Representation Labels for Browsing Document Collections Michael J. Cole School of Communication, Information and Library Studies Rutgers University 12 May 2005 Abstract Effective browsing in clustered document collections requires interface cluster representations that are meaningfully-labelled. This pilot study attempts to measure the effectiveness of interface labels in supporting user browsing activity over a collection of clustered news stories with recall-based performance measures. Experiments with 18 subjects were unable to show that automatically-generated labels can be distinguished by their effectiveness from baseline labels provided by humans and that label sets generated by different algorithms can be distinguished from one another. Exploration of the results suggests that user and task effects may have dominated any label effects that exist. Design for future experiments to test these potential explanations for the results is discussed. 1. Introduction Conceptual browsing of large collections of documents and other information objects is a desirable alternative to specified search. The development of effective browsing systems depends on creation of sensible subcollections using automated indexing and clustering systems. It also requires an effective user interface to support navigation and selection of clusters. Cole 2 An important problem for the cluster browsing interface is the quality of the cluster representation labels (Chen, Houston, Sewell,& Schatz, 1998; Largus & Kaski, 1999; Chen, 2000; Raubner, 2000; Popescul & Ungar, 2000; Schweighofer, Raubner, & Dittenbach, 2001; Muresan, 2002; Azcarraga, et al., 2004; Muresan & Harper, 2004). The effectiveness of a label can be defined in terms of the success of a user in selecting clusters containing documents relevant to a task. One labeling algorithm is more effective than another labeling algorithm if it consistently produces labels that allow a user to perform better in the selection of clusters that hold relevant documents. Of course, the real-world performance of a user depends on both the effectiveness of the labels for the selection process and the ability of the clustering system to generate meaningful clusters. This project concerns the problem of identifying which algorithms provide high quality labels for cluster representations. It reports on experiments using an instrument designed to measure the effect the automatically-generated label sets have on user performance in selecting relevant clusters. More effective labeling algorithms will consistently produce label sets that allow users to find the largest number of documents relevant to their interests. 2. Related work Relatively little research exists on the problem of producing high quality document cluster labels. In many cases the algorithms simply select words that determined the document clustering (Chen, Houston, Sewell,& Schatz, 1998; Largus & Kaski, 1999; Chen, 2000; Raubner, 2000; Schweighofer, Raubner, & Dittenbach, 2001; Azcarraga, et al., 2004). Popescul and Ungar (2000) develop two labeling techniques based on statistical properties of cluster word distributions. Multidocument summarization (Goldstein, et al. 2000; Hatzivassiloglou, et al., 2001; Radev, Fan, & Zhang, 2001; Lawrie and Croft , 2001, 2003; Glover et al., 2002) has a goal of producing human-readable labels to represent document groups, but the descriptions produced are too lengthy to conform to interface constraints for labeling large numbers of clusters. Several researchers have conducted limited user studies on label preferences ( Lin, Chen, & Nunamaker, 1999; Popescul & Ungar, 2000; Schweighofer, Raubner, & Dittenbach, 2001 ). These Cole 3 studies suffer from small sample sizes and other design problems. In some cases the user evaluation is anecdotal, or self-reported user preferences. In other cases the evaluation of label effectiveness took place only after the subject had experienced the cluster content, so these experiments tested the effectiveness of the interface and clustering system. Further, user reports of label preference after interaction with the cluster content cannot be related to label quality for making the initial selection. With the exception of Popescul and Ungar, the study designs were tightly bound to a clustering system and cannot compare the effectiveness of alternative labeling systems or ones that are independent of the clustering system. The present work proposes an instrument capable of measuring just the label effect in a user's naive selection of one cluster representation over another in a browsing interface. Such an instrument can allow one to test many different labeling algorithms to see which of them produce high quality labels. It can also assist exploration of user situation, task, and collection impact on effective labeling. 3 Theory of label selection In contrast to the collection-centered perspective on generating 'good' labels, this study observes that a user's initial basis for selecting a cluster in a browsing interface cannot involve knowledge of the cluster contents, though it must include a belief about the contents. A user can only select a labelled cluster representation from their own understanding of the meaning of the labels. A congenial theoretical foundation for this user-centered perspective is to assume a user actively constructs some meaning for the words in the labels using their existing knowledge and an understanding of their situation, i.e. browsing for information that is relevant to some interest. Work on the process by which user's learn menus and other affordances in software interfaces is relevantly similar to the problem of understanding label selection in a browsing interface. Experimental support exists for a theory that users select labels because there is a semantic similarity between the label and the user task description (Polson & Lewis, 1990). If one as the task of printing a document to a files then one is likely to select a menu item labelled 'print'. Selection of cluster labels is similar to application label selection when a user first confronts the interface. First, the user must rely on existing knowledge and previous experience to evaluate the label. Second, the precise result of selecting the label is unknown. Finally, selection of the label and Cole 4 subsequent experience with the object or process represented by the label alters the meaning of the label for the user. There are also significant differences between application interface labels and document cluster labels, and those differences turn on basic issues for IR. For clusters, the labels are presumed to stand for the content of the clusters, and it is this correspondence that underwrites the user's belief that in selecting the labelled representation they are selecting relevant content in the cluster. In contrast, an application label is a navigation point that becomes associated with a function. It does not in some useful way stand for the content of the process activated when the label is selected. The production of the user expectation and the process by which the label terms are given meaning can be highly user- and task-dependent. Still, it seems reasonable to see the situation of confronting novel cluster representations as similar to working with a new software interface. In both cases, the user must construct meaning for the labels to decide which object to select. The experimental design for the study, and justification for the measurement and its analysis depends on this idea that the user recognizes some semantic similarity between the label and a task description. Labels that include words known by the user to be germane to the immediate interest and task are more likely to be selected than those that contain fewer such words, words with less semantic similarity or no such words. 4 The Problem of Task This study uses tasks (Appendix 2) that appear to be similar, differing only in the general topic area they address, e.g. interest in crime stories vs. business stories. Creation of appropriately similar tasks is a significant challenge, especially as the browsing interest area is made more specific, because familiarity differences will have greater impact on the word pools used to recognize label semantic similarity. General topic area tasks can be expected to suffer less from these effects since different users The clustered collections themselves can be sensitive to user familiarity. If one does not know where Azerbaijan is located, or that it is a country, it will be difficult to know if such a label is a hint that the stories in the cluster are relevant to an interest in Asian affairs. Likewise, user understanding of the collection domain will affect label meaning production because users will recognize certain collections as having characteristic copora and will attempt to modify their understanding of the task description to match the likely description in the collection domain. So, for users familiar with the medical domain, an interest in “heart” will be biased towards an Cole 5 interest in “cardio-” when one is aware the clusters represent documents from PubMed. Even apart from domain recognition effects on the user, the generality of the task description may be an important factor in label selection effectiveness. If tasks are suitably broad, it seems that users with high and low familiarity may do nearly as well in label selection tasks because each may have decent familiarity with several parts of the domain and the criteria of what is potentially interesting to the user includes many documents that may be of different levels of specific usefulness. So an introductory article about the structure of the Japanese government may be of low interest to one familiar with politics in Japan, but of high interest to someone less familiar. This tendency for the distribution of relevant interesting documents to be correlated with users of differing familiarity in broad domain areas should tend to cancel out performance differences between the two groups. So a lack of control on user familiarity and user-task interactions can be expected to be mitigated for suitably broad browsing interests. However, this lack of control on user familiarity can be expected to become a greater difficulty, threatening the validity of label effect measurement, when more specific tasks are considered. Some users with specific familiarity can be expected to have much deeper pools of words for task description to draw upon when making semantic similarity judgments. Extending the results of this work to investigation of user domain familiarity is important to understand the limits of general applicability of a labeling algorithm. It is also of great interest in looking at the impact of user familiarity with specific domains when document clusters from several domains are presented in the interface. Since the goal of the project is to distinguish automated labeling systems with good performance across various user classes, there is no attempt to control for this variability in the design and evaluation of the current experiments. The specific effects of user familiarity and collection and clustering algorithm characteristics on label effectiveness is a rich direction for future research. 5 Measurement of Label Effect on User Performance A user browsing for information in a large document collection wishes to encounter documents that are relevant to the task and situation. Systems that support the user in coming into contact with a greater number of relevant documents will likely be judged as better-performing systems. It is also the case that a user typically browses until the task is complete, some other activity supersedes the search, or Cole 6 little or no success is experienced in finding relevant documents. So, a system designed to support browsing must also acknowledge the importance of efficiency in real world browsing situations. These considerations make appropriate recall-based measures of user performance, and support the view that label effectiveness can be measured by the total number of relevant documents a user can access by selecting some limited number of clusters. This study uses recall-based measures for label effectiveness performance and also. While a user may be content to explore freely, a system that provides access to many relevant documents early in the browsing process is likely to lead to most positive user assessments of the system. So efficiency can be correlated with the amount of effort, in the form of number of clusters selected and explored, required to find interesting documents. Efficiency measurements to gage label effect on user selection of clusters are not explicitly addressed in this pilot study, however the experimental design does acknowledge the importance of efficiency by restricting the number of clusters to be selected. In practical systems, the clusters are assumed to have been generated without knowledge of specific users or their particular interests. From the perspective of a user-centered browsing system, the issue is how well the clusters match the general and specific interests of the user. Some research suggests that when people cluster information objects the results are quite different than those provided by automated indexing systems (Macskassy, et al., 1998). Two specific measures of user performance in selecting clusters are used in the pilot study: the sum of the number of relevant stories in the selected clusters, and the number of selected clusters that are relevant to the task. In both cases, the measures are determined strictly by the clusters selected by the user. In the first measure, the number of task-relevant stories contained by each selected cluster is summed over the clusters selected by the user. The second measure is a simple count of the selected clusters that are relevant to the task. The second measure is useful only in the context of the experiments to measure label effect where the contents of the clusters are known to be all relevant to some topic or concept. In the current Cole 7 study the clusters have been created by human assessors and all stories are relevant to the cluster topic. This will not apply when clusters are generated automatically because more or less imperfect partitions of the collection are inevitable. The first measure, a count of the relevant stories in the selected clusters, is appropriate for testing labelling algorithm effectiveness in systems using real-world clustering applications. 6 Research Questions and Hypotheses There are many interesting research questions regarding the properties of effective cluster labels, for example, how does user performance in selecting relevant clusters change as a function of the number of words in a label? Is it better to have words or phrases? Do nouns work better than a mixture of word types? For this work, the research question is whether the experimental instrument can measure the effectiveness of different label sets in supporting user selection of relevant clusters. There are two specific hypotheses: H1a. Automated labeling algorithms can be distinguished from one another by effectiveness of the generated label sets in supporting user selection of relevant clusters. H1b. Automated labeling algorithms can be placed in a rank order of effectiveness. H2. Labels produced by automated algorithms are less effective than high quality labels produced by humans in supporting user selection of relevant clusters. 7 The Evaluation Model A general linear model is adopted for this study. The performance of the user in selecting relevant clusters, which are then counted independently of further user activity, depends on scanning the interface and selecting the labels that are meaningful for the task. The selection of one cluster is independent of another. R S e ffe c t U e ffe c t T e ffe c t r0 where U = User T = Task S = Interface System (Label set used) r0= constant R = the performance score The evaluation will consider only main effects. 8 Instruments Cole 8 Figure 1: Screen shot of interface with baseline labels. The primary instrument is a simple interface (Figure 1) to present the cluster representations with the experimental label sets. The subject is able to select or deselect clusters by simply clicking anywhere in the circle. A translucent blue halo shows the mouse has entered the cluster representation area and an opaque red ring is fixed if the cluster is selected. The tasks and the label sets applied to the cluster representations change in the experiment and the cluster positions are randomized when a new label set is presented. Selection process logger The interface automatically records the subjects' gestures with the mouse during the selection process. The mouse log records the cursor position in 50 msec slices as well as all events, including click events. The log can be played back to control the cursor and see the process of user selection of clusters. This includes the order of cluster selection, any cases where the subject changed their mind and incidents of hesitation. The total elapsed time can be calculated, so a keystroke level modeling (Card, Moran, & Newell, 1983) analysis of ease of use can Cole 9 be applied to the user's process of label selection. These logs can also be used to look at characteristics of user label select by users with more or less general familiar with news events. 9 Corpus The clusters are from the National Institute of Science and Technology (NIST) Topic Tracking and Detection (TDT3) corpus (Cieri, Graff, Liberman, Martey, & Strassel, 2000). It consists of over 60,000 news stories from wire services, and television and radio transcriptions. Both English and Mandarin news stories over a variety of topics, including economics, world affairs, sports, are represented in the collection. Each story has been hand scored for its relevance to a set of 113 topics. Every member of a cluster includes references that are directly related to the topic. A story appears in at most one cluster. An example of a topic description used to cluster the stories is given in Appendix 1. The baseline labels for this study are the topic names taken from the descriptions used to assign the topicality of each story. 10 Automated labeling algorithms Within the basic interface, three label sets are applied. One label set is the baseline, the 'gold standard' human-generated labels taken from the TDT3 corpus topic descriptions. The other two label sets can be drawn from any labeling algorithm. For the pilot study the labels sets were generated from a chi square test of word commonality and the product of word frequency and predictiveness (Popescul & Ungar, 2000). These algorithms were selected because Popescul and Ungar claim these algorithms perform better than simple alternatives such as the most frequent terms and because they found users had a clear preference for the labels produced by the word frequency-predictiveness technique. 11 Tasks Six topic-driven browsing tasks (Appendix 2) were constructed that addressed topic categories within which the TDT topic clusters would fit. So all of the stories in a topic cluster relevant to a task were deemed relevant to the task. An example of a task description is: Imagine you are interested in Chinese affairs. Please select three clusters that seem interesting. When you have finished click on the button at the top to advance to the next page. Out of the 113 scored TDT topics, 35 were selected at random for each experiment. For the pilot study, the topics used in each experiment were selected so that there was no overlap between the cluster sets. All of the stories in a given topic were deemed relevant to a task if the topic was relevant to the task. Cole 10 The total recall for task 1 in the selected topic clusters is 211 stories, but since the subject selects three clusters the maximum recall possible is 170 (the sum of the number of relevant stories in the top three clusters). A subject selecting the top three clusters would be assigned a performance value of 1. One selecting clusters with 127 relevant documents would be assigned a performance value of 0.75. In the results, the performance of the label set is compared against this maximum recall possibility to mitigate the effect of uneven story distribution in the clusters. Uneven relevant story distribution within topics, that is, the fact that a subset of stories will be relevant to a topic, is characteristic of information environments, but no effort has been made to correct for this variability in the experimental design. 12 Participants A convenience sample of 18 undergraduate (N=9) and graduate students (N=9) from a large Central Atlantic university were recruited. No participants were excluded. It was observed that four of the graduate students and one undergraduate were not native English speakers. 13 Treatment plan One challenge to the experimental design, the selection of tasks, has been explored. Another challenge is learning effects due to repeated exposure of the same clusters to the subject with different labels. Ideally, one would like to have the same user interact with each interface for a given task. Since the interfaces vary only in the labels used to label the clusters, significant label effect could be detected directly. Unfortunately, it is likely the user remembers some of the previous interface labels when working in the next interface. This learning effect increases the number of labels available to the user for semantic similarity evaluations and is a threat to detecting the specific effectiveness of a label set. To mitigate both learning and the task effects noted earlier, a Latin Square Design (LSD) (TaegueSutcliff, 1997) is adopted to assign treatments. In adopting this design, user-task interactions have been assumed to be unimportant. All subjects perform each task exactly once and every subject experiences each label system. The basic block for the experimental assignment is given in table 1. Table 1: A sample treatment block Baseline Subject 1 task 2, task 1 Label set 1 task 3, task 4 Label set 2 task 6, task 5 Cole 11 Baseline Label set 1 Label set 2 Subject 2 task 6, task 5 task 1, task 2 task 4, task 3 Subject 3 task 3, task 4 task 5, task 6 task 1, task 2 For the current experiment this block is repeated six times, using different subjects for each block but with the same label sets and tasks. The experiment is repeated using the same subjects with new clusters and label sets created by the respective labeling algorithms. When each of the treatments were presented the order of the tasks was randomized, as was the order of seeing the label sets when the user was exposed to the second cluster set. Between subjects the order of the label sets was randomized, that is Subject 1 would not necessarily see the label sets in the same order as Subject 2. 14 Experimental Procedure Subjects answered a pretest questionnaire (Appendix 3) about their sources of news and news consumption intensity. The purpose was to gain some understanding of the subject's familiarity with world news events without introducing specific word or phrases that could prime the subject and so interfere with label selection process. Subjects who consume high volumes of news and commentary are likely to have greater familiarity with a randomly selected topic. Greater familiarity is assumed to be associated with a broader pool of words in the subject's understanding of the browsing interest task description, improving the probability of recognizing semantic similarities with label words. The subjects were then given a short tutorial on the interface and encouraged to work with it until they were comfortable. The subject were aware that the purpose of the experiment was to improve access to archived news stories and that mouse events were being logged. All subject data was made anonymous. Each subject was presented with a sequence of tasks in labeled interfaces. For each task, the subject would start the event logger, select three clusters and stop the event logger. They were told each task would take at most 3-5 minutes to complete, but were also told they could take as long as needed to complete the tasks and associated questionnaires. After completing the two assigned tasks in one of the system interfaces, a short questionnaire (Appendix 4) was administered asking about their confidence in the selections they made and opinion about the ease of use of the interface. This information can support event log analysis and also to further probe the impact of user familiarity on label effect and user satisfaction with the labels. This entire process is repeated three times for the cluster set. Cole 12 The process of treatment with three interfaces and six tests was repeated with a different random selection of clusters and label sets from the TDT topic descriptions, and the chi square and predictiveness algorithms. Completion of each cluster selection process usually takes less than two minutes so even though the process is repeated 12 times the experimental design should be resistant to reduced user engagement due to fatigue or boredom. Subjects usually complete the interface questionnaire in less than 15 seconds. 15 Pilot Study Results The performance of the user in selecting clusters with the greatest number of relevant documents is presented graphically in figure 2. The box plot shows a no clear differences between the label sets. There were wide variations in performance, with mean values around 50% (In no case were the subjects, on average, able to select three clusters that would access more than 50% of the relevant documents available. An ANOVA analysis (Table 2) of the model shows significant main effects for the task, user and the cluster set used. Task (F=5.355 p<.05) and user (F=1.784 p<.05) were most significant. There was no significant main effect observed for the label sets. The model has an adjusted effect size of just 0.151 so a great deal of the observed variability is unexplained by the model. T e s tso fB e tw e e n -S u b je c tsE ffe c ts D e p e n d e n tV a ria b le :p e rfo rm a n c e S o u rc e C o rre c te dM o d e l T y p eIII S u m o f S q u a re s d f 6 .7 2 3 2 5 M e a n S q u a re 0 .2 6 9 F S ig . P a rtia l E ta N o n c e n t. O b s e rv e d S q u a re dP a ra m e te r P o w e r(a ) 2 .5 3 1 0 .0 0 00 .2 5 0 4 4 .7 5 7 1 4 4 .7 5 74 2 1 .2 7 0 In te rc e p t 3 . 2 2 2 1 7 0 . 1 9 0 1 . 7 8 4 S U B J E C T 2 .8 4 5 5 0 .5 6 9 5 .3 5 5 T A S K 0 .5 1 8 1 0 .5 1 8 4 .8 7 2 C S E T 0 .1 3 8 2 0 .0 6 9 0 .6 5 1 L S E T 2 0 .1 8 61 9 0 0 .1 0 6 E rro r 7 1 .6 6 62 1 6 T o ta l 2 6 .9 0 92 1 5 C o rre c te dT o ta l a .C o m p u te du s in ga lp h a=.0 5 b .R S q u a re d=.2 5 0(A d ju s te dR S q u a re d=.1 5 1 ) Table 2: Model for relevant documents user performance 0 .0 0 0 0 .0 3 2 0 .0 0 0 0 .0 2 8 0 .5 2 3 0 .6 8 9 0 .1 3 8 0 .1 2 4 0 .0 2 5 0 .0 0 7 6 3 .2 7 8 0 .9 9 9 4 2 1 .2 7 0 3 0 .3 2 8 2 6 .7 7 6 4 .8 7 2 1 .3 0 2 1 .0 0 0 0 .9 4 3 0 .9 8 8 0 .5 9 3 0 .1 5 8 Cole 13 Table 3: Analysis of variance model for user performance in selecting relevant documents T e s tso fB e tw e e n S u b j e c tsE ffe c ts D e p e n d e n tV a ria b le :C R E L C O R R T y p eI I IS u m o fS q u a re s S o u rc e d f M e a nS q u a re 1 1 .4 1 5 2 9 0 .3 9 4 C o rre c te dM o d e l 2 3 .6 7 5 1 2 3 .6 7 5 I n te rc e p t 3 .9 9 6 1 7 0 .2 3 5 S U B J E C T 2 .4 7 8 5 0 .4 9 6 T A S K 0 .5 0 0 1 0 .5 0 0 C S E T 0 .3 7 5 2 0 .1 8 8 L S E T 3 .3 5 9 4 0 .8 4 0 C R E L A V A I 1 4 .7 6 1 1 8 6 0 .0 7 9 E rro r 9 6 .9 2 6 2 1 6 T o ta l 2 6 .1 7 6 2 1 5 C o rre c te dT o ta l a .R S q u a re d= .4 3 6( A d ju s te dR S q u a re d= .3 4 8 ) F 4 .9 6 0 2 9 8 .3 2 3 2 .9 6 2 6 .2 4 4 6 .3 0 0 2 .3 6 4 1 0 .5 8 3 S ig . 0 .0 0 0 0 .0 0 0 0 .0 0 0 0 .0 0 0 0 .0 1 3 0 .0 9 7 0 .0 0 0 A similar result is observed using the second performance measure (Table 3). Again, subject (F=2.96 p<.05) , task (F=6.24 p<.05) , and cluster sets (F=6.30 p<.05) are all significant contributors to the model while the label sets are not significant. The adjusted effect size of the model is somewhat better (0.348). Examination of boxplots of user performance for both selection of relevant stories and selection of relevant clusters provides similar results (figures 2 and 3), as expected. The two performance measures are closely linked, but it is possible that a subject could select three relevant clusters that had contained few documents and so score poorly by the relevant documents recall measure but perfectly by the relevant clusters measure. The differences in the performance measures in the boxplots show that this problem is not a significant issue and one can see the same relationships between the observed means for each label set are preserved. Figure 2: User performance measured by number of relevant documents selected Figure 3: User performance by number of relevant clusters selected Cole 14 An examination of the user and task differences (Appendix 5) show no distinguishable differences between users or tasks. 16 Discussion The results do not support rejection of the null hypotheses for the instrument validation hypotheses. Figure 4: User performance in selecting relevant stories by task Some level of label effectiveness must exist, after all random words for labels would cause user performance to be no better than random. In view of the theoretical background for label selection it seems that the influence of user familiarity and task may be stronger than anticipated. So the calculated model, dominated by task and user main effects, might be taken at face value. Figure 4 presents user performance in selecting relevant stories for the various tasks by labelling algorithm. Rather consistent high user performance for tasks 4 (Chinese affairs) and 5 (European and Russian affairs) as compared to tasks 3 (revolts, terrorism, etc.) and 6 (violent crimes and personal tragedies) suggests there may be significant differences in user recognition of labels that cut across labelling algorithms. A look at the labels used shows a predominance of geographic place names for the relevant clusters for tasks 4 and 5. The Cole 15 labels for clusters relevant to tasks 3 and 6 contain many nouns, but these nouns do not obviously link to the task as cleanly as they do for tasks 4 and 5. User familiarity with a domain may also explain why no label effect could be observed. The goal of the pretest questionnaire is to identify subjects with broad familiarity of news domains without priming them with specific words that might affect the experiment. Tests of this questionnaire instrument included a group of editors (N=8) at an on-line newspaper owned by one of the major broadcast networks. This group is omnivorous, consuming news from many different types of sources, while the other respondents, on average, made intense use only of newspapers and broadcast news. However, this group was highly variable in their news consumption habits and a K-means clustering analysis revealed a group (4 out of 29) with a profile very similar to that of the news editors. An analysis of the pretest questionnaires shows that no less than four of the subjects in the experiment had a similar news consumption profile and therefore can be expected to have high familiarity of news events and so greater domain knowledge to draw on in selecting cluster labels. Figure 7 shows a comparison of the task performance by the labelling algorithm partitioned by intensity of news consumers (4 or more sources used within the last week) shows that the chi square label scores had much less variability that any other task. This exploratory work provides some evidence of user familiarity impact on user browsing performance. There are significant limitations in the study due to the corpus used. News stories have distinctive characteristics as compared to other texts. The age of the news stories in the corpus may also be a problem because many subjects may not remember news events from five years ago and so valuable hints, especially in proper names, that appear in the labels will be less helpful than they might be in clusters of contemporary stories. A significant sample of the subjects spoke English as a second language and their experience with news in their native countries during the 1999-2000 period could be very different from those living in the US at the time. While the corpus included many stories from the Chinese news agencies, there was a bias towards reports, and so subject matter, from US sources. Cole 16 Figure 7: Comparison of task performance between intense news consumers and those less intense Figure 5: News consumption by source - non-editors Figure 6: Frequency of news consumption by news source - editors 17 Conclusion An instrument has been developed to detect and measure the effect of labels on user selection of document clusters. Experiments have failed to show significnat label effects on user cluster selection performance in support of a task or browsing interest. Exploratory analysis suggests task and user familiarity effects may have swamped any label effect in the experiments. Future experiments with this instrument will incorporate these lessons and take account of task by using different tasks that suggest nouns vs mixtures of various parts of speech. Chen (2000) has claimed that noun phrases provide more effective labels. Regarding user familiarity, experiments will be formulated that assign subjects on the basis of the intensity of their news consumption. If strong user and task effects on label selection are confirmed, the importance of finding effective labelling algorithms that are independent of clustering algorithms is increased. Improved browsing performance my depend on the ability to dynamically relabel existing clusters to take account of the user's domain knowledge and task. Identification of labelling algorithms that have low computational demands is especially important if the task and/or user familiarity effects are confirmed. References Azcarraga, A.P., Yap, T.N. (2001) Extracting meaningful labels for WEBSOM text archives. In Proceedings of the SIGCKIM conference on Human factors in computing systems (CKIM ’01) (Atlanta, GA). ACM Press, New York, NY, 2001, 40-48. Cole 17 Card, S. K., Moran, T. P., and Newell, A..(1983) The Psychology of Human-Computer Interaction. Lawrence Erlbaum Associates, Hillsdale, NJ. Chen, H. (2000). High-performance Digital Library Classification Systems: From Information Retrieval to Knowledge Management. Retrieved from http://www.dli2.nsf.gov/projects/chen.pdf on November 7, 2003 Chen, H., Houston, A., Sewell, R., and Schatz, B., (1998) Internet browsing and searching: User evaluations of category map and concept space techniques. Journal of the American Society for Information Science, 49, 7 582-603 Cieri, C., Graff, D., Liberman, M., Martey, N. and Strassel, S. (2000). Large multilingual broadcast news corpora for cooperative research in topic detection and tracking: The TDT2 and TDT3 corpus efforts. In Proceedings of the Second International Language Resources and Evaluation Conference, Athens, Greece, May 2000. Retrieved from: http://papers.ldc.upenn.edu/ March 17, 2005. Lagus, K., and Kaski, S. (1999) Keyword selection method for characterizing text document maps, In: Proceedings of the Ninth International Conference on Artificial Neural Networks (ICANN99), pp. 371376, IEE, London, 1999. Lawrie, D., Croft, W.B., and Rosenberg, A. (2001). Finding topic words for hierarchical summarization. SIGIR 2001 (New Orleans, LA) pp 349-357 Lin, C., Chen, H., and Nunamaker, J. (1999). Verifying the proximity and size hypothesis for selforganizing maps. Journal of Management Information Systems; Armonk; Winter 1999/2000. Macskassy, S. A., Banerjee, A. Davison, B. D., and Hirsh, H. (1998). Human Performance on Clustering Web Pages. Technical Report DCS-TR-355, Department of Computer Science, Rutgers University. http://citeseer.ist.psu.edu/article/macskassy98human.html Muresan, G. (2002) Using Document Clustering and Language Modelling in Mediated Information Retrieval, Ph.D. thesis, School of Computing, The Robert Gordon University, Aberdeen, Scotland, January 2002 Muresan , G., and Harper, D. (2004). Topic modeling for mediated access to very large document collections. Journal of the American Society for Information Science 2004 55 (10)892-910 Polson, P.G. and Lewis, C.H. (1990).Theory-based design for easily learned interfaces. HumanComputer Interaction, 5 (2-3), 191-220. Popescul, A., and Ungar, L. (2000). Automatic Labeling of Document Clusters. Unpublished retrieved September 23, 2004 from http://www.citeseer.com/ Radev, D.R., Fan, W., and Zhang, Z. (2001). WebInEssence: A personalized web-based multi-document summarization and recommendation system. In NAACL 2001 Workshop on Automatic Summarization (Pittsburgh, PA), 79-88. Rauber, A., Schweighofer, E., and Merkl, D. (2000). Text Classification and Labelling of Document Clusters with Self-Organizing Maps. Journal of the Austrian Society for Artificial Intelligence (ÖGAI),13 3, pp. 17-23 Raubner, A. (2000). LabelSOM: On the labeling of self-organizing maps. In Proceedings of the SIGCHI conference on Human factors in computing systems (CHI ’00) (The Hague, The Netherlands, April 1-6, 2000). ACM Press, New York, NY, 2000, 526-531. Cole 18 Schweighofer, E., Raubner, A., and Dittenbach, M. (2001). Improving the quality of labels for selforganizing maps using fine-tuning. Proceedings of the DEXA Workshop on Legal Information Systems and Applications (LISA01) Vienna, Austria Soto,R. (1999). Learning and performing by exploration: label quality measured by latent semantic analysis, Proceedings of the SIGCHI conference on Human factors in computing systems: the CHI is the limit, May 15-20, 1999, Pittsburgh, Pennsylvania, United States Tague-Sutcliffe, Jean. (1997). "The Pragmatics of Information Retrieval Experimentation, Revisited." In Readings in Information Retrieval, ed. Karen Sparck Jones and Peter Willett, 205-216. San Francisco, CA: Morgan Kaufmann, 1997. [Originally published in: Information Processing & Management 28 (1992): 467-490.]Appendix 1: TDT Topic Description 30003. Pinochet Trial Seminal Event WHAT: Pinochet, who ruled Chile from 1973-1990, is arrested on charges of genocide and torture during his reign. WHO: Former Chilean dictator General Augusto Pinochet; Judge Baltasar Garzon ("Superjudge") WHERE: Pinochet is arrested and held in London, then later extradited to Spain. WHEN: The arrest occurs on 10/16/98; court negotiations last the rest of the year. Topic Explication Pinochet was arrested in a London hospital on a warrant issued by Spanish Judge Baltasar Garzon. Pinochet appealed his arrest and a London court agreed, but the decision was overturned by Britain's highest court. After much legal wrangling over the site of the trial, the British Courts ruled that Spain should proceed with the extradition request; Pinochet continues to fight it. On topic: stories covering any angle of the legal process surrounding this trial (including Pinochet's initial arrest in October, his appeals, British Court rulings, reactions of world leaders and Chilean citizens to the trial, etc.). Stories about Pinochet's reign or legacy are not on topic unless they explicitly discuss this trial. Rule of Interpretation Rule 3: Legal/Criminal Cases Appendix 2: Task Descriptions Measuring Label Effect: Task statements rev5 5.1.05 TASK 1: Imagine you are interested in business and the economy (both domestic and international). Please select three clusters that seem interesting. When you have finished click on the button at the top to advance to the next page. TASK 2: Imagine you are interested in domestic politics. Please select three clusters that seem interesting. When you have finished click on the button at the top to advance to the next page. TASK 3: Imagine you are interested in protests, terrorism, civil wars and revolts around the world. Please select three clusters that seem interesting. When you have finished click on the button at the top to advance to the next page. TASK 4: Imagine you are interested in Chinese affairs. Please select three clusters that seem interesting. When you have finished click on the button at the top to advance to the next page. Cole 19 TASK 5: Imagine you are interested in European and Russian affairs. Please select three clusters that seem interesting. When you have finished click on the button at the top to advance to the next page. TASK 6: Imagine you are interested in stories about violent crimes and personal tragedies, especially those that result in injury or death. Please select three clusters that seem interesting. When you have finished click on the button at the top to advance to the next page.Appendix 3: Pre-experiment questionnaire Please make a mark on the line closest to your answer. 1. When did you last read a newspaper for national or international news? 2. When did you last watch an evening news program? 3. When did you last read a contemporary issues magazine, such as Time, Newsweek, The Economist, The National Review, The Nation, etc. ? 4. When did you last read an on-line newspaper, such as The New York Times, CNN, AP wire, CBS, MSNBC, etc. for national or international news? 5. When did you last read a blog or on-line discussion group that concerned some national or international news story? 6. When did you last listen to a talk radio show that was concerned with national or international news and events? 7. When did you last read a non-fiction book that referenced news events or concerned contemporary issues? Cole 20 Thank you!Appendix 4: Post-task questionnaire Task Questionnaire: Please make a mark in the circle that best describes your feeling about the statement. I had enough time to complete the task. ∘ ∘ ∘ agree strongly ∘ ∘ neither agree or disagree ∘ ∘ disagree strongly I am very confident about the selections made for this task. ∘ ∘ ∘ agree strongly ∘ ∘ neither agree or disagree ∘ ∘ disagree strongly This interface is extremely easy to use. ∘ agree strongly ∘ ∘ ∘ neither agree or disagree Thank you! Appendix 5 User Task and System Parameter Effects For user performance in selecting relevant documents ∘ ∘ disagree strongly ∘ Cole 21 P aram eterE s tim ates D e p e n d e n tV a ria b le :p e rfo rm a n c e 9 5 % C o n fid e n c e In te rv a l a rtia lE ta L o w e r U p p e r P q u a re d B o u n d B o u n d S 0 .0 0 4 0 .4 5 0 0 .0 2 1 -0 .0 6 8 0 .4 5 7 0 .0 1 1 -0 .2 7 4 0 .2 5 1 0 .0 0 0 -0 .2 1 1 0 .3 1 4 0 .0 0 1 -0 .0 2 7 0 .4 9 8 0 .0 1 6 -0 .1 5 8 0 .3 6 7 0 .0 0 3 -0 .2 3 5 0 .2 8 9 0 .0 0 0 -0 .3 6 3 0 .1 6 2 0 .0 0 3 -0 .2 7 5 0 .2 5 0 0 .0 0 0 -0 .2 5 9 0 .2 6 6 0 .0 0 0 -0 .3 2 5 0 .2 0 0 0 .0 0 1 -0 .3 2 2 0 .2 0 3 0 .0 0 1 -0 .4 0 5 0 .1 2 0 0 .0 0 6 -0 .5 1 4 0 .0 1 1 0 .0 1 8 -0 .2 0 0 0 .3 2 5 0 .0 0 1 -0 .1 0 7 0 .4 1 8 0 .0 0 7 -0 .3 1 0 0 .2 1 4 0 .0 0 1 -0 .4 3 3 0 .0 9 1 0 .0 0 9 . . . -0 .0 1 0 0 .2 9 3 0 .0 1 8 0 .0 0 1 0 .3 0 4 0 .0 2 0 -0 .1 3 4 0 .1 6 9 0 .0 0 0 0 .1 3 0 0 .4 3 3 0 .0 6 6 0 .1 4 5 0 .4 4 8 0 .0 7 3 . . . 0 .0 1 0 0 .1 8 5 0 .0 2 5 . . . -0 .0 7 2 0 .1 4 2 0 .0 0 2 -0 .0 4 5 0 .1 6 9 0 .0 0 7 . . . S td . P a ra m e te r B E rro r t S ig . 0 .2 2 7 0 .1 1 3 2 .0 0 7 0 .0 4 6 In te rc e p t 0 .1 9 5 0 .1 3 3 1 .4 6 2 0 .1 4 5 [S U B J E C T =1 ] -0 .0 1 2 0 .1 3 3 -0 .0 8 8 0 .9 3 0 [S U B J E C T =2 ] 0 .0 5 1 0 .1 3 3 0 .3 8 4 0 .7 0 2 [S U B J E C T =3 ] 0 .2 3 5 0 .1 3 3 1 .7 6 9 0 .0 7 8 [S U B J E C T =4 ] 0 .1 0 4 0 .1 3 3 0 .7 8 5 0 .4 3 3 [S U B J E C T =5 ] 0 .0 2 7 0 .1 3 3 0 .2 0 3 0 .8 3 9 [S U B J E C T =6 ] -0 .1 0 0 0 .1 3 3 -0 .7 5 2 0 .4 5 3 [S U B J E C T =7 ] -0 .0 1 3 0 .1 3 3 -0 .0 9 5 0 .9 2 5 [S U B J E C T =8 ] 0 .0 0 3 0 .1 3 3 0 .0 2 4 0 .9 8 1 [S U B J E C T =9 ] .0 6 3 0 .1 3 3 -0 .4 7 3 0 .6 3 7 [S U B J E C T =1 0 ] -0 .0 6 0 0 .1 3 3 -0 .4 4 7 0 .6 5 5 [S U B J E C T =1 1 ] -0 .1 4 3 0 .1 3 3 -1 .0 7 2 0 .2 8 5 [S U B J E C T =1 2 ] -0 .2 5 1 0 .1 3 3 -1 .8 8 8 0 .0 6 1 [S U B J E C T =1 3 ] -0 .0 6 3 0 .1 3 3 0 .4 7 3 0 .6 3 7 [S U B J E C T =1 4 ] 0 .1 5 6 0 .1 3 3 1 .1 7 0 0 .2 4 4 [S U B J E C T =1 5 ] 0 .0 4 8 0 .1 3 3 -0 .3 6 1 0 .7 1 9 [S U B J E C T =1 6 ] -0 .1 7 1 0 .1 3 3 -1 .2 8 5 0 .2 0 0 [S U B J E C T =1 7 ] -0 0 . . . [S U B J E C T =1 8 ] 0 .1 4 2 0 .0 7 7 1 .8 4 8 0 .0 6 6 [T A S K =1 ] 0 .1 5 2 0 .0 7 7 1 .9 8 4 0 .0 4 9 [T A S K =2 ] 0 .0 1 7 0 .0 7 7 0 .2 2 4 0 .8 2 3 [T A S K =3 ] 0 .2 8 2 0 .0 7 7 3 .6 6 4 0 .0 0 0 [T A S K =4 ] 0 .2 9 7 0 .0 7 7 3 .8 6 2 0 .0 0 0 [T A S K =5 ] 0 . . . [T A S K =6 ] 0 .0 9 8 0 .0 4 4 2 .2 0 7 0 .0 2 8 [C S E T =1 ] 0 . . . [C S E T =2 ] 0 .0 3 5 0 .0 5 4 0 .6 5 0 0 .5 1 7 [L S E T =1 .0 0 ] 0 .0 6 2 0 .0 5 4 1 .1 3 7 0 .2 5 7 [L S E T =2 .0 0 ] 0 . . . [L S E T =3 .0 0 ] a .C o m p u te du s in ga lp h a=.0 5 b .T h isp a ra m e te r iss e t toz e rob e c a u s eit isre d u n d a n t. Appendix 7 N o n c e n t. P a ra m e te r 2 .0 0 7 1 .4 6 2 0 .0 8 8 0 .3 8 4 1 .7 6 9 0 .7 8 5 0 .2 0 3 0 .7 5 2 0 .0 9 5 0 .0 2 4 0 .4 7 3 0 .4 4 7 1 .0 7 2 1 .8 8 8 0 .4 7 3 1 .1 7 0 0 .3 6 1 1 .2 8 5 . 1 .8 4 8 1 .9 8 4 0 .2 2 4 3 .6 6 4 3 .8 6 2 . 2 .2 0 7 . 0 .6 5 0 1 .1 3 7 . O b s e rv e d P o w e r(a ) 0 .5 1 5 0 .3 0 7 0 .0 5 1 0 .0 6 7 0 .4 2 1 0 .1 2 2 0 .0 5 5 0 .1 1 6 0 .0 5 1 0 .0 5 0 0 .0 7 6 0 .0 7 3 0 .1 8 7 0 .4 6 7 0 .0 7 6 0 .2 1 4 0 .0 6 5 0 .2 4 8 . 0 .4 5 2 0 .5 0 6 0 .0 5 6 0 .9 5 4 0 .9 7 0 . 0 .5 9 3 . 0 .0 9 9 0 .2 0 5 . News source survey (Previous test version used for news editor /other survey. Please make a mark on the line closest to your answer. 1. When did you last read a newspaper for national or international news? 2. When did you last watch an evening news program? Cole 22 3. When did you last read a contemporary issues magazine, such as Time, Newsweek, The Economist, The National Review, The Nation, etc. ? 4. When did you last read an on-line newspaper, such as The New York Times, CNN, AP wire, CBS, MSNBC, etc. for national or international news? 5. When did you last read a blog or on-line discussion group that concerned some national or international news story? 6. When did you last listen to a talk radio show that was concerned with national or international news and events? 7. When did you last read a non-fiction book that referenced news events or concerned contemporary issues? Thank you!