mcole.mle.prePilotStudy - School of Communication and

advertisement
Cole 1
Measuring Label Effect prePilot Eval v26: 619
Measuring the Effectiveness of Cluster Representation Labels for Browsing Document Collections
Michael J. Cole
School of Communication, Information and Library Studies
Rutgers University
12 May 2005
Abstract
Effective browsing in clustered document collections requires interface cluster representations that are
meaningfully-labelled. This pilot study attempts to measure the effectiveness of interface labels in
supporting user browsing activity over a collection of clustered news stories with recall-based
performance measures. Experiments with 18 subjects were unable to show that automatically-generated
labels can be distinguished by their effectiveness from baseline labels provided by humans and that label
sets generated by different algorithms can be distinguished from one another. Exploration of the results
suggests that user and task effects may have dominated any label effects that exist. Design for future
experiments to test these potential explanations for the results is discussed.
1. Introduction
Conceptual browsing of large collections of documents and other information objects is a desirable
alternative to specified search. The development of effective browsing systems depends on creation of sensible
subcollections using automated indexing and clustering systems. It also requires an effective user interface to
support navigation and selection of clusters.
Cole 2
An important problem for the cluster browsing interface is the quality of the cluster
representation labels (Chen, Houston, Sewell,& Schatz, 1998; Largus & Kaski, 1999; Chen, 2000;
Raubner, 2000; Popescul & Ungar, 2000; Schweighofer, Raubner, & Dittenbach, 2001; Muresan, 2002;
Azcarraga, et al., 2004; Muresan & Harper, 2004). The effectiveness of a label can be defined in terms of
the success of a user in selecting clusters containing documents relevant to a task. One labeling
algorithm is more effective than another labeling algorithm if it consistently produces labels that allow a
user to perform better in the selection of clusters that hold relevant documents. Of course, the real-world
performance of a user depends on both the effectiveness of the labels for the selection process and the
ability of the clustering system to generate meaningful clusters.
This project concerns the problem of identifying which algorithms provide high quality labels for
cluster representations. It reports on experiments using an instrument designed to measure the effect the
automatically-generated label sets have on user performance in selecting relevant clusters. More
effective labeling algorithms will consistently produce label sets that allow users to find the largest
number of documents relevant to their interests.
2. Related work
Relatively little research exists on the problem of producing high quality document cluster labels. In
many cases the algorithms simply select words that determined the document clustering (Chen, Houston,
Sewell,& Schatz, 1998; Largus & Kaski, 1999; Chen, 2000; Raubner, 2000; Schweighofer, Raubner, &
Dittenbach, 2001; Azcarraga, et al., 2004). Popescul and Ungar (2000) develop two labeling techniques based
on statistical properties of cluster word distributions. Multidocument summarization (Goldstein, et al. 2000;
Hatzivassiloglou, et al., 2001; Radev, Fan, & Zhang, 2001; Lawrie and Croft , 2001, 2003; Glover et al., 2002)
has a goal of producing human-readable labels to represent document groups, but the descriptions produced are
too lengthy to conform to interface constraints for labeling large numbers of clusters.
Several researchers have conducted limited user studies on label preferences ( Lin, Chen, &
Nunamaker, 1999; Popescul & Ungar, 2000; Schweighofer, Raubner, & Dittenbach, 2001 ). These
Cole 3
studies suffer from small sample sizes and other design problems. In some cases the user evaluation is
anecdotal, or self-reported user preferences. In other cases the evaluation of label effectiveness took
place only after the subject had experienced the cluster content, so these experiments tested the
effectiveness of the interface and clustering system. Further, user reports of label preference after
interaction with the cluster content cannot be related to label quality for making the initial selection.
With the exception of Popescul and Ungar, the study designs were tightly bound to a clustering system
and cannot compare the effectiveness of alternative labeling systems or ones that are independent of the
clustering system.
The present work proposes an instrument capable of measuring just the label effect in a user's naive
selection of one cluster representation over another in a browsing interface. Such an instrument can allow one to
test many different labeling algorithms to see which of them produce high quality labels. It can also assist
exploration of user situation, task, and collection impact on effective labeling.
3 Theory of label selection
In contrast to the collection-centered perspective on generating 'good' labels, this study observes that a
user's initial basis for selecting a cluster in a browsing interface cannot involve knowledge of the cluster contents,
though it must include a belief about the contents. A user can only select a labelled cluster representation from
their own understanding of the meaning of the labels. A congenial theoretical foundation for this user-centered
perspective is to assume a user actively constructs some meaning for the words in the labels using their existing
knowledge and an understanding of their situation, i.e. browsing for information that is relevant to some interest.
Work on the process by which user's learn menus and other affordances in software interfaces is
relevantly similar to the problem of understanding label selection in a browsing interface. Experimental support
exists for a theory that users select labels because there is a semantic similarity between the label and the user
task description (Polson & Lewis, 1990). If one as the task of printing a document to a files then one is likely to
select a menu item labelled 'print'. Selection of cluster labels is similar to application label selection when a user
first confronts the interface. First, the user must rely on existing knowledge and previous experience to evaluate
the label. Second, the precise result of selecting the label is unknown. Finally, selection of the label and
Cole 4
subsequent experience with the object or process represented by the label alters the meaning of the label for the
user.
There are also significant differences between application interface labels and document cluster labels,
and those differences turn on basic issues for IR. For clusters, the labels are presumed to stand for the content of
the clusters, and it is this correspondence that underwrites the user's belief that in selecting the labelled
representation they are selecting relevant content in the cluster. In contrast, an application label is a navigation
point that becomes associated with a function. It does not in some useful way stand for the content of the process
activated when the label is selected. The production of the user expectation and the process by which the label
terms are given meaning can be highly user- and task-dependent. Still, it seems reasonable to see the situation of
confronting novel cluster representations as similar to working with a new software interface. In both cases, the
user must construct meaning for the labels to decide which object to select.
The experimental design for the study, and justification for the measurement and its analysis depends on
this idea that the user recognizes some semantic similarity between the label and a task description. Labels that
include words known by the user to be germane to the immediate interest and task are more likely to be selected
than those that contain fewer such words, words with less semantic similarity or no such words.
4 The Problem of Task
This study uses tasks (Appendix 2) that appear to be similar, differing only in the general topic area they
address, e.g. interest in crime stories vs. business stories. Creation of appropriately similar tasks is a significant
challenge, especially as the browsing interest area is made more specific, because familiarity differences will
have greater impact on the word pools used to recognize label semantic similarity. General topic area tasks can be
expected to suffer less from these effects since different users
The clustered collections themselves can be sensitive to user familiarity. If one does not know where
Azerbaijan is located, or that it is a country, it will be difficult to know if such a label is a hint that the stories in
the cluster are relevant to an interest in Asian affairs. Likewise, user understanding of the collection domain will
affect label meaning production because users will recognize certain collections as having characteristic copora
and will attempt to modify their understanding of the task description to match the likely description in the
collection domain. So, for users familiar with the medical domain, an interest in “heart” will be biased towards an
Cole 5
interest in “cardio-” when one is aware the clusters represent documents from PubMed.
Even apart from domain recognition effects on the user, the generality of the task description may be an
important factor in label selection effectiveness. If tasks are suitably broad, it seems that users with high and low
familiarity may do nearly as well in label selection tasks because each may have decent familiarity with several
parts of the domain and the criteria of what is potentially interesting to the user includes many documents that
may be of different levels of specific usefulness. So an introductory article about the structure of the Japanese
government may be of low interest to one familiar with politics in Japan, but of high interest to someone less
familiar. This tendency for the distribution of relevant interesting documents to be correlated with users of
differing familiarity in broad domain areas should tend to cancel out performance differences between the two
groups. So a lack of control on user familiarity and user-task interactions can be expected to be mitigated for
suitably broad browsing interests.
However, this lack of control on user familiarity can be expected to become a greater difficulty,
threatening the validity of label effect measurement, when more specific tasks are considered. Some users with
specific familiarity can be expected to have much deeper pools of words for task description to draw upon when
making semantic similarity judgments. Extending the results of this work to investigation of user domain
familiarity is important to understand the limits of general applicability of a labeling algorithm. It is also of great
interest in looking at the impact of user familiarity with specific domains when document clusters from several
domains are presented in the interface.
Since the goal of the project is to distinguish automated labeling systems with good performance across
various user classes, there is no attempt to control for this variability in the design and evaluation of the current
experiments. The specific effects of user familiarity and collection and clustering algorithm characteristics on
label effectiveness is a rich direction for future research.
5 Measurement of Label Effect on User Performance
A user browsing for information in a large document collection wishes to encounter documents
that are relevant to the task and situation. Systems that support the user in coming into contact with a
greater number of relevant documents will likely be judged as better-performing systems. It is also the
case that a user typically browses until the task is complete, some other activity supersedes the search, or
Cole 6
little or no success is experienced in finding relevant documents. So, a system designed to support
browsing must also acknowledge the importance of efficiency in real world browsing situations. These
considerations make appropriate recall-based measures of user performance, and support the view that
label effectiveness can be measured by the total number of relevant documents a user can access by
selecting some limited number of clusters.
This study uses recall-based measures for label effectiveness performance and also. While a user may be
content to explore freely, a system that provides access to many relevant documents early in the browsing process
is likely to lead to most positive user assessments of the system. So efficiency can be correlated with the amount
of effort, in the form of number of clusters selected and explored, required to find interesting documents.
Efficiency measurements to gage label effect on user selection of clusters are not explicitly addressed in this
pilot study, however the experimental design does acknowledge the importance of efficiency by restricting the
number of clusters to be selected.
In practical systems, the clusters are assumed to have been generated without knowledge of
specific users or their particular interests. From the perspective of a user-centered browsing system, the
issue is how well the clusters match the general and specific interests of the user. Some research
suggests that when people cluster information objects the results are quite different than those provided
by automated indexing systems (Macskassy, et al., 1998).
Two specific measures of user performance in selecting clusters are used in the pilot study: the
sum of the number of relevant stories in the selected clusters, and the number of selected clusters that
are relevant to the task. In both cases, the measures are determined strictly by the clusters selected by the
user. In the first measure, the number of task-relevant stories contained by each selected cluster is
summed over the clusters selected by the user. The second measure is a simple count of the selected
clusters that are relevant to the task.
The second measure is useful only in the context of the experiments to measure label effect
where the contents of the clusters are known to be all relevant to some topic or concept. In the current
Cole 7
study the clusters have been created by human assessors and all stories are relevant to the cluster topic.
This will not apply when clusters are generated automatically because more or less imperfect partitions
of the collection are inevitable. The first measure, a count of the relevant stories in the selected clusters,
is appropriate for testing labelling algorithm effectiveness in systems using real-world clustering
applications.
6 Research Questions and Hypotheses
There are many interesting research questions regarding the properties of effective cluster labels, for
example, how does user performance in selecting relevant clusters change as a function of the number of words
in a label? Is it better to have words or phrases? Do nouns work better than a mixture of word types? For this
work, the research question is whether the experimental instrument can measure the effectiveness of different
label sets in supporting user selection of relevant clusters. There are two specific hypotheses:
H1a. Automated labeling algorithms can be distinguished from one another by effectiveness of the
generated label sets in supporting user selection of relevant clusters.
H1b. Automated labeling algorithms can be placed in a rank order of effectiveness.
H2. Labels produced by automated algorithms are less effective than high quality labels produced by
humans in supporting user selection of relevant clusters.
7 The Evaluation Model
A general linear model is adopted for this study. The performance of the user in selecting relevant
clusters, which are then counted independently of further user activity, depends on scanning the interface
and selecting the labels that are meaningful for the task. The selection of one cluster is independent of
another.
R S
e
ffe
c
t
U
e
ffe
c
t
T
e
ffe
c
t r0
where U = User T = Task S = Interface System (Label set used) r0= constant
R = the performance score
The evaluation will consider only main effects.
8 Instruments
Cole 8
Figure 1: Screen shot of interface with baseline labels.
The primary instrument is a simple interface (Figure 1) to present the cluster representations with the
experimental label sets. The subject is able to select or deselect clusters by simply clicking anywhere in the
circle. A translucent blue halo shows the mouse has entered the cluster representation area and an opaque red ring
is fixed if the cluster is selected. The tasks and the label sets applied to the cluster representations change in the
experiment and the cluster positions are randomized when a new label set is presented.
Selection process logger
The interface automatically records the subjects' gestures with the mouse during the selection process.
The mouse log records the cursor position in 50 msec slices as well as all events, including click events. The log
can be played back to control the cursor and see the process of user selection of clusters. This includes the order
of cluster selection, any cases where the subject changed their mind and incidents of hesitation. The total elapsed
time can be calculated, so a keystroke level modeling (Card, Moran, & Newell, 1983) analysis of ease of use can
Cole 9
be applied to the user's process of label selection. These logs can also be used to look at characteristics of user
label select by users with more or less general familiar with news events.
9 Corpus
The clusters are from the National Institute of Science and Technology (NIST) Topic Tracking and
Detection (TDT3) corpus (Cieri, Graff, Liberman, Martey, & Strassel, 2000). It consists of over 60,000 news
stories from wire services, and television and radio transcriptions. Both English and Mandarin news stories over
a variety of topics, including economics, world affairs, sports, are represented in the collection.
Each story has been hand scored for its relevance to a set of 113 topics. Every member of a cluster
includes references that are directly related to the topic. A story appears in at most one cluster. An example of a
topic description used to cluster the stories is given in Appendix 1. The baseline labels for this study are the topic
names taken from the descriptions used to assign the topicality of each story.
10 Automated labeling algorithms
Within the basic interface, three label sets are applied. One label set is the baseline, the 'gold standard'
human-generated labels taken from the TDT3 corpus topic descriptions. The other two label sets can be drawn
from any labeling algorithm. For the pilot study the labels sets were generated from a chi square test of word
commonality and the product of word frequency and predictiveness (Popescul & Ungar, 2000). These algorithms
were selected because Popescul and Ungar claim these algorithms perform better than simple alternatives such as
the most frequent terms and because they found users had a clear preference for the labels produced by the word
frequency-predictiveness technique.
11 Tasks
Six topic-driven browsing tasks (Appendix 2) were constructed that addressed topic categories within
which the TDT topic clusters would fit. So all of the stories in a topic cluster relevant to a task were deemed
relevant to the task. An example of a task description is:
Imagine you are interested in Chinese affairs. Please select three clusters that seem
interesting. When you have finished click on the button at the top to advance to the next
page.
Out of the 113 scored TDT topics, 35 were selected at random for each experiment. For the pilot study,
the topics used in each experiment were selected so that there was no overlap between the cluster sets. All of the
stories in a given topic were deemed relevant to a task if the topic was relevant to the task.
Cole 10
The total recall for task 1 in the selected topic clusters is 211 stories, but since the subject selects three
clusters the maximum recall possible is 170 (the sum of the number of relevant stories in the top three clusters).
A subject selecting the top three clusters would be assigned a performance value of 1. One selecting clusters with
127 relevant documents would be assigned a performance value of 0.75. In the results, the performance of the
label set is compared against this maximum recall possibility to mitigate the effect of uneven story distribution in
the clusters. Uneven relevant story distribution within topics, that is, the fact that a subset of stories will be
relevant to a topic, is characteristic of information environments, but no effort has been made to correct for this
variability in the experimental design.
12 Participants
A convenience sample of 18 undergraduate (N=9) and graduate students (N=9) from a large Central
Atlantic university were recruited. No participants were excluded. It was observed that four of the graduate
students and one undergraduate were not native English speakers.
13 Treatment plan
One challenge to the experimental design, the selection of tasks, has been explored. Another challenge is
learning effects due to repeated exposure of the same clusters to the subject with different labels. Ideally, one
would like to have the same user interact with each interface for a given task. Since the interfaces vary only in the
labels used to label the clusters, significant label effect could be detected directly. Unfortunately, it is likely the
user remembers some of the previous interface labels when working in the next interface. This learning effect
increases the number of labels available to the user for semantic similarity evaluations and is a threat to detecting
the specific effectiveness of a label set.
To mitigate both learning and the task effects noted earlier, a Latin Square Design (LSD) (TaegueSutcliff, 1997) is adopted to assign treatments. In adopting this design, user-task interactions have been assumed
to be unimportant. All subjects perform each task exactly once and every subject experiences each label system.
The basic block for the experimental assignment is given in table 1.
Table 1: A sample treatment block
Baseline
Subject 1
task 2, task 1
Label set 1
task 3, task 4
Label set 2
task 6, task 5
Cole 11
Baseline
Label set 1
Label set 2
Subject 2
task 6, task 5
task 1, task 2
task 4, task 3
Subject 3
task 3, task 4
task 5, task 6
task 1, task 2
For the current experiment this block is repeated six times, using different subjects for each block but with the
same label sets and tasks. The experiment is repeated using the same subjects with new clusters and label sets
created by the respective labeling algorithms. When each of the treatments were presented the order of the tasks
was randomized, as was the order of seeing the label sets when the user was exposed to the second cluster set.
Between subjects the order of the label sets was randomized, that is Subject 1 would not necessarily see the label
sets in the same order as Subject 2.
14 Experimental Procedure
Subjects answered a pretest questionnaire (Appendix 3) about their sources of news and news
consumption intensity. The purpose was to gain some understanding of the subject's familiarity with world news
events without introducing specific word or phrases that could prime the subject and so interfere with label
selection process. Subjects who consume high volumes of news and commentary are likely to have greater
familiarity with a randomly selected topic. Greater familiarity is assumed to be associated with a broader pool of
words in the subject's understanding of the browsing interest task description, improving the probability of
recognizing semantic similarities with label words.
The subjects were then given a short tutorial on the interface and encouraged to work with it until they
were comfortable. The subject were aware that the purpose of the experiment was to improve access to archived
news stories and that mouse events were being logged. All subject data was made anonymous.
Each subject was presented with a sequence of tasks in labeled interfaces. For each task, the subject
would start the event logger, select three clusters and stop the event logger. They were told each task would take
at most 3-5 minutes to complete, but were also told they could take as long as needed to complete the tasks and
associated questionnaires. After completing the two assigned tasks in one of the system interfaces, a short
questionnaire (Appendix 4) was administered asking about their confidence in the selections they made and
opinion about the ease of use of the interface. This information can support event log analysis and also to further
probe the impact of user familiarity on label effect and user satisfaction with the labels. This entire process is
repeated three times for the cluster set.
Cole 12
The process of treatment with three interfaces and six tests was repeated with a different random
selection of clusters and label sets from the TDT topic descriptions, and the chi square and predictiveness
algorithms. Completion of each cluster selection process usually takes less than two minutes so even though the
process is repeated 12 times the experimental design should be resistant to reduced user engagement due to
fatigue or boredom. Subjects usually complete the interface questionnaire in less than 15 seconds.
15 Pilot Study Results
The performance of the user in selecting clusters with the greatest number of relevant documents is
presented graphically in figure 2. The box plot shows a no clear differences between the label sets. There were
wide variations in performance, with mean values around 50% (In no case were the subjects, on average, able to
select three clusters that would access more than 50% of the relevant documents available. An ANOVA analysis
(Table 2) of the model shows significant main effects for the task, user and the cluster set used. Task (F=5.355
p<.05) and user (F=1.784 p<.05) were most significant. There was no significant main effect observed for the
label sets. The model has an adjusted effect size of just 0.151 so a great deal of the observed variability is
unexplained by the model.
T
e
s
tso
fB
e
tw
e
e
n
-S
u
b
je
c
tsE
ffe
c
ts
D
e
p
e
n
d
e
n
tV
a
ria
b
le
:p
e
rfo
rm
a
n
c
e
S
o
u
rc
e
C
o
rre
c
te
dM
o
d
e
l
T
y
p
eIII
S
u
m
o
f
S
q
u
a
re
s d
f
6
.7
2
3 2
5
M
e
a
n
S
q
u
a
re
0
.2
6
9
F
S
ig
.
P
a
rtia
l
E
ta N
o
n
c
e
n
t. O
b
s
e
rv
e
d
S
q
u
a
re
dP
a
ra
m
e
te
r P
o
w
e
r(a
)
2
.5
3
1 0
.0
0
00
.2
5
0
4
4
.7
5
7 1 4
4
.7
5
74
2
1
.2
7
0
In
te
rc
e
p
t
3
.
2
2
2
1
7
0
.
1
9
0
1
.
7
8
4
S
U
B
J
E
C
T
2
.8
4
5 5
0
.5
6
9 5
.3
5
5
T
A
S
K
0
.5
1
8 1
0
.5
1
8 4
.8
7
2
C
S
E
T
0
.1
3
8 2
0
.0
6
9 0
.6
5
1
L
S
E
T
2
0
.1
8
61
9
0 0
.1
0
6
E
rro
r
7
1
.6
6
62
1
6
T
o
ta
l
2
6
.9
0
92
1
5
C
o
rre
c
te
dT
o
ta
l
a
.C
o
m
p
u
te
du
s
in
ga
lp
h
a=.0
5
b
.R
S
q
u
a
re
d=.2
5
0(A
d
ju
s
te
dR
S
q
u
a
re
d=.1
5
1
)
Table 2: Model for relevant documents user performance
0
.0
0
0
0
.0
3
2
0
.0
0
0
0
.0
2
8
0
.5
2
3
0
.6
8
9
0
.1
3
8
0
.1
2
4
0
.0
2
5
0
.0
0
7
6
3
.2
7
8
0
.9
9
9
4
2
1
.2
7
0
3
0
.3
2
8
2
6
.7
7
6
4
.8
7
2
1
.3
0
2
1
.0
0
0
0
.9
4
3
0
.9
8
8
0
.5
9
3
0
.1
5
8
Cole 13
Table 3: Analysis of variance model for user performance in selecting relevant
documents
T
e
s
tso
fB
e
tw
e
e
n
S
u
b
j
e
c
tsE
ffe
c
ts
D
e
p
e
n
d
e
n
tV
a
ria
b
le
:C
R
E
L
C
O
R
R
T
y
p
eI
I
IS
u
m
o
fS
q
u
a
re
s
S
o
u
rc
e
d
f
M
e
a
nS
q
u
a
re
1
1
.4
1
5
2
9
0
.3
9
4
C
o
rre
c
te
dM
o
d
e
l
2
3
.6
7
5
1
2
3
.6
7
5
I
n
te
rc
e
p
t
3
.9
9
6
1
7
0
.2
3
5
S
U
B
J
E
C
T
2
.4
7
8
5
0
.4
9
6
T
A
S
K
0
.5
0
0
1
0
.5
0
0
C
S
E
T
0
.3
7
5
2
0
.1
8
8
L
S
E
T
3
.3
5
9
4
0
.8
4
0
C
R
E
L
A
V
A
I
1
4
.7
6
1
1
8
6
0
.0
7
9
E
rro
r
9
6
.9
2
6
2
1
6
T
o
ta
l
2
6
.1
7
6
2
1
5
C
o
rre
c
te
dT
o
ta
l
a
.R
S
q
u
a
re
d=
.4
3
6(
A
d
ju
s
te
dR
S
q
u
a
re
d=
.3
4
8
)
F
4
.9
6
0
2
9
8
.3
2
3
2
.9
6
2
6
.2
4
4
6
.3
0
0
2
.3
6
4
1
0
.5
8
3
S
ig
.
0
.0
0
0
0
.0
0
0
0
.0
0
0
0
.0
0
0
0
.0
1
3
0
.0
9
7
0
.0
0
0
A similar result is observed using the second performance measure (Table 3). Again, subject (F=2.96
p<.05) , task (F=6.24 p<.05) , and cluster sets (F=6.30 p<.05) are all significant contributors to the model while
the label sets are not significant. The adjusted effect size of the model is somewhat better (0.348).
Examination of boxplots of user performance for both selection of relevant stories and selection of
relevant clusters provides similar results (figures 2 and 3), as expected. The two performance measures are
closely linked, but it is possible that a subject could select three relevant clusters that had contained few
documents and so score poorly by the relevant documents recall measure but perfectly by the relevant clusters
measure. The differences in the performance measures in the boxplots show that this problem is not a significant
issue and one can see the same relationships between the observed means for each label set are preserved.
Figure 2: User performance measured by number of
relevant documents selected
Figure 3: User performance by number of relevant
clusters selected
Cole 14
An examination of the user and task differences (Appendix 5) show no distinguishable differences
between users or tasks.
16 Discussion
The results do not support rejection of the null hypotheses for the instrument validation hypotheses.
Figure 4: User performance in selecting relevant stories by task
Some level of label effectiveness must exist, after all random words for labels would cause user performance to
be no better than random. In view of the theoretical background for label selection it seems that the influence of
user familiarity and task may be stronger than anticipated. So the calculated model, dominated by task and user
main effects, might be taken at face value.
Figure 4 presents user performance in selecting relevant stories for the various tasks by labelling
algorithm. Rather consistent high user performance for tasks 4 (Chinese affairs) and 5 (European and Russian
affairs) as compared to tasks 3 (revolts, terrorism, etc.) and 6 (violent crimes and personal tragedies) suggests
there may be significant differences in user recognition of labels that cut across labelling algorithms. A look at
the labels used shows a predominance of geographic place names for the relevant clusters for tasks 4 and 5. The
Cole 15
labels for clusters relevant to tasks 3 and 6 contain many nouns, but these nouns do not obviously link to the task
as cleanly as they do for tasks 4 and 5.
User familiarity with a domain may also explain why no label effect could be observed. The goal of the
pretest questionnaire is to identify subjects with broad familiarity of news domains without priming them with
specific words that might affect the experiment. Tests of this questionnaire instrument included a group of editors
(N=8) at an on-line newspaper owned by one of the major broadcast networks. This group is omnivorous,
consuming news from many different types of sources, while the other respondents, on average, made intense use
only of newspapers and broadcast news.
However, this group was highly variable in their news consumption habits and a K-means clustering
analysis revealed a group (4 out of 29) with a profile very similar to that of the news editors. An analysis of the
pretest questionnaires shows that no less than four of the subjects in the experiment had a similar news
consumption profile and therefore can be expected to have high familiarity of news events and so greater domain
knowledge to draw on in selecting cluster labels. Figure 7 shows a comparison of the task performance by the
labelling algorithm partitioned by intensity of news consumers (4 or more sources used within the last week)
shows that the chi square label scores had much less variability that any other task. This exploratory work
provides some evidence of user familiarity impact on user browsing performance.
There are significant limitations in the study due to the corpus used. News stories have distinctive
characteristics as compared to other texts. The age of the news stories in the corpus may also be a problem
because many subjects may not remember news events from five years ago and so valuable hints, especially in
proper names, that appear in the labels will be less helpful than they might be in clusters of contemporary stories.
A significant sample of the subjects spoke English as a second language and their experience with news in their
native countries during the 1999-2000 period could be very different from those living in the US at the time.
While the corpus included many stories from the Chinese news agencies, there was a bias towards reports, and so
subject matter, from US sources.
Cole 16
Figure 7: Comparison of task performance between intense news consumers and those less intense
Figure 5: News consumption by source - non-editors
Figure 6: Frequency of news consumption by news
source - editors
17 Conclusion
An instrument has been developed to detect and measure the effect of labels on user selection of
document clusters. Experiments have failed to show significnat label effects on user cluster selection
performance in support of a task or browsing interest. Exploratory analysis suggests task and user
familiarity effects may have swamped any label effect in the experiments.
Future experiments with this instrument will incorporate these lessons and take account of task
by using different tasks that suggest nouns vs mixtures of various parts of speech. Chen (2000) has
claimed that noun phrases provide more effective labels. Regarding user familiarity, experiments will be
formulated that assign subjects on the basis of the intensity of their news consumption.
If strong user and task effects on label selection are confirmed, the importance of finding effective
labelling algorithms that are independent of clustering algorithms is increased. Improved browsing performance
my depend on the ability to dynamically relabel existing clusters to take account of the user's domain knowledge
and task. Identification of labelling algorithms that have low computational demands is especially important if
the task and/or user familiarity effects are confirmed.
References
Azcarraga, A.P., Yap, T.N. (2001) Extracting meaningful labels for WEBSOM text archives. In
Proceedings of the SIGCKIM conference on Human factors in computing systems (CKIM ’01) (Atlanta,
GA). ACM Press, New York, NY, 2001, 40-48.
Cole 17
Card, S. K., Moran, T. P., and Newell, A..(1983) The Psychology of Human-Computer Interaction.
Lawrence Erlbaum Associates, Hillsdale, NJ.
Chen, H. (2000). High-performance Digital Library Classification Systems: From Information Retrieval to Knowledge
Management. Retrieved from http://www.dli2.nsf.gov/projects/chen.pdf on November 7, 2003
Chen, H., Houston, A., Sewell, R., and Schatz, B., (1998) Internet browsing and searching: User
evaluations of category map and concept space techniques. Journal of the American Society for
Information Science, 49, 7 582-603
Cieri, C., Graff, D., Liberman, M., Martey, N. and Strassel, S. (2000). Large multilingual broadcast
news corpora for cooperative research in topic detection and tracking: The TDT2 and TDT3 corpus
efforts. In Proceedings of the Second International Language Resources and Evaluation Conference,
Athens, Greece, May 2000. Retrieved from: http://papers.ldc.upenn.edu/ March 17, 2005.
Lagus, K., and Kaski, S. (1999) Keyword selection method for characterizing text document maps, In:
Proceedings of the Ninth International Conference on Artificial Neural Networks (ICANN99), pp. 371376, IEE, London, 1999.
Lawrie, D., Croft, W.B., and Rosenberg, A. (2001). Finding topic words for hierarchical summarization.
SIGIR 2001 (New Orleans, LA) pp 349-357
Lin, C., Chen, H., and Nunamaker, J. (1999). Verifying the proximity and size hypothesis for selforganizing maps. Journal of Management Information Systems; Armonk; Winter 1999/2000.
Macskassy, S. A., Banerjee, A. Davison, B. D., and Hirsh, H. (1998). Human Performance on Clustering
Web Pages. Technical Report DCS-TR-355, Department of Computer Science, Rutgers University.
http://citeseer.ist.psu.edu/article/macskassy98human.html
Muresan, G. (2002) Using Document Clustering and Language Modelling in Mediated Information
Retrieval, Ph.D. thesis, School of Computing, The Robert Gordon University, Aberdeen, Scotland,
January 2002
Muresan , G., and Harper, D. (2004). Topic modeling for mediated access to very large document
collections. Journal of the American Society for Information Science 2004 55 (10)892-910
Polson, P.G. and Lewis, C.H. (1990).Theory-based design for easily learned interfaces. HumanComputer Interaction, 5 (2-3), 191-220.
Popescul, A., and Ungar, L. (2000). Automatic Labeling of Document Clusters. Unpublished retrieved
September 23, 2004 from http://www.citeseer.com/
Radev, D.R., Fan, W., and Zhang, Z. (2001). WebInEssence: A personalized web-based multi-document
summarization and recommendation system. In NAACL 2001 Workshop on Automatic Summarization
(Pittsburgh, PA), 79-88.
Rauber, A., Schweighofer, E., and Merkl, D. (2000). Text Classification and Labelling of Document
Clusters with Self-Organizing Maps. Journal of the Austrian Society for Artificial Intelligence
(ÖGAI),13 3, pp. 17-23
Raubner, A. (2000). LabelSOM: On the labeling of self-organizing maps. In Proceedings of the SIGCHI
conference on Human factors in computing systems (CHI ’00) (The Hague, The Netherlands, April 1-6,
2000). ACM Press, New York, NY, 2000, 526-531.
Cole 18
Schweighofer, E., Raubner, A., and Dittenbach, M. (2001). Improving the quality of labels for selforganizing maps using fine-tuning. Proceedings of the DEXA Workshop on Legal Information Systems
and Applications (LISA01) Vienna, Austria
Soto,R. (1999). Learning and performing by exploration: label quality measured by latent semantic
analysis, Proceedings of the SIGCHI conference on Human factors in computing systems: the CHI is the
limit, May 15-20, 1999, Pittsburgh, Pennsylvania, United States
Tague-Sutcliffe, Jean. (1997). "The Pragmatics of Information Retrieval Experimentation, Revisited." In
Readings in Information Retrieval, ed. Karen Sparck Jones and Peter Willett, 205-216. San Francisco,
CA: Morgan Kaufmann, 1997. [Originally published in: Information Processing & Management 28
(1992): 467-490.]Appendix 1: TDT Topic Description
30003. Pinochet Trial
Seminal Event
WHAT: Pinochet, who ruled Chile from 1973-1990, is arrested on charges of genocide and
torture during his reign. WHO: Former Chilean dictator General Augusto Pinochet; Judge
Baltasar Garzon ("Superjudge") WHERE: Pinochet is arrested and held in London, then later
extradited to Spain. WHEN: The arrest occurs on 10/16/98; court negotiations last the rest of
the year.
Topic Explication
Pinochet was arrested in a London hospital on a warrant issued by Spanish Judge Baltasar
Garzon. Pinochet appealed his arrest and a London court agreed, but the decision was
overturned by Britain's highest court. After much legal wrangling over the site of the trial, the
British Courts ruled that Spain should proceed with the extradition request; Pinochet
continues to fight it. On topic: stories covering any angle of the legal process surrounding
this trial (including Pinochet's initial arrest in October, his appeals, British Court rulings,
reactions of world leaders and Chilean citizens to the trial, etc.). Stories about Pinochet's
reign or legacy are not on topic unless they explicitly discuss this trial.
Rule of Interpretation Rule 3: Legal/Criminal Cases
Appendix 2: Task Descriptions
Measuring Label Effect: Task statements
rev5 5.1.05
TASK 1: Imagine you are interested in business and the economy (both domestic and international).
Please select three clusters that seem interesting. When you have finished click on the button at the top
to advance to the next page.
TASK 2: Imagine you are interested in domestic politics. Please select three clusters that seem
interesting. When you have finished click on the button at the top to advance to the next page.
TASK 3: Imagine you are interested in protests, terrorism, civil wars and revolts around the world.
Please select three clusters that seem interesting. When you have finished click on the button at the top
to advance to the next page.
TASK 4: Imagine you are interested in Chinese affairs. Please select three clusters that seem interesting.
When you have finished click on the button at the top to advance to the next page.
Cole 19
TASK 5: Imagine you are interested in European and Russian affairs. Please select three clusters that
seem interesting. When you have finished click on the button at the top to advance to the next page.
TASK 6: Imagine you are interested in stories about violent crimes and personal tragedies, especially
those that result in injury or death. Please select three clusters that seem interesting. When you have
finished click on the button at the top to advance to the next page.Appendix 3: Pre-experiment questionnaire
Please make a mark on the line closest to your answer.
1. When did you last read a newspaper for national or international news?
2. When did you last watch an evening news program?
3. When did you last read a contemporary issues magazine, such as Time, Newsweek, The Economist,
The National Review, The Nation, etc. ?
4. When did you last read an on-line newspaper, such as The New York Times, CNN, AP wire, CBS,
MSNBC, etc. for national or international news?
5. When did you last read a blog or on-line discussion group that concerned some national or
international news story?
6. When did you last listen to a talk radio show that was concerned with national or international news
and events?
7. When did you last read a non-fiction book that referenced news events or concerned contemporary
issues?
Cole 20
Thank you!Appendix 4: Post-task questionnaire
Task Questionnaire:
Please make a mark in the circle that best describes your feeling about the statement.
I had enough time to complete the task.
∘
∘
∘
agree strongly
∘
∘
neither agree or disagree
∘
∘
disagree strongly
I am very confident about the selections made for this task.
∘
∘
∘
agree strongly
∘
∘
neither agree or disagree
∘
∘
disagree strongly
This interface is extremely easy to use.
∘
agree strongly
∘
∘
∘
neither agree or disagree
Thank you!
Appendix 5
User Task and System Parameter Effects
For user performance in selecting relevant documents
∘
∘
disagree strongly
∘
Cole 21
P
aram
eterE
s
tim
ates
D
e
p
e
n
d
e
n
tV
a
ria
b
le
:p
e
rfo
rm
a
n
c
e
9
5
%
C
o
n
fid
e
n
c
e
In
te
rv
a
l
a
rtia
lE
ta
L
o
w
e
r U
p
p
e
r P
q
u
a
re
d
B
o
u
n
d B
o
u
n
d S
0
.0
0
4 0
.4
5
0
0
.0
2
1
-0
.0
6
8 0
.4
5
7
0
.0
1
1
-0
.2
7
4 0
.2
5
1
0
.0
0
0
-0
.2
1
1 0
.3
1
4
0
.0
0
1
-0
.0
2
7 0
.4
9
8
0
.0
1
6
-0
.1
5
8 0
.3
6
7
0
.0
0
3
-0
.2
3
5 0
.2
8
9
0
.0
0
0
-0
.3
6
3 0
.1
6
2
0
.0
0
3
-0
.2
7
5 0
.2
5
0
0
.0
0
0
-0
.2
5
9 0
.2
6
6
0
.0
0
0
-0
.3
2
5 0
.2
0
0
0
.0
0
1
-0
.3
2
2 0
.2
0
3
0
.0
0
1
-0
.4
0
5 0
.1
2
0
0
.0
0
6
-0
.5
1
4 0
.0
1
1
0
.0
1
8
-0
.2
0
0 0
.3
2
5
0
.0
0
1
-0
.1
0
7 0
.4
1
8
0
.0
0
7
-0
.3
1
0 0
.2
1
4
0
.0
0
1
-0
.4
3
3 0
.0
9
1
0
.0
0
9
.
.
.
-0
.0
1
0 0
.2
9
3
0
.0
1
8
0
.0
0
1 0
.3
0
4
0
.0
2
0
-0
.1
3
4 0
.1
6
9
0
.0
0
0
0
.1
3
0 0
.4
3
3
0
.0
6
6
0
.1
4
5 0
.4
4
8
0
.0
7
3
.
.
.
0
.0
1
0 0
.1
8
5
0
.0
2
5
.
.
.
-0
.0
7
2 0
.1
4
2
0
.0
0
2
-0
.0
4
5 0
.1
6
9
0
.0
0
7
.
.
.
S
td
.
P
a
ra
m
e
te
r
B
E
rro
r
t
S
ig
.
0
.2
2
7 0
.1
1
3 2
.0
0
7 0
.0
4
6
In
te
rc
e
p
t
0
.1
9
5 0
.1
3
3 1
.4
6
2 0
.1
4
5
[S
U
B
J
E
C
T
=1
]
-0
.0
1
2 0
.1
3
3 -0
.0
8
8 0
.9
3
0
[S
U
B
J
E
C
T
=2
]
0
.0
5
1 0
.1
3
3 0
.3
8
4 0
.7
0
2
[S
U
B
J
E
C
T
=3
]
0
.2
3
5 0
.1
3
3 1
.7
6
9 0
.0
7
8
[S
U
B
J
E
C
T
=4
]
0
.1
0
4 0
.1
3
3 0
.7
8
5 0
.4
3
3
[S
U
B
J
E
C
T
=5
]
0
.0
2
7 0
.1
3
3 0
.2
0
3 0
.8
3
9
[S
U
B
J
E
C
T
=6
]
-0
.1
0
0 0
.1
3
3 -0
.7
5
2 0
.4
5
3
[S
U
B
J
E
C
T
=7
]
-0
.0
1
3 0
.1
3
3 -0
.0
9
5 0
.9
2
5
[S
U
B
J
E
C
T
=8
]
0
.0
0
3 0
.1
3
3 0
.0
2
4 0
.9
8
1
[S
U
B
J
E
C
T
=9
]
.0
6
3 0
.1
3
3 -0
.4
7
3 0
.6
3
7
[S
U
B
J
E
C
T
=1
0
] -0
.0
6
0 0
.1
3
3 -0
.4
4
7 0
.6
5
5
[S
U
B
J
E
C
T
=1
1
] -0
.1
4
3 0
.1
3
3 -1
.0
7
2 0
.2
8
5
[S
U
B
J
E
C
T
=1
2
] -0
.2
5
1 0
.1
3
3 -1
.8
8
8 0
.0
6
1
[S
U
B
J
E
C
T
=1
3
] -0
.0
6
3 0
.1
3
3 0
.4
7
3 0
.6
3
7
[S
U
B
J
E
C
T
=1
4
] 0
.1
5
6 0
.1
3
3 1
.1
7
0 0
.2
4
4
[S
U
B
J
E
C
T
=1
5
] 0
.0
4
8 0
.1
3
3 -0
.3
6
1 0
.7
1
9
[S
U
B
J
E
C
T
=1
6
] -0
.1
7
1 0
.1
3
3 -1
.2
8
5 0
.2
0
0
[S
U
B
J
E
C
T
=1
7
] -0
0
.
.
.
[S
U
B
J
E
C
T
=1
8
]
0
.1
4
2 0
.0
7
7 1
.8
4
8 0
.0
6
6
[T
A
S
K
=1
]
0
.1
5
2 0
.0
7
7 1
.9
8
4 0
.0
4
9
[T
A
S
K
=2
]
0
.0
1
7 0
.0
7
7 0
.2
2
4 0
.8
2
3
[T
A
S
K
=3
]
0
.2
8
2 0
.0
7
7 3
.6
6
4 0
.0
0
0
[T
A
S
K
=4
]
0
.2
9
7 0
.0
7
7 3
.8
6
2 0
.0
0
0
[T
A
S
K
=5
]
0
.
.
.
[T
A
S
K
=6
]
0
.0
9
8 0
.0
4
4 2
.2
0
7 0
.0
2
8
[C
S
E
T
=1
]
0
.
.
.
[C
S
E
T
=2
]
0
.0
3
5 0
.0
5
4 0
.6
5
0 0
.5
1
7
[L
S
E
T
=1
.0
0
]
0
.0
6
2 0
.0
5
4 1
.1
3
7 0
.2
5
7
[L
S
E
T
=2
.0
0
]
0
.
.
.
[L
S
E
T
=3
.0
0
]
a
.C
o
m
p
u
te
du
s
in
ga
lp
h
a=.0
5
b
.T
h
isp
a
ra
m
e
te
r iss
e
t toz
e
rob
e
c
a
u
s
eit isre
d
u
n
d
a
n
t.
Appendix 7
N
o
n
c
e
n
t.
P
a
ra
m
e
te
r
2
.0
0
7
1
.4
6
2
0
.0
8
8
0
.3
8
4
1
.7
6
9
0
.7
8
5
0
.2
0
3
0
.7
5
2
0
.0
9
5
0
.0
2
4
0
.4
7
3
0
.4
4
7
1
.0
7
2
1
.8
8
8
0
.4
7
3
1
.1
7
0
0
.3
6
1
1
.2
8
5
.
1
.8
4
8
1
.9
8
4
0
.2
2
4
3
.6
6
4
3
.8
6
2
.
2
.2
0
7
.
0
.6
5
0
1
.1
3
7
.
O
b
s
e
rv
e
d
P
o
w
e
r(a
)
0
.5
1
5
0
.3
0
7
0
.0
5
1
0
.0
6
7
0
.4
2
1
0
.1
2
2
0
.0
5
5
0
.1
1
6
0
.0
5
1
0
.0
5
0
0
.0
7
6
0
.0
7
3
0
.1
8
7
0
.4
6
7
0
.0
7
6
0
.2
1
4
0
.0
6
5
0
.2
4
8
.
0
.4
5
2
0
.5
0
6
0
.0
5
6
0
.9
5
4
0
.9
7
0
.
0
.5
9
3
.
0
.0
9
9
0
.2
0
5
.
News source survey (Previous test version used for news editor /other survey.
Please make a mark on the line closest to your answer.
1. When did you last read a newspaper for national or international news?
2. When did you last watch an evening news program?
Cole 22
3. When did you last read a contemporary issues magazine, such as Time, Newsweek, The Economist,
The National Review, The Nation, etc. ?
4. When did you last read an on-line newspaper, such as The New York Times, CNN, AP wire, CBS,
MSNBC, etc. for national or international news?
5. When did you last read a blog or on-line discussion group that concerned some national or
international news story?
6. When did you last listen to a talk radio show that was concerned with national or international news
and events?
7. When did you last read a non-fiction book that referenced news events or concerned contemporary
issues?
Thank you!
Download