Chaprasit-minor-thesis - University of South Australia

advertisement
Ability of Co-selection to Group
Words Semantically
by
Satit Chaprasit
Supervisors
Associate Professor Helen Ashman
Associate Supervisor
Gavin Smith
A thesis submitted for the degree of
Master of Computer and Information Science
School of Computer and Information Science
Division of Information Technology, Engineering and the Environment
University of South Australia
October 2010
i|Page
Table of Contents
List of Figures ......................................................................................................................................... iii
List of Tables .......................................................................................................................................... iv
Declaration .............................................................................................................................................. v
Acknowledgements................................................................................................................................ vi
Abstract ................................................................................................................................................. vii
Chapter 1 Introduction ........................................................................................................................... 1
1.1 Motivation..................................................................................................................................... 1
1.2 Thesis background ........................................................................................................................ 2
1.3 The thesis fields............................................................................................................................. 3
1.4 Research Questions ...................................................................................................................... 3
1.5 Explanatory of research questions ............................................................................................... 3
1.6 Contributions ................................................................................................................................ 4
Chapter 2 Literature Survey .................................................................................................................... 5
2.1 Background ................................................................................................................................... 5
2.1.1 Fundamental Knowledge on Click-Through Data as Implicit Feedback ................................. 5
2.1.2 Clustering with Co-Selection Method .................................................................................... 6
2.2 Related Works ............................................................................................................................... 7
2.2.1 Word Sense Disambiguation .................................................................................................. 7
2.2.2 Query Clustering .................................................................................................................... 8
Chapter 3 Methodology ........................................................................................................................ 10
3.1 Method Outline........................................................................................................................... 10
3.2 Identifying ambiguous and unambiguous terms ........................................................................ 10
3.3 Data Preprocessing ..................................................................................................................... 11
3.4 Data Selection for Word Sense Discrimination Experiment ....................................................... 12
3.5 Data Selection for Query Clustering Experiment. ....................................................................... 15
3.6 Expected Outcomes .................................................................................................................... 16
Chapter 4 Results .................................................................................................................................. 17
4.1 Results of Word Sense Discrimination by Co-selection Method ................................................ 17
4.2 The results from the experiment of query clustering on ambiguous dataset. ........................... 19
Chapter 5 Discussion ............................................................................................................................. 22
5.1 Discussion on the experiment on word sense discrimination .................................................... 22
5.2 Discussion on outcome from query clustering experiment ........................................................ 23
ii | P a g e
5.3 Scope and Limitation................................................................................................................... 25
5.4 Future work ................................................................................................................................. 25
Chapter 6 Conclusion ............................................................................................................................ 27
Appendix A- Complete Explanatory of Query Clustering Methodolog ................................................. 30
A.1 – Section outline ......................................................................................................................... 30
A.2 – Generating identification of connected component ............................................................... 30
A.3 – The selection of 10 truly ambiguous query terms ................................................................... 31
A.4 – The extraction of related queries ............................................................................................ 33
A.5 – Generating query pairs ............................................................................................................ 33
A.6 – Word sense evaluation ............................................................................................................ 35
A.7 – The approaches of working out the result .............................................................................. 36
References ............................................................................................................................................ 38
iii | P a g e
List of Figures
Figure 3-2 stage of selecting truly ambiguous and unambiguous queries ............................................ 12
Figure 3-3 cluster generated by the query graph ................................................................................... 14
Figure 4-1 numbers of clusters generated for each of 20 ambiguous and unambiguous queries.......... 17
Figure 4-2 overall proportion of semantically similar queries of method0........................................... 20
Figure 4-3 overall proportion of semantically similar queries of method1........................................... 20
Figure 4-4 comparison individual proportion of semantically similar queries ..................................... 20
iv | P a g e
List of Tables
Table 3-1 typing errors of ambiguous indicator found during data preprocessing ............................... 11
Table 4-1 numbers of clusters generated for ambiguous and unambiguous query terms ..................... 18
Table 4-2 basic statistic for the WSD experiment ................................................................................ 18
Table 4-3 level of agreement between participants............................................................................... 19
Table 4-4 the proportion of semantically similar pairs as rated by 11 participants .............................. 19
v|Page
Declaration
I declare that this thesis does not incorporate without acknowledgment any material
previously submitted for a degree or diploma in any university; and that to the best of my
knowledge it does not contain any materials previously published or written by another
person except where due reference is made in the text.
Satit Chaprasit
20th October 2010
vi | P a g e
Acknowledgements
I would like to thank my supervisor Helen Ashman, Gavin Smith, and Mark Truran for their
times, valuable suggestions, and supports during this year. I also would like to thank
members of security lab for their helps and supports. I also do not forget to thank my friends
and participants that help me to complete my thesis. Finally, I would like to thank my family
in Thailand and my cousin here for encouragements and supports.
vii | P a g e
Abstract
Meanings of ambiguous words vary in different contexts. When using machines to
identify meaning of ambiguous words, it is called Word Sense Disambiguation (WSD).
However, there are also problems in this area. A major problem for word sense
disambiguation is to automatically determine word sense. This is because meanings can vary
in different context and new meanings can occur any time. Another problem is to
automatically construct sets of synonyms for use in word sense disambiguation.
Co-selection method, which exploits users’ agreements on clickthrough data as
similarity, could be used to deal with such WSD problems. However, previous studies on the
co-selection method have so far failed to demonstrate the performance of it due to unsuitable
datasets. This study then aims to redress this situation with carefully selecting datasets to
evaluate the co-selection ability to discriminate word sense and to construct sets of
synonyms. The dataset was established by using Wikipedia article titles as a source to create
a list of ambiguous and unambiguous query terms. The query terms selected for the
experiments were also checked whether they were truly ambiguous or unambiguous.
In the case of the word sense discrimination experiment, numbers of clusters
generated by co-selection method for ambiguous and unambiguous queries selected were
used to evaluate whether there is correlation between them. On the other hand, in the case of
experiment of ability to construct sets of synonyms, human judgment were used to evaluate
query pairs generated by query clustering on single click and query clustering on co-selected
data.
The outcomes from the experiments indicate that it is difficult to use co-selection
method on web search to discriminate with word sense effectively, but the query clustering
on co-selected data can help to create unambiguous clusters, which could be used to
automatically construct sets of synonyms.
1|Page
Chapter 1 Introduction
Recently, people use web search in order to find their information needs more and
more. This means that there are a great number of users interacting with search engines by
giving queries into the systems and then selecting the return results that match the
information need in mind. Such selection can be viewed as 2 forms in this study:

Clickthrough data: It is a collection of results selected after submitting a
query. This selection indicates that the user justifies the selected result (clicked
result) is relevant to the search term.

Co-selection data: When a user selects more than one returned result from a
query submitted, it can be seen that such selected results are mutually relevant
to each other. The assumption is that the user has only one sense in mind and
he/she will only select the results that match the purpose. According to this
assumption, co-selected data can help to deal with ambiguous results returned
because the users will select multiple results having the same meaning.
This study will reveal the ability of co-selection to group words semantically.
1.1 Motivation
There are 2 problems in Word Sense Disambiguation (WSD) that have not been
solved. The first one is to identify word sense of ambiguous words. This is because its
meaning can be changed when they occur in a different context, and the potentially emerging
meaning can also occur anytime. The other problem is how to automatically build sets of
synonyms since manually update the synonym sets required severe human efforts.
However, a relatively new technique, called co-selection, could deal with these
problems. It could be used for discriminating with word sense and creating unambiguous
clusters from ambiguous query terms automatically. However, previous studies on coselection have failed to give the conclusion of the performance of the co-selection method
because of unsuitable datasets for their experiments. Therefore, this study will redress this
situation with appropriate datasets for new experiments.
2|Page
1.2 Thesis background
Ambiguity is widespread in human language. A number of words can be interpreted
into different meanings depending on which context such words occur in. Word sense
disambiguation is the meaning discovery for words in different context. However, dealing
with different meanings of an ambiguous word is complicated for machines [15].
For
example, search systems face the problem of lexical ambiguity, because, for given a query, it
is difficult for the systems to provide exactly information needs of the user due to low quality
and ambiguity [6].
The proposed method in this study, the co-selection method, aims to deal with such
problems by exploiting users’ judgements from click-through data as similarity function to
automatically determine senses of an ambiguous query. This method does not depend on
potentially-outdated external knowledge sources nor on extensive preprocessing of datasets,
since it requires only a simple approach: user consensus [25]. The assumption of this method
is that when submitted a query, the user has only one sense (meaning) of the query in mind.
Different users will therefore select different search results matching the sense in their mind.
When users select more than one result, such results are called “co-selected data”. Then,
when a number of users co-select the same results from the same query, it emphasizes that
such co-selected results are mutually relevant. For this reason, a group of similar results
indicate a distinct sense of an ambiguous word.
In short, this method exploits users’
agreement as similarity function to discriminate word sense.
Even though the result from the first study on co-selection method [25] indicated that
the method was feasible to discriminate with sense of ambiguous queries, the study focused
only on image search. Additionally, the study used an artificial dataset, where it was only a
small-scale experiment participated by volunteers under artificial controls. Therefore, it is
difficult to conclude that co-selection method can discriminate with word sense based only on
that experiment.
There was also another study based on co-selection method from [22] in 2009, but the
study aim to construct unambiguous clusters, not discriminate with word sense, by using
query clustering on co-selected data. The study performed an experiment by using a real
world dataset, which was based on web document search, not image search. However, the
result of the experiment was not useful for assessing co-selection because randomly
3|Page
extracting data from the entire dataset resulted in an unambiguous dataset. For this reason, the
dataset, which contained no ambiguous terms, was inappropriate to evaluate the ability of coselection method to create unambiguous clusters.
Since datasets of previous studies on the co-selection method were unsuitable to
conclude the ability of co-selection method to group words semantically, this study will
perform both a word sense discrimination experiment and the experiment of query clustering
on co-selected data [22], but with the appropriate dataset that will be carefully established.
Then we will discuss results, potential factors, and potential future work in order to conclude
the ability of co-selection method.
1.3 The thesis fields
Word Sense Disambiguation and Information Retrieval
1.4 Research Questions

Is there a correlation between numbers of clusters generated by co-selection method
from ambiguous and unambiguous queries?

Can query clustering on co-selected data augment to create unambiguous clusters?
1.5 Explanatory of research questions
Evaluating the capability of co-selection method has not been concluded yet due to
inappropriate dataset. This study therefore aims to conclude the performance of it by
carefully selecting the dataset for new experiments. There are 2 questions in this study. The
first question is about word sense discrimination; it aims to answer whether or not there is a
correlation between numbers of clusters generated between ambiguous and unambiguous
queries - intuitively the more clusters then the more meanings the query has. The other
question is about the ability of co-selection to construct sets of synonyms; it aims to answer
whether query clustering on co-selection method can help to create unambiguous clusters.
4|Page
In the case of the first question, according to assumption of co-selection method
introduced earlier in this chapter, users will click mostly on results matching their
information needs, so an ambiguous query would have a greater number of clusters generated
than unambiguous queries. If the outcome occurs in this way, it will indicate that there is a
correlation between the numbers of clusters generated from ambiguous and unambiguous
queries.
In the case of the second question, with ambiguous queries, the performance of query
clustering was reduced because ambiguous senses of the queries confuse query clustering to
group related query effectively. However, the co-selection method could help to perform this
task more effectively because query clustering on co-selected data also consider information
need of “each user” to group related queries. Thus, we look at whether query clustering on
co-selected data can augment to create unambiguous clusters.
1.6 Contributions
To conclude, the followings are potential contributions of this study:

This study could augment the credibility of co-selection method to group query terms
semantically by carefully selecting dataset for new experiments.

This study will identify problems of using co-selection method on web search.

This study will investigate the use of click-through data on web search whether or not
it is reliable enough to discriminate with word sense by co-selection method.
5|Page
Chapter 2 Literature Survey
This chapter is partitioned into 2 parts, background and related works. The
background will first provide the essential information of click-through data as implicit
feedback, and will then provide background on previous experiments on co-selection method.
For the second part, the study will provide related works in the area of word sense
disambiguation and query clustering.
2.1 Background
2.1.1 Fundamental Knowledge on Click-Through Data as Implicit Feedback
For explicit feedback, users are asked to assess which results are relevant to a search
query so that the retrieval function can be optimized by using this sort of feedback. However,
users are usually unwilling to give the assessment [10]. In addition, according to [11],
manually developing a retrieval function is not only time consuming, but it is sometimes also
not practical.
In contrast to implicit feedback in the form of click-through data, it can be
accumulated at very low cost, in a great number of quantities, and without a requirement of
an additional user activity [11]. Click-through data can be extracted from a search engine log,
where it indicate which URLs of search results were clicked by the users after submitting a
search query. Although queries and URLs selected from search engines are not exactly
relevant, they at least indicate the relationship of users’ judgements and clicked document
results from the queries [28].
Since [14] has first proposed this form of implicit feedback for a web search system,
there are a number of uses based on this concept, such as developing information retrieval
functions [10], optimizing web search [30], investigating user search behaviors [2, 5, 18], and
clustering web documents and queries [3].
Even though click-through data seem to promise implicit feedback on web search,
many studies have asked questions on it. There are three major difficulties – noise as well as
incompleteness, sparseness, and potentially emerging queries and documents – for studies to
work on mining click-through data [30]. Biases on documents selected by users are also
6|Page
found in several studies. According to [1], users assume the top result is relevant, although
the results are randomly retrieved. In addition, [11] point out that high ranking of search
results, trust bias, can influence the users’ decision of clicking on returned documents from
the search, even if they are less relevant to the users’ information needs than other results.
Also, the overall quality of retrieval system, quality bias, influences how users select
documents; this is, if the search system can provide only low relevant results of that query,
users will select the less relevant results as well. Furthermore, ambiguous queries are a major
challenge for information retrieval system to exploit click-through data as relevant implicit
feedback, since it is difficult to deal with. Generally, users usually give short search terms,
which results in ambiguity of the keywords. IS-A relationship could also lead to ambiguity
[20].
Although the accuracy of relevant documents returned from click-through data is
quite low, approximately 52% [18], the judgment of relevant abstracts (snippets taken from
the document and containing the search terms) of the results is 82.6% [12]. The co-selection
method is based on how people judge “abstracts of the documents” instead of documents
itself. For this reason, co-selection method on web search could be reliable enough to perform
the experiments.
2.1.2 Clustering with Co-Selection Method
Co-selection method was first proposed by [25]. This study introduced how to benefit
from users’ consensus as similarity measure on click-through data. The study aimed to
evaluate the assumption of the co-selection method on image search system, SENSAI, for
finding whether co-selected images on the same query indicated mutual relevance or not. For
each term, the clustering method used in this study collected images clicked by randomly
putting each URL into a single axis for initial position, and then such URLs were moved
closer together on the axis after such URLs were co-selected. The experimental result showed
that exploiting users’ consensus was able to automatically separate senses of ambiguous
search terms without relying on laboring to maintain and gather knowledge resources.
However, the limitation of this study was the artificial data since the dataset was collected
from asking users to perform search tasks in limited controls.
There was another previous co-selection study from [22]. This study aimed to
evaluate the performance of co-selection method to create unambiguous clusters by query
7|Page
clustering. The real world dataset was also used in this study. However, the result was
unexpected because the experimental result showed that the alternative method, which had no
capability of disambiguation, gave the best semantic coherence, instead of query clustering
on co-selected data. The researchers conclude that it was not a good idea when randomly
extracting queries from the real world dataset for the experiment because [17, 29] point out
that there is only 4% of ambiguous words in a general search engine. Therefore, randomly
selecting data from the clickthrough data resulted in the unambiguous dataset, which is not
appropriate for evaluating the performance of co-selection method.
Since the appropriate dataset need to be used for verifying the capability of coselection method, this thesis aims to carefully extract data from a real world dataset to contain
a proportion of known ambiguous terms. Therefore, the credibility of co-selection to group
words semantically will be revealed in this study.
2.2 Related Works
2.2.1 Word Sense Disambiguation
Since dealing with ambiguity is complicated for the machine, it is required the process
of transforming unstructured textual information to analyzed data structures in order to
identify the underlying meaning. This process – the computationally meaning discovery for
words in context - is called Word Sense Disambiguation (WSD) [15].
To achieve this process, a number of studies have tried to use different techniques,
which usually relied on external resources [25]. For example, The study from [9] was the text
categorization using natural language processing pattern from knowledge-based ruled.
Another example is the study on WSD [8] proposing the method to find related words from a
given word by using Longman Dictionary of Contemporary English (LDOCE). The problem
of external resources is that human efforts and specialist involved are required to manually
maintain the resources.
However, [19] illustrates that word sense disambiguation can be divided into 2 sub
tasks, discriminating and labeling tasks. Word Sense Discrimination is operated by
partitioning senses of an ambiguous word into different groups, where the assumption is that
each group represents only one sense. Most of word sense discrimination approaches are
8|Page
based on clustering. After completing this process, not only can the cluster of senses be sent
to lexicographers for the labeling tasks, but such clusters can also be used in the field of
information access without the requirement of given definition of terms. As [19] illustrate the
system can give examples of each cluster for users to decide which sense they want.
For example, regardless of the definition of words, query suggestion can provide examples of
senses of a given query for users to justify which sense they want.
Word sense discovery is a significantly difficult task because the meanings of words
vary in different contexts, and a new meaning can also be introduced anytime [13]. To deal
with this problem, the studies from [19], [26], [24] tried to discriminate with word sense by
complex approaches. On the other hand, the co-selection method based only on a simple
approach - users’ consensus - to discriminate with word sense.
2.2.2 Query Clustering
At the first stage, [3] used query clustering based on click-through data in order to
collect similar queries together. In this method, a bipartite graph was utilized to indicate the
relationship between distinct query nodes and distinct URL nodes. This algorithm, called
agglomerative hierarchical algorithm, relies only on content-ignorant clustering, where it can
discover similar groups by iteratively merging queries and documents at a time. However,
because of performing clustering per iteration, a noticeable disadvantage of this method is
that it is quite slow [3]. The study used a half of million records of click-through log from the
Lycos search engine to evaluate query suggestions. However, according to [4], this algorithm
is vulnerable to noisy clicks, since users sometimes select document results by mistakes or
poor interpretation of result captions. Therefore, [4] further developed this method by
adapting the similarity function in order to detect noisy clicks and then eliminate them.
Another query clustering method is the session based query clustering, proposed by
[28], which was based on click-through data and also utilized a combination of query content
similarity and query session similarity. A query session was defined as sequences of activities
performed by a user after submitting a query. The clustering algorithm used in this work is
DBSCAN [7] because it can deal with a large data set efficiently, and integrate new queries
incrementally. In addition, DBSCAN does not require manually setting the number of
clusters and the maximum size of clusters. However, in the experiment reported, only 20,000
queries were randomly extracted from the entire data set, because this entire data set was too
9|Page
large, approximately 22 GB. Also, since the data set of this study was from the Encarta
website, which was not from the search engine, [22] point out that the users might not interact
with this system in the same way as in the search engine systems.
In addition, according to [22], the agglomerative hierarchical clustering method
proposed by [3] is unlike the session based method because it has no capability of
discriminating with word sense, but can only collect related information into the same cluster.
Although query clustering has potential for use in word sense disambiguation –
constructing sets of synonyms automatically, these studies have not evaluated the ability of
creating unambiguous clusters. For this reason, it is rational for this project to evaluate that
ability by using query clustering on co-selected data.
10 | P a g e
Chapter 3 Methodology
An exploratory methodology to evaluate the potential capability of the co-selection
method to discriminate with word sense and to cluster unambiguous queries was completed
in this study. The first experiment was to evaluate whether there is difference between the
number of clusters generated from ambiguous versus unambiguous queries; the second
experiment was to evaluate whether query clustering on co-selected data can create
unambiguous clusters. The following indicate the stages of methodology to achieve the
research goal.
3.1 Method Outline

Identifying ambiguous and unambiguous terms (3.2)

Co-selected data preprocessing (3.3)

Data selection for word sense discrimination experiment (3.3)

Data Selection for query clustering experiment (3.4)

Expected outcomes (3.5)
3.2 Identifying ambiguous and unambiguous terms
There were 2 experiments in this study - comparison between the clusters generated
from co-selected data from ambiguous and unambiguous query terms and query clustering to
create unambiguous clusters. Therefore, a known ambiguous dataset and a known nonambiguous dataset are required for the experiments. According to [17], Wikipedia can be
used as a source to identify ambiguous and unambiguous terms for these experiments. It was
able to help us separate the titles of articles into ambiguous titles and unambiguous titles. For
this reason, Wikipedia was used to generate the known ambiguous and non-ambiguous terms.
Number
Ambiguous indicators
Number
Ambiguous indicators
1
_(Disambiguation)
9
(disambiguation_page
2
(disambiguation)
10
_(disambigation)
3
(Disambiguation)
11
_(disambigaiton)
4
(disambiguation
12
_(disambigutaion)
5
_(disambig)
13
_(disambiguatuion)
6
_disambiguation
14
_(disambiguaton)
11 | P a g e
7
_(disambiguation_page)
15
_(disambiguatiuon)
8
(disambiguation_page)
16
_(disambigauation)
Table 3-1 typing errors of ambiguous indicator found during data preprocessing
As mentioned above, all article titles from Wikipedia, which was downloaded from
http://download.wikimedia.org/enwiki/20100622/ (the file described as 'List of page titles' –
all-titles-in-ns0.gz, approximately 41 Megabytes), were used to generate a group of
ambiguous titles and unambiguous titles. According to the Wikipedia rule, all of the titles
containing “_(disambiguation)” were ambiguous titles, so ambiguous titles could be extracted
by retrieving them out of the all-titles-articles text file. However, because Wikipedia were
contributed by a number of people, there were, in the downloaded text file, also some typing
errors for indicating the title as an ambiguous title. For example, some of the titles contained
“_(disambiguatoin)” instead of “_(disambiguation)” (See table 3.1 for the typing errors found
during the data preprocessing). We noticed that the errors came from the last part of the word
“disambiguation”, which was “guation” rather than “disambig”. For this reason, to extract
known ambiguous titles properly, “disambig” was used as the filter to select ambiguous titles
out of Wikipedia all titles. The remaining titles after the extraction were considered as known
unambiguous titles. As a result, the list of ambiguous title and unambiguous one were
created, and they were then imported into the database as the “ALL-AMBIGUOUS” and
“ALL-UNAMBIGUOUS” tables.
3.3 Data Preprocessing
Clickthrough data needs to be preprocessed to perform experiments based on coselection method. Such data preprocessing steps are described below:
1. Queries were normalized into lower cases in order to contain a greater number of
clicks because too small number of clicks for a query would be inadequate to
perform good experiments.
2. The identification of sessions was required to represent a person submitting a
query, which was essential for co-selection method. This is, a unique session
represents a query. Therefore, “SessionIDs” were generated by sorting query,
time, and URLs respectively. The same queries occurring in different periods of
time (30 minutes) were assigned by different “SessionID”.
3. The same SessionID having less than 2 records (2 clicks) were filtered out
because 1 record (1 click) per session was not co-selected data.
12 | P a g e
4. Any co-selected data generated by only one SessionID were also considered as
unusual co-selected data, so they were also filtered out.
After performing these steps, we then had clickthrough data having SessionID with
greater than 1 click – called “CT_SSG1”. This means that we currently had “ALLAMBIGUOUS”, “ALL-UNAMBIGUOUS”, and “CT_SSG1” tables.
3.4 Data Selection for Word Sense Discrimination Experiment
For word sense discrimination by using co-selection method, we aimed to evaluate
whether there was a difference between the number of clusters generated from unambiguous
and ambiguous query terms. According to how co-selection method works, the number of
clusters generated from ambiguous query terms should be more than unambiguous terms as it
would be expected to partition clicks from the same search term into distinct clusters where
the different meanings of the search term are manifested. For this reason, this section will
provide the steps of how to select 20 truly ambiguous and unambiguous query terms for word
sense discrimination experiment.
Join
CT_SSG1
AMBIGUOUS-SELECTED
ALL-UNAMBIGUOUS
UNAMBIGUOUS-SELECTED
Comparing
click count
Select 20
Truly
Unambiguous
Select 20
Truly
Ambiguous
AMBIGUOUS-SELECTED-20
Create Table
Create Table
ALL-AMBIGUOUS
Join
UNAMBIGUOUS-SELECTED-20
Figure 3-1 stage of selecting truly ambiguous and unambiguous queries
13 | P a g e
The “CT_SSG1” was used to map to the “ALL-AMBIGUOUS” and “ALLUNAMBIGUOUS” tables in order to create a table, named “AMBIGALL_SELECTED”, that
contained only ambiguous queries and another table, named “UNAMBIGALL_SELECTED”,
containing only unambiguous queries. Then, these 2 tables were sorted by the highest click
count to the lowest click count of query terms.
However, the imbalance of click count between ambiguous and unambiguous query
terms could be seen as the bias against query selection for the experiment. This is, after we
had the tables that list queries with its total clicks, the highest total clicks from the table
containing unambiguous queries (“UNAMBIGALL_SELECTED”) was significantly higher
than the highest total click from the ambiguous one (“AMBIGALL_SELECTED”). This
means that if we chose the top 20 queries from each table, it could be biased towards such
selected queries. For this reason, we decided to choose 20 ambiguous query terms first, and
then such ambiguous terms would be used for selecting unambiguous query terms containing
similar clicks (This will be explained in more detail below).
In addition, both ambiguous and unambiguous queries were needed to be verified
whether they were truly ambiguous or unambiguous queries. In the case of truly ambiguous
queries, we dealt with this issue by manually checked them as follows:

Whether or not sense of ambiguous terms was dominant, by using Wikipedia
website as it is the original source.

Whether or not sense of ambiguous terms was at least 20 percent of
ambiguous search results from the top 10 results at Google, Bing, and Yahoo.
If we had failed to do this, it is possible that there would be no co-selection
data for the minority senses of the term present in the data, and thus the
distinct clusters would not be manifested.
Then we chose the top 20 proved ambiguous queries.
In the case of truly unambiguous queries, we only manually checked with search
results (they were not required to check with Wikipedia as it was unambiguous from
Wikipedia at the first place). The selection criteria were as follows:

The unambiguous terms must not occur in the ambiguous table (it is possible
for a term occurring in both ambiguous and unambiguous tables such as
“google”, and “aol”)
14 | P a g e

The unambiguous terms must have only one sense in the first page of search
results. . This indicated that the search engine had not identified sufficiently
represenative ambiguity in this term.

Then, we chose 20 selected queries by one-by-one comparing with total clicks
of selected 20 ambiguous queries (not more or less than 10%). For example, if
the 1st selected ambiguous query has 1,000 clicks, the 1st selected
unambiguous query has to have total clicks between 900 and 1,100. Thus,
according to these criteria, 20 truly ambiguous and unambiguous queries were
selected
into
tables,
named
“AMBIGALL_SELECTED_20”
and
“UNAMBIGALL_SELECTED_20” respectively.
“SensedIDs” for query terms were also required in order to indicate how many
clusters from co-selected data occurred in a query term. This is essential because it would be
used for comparing number of clusters between ambiguous and unambiguous query terms.
The “query graph” was chosen to achieve this task as use in [23]. The principle of the query
graph method was for clustering URLs containing the same sense of an ambiguous query into
the same group. The bipartite graph was used in the query graph in order to visualize
relevance between queries and documents clicked by users. A node represented a document
clicked; there was a counting number to show how many times users selected the same
document; there was also an edge, with a counting time, generated between nodes when a
user selected multiple results (See Figure 3.2). Different groups of nodes and edges
represented distinct clusters, which also represented potentially distinct meanings. For this
reason, the query graph was used in order to generate the identification of sense-distinct
clusters (SenseID) into each record for the experiment.
7
A
3.0
B
3
4
3.0
D
4.0
3
C
E
3.0
F
4
3
Figure 3-2 cluster generated by the query graph
2.0
15 | P a g e
The followings are the methods that we used to work the results out:

Paired t-test was used to evaluate whether number of clusters generated from truly
ambiguous and unambiguous query terms were statistically different. ( paired ttest from http://faculty.vassar.edu/lowry/VassarStats.htm)

Mean ( X ) of number clusters generated from both 20 truly ambiguous and
unambiguous terms was calculated to compare the average number of clusters.

Standard deviation of number clusters generated from both 20 truly ambiguous
and unambiguous terms was also calculated to support the outcome.
3.5 Data Selection for Query Clustering Experiment.
In the case of second, query clustering experiment, we needed to evaluate whether
query clustering on co-selected data could augment to create unambiguous clusters. To
achieve such evaluation, we only needed to use query terms that have more than 1 sense.
Therefore, the list of ambiguous query terms alone was used for this experiment. We decided
to make a comparison between query clustering on normal clickthrough data (single click) –
called Method0 in this study - and on co-selected data - called Method1 in this study. Then
we used human judgment to evaluate the semantic relationships of query pairs randomly
generated by both methods. After that, we used Fleiss kappa and standard statistics to work
the results out.
Since complete explanation of the methodology for this experiment is considered as
too large information for this section, it is provided in Appendix A. Therefore, the followings
are only the brief steps of methodology for this experiment.

ConnectedIDs were generated for queries in both method0 and method1 by the
simple query clustering [3], which does not require setting any parameter.
Input data between both methods was different. Method0 – the input data was
(query and URL). Method1- the input data was ((Query,SessionID) and URL).
For this reason, method0 would have only 1 cluster per query, while method1
would be expected to have multiple clusters per query.

10 truly ambiguous were selected to extract queries having the same
ConnectedID as such 10 truly ambiguous for generating query pairs. To select
16 | P a g e
such 10 truly ambiguous queries, all of ambiguous queries were sorted by an
additional 2 fields - number of clusters (method1) and size of clusters
(method0), respectively.

Unique query pairs of both method0 and method1 were randomly generated
for the evaluation.

Human judgement was used to evaluate the semantic relationship between
query pairs.

Fleiss free-marginal-kappa was used to work out the level of agreement
between participants, and basic statistics were used to compare the
performance of method0 and method1.
3.6 Expected Outcomes
In the case of the first experiment (word sense discrimination), we expected to see
that there was distinction between clusters generated by co-selected data from unambiguous
queries and ambiguous queries. It was expected that the number of clusters generated for
unambiguous queries should be fewer than ambiguous queries. If the outcome resulted in this
way, it indicated that co-selection method on web search would be able to help to
discriminate with word sense.
In the case of the second experiment (query clustering), we expected to see that query
clustering on co-selected data could augment to create unambiguous clusters. This means that
when comparing query clustering on co-selected data (method1) with query clustering on
single click (method0), number of semantically similar pairs for method1 would be more than
method0 significantly, and the level of agreement between raters would not be only by
chance. If the outcome resulted in this way, it indicated that query clustering on co-selected
data better than the basic clustering algorithm when performing on ambiguous queries. i.e.
that method1 would be able to distinguish between senses of an ambiguous term, while
method0 would not.
17 | P a g e
Chapter 4 Results
This chapter will present the results of both experiments. Discussion about the result
will be in the following chapter. Since there were 2 experiments in this study, this chapter is
divided into 2 parts. The first part is for the result from word sense discrimination by coselected method. The other one is for query clustering from ambiguous dataset.
4.1 Results of Word Sense Discrimination by Co-selection Method
Number of clusters for 20 unambiguous queries and 20 ambiguous queries were
generated to compare whether there was a difference between these 2 groups (see table 4.1).
Therefore, the numbers of clusters were assigned into independent sample paired T-test to
validate whether they were statistically different between each other, and basic statistic were
also used to support the outcome.
Figure 4-1 numbers of clusters generated for each of 20 ambiguous and unambiguous queries
18 | P a g e
#
Unambiguous
query term
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
gmail
clip art
american airlines
youtube
wedding cakes
tori spelling
google earth
delta airlines
howard stern
google maps
Cymbalta
Ipod
Itunes
Screensavers
Swimsuits
birthday cards
jennifer aniston
paintball guns
Fonts
wedding vows
Number of
clusters
generated
1
2
4
1
4
1
3
2
8
1
1
8
5
5
3
5
8
1
5
1
Ambiguous query
term
pogo
ups
amazon
aim
juno
chase
monster
southwest
delta
people
aaa
gap
whirlpool
time
hallmark
continental
Fox
nwa
e3
mls
Number of
clusters
generated
2
2
5
1
2
2
5
1
6
7
1
2
6
1
1
2
6
3
6
2
Table 4-1 numbers of clusters generated for ambiguous and unambiguous query terms
Statistic
Unambiguous
Ambiguous group
group
Means
3.45
3.15
Standard deviation
2.50
2.13
Paired T-test (two-tails)
P-value: 0.68
Table 4-2 basic statistic for the WSD experiment
Based on the data, there was no distinction between unambiguous and ambiguous
queries. Firstly, the means and standard deviations between 2 groups were not significantly
different. Another reason to support this result is that P-value from Paired t-test (calculated
from http://faculty.vassar.edu/lowry/VassarStats.htm) was 0.68, which was greater than 0.05,
and it means that the numbers of clusters generated from these 2 groups were not statistically
different.
19 | P a g e
4.2 The results from the experiment of query clustering on ambiguous
dataset.
In the case of the query clustering experiment, there were 11 participants completing
the evaluation. The rank results from these participants were then used to indicate whether
query clustering on co-selected data (method1) can cluster ambiguous queries better than
query clustering on single click (method0) or not. The agreement of rating between
participants was measured by Fleiss Free-marginal Kappa [16], calculated from
http://justusrandolph.net/kappa where it was produced by the author of [16], and basic
statistic was also used to compare the performance between method1 and method0.
Type of result from online kappa calculator
Value
Percent of overall agreement
0.71
Fixed-marginal kappa
0.39
Free-marginal kappa
0.43
Table 4-3 level of agreement between participants
As mention above, in this study Fleiss Free-marginal Kappa was used to indicate the
level of agreement between raters because raters can rate the result freely – they do not have
a limited number to rate each category. The result from it is relatively positive because it is
0.43, which means moderate inter-ranker agreement (a positive agreement of kappa starts at
0), according to [27].
The following is the semantically similar query pairs as rated by the participants for
both methods.
Participant
1
2
3
4
5
6
7
8
9
10
11
Standard Deviation
Overall
Method0
Method1
0.45
0.508333
0.458333
0.616667
0.358333
0.458333
0.383333
0.533333
0.316667
0.533333
0.641667
0.101944
0.675
0.758333
0.675
0.9
0.791667
0.733333
0.633333
0.925
0.558333
0.833333
0.875
0.117604
0.47803
0.759848
Table 4-4 the proportion of semantically similar pairs as rated by 11 participants
20 | P a g e
Figure 4-2 overall proportion of semantically similar queries of method0
Figure 4-3 overall proportion of semantically similar queries of method1
Figure 4-4 comparison individual proportion of semantically similar queries
21 | P a g e
Based on our data, method1 can cluster ambiguous queries better than method0
because the overall value of method1 is significantly higher than method0, while standard
deviation was only little different.
As mention at the start of this chapter, further discussion about the results from the
study will be provided in the next chapter.
22 | P a g e
Chapter 5 Discussion
As presenting the results from the previous chapter, the outcome for the word sense
discrimination experiment is opposed to our expectation, whereas the outcome from the
query clustering experiment resulted in the same direction as we expected. After giving an
analysis, there were many potential factors involving in these outcomes. Therefore, this
chapter will discuss such potential factors of both experiments.
5.1 Discussion on the experiment on word sense discrimination
After outcomes from this experiment had occurred, we considered why the outcomes
behaved in this unexpected way. However, when looking into the data and the selected
dataset, there were potential factors impacting on ability of co-selection method, especially in
web document search - change in ranking position of search result, the potential difference
between our dataset and the present dataset, the scope of ambiguous query terms, and noise
filtering. Therefore, these factors will be discussed in this section.
Firstly, change in search result ranking during the period of clickthrough data
collected (approximately a month) could be one of the causes of many clusters generated for
unambiguous query term. This is, people could co-select different search results when
ranking position changed. For example, at the time of starting to collect the data, the coselected results of a query term might be on the top 5 of the first page, but its rank could be
changed anytime after that, depending on a number of factors such as contemporary trend,
and competing on Search Engine Optimization (SEO). Other research by colleagues in the
Security Lab is currently finding that some classes of queries are highly volatile, and that the
overlap between search results from one five-day period to another for the same search term
can drop significantly. For this reason, different users could select different results, although
they may have the same information need. Hence, ranking changed could be one factor
making the number of clusters similar.
Secondly, the entire dataset used in this study was from several years ago. This means
that although we carefully selected ambiguous and unambiguous queries, it was still difficult
to justify whether such queries were truly ambiguous or unambiguous, when considering
trend between several years ago and the present. As a result, the selected queries might not be
23 | P a g e
exactly appropriate. In short, comparing senses of queries of dataset from several years ago
with the present search results can be another factor that leads to unexpected outcome.
Thirdly, the scope of ambiguous term is vague. It is unclear about what information
need of user is. For example, “ipod” is an unambiguous term at the meaning level, but users
could use “ipod” for different information need such as looking for ipod news, ipod review,
and different versions of ipod - this term would be considered 'weakly ambiguous' [22]. As a
result, users with different purposes for the “ipod” term would click on different search
results, and that could be one of reasons why there were a number of clusters generated for
some of unambiguous terms. In short, it was possible for the user to use a term that was likely
unambiguous to select different information needs due to weak ambiguity of those terms.
Finally, fundamental noise filtering for co-selected data might also be a potential
factor. This is because the filtering only considers a set of co-selected data as being unusual
when there is “one” person co-selected results that different from the others. This means that
if 2 or 3 people accidentally co-selected similar results that did not relate to their information
needs, a new cluster would also be generated by such accidentally co-selected data. It can be
seen that this could be a potential factor to generate more clusters. However, this factor could
also result in generating more clusters for both unambiguous queries and ambiguous queries.
For this reason, we are relatively sure that this factor should not be a significant point to the
unexpected outcome.
To summarize, potential change in search result, not up-to-date dataset, scope of
ambiguous terms, and noise filtering for co-selected data are potential key factors for the
unexpected outcome of this experiment. However, although they can indicate the reason why
there were no correlation between the clusters generated from ambiguous or unambiguous
queries, it is difficult to achieve all of these factors to discriminate with word sense by relying
on clickthrough data on web search in the future. For this reason, it is rational to focus on coselection method to discriminate with word sense in image search instead of web document
search.
5.2 Discussion on outcome from query clustering experiment
The outcome from this experiment was as our expectation. Query clustering on coselected data is better to cluster ambiguous query terms than single click query clustering
24 | P a g e
from [3]. After giving an analysis to the outcome, there are a few areas we should discuss
with.
Firstly, based on outcome calculated by basic statistics, method1 outperformed
method0. As mention in the previous chapter, the overall proportion of semantically similar
queries of method1, 0.76, was significantly higher than the overall proportion of method0,
0.47, and standard deviation of both method were not significantly different. When further
looking at proportion at the rater level, it was also noticeable that every participant rated
method1 better than method0. This could be because method1, which was based on coselected data, presumably clustered sets of queries containing potentially different senses so
that pairs selected from within clusters were more likely to be of one sense of the search term
only, while method0 did not distinguish potentially different senses before the creation of
query pairs for the evaluation, thus there was a greater chance of selecting pairs of clicked
data from the one cluster having distinct senses. Furthermore, ambiguous queries selected
were sorted by greater in a number of clusters to lower as the priority. In other words, queries
containing most potentially different senses were used to compare the performance of both
methods. Although this selection criterion could be seen as a significant advantage for
method1, it was suitable for answering whether method0 or method1 could well perform
query clustering on ambiguous queries. Therefore, it was rational to use this approach to
select ambiguous queries for generating pairs. In short, on ambiguous dataset, method1 can
cluster the ambiguous queries better than method0.
Secondly, the level of agreement between participants was relatively good, but there
were only 11 participants. As showing in the result chapter, level of agreement between
participants indicated that the agreement was not purely by chance because the Fleiss Freemarginal Kappa was 0.43 (positive agreement start at 0). However, participants did not
completely agree on the same thing. This might be because some of query pairs were difficult
for participants to judge when such query pairs provide unclear meaning, or it might be
because they really disagreed on the relationship of some of query pairs. Additionally, there
was only the small number of participants performing the evaluation, 11 participants, so it
was difficult to analyze potential factors from the level of agreement.
To summarize, based on data, the results indicate that method1 is better than method0
at both the overall proportion level and the individual proportion level. Although steps of
selecting ambiguous queries could provide advantage for method1, it was suitable for the
25 | P a g e
purpose of experiment, which assessed the performance of query clustering on ambiguous
queries. The level of agreement between participants was relatively high, even though there
was only the small number of participants, but the overall results show that co-selection is a
promising avenue for sense-sensitive clustering.
5.3 Scope and Limitation
With scope and limitation for this study, there are several areas that were not covered in
the study. The following are such areas out of the scope:

This study did not identify meanings of distinct clusters generated from coselected data because the meanings were not required for use in word sense
discrimination – The system can give examples of each cluster for users to
decide which sense they want (Schütze, 1998) – and it did not need for use in
constructing sets of synonym as well.

This study did not include the experiment on image search, but focused only on
web document search instead.

Since a contemporary dataset is not published, we instead used a real world
dataset from several years ago.

Query clustering from [4] was not used to generate ConnectedIDs because it
requires a parameter, which is artificial factor depending on different kind of
works. Although it could reduce noise more effectively than [3], we were not sure
that noise filtering during the generating ConnectedIDs could affect to the
performance of co-selection method.

We only used basic statistics to find out the results for the experiments. The
advanced statistics might be able to provide further reliable outcome, but with
time limited, this was also out of the scope.
5.4 Future work
In the case of word sense discrimination, we suggest that it would be better to work
on image search instead of web document search because image search is a more reliable
source for analyzing how users click on the results. Although numbers of people participating
26 | P a g e
in image search are significantly fewer than web search, it is still able to augment the task of
automatically discriminate with word sense.
For query clustering on co-selected data, contemporary dataset is required to use in
the future experiment. This is because the present search results from search engine seem to
become more diversified. This means that it can affect to the performance of co-selection
method. It could result in either higher performance or lower performance.
Another potential future work for query clustering on co-selection method is the
comparison between method1 and query clustering from [4], because this comparison has not
been performed in this study. The performance of [4] is higher than [3] because the query
clustering can reduce noise effectively. However, it is unclear whether such query clustering
can perform better than query clustering on co-selection method (method1). Additionally, if
the performances of both methods are similar, we can point out that our method will be more
interesting because it does not require setting an artificial parameter.
27 | P a g e
Chapter 6 Conclusion
Co-selection is a relatively new method to cluster search terms semantically by exploiting
users’ judgement as the similarity function. There is potential to use this similarity function to
discriminate with word senses and construct sets of synonyms by query clustering on coselected data. Previous studies on co-selection method have failed to conclusively
demonstrate the performance of it because of unsuitable datasets. This study therefore
selected dataset carefully to perform new experiments.
The literature survey was done in order to develop background knowledge and the issue of
current situation of the study on co-selection method. It exposed that previous studies have
not used co-selection method on web search to determine word senses, and previous studies
on query clustering also have not evaluated the ability to create unambiguous clusters except
the study from [21], which tried evaluating it but the dataset was not appropriate – randomly
exacting data from the entire dataset resulted in the unambiguous dataset.
There are 2 objectives for this study:

This study aimed to evaluate whether there was a difference between the number of
clusters generated by co-selected data for unambiguous queries and ambiguous
queries or not.

This study aimed to evaluate whether query clustering on co-selected data could
augment to create unambiguous clusters or not.
The first objective was to find out that number of clusters generated by co-selected data for
ambiguous queries would be statistically higher than for unambiguous queries. To answer
this question, the experiment needed to identify both ambiguous query terms and
unambiguous query terms to generate clusters for the evaluation.
On the other hand, the second objective was to find out that query clustering on co-selected
data could cluster ambiguous queries better than basic query clustering [3] when using
ambiguous dataset. For this reason, ambiguous query terms alone were required for this
experiment.
To achieve the objectives of the study, the followings were several stages of our
methodology:
28 | P a g e

Identifying ambiguous and unambiguous query terms by using Wikipedia article titles
as the source.

Data preprocessing for the experiments of co-selection method – normalized queries
in order for having more clicks, generated SessionID to represent a unique user
clicking on search results, and filtered sessions having less than 2 clicks out.

For word sense discrimination, 20 truly ambiguous queries and 20 truly unambiguous
queries were identified. Then, SenseID were generated in order to indicate how many
clusters each query has. After that, number of clusters generated for ambiguous
queries and unambiguous queries were used to compare whether there was a
correlation between these 2 groups. The comparison was done by using paired t-test
and basic statistic.

For query clustering, ConnectedIDs for method0 – (query and URL) was the input
data – and for method1- ((query, SessionID), and URL) was the input data – were
generated by simple query clustering [3] because it does not require any parameter. 10
truly ambiguous were identified for use in extracting queries for both methods. Query
pairs for both method0 and method1 were then randomly generated for the evaluation.
After that, human judgement was used to evaluate the relationship between each
query pair. To calculate the results, Fleiss Free-marginal Kappa was used for finding
out the level of agreement between participants, and basic statistic was utilized to
work out the proportion of semantically similar queries between 2 methods.
However, unfortunately, the outcome of the word sense discrimination experiment indicates
that there is no difference between numbers of clusters generated from ambiguous and
unambiguous queries selected. This could be because potential change in search result,
dataset from several years ago, scope of ambiguous terms, and noise filtering for co-selected
data affected to the ability of co-selection method to discriminate with word sense on this
study.
On the other hand, the outcome of the query clustering experiment is as our expectation.
Method1 performed query clustering on ambiguous dataset better than method0 in both
overall proportion and individual proportion of semantically similar pairs. Additionally, the
level of agreement between participants was also relatively high. Therefore, based on these
data, query clustering on co-selection method is able to create unambiguous clusters.
29 | P a g e
Finally, in the case of word sense discrimination by co-selection method, future work should
focus on image search instead of web search. On the other hand, for query clustering by coselection method, the future work should use contemporary dataset, or the future work could
consider to making a comparison between query clustering on co-selection method and query
clustering on single click which could well perform in noise filtering as [4].
30 | P a g e
Appendix A- Complete Explanatory of Query Clustering Methodolog
Since there is a large explanation for the query clustering experiment, it seems to be too large
information to explain it in detail in the chapter of methodology. Therefore, it will be
explained here (Appendix A). The followings are complete stages of methodology for the
query clustering experiment.
A.1 – Section outline

Generating identification of connected component (A.2)

The selection of 10 truly ambiguous query terms (A.3)

The extraction of related queries (A.4)

Generating query pairs (A.5)

Word sense evaluation (A.6)

The approaches of working out the result (A.7)
A.2 – Generating identification of connected component
Firstly, Identification of connected component (ConnectedID) was required to
indicate queries clustered together. Fundamental query clustering [3] was chosen to generate
connected-id because it does not require setting a parameter. Additionally, although query
clustering from [4] can perform better noise filtering compared to [3], it requires setting a
certain threshold for the noise filtering, and the threshold is an artificial factor, which is
difficult to know which threshold is the best for our data. Furthermore, it was unclear whether
noise filtering during the process of generating ConnectedIDs could affect to the performance
of co-selection method. For example, it might filter out potential queries that is not useful for
query clustering on single click, but useful for query clustering on co-selected data. Thus,
query clustering from [3] is rational for this experiment because it does not require any
parameter, and it would not affect to ConnectedIDs assigned for query clustering on coselection method.
In this study, there were 2 methods for generating unique ConnectedIDs to make a
comparison between single click data and co-selected data. Method0 was for single click
data, whereas method1 was for co-selected data. The table used for generating ConnectedIDs
31 | P a g e
was CT-SSG1, which had already been introduced in section 3.3 (data preprocessing). The
difference between method0 and method1 was the input data to generate ConnectedIDs. In
the case of method0, a pair of input data was (Query and URL), while in the case of method1
a pair of input data was ((Query, SessionID) and URL). For this reason, ConnectedIDs of
data generated for method0 and method1 were different, and another difference is that
method0 had only one cluster per query, while method one would had more than 1 cluster,
depending on co-selected data. After assigned ConnectedIDs, the table of method0 was
“METHOD0”, and the table of method0 was “METHOD1”. In short, ConnectedIDs for
method0 and for method1 were generated by different input data in order to compare query
clustering on single click data with co-selected data.
A.3 – The selection of 10 truly ambiguous query terms
After the ConnectedIDs were generated, a list of ambiguous terms needed to be
created in order for use in selecting truly ambiguous query terms for the experiment. The list
was generated by map both tables (METHOD0 and METHOD1) to “ALL-AMBIGUOUS”
table, which was generated in Section 3.1. In fact, the queries occurring in Method0 and
method1 were the same (the difference was just ConnectedIDs assigned), so it was not
required to map both tables to the “ALL-AMBIGUOUS” table. However, to map them was a
simple task, and we also wanted to make sure that the amount of ambiguous queries from
method0 and method1 was the same. As a result, we first mapped “METHOD0” with “ALLAMBIGUOUS” and then recorded the amount of ambiguous queries after mapped, which
were 1,090 queries. Then, we mapped both “METHOD0” and “METHOD1” with “ALLAMBIGUOUS”, and looked at the amount of ambiguous queries again. As expected, there
was nothing wrong because the amount of ambiguous queries was the same, 1090. Thus, we
created a new table – called “AMBIGUOUS_COMMON”- by using the mapped tables that
we just mention.
However, for this experiment we need to select only 10 truly ambiguous queries
because of concerns of time spent on the word sense evaluation by participants and adequate
query pairs generated from an ambiguous query. The concern of the time spent on the word
sense evaluation for participants occurred because if we selected too many query pairs for
evaluation, it would consume too much time of participants to complete the evaluation, and it
32 | P a g e
would also reduce the performance of participants to evaluate such pairs when spending more
time on the evaluation. On the other hand, the concern of selecting adequate query pairs
occurred because if we chose too fewer truly ambiguous queries to generate query pairs, it
might result in difficulty to interpret results properly. For this reason, we decided to choose
10 truly ambiguous queries and decided to randomly select 24 queries pairs (12 for method0
and the other 12 for method1) per each of these 10 truly ambiguous queries. This means that
the total number of query pairs would be 240. We assumed that participants would spend 10
seconds per pair, so completing 240 pairs resulted in 40 minutes. In fact, it might still be a
little too much time for participants to complete this evaluation, but, as mention above, we
were also concerned with inadequate pairs for interpreting results. Therefore, we decided to
allow participants to pause the evaluation whenever they want, and come back to continue
later as the existing participants. 10 truly ambiguous queries would be suitable to deal with
the issues of selecting inadequate query pairs and time spent on completing the evaluation.
To find 10 rational ambiguous queries for this experiment, we added 2 additional
fields – size of cluster from method0 and number of cluster from method1 - into
“AMBIGUOUS COMMON” table. These two fields were used for sorting ambiguous query
to find 10 rational ambiguous queries. Although there were 1,090 common ambiguous
queries that we could use for the experiment, as mentioned earlier in this section, some of
them might not be rational because size of clusters (from method0) was too small and/or
number of clusters (from method1) was only one cluster, which was the same as method0, so
they could not be compared with method0. For this reason, we first sorted the queries by
number of clusters since the difference between the methods would become more significant
when number of clusters of method1 were higher. We then sorted the queries by size of
clusters of method0 in order to make sure that this ambiguous query could generate 12 pairs
(the cluster had to contain at least 6 unique queries), and also the greater size of cluster, the
more potential ambiguous queries would occur. After sorting by the number of clusters and
the size of cluster respectively, to find 10 truly ambiguous queries, the ambiguous queries
were evaluated by looking whether or not senses of ambiguous terms were at least 20 percent
of ambiguous search results from the first page of Google, Bing, and Yahoo. When one of top
sorted queries was unclear to be truly ambiguous, we decided not to choose it, and then we
looked at the next one iteratively until 10 truly ambiguous were selected. For this reason, 10
truly ambiguous selected would be rational enough for the experiment. In short, sorting
33 | P a g e
number of clusters (of method1) and size of cluster (of method0) and evaluate truly
ambiguous queries from search results were criteria to find 10 rational ambiguous queries.
A.4 – The extraction of related queries
ConnectedIDs from 10 rational ambiguous queries selected were used to extract
queries related to such 10 ambiguous queries selected. There were ConnectedIDs of such
ambiguous queries occurred in “METHOD0” and “METHOD1” tables. These ConnectedIDs
were used for extracting related queries, which having the same ConnectedIDs as 10 rational
ambiguous queries selected, into “METHOD0_SELECTED” and “METHOD1_SELECTED”
tables. For method0, there was only 1 ConnectedID for each of 10 ambiguous queries
selected because method0, which was based on single click data, considered all queries
sharing the same URLs as related queries. On the other hand, in the case of method1, there
could be multiple clusters – more than 1 ConnectedID - for each ambiguous queries selected
because method1 – based on co-selected data – included judgment from each user to cluster
related queries. For example, if there were 2 clusters for a query term “apple”, there would be
2 ConnectedIDs where each of them would represent different set of related queries.
Therefore, set of queries were extracted into table named “METHOD0_SELECTED” and
“METHOD1_SELECTED” based on the ConnectedIDs of the ambiguous queries selected
occurred in both “METHOD0” and “METHOD1”.
A.5 – Generating query pairs
However, queries pairs for the evaluation needed to be randomly selected from
“METHOD0_SELECTED” and “METHOD1_SELECTED” because we needed only 12
query pairs for each of the 10 ambiguous query terms selected. These means that we needed
120 query pairs for method0 and the other 120 query pairs for method1, so the total number
was 240 as discussed earlier in this chapter. As mention above, the difference between
related queries clustered in method0 – 1 cluster – and method1 – multiple clusters – was
number of clusters for each ambiguous query selected. In the case of method 0, we randomly
select 12 queries pairs as follows:
34 | P a g e
1. We sorted related queries by query term.
2. We created a list of potential unique query pair for an ambiguous query selected
in order to randomly select; for example, if there were four members for a term,
A, B, C and D, the list of these potential unique queries pairs would be presented
as the table B-1. ( Note that our data definitely had more than 4 members due to
the solution to inadequate query pairs generated, as explained before, but we use 4
members here in order to make the example easy to be understood)
Index
Potential Pairs
Index
Potential Pairs
1
A-B
4
B-C
2
A-C
5
B-D
3
A-D
6
C-D
Table A-1 the example of a list of potential unique query pair
3. We randomly generated 12 unique indexes for selecting 12 unique pairs for the
ambiguous term.
4. Iteratively perform the 1st to the 3rd steps but use different an ambiguous query
selected until completing all of 10 ambiguous queries selected.
From these steps, we finally had 120 unique query pairs for the evaluation of
method0.
In the case of method1, it was a little difference from method0 due to multiple clusters
for an ambiguous query term. The followings are steps for randomly selecting query pairs for
the evaluation:
1. We randomly selected one of multiple clusters of an ambiguous query selected.
2. We sorted related queries by query term.
3. We created a list of potential unique query pair for the cluster randomly selected
as the example from method0 (see table B-1).
4. We randomly generate “one” unique index for selecting “one” out of 12 unique
pairs for the ambiguous term.
5. If the unique pair duplicated to previous pairs selected, this duplicated pair was
not recorded, and the process then started from the first step again. If the pair did
not duplicate to previous pairs selected, it was recorded, and then the process
35 | P a g e
iteratively start from step 1 until 12 unique pairs are selected for an ambiguous
term.
6. Iteratively, select 12 unique pairs for the other ambiguous queries selected.
From these steps, we finally had 120 unique queries pairs for the evaluation of
method1 as well. This indicates that 240 query queries for the experiment had been randomly
selected for the evaluation already.
A.6 – Word sense evaluation
Figure A-1 the example of evaluation of a query pair
To use human judgment to evaluate the relationships between query pairs, simple user
interface for this evaluation was established by a small web application. Every feature built in
this small application served its purpose as follows:
1. There was the login page for both a new participant and an existing participant (in
the case that the participant preferred not to finish them all at a time). This helped
us to validate which participants performed the evaluation completely.
2. The evaluation page randomly provided a query pair for the participant to justify
relationship between that pair iteratively until there is no pair left. This means that
participant did not know whether the query pairs came from method0 or method1
(see figure B-1). The following were features in this page:
a. There were 8 choices for help participants to decide the relationship between
a query pair.
i. [query 1] is the same concept as [query 2]
36 | P a g e
ii. [query 1] is a sibling concept of [query 2]
iii. [query 1] is the opposite of [query 2]
iv. [query 1] is not related to [query 2]
v. [query 1] is an example of [query 2]
vi. [query 1] is a more generic than [query 2]
vii. [query 1] is a part of [query 2]
viii. [query 1] contains the part [query 2]
b. In the case that participants might not have any idea about the meaning of a
query, they were able to click on that query in order to see whether the search
results of it related to the other query or not.
c. History of evaluated query pairs was also provided for participants to change
the evaluated result because they might change their mind, or they might
accidentally choose the wrong one.
d. In the case of session timeout when a participant did not interact with this
page for a period of time, the application forced participant to login again,
otherwise evaluated results after the session timeout resulted in not containing
the name of the participant, and it would be more difficult for us to calculate
the result.
e. Although there were 240 pairs for the evaluation – 120 unique pairs from
method0 and 120 unique pairs from method1, there were also some query
pairs duplicated across method0 and method1. In fact, this did not affect to the
result. However, we did not want participants to evaluate the same pair twice.
Therefore, when participant evaluated a query pair that duplicated across
method, the result of this pair was assigned to the other pair as well.
These were all features for the application to serve participants when performing the
evaluating task. In short, participants evaluate query pairs that were randomly selected one by
one until completing all query pairs.
A.7 – The approaches of working out the result
Ranks of evaluated pairs – 8 choices (see table B-2) - from participants were
transformed to 1 – related meaning – or 0 – not related meaning.
37 | P a g e
#
Relationship between a query pair
Transformed value
1
[query 1] is the same concept as [query 2]
1 (Related)
2
[query 1] is a sibling concept of [query 2]
0 (Not related)
3
[query 1] is the opposite of [query 2]
0 (Not related)
4
[query 1] is not related to [query 2]
0 (Not related)
5
[query 1] is an example of [query 2]
1 (Related)
6
[query 1] is a more generic than [query 2]
1 (Related)
7
[query 1] is a part of [query 2]
1 (Related)
8
[query 1] contains the part [query 2]
1 (Related)
Table A-2 ranks transformation
Then, these values were used to calculate the level of agreement between raters and
the proportion of semantically similar pairs from method0 and method1.
The level of agreement between participants was evaluated because the results would
not be reliable if the agreement was done by chance. Fleiss free-marginal kappa, not fixed
kappa, was used to achieve this task since Fleiss “free-marginal” kappa is suitable when
raters do not have a limited number of each category to rate the result [16]. There is also an
online tool to calculate this Fleiss free-marginal kappa produced by [16] at
http://justusrandolph.net/kappa/, so this online tool was used to calculate this result in this
study. In short, Fleiss free-marginal kappa was used to measure level of agreement between
participants.
The proportion of semantically similar pairs was measured for indicating which
method is better than the other. This was done by using basic statistic to calculate at overall
level and also at the individual level. Therefore, the measurement from the statistic should be
able to indicate whether method1 is better than method0 or not.
38 | P a g e
References
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
A. K. Agrahri, D. A. T. Manickam and J. Riedl, "Can people collaborate to improve the
relevance of search results?," in Proceedings of the 2008 ACM conference on Recommender
systems Lausanne, Switzerland: ACM, 2008, pp. 283-286.
R. Baeza-Yates, C. Hurtado, M. Mendoza, and G. Dupret, "Modeling User Search Behavior,"
in Proceedings of the Third Latin American Web Congress: IEEE Computer Society, 2005, p.
242.
D. Beeferman and A. Berger, "Agglomerative clustering of a search engine query log," in
Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and
data mining Boston, Massachusetts, United States: ACM, 2000, pp. 407-416.
W. S. Chan, W. T. Leung and D. L. Lee, "Clustering Search Engine Query Log Containing Noisy
Clickthroughs," in Proceedings of the 2004 International Symposium on Applications and the
Internet (SAINT’04), 2004, p. 4.
C. L. A. Clarke, E. Agichtein, S. Dumais, and R. W. White, "The influence of caption features
on clickthrough patterns in web search," in Proceedings of the 30th annual international
ACM SIGIR conference on Research and development in information retrieval Amsterdam,
The Netherlands: ACM, 2007, pp. 135 - 142.
B. Croft, R. Cook and D. Wilder, "Providing Government Information on the Interne:
Experiences with THOMAS," University of Massachusetts1995.
M. Ester, H. P. Kriegel, J. Sander, and X. Xu, "A Density-Based Algorithm for Discovering
Clusters in Large Spatial Databases with Noise," in Proceedings of 2nd Int. Conf. on
Knowledge Discovery and Data Mining, 1996, pp. 226-231.
J. A. Guthrie, L. Guthrie, Y. Wilks, and H. Aidinejad, "Subject-dependent co-occurrence and
word sense disambiguation," in Proceedings of the 29th annual meeting on Association for
Computational Linguistics Berkeley, California: Association for Computational Linguistics,
1991, pp. 146 - 152.
P. J. Hayes, L. E. Knecht and M. J. Cellio, "A news story categorization system," in Proceedings
of the second conference on Applied natural language processing Austin, Texas: Association
for Computational Linguistics, 1988.
T. Joachims, "Optimizing search engines using clickthrough data," in Proceedings of the
eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Edmonton, Alberta, Canada: ACM, 2002, pp. 133 - 142.
T. Joachims, L. Granka, B. Pan, H. Hembrooke, and G. Gay, "Accurately interpreting
clickthrough data as implicit feedback," in Proceedings of the 28th annual international ACM
SIGIR conference on Research and development in information retrieval Salvador, Brazil:
ACM, 2005, pp. 154 - 161.
T. Joachims, L. Granka, B. Pan, H. Hembrooke, F. Radlinski, and G. Gay, "Evaluating the
accuracy of implicit feedback from clicks and query reformulations in Web search," ACM
Trans. Inf. Syst., vol. 25, p. 7, 2007.
A. Kilgarriff, "Word Senses," in Word Sense Disambiguation, 2006, pp. 29-46.
H. Lieberman, "Letizia: an agent that assists web browsing," in Proceedings of the 14th
international joint conference on Artificial intelligence - Volume 1 Montreal, Quebec, Canada:
Morgan Kaufmann Publishers Inc., 1995, pp. 924-929.
R. Navigli, "Word sense disambiguation: A survey," ACM Comput. Surv., vol. 41, pp. 1-69,
2009.
J. Randolph, "Free-Marginal Multirater Kappa (multirater K[free]): An Alternative to Fleiss'
Fixed-Marginal Multirater Kappa," in Joensuu University Learning and Instruction
Symposium, Joensuu, Finland, 2005.
39 | P a g e
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
[29]
[30]
M. Sanderson, "Ambiguous queries: test collections need more sense," in Proceedings of the
31st annual international ACM SIGIR conference on Research and development in
information retrieval Singapore, Singapore: ACM, 2008, pp. 499-506.
F. Scholer, M. Shokouhi, B. Billerbeck, and A. Turpin, "Using Clicks as Implicit Judgments:
Expectations Versus Observations," in Advances in Information Retrieval, 2008, pp. 28-39.
H. Schütze, "Automatic word sense discrimination," Comput. Linguist., vol. 24, pp. 97-123,
1998.
D. Shen, M. Qin, W. Chen, Q. Yang, and Z. Chen, "Mining web query hierarchies from
clickthrough data," in Proceedings of the 22nd national conference on Artificial intelligence Volume 1 Vancouver, British Columbia, Canada: AAAI Press, 2007, pp. 341-346.
G. Smith and H. Ashman, "Evaluating implicit judgements from Image search interactions," in
Proceedings of the WebSci ' 09: Society On-Line, 2009.
G. Smith, T. Brailsford, C. Donner, D. Hooijmaijers, M. Truran, J. Goulding, and H. Ashman,
"Generating unambiguous URL clusters from web search," in Proceedings of the 2009
workshop on Web Search Click Data Barcelona, Spain: ACM, 2009, pp. 28-34.
K. Sprck-Jones, S. E. Robertson and M. Sanderson, "Ambiguous requests: implications for
retrieval tests, systems and theories," SIGIR Forum, vol. 41, pp. 8-17, 2007.
N. Tomasz, "Word Sense Discovery for Web Information Retrieval," 2008, pp. 267-274.
M. Truran, J. Goulding and H. Ashman, "Co-active intelligence for image retrieval," in
Proceedings of the 13th annual ACM international conference on Multimedia Hilton,
Singapore: ACM, 2005, pp. 547-550.
J. Véronis, "HyperLex: lexical cartography for information retrieval," Computer Speech &
Language, vol. 18, pp. 223-252, 2004.
A. J. Viera and J. M. Garrett, "Understanding interobserver agreement: the kappa statistic,"
Family medicine, vol. 37, pp. 360-363, 2005.
J.-R. Wen and H.-J. Zhang, Query Clustering in the Web Context: Kluwer Academic Publishers,
2002.
J. R. Wen, J. Y. Nie and H. J. Zhang, "Query clustering using user logs," ACM Trans. Inf. Syst.,
vol. 20, pp. 59-81, 2002.
G.-R. Xue, H.-J. Zeng, Z. Chen, Y. Yu, W.-Y. Ma, W. Xi, and W. Fan, "Optimizing web search
using web click-through data," in Proceedings of the thirteenth ACM international conference
on Information and knowledge management Washington, D.C., USA: ACM, 2004, pp. 118 126.
Download