Ability of Co-selection to Group Words Semantically by Satit Chaprasit Supervisors Associate Professor Helen Ashman Associate Supervisor Gavin Smith A thesis submitted for the degree of Master of Computer and Information Science School of Computer and Information Science Division of Information Technology, Engineering and the Environment University of South Australia October 2010 i|Page Table of Contents List of Figures ......................................................................................................................................... iii List of Tables .......................................................................................................................................... iv Declaration .............................................................................................................................................. v Acknowledgements................................................................................................................................ vi Abstract ................................................................................................................................................. vii Chapter 1 Introduction ........................................................................................................................... 1 1.1 Motivation..................................................................................................................................... 1 1.2 Thesis background ........................................................................................................................ 2 1.3 The thesis fields............................................................................................................................. 3 1.4 Research Questions ...................................................................................................................... 3 1.5 Explanatory of research questions ............................................................................................... 3 1.6 Contributions ................................................................................................................................ 4 Chapter 2 Literature Survey .................................................................................................................... 5 2.1 Background ................................................................................................................................... 5 2.1.1 Fundamental Knowledge on Click-Through Data as Implicit Feedback ................................. 5 2.1.2 Clustering with Co-Selection Method .................................................................................... 6 2.2 Related Works ............................................................................................................................... 7 2.2.1 Word Sense Disambiguation .................................................................................................. 7 2.2.2 Query Clustering .................................................................................................................... 8 Chapter 3 Methodology ........................................................................................................................ 10 3.1 Method Outline........................................................................................................................... 10 3.2 Identifying ambiguous and unambiguous terms ........................................................................ 10 3.3 Data Preprocessing ..................................................................................................................... 11 3.4 Data Selection for Word Sense Discrimination Experiment ....................................................... 12 3.5 Data Selection for Query Clustering Experiment. ....................................................................... 15 3.6 Expected Outcomes .................................................................................................................... 16 Chapter 4 Results .................................................................................................................................. 17 4.1 Results of Word Sense Discrimination by Co-selection Method ................................................ 17 4.2 The results from the experiment of query clustering on ambiguous dataset. ........................... 19 Chapter 5 Discussion ............................................................................................................................. 22 5.1 Discussion on the experiment on word sense discrimination .................................................... 22 5.2 Discussion on outcome from query clustering experiment ........................................................ 23 ii | P a g e 5.3 Scope and Limitation................................................................................................................... 25 5.4 Future work ................................................................................................................................. 25 Chapter 6 Conclusion ............................................................................................................................ 27 Appendix A- Complete Explanatory of Query Clustering Methodolog ................................................. 30 A.1 – Section outline ......................................................................................................................... 30 A.2 – Generating identification of connected component ............................................................... 30 A.3 – The selection of 10 truly ambiguous query terms ................................................................... 31 A.4 – The extraction of related queries ............................................................................................ 33 A.5 – Generating query pairs ............................................................................................................ 33 A.6 – Word sense evaluation ............................................................................................................ 35 A.7 – The approaches of working out the result .............................................................................. 36 References ............................................................................................................................................ 38 iii | P a g e List of Figures Figure 3-2 stage of selecting truly ambiguous and unambiguous queries ............................................ 12 Figure 3-3 cluster generated by the query graph ................................................................................... 14 Figure 4-1 numbers of clusters generated for each of 20 ambiguous and unambiguous queries.......... 17 Figure 4-2 overall proportion of semantically similar queries of method0........................................... 20 Figure 4-3 overall proportion of semantically similar queries of method1........................................... 20 Figure 4-4 comparison individual proportion of semantically similar queries ..................................... 20 iv | P a g e List of Tables Table 3-1 typing errors of ambiguous indicator found during data preprocessing ............................... 11 Table 4-1 numbers of clusters generated for ambiguous and unambiguous query terms ..................... 18 Table 4-2 basic statistic for the WSD experiment ................................................................................ 18 Table 4-3 level of agreement between participants............................................................................... 19 Table 4-4 the proportion of semantically similar pairs as rated by 11 participants .............................. 19 v|Page Declaration I declare that this thesis does not incorporate without acknowledgment any material previously submitted for a degree or diploma in any university; and that to the best of my knowledge it does not contain any materials previously published or written by another person except where due reference is made in the text. Satit Chaprasit 20th October 2010 vi | P a g e Acknowledgements I would like to thank my supervisor Helen Ashman, Gavin Smith, and Mark Truran for their times, valuable suggestions, and supports during this year. I also would like to thank members of security lab for their helps and supports. I also do not forget to thank my friends and participants that help me to complete my thesis. Finally, I would like to thank my family in Thailand and my cousin here for encouragements and supports. vii | P a g e Abstract Meanings of ambiguous words vary in different contexts. When using machines to identify meaning of ambiguous words, it is called Word Sense Disambiguation (WSD). However, there are also problems in this area. A major problem for word sense disambiguation is to automatically determine word sense. This is because meanings can vary in different context and new meanings can occur any time. Another problem is to automatically construct sets of synonyms for use in word sense disambiguation. Co-selection method, which exploits users’ agreements on clickthrough data as similarity, could be used to deal with such WSD problems. However, previous studies on the co-selection method have so far failed to demonstrate the performance of it due to unsuitable datasets. This study then aims to redress this situation with carefully selecting datasets to evaluate the co-selection ability to discriminate word sense and to construct sets of synonyms. The dataset was established by using Wikipedia article titles as a source to create a list of ambiguous and unambiguous query terms. The query terms selected for the experiments were also checked whether they were truly ambiguous or unambiguous. In the case of the word sense discrimination experiment, numbers of clusters generated by co-selection method for ambiguous and unambiguous queries selected were used to evaluate whether there is correlation between them. On the other hand, in the case of experiment of ability to construct sets of synonyms, human judgment were used to evaluate query pairs generated by query clustering on single click and query clustering on co-selected data. The outcomes from the experiments indicate that it is difficult to use co-selection method on web search to discriminate with word sense effectively, but the query clustering on co-selected data can help to create unambiguous clusters, which could be used to automatically construct sets of synonyms. 1|Page Chapter 1 Introduction Recently, people use web search in order to find their information needs more and more. This means that there are a great number of users interacting with search engines by giving queries into the systems and then selecting the return results that match the information need in mind. Such selection can be viewed as 2 forms in this study: Clickthrough data: It is a collection of results selected after submitting a query. This selection indicates that the user justifies the selected result (clicked result) is relevant to the search term. Co-selection data: When a user selects more than one returned result from a query submitted, it can be seen that such selected results are mutually relevant to each other. The assumption is that the user has only one sense in mind and he/she will only select the results that match the purpose. According to this assumption, co-selected data can help to deal with ambiguous results returned because the users will select multiple results having the same meaning. This study will reveal the ability of co-selection to group words semantically. 1.1 Motivation There are 2 problems in Word Sense Disambiguation (WSD) that have not been solved. The first one is to identify word sense of ambiguous words. This is because its meaning can be changed when they occur in a different context, and the potentially emerging meaning can also occur anytime. The other problem is how to automatically build sets of synonyms since manually update the synonym sets required severe human efforts. However, a relatively new technique, called co-selection, could deal with these problems. It could be used for discriminating with word sense and creating unambiguous clusters from ambiguous query terms automatically. However, previous studies on coselection have failed to give the conclusion of the performance of the co-selection method because of unsuitable datasets for their experiments. Therefore, this study will redress this situation with appropriate datasets for new experiments. 2|Page 1.2 Thesis background Ambiguity is widespread in human language. A number of words can be interpreted into different meanings depending on which context such words occur in. Word sense disambiguation is the meaning discovery for words in different context. However, dealing with different meanings of an ambiguous word is complicated for machines [15]. For example, search systems face the problem of lexical ambiguity, because, for given a query, it is difficult for the systems to provide exactly information needs of the user due to low quality and ambiguity [6]. The proposed method in this study, the co-selection method, aims to deal with such problems by exploiting users’ judgements from click-through data as similarity function to automatically determine senses of an ambiguous query. This method does not depend on potentially-outdated external knowledge sources nor on extensive preprocessing of datasets, since it requires only a simple approach: user consensus [25]. The assumption of this method is that when submitted a query, the user has only one sense (meaning) of the query in mind. Different users will therefore select different search results matching the sense in their mind. When users select more than one result, such results are called “co-selected data”. Then, when a number of users co-select the same results from the same query, it emphasizes that such co-selected results are mutually relevant. For this reason, a group of similar results indicate a distinct sense of an ambiguous word. In short, this method exploits users’ agreement as similarity function to discriminate word sense. Even though the result from the first study on co-selection method [25] indicated that the method was feasible to discriminate with sense of ambiguous queries, the study focused only on image search. Additionally, the study used an artificial dataset, where it was only a small-scale experiment participated by volunteers under artificial controls. Therefore, it is difficult to conclude that co-selection method can discriminate with word sense based only on that experiment. There was also another study based on co-selection method from [22] in 2009, but the study aim to construct unambiguous clusters, not discriminate with word sense, by using query clustering on co-selected data. The study performed an experiment by using a real world dataset, which was based on web document search, not image search. However, the result of the experiment was not useful for assessing co-selection because randomly 3|Page extracting data from the entire dataset resulted in an unambiguous dataset. For this reason, the dataset, which contained no ambiguous terms, was inappropriate to evaluate the ability of coselection method to create unambiguous clusters. Since datasets of previous studies on the co-selection method were unsuitable to conclude the ability of co-selection method to group words semantically, this study will perform both a word sense discrimination experiment and the experiment of query clustering on co-selected data [22], but with the appropriate dataset that will be carefully established. Then we will discuss results, potential factors, and potential future work in order to conclude the ability of co-selection method. 1.3 The thesis fields Word Sense Disambiguation and Information Retrieval 1.4 Research Questions Is there a correlation between numbers of clusters generated by co-selection method from ambiguous and unambiguous queries? Can query clustering on co-selected data augment to create unambiguous clusters? 1.5 Explanatory of research questions Evaluating the capability of co-selection method has not been concluded yet due to inappropriate dataset. This study therefore aims to conclude the performance of it by carefully selecting the dataset for new experiments. There are 2 questions in this study. The first question is about word sense discrimination; it aims to answer whether or not there is a correlation between numbers of clusters generated between ambiguous and unambiguous queries - intuitively the more clusters then the more meanings the query has. The other question is about the ability of co-selection to construct sets of synonyms; it aims to answer whether query clustering on co-selection method can help to create unambiguous clusters. 4|Page In the case of the first question, according to assumption of co-selection method introduced earlier in this chapter, users will click mostly on results matching their information needs, so an ambiguous query would have a greater number of clusters generated than unambiguous queries. If the outcome occurs in this way, it will indicate that there is a correlation between the numbers of clusters generated from ambiguous and unambiguous queries. In the case of the second question, with ambiguous queries, the performance of query clustering was reduced because ambiguous senses of the queries confuse query clustering to group related query effectively. However, the co-selection method could help to perform this task more effectively because query clustering on co-selected data also consider information need of “each user” to group related queries. Thus, we look at whether query clustering on co-selected data can augment to create unambiguous clusters. 1.6 Contributions To conclude, the followings are potential contributions of this study: This study could augment the credibility of co-selection method to group query terms semantically by carefully selecting dataset for new experiments. This study will identify problems of using co-selection method on web search. This study will investigate the use of click-through data on web search whether or not it is reliable enough to discriminate with word sense by co-selection method. 5|Page Chapter 2 Literature Survey This chapter is partitioned into 2 parts, background and related works. The background will first provide the essential information of click-through data as implicit feedback, and will then provide background on previous experiments on co-selection method. For the second part, the study will provide related works in the area of word sense disambiguation and query clustering. 2.1 Background 2.1.1 Fundamental Knowledge on Click-Through Data as Implicit Feedback For explicit feedback, users are asked to assess which results are relevant to a search query so that the retrieval function can be optimized by using this sort of feedback. However, users are usually unwilling to give the assessment [10]. In addition, according to [11], manually developing a retrieval function is not only time consuming, but it is sometimes also not practical. In contrast to implicit feedback in the form of click-through data, it can be accumulated at very low cost, in a great number of quantities, and without a requirement of an additional user activity [11]. Click-through data can be extracted from a search engine log, where it indicate which URLs of search results were clicked by the users after submitting a search query. Although queries and URLs selected from search engines are not exactly relevant, they at least indicate the relationship of users’ judgements and clicked document results from the queries [28]. Since [14] has first proposed this form of implicit feedback for a web search system, there are a number of uses based on this concept, such as developing information retrieval functions [10], optimizing web search [30], investigating user search behaviors [2, 5, 18], and clustering web documents and queries [3]. Even though click-through data seem to promise implicit feedback on web search, many studies have asked questions on it. There are three major difficulties – noise as well as incompleteness, sparseness, and potentially emerging queries and documents – for studies to work on mining click-through data [30]. Biases on documents selected by users are also 6|Page found in several studies. According to [1], users assume the top result is relevant, although the results are randomly retrieved. In addition, [11] point out that high ranking of search results, trust bias, can influence the users’ decision of clicking on returned documents from the search, even if they are less relevant to the users’ information needs than other results. Also, the overall quality of retrieval system, quality bias, influences how users select documents; this is, if the search system can provide only low relevant results of that query, users will select the less relevant results as well. Furthermore, ambiguous queries are a major challenge for information retrieval system to exploit click-through data as relevant implicit feedback, since it is difficult to deal with. Generally, users usually give short search terms, which results in ambiguity of the keywords. IS-A relationship could also lead to ambiguity [20]. Although the accuracy of relevant documents returned from click-through data is quite low, approximately 52% [18], the judgment of relevant abstracts (snippets taken from the document and containing the search terms) of the results is 82.6% [12]. The co-selection method is based on how people judge “abstracts of the documents” instead of documents itself. For this reason, co-selection method on web search could be reliable enough to perform the experiments. 2.1.2 Clustering with Co-Selection Method Co-selection method was first proposed by [25]. This study introduced how to benefit from users’ consensus as similarity measure on click-through data. The study aimed to evaluate the assumption of the co-selection method on image search system, SENSAI, for finding whether co-selected images on the same query indicated mutual relevance or not. For each term, the clustering method used in this study collected images clicked by randomly putting each URL into a single axis for initial position, and then such URLs were moved closer together on the axis after such URLs were co-selected. The experimental result showed that exploiting users’ consensus was able to automatically separate senses of ambiguous search terms without relying on laboring to maintain and gather knowledge resources. However, the limitation of this study was the artificial data since the dataset was collected from asking users to perform search tasks in limited controls. There was another previous co-selection study from [22]. This study aimed to evaluate the performance of co-selection method to create unambiguous clusters by query 7|Page clustering. The real world dataset was also used in this study. However, the result was unexpected because the experimental result showed that the alternative method, which had no capability of disambiguation, gave the best semantic coherence, instead of query clustering on co-selected data. The researchers conclude that it was not a good idea when randomly extracting queries from the real world dataset for the experiment because [17, 29] point out that there is only 4% of ambiguous words in a general search engine. Therefore, randomly selecting data from the clickthrough data resulted in the unambiguous dataset, which is not appropriate for evaluating the performance of co-selection method. Since the appropriate dataset need to be used for verifying the capability of coselection method, this thesis aims to carefully extract data from a real world dataset to contain a proportion of known ambiguous terms. Therefore, the credibility of co-selection to group words semantically will be revealed in this study. 2.2 Related Works 2.2.1 Word Sense Disambiguation Since dealing with ambiguity is complicated for the machine, it is required the process of transforming unstructured textual information to analyzed data structures in order to identify the underlying meaning. This process – the computationally meaning discovery for words in context - is called Word Sense Disambiguation (WSD) [15]. To achieve this process, a number of studies have tried to use different techniques, which usually relied on external resources [25]. For example, The study from [9] was the text categorization using natural language processing pattern from knowledge-based ruled. Another example is the study on WSD [8] proposing the method to find related words from a given word by using Longman Dictionary of Contemporary English (LDOCE). The problem of external resources is that human efforts and specialist involved are required to manually maintain the resources. However, [19] illustrates that word sense disambiguation can be divided into 2 sub tasks, discriminating and labeling tasks. Word Sense Discrimination is operated by partitioning senses of an ambiguous word into different groups, where the assumption is that each group represents only one sense. Most of word sense discrimination approaches are 8|Page based on clustering. After completing this process, not only can the cluster of senses be sent to lexicographers for the labeling tasks, but such clusters can also be used in the field of information access without the requirement of given definition of terms. As [19] illustrate the system can give examples of each cluster for users to decide which sense they want. For example, regardless of the definition of words, query suggestion can provide examples of senses of a given query for users to justify which sense they want. Word sense discovery is a significantly difficult task because the meanings of words vary in different contexts, and a new meaning can also be introduced anytime [13]. To deal with this problem, the studies from [19], [26], [24] tried to discriminate with word sense by complex approaches. On the other hand, the co-selection method based only on a simple approach - users’ consensus - to discriminate with word sense. 2.2.2 Query Clustering At the first stage, [3] used query clustering based on click-through data in order to collect similar queries together. In this method, a bipartite graph was utilized to indicate the relationship between distinct query nodes and distinct URL nodes. This algorithm, called agglomerative hierarchical algorithm, relies only on content-ignorant clustering, where it can discover similar groups by iteratively merging queries and documents at a time. However, because of performing clustering per iteration, a noticeable disadvantage of this method is that it is quite slow [3]. The study used a half of million records of click-through log from the Lycos search engine to evaluate query suggestions. However, according to [4], this algorithm is vulnerable to noisy clicks, since users sometimes select document results by mistakes or poor interpretation of result captions. Therefore, [4] further developed this method by adapting the similarity function in order to detect noisy clicks and then eliminate them. Another query clustering method is the session based query clustering, proposed by [28], which was based on click-through data and also utilized a combination of query content similarity and query session similarity. A query session was defined as sequences of activities performed by a user after submitting a query. The clustering algorithm used in this work is DBSCAN [7] because it can deal with a large data set efficiently, and integrate new queries incrementally. In addition, DBSCAN does not require manually setting the number of clusters and the maximum size of clusters. However, in the experiment reported, only 20,000 queries were randomly extracted from the entire data set, because this entire data set was too 9|Page large, approximately 22 GB. Also, since the data set of this study was from the Encarta website, which was not from the search engine, [22] point out that the users might not interact with this system in the same way as in the search engine systems. In addition, according to [22], the agglomerative hierarchical clustering method proposed by [3] is unlike the session based method because it has no capability of discriminating with word sense, but can only collect related information into the same cluster. Although query clustering has potential for use in word sense disambiguation – constructing sets of synonyms automatically, these studies have not evaluated the ability of creating unambiguous clusters. For this reason, it is rational for this project to evaluate that ability by using query clustering on co-selected data. 10 | P a g e Chapter 3 Methodology An exploratory methodology to evaluate the potential capability of the co-selection method to discriminate with word sense and to cluster unambiguous queries was completed in this study. The first experiment was to evaluate whether there is difference between the number of clusters generated from ambiguous versus unambiguous queries; the second experiment was to evaluate whether query clustering on co-selected data can create unambiguous clusters. The following indicate the stages of methodology to achieve the research goal. 3.1 Method Outline Identifying ambiguous and unambiguous terms (3.2) Co-selected data preprocessing (3.3) Data selection for word sense discrimination experiment (3.3) Data Selection for query clustering experiment (3.4) Expected outcomes (3.5) 3.2 Identifying ambiguous and unambiguous terms There were 2 experiments in this study - comparison between the clusters generated from co-selected data from ambiguous and unambiguous query terms and query clustering to create unambiguous clusters. Therefore, a known ambiguous dataset and a known nonambiguous dataset are required for the experiments. According to [17], Wikipedia can be used as a source to identify ambiguous and unambiguous terms for these experiments. It was able to help us separate the titles of articles into ambiguous titles and unambiguous titles. For this reason, Wikipedia was used to generate the known ambiguous and non-ambiguous terms. Number Ambiguous indicators Number Ambiguous indicators 1 _(Disambiguation) 9 (disambiguation_page 2 (disambiguation) 10 _(disambigation) 3 (Disambiguation) 11 _(disambigaiton) 4 (disambiguation 12 _(disambigutaion) 5 _(disambig) 13 _(disambiguatuion) 6 _disambiguation 14 _(disambiguaton) 11 | P a g e 7 _(disambiguation_page) 15 _(disambiguatiuon) 8 (disambiguation_page) 16 _(disambigauation) Table 3-1 typing errors of ambiguous indicator found during data preprocessing As mentioned above, all article titles from Wikipedia, which was downloaded from http://download.wikimedia.org/enwiki/20100622/ (the file described as 'List of page titles' – all-titles-in-ns0.gz, approximately 41 Megabytes), were used to generate a group of ambiguous titles and unambiguous titles. According to the Wikipedia rule, all of the titles containing “_(disambiguation)” were ambiguous titles, so ambiguous titles could be extracted by retrieving them out of the all-titles-articles text file. However, because Wikipedia were contributed by a number of people, there were, in the downloaded text file, also some typing errors for indicating the title as an ambiguous title. For example, some of the titles contained “_(disambiguatoin)” instead of “_(disambiguation)” (See table 3.1 for the typing errors found during the data preprocessing). We noticed that the errors came from the last part of the word “disambiguation”, which was “guation” rather than “disambig”. For this reason, to extract known ambiguous titles properly, “disambig” was used as the filter to select ambiguous titles out of Wikipedia all titles. The remaining titles after the extraction were considered as known unambiguous titles. As a result, the list of ambiguous title and unambiguous one were created, and they were then imported into the database as the “ALL-AMBIGUOUS” and “ALL-UNAMBIGUOUS” tables. 3.3 Data Preprocessing Clickthrough data needs to be preprocessed to perform experiments based on coselection method. Such data preprocessing steps are described below: 1. Queries were normalized into lower cases in order to contain a greater number of clicks because too small number of clicks for a query would be inadequate to perform good experiments. 2. The identification of sessions was required to represent a person submitting a query, which was essential for co-selection method. This is, a unique session represents a query. Therefore, “SessionIDs” were generated by sorting query, time, and URLs respectively. The same queries occurring in different periods of time (30 minutes) were assigned by different “SessionID”. 3. The same SessionID having less than 2 records (2 clicks) were filtered out because 1 record (1 click) per session was not co-selected data. 12 | P a g e 4. Any co-selected data generated by only one SessionID were also considered as unusual co-selected data, so they were also filtered out. After performing these steps, we then had clickthrough data having SessionID with greater than 1 click – called “CT_SSG1”. This means that we currently had “ALLAMBIGUOUS”, “ALL-UNAMBIGUOUS”, and “CT_SSG1” tables. 3.4 Data Selection for Word Sense Discrimination Experiment For word sense discrimination by using co-selection method, we aimed to evaluate whether there was a difference between the number of clusters generated from unambiguous and ambiguous query terms. According to how co-selection method works, the number of clusters generated from ambiguous query terms should be more than unambiguous terms as it would be expected to partition clicks from the same search term into distinct clusters where the different meanings of the search term are manifested. For this reason, this section will provide the steps of how to select 20 truly ambiguous and unambiguous query terms for word sense discrimination experiment. Join CT_SSG1 AMBIGUOUS-SELECTED ALL-UNAMBIGUOUS UNAMBIGUOUS-SELECTED Comparing click count Select 20 Truly Unambiguous Select 20 Truly Ambiguous AMBIGUOUS-SELECTED-20 Create Table Create Table ALL-AMBIGUOUS Join UNAMBIGUOUS-SELECTED-20 Figure 3-1 stage of selecting truly ambiguous and unambiguous queries 13 | P a g e The “CT_SSG1” was used to map to the “ALL-AMBIGUOUS” and “ALLUNAMBIGUOUS” tables in order to create a table, named “AMBIGALL_SELECTED”, that contained only ambiguous queries and another table, named “UNAMBIGALL_SELECTED”, containing only unambiguous queries. Then, these 2 tables were sorted by the highest click count to the lowest click count of query terms. However, the imbalance of click count between ambiguous and unambiguous query terms could be seen as the bias against query selection for the experiment. This is, after we had the tables that list queries with its total clicks, the highest total clicks from the table containing unambiguous queries (“UNAMBIGALL_SELECTED”) was significantly higher than the highest total click from the ambiguous one (“AMBIGALL_SELECTED”). This means that if we chose the top 20 queries from each table, it could be biased towards such selected queries. For this reason, we decided to choose 20 ambiguous query terms first, and then such ambiguous terms would be used for selecting unambiguous query terms containing similar clicks (This will be explained in more detail below). In addition, both ambiguous and unambiguous queries were needed to be verified whether they were truly ambiguous or unambiguous queries. In the case of truly ambiguous queries, we dealt with this issue by manually checked them as follows: Whether or not sense of ambiguous terms was dominant, by using Wikipedia website as it is the original source. Whether or not sense of ambiguous terms was at least 20 percent of ambiguous search results from the top 10 results at Google, Bing, and Yahoo. If we had failed to do this, it is possible that there would be no co-selection data for the minority senses of the term present in the data, and thus the distinct clusters would not be manifested. Then we chose the top 20 proved ambiguous queries. In the case of truly unambiguous queries, we only manually checked with search results (they were not required to check with Wikipedia as it was unambiguous from Wikipedia at the first place). The selection criteria were as follows: The unambiguous terms must not occur in the ambiguous table (it is possible for a term occurring in both ambiguous and unambiguous tables such as “google”, and “aol”) 14 | P a g e The unambiguous terms must have only one sense in the first page of search results. . This indicated that the search engine had not identified sufficiently represenative ambiguity in this term. Then, we chose 20 selected queries by one-by-one comparing with total clicks of selected 20 ambiguous queries (not more or less than 10%). For example, if the 1st selected ambiguous query has 1,000 clicks, the 1st selected unambiguous query has to have total clicks between 900 and 1,100. Thus, according to these criteria, 20 truly ambiguous and unambiguous queries were selected into tables, named “AMBIGALL_SELECTED_20” and “UNAMBIGALL_SELECTED_20” respectively. “SensedIDs” for query terms were also required in order to indicate how many clusters from co-selected data occurred in a query term. This is essential because it would be used for comparing number of clusters between ambiguous and unambiguous query terms. The “query graph” was chosen to achieve this task as use in [23]. The principle of the query graph method was for clustering URLs containing the same sense of an ambiguous query into the same group. The bipartite graph was used in the query graph in order to visualize relevance between queries and documents clicked by users. A node represented a document clicked; there was a counting number to show how many times users selected the same document; there was also an edge, with a counting time, generated between nodes when a user selected multiple results (See Figure 3.2). Different groups of nodes and edges represented distinct clusters, which also represented potentially distinct meanings. For this reason, the query graph was used in order to generate the identification of sense-distinct clusters (SenseID) into each record for the experiment. 7 A 3.0 B 3 4 3.0 D 4.0 3 C E 3.0 F 4 3 Figure 3-2 cluster generated by the query graph 2.0 15 | P a g e The followings are the methods that we used to work the results out: Paired t-test was used to evaluate whether number of clusters generated from truly ambiguous and unambiguous query terms were statistically different. ( paired ttest from http://faculty.vassar.edu/lowry/VassarStats.htm) Mean ( X ) of number clusters generated from both 20 truly ambiguous and unambiguous terms was calculated to compare the average number of clusters. Standard deviation of number clusters generated from both 20 truly ambiguous and unambiguous terms was also calculated to support the outcome. 3.5 Data Selection for Query Clustering Experiment. In the case of second, query clustering experiment, we needed to evaluate whether query clustering on co-selected data could augment to create unambiguous clusters. To achieve such evaluation, we only needed to use query terms that have more than 1 sense. Therefore, the list of ambiguous query terms alone was used for this experiment. We decided to make a comparison between query clustering on normal clickthrough data (single click) – called Method0 in this study - and on co-selected data - called Method1 in this study. Then we used human judgment to evaluate the semantic relationships of query pairs randomly generated by both methods. After that, we used Fleiss kappa and standard statistics to work the results out. Since complete explanation of the methodology for this experiment is considered as too large information for this section, it is provided in Appendix A. Therefore, the followings are only the brief steps of methodology for this experiment. ConnectedIDs were generated for queries in both method0 and method1 by the simple query clustering [3], which does not require setting any parameter. Input data between both methods was different. Method0 – the input data was (query and URL). Method1- the input data was ((Query,SessionID) and URL). For this reason, method0 would have only 1 cluster per query, while method1 would be expected to have multiple clusters per query. 10 truly ambiguous were selected to extract queries having the same ConnectedID as such 10 truly ambiguous for generating query pairs. To select 16 | P a g e such 10 truly ambiguous queries, all of ambiguous queries were sorted by an additional 2 fields - number of clusters (method1) and size of clusters (method0), respectively. Unique query pairs of both method0 and method1 were randomly generated for the evaluation. Human judgement was used to evaluate the semantic relationship between query pairs. Fleiss free-marginal-kappa was used to work out the level of agreement between participants, and basic statistics were used to compare the performance of method0 and method1. 3.6 Expected Outcomes In the case of the first experiment (word sense discrimination), we expected to see that there was distinction between clusters generated by co-selected data from unambiguous queries and ambiguous queries. It was expected that the number of clusters generated for unambiguous queries should be fewer than ambiguous queries. If the outcome resulted in this way, it indicated that co-selection method on web search would be able to help to discriminate with word sense. In the case of the second experiment (query clustering), we expected to see that query clustering on co-selected data could augment to create unambiguous clusters. This means that when comparing query clustering on co-selected data (method1) with query clustering on single click (method0), number of semantically similar pairs for method1 would be more than method0 significantly, and the level of agreement between raters would not be only by chance. If the outcome resulted in this way, it indicated that query clustering on co-selected data better than the basic clustering algorithm when performing on ambiguous queries. i.e. that method1 would be able to distinguish between senses of an ambiguous term, while method0 would not. 17 | P a g e Chapter 4 Results This chapter will present the results of both experiments. Discussion about the result will be in the following chapter. Since there were 2 experiments in this study, this chapter is divided into 2 parts. The first part is for the result from word sense discrimination by coselected method. The other one is for query clustering from ambiguous dataset. 4.1 Results of Word Sense Discrimination by Co-selection Method Number of clusters for 20 unambiguous queries and 20 ambiguous queries were generated to compare whether there was a difference between these 2 groups (see table 4.1). Therefore, the numbers of clusters were assigned into independent sample paired T-test to validate whether they were statistically different between each other, and basic statistic were also used to support the outcome. Figure 4-1 numbers of clusters generated for each of 20 ambiguous and unambiguous queries 18 | P a g e # Unambiguous query term 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 gmail clip art american airlines youtube wedding cakes tori spelling google earth delta airlines howard stern google maps Cymbalta Ipod Itunes Screensavers Swimsuits birthday cards jennifer aniston paintball guns Fonts wedding vows Number of clusters generated 1 2 4 1 4 1 3 2 8 1 1 8 5 5 3 5 8 1 5 1 Ambiguous query term pogo ups amazon aim juno chase monster southwest delta people aaa gap whirlpool time hallmark continental Fox nwa e3 mls Number of clusters generated 2 2 5 1 2 2 5 1 6 7 1 2 6 1 1 2 6 3 6 2 Table 4-1 numbers of clusters generated for ambiguous and unambiguous query terms Statistic Unambiguous Ambiguous group group Means 3.45 3.15 Standard deviation 2.50 2.13 Paired T-test (two-tails) P-value: 0.68 Table 4-2 basic statistic for the WSD experiment Based on the data, there was no distinction between unambiguous and ambiguous queries. Firstly, the means and standard deviations between 2 groups were not significantly different. Another reason to support this result is that P-value from Paired t-test (calculated from http://faculty.vassar.edu/lowry/VassarStats.htm) was 0.68, which was greater than 0.05, and it means that the numbers of clusters generated from these 2 groups were not statistically different. 19 | P a g e 4.2 The results from the experiment of query clustering on ambiguous dataset. In the case of the query clustering experiment, there were 11 participants completing the evaluation. The rank results from these participants were then used to indicate whether query clustering on co-selected data (method1) can cluster ambiguous queries better than query clustering on single click (method0) or not. The agreement of rating between participants was measured by Fleiss Free-marginal Kappa [16], calculated from http://justusrandolph.net/kappa where it was produced by the author of [16], and basic statistic was also used to compare the performance between method1 and method0. Type of result from online kappa calculator Value Percent of overall agreement 0.71 Fixed-marginal kappa 0.39 Free-marginal kappa 0.43 Table 4-3 level of agreement between participants As mention above, in this study Fleiss Free-marginal Kappa was used to indicate the level of agreement between raters because raters can rate the result freely – they do not have a limited number to rate each category. The result from it is relatively positive because it is 0.43, which means moderate inter-ranker agreement (a positive agreement of kappa starts at 0), according to [27]. The following is the semantically similar query pairs as rated by the participants for both methods. Participant 1 2 3 4 5 6 7 8 9 10 11 Standard Deviation Overall Method0 Method1 0.45 0.508333 0.458333 0.616667 0.358333 0.458333 0.383333 0.533333 0.316667 0.533333 0.641667 0.101944 0.675 0.758333 0.675 0.9 0.791667 0.733333 0.633333 0.925 0.558333 0.833333 0.875 0.117604 0.47803 0.759848 Table 4-4 the proportion of semantically similar pairs as rated by 11 participants 20 | P a g e Figure 4-2 overall proportion of semantically similar queries of method0 Figure 4-3 overall proportion of semantically similar queries of method1 Figure 4-4 comparison individual proportion of semantically similar queries 21 | P a g e Based on our data, method1 can cluster ambiguous queries better than method0 because the overall value of method1 is significantly higher than method0, while standard deviation was only little different. As mention at the start of this chapter, further discussion about the results from the study will be provided in the next chapter. 22 | P a g e Chapter 5 Discussion As presenting the results from the previous chapter, the outcome for the word sense discrimination experiment is opposed to our expectation, whereas the outcome from the query clustering experiment resulted in the same direction as we expected. After giving an analysis, there were many potential factors involving in these outcomes. Therefore, this chapter will discuss such potential factors of both experiments. 5.1 Discussion on the experiment on word sense discrimination After outcomes from this experiment had occurred, we considered why the outcomes behaved in this unexpected way. However, when looking into the data and the selected dataset, there were potential factors impacting on ability of co-selection method, especially in web document search - change in ranking position of search result, the potential difference between our dataset and the present dataset, the scope of ambiguous query terms, and noise filtering. Therefore, these factors will be discussed in this section. Firstly, change in search result ranking during the period of clickthrough data collected (approximately a month) could be one of the causes of many clusters generated for unambiguous query term. This is, people could co-select different search results when ranking position changed. For example, at the time of starting to collect the data, the coselected results of a query term might be on the top 5 of the first page, but its rank could be changed anytime after that, depending on a number of factors such as contemporary trend, and competing on Search Engine Optimization (SEO). Other research by colleagues in the Security Lab is currently finding that some classes of queries are highly volatile, and that the overlap between search results from one five-day period to another for the same search term can drop significantly. For this reason, different users could select different results, although they may have the same information need. Hence, ranking changed could be one factor making the number of clusters similar. Secondly, the entire dataset used in this study was from several years ago. This means that although we carefully selected ambiguous and unambiguous queries, it was still difficult to justify whether such queries were truly ambiguous or unambiguous, when considering trend between several years ago and the present. As a result, the selected queries might not be 23 | P a g e exactly appropriate. In short, comparing senses of queries of dataset from several years ago with the present search results can be another factor that leads to unexpected outcome. Thirdly, the scope of ambiguous term is vague. It is unclear about what information need of user is. For example, “ipod” is an unambiguous term at the meaning level, but users could use “ipod” for different information need such as looking for ipod news, ipod review, and different versions of ipod - this term would be considered 'weakly ambiguous' [22]. As a result, users with different purposes for the “ipod” term would click on different search results, and that could be one of reasons why there were a number of clusters generated for some of unambiguous terms. In short, it was possible for the user to use a term that was likely unambiguous to select different information needs due to weak ambiguity of those terms. Finally, fundamental noise filtering for co-selected data might also be a potential factor. This is because the filtering only considers a set of co-selected data as being unusual when there is “one” person co-selected results that different from the others. This means that if 2 or 3 people accidentally co-selected similar results that did not relate to their information needs, a new cluster would also be generated by such accidentally co-selected data. It can be seen that this could be a potential factor to generate more clusters. However, this factor could also result in generating more clusters for both unambiguous queries and ambiguous queries. For this reason, we are relatively sure that this factor should not be a significant point to the unexpected outcome. To summarize, potential change in search result, not up-to-date dataset, scope of ambiguous terms, and noise filtering for co-selected data are potential key factors for the unexpected outcome of this experiment. However, although they can indicate the reason why there were no correlation between the clusters generated from ambiguous or unambiguous queries, it is difficult to achieve all of these factors to discriminate with word sense by relying on clickthrough data on web search in the future. For this reason, it is rational to focus on coselection method to discriminate with word sense in image search instead of web document search. 5.2 Discussion on outcome from query clustering experiment The outcome from this experiment was as our expectation. Query clustering on coselected data is better to cluster ambiguous query terms than single click query clustering 24 | P a g e from [3]. After giving an analysis to the outcome, there are a few areas we should discuss with. Firstly, based on outcome calculated by basic statistics, method1 outperformed method0. As mention in the previous chapter, the overall proportion of semantically similar queries of method1, 0.76, was significantly higher than the overall proportion of method0, 0.47, and standard deviation of both method were not significantly different. When further looking at proportion at the rater level, it was also noticeable that every participant rated method1 better than method0. This could be because method1, which was based on coselected data, presumably clustered sets of queries containing potentially different senses so that pairs selected from within clusters were more likely to be of one sense of the search term only, while method0 did not distinguish potentially different senses before the creation of query pairs for the evaluation, thus there was a greater chance of selecting pairs of clicked data from the one cluster having distinct senses. Furthermore, ambiguous queries selected were sorted by greater in a number of clusters to lower as the priority. In other words, queries containing most potentially different senses were used to compare the performance of both methods. Although this selection criterion could be seen as a significant advantage for method1, it was suitable for answering whether method0 or method1 could well perform query clustering on ambiguous queries. Therefore, it was rational to use this approach to select ambiguous queries for generating pairs. In short, on ambiguous dataset, method1 can cluster the ambiguous queries better than method0. Secondly, the level of agreement between participants was relatively good, but there were only 11 participants. As showing in the result chapter, level of agreement between participants indicated that the agreement was not purely by chance because the Fleiss Freemarginal Kappa was 0.43 (positive agreement start at 0). However, participants did not completely agree on the same thing. This might be because some of query pairs were difficult for participants to judge when such query pairs provide unclear meaning, or it might be because they really disagreed on the relationship of some of query pairs. Additionally, there was only the small number of participants performing the evaluation, 11 participants, so it was difficult to analyze potential factors from the level of agreement. To summarize, based on data, the results indicate that method1 is better than method0 at both the overall proportion level and the individual proportion level. Although steps of selecting ambiguous queries could provide advantage for method1, it was suitable for the 25 | P a g e purpose of experiment, which assessed the performance of query clustering on ambiguous queries. The level of agreement between participants was relatively high, even though there was only the small number of participants, but the overall results show that co-selection is a promising avenue for sense-sensitive clustering. 5.3 Scope and Limitation With scope and limitation for this study, there are several areas that were not covered in the study. The following are such areas out of the scope: This study did not identify meanings of distinct clusters generated from coselected data because the meanings were not required for use in word sense discrimination – The system can give examples of each cluster for users to decide which sense they want (Schütze, 1998) – and it did not need for use in constructing sets of synonym as well. This study did not include the experiment on image search, but focused only on web document search instead. Since a contemporary dataset is not published, we instead used a real world dataset from several years ago. Query clustering from [4] was not used to generate ConnectedIDs because it requires a parameter, which is artificial factor depending on different kind of works. Although it could reduce noise more effectively than [3], we were not sure that noise filtering during the generating ConnectedIDs could affect to the performance of co-selection method. We only used basic statistics to find out the results for the experiments. The advanced statistics might be able to provide further reliable outcome, but with time limited, this was also out of the scope. 5.4 Future work In the case of word sense discrimination, we suggest that it would be better to work on image search instead of web document search because image search is a more reliable source for analyzing how users click on the results. Although numbers of people participating 26 | P a g e in image search are significantly fewer than web search, it is still able to augment the task of automatically discriminate with word sense. For query clustering on co-selected data, contemporary dataset is required to use in the future experiment. This is because the present search results from search engine seem to become more diversified. This means that it can affect to the performance of co-selection method. It could result in either higher performance or lower performance. Another potential future work for query clustering on co-selection method is the comparison between method1 and query clustering from [4], because this comparison has not been performed in this study. The performance of [4] is higher than [3] because the query clustering can reduce noise effectively. However, it is unclear whether such query clustering can perform better than query clustering on co-selection method (method1). Additionally, if the performances of both methods are similar, we can point out that our method will be more interesting because it does not require setting an artificial parameter. 27 | P a g e Chapter 6 Conclusion Co-selection is a relatively new method to cluster search terms semantically by exploiting users’ judgement as the similarity function. There is potential to use this similarity function to discriminate with word senses and construct sets of synonyms by query clustering on coselected data. Previous studies on co-selection method have failed to conclusively demonstrate the performance of it because of unsuitable datasets. This study therefore selected dataset carefully to perform new experiments. The literature survey was done in order to develop background knowledge and the issue of current situation of the study on co-selection method. It exposed that previous studies have not used co-selection method on web search to determine word senses, and previous studies on query clustering also have not evaluated the ability to create unambiguous clusters except the study from [21], which tried evaluating it but the dataset was not appropriate – randomly exacting data from the entire dataset resulted in the unambiguous dataset. There are 2 objectives for this study: This study aimed to evaluate whether there was a difference between the number of clusters generated by co-selected data for unambiguous queries and ambiguous queries or not. This study aimed to evaluate whether query clustering on co-selected data could augment to create unambiguous clusters or not. The first objective was to find out that number of clusters generated by co-selected data for ambiguous queries would be statistically higher than for unambiguous queries. To answer this question, the experiment needed to identify both ambiguous query terms and unambiguous query terms to generate clusters for the evaluation. On the other hand, the second objective was to find out that query clustering on co-selected data could cluster ambiguous queries better than basic query clustering [3] when using ambiguous dataset. For this reason, ambiguous query terms alone were required for this experiment. To achieve the objectives of the study, the followings were several stages of our methodology: 28 | P a g e Identifying ambiguous and unambiguous query terms by using Wikipedia article titles as the source. Data preprocessing for the experiments of co-selection method – normalized queries in order for having more clicks, generated SessionID to represent a unique user clicking on search results, and filtered sessions having less than 2 clicks out. For word sense discrimination, 20 truly ambiguous queries and 20 truly unambiguous queries were identified. Then, SenseID were generated in order to indicate how many clusters each query has. After that, number of clusters generated for ambiguous queries and unambiguous queries were used to compare whether there was a correlation between these 2 groups. The comparison was done by using paired t-test and basic statistic. For query clustering, ConnectedIDs for method0 – (query and URL) was the input data – and for method1- ((query, SessionID), and URL) was the input data – were generated by simple query clustering [3] because it does not require any parameter. 10 truly ambiguous were identified for use in extracting queries for both methods. Query pairs for both method0 and method1 were then randomly generated for the evaluation. After that, human judgement was used to evaluate the relationship between each query pair. To calculate the results, Fleiss Free-marginal Kappa was used for finding out the level of agreement between participants, and basic statistic was utilized to work out the proportion of semantically similar queries between 2 methods. However, unfortunately, the outcome of the word sense discrimination experiment indicates that there is no difference between numbers of clusters generated from ambiguous and unambiguous queries selected. This could be because potential change in search result, dataset from several years ago, scope of ambiguous terms, and noise filtering for co-selected data affected to the ability of co-selection method to discriminate with word sense on this study. On the other hand, the outcome of the query clustering experiment is as our expectation. Method1 performed query clustering on ambiguous dataset better than method0 in both overall proportion and individual proportion of semantically similar pairs. Additionally, the level of agreement between participants was also relatively high. Therefore, based on these data, query clustering on co-selection method is able to create unambiguous clusters. 29 | P a g e Finally, in the case of word sense discrimination by co-selection method, future work should focus on image search instead of web search. On the other hand, for query clustering by coselection method, the future work should use contemporary dataset, or the future work could consider to making a comparison between query clustering on co-selection method and query clustering on single click which could well perform in noise filtering as [4]. 30 | P a g e Appendix A- Complete Explanatory of Query Clustering Methodolog Since there is a large explanation for the query clustering experiment, it seems to be too large information to explain it in detail in the chapter of methodology. Therefore, it will be explained here (Appendix A). The followings are complete stages of methodology for the query clustering experiment. A.1 – Section outline Generating identification of connected component (A.2) The selection of 10 truly ambiguous query terms (A.3) The extraction of related queries (A.4) Generating query pairs (A.5) Word sense evaluation (A.6) The approaches of working out the result (A.7) A.2 – Generating identification of connected component Firstly, Identification of connected component (ConnectedID) was required to indicate queries clustered together. Fundamental query clustering [3] was chosen to generate connected-id because it does not require setting a parameter. Additionally, although query clustering from [4] can perform better noise filtering compared to [3], it requires setting a certain threshold for the noise filtering, and the threshold is an artificial factor, which is difficult to know which threshold is the best for our data. Furthermore, it was unclear whether noise filtering during the process of generating ConnectedIDs could affect to the performance of co-selection method. For example, it might filter out potential queries that is not useful for query clustering on single click, but useful for query clustering on co-selected data. Thus, query clustering from [3] is rational for this experiment because it does not require any parameter, and it would not affect to ConnectedIDs assigned for query clustering on coselection method. In this study, there were 2 methods for generating unique ConnectedIDs to make a comparison between single click data and co-selected data. Method0 was for single click data, whereas method1 was for co-selected data. The table used for generating ConnectedIDs 31 | P a g e was CT-SSG1, which had already been introduced in section 3.3 (data preprocessing). The difference between method0 and method1 was the input data to generate ConnectedIDs. In the case of method0, a pair of input data was (Query and URL), while in the case of method1 a pair of input data was ((Query, SessionID) and URL). For this reason, ConnectedIDs of data generated for method0 and method1 were different, and another difference is that method0 had only one cluster per query, while method one would had more than 1 cluster, depending on co-selected data. After assigned ConnectedIDs, the table of method0 was “METHOD0”, and the table of method0 was “METHOD1”. In short, ConnectedIDs for method0 and for method1 were generated by different input data in order to compare query clustering on single click data with co-selected data. A.3 – The selection of 10 truly ambiguous query terms After the ConnectedIDs were generated, a list of ambiguous terms needed to be created in order for use in selecting truly ambiguous query terms for the experiment. The list was generated by map both tables (METHOD0 and METHOD1) to “ALL-AMBIGUOUS” table, which was generated in Section 3.1. In fact, the queries occurring in Method0 and method1 were the same (the difference was just ConnectedIDs assigned), so it was not required to map both tables to the “ALL-AMBIGUOUS” table. However, to map them was a simple task, and we also wanted to make sure that the amount of ambiguous queries from method0 and method1 was the same. As a result, we first mapped “METHOD0” with “ALLAMBIGUOUS” and then recorded the amount of ambiguous queries after mapped, which were 1,090 queries. Then, we mapped both “METHOD0” and “METHOD1” with “ALLAMBIGUOUS”, and looked at the amount of ambiguous queries again. As expected, there was nothing wrong because the amount of ambiguous queries was the same, 1090. Thus, we created a new table – called “AMBIGUOUS_COMMON”- by using the mapped tables that we just mention. However, for this experiment we need to select only 10 truly ambiguous queries because of concerns of time spent on the word sense evaluation by participants and adequate query pairs generated from an ambiguous query. The concern of the time spent on the word sense evaluation for participants occurred because if we selected too many query pairs for evaluation, it would consume too much time of participants to complete the evaluation, and it 32 | P a g e would also reduce the performance of participants to evaluate such pairs when spending more time on the evaluation. On the other hand, the concern of selecting adequate query pairs occurred because if we chose too fewer truly ambiguous queries to generate query pairs, it might result in difficulty to interpret results properly. For this reason, we decided to choose 10 truly ambiguous queries and decided to randomly select 24 queries pairs (12 for method0 and the other 12 for method1) per each of these 10 truly ambiguous queries. This means that the total number of query pairs would be 240. We assumed that participants would spend 10 seconds per pair, so completing 240 pairs resulted in 40 minutes. In fact, it might still be a little too much time for participants to complete this evaluation, but, as mention above, we were also concerned with inadequate pairs for interpreting results. Therefore, we decided to allow participants to pause the evaluation whenever they want, and come back to continue later as the existing participants. 10 truly ambiguous queries would be suitable to deal with the issues of selecting inadequate query pairs and time spent on completing the evaluation. To find 10 rational ambiguous queries for this experiment, we added 2 additional fields – size of cluster from method0 and number of cluster from method1 - into “AMBIGUOUS COMMON” table. These two fields were used for sorting ambiguous query to find 10 rational ambiguous queries. Although there were 1,090 common ambiguous queries that we could use for the experiment, as mentioned earlier in this section, some of them might not be rational because size of clusters (from method0) was too small and/or number of clusters (from method1) was only one cluster, which was the same as method0, so they could not be compared with method0. For this reason, we first sorted the queries by number of clusters since the difference between the methods would become more significant when number of clusters of method1 were higher. We then sorted the queries by size of clusters of method0 in order to make sure that this ambiguous query could generate 12 pairs (the cluster had to contain at least 6 unique queries), and also the greater size of cluster, the more potential ambiguous queries would occur. After sorting by the number of clusters and the size of cluster respectively, to find 10 truly ambiguous queries, the ambiguous queries were evaluated by looking whether or not senses of ambiguous terms were at least 20 percent of ambiguous search results from the first page of Google, Bing, and Yahoo. When one of top sorted queries was unclear to be truly ambiguous, we decided not to choose it, and then we looked at the next one iteratively until 10 truly ambiguous were selected. For this reason, 10 truly ambiguous selected would be rational enough for the experiment. In short, sorting 33 | P a g e number of clusters (of method1) and size of cluster (of method0) and evaluate truly ambiguous queries from search results were criteria to find 10 rational ambiguous queries. A.4 – The extraction of related queries ConnectedIDs from 10 rational ambiguous queries selected were used to extract queries related to such 10 ambiguous queries selected. There were ConnectedIDs of such ambiguous queries occurred in “METHOD0” and “METHOD1” tables. These ConnectedIDs were used for extracting related queries, which having the same ConnectedIDs as 10 rational ambiguous queries selected, into “METHOD0_SELECTED” and “METHOD1_SELECTED” tables. For method0, there was only 1 ConnectedID for each of 10 ambiguous queries selected because method0, which was based on single click data, considered all queries sharing the same URLs as related queries. On the other hand, in the case of method1, there could be multiple clusters – more than 1 ConnectedID - for each ambiguous queries selected because method1 – based on co-selected data – included judgment from each user to cluster related queries. For example, if there were 2 clusters for a query term “apple”, there would be 2 ConnectedIDs where each of them would represent different set of related queries. Therefore, set of queries were extracted into table named “METHOD0_SELECTED” and “METHOD1_SELECTED” based on the ConnectedIDs of the ambiguous queries selected occurred in both “METHOD0” and “METHOD1”. A.5 – Generating query pairs However, queries pairs for the evaluation needed to be randomly selected from “METHOD0_SELECTED” and “METHOD1_SELECTED” because we needed only 12 query pairs for each of the 10 ambiguous query terms selected. These means that we needed 120 query pairs for method0 and the other 120 query pairs for method1, so the total number was 240 as discussed earlier in this chapter. As mention above, the difference between related queries clustered in method0 – 1 cluster – and method1 – multiple clusters – was number of clusters for each ambiguous query selected. In the case of method 0, we randomly select 12 queries pairs as follows: 34 | P a g e 1. We sorted related queries by query term. 2. We created a list of potential unique query pair for an ambiguous query selected in order to randomly select; for example, if there were four members for a term, A, B, C and D, the list of these potential unique queries pairs would be presented as the table B-1. ( Note that our data definitely had more than 4 members due to the solution to inadequate query pairs generated, as explained before, but we use 4 members here in order to make the example easy to be understood) Index Potential Pairs Index Potential Pairs 1 A-B 4 B-C 2 A-C 5 B-D 3 A-D 6 C-D Table A-1 the example of a list of potential unique query pair 3. We randomly generated 12 unique indexes for selecting 12 unique pairs for the ambiguous term. 4. Iteratively perform the 1st to the 3rd steps but use different an ambiguous query selected until completing all of 10 ambiguous queries selected. From these steps, we finally had 120 unique query pairs for the evaluation of method0. In the case of method1, it was a little difference from method0 due to multiple clusters for an ambiguous query term. The followings are steps for randomly selecting query pairs for the evaluation: 1. We randomly selected one of multiple clusters of an ambiguous query selected. 2. We sorted related queries by query term. 3. We created a list of potential unique query pair for the cluster randomly selected as the example from method0 (see table B-1). 4. We randomly generate “one” unique index for selecting “one” out of 12 unique pairs for the ambiguous term. 5. If the unique pair duplicated to previous pairs selected, this duplicated pair was not recorded, and the process then started from the first step again. If the pair did not duplicate to previous pairs selected, it was recorded, and then the process 35 | P a g e iteratively start from step 1 until 12 unique pairs are selected for an ambiguous term. 6. Iteratively, select 12 unique pairs for the other ambiguous queries selected. From these steps, we finally had 120 unique queries pairs for the evaluation of method1 as well. This indicates that 240 query queries for the experiment had been randomly selected for the evaluation already. A.6 – Word sense evaluation Figure A-1 the example of evaluation of a query pair To use human judgment to evaluate the relationships between query pairs, simple user interface for this evaluation was established by a small web application. Every feature built in this small application served its purpose as follows: 1. There was the login page for both a new participant and an existing participant (in the case that the participant preferred not to finish them all at a time). This helped us to validate which participants performed the evaluation completely. 2. The evaluation page randomly provided a query pair for the participant to justify relationship between that pair iteratively until there is no pair left. This means that participant did not know whether the query pairs came from method0 or method1 (see figure B-1). The following were features in this page: a. There were 8 choices for help participants to decide the relationship between a query pair. i. [query 1] is the same concept as [query 2] 36 | P a g e ii. [query 1] is a sibling concept of [query 2] iii. [query 1] is the opposite of [query 2] iv. [query 1] is not related to [query 2] v. [query 1] is an example of [query 2] vi. [query 1] is a more generic than [query 2] vii. [query 1] is a part of [query 2] viii. [query 1] contains the part [query 2] b. In the case that participants might not have any idea about the meaning of a query, they were able to click on that query in order to see whether the search results of it related to the other query or not. c. History of evaluated query pairs was also provided for participants to change the evaluated result because they might change their mind, or they might accidentally choose the wrong one. d. In the case of session timeout when a participant did not interact with this page for a period of time, the application forced participant to login again, otherwise evaluated results after the session timeout resulted in not containing the name of the participant, and it would be more difficult for us to calculate the result. e. Although there were 240 pairs for the evaluation – 120 unique pairs from method0 and 120 unique pairs from method1, there were also some query pairs duplicated across method0 and method1. In fact, this did not affect to the result. However, we did not want participants to evaluate the same pair twice. Therefore, when participant evaluated a query pair that duplicated across method, the result of this pair was assigned to the other pair as well. These were all features for the application to serve participants when performing the evaluating task. In short, participants evaluate query pairs that were randomly selected one by one until completing all query pairs. A.7 – The approaches of working out the result Ranks of evaluated pairs – 8 choices (see table B-2) - from participants were transformed to 1 – related meaning – or 0 – not related meaning. 37 | P a g e # Relationship between a query pair Transformed value 1 [query 1] is the same concept as [query 2] 1 (Related) 2 [query 1] is a sibling concept of [query 2] 0 (Not related) 3 [query 1] is the opposite of [query 2] 0 (Not related) 4 [query 1] is not related to [query 2] 0 (Not related) 5 [query 1] is an example of [query 2] 1 (Related) 6 [query 1] is a more generic than [query 2] 1 (Related) 7 [query 1] is a part of [query 2] 1 (Related) 8 [query 1] contains the part [query 2] 1 (Related) Table A-2 ranks transformation Then, these values were used to calculate the level of agreement between raters and the proportion of semantically similar pairs from method0 and method1. The level of agreement between participants was evaluated because the results would not be reliable if the agreement was done by chance. Fleiss free-marginal kappa, not fixed kappa, was used to achieve this task since Fleiss “free-marginal” kappa is suitable when raters do not have a limited number of each category to rate the result [16]. There is also an online tool to calculate this Fleiss free-marginal kappa produced by [16] at http://justusrandolph.net/kappa/, so this online tool was used to calculate this result in this study. In short, Fleiss free-marginal kappa was used to measure level of agreement between participants. The proportion of semantically similar pairs was measured for indicating which method is better than the other. This was done by using basic statistic to calculate at overall level and also at the individual level. Therefore, the measurement from the statistic should be able to indicate whether method1 is better than method0 or not. 38 | P a g e References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] A. K. Agrahri, D. A. T. Manickam and J. Riedl, "Can people collaborate to improve the relevance of search results?," in Proceedings of the 2008 ACM conference on Recommender systems Lausanne, Switzerland: ACM, 2008, pp. 283-286. R. Baeza-Yates, C. Hurtado, M. Mendoza, and G. Dupret, "Modeling User Search Behavior," in Proceedings of the Third Latin American Web Congress: IEEE Computer Society, 2005, p. 242. D. Beeferman and A. Berger, "Agglomerative clustering of a search engine query log," in Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining Boston, Massachusetts, United States: ACM, 2000, pp. 407-416. W. S. Chan, W. T. Leung and D. L. Lee, "Clustering Search Engine Query Log Containing Noisy Clickthroughs," in Proceedings of the 2004 International Symposium on Applications and the Internet (SAINT’04), 2004, p. 4. C. L. A. Clarke, E. Agichtein, S. Dumais, and R. W. White, "The influence of caption features on clickthrough patterns in web search," in Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval Amsterdam, The Netherlands: ACM, 2007, pp. 135 - 142. B. Croft, R. Cook and D. Wilder, "Providing Government Information on the Interne: Experiences with THOMAS," University of Massachusetts1995. M. Ester, H. P. Kriegel, J. Sander, and X. Xu, "A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise," in Proceedings of 2nd Int. Conf. on Knowledge Discovery and Data Mining, 1996, pp. 226-231. J. A. Guthrie, L. Guthrie, Y. Wilks, and H. Aidinejad, "Subject-dependent co-occurrence and word sense disambiguation," in Proceedings of the 29th annual meeting on Association for Computational Linguistics Berkeley, California: Association for Computational Linguistics, 1991, pp. 146 - 152. P. J. Hayes, L. E. Knecht and M. J. Cellio, "A news story categorization system," in Proceedings of the second conference on Applied natural language processing Austin, Texas: Association for Computational Linguistics, 1988. T. Joachims, "Optimizing search engines using clickthrough data," in Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining Edmonton, Alberta, Canada: ACM, 2002, pp. 133 - 142. T. Joachims, L. Granka, B. Pan, H. Hembrooke, and G. Gay, "Accurately interpreting clickthrough data as implicit feedback," in Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval Salvador, Brazil: ACM, 2005, pp. 154 - 161. T. Joachims, L. Granka, B. Pan, H. Hembrooke, F. Radlinski, and G. Gay, "Evaluating the accuracy of implicit feedback from clicks and query reformulations in Web search," ACM Trans. Inf. Syst., vol. 25, p. 7, 2007. A. Kilgarriff, "Word Senses," in Word Sense Disambiguation, 2006, pp. 29-46. H. Lieberman, "Letizia: an agent that assists web browsing," in Proceedings of the 14th international joint conference on Artificial intelligence - Volume 1 Montreal, Quebec, Canada: Morgan Kaufmann Publishers Inc., 1995, pp. 924-929. R. Navigli, "Word sense disambiguation: A survey," ACM Comput. Surv., vol. 41, pp. 1-69, 2009. J. Randolph, "Free-Marginal Multirater Kappa (multirater K[free]): An Alternative to Fleiss' Fixed-Marginal Multirater Kappa," in Joensuu University Learning and Instruction Symposium, Joensuu, Finland, 2005. 39 | P a g e [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] M. Sanderson, "Ambiguous queries: test collections need more sense," in Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval Singapore, Singapore: ACM, 2008, pp. 499-506. F. Scholer, M. Shokouhi, B. Billerbeck, and A. Turpin, "Using Clicks as Implicit Judgments: Expectations Versus Observations," in Advances in Information Retrieval, 2008, pp. 28-39. H. Schütze, "Automatic word sense discrimination," Comput. Linguist., vol. 24, pp. 97-123, 1998. D. Shen, M. Qin, W. Chen, Q. Yang, and Z. Chen, "Mining web query hierarchies from clickthrough data," in Proceedings of the 22nd national conference on Artificial intelligence Volume 1 Vancouver, British Columbia, Canada: AAAI Press, 2007, pp. 341-346. G. Smith and H. Ashman, "Evaluating implicit judgements from Image search interactions," in Proceedings of the WebSci ' 09: Society On-Line, 2009. G. Smith, T. Brailsford, C. Donner, D. Hooijmaijers, M. Truran, J. Goulding, and H. Ashman, "Generating unambiguous URL clusters from web search," in Proceedings of the 2009 workshop on Web Search Click Data Barcelona, Spain: ACM, 2009, pp. 28-34. K. Sprck-Jones, S. E. Robertson and M. Sanderson, "Ambiguous requests: implications for retrieval tests, systems and theories," SIGIR Forum, vol. 41, pp. 8-17, 2007. N. Tomasz, "Word Sense Discovery for Web Information Retrieval," 2008, pp. 267-274. M. Truran, J. Goulding and H. Ashman, "Co-active intelligence for image retrieval," in Proceedings of the 13th annual ACM international conference on Multimedia Hilton, Singapore: ACM, 2005, pp. 547-550. J. Véronis, "HyperLex: lexical cartography for information retrieval," Computer Speech & Language, vol. 18, pp. 223-252, 2004. A. J. Viera and J. M. Garrett, "Understanding interobserver agreement: the kappa statistic," Family medicine, vol. 37, pp. 360-363, 2005. J.-R. Wen and H.-J. Zhang, Query Clustering in the Web Context: Kluwer Academic Publishers, 2002. J. R. Wen, J. Y. Nie and H. J. Zhang, "Query clustering using user logs," ACM Trans. Inf. Syst., vol. 20, pp. 59-81, 2002. G.-R. Xue, H.-J. Zeng, Z. Chen, Y. Yu, W.-Y. Ma, W. Xi, and W. Fan, "Optimizing web search using web click-through data," in Proceedings of the thirteenth ACM international conference on Information and knowledge management Washington, D.C., USA: ACM, 2004, pp. 118 126.