C-Link: Concept Linkage in Knowledge Repositories Peter Cowling1*, Stephen Remde1†, Peter Hartley2, Will Stewart2, Joe Stock-Brooks,3 Tom Woolley 1 2 Artificial Intelligence Research Group, SCIM, University of Bradford, UK. Teaching Quality Enhancement Group, University of Bradford, UK. 3 National Media Museum, Bradford, UK † *p.i.cowling@bradford.ac.uk, s.m.remde@bradford.ac.uk C-Link is a tool for semi-structured searching of knowledge repositories based on finding previously unknown concepts that lie between other known concepts. Consider a user who wanted to know about optimization of crystal structures. A search which looks for concepts which lie between and hence connect “optimization” and “crystal structure” may turn up previously unknown concepts such as “genetic algorithms” or “space groups” – which would be very difficult to find via conventional approaches to search (which assume that the user has a good understanding of what terms to search for), or that unknown concepts lie close to one of the concepts already known. This paper first introduces C-Link as a search tool and then describes an experimental trial where 59 students compared the C-Link tool to the standard search facility of Wikipedia in a controlled experiment. The aim of this trial was to test the efficiency and effectiveness of the tool against traditional search engines and to provide a basis for our planned future work on the long-term educational impact of this type of tool. Results of the trial are analyzed, showing good promise for C-Link, and conclusions are drawn. Abstract When searching a knowledge repository such as Wikipedia or the Internet, the user doesn’t always know what they are looking for. Indeed, it is often the case that a user wishes to find information about a concept that was completely unknown to them prior to the search. In this paper we describe C-Link, which provides the user with a method for searching for unknown concepts which lie between two known concepts. C-Link does this by modeling the knowledge repository as a weighted, directed graph where nodes are concepts and arc weights give the degree of “relatedness” between concepts. An experimental study was undertaken with 59 participants to investigate the performance of C-Link compared to standard search approaches. Statistical analysis of the results shows great potential for C-Link as a search tool. Introduction Knowledge repositories proliferate at an accelerating rate. While these offer excellent support for specific information searches, there is limited support for unstructured browsing or semi-structured information gathering, when a user does not know what there is to know (but wants to find information connecting known concepts). Students making the transition from School to University often feel swamped by information and need to develop skills in information literacy. The need to develop students’ skills in information literacy has been highlighted as a key educational objective in the UK and worldwide (SCONUL, 2007). This has also been seen as an essential component of broader citizenship (Obama, 2009). Tools for understanding the structure of information in these large repositories and for conducting semi-structured queries are needed by University students and by the general public. Related Work Measures of Semantic Relatedness (MSR) allow machines to calculate the relatedness of two concepts or phrases. The WordSimilarity-353 collection (Finkelstein, 2002) defines a set of pairs of terms, and their relatedness as perceived by a human. This collection is often used to compare relatedness measures, where a higher correlation with subjective human relatedness indicates a more successful method. Measures can be classified into two types: i) those using manually created thesauri and ii) those using Corpus-based approaches. Using manually created thesauri increases information accuracy, but is time consuming and this limits Copyright © 2009, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. 32 For our experimental studies we have used Wikipedia as our experimental knowledge repository, since it offers several advantages over other sources. Firstly, it is very accessible. Wikipedia database dumps can be downloaded which provide a snapshot of the site and can easily be manipulated locally avoiding the need for slow SOAP/REST requests. Secondly, tools exist for parsing these dumps and extracting meta-information and the links between pages. We use Wikipedia Miner (Milne, 2009) for this as it provides an excellent framework for working with the Wikipedia dumps and also has method for calculating the relatedness of two articles. Thirdly, the data is relatively clean and concise. Wikipedia is prone to vandalism, but this “noise” is a lot less than that found more generally on the World Wide Web. Thus the articles in Wikipedia are the concepts that we will be finding paths between and the links on the pages will define the connections between concepts. We use the Wikipedia Link-based Measurement (WLM) (Milne, 2008) to determine the relatedness of two concepts. Our search algorithm is based on A* search (Russel, 2003) in the directed graph of concepts. Hence we aim to find concepts that lie on short paths between the two target concepts. The A* algorithm will find an optimal path if h (the function that estimates the distance from the destination) is never an over estimate. Finding an “optimal” path is not particularly important for this application, due to the subjectivity of the data and inaccuracy of the distance function. Generally we want a diverse selection of concepts that lie on fairly short paths. the range of vocabulary. WordNet and Roget (McHale, 1998) use manually created thesauri and achieve correlations of approximately 0.33-0.35 and 0.55 respectively. (Leacock, 1998) apply a similar technique to Wikipedia which has a tree like structure of categories. A measure of the distance of two articles in this tree is used to define their relatedness. This achieved correlations of between 0.19 and 0.49. The advantage of using Wikipedia in this case is that Wikipedia is constantly being updated by many people, distributing the effort to keep an up-todate thesaurus. Corpus based approaches perform statistical analysis on large numbers of text documents and hence do not need vocabularies explicitly defining. Latent Semantic Analysis (LSA) (Landauer, 1998) is a successful example of a corpus based approach having approximately a 0.56 correlation with human measures, however its accuracy relies on an extremely large corpuses. (Milne, 2008) uses a combination of two measures to compare the relatedness of two articles. The first is inspired by the Normalized Google Distance (NGD) (Cilibrasi, 2006). This corpus-based technique uses counts of the terms appearing on web-pages and is named as such because of its use in the Google search engine. The second is a information retrieval based technique that calculates the alignment of two TF×IDF (term frequency – inverse domain frequency) vectors (Salton, 1988). Individually, the measures have correlations of 0.72 and 0.66 respectively, but when combined this increases to approximately 0.74. It is worth noting that generally, ensemble approaches offer an interesting avenue of investigation, which has achieved very high levels of success in other domains (Polikar, 2006). Recently, search engines have been looking at new ways of presenting search results. Google latest addition is their Wonder Wheel1 which presents the user with related search queries in a “wheel”. Similarly, Quintura2 provide an interactive way of refining a search by displaying a cloud of related words. While these tools do not help find links between concepts, they may help the user to visualise relationships between search concepts, which may help in finding previously unknown concepts. C-Link C-Link is a new tool, developed at the University of Bradford, to search for (previously unknown) concepts that lie on short paths between two known concepts. We represent our data as a directed graph where each concept is represented by a node, and the links between closely connected concepts represented as arcs on the graph with a weight inversely proportional to the “relatedness” of the two concepts. This means that closely related linked concepts have lower weight (and can be viewed as closer together.) 1 2 Fig 1. An example C-Link search from “Operations Research” to “Combinatorial Optimization”. Here multiple paths have been found. “Operations Research” → “Assignment Problem” → “Combinatorial Optimization” is the shortest. http://www.google.com/search?tbs=ww:1&q=search http://quintura.com/ 33 axis showing the relation to start and end concepts respectively. Fig 4 shows an example of this and you can see that nodes toward the top right of the graph are most related to both of the concepts whereas nodes toward the top left or bottom right are more related to a single concept. From this view the user get a good overview of the search and can more easily see the most relevant nodes. Hovering over nodes in this view will again show the information from Wikipedia on the concept and also highlight nodes that link to and from the hovered node. Fig 1. Shows an example sub-graph of the search from “Operations Research” to “Combinatorial Optimization”. The shortest path is “Operations Research” → “Assignment Problem” → “Combinatorial Optimization” however there are other nodes which are very interesting and related to both of the search terms. The C-Link user can choose to stop searching when the first (shortest) path is found, or continue the search to find a broader range of intermediate concepts. Linear Programming In mathematics, linear programming (LP) is a technique for optimization of a linear objective function, subject to linear equality and linear inequality constraints. Informally, linear programming determines the way to achieve the best outcome (such as maximum profit or lowest cost) in a given mathematical model and given some list of requirements represented as linear equations. Distance to ‘Operations Research’: 0.3 Optimistic Distance to ‘Combinatorial Optimization’: 0.3 Similarity to ‘Operations Research’: 72.1% Similarity to ‘Combinatorial Optimization’: 72.2% Fig 2. An example C-Link search tooltip for “Linear Programming” showing the name of the node, the first paragraph from Wikipedia’s entry for the node, the path to this node’s length, an optimistic estimate of the distance to the destination node, and the nodes relatedness to the start and destination nodes. Fig 4. Plot of all nodes found during a search from “Crystal Structure” to “Optimization” showing their relatedness to each of the search concepts. (Although the text here is unreadable the graph has been shown to illustrate the structure of a longer search graph) Experimental Trial Setup In order to investigate the performance of the C-Link software in practice, we conducted an experimental trial, consisting of 10 tasks, half of which were to be completed using C-Link and the other half using the standard built-in search in Wikipedia. For each of the tasks, the participants had five minutes to find five concepts that provided interesting connections between pairs of given concepts. The pairs of concepts are listed in Table 1. They were also asked to rate how difficult each task was and how confident they were in their results. The trial was conducted with 59 participants, who generally had high levels of computer literacy (they were first year computer science undergraduates at the University of Bradford). 21 of the participants used C-Link for the first 5 tests, and the standard search approach of Wikipedia for the next 5. The remaining 38 used the standard search approach of Wikipedia for the first 5 tests, and C-Link for the final 5. A script was used to ensure both groups had the same information. The script instructed the participants about the format of the trial and gave an introduction to each tool (including a single 5-minute test Fig 3. A section of a larger graph. This graph shows “Terrorism” → “Earth Liberation Foundation” → “Green Scare” → “Global Warming”. (Although the text here is unreadable the graph has been shown to illustrate the structure of a longer search graph) Hovering over a node shows some information about the node allowing the user to decide how relevant each concept is. Fig. 2 shows an example of this. For more complicated graphs such as the example shown in Fig 3., this can be laborious. To get an overview of all the concepts visited during the search the user has the option to plot all the nodes on a Cartesian graph with the x- and y- 34 thing about each tool and provide us with a detailed comparison of the two methods. The participants were given lunch and a chance to win three £50 prizes. Prizes were awarded for the best feedback, for the person who scored the most points and to one participant chosen at random. using the pair “Sport” and “Medicine” before commencing the tests for each search approach). Task Concept a Concept b 1 Global Warming Terrorism 2 Monet Cumulus 3 Virus Youtube 4 Alan Turing ELIZA 5 Crystal Structure Optimisation 6 Deep Inelastic Scattering Bohr Model 7 Albacore Art 8 Operations Research Combinatorial Optimisation 9 Banana Counter-Intelligence 10 Monty Python Universe Table 1. Concept pairs used for the 10 tasks. Analysis The average response for all 10 tests are shown in Tables 2 and 4 and the average of users overall ratings are shown in Tables 3 and 5. Results with a grey background are the class containing the median result and bold results are the modal results. The mean result in Tables 2-5 are calculated by associating each response with the value to the left (thus negative values represent Unconfident/Hard and positive values represent Confident/Easy). Tables 2 and 3 show similar results although the difference in mean is a lot greater in the overall results. The overall results give a better indication of the participants feeling as they had experienced both the tools when answering these questions. For both these tables, the mean and median of the average and the mean, median and mode of the overall are at either sides of the confidence scale showing strong evidence that participants were more confident with their answers when using C-Link. Looking at individual questions shows further evidence of this as the mean average confidence was better for C-Link on all questions, the median was better for C-Link on 5/10 and the same on 5/10, and the modal confidence was better for C-Link on 4/10 and the same on 6/10. Each of the five answers given by each participant for each trial was marked using a score based on Pointwise Mutual Information (PMI) (Turney, 2001), to provide a different measure to that used in the C-Link tool and so remove a source of bias in the results. We scored each of the five answers by taking square root of the normalized PMI value of the answer to both the search terms multiplied together. If the PMI of an answer is 1, we give them 0 as students were specifically asked to avoid duplicating the known concepts or providing direct synonyms. So we have, where p(x,y) is the probability a document contains both x and y, p(z) is the probability a document contains z. Our intuition then is that, in some way, the PMI measure measures the excess of the joint probability of x and y over what it would be if they were independent. When PMI is normalized between 0 and 1, we have Average Confidence C-Link Search Extremely Unconfident 9.3% 16.2% Very Unconfident 5.2% 13.5% Unconfident 17.7% 27.4% Confident 43.5% 33.2% Very Confident 19.4% 8.7% Extremely Confident 4.8% 1.0% Mean +0.46 -0.84 Table 2. Average of the confidence results for the 10 tasks. -5 -3 -1 +1 +3 +5 Overall Confidence C-Link Search Extremely Unconfident 3.4% 8.5% Very Unconfident 0.0% 16.9% Unconfident 8.5% 44.1% Confident 54.2% 27.1% Very Confident 28.8% 3.4% Extremely Confident 5.1% 0.0% Mean +1.40 -1.00 Table 3. Responses to the overall confidence question. -5 -3 -1 +1 +3 +5 Hence the score for a single task is between 0 and 5, with 5 being the best possible score (although it would be impossible to achieve 5 in practice). PMI values were obtained using Rensselaer Polytechnic Institute Cognitive Science Department's Measures of Semantic Relatedness Server (Veksler, 2007). At the end the participants were asked to rate overall how confident they were and how easy it was using each tool. They were also asked to say one positive and negative Table 4 and 5 show results for the perceived difficulty of the task using each search method. Again, the difference in mean is a lot greater in the overall results. These tables clearly show that the participants found it easier to do the tasks using C-Link than using the standard search built into Wikipedia. Looking at individual question shows further 35 distance between the two “known” concepts and the unknown intermediate concepts meant that the standard Wikipedia search was able to find good quality answers, particularly since the Wikipedia pages for these two concepts contained links to concepts which highly related to both. There were no significant differences in the trial according to whether a participant used C-Link for the first or the second half. Although the participants were given 5 minutes per question, anecdotal evidence from observations indicated that the participants completed tasks quicker using C-Link, usually finishing in under 4 minutes whereas the full 5 minutes were generally used when using the standard search facility of Wikipedia. The participants seemed to particularly enjoy using the C-Link search tool, to the extent that it was difficult to stop students from using the tool at the end of the trial for those students who used CLink second. evidence of this as the mean average ease was better for CLink on all questions, the median was better for C-Link on 8/10 and the same on 2/10, and the modal confidence was better for C-Link on 7/10 and the same on 3/10. Average Ease C-Link Search Extremely Hard 6.0% 15.5% Very Hard 7.1% 20.0% Hard 20.1% 33.8% Easy 36.0% 24.6% Very Easy 22.3% 4.4% Extremely Easy 8.4% 1.7% Mean +0.74 -1.24 Table 4. Average of the ease results for the 10 tasks. -5 -3 -1 +1 +3 +5 Overall Ease C-Link Search Extremely Hard 1.7% 3.4% Very Hard 1.7% 27.1% Hard 8.5% 22.0% Easy 35.6% 37.3% Very Easy 37.3% 5.1% Extremely Easy 15.3% 5.1% Mean +2.02 -0.42 Table 5. Responses to the overall ease question. -5 -3 -1 +1 +3 +5 Discussion When searching a knowledge repository it is often the previously unknown items which have the highest value, however, we must find a way to reach these unknown concepts from known concepts. C-Link provides one such way (which we regard as being somewhat natural). If CLink were to be used widely, this would require a significant rethink of the methods for searching knowledge repositories, but in certain domains, such as legal case analysis, the search for linked concepts is already natural. In this trial participants were given pairs of concepts that we regard as typical of the pairs of concepts that might be searched (because they are related in an obscure way). If a C-Link user were searching for an unusual and previously unknown facet of a known concept (e.g. to find the political leanings of a public figure), then C-Link also provides a natural way to do this. By finding paths to a number of “standard” secondary concepts, we can also use the C-Link methodology to actively search for many different facets of any single known concept. Fig. 5 shows the distribution of average scores for CLink and the standard Wikipedia search averaged over all questions and participants (using the PMI measure). Given the 59 participants and 10 questions (and the fact that we are evaluating these results using a measure independent of the C-Link measure), we have strong evidence for the fact that C-Link significantly outperformed the standard Wikipedia search in this experiment. C-Link Search % of population 100% 75% 50% 25% Conclusions and Further Work Often, when searching, a user is looking for concepts which are not known prior to the search. We have presented C-Link, an approach where the user enters a pair of concepts, and a search is made for concepts that lie between the chosen concepts when we consider the concepts as lying in a weighted digraph where each weight measures relatedness. An experimental trial was undertaken involving 59 participants testing C-Link, against the standard search tools of Wikipedia. 10 tasks were performed using both methods and the group was split so some would use CLink first and some would use standard Search first so as 0% 0 1 2 3 4 5 Average Score Fig 5. Plot showing the distribution of scores averaged over all questions. The mean average of the results was better for C-Link on all questions except question number 8. Here, the two concepts (Combinatorial Optimization and Operations Research) are closely related, but probably unknown to the participants. The results of question 8 show that C-Link users where more confident with their answers and found the task easier than the Search users, but the very short 36 not to bias the results. Every effort was made to keep the test fair. The results show that participants felt that C-Link made it easier to find previously unknown concepts which lay between known (but possibly unrelated) search terms, and that they were more confident in their answers. This was reflected in their answers with participants scoring more points using C-Link than normal search. Detailed observation of the results provided by C-Link show that there is definite scope for improvement. Unusual or erroneous linkages between concepts in Wikipedia lead to C-Link finding spurious links. It would be highly useful for the user of C-Link to be able to identify promising concepts for further expansion, and highlight red herrings. In future work, new search algorithms and different similarity measures will be tried, which make best use of available data, and allow for the user to guide the search. C-Link has the potential to be used in a range of knowledge repositories where linkage between concepts, and finding previously unknown concepts, is key. We aim to try this approach to legal, citation, historical, intelligence and other repositories, as well as adapting the approach to work in the unstructured environment of the Internet. Milne, D. and Witten, I.H. 2009 An Open-Source Toolkit for Mining Wikipedia, Accessed Online 15/10/2009: http://www.cs.waikato.ac.nz/~dnk2/publications/AnOpenS ourceToolkitForMiningWikipedia.pdf Acknowledgements SCONUL 2007. Information Skills in Higher Education: A SCONUL Position Paper. Online (Accessed October 15, 2009): http://www.sconul.ac.uk/groups/ information_literacy/papers/Seven_pillars.html Milne, D. and Witten, I.H. 2008. An effective, low-cost measure of semantic relatedness obtained from Wikipedia links. In Proceedings of the first AAAI Workshop on Wikipedia and Artificial Intelligence (WIKIAI'08), Chicago, I.L. Obama, B., 2009. National Information Literacy Awareness Month, 2009. Online (Accessed October 15, 2009): http://www.whitehouse.gov/the_press_office/ Presidential-Proclamation-National-Information-LiteracyAwareness-Month/ Polikar, R., 2006 Ensemble based systems in decision making, IEEE Circuits and Systems Magazine 6(3), pp 2145 Russell, S J. and Norvig, P, 2003. Artificial Intelligence: A Modern Approach (2nd ed.), Upper Saddle River, NJ: Prentice Hall, ISBN 0-13-790395-2 Salton, Gerard and Buckley, C. 1988. Term-weighting approaches in automatic text retrieval. Information Processing & Management 24 (5), pp 513-523. This research was made possible through a JISC Rapid Innovations grant. Turney, P. 2001. Mining the Web for Synonyms: PMI versus LSA on TOEFL. In L. De Raedt & P. Flach (Eds.), Proceedings of the Twelfth European Conference on Machine Learning (ECML-2001). Freiburg, Germany, pp. 491-502 References Cilibrasi, R. & Vitanyi, P. 2006. Similarity of objects and the meaning of words. In J.-Y. Cai, S. B. Cooper, and A. Li (Eds.) Proc. 3rd Conf. Theory and Applications of Models of Computation (TAMC), Lecture Notes in Computer Science, vol 3959, Springer-Verlag, Berlin, pp 21-45. Veksler, V. D., Grintsvayg, A., Lindsey, R., & Gray, W. D. 2007. A proxy for all your semantic needs. 29th Annual Meeting of the Cognitive Science Society, CogSci2007, Nashville, TN. Finkelstein, L., Gabrilovich, Y.M., Rivlin, E., Solan, Z., Wolfman, G. and Ruppin, E. 2002 Placing search in context: The concept revisited. ACM Transactions on Information Systems 20(1), pp 116-131. Landauer, T.K. and Foltz, P.W. and Laham, D. 1998 An introduction to latent semantic analysis. Discourse Processes 25(2-3), pp 259-284. Leacock, C. & M. Chodorow 1998. Combining local context and WordNet similarity for word sense identification. In C. Fellbaum (Ed.), WordNet. An Electronic Lexical Database, Cambridge, Mass.: MIT Press, pp. 265-283. McHale, M. 1998 A Comparison of WordNet and Roget's Taxonomy for Measuring Semantic Similarity, In Proceedings of COLING/ACL Workshop on Usage of WordNet in Natural Language Processing Systems, Montréal, Canada. pp. 115-120. 37