C-Link: Concept Linkage in Knowledge Repositories Peter Cowling , Stephen Remde

C-Link: Concept Linkage in Knowledge Repositories
Peter Cowling1*, Stephen Remde1†, Peter Hartley2, Will Stewart2, Joe Stock-Brooks,3
Tom Woolley
1
2
Artificial Intelligence Research Group, SCIM, University of Bradford, UK.
Teaching Quality Enhancement Group, University of Bradford, UK. 3 National Media Museum, Bradford, UK
†
*p.i.cowling@bradford.ac.uk, s.m.remde@bradford.ac.uk
C-Link is a tool for semi-structured searching of
knowledge repositories based on finding previously
unknown concepts that lie between other known concepts.
Consider a user who wanted to know about optimization of
crystal structures. A search which looks for concepts which
lie between and hence connect “optimization” and “crystal
structure” may turn up previously unknown concepts such
as “genetic algorithms” or “space groups” – which would
be very difficult to find via conventional approaches to
search (which assume that the user has a good
understanding of what terms to search for), or that
unknown concepts lie close to one of the concepts already
known.
This paper first introduces C-Link as a search tool and
then describes an experimental trial where 59 students
compared the C-Link tool to the standard search facility of
Wikipedia in a controlled experiment. The aim of this trial
was to test the efficiency and effectiveness of the tool
against traditional search engines and to provide a basis for
our planned future work on the long-term educational
impact of this type of tool. Results of the trial are analyzed,
showing good promise for C-Link, and conclusions are
drawn.
Abstract
When searching a knowledge repository such as Wikipedia
or the Internet, the user doesn’t always know what they are
looking for. Indeed, it is often the case that a user wishes to
find information about a concept that was completely
unknown to them prior to the search. In this paper we
describe C-Link, which provides the user with a method for
searching for unknown concepts which lie between two
known concepts. C-Link does this by modeling the
knowledge repository as a weighted, directed graph where
nodes are concepts and arc weights give the degree of
“relatedness” between concepts. An experimental study was
undertaken with 59 participants to investigate the
performance of C-Link compared to standard search
approaches. Statistical analysis of the results shows great
potential for C-Link as a search tool.
Introduction
Knowledge repositories proliferate at an accelerating rate.
While these offer excellent support for specific information
searches, there is limited support for unstructured browsing
or semi-structured information gathering, when a user does
not know what there is to know (but wants to find
information connecting known concepts). Students making
the transition from School to University often feel
swamped by information and need to develop skills in
information literacy. The need to develop students’ skills
in information literacy has been highlighted as a key
educational objective in the UK and worldwide (SCONUL,
2007). This has also been seen as an essential component
of broader citizenship (Obama, 2009). Tools for
understanding the structure of information in these large
repositories and for conducting semi-structured queries are
needed by University students and by the general public.
Related Work
Measures of Semantic Relatedness (MSR) allow machines
to calculate the relatedness of two concepts or phrases. The
WordSimilarity-353 collection (Finkelstein, 2002) defines
a set of pairs of terms, and their relatedness as perceived by
a human. This collection is often used to compare
relatedness measures, where a higher correlation with
subjective human relatedness indicates a more successful
method.
Measures can be classified into two types: i) those using
manually created thesauri and ii) those using Corpus-based
approaches. Using manually created thesauri increases
information accuracy, but is time consuming and this limits
Copyright © 2009, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
32
For our experimental studies we have used Wikipedia as
our experimental knowledge repository, since it offers
several advantages over other sources. Firstly, it is very
accessible. Wikipedia database dumps can be downloaded
which provide a snapshot of the site and can easily be
manipulated locally avoiding the need for slow
SOAP/REST requests. Secondly, tools exist for parsing
these dumps and extracting meta-information and the links
between pages. We use Wikipedia Miner (Milne, 2009) for
this as it provides an excellent framework for working with
the Wikipedia dumps and also has method for calculating
the relatedness of two articles. Thirdly, the data is
relatively clean and concise. Wikipedia is prone to
vandalism, but this “noise” is a lot less than that found
more generally on the World Wide Web. Thus the articles
in Wikipedia are the concepts that we will be finding paths
between and the links on the pages will define the
connections between concepts. We use the Wikipedia
Link-based Measurement (WLM) (Milne, 2008) to
determine the relatedness of two concepts.
Our search algorithm is based on A* search (Russel,
2003) in the directed graph of concepts. Hence we aim to
find concepts that lie on short paths between the two target
concepts. The A* algorithm will find an optimal path if h
(the function that estimates the distance from the
destination) is never an over estimate. Finding an
“optimal” path is not particularly important for this
application, due to the subjectivity of the data and
inaccuracy of the distance function. Generally we want a
diverse selection of concepts that lie on fairly short paths.
the range of vocabulary. WordNet and Roget (McHale,
1998) use manually created thesauri and achieve
correlations of approximately 0.33-0.35 and 0.55
respectively. (Leacock, 1998) apply a similar technique to
Wikipedia which has a tree like structure of categories. A
measure of the distance of two articles in this tree is used
to define their relatedness. This achieved correlations of
between 0.19 and 0.49. The advantage of using Wikipedia
in this case is that Wikipedia is constantly being updated
by many people, distributing the effort to keep an up-todate thesaurus.
Corpus based approaches perform statistical analysis on
large numbers of text documents and hence do not need
vocabularies explicitly defining. Latent Semantic Analysis
(LSA) (Landauer, 1998) is a successful example of a
corpus based approach having approximately a 0.56
correlation with human measures, however its accuracy
relies on an extremely large corpuses.
(Milne, 2008) uses a combination of two measures to
compare the relatedness of two articles. The first is
inspired by the Normalized Google Distance (NGD)
(Cilibrasi, 2006). This corpus-based technique uses counts
of the terms appearing on web-pages and is named as such
because of its use in the Google search engine. The second
is a information retrieval based technique that calculates
the alignment of two TF×IDF (term frequency – inverse
domain frequency) vectors (Salton, 1988). Individually, the
measures have correlations of 0.72 and 0.66 respectively,
but when combined this increases to approximately 0.74. It
is worth noting that generally, ensemble approaches offer
an interesting avenue of investigation, which has achieved
very high levels of success in other domains (Polikar,
2006).
Recently, search engines have been looking at new ways
of presenting search results. Google latest addition is their
Wonder Wheel1 which presents the user with related search
queries in a “wheel”. Similarly, Quintura2 provide an
interactive way of refining a search by displaying a cloud
of related words. While these tools do not help find links
between concepts, they may help the user to visualise
relationships between search concepts, which may help in
finding previously unknown concepts.
C-Link
C-Link is a new tool, developed at the University of
Bradford, to search for (previously unknown) concepts that
lie on short paths between two known concepts.
We represent our data as a directed graph where each
concept is represented by a node, and the links between
closely connected concepts represented as arcs on the
graph with a weight inversely proportional to the
“relatedness” of the two concepts. This means that closely
related linked concepts have lower weight (and can be
viewed as closer together.)
1
2
Fig 1. An example C-Link search from “Operations Research” to
“Combinatorial Optimization”. Here multiple paths have been found.
“Operations Research” → “Assignment Problem” → “Combinatorial
Optimization” is the shortest.
http://www.google.com/search?tbs=ww:1&q=search
http://quintura.com/
33
axis showing the relation to start and end concepts
respectively. Fig 4 shows an example of this and you can
see that nodes toward the top right of the graph are most
related to both of the concepts whereas nodes toward the
top left or bottom right are more related to a single
concept. From this view the user get a good overview of
the search and can more easily see the most relevant nodes.
Hovering over nodes in this view will again show the
information from Wikipedia on the concept and also
highlight nodes that link to and from the hovered node.
Fig 1. Shows an example sub-graph of the search from
“Operations Research” to “Combinatorial Optimization”.
The shortest path is “Operations Research” →
“Assignment Problem” → “Combinatorial Optimization”
however there are other nodes which are very interesting
and related to both of the search terms. The C-Link user
can choose to stop searching when the first (shortest) path
is found, or continue the search to find a broader range of
intermediate concepts.
Linear Programming
In mathematics, linear programming (LP) is a technique for
optimization of a linear objective function, subject to linear equality
and linear inequality constraints. Informally, linear programming
determines the way to achieve the best outcome (such as maximum
profit or lowest cost) in a given mathematical model and given some
list of requirements represented as linear equations.
Distance to ‘Operations Research’: 0.3
Optimistic Distance to ‘Combinatorial Optimization’: 0.3
Similarity to ‘Operations Research’: 72.1%
Similarity to ‘Combinatorial Optimization’: 72.2%
Fig 2. An example C-Link search tooltip for “Linear Programming”
showing the name of the node, the first paragraph from Wikipedia’s
entry for the node, the path to this node’s length, an optimistic
estimate of the distance to the destination node, and the nodes
relatedness to the start and destination nodes.
Fig 4. Plot of all nodes found during a search from “Crystal
Structure” to “Optimization” showing their relatedness to each of the
search concepts. (Although the text here is unreadable the graph has
been shown to illustrate the structure of a longer search graph)
Experimental Trial Setup
In order to investigate the performance of the C-Link
software in practice, we conducted an experimental trial,
consisting of 10 tasks, half of which were to be completed
using C-Link and the other half using the standard built-in
search in Wikipedia. For each of the tasks, the participants
had five minutes to find five concepts that provided
interesting connections between pairs of given concepts.
The pairs of concepts are listed in Table 1. They were also
asked to rate how difficult each task was and how
confident they were in their results.
The trial was conducted with 59 participants, who
generally had high levels of computer literacy (they were
first year computer science undergraduates at the
University of Bradford). 21 of the participants used C-Link
for the first 5 tests, and the standard search approach of
Wikipedia for the next 5. The remaining 38 used the
standard search approach of Wikipedia for the first 5 tests,
and C-Link for the final 5. A script was used to ensure both
groups had the same information. The script instructed the
participants about the format of the trial and gave an
introduction to each tool (including a single 5-minute test
Fig 3. A section of a larger graph. This graph shows “Terrorism” →
“Earth Liberation Foundation” → “Green Scare” → “Global
Warming”. (Although the text here is unreadable the graph has been
shown to illustrate the structure of a longer search graph)
Hovering over a node shows some information about the
node allowing the user to decide how relevant each
concept is. Fig. 2 shows an example of this. For more
complicated graphs such as the example shown in Fig 3.,
this can be laborious. To get an overview of all the
concepts visited during the search the user has the option to
plot all the nodes on a Cartesian graph with the x- and y-
34
thing about each tool and provide us with a detailed
comparison of the two methods.
The participants were given lunch and a chance to win
three £50 prizes. Prizes were awarded for the best
feedback, for the person who scored the most points and to
one participant chosen at random.
using the pair “Sport” and “Medicine” before commencing
the tests for each search approach).
Task
Concept a
Concept b
1
Global Warming
Terrorism
2
Monet
Cumulus
3
Virus
Youtube
4
Alan Turing
ELIZA
5
Crystal Structure
Optimisation
6
Deep Inelastic Scattering
Bohr Model
7
Albacore
Art
8
Operations Research
Combinatorial Optimisation
9
Banana
Counter-Intelligence
10
Monty Python
Universe
Table 1. Concept pairs used for the 10 tasks.
Analysis
The average response for all 10 tests are shown in Tables 2
and 4 and the average of users overall ratings are shown in
Tables 3 and 5. Results with a grey background are the
class containing the median result and bold results are the
modal results. The mean result in Tables 2-5 are calculated
by associating each response with the value to the left (thus
negative values represent Unconfident/Hard and positive
values represent Confident/Easy).
Tables 2 and 3 show similar results although the
difference in mean is a lot greater in the overall results.
The overall results give a better indication of the
participants feeling as they had experienced both the tools
when answering these questions. For both these tables, the
mean and median of the average and the mean, median and
mode of the overall are at either sides of the confidence
scale showing strong evidence that participants were more
confident with their answers when using C-Link. Looking
at individual questions shows further evidence of this as
the mean average confidence was better for C-Link on all
questions, the median was better for C-Link on 5/10 and
the same on 5/10, and the modal confidence was better for
C-Link on 4/10 and the same on 6/10.
Each of the five answers given by each participant for
each trial was marked using a score based on Pointwise
Mutual Information (PMI) (Turney, 2001), to provide a
different measure to that used in the C-Link tool and so
remove a source of bias in the results. We scored each of
the five answers by taking square root of the normalized
PMI value of the answer to both the search terms
multiplied together. If the PMI of an answer is 1, we give
them 0 as students were specifically asked to avoid
duplicating the known concepts or providing direct
synonyms. So we have,
where p(x,y) is the probability a document contains both x
and y, p(z) is the probability a document contains z. Our
intuition then is that, in some way, the PMI measure
measures the excess of the joint probability of x and y over
what it would be if they were independent. When PMI is
normalized between 0 and 1, we have
Average Confidence
C-Link Search
Extremely Unconfident
9.3%
16.2%
Very Unconfident
5.2%
13.5%
Unconfident
17.7%
27.4%
Confident
43.5%
33.2%
Very Confident
19.4%
8.7%
Extremely Confident
4.8%
1.0%
Mean
+0.46
-0.84
Table 2. Average of the confidence results for the 10 tasks.
-5
-3
-1
+1
+3
+5
Overall Confidence
C-Link Search
Extremely Unconfident
3.4%
8.5%
Very Unconfident
0.0%
16.9%
Unconfident
8.5%
44.1%
Confident
54.2%
27.1%
Very Confident
28.8%
3.4%
Extremely Confident
5.1%
0.0%
Mean
+1.40
-1.00
Table 3. Responses to the overall confidence question.
-5
-3
-1
+1
+3
+5
Hence the score for a single task is between 0 and 5, with 5
being the best possible score (although it would be
impossible to achieve 5 in practice). PMI values were
obtained using Rensselaer Polytechnic Institute Cognitive
Science Department's Measures of Semantic Relatedness
Server (Veksler, 2007).
At the end the participants were asked to rate overall
how confident they were and how easy it was using each
tool. They were also asked to say one positive and negative
Table 4 and 5 show results for the perceived difficulty of
the task using each search method. Again, the difference in
mean is a lot greater in the overall results. These tables
clearly show that the participants found it easier to do the
tasks using C-Link than using the standard search built into
Wikipedia. Looking at individual question shows further
35
distance between the two “known” concepts and the
unknown intermediate concepts meant that the standard
Wikipedia search was able to find good quality answers,
particularly since the Wikipedia pages for these two
concepts contained links to concepts which highly related
to both.
There were no significant differences in the trial
according to whether a participant used C-Link for the first
or the second half.
Although the participants were given 5 minutes per
question, anecdotal evidence from observations indicated
that the participants completed tasks quicker using C-Link,
usually finishing in under 4 minutes whereas the full 5
minutes were generally used when using the standard
search facility of Wikipedia. The participants seemed to
particularly enjoy using the C-Link search tool, to the
extent that it was difficult to stop students from using the
tool at the end of the trial for those students who used CLink second.
evidence of this as the mean average ease was better for CLink on all questions, the median was better for C-Link on
8/10 and the same on 2/10, and the modal confidence was
better for C-Link on 7/10 and the same on 3/10.
Average Ease
C-Link Search
Extremely Hard
6.0%
15.5%
Very Hard
7.1%
20.0%
Hard
20.1%
33.8%
Easy
36.0%
24.6%
Very Easy
22.3%
4.4%
Extremely Easy
8.4%
1.7%
Mean
+0.74
-1.24
Table 4. Average of the ease results for the 10 tasks.
-5
-3
-1
+1
+3
+5
Overall Ease
C-Link Search
Extremely Hard
1.7%
3.4%
Very Hard
1.7%
27.1%
Hard
8.5%
22.0%
Easy
35.6%
37.3%
Very Easy
37.3%
5.1%
Extremely Easy
15.3%
5.1%
Mean
+2.02
-0.42
Table 5. Responses to the overall ease question.
-5
-3
-1
+1
+3
+5
Discussion
When searching a knowledge repository it is often the
previously unknown items which have the highest value,
however, we must find a way to reach these unknown
concepts from known concepts. C-Link provides one such
way (which we regard as being somewhat natural). If CLink were to be used widely, this would require a
significant rethink of the methods for searching knowledge
repositories, but in certain domains, such as legal case
analysis, the search for linked concepts is already natural.
In this trial participants were given pairs of concepts that
we regard as typical of the pairs of concepts that might be
searched (because they are related in an obscure way). If a
C-Link user were searching for an unusual and previously
unknown facet of a known concept (e.g. to find the
political leanings of a public figure), then C-Link also
provides a natural way to do this. By finding paths to a
number of “standard” secondary concepts, we can also use
the C-Link methodology to actively search for many
different facets of any single known concept.
Fig. 5 shows the distribution of average scores for CLink and the standard Wikipedia search averaged over all
questions and participants (using the PMI measure). Given
the 59 participants and 10 questions (and the fact that we
are evaluating these results using a measure independent
of the C-Link measure), we have strong evidence for the
fact that C-Link significantly outperformed the standard
Wikipedia search in this experiment.
C-Link
Search
% of population
100%
75%
50%
25%
Conclusions and Further Work
Often, when searching, a user is looking for concepts
which are not known prior to the search. We have
presented C-Link, an approach where the user enters a pair
of concepts, and a search is made for concepts that lie
between the chosen concepts when we consider the
concepts as lying in a weighted digraph where each weight
measures relatedness.
An experimental trial was undertaken involving 59
participants testing C-Link, against the standard search
tools of Wikipedia. 10 tasks were performed using both
methods and the group was split so some would use CLink first and some would use standard Search first so as
0%
0
1
2
3
4
5
Average Score
Fig 5. Plot showing the distribution of scores averaged over
all questions.
The mean average of the results was better for C-Link on
all questions except question number 8. Here, the two
concepts (Combinatorial Optimization and Operations
Research) are closely related, but probably unknown to the
participants. The results of question 8 show that C-Link
users where more confident with their answers and found
the task easier than the Search users, but the very short
36
not to bias the results. Every effort was made to keep the
test fair.
The results show that participants felt that C-Link made
it easier to find previously unknown concepts which lay
between known (but possibly unrelated) search terms, and
that they were more confident in their answers. This was
reflected in their answers with participants scoring more
points using C-Link than normal search.
Detailed observation of the results provided by C-Link
show that there is definite scope for improvement. Unusual
or erroneous linkages between concepts in Wikipedia lead
to C-Link finding spurious links. It would be highly useful
for the user of C-Link to be able to identify promising
concepts for further expansion, and highlight red herrings.
In future work, new search algorithms and different
similarity measures will be tried, which make best use of
available data, and allow for the user to guide the search.
C-Link has the potential to be used in a range of
knowledge repositories where linkage between concepts,
and finding previously unknown concepts, is key. We aim
to try this approach to legal, citation, historical,
intelligence and other repositories, as well as adapting the
approach to work in the unstructured environment of the
Internet.
Milne, D. and Witten, I.H. 2009 An Open-Source Toolkit
for Mining Wikipedia, Accessed Online 15/10/2009:
http://www.cs.waikato.ac.nz/~dnk2/publications/AnOpenS
ourceToolkitForMiningWikipedia.pdf
Acknowledgements
SCONUL 2007. Information Skills in Higher Education: A
SCONUL Position Paper. Online (Accessed October 15,
2009):
http://www.sconul.ac.uk/groups/
information_literacy/papers/Seven_pillars.html
Milne, D. and Witten, I.H. 2008. An effective, low-cost
measure of semantic relatedness obtained from Wikipedia
links. In Proceedings of the first AAAI Workshop on
Wikipedia and Artificial Intelligence (WIKIAI'08),
Chicago, I.L.
Obama, B., 2009. National Information Literacy
Awareness Month, 2009. Online (Accessed October 15,
2009):
http://www.whitehouse.gov/the_press_office/
Presidential-Proclamation-National-Information-LiteracyAwareness-Month/
Polikar, R., 2006 Ensemble based systems in decision
making, IEEE Circuits and Systems Magazine 6(3), pp 2145
Russell, S J. and Norvig, P, 2003. Artificial Intelligence: A
Modern Approach (2nd ed.), Upper Saddle River, NJ:
Prentice Hall, ISBN 0-13-790395-2
Salton, Gerard and Buckley, C. 1988. Term-weighting
approaches in automatic text retrieval. Information
Processing & Management 24 (5), pp 513-523.
This research was made possible through a JISC Rapid
Innovations grant.
Turney, P. 2001. Mining the Web for Synonyms: PMI
versus LSA on TOEFL. In L. De Raedt & P. Flach (Eds.),
Proceedings of the Twelfth European Conference on
Machine Learning (ECML-2001). Freiburg, Germany, pp.
491-502
References
Cilibrasi, R. & Vitanyi, P. 2006. Similarity of objects and
the meaning of words. In J.-Y. Cai, S. B. Cooper, and A. Li
(Eds.) Proc. 3rd Conf. Theory and Applications of Models
of Computation (TAMC), Lecture Notes in Computer
Science, vol 3959, Springer-Verlag, Berlin, pp 21-45.
Veksler, V. D., Grintsvayg, A., Lindsey, R., & Gray, W. D.
2007. A proxy for all your semantic needs. 29th Annual
Meeting of the Cognitive Science Society, CogSci2007,
Nashville, TN.
Finkelstein, L., Gabrilovich, Y.M., Rivlin, E., Solan, Z.,
Wolfman, G. and Ruppin, E. 2002 Placing search in
context: The concept revisited. ACM Transactions on
Information Systems 20(1), pp 116-131.
Landauer, T.K. and Foltz, P.W. and Laham, D. 1998 An
introduction to latent semantic analysis. Discourse
Processes 25(2-3), pp 259-284.
Leacock, C. & M. Chodorow 1998. Combining local
context and WordNet similarity for word sense
identification. In C. Fellbaum (Ed.), WordNet. An
Electronic Lexical Database, Cambridge, Mass.: MIT
Press, pp. 265-283.
McHale, M. 1998 A Comparison of WordNet and Roget's
Taxonomy for Measuring Semantic Similarity, In
Proceedings of COLING/ACL Workshop on Usage of
WordNet in Natural Language Processing Systems,
Montréal, Canada. pp. 115-120.
37