Discovering Attribute and Entity Synonyms

advertisement
Discovering Attribute and Entity Synonyms for
Knowledge Integration and Semantic Web Search
Hamid Mousavi 1 , Shi Gao 2 , Carlo Zaniolo 3
1,2,3
Technical Report #130013
Computer Science Department, UCLA
Los Angeles, USA
1
hmousavi@cs.ucla.edu
2
gaoshi@cs.ucla.edu
3
zaniolo@cs.ucla.edu
Abstract— There is a growing interest in supporting semantic
search on knowledge bases such as DBpedia, YaGo, FreeBase, and
other similar systems, which play a key role in many semantic
web applications. Although the standard RDF format ăsubject,
attribute, valueą is often used by these systems, the sharing of
their knowledge is hampered by the fact that various synonyms
are frequently used to denote the same entity or attribute—
actually, even an individual system may use alternative synonyms
in different contexts, and polynyms also represent a frequent
problem. Recognizing such synonyms and polynyms is critical
for improving the precision and recall of semantic search. Most
of previous efforts in this area have focused on entity synonym
recognition, whereas attribute synonyms were neglected, and so
was the use of context to select the appropriate synonym. For
instance, the attribute ‘birthdate’ can be a synonym for ‘born’
when it is used with a value of type ‘date’; but if ‘born’ comes
with values which indicate places, then ‘birthplace’ should be
considered as its synonym. Thus, the context is critical to find
more specific and accurate synonyms.
In this paper, we propose new techniques to generate contextaware synonyms for the entities and attributes that we are using
to reconcile knowledge extracted from various sources. To this
end, we propose the Context-aware Synonym Suggestion System
(CS 3 ) which learns synonyms from text by using our NLP-based
text mining framework, called SemScape, and also from existing
evidence in the current knowledge bases. Using CS 3 and our
previously proposed knowledge extraction system IBminer, we
integrate some of the publicly available knowledge bases into
one of the superior quality and coverage, called IKBstore.
I. I NTRODUCTION
The importance of knowledge bases in semantic-web applications has motivated the endeavors of several important
projects that have created the public-domain knowledge bases
shown in Table I. The project described in this paper seeks
to integrate and extend these knowledge bases into a more
complete and consistent repository named Integrated Knowledge Base Store (IKBstore). IKBstore will provide much better
support for advanced web applications, and in particular for
user-friendly search systems that support Faceted Search [5]
and By-Example Structured Queries [6]. Our approach in
achieving this ambitious goal involves the four main tasks of:
A Integrating existing knowledge bases by converting them
into a common internal representation and store them in
IKBstore.
B Completing the integrated knowledge base by extracting
more facts from free text.
C Generating a large corpus of context-aware synonyms
that can be used to resolve inconsistencies in IKBstore
and to improve the robustness of query answering systems,
D Resolving incompleteness in IKBstore by using the
synonyms generated in Task C.
At the time of this writing, task A and B are completed,
and other tasks are well on their ways. The tools developed
to perform tasks B, C and D, and the initial results they
produced will be demonstrated at the VLDB conference on
Riva del Garda [20]. The rest of this section and the next
section introduce the various intertwined aspects of these tasks,
while the remaining sections provide an in-depth coverage of
the techniques used for synonyms generation and the very
promising experimental results they produced.
Task A was greatly simplified by the fact that many
projects, including DBpedia [7] and YaGo [18], represent
the information derived from the structured summaries of
Wikipedia (a.k.a. InfoBoxes) by RDF triples of the form
ăsubject, attribute, valueą, which specifies the value for
an attribute (property) of a subject. This common representation facilitates the use of these knowledge bases by
a roster of semantic-web applications, including queries expressed in SPARQL, and user-friendly search interfaces [5],
[6]. However, the coverage and consistency provided by each
individual system remain limited. To overcome these probName
Size (MB)
ConceptNet [27]
DBpedia [7]
FreeBase [8]
Geonames [2]
MusicBrainz [3]
NELL [10]
OpenCyc [4]
YaGo2 [18]
3075
43895
85035
2270
17665
1369
240
19859
Number of
Entities (106 )
0.30
3.77
«25
8.3
18.3
4.34
0.24
2.64
Number of
Triples (106 )
1.6
400
585
90
«131
50
2.1
124
TABLE I
S OME OF THE PUBLICLY AVAILABLE K NOWLEDGE BASES
UCLA Computer Science Department Technical Report #130013
lems, this project is merging, completing, and integrating these
knowledge bases at the semantic level.
Task B is mainly to complete the initial knowledge base
using our knowledge extraction system called IBminer [22].
IBminer employs an NLP-based text mining framework, called
SemScape, to extract initial triples from the text. Then using
a large body of categorical information and learning from
matches between the initial triples and existing InfoBox items
in the current knowledge base, IBminer translates the initial
triples into more standard InfoBox triples.
The integrated knowledge base so obtained will represent
a big step forward, since it will (i) improve coverage, quality
and consistency of the knowledge available to semantic web
applications and (ii) provide a common ground for different contributors to improve the knowledge bases in a more
standard and effective way. However, a serious obstacle in
achieving such desirable goal is that different systems do not
adhere to a standard terminology to represent their knowledge,
and instead use plethora of synonyms and polynyms.
Thus, we need to resolve synonyms and polynyms for the
entity names as well as the attribute names. For example,
by knowing ‘Johann Sebastian Bach’ and ‘J.S. Bach’ are
synonyms, the knowledge base can merge their triples and
associate them with one single name. As for the polynyms,
the problem is even more complex. Most of the time based
on the context (or popularity), one should decide the correct
polynym of a vague term such as ‘JSB’ which may refer to
“Johann Sebastian Bach”, ‘Japanese School of Beijing’, etc.
Several efforts to find entity synonyms have been reported in
recent years [11], [12], [13], [15], [25]. However, the synonym
problem for attribute names has received much less attention,
although they can play a critical role in query answering. For
instance, the attribute ‘birthdate’ can be represented with terms
such as ‘date of birth’, ‘wasbornindate’, ‘born’, and ‘DoB’
in different knowledge bases, or even in the same one when
used in different contexts. Unless these synonyms are known,
a search for musicians born, say, in 1685 is likely to produce
a dismal recall.
To address the aforementioned issues, we proposed our
Context-aware Synonym Suggestion System (CS 3 for short).
CS 3 performs mainly tasks C and D by first extracting
context-aware attribute and entity synonyms, and then using
them to improve the consistency of IKBstore. CS 3 learns
attribute synonyms by matching morphological information
in free text to the existing structured information. Similar
to IBminer, CS 3 takes advantage of a large body of categorical information available in Wikipedia, which serves as
the contextual information. Then, CS 3 improves the attribute
synonyms so discovered, by using triples with matching subjects and values but different attribute names. After unifying
the attribute names in different knowledge bases, CS 3 finds
subjects with similar attributes and values as well as similar
categorical information to suggest more entity synonyms.
Through this process, CS 3 uses several heuristics and takes
advantage of currently existing interlinks such as DBpedia’s
alias, redirect, externalLink, or sameAs links as well as the
interlinks provided by other knowledge bases.
In this paper, we describe the following contributions:
3
‚ The Context-aware Synonym Suggestion System (CS )
which generates synonyms for both entities and attributes
in existing knowledge bases. CS 3 intuitively uses free
text and existing structured data to learn patterns for suggesting attribute synonyms. It also uses several heuristics
to improve existing entity synonyms.
‚ Novel techniques are introduced to integrate several pubic knowledge bases and convert them into a general
knowledge base. To this end, we initially collect the
knowledge bases and integrate them by exploiting the
subject interlinks they provide. Then, IBminer is used
to find more structured information from free text to
extend the coverage of the initial knowledge base. At
this point, we use CS 3 to resolve attribute synonyms
on the integrated knowledge base, and to suggest more
entity synonyms based on their context similarity and
other evidence. This improves performance of semantic
search over our knowledge base, since more standard and
specific terms are used for both entities and attributes.
‚ We implemented our system and performed preliminary
experiments on public knowledge bases, namely DBpedia
and YaGo, and text from Wikipedia pages. The initial
results so obtained are very promising and show that
CS 3 improves the quality and coverage of the existing
knowledge bases by applying synonyms in knowledge
integration. The evaluation results also indicate that IKBstore can reach up to 97% accuracy.
The rest of the paper is organized as follows: In next
section, we explain the high-level tasks in IKBstore. Then,
in Section III, we briefly discuss the techniques to generate
structured information from text in IBminer. In Section IV,
we propose CS 3 to learn context-aware synonyms. In Section
V, we discuss how these subsystems are used to integrate our
knowledge bases. The preliminary results of our approach are
presented in Section VI. We discuss some related work in
Section VII and conclude our paper in Section VIII.
II. T HE BIG PICTURE
As already mentioned, the goal of IKBstore is to integrate
the public knowledge bases and create a more consistent and
complete knowledge base. IKBstore performs four tasks to
achieve this goal. Here we elaborate these four tasks in more
details:
Task A: Collecting publicly available knowledge bases, unifying knowledge representation format, and integrating knowledge bases using existing interlinks and structured information.
Creating the initial knowledge base is actually a straightforward task (Subsection V-B), since many of the existing
knowledge bases are representing their knowledge in RDF
format. Moreover, they usually provide information to interlink
a considerable portion of their subjects to those in DBpedia.
Thus, we use such information to create the initial integrated
knowledge base. For easing our discussion, we refer to initial
integrated knowledge base as initial knowledge base. Although
2
UCLA Computer Science Department Technical Report #130013
this naive integration may improve the coverage of the initial
knowledge base, it still needs considerable improvement in
consistency and quality.
a common ground for different contributors to improve the
knowledge bases in a more standard and effective way. Using
multiple knowledge bases in IKBstore can also be a good
mean for verifying the correctness of the current structured
summaries as well as those generated from the text.
Task B: Completing the initial knowledge base using accompanying text. In order to do so, we employ the IBminer
system [22] to generate structured data from the free text
available at Wikipedia or similar resources. IBminer first
generated semantic links between entity names in the text
using recently proposed text mining SemScape. Then, IBminer
learns common patterns called Potential Matches (PMs) by
matching the current triples in the initial knowledge base to
the semantic links derived from free text. It then employs the
PMs to extract more InfoBox triples from text. These newly
found triples are then merged with the initial knowledge base
to improve its coverage. Section III provides more information
about this process.
III. F ROM T EXT T O S TRUCTURED DATA
To perform the nontrivial task of generating structured
data from text, we use our IBminer system [22]. Although,
IBminer’s process is quite complex, we can divide it into
three high-level steps which are elaborated in this section.
The first step is to parse the sentences in text and convert
them to a more machine friendly structure called TextGraphs
which contain grammatical and semantic links between entities
mentioned in the text. As discussed in subsection III-A, this
step is performed by the NLP-based text mining framework
SemScape [21], [22]. The second step is to learn a structure
called Potential Match (PM). As explained in Subsection III-B,
PM contains context-aware potential matches between semantic links in the TextGraphs and exiting InfoBox items. As the
third step, PM is used to suggest the final structured summaries
(InfoBoxes) from the semantic links in the TextGraphs. This
phase is described in Subsection III-C.
Task C: Generating a large corpus of context-aware synonyms.
Since IBminer learns by matching structured data to the
morphological structure in the text, it may find more than
one acceptable matching attribute name for a given link name
from the text. This in fact implies possible attribute synonyms,
and it is the main intuition that CS 3 uses to learn attribute
synonyms. Based on PM, CS 3 creates the Potential Attribute
Synonyms (PAS) which is in nature similar to PM. However
instead of mapping link names into attribute names, PAS
provides mapping between different attribute names based
on the categorical information of the subject and the value.
Similar to the case of IBminer, the categorical information
serves as the contextual information, and improves the final
results of the generated attribute synonyms As described
in Section IV-A, CS 3 improves PAS by learning from the
triples with matching subjects and values but different attribute
names in the current knowledge base. CS 3 also recognizes
context-aware entity synonyms by considering the categorical
information and InfoBoxes of the entities. (Subsection IV-B).
A. From Text to TextGraphs
To generate TextGraphs from text, we employ the SemScape
system which uses morphological information in the text to
capture the categorical, semantic, and grammatical relations
between words and terms in the text. To understand the general
idea, consider the following sentence:
Motivating Sentence: “Johann Sebastian Bach (31 March
1685 - 28 July 1750) was a German composer, organist,
harpsichordist, violist, and violinist of the Baroque Period.”
There are several entity names in this sentence (e.g. ‘Johann
Sebastian Bach’, ‘31 March 1685’, and ‘German composer’).
The first step in SemScape is to recognize these entity names.
Thus, SemScape parses the sentence with Stanford parser
[19] which is a probabilistic parser. Using around 150 treebased patterns (rules), SemScape finds such entity names and
annotates nodes in the parse trees with possible entity names
they contain. These annotations are called MainParts (MPs),
and the annotated parser tree is referred to as an MP Tree.
With the nodes annotated with their MainParts, other rules do
not need to know the underlying structure of the parse trees
at each node. As a result, one can provide simpler, fewer, and
more general rules to mine the annotated parse trees.
Next, SemScape uses another set of tree-based patterns to
find grammatical connections between words and entity names
in the parse trees and combine them into the TextGraph. One
such TextGraph for our motivating sentence is shown in Figure
1. Currently SemScape contains more than 290 manually
created rules to perform this step. Each link in the TextGraph
is also assigned a confidence value indicating SemScape’s
confidence on the correctness of the link. We should point out
Task D: Realigning attribute and entity names to construct
the final IKBstore. This is indeed the most important step on
preparing the knowledge bases for structured queries. Here,
we first use P AS to resolve attribute synonyms in the current
knowledge base. Then, we use the entity synonyms suggested
by CS 3 to integrate entity synonyms and their InfoBoxes. This
step is covered in Section V.
Applications: IKBstore can benefit a wide variety of applications, since it covers a large number of structured summaries represented with a standard terminology. Knowledge
extraction and population systems such as IBminer [22] and
OntoMiner [23], knowledge browsing tools such as DBpedia
Live [1] and InfoBox Knowledge-Base Browser (IBKB)[20],
and semantic web search such as Faceted Search [5] and ByExample Structured queries [6] are three prominent examples
of such applications. In particular for semantic web search,
IKBstore improves the coverage and accuracy of structured
queries due to superior quality and coverage with respect to
existing knowledge bases. Moreover, IKBstore can serve as
3
UCLA Computer Science Department Technical Report #130013
Fig. 2. Part a) shows the graph pattern for Rule1, and part b) depicts one
of the possible matches for this pattern.
Fig. 1.
semantic links (initial triples) from TextGraphs and add them
to the same TextGraph. Some of the semantic links generated
for our motivating sentence are shown in Figure 3. For the
sake of simplicity we do not depict grammatical links in this
graph. We should restate that all the patterns discussed in this
subsection are manually created and the readers should not
confuse them with automatically generated patterns that we
discuss in the next subsections.
Part of the TextGraph for our motivating sentence.
that TextGraphs support nodes consisting of multiple nodes
(and links) through its hyper-links which mainly differentiates
TextGraphs from similar structures such as dependency trees.
For more details on the TextGraph generation phase, readers
are referred to [22].
Although useful for some applications, the grammatical
connections at this stage of the TextGraphs are not enough
for IBminer to generate structured data. The reason is that,
IBminer needs connections between entity names, which we
refer to them as Semantic links. Semantic links simply specify
any relation between two entities1 . With this definition, most
of the grammatical links in the TextGraphs are not semantic
links. To generate more semantic links, IBminer uses a set
of manually created graph-based patterns (rules) over the
TextGraphs. One such rule is provided below:
B. Generating Potential Matches (PM)
To generate the final structured data (new InfoBox Triples),
IBminer learns an intermediate data structure called Potential
Matches (PM) from the mapping between initial triples (semantic links in TextGraphs) and current InfoBox triples. For
instance consider the initial triple ăBach, was, composerą
in Figure 3. Also assume, the current InfoBoxes include
ăBach, Occupation, composerą. This implies a simple pattern
saying that link name ‘was’ may be translated to attribute
name ‘Occupation’ depending on the context of the subject
(‘Bach’ in this case) and the value (‘composer’). We refer to
these matches as potential matches and the goal here is to
automatically generate and aggregate all potential matches.
To understand why we need to consider the context of the
subject and the value, this time consider the two initial triples
ăBach, was, composerą and ăBach, was, Germaną in the
TextGraph in Figure 3. Obviously the link name ‘was’ should
be interpreted differently in these two cases, since the former
one is connecting a ‘person’ to an ‘occupation’, while the
latter is between a ‘person’ and a ‘nationality’. Now, consider
two existing InfoBoxes ăBach, occupation, composerą and
ăBach, nationality, Germaną which respectively match the
formerly mentioned initial triples. These two matches imply
a simple pattern saying link name ‘was’ connecting ‘Bach’
and ‘composer’ should be interpreted as ‘occupation’, while
it should be interpreted as ‘nationality’ if it is used between
‘Bach’ and ‘German’. To generalize these implications, instead
of using the subject and value names, we use the categories
they belong to in our patterns. For instance, knowing that
‘Bach’ is in category ‘Cat:Person’ and ‘composer’ and ‘German’ are respectively in categories ‘Cat:Occupation in Music’
and ‘Cat:Ueropean Nationality’, we learn the following two
patterns:
‚ ăCat:Person, was, Cat:Occupation in Musicą: occupation
‚ ăCat:Person, was, Cat:European Nationalityą: nationality
Here the pattern ăc1 , l, c2 ą:α indicates that the link named
l, connecting a subject in category c1 to an entity or value in
——————————– Rule 1. ——————————–
SELECT ( ?1 ?3 ?2 )
WHERE
{
?1 “subj of” ?3.
?2 “obj of” ?3.
NOT(”not” “prop of” ?3).
NOT(”no” “det of” ?1).
NOT(”no” “det of” ?2).
}
————————————————————————–
As depicted in the part a) of Figure 2, the pattern graph
(WHERE clause of Rule 1) specifies two nodes with (variable)
names ?1 and ?2 which are connected to a third node (?3)
respectively with subj of and obj of links. One possible match
for this pattern in the TextGraph of our running example is
depicted in part b) of Figure 2. It is worthy mentioning that
due to the structure of the TextGraphs, matching multi-word
entities to the variable names in the patterns is an easy task
for IBminer. This is actually a challenging issue in works such
as [30] which are based on dependency parse trees. Using the
SELECT clause Rule 1, the rule returns several triples for our
running example such as:
‚ ă‘johann sebastian bach’, ‘was’, ‘composer’ą,
‚ ă‘sebastian bach’, ‘was’, ‘composer’ą,
‚ ă‘bach’, ‘was’, ‘composer’ą,etc.
The above triples are referred to as the initial triples. In
IBminer, we have created 98 graph-based rules to capture
1 SemScape
treats values as entities
4
UCLA Computer Science Department Technical Report #130013
Fig. 3.
general or inaccurate (e.g. ‘People by Historical Ethnicity’ and
‘Centuries in Germany’). Considering the same issue for the
value part, hundreds of thousands of potential matches may be
generated for a single subject. This issue not only wastes our
resources, but also impacts the accuracy of the final results.
To address this issue, we use a flow-driven technique to rank
all the categories to which subject s belongs, and then select
best NC categories. The main intuition is to propagate flows or
weights through different pathes from s to each category, the
categories receiving more weights are considered to be more
related to s. Now, L being the number of allowed ancestor
levels, we create the categorical structure for s up to L levels.
Starting with node s as the root of this structure and assigning
weight 1.0 to it, we iteratively select the closest node to s,
which has not been processed yet, propagate its weight to
its parent categories, and mark it as processed. To propagate
weights of node ci with k ancestors, we increase the current
weights of each k ancestors with wi {k, where wi is the current
weight of node ci . Although wi may change even after ci is
processed, we will not re-process ci after any further updates
on its weight. After propagating the weight to all the nodes, we
select the top NC categories for generating potential matches
and attribute synonyms.
Some Semantic links for our motivating sentence.
category c2 , may be interpreted as the attribute name α. Note
that for each triple with a matching InfoBox triple, we create
several patterns since the subject and the values usually belong
to more than one (direct or indirect) categories.
More formally, let ăs, l, vą be an initial triple in the
TextGraph, which matches InfoBox triple ăs, α, vą in the
initial knowledge base. Also, let s and v respectively belong
to category sets Cs “ tcs1 , cs2 , ...u and Cv “ tcv1 , cv2 , ...u
according to the categorical information in Wikipedia. Later
in this subsection, we discuss a simple yet effective technique
for selecting a small set of related categories for a given subject
or value. For each cs P Cs and cv P Cv , IBminer creates the
following tuple and add it to PM:
ăcs , l, cv ą : α
C. Generating New Structured Summaries
To extract new structured summaries (InfoBox triples),
IBminer uses PMs to translate the link names of the initial
triples into the attribute names of InfoBoxes. Let t “ăs, l,
vą be the initial triple whose link (l) needs to be translated,
and s and v are listed in category sets Cs “ tcs1 , cs2 , ...u and
Cv “ tcv1 , cv2 , ...u respectively which are generated based on
our category selection algorithm. The key idea to translate l
is to take a consensus among all pair of categories in Cs and
Cv and all possible attributes to decide which attribute name
is a possible match.
To this end, for each cs P Cs and cv P Cv , IBminer finds all
potential matches such as ăcs , l, cv ą: αi . The resulting set
of potential matches are then grouped by the attribute names,
αi ’s, and for each group we compute the average confidence
value and the aggregate evidence frequency of the matches.
IBminer uses two thresholds at this point to discard lowconfident (named τc ) or low-frequent (named τe ) potential
matches, which are discussed in Section VI. Next, IBminer
filters the remaining results by a very effective type-checking
technique2 . At this point, if there exists one or more matches
in this list, we pick the one with the largest evidence value, say
pmmax , as the only attribute map and report the new InfoBox
triple ăs, pmmax .ai , vą with confidence t.c ˆ pmmax .c and
evidence tn .e. Secondary possible matches are considered in
the next section as attribute synonyms.
Each tuple in PM is also associated with a confidence value
c (initialized by the confidence of the TextGraph’s semantic
link) and an evidence frequency e (initialized by 1). More
matches for the same categories and the same link will increase
the confidence and evidence count of the above potential
match. The above potential match basically means that for
an initial triple ăs1 , l, v 1 ą in which s1 belongs to category cs
and v 1 belongs to cv , α may be a match for l with confidence
c and evidence e.
Later in Section IV, we show that how CS 3 uses the P M
structure to build up a Potential Attribute Synonym structure
to generate high-quality synonyms.
Selecting Best Categories: A very important issue in
generating potential matches is the quality and quantity of the
categories for the subjects and values. The direct categories
provided for most of the subjects are too specific and there are
only a few subjects in each of them. As shown in Section VI,
generating the potential matches over direct categories is not
very helpful to generalize the matches for new subjects. On the
other hand, exhaustively adding all the indirect (or ancestor)
categories will result in too many inaccurate potential matches.
For instance considering only four levels of categories in
Wikipedia’s taxonomy, the subject ‘Johann Sebastian Bach’
belongs to 422 categories. In this list, there are some useful
indirect categories such as ‘German Musicians’ and ‘German
Entertainers’, as well as several categories which are either too
IV. C ONTEXT- AWARE S YNONYMS
Synonyms are terms describing the same concept, which
can be used interchangeably. According to this definition, no
2 See
5
[22] for details on the type-checking mechanism used by IBminer
UCLA Computer Science Department Technical Report #130013
synonym for αj (αi ) can be computed by Ni,j {Nj (Ni,j {Ni ).
Obviously this is not always a symmetric relationship (e.g.
‘born’ attribute is always a synonym for ‘birthdate’, but not the
other way around, since ‘born’ may also refer to ‘birthplace’
or ‘birthname’ as well). In other words having Ni and Ni,j
computed, we can resolve both synonyms and polynyms for
any given context (Cs and Cv ).
With the above intuition in mind, the goal in PAS is to
compute Ni and Ni,j . Next we explain how CS 3 constructs
PAS in one-pass algorithm which is essential for scaling up
our system. For each two records in PM such as ăcs , l, cv ą:
αi and ăcs , l, cv ą: αj respectively with evidence frequency
ei and ej (ei ď ej ), we add the following two records to PAS:
matter what context is used, the synonym for a term is fixed
(e.g. ‘birthdate’ and ‘date of birth’ are always synonyms).
However, the meaning or semantic of a term usually depends
on the context in which the term is used. The synonym
also varies as the context changes. For instance, in an article
describing IT companies, the synonym of the attribute name
‘wasCreatedOnDate’ most probably is ‘founded date’. In this
case, knowing that the attribute is used for the name of a
company is a contextual information that helps us find an
appropriate synonym for ‘wasCreatedOnDate’. However, if
this attribute is used for something else, such as an invention,
one can not use the same synonym for it.
Being aware of the context is even more useful for resolving
polynymous phrases, which are in fact much more prevalent
than exact synonyms in the knowledge bases. For example,
consider the entity/subject name ‘Johann Sebastian Bach’.
Due to its popularity, a general understanding is that the entity
is describing the famous German classical musician. However,
what if we know that for this specific entity the birthplace is in
‘Berlin’. This simple contextual information will lead us to the
conclusion that the entity is refereing to the painter who was
actually the grandson of the famous musician Johann Sebastian
Bach. A very similar issue exists for the attribute synonyms.
For instance considering attribute ‘born’, ‘birthdate’ can be
a synonym for ‘born’ when it is used with a value of type
‘date’; but if ‘born’ is used with values which indicate places,
then ‘birthplace’ should be considered as its synonym.
CS 3 constructs a structure called Potential Attribute Synonyms (PAS) to extract attribute synonym. In the generation
of PAS, CS 3 essentially counts the number of times each pair
of attributes are used between the same subject and value and
with the same corresponding semantic link in the TextGraphs.
The context in this case is considered to be the categorical
information for the subject and the value. These numbers
are then used to compute the probability that any given two
attributes are synonyms. Next subsection describes the process
of generating PAS. Later in Subsection IV-B, we will discuss
our approach to suggest entity synonyms and improve existing
ones.
ăcs , αi , cv ą: αj
ăcs , αj , cv ą: αi
Both records are inserted with the same evidence frequency
ei . Note that, if the records are already in the current PAS,
we increase their evidence frequency by ei . At the very same
time we also count the number of times each attribute is used
between a pair of categories. This is necessary for estimating
Ni and computing the final weights for the attribute synonyms.
That is for the case above, we add the following two PAS
records as well:
ăcs , αi , cv ą: ‘’ (with evidence ei )
ăcs , αj , cv ą: ‘’ (with evidence ej )
Improving PAS with Matching InfoBox Items: Potential attribute synonyms can be also derived from different knowledge bases which contain the same piece
of knowledge, but in different attribute names. For instance let ăJ.S.Bach, birthdate, 1685ą and ăJ.S.Bach,
wasBornOnDate, 1685ą be two InfoBox triples indicating bach’s birthdate. Since the subject and value part
of the two triples matches, one may say birthdate and
wasBornOnDate are synonyms. To add these types of
synonyms to the PAS structure, we follow the exact same idea
explained earlier in this section. That is, consider two triples
such as ăs, αi , vą and ăs, αj , vą in which αi and αj may be
a synonym. Also, let s and v respectively belong to category
sets Cs “ tcs1 , cs2 , ...u and Cv “ tcv1 , cv2 , ...u. Thus, for all
cs P Cs and cv P Cv we add the following triples to PAS:
A. Generating Attribute Synonyms
Intuitively, if two attributes (say ‘birthdate’ and ‘dateOfBirth’) are synonyms in a specific context, they should be
represented with the same (or very similar) semantic links in
the TextGraphs (e.g. with semantic links such as ‘was born
on’, ‘born on’, or ‘birthdate is’). In simpler words, we use
text as the witness for our attribute synonyms. Moreover, the
context, which is defined as the categories for the subjects
(and for the values), should be very similar for synonymous
attributes.
More formally, let attributes αi and αj be two matches for
link l in initial triple ăs, l, vą. Let Ni,j (“ Nj,i ) be the total
number of times both αi and αj are the interpretation of the
same link (in the initial triples) between category sets Cs and
Cv . Also, let Nx be the total number of time αx is used
between Cs and Cv . Thus the probability that αi (αj ) is a
ăcs , αi , cv ą: αj (with evidence 1)
ăcs , αj , cv ą: αi (with evidence 1)
This intuitively means that from the context (category) of
cs to cv , attributes αi and αj may be synonyms. Again
more examples for these categories and attributes increase
the evidence which in turn improve the quality of the final
attribute synonyms. Much in the same way as learning from
initial triples, we count the number of times that an attribute
is used between any possible pair of categories (cs and cv ) to
estimate Ni .
Generating Final Attribute Synonyms: Once PAS structure is built, it is easy to compute attribute synonyms as
described earlier. Assume we want to find best synonyms for
attribute αi in InfoBox Triple t=ăs, αi , vą. Using PAS, for
6
UCLA Computer Science Department Technical Report #130013
all possible αj , all cs P Cs , and all cv P Cv , we aggregate
the evidence frequency (e) of records such as ăcs , αi , cv ą:
αj in PAS to compute Ni,j . Similarly, we compute Nj by
aggregating the evidence frequency (e) of all records in form
of ăcs , αi , cv ą: ‘’. Finally, we only accept attribute αj as
the synonym of αj , if Ni,j {Ni and Ni,j are respectively above
predefined thresholds τsc and τse . We study the effect of such
thresholds in Section VI.
V. C OMBINING K NOWLEDGE BASES
The IBminer and CS 3 systems allow us to integrate the
existing knowledge bases into one of superior quality and
coverage. In this section, we elaborate the steps of IKBstore
as introduced previously in Section II.
A. Data Gathering
We are currently in the process of integrating knowledge
bases listed in Table I, which include some domain specific knowledge bases (e.g. MusicBrainz [3], Geonames [2],
etc.), and some domain independent ones (e.g. DBpedia [7],
YaGo2 [18], etc.). Although most knowledge bases such as
DBPedia, YaGo, and FreeBase already provide there knowledge in RDF, some of them may use other representations.
Thus, for all knowledge bases, we convert their knowledge
into ăSubject, Attribute, Valueą triples and store them in
IKBstore3 . IKBstore is currently implemented over Apache
Cassandra which is designed for handling very large amount
of data. IKBstore recognizes three main types of information:
‚ InfoBox triples: These triples provide information on
a known subject (subject) in the ăsubject, attribute,
valueą format. E.g. ăJ.S. Bach, PlaceofBirth, Eisenachą which indicates the birthplace of the subject
J.S.Bach is Eisenach. We refer to these triples as InfoBox
triples.
‚ Subject/Category triples: They provide the categories that
a subject belongs to in the form of subject/link/category
where, link represents a taxonomical relation. E.g.
ăJ.S.Bach, is in, Cat:German Composersą which indicates the subject J.S.Bach belongs to the category
Cat:German Composers.
‚ Category/Category triples: They represent taxonomical links between categories. E.g. ăCat:German Composers, is in, Cat:German Musiciansą which indicates the
category Cat:German Composers is a sub-category of
Cat:German Musicians.
Currently, we have converted all the knowledge bases listed
in Table I into above triple formats.
IKBstore also preserves the provenance of each piece of
knowledge. In other words, for every fact in the integrated
knowledge base, we can track its data source. For the facts
derived from text, we record the article ID as the provenance. Provenance has several important applications such
as restricted search on specific sources, tracking erroneous
knowledge pieces to the emanating source, and better ranking
techniques based on reliability of the knowledge in each
source. In fact, we also annotate each fact with accuracy confidence and frequency values, based on the provenance of the
fact. To do so, each knowledge base is assigned a confidence
value. To compute the frequency and confidence of individual
facts, we respectively count the number of knowledge bases
B. Generating Entity Synonyms
There are several techniques to find entity synonyms.
Approaches based on the string similarity matching [24],
manually created synonym dictionaries [28], automatically
generated synonyms from click log [14], [15], and synonyms
generated by other data/text mining approaches [29], [23] are
only a few examples of such techniques. Although performing
very well on suggesting context-independent synonyms, they
do not explicitly consider the contextual information for suggesting more appropriate synonyms and resolving polynyms.
Very similar to context-aware attribute synonyms in which
the context of the subject and value used with an attribute
plays a crucial role on the synonyms for that attribute, we can
define context-aware entity synonyms. For each entity name,
CS 3 uses the categorical information of the entity as well as all
the InfoBox triples of the entity as the contextual information
for that entity. Thus to complete the exiting entity synonym
suggestion techniques, for any suggested pair of synonymous
entities, we compute entities context similarity to verify the
correctness of the suggested synonym.
It is important to understand that this approach should be
used as a complementary technique over the existing ones for
two main reasons. First, context similarity of two entities does
not always imply that they are synonyms specially when many
pieces of knowledge are missing for most of entities in the
current knowledge bases. Second, it is not feasible to compute
the context similarity of all possible pairs of entities due to
the large number of existing entities. In this work, we use the
OntoMiner system [23] in addition to simple string matching
techniques (e.g. Exact string matching, having common words,
and edit distance) to suggest initial possible synonyms.
Let ‘Johann Sebastian Bach’ and ‘J.S. Bach’ be two synonyms that two different knowledge bases are using to denote
the famous musician. A simple string matching would offer
these two entity as being synonyms. Thus we compare their
contextual information and realize that they have many common attributes with similar values for them (e.g. same values
for attributes occupation, birthdate, birthplace, etc.). Also they
both belong to many common categories (e.g. Cat:German
musician, Cat:Composer, Cat:people, etc.). Thus we suggest
them as entity synonyms with high confidence. However,
consider ‘Johann Sebastian Bach (painter)’ and ‘J.S. Bach’ entities. Although the initial synonym suggestion technique may
suggest them as synonyms, since their contextual information
is quite different (e.i. they have different values for common
attributes occupation, birthplace, birthdate, deathplace, etc.)
our system does not accept them as being synonyms.
3 For instance MusicBraneZ uses relational representation, for which we
consider every column name, say α, as an attribute connecting the main entity
(s) of a row in the table to value (v) of that row for column α, and create
triple ăs, α,vą.
7
UCLA Computer Science Department Technical Report #130013
including the fact and combine the confidence value of them
as explained in [22].
B. Initial Knowledge Integration
The aim of this phase is to find the initial interlinks between
subjects, attributes, and categories from the various knowledge
sources to eliminate duplication, align attributes, and reduce
inconsistency using only the existing interlinks. At the end
of this phase, we have an initial knowledge base which is
not quite ready for structured queries, but provides a better
starting point for IBminer to generate more structured data
and for CS 3 to resolve attribute and entity synonyms.
‚ Interlinking Subjects: Fortunately, many subjects in different knowledge bases have the same name. Moreover,
DBPedia is interlinked with many existing knowledge
bases, such as YaGo2 and FreeBase, which can serve as
a source of subject interlinks. For the knowledge bases
which do not provide such interlinks (e.g. NELL), in
addition to exact matching, we parse the structured part of
knowledge base to derive candidate interlinks for existing
entities, such as redirect and sameAs links in Wikipedia.
‚
‚
coverage of existing knowledge bases. These new triples
are then added to IKBstore. For each generated triple,
we also update the confidence and evidence frequency
in IKBstore. That is if the triple is already in IKBstore,
we only increase and update its confidence and evidence
frequency.
Realigning attribute names: Next we employ CS 3 to
learn PAS and generate synonyms for attribute names and
expand the initial knowledge base with more common and
standard attribute names.
Matching entity synonyms: This step merges the entities
base on the entity synonyms suggested by CS 3 . For the
suggested synonym entities such as s1 , s2 , we aggregate
their triples and use one common entity name, say s1 . The
other subject (s2 ) is considered as a possible alias for s1 ,
which can be represented by RDF triple ăs1 , alias, s2 ą.
Integrating categorical information: Since we have
merged subjects based on entity synonyms, the similarity
score of the categories may change and thus we need
to rerun the category integration described in Subsection
V-B.
Currently the knowledge base integration is partially completed for DBpedia and YaGo2 as described in Section VI.
‚
Interlinking Attributes: As we mentioned previously,
attributes interlinks are completely ignored in the current
studies. In this phase, we only use exact matching for
interlinking attributes.
‚ Interlinking Categories: In addition to exact matching,
we compute the similarity of the categories in different knowledge bases based on their common instances.
Consider two categories c1 and c2 , and let Spcq be the
set of subjects in category c. The similarity function for
categories interlink is defined as Simpc1 , c2 q “ |Spc1 q X
Spc2 q|{|Spc1 q Y Spc2 q|. If the Simpc1 , c2 q is greater than
a certain threshold, we consider c1 and c2 as aliases
of each other, which simply means that if the instances
of two categories are highly overlapping, they might be
representing the same category.
After retrieving these interlinks, we merge similar entities,
categories, and triples based on the retrieved interlinks. The
provenance information is generated and stored along with the
triples.
‚
VI. E XPERIMENTAL R ESULTS
In this section, we test and evaluate different steps of
creating IKBstore in terms of precision and recall. To this end,
we create an initial knowledge base using subjects listed in
Wikipedia for three specific domains (Musicians, Actors, and
Institutes). For these subjects, we add their related structured
data from DBpedia and YaGo2 to our initial knowledge
bases. Then, to learn PM and PAS structures, we use the
entire Wikipedia’s long abstracts provided by DBpedia for the
mentioned subjects. We should state that IBminer only uses
the free text and thus can take advantage of any other source
of textual information.
All the experiments are performed in a single machine
with 16 cores of 2.27GHz and 16GB of main memory. This
machine is running Ubuntu12. On average, SemScape spends
3.07 seconds on generating initial semantic links for each
sentence on a singe CPU. That is, using 16 cores, we were
able to generate initial semantic links for 5.2 sentences per
second.
C. Final Knowledge Integration
Once the initial knowledge base is ready, we first employ
IBminer to extract more structured data from accompanying
text and then utilize CS 3 to resolve synonymous information
and create the final knowledge base. More specifically, we
perform the following steps in order to complete and integrate
the final knowledge base:
‚ Improving Knowledge base coverage: The web documents contain numerable facts which are ignored by
existing knowledge bases. Thus, we first enrich the
knowledge base by employing our knowledge extraction
system IBminer. As described in Section III, IBminer
learns PM from free text and the initial knowledge base
to derive more triples which will greatly improve the
A. Initial Knowledge Base
To evaluate our system, we have created a simple knowledge
base which consists of three data sets for domains Musicians,
Actors, and Institutes (e.g. universities and colleges) from the
subjects listed in Wikipedia. These data sets do not share any
subject, and in total they cover around 7.9% of Wikipedia
subjects. To build each data set, we start from a general
category (e.g. Category:Musicians for Musicians data set) and
find all the Wikipedia subjects in this category or any of its
descendant categories up to four levels. Table II provides more
details on each data set.
8
UCLA Computer Science Department Technical Report #130013
Data set
Name
Musicians
Actors
Institutes
Subjects InfoBox Subjects with Subjects with Sentences
Triples
Abstract
InfoBox
per Abstract
65835 687184
65476
52339
8.4
52710 670296
52594
50212
6.2
86163 952283
84690
54861
5.9
text (long abstract for our case). To have a better estimation
for recall, we only used those InfoBox triples which match
at least to one of our initial triples. In this way, we estimate
based on InfoBox triples which are most likely mentioned in
the text. Nevertheless, this automatic technique provides an
acceptable recall value which is enough for comparison and
tuning purposes.
TABLE II
D ESCRIPTION
OF THE DATA SETS USED IN OUR EXPERIMENTS .
Best Matches: To ease our experiments, we combine our
two thresholds (τe and τc ) by multiplying them together and
create a single threshold called τ (=τe ˆτc ). Part a) in Figure 4
depicts the precision/recall diagram for different thresholds on
τ in Musicians data set. As can be seen, for the first 15% of
coverage, IBminer is generating only correct information. For
these cases τ is very high. According to this diagram, to reach
97% precision which is higher than DBpedia’s precision, one
should set τ to 6,300 (« .12|Tm |). For this case as shown in
Part c) of the figure, IBminer generates around 96.7K triples
with 33.6% recall.
In our initial knowledge base, we currently use InfoBox
triples from DBpedia and YaGo2 for the mentioned subjects.
Using multiple knowledge bases improves the coverage of
our initial knowledge base. However in many cases they use
different terminologies for representing the attribute and entity
names. Although our initial knowledge base contains three
data sets and we have performed all four tasks on these data
sets, we only report the evaluation results for the Musicians
data set here. This is mainly due to the time-consuming
process of manually grading the results.
To evaluate the quality of existing knowledge base, we
randomly selected 50 subjects (which have both abstract and
InfoBox) from the Musicians data set. For each subject, we
compared the information provided in its abstract and in its
InfoBox. For these 50 subjects, 1155 unique InfoBox triples
(and 92 duplicate triples) are listed in DBpedia. Interestingly,
only 305 of these triples are inferable from the abstract text
(only 26.4% of the InfoBoxes are covered by the accompanying text.) More importantly, we have found 47 wrong
triples (4.0%) and 146 (12.6%) triples which are not useful in
terms of main knowledge (e.g. triples specifying upload file
name, image name, format, file size, or alignment). Thus an
optimistic estimation is that at most 76% of facts provided
in DBpedia are useful for semantic-related applications. We
may call this portion of the knowledge base as useful in this
section. Moreover, the accuracy of DBpedia in this case is less
than 96%.
Secondary Matches: For each best match found in the previous step, say t “ăs, αi , vą, we generate attribute synonyms
for αi using PAS. If any of the reported attribute synonyms are
not in the list of possible matches for t (See Subsection IIIC), we ignore the synonym. Considering τe =.12|Tm |, precision
and recall of the remaining synonyms are computed similar to
the best match case and depicted in part b) of Figure 4 (while
the potential attribute synonym evidence count (τse ) decreases
from right to left). As the figure indicates, for instance, to have
97% accuracy one need to set τse to 12,000(« .23|Tm |). We
should add that although synonym results for this case indicate
only 3.6% of coverage, the number of correct new triples that
they generate are 53.6K which is quite acceptable.
Part c) of Figure 4 summarizes the number of best match
triples, the total number of generated items (both Best and
Secondary Matches), and the total number of correct items
for some possible τ . For instance for τ =.4|Tm |, we reach up
to 97% accuracy while the total number of generated results
is more than 150K (96.7K ` 53.6K). This indicates 28.7%
improvement in the coverage of the useful part of our Initial
knowledge base while the accuracy is above 97%. This is
actually very impressive considering the fact that we only use
the abstract of the page.
B. Completing Knowledge by IBminer
Using the Musicians data set and considering 40 categories
(NC “ 40) and 4 levels (L “ 4), we trained the Potential
Match (PM) structure using IBminer system. Total number of
initial triples for this chunk of data set is |Tn | “ 4.72M ,
while 52.9K of them match with existing InfoBoxes (|Tm | “
52.9K). Using PM and the initial triples, we generate the final
InfoBox triples without setting any confidence and evidence
frequency threshold (i.e. τe “ 0 and τc “ 0.0). Later in this
section, we also analyze the effect of NC and L selection on
the results.
To estimate the accuracy of the final triples, we randomly
select 20K of the generated triples and carefully grade them
by matching against their abstracts. Many similar systems such
as [9], [17], [31] have also used manual grading due to the
lack of a good benchmark. As for the recall, we investigate
existing InfoBoxes and compute what portion of them is also
generated by IBminer. This gives only a pessimistic estimation
of the recall ability, since we do not know what portion of
the InfoBoxes in Wikipedia are covered or mentioned in the
C. The Impact of Category Selection
To understand the impact of category selection technique,
we repeat the previously mentioned experiment on the Musicians data set for different levels (L) and category numbers
(NC ). Here, we only compare the results based on the recall
estimation. First, we fix the NC at 40 and change L from 1
to 4. Part a) of Figure 5 depicts the estimated recall (only
for the best matches) while PM’s frequency threshold (τe )
increases. Not surprisingly, as L increases, the recall improves
significantly, since more general categories are considered in
PM construction. On the other hand, using more categories
also improves the recall as depicted in Part b) of the figure.
Then we fix L at 4 and change NC from 10 to 50. The results
9
UCLA Computer Science Department Technical Report #130013
Fig. 4. Results for Musicians data set: a) Precision/Recall diagram for best matches, b) Precision/Recall diagram for attribute synonyms, and c) the size of
generated results for the test data set.
indicate that even with 10 categories, we may reach good recall
with respect to other cases. This mainly proves the efficiency
of our category selection technique. However, to have high
recall for lower NC s, one might want to use very low values
for τe which will result in lower accuracy.
In Part c) of Figure 5, we compare the time performance of
all the above cases. This diagram shows the average time spent
on generating final InfoBoxes for each abstract. As we increase
the levels, the delay increases exponentially since IBminer
performs more database accesses to generate the categories
structure. However, the delay linearly and insignificantly increases with increase in category numbers. This is mainly due
to the fact that the increase in the number of categories does
not require any extra database access. The delay for generating
PM also follows the exact same trend.
VII. R ELATED W ORK
There are several attempts to generate large-scale knowledge
bases [4], [10], [27], [8], [7], [2], [3]. However, none of
them are providing any automatic integration technique among
existing knowledge bases. There are only a few recent efforts
on integrating knowledge bases in both domain-specific and
general topics. GeoWordNet [16] is an integrated knowledge
base of GeoNames [2], WordNet [28], and MultiWordNet [26].
However, the integration is based on the manually created
concept mapping which is time-consuming and error-prone
for large scale knowledge base integration. In [18], Yago2
is integrated with GeoNames and structured information in
Wikipedia. The category (class) mapping in Yago2 is performed by simple string matching which is not reliable for
large taxonomies such as the one in Wikipedia. Recently, Wu
et al. proposed Probase [31] which aims to generate a general
taxonomy from web documents. Probase merged the concepts
and instances into a large general taxonomy. In taxonomy
integration, the concepts with similar context properties (e.g.
derived from the same sentence or with similar child nodes)
are merged.
Another line of related studies is the synonym suggestion
systems, which mostly focus on entity synonyms and ignore
attribute synonyms. One of the early work in this area is
WordNet [28] which provides manually defined synonyms for
general words. Words and very general terms in WordNet are
grouped to set of synonyms called Synset. Although WordNet
D. Completing Knowledge by Attribute Synonyms
In order to compute the precision and recall for the attribute
synonyms generated by CS 3 , we use the Musicians data set
and construct the P AS structure as described in section IV.
Using P AS, we generated possible synonyms for 10, 000
InfoBox items in our initial knowledge base. Note that these
synonyms are for already existing InfoBoxes which differentiate them from secondary matches evaluated in Subsection
VI-B. This has generated more than 14, 900 synonym items.
Among the 10, 000 InfoBox items, 1263 attribute synonyms
were listed and our technique generated 994 of them. We used
these matches to estimate the recall value of our technique for
different frequency thresholds (τcs ) as shown in Figure 6. As
for the precision estimation, we manually graded the generated
synonyms.
As can be seen in Figure 6, CS 3 is able to find more than
74% of the possible synonyms with more than 92% accuracy.
In fact, this is a very big step in improving structured query
results, since it increase the coverage of the IKBstore by at
least 88.3%. This in some sense improves the consistency of
the knowledge base by providing more synonymous InfoBox
triples. In aggregate with the improvement we achieved by
IBminer, we can state that our IKBstore doubles the size of
the current knowledge bases while preserving their precision
(if not improving) and improving their consistency.
Fig. 6. The Precision/Recall diagram for the attribute synonyms generated
for existing InfoBoxes in the Musicians data set.
10
UCLA Computer Science Department Technical Report #130013
Fig. 5. a) The impact of increasing level number on recall, b) The impact of increasing number of categories on recall, and c) InfoBox generation delay per
abstract.
contains 117,000 high quality Synsets, it is still limited to
only general words which miss most of multi-word terms,
and as a result it is not effective to be used in large scale
knowledge base integration systems. Another approach to
extract synonyms is based on approximate string matching
(fuzzy string searching) [24]. This technique is effective in
detecting certain kinds of format-related synonyms, such as
normalization and misspelling. However, this technique can
not discover semantic synonyms.
To overcome the shortcomings of the aforementioned approaches, more automatic approaches for generating entity
synonyms from web scale documents were presented in recent
years. In [12], [13], Chaudhuri et al. proposed an approach
to generate a set of synonyms which are substrings of given
entities. This approach exploits the correlation between the
substrings and the entities in the web documents to identify
the candidate synonyms. In [14], [15], Cheng et al. used query
click logs in web search engine to derive synonyms. First,
according to the click logs, every entity is associated with
a set of relevant web pages. Then, the queries which result
in certain web pages are considered as possible synonyms
for the associated entities of those pages. The synonyms are
then refined by evaluating the click patterns of queries. Later
in [11], Chakrabarti et al. refined this approach by defining
similarity functions, which solves click log sparsity problem
and verifies that candidate synonym string is of the same
class of the entity. In [29], the author proposed a machine
learning algorithm, called PMI-IR to mine synonyms from web
documents. In PMI-IR, if two sets of documents are correlated,
the words associated with the two sets are taken as candidate
synonyms.
outperforms existing knowledge bases in terms of coverage,
accuracy, and consistency. Our preliminary evaluation also
proves this claim as it shows that IKBstore doubles the coverage of the existing knowledge base while slightly improving
accuracy and resolving many inconsistencies. The accuracy of
the generated knowledge at this point is above 97%.
As for the future work, we are currently expanding IKBstore
and our goal is to cover all the subjects in Wikipedia. We are
also working on improving the integration of the categorical
and taxonomical information available in different knowledge
bases.
R EFERENCES
[1] Dbpedia live. http://live.dbpedia.org/.
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
VIII. C ONCLUSION
[11]
In this paper, we propose the IKBstore technique to integrate
currently existing knowledge bases into a more complete
and consistent knowledge base. IKBstore first merges existing
knowledge bases using the interlinks they provide. Then, it
employs the text-based knowledge discovery system IBminer
to improve the coverage of the initially integrated knowledge
bases by extracting knowledge from free text. Finally, IKBstore utilizes the CS 3 to extract context-aware attribute and
entity synonyms and uses them to reconcile among different
terminologies used by different knowledge bases. In this
way, IKBstore creates an integrated knowledge base which
[12]
[13]
[14]
11
Geonames. http://www.geonames.org.
Musicbrainz. http://musicbrainz.org.
Opencyc. http://www.cyc.com/platform/opencyc.
W. Abramowicz and R. Tolksdorf, editors. Business Information Systems, 13th International Conference, BIS 2010, Berlin,
Germany, May 3-5, 2010. Proceedings, volume 47 of Lecture
Notes in Business Information Processing. Springer, 2010.
M. Atzori and C. Zaniolo. Swipe: searching wikipedia by
example. In WWW (Companion Volume), pages 309–312, 2012.
C. Bizer, J. Lehmann, G. Kobilarov, S. Auer, C. Becker, R. Cyganiak, and S. Hellmann. Dbpedia - a crystallization point for
the web of data. J. Web Sem., 7(3):154–165, 2009.
K. D. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor.
Freebase: a collaboratively created graph database for structuring
human knowledge. In SIGMOD, 2008.
A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. H. Jr., and
T. Mitchell. Toward an architecture for never-ending language
learning. In AAAI, pages 1306–1313, 2010.
A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. R. H. Jr.,
and T. M. Mitchell. Toward an architecture for never-ending
language learning. In AAAI, 2010.
K. Chakrabarti, S. Chaudhuri, T. Cheng, and D. Xin. A framework for robust discovery of entity synonyms. In Proceedings of
the 18th ACM SIGKDD international conference on Knowledge
discovery and data mining, KDD ’12, pages 1384–1392, New
York, NY, USA, 2012. ACM.
S. Chaudhuri, V. Ganti, and D. Xin. Exploiting web search
to generate synonyms for entities. In Proceedings of the 18th
international conference on World wide web, WWW ’09, pages
151–160, New York, NY, USA, 2009. ACM.
S. Chaudhuri, V. Ganti, and D. Xin. Mining document collections to facilitate accurate approximate entity matching. Proc.
VLDB Endow., 2(1):395–406, Aug. 2009.
T. Cheng, H. W. Lauw, and S. Paparizos. Fuzzy matching of
web queries to structured data. In ICDE, pages 713–716, 2010.
UCLA Computer Science Department Technical Report #130013
[15] T. Cheng, H. W. Lauw, and S. Paparizos. Entity synonyms
for structured web search. IEEE Trans. Knowl. Data Eng.,
24(10):1862–1875, 2012.
[16] F. Giunchiglia, V. Maltese, F. Farazi, and B. Dutta. Geowordnet:
A resource for geo-spatial applications. In ESWC (1), pages
121–136, 2010.
[17] J. Hoffart, F. M. Suchanek, K. Berberich, E. Lewis-Kelham,
G. de Melo, and G. Weikum. Yago2: exploring and querying
world knowledge in time, space, context, and many languages.
In WWW, pages 229–232, 2011.
[18] J. Hoffart, F. M. Suchanek, K. Berberich, and G. Weikum.
Yago2: A spatially and temporally enhanced knowledge base
from wikipedia. Artif. Intell., 194:28–61, 2013.
[19] D. Klein and C. D. Manning. Fast exact inference with a
factored model for natural language parsing. In In Advances
in Neural Information Processing Systems 15 (NIPS, pages 3–
10. MIT Press, 2003.
[20] H. Mousavi, S. Gao, and C. Zaniolo. Ibminer: A text mining
tool for constructing and populating infobox databases and
knowledge bases. VLDB (demo track), 2013.
[21] H. Mousavi, D. Kerr, and M. Iseli. A new framework for textual
information mining over parse trees. In (CRESST Report 805).
University of California, Los Angeles, 2011.
[22] H. Mousavi, D. Kerr, M. Iseli, and C. Zaniolo. Deducing
infoboxes from unstructured text in wikipedia pages. In CSD
Technical Report #130001), UCLA, 2013.
[23] H. Mousavi, D. Kerr, M. Iseli, and C. Zaniolo. Ontoharvester:
An unsupervised ontology generator from free text. In CSD
Technical Report #130003), UCLA, 2013.
[24] G. Navarro. A guided tour to approximate string matching.
ACM Comput. Surv., 33(1):31–88, Mar. 2001.
[25] P. Pantel, E. Crestan, A. Borkovsky, A.-M. Popescu, and
V. Vyas. Web-scale distributional similarity and entity set
expansion. In Proceedings of the 2009 Conference on Empirical
Methods in Natural Language Processing: Volume 2 - Volume
2, EMNLP ’09, pages 938–947, Stroudsburg, PA, USA, 2009.
Association for Computational Linguistics.
[26] E. Pianta, L. Bentivogli, and C. Girardi. Multiwordnet: developing an aligned multilingual database. In Proceedings of the
First International Conference on Global WordNet, 2002.
[27] P. Singh, T. Lin, E. T. Mueller, G. Lim, T. Perkins, and W. L.
Zhu. Open mind common sense: Knowledge acquisition from
the general public. In Confederated International Conferences
DOA, CoopIS and ODBASE, London, UK, 2002.
[28] M. M. Stark and R. F. Riesenfeld. Wordnet: An electronic
lexical database. In Proceedings of 11th Eurographics Workshop
on Rendering. MIT Press, 1998.
[29] P. D. Turney. Mining the web for synonyms: Pmi-ir versus lsa
on toefl. In Proceedings of the 12th European Conference on
Machine Learning, EMCL ’01, pages 491–502, London, UK,
UK, 2001. Springer-Verlag.
[30] F. Wu and D. S. Weld. Open information extraction using
wikipedia. In ACL, pages 118–127, 2010.
[31] W. Wu, H. Li, H. Wang, and K. Q. Zhu. Probase: a probabilistic
taxonomy for text understanding. In SIGMOD Conference,
pages 481–492, 2012.
12
Download