JCIS-E0340

advertisement
Journal of Computational Information Systems 3:2(2007) 203-213
Available at http://www.JofCI.org
Name Disambiguation Using Many-to-One features
Yuan LIU†, Jing HE, Conglei YAO
Computer Networks and Distributed System Laboratory,
School of Electronic Engineering and Computer Science, Peking University
Abstract
As the Web increase drastically, more and more entity information come to appear on Web, including their profile
information, their web log containing their idea, activity, speech and so on. However, there are many entities sharing same
names. Such entities include persons, locations and so on. This paper presents an approach to estimate the number of
entities sharing same name by employing many-to-one features. The basic idea is that the entities are not likely to share all
other features even if they have the same name. We list some strategies for selecting key features, present an approach to
extract the features on Web, and combine them to estimate the entity number. What is more, we also give a method to
identify the fake information which will confuse us and filter them.
Keywords: Name disambiguation; Many-to-one relation; Information Extraction
1. Introduction
The referents of a same name appearing on Web are difficult to distinguish due to lack of features which
can be used to identify them. Originally, the name is a key feature used to identify one entity from the
others. <Name, Entity> should be a many-to-one relation to prevent misunderstanding. However, while
more and more entities come to appear on Web, such relation has been broken out and it becomes to be
many-to-many relation. One name may potentially refer to tens or hundreds of entities. One can't just use
name to identify an entity on Web because of homonym exists so commonly.
Our purpose is to estimate lower bound of referents’ number of a name in name list, so we ignore name
recognition in general process of name disambiguation. We present a method based on choosing some
special features which have many-to-one relations, including profiles of the entities and relations between
entities. As to a person name, the person's birthday and his parent’s name may be important features. Use
these features, combined with the person’s name, we are able to distinguish this person from other
homonyms. How to choose features and extract features should be considered first. The data on web is
always redundant and full of feature patterns. We use iterative pattern relation extraction to make our
method scalable. Many web pages just contain entity name and don't contain the features we want. And we
will also use query expansion to avoid data sparse.
This rest of paper is organized as follow. In the following section we review the former works on Name
disambiguation. How to select and extract feature will be presented in section 3. How to identify entities
and filter the fake information will be presented in section 4. We have made some experiments on 1000
person name and manually labeled 20 person names. It will be presented in section 5. The conclusion will
be presented in section 6.
†
Corresponding author.
Email address: liuyuan@net.pku.edu.cn (Yuan Liu)
1553-9105/ Copyright © 2007 Binary Information Press
June, 2007
Yuan Liu et al. /Journal of Computational Information Systems 3:2 (2007) 203-213
20
4
2. Related Work
Name disambiguation in large data set origin appears in plain text [1], and then with the development of the
web, new methods are arising to apply to the web environment [3, 4]. A lot of works are about person name
disambiguation because of high duplication rate and high dimensions of features. And some other works
are about geographic name disambiguation [2].
There are three methods which are mentioned in most of the paper. Finding the biographical feature of
name, analyzing the name networks and clustering [1, 3, 4]. We do not need to fully understand the whole
text of a web page, but only the messages (profile or relation) we need [8].
3. Feature Selection and Extraction
A name entity has a collection of features. Some features are informative to distinguish name such as
person's birthday and birth place. In former works, a set of features is useful in distinguishing person's
name, called biographical feature [3]. Such as "George H.W.Bush was the 41st resident of USA" and
"Mozart was born in 1756". We find out these features have internal attributes that can be formulate to
many-to-one relation in mathematic.
3.1. Many-to-One relation features
Many-to-One relation features is a common concept in software analysis and design processing, meaning
that one object can only have ONE value of such feature, but multiple objects can share such a feature of
same value. The advantage of such feature is that it defines a lower bound of the entity number. Take
birthday of person entity as example, it is obviously a many-to-one feature because one person only has one
birthday but multiple person can share a same birthday.
[4] employs occupation as a feature to distinguish persons, but it is not a many-to-one relation feature
because one person can easily transfer from one occupation to another. For instance, considered to be a
great linguist, Noam Chomsky is also a statesman.
In the algorithm based on occupation, such one Chomsky may be considered to be two distinct people
mistakenly. In this paper, we assume one-to-one relation feature is just one special situation of many-to-one
relation feature, so features such as identity card ID belong to many-to-one relation features too.
There are two classes of many-to-one relation features, named informative feature and relative feature.
Informative feature is a value, in integer, float or string. On the other hand, relative feature points to another
entity. Though two or more persons may have the same name and birthday and also publish the information
on web, it does not influence the lower bound of the referents' number and has a low conflict probability.
Table 1 and Table 2 list some import many-to-one features for person and location entity, separately.
Table 1. Many-to-one features for person entity
Person
Sex Birthday Birth Place Identity card ID Father's name Mother's name
Table 2. Many-to-one features for location entity
Location
Established time Area Higher-up Location Longitude Latitude
3.2. Pattern based Feature Extraction
Feature extraction algorithms are generally studied in field of Information Extraction, Question Answering
and so on. Both manually written pattern and automatic pattern based approach can be used in feature
Yuan Liu et al. /Journal of Computational Information Systems 3:2 (2007) 203-213
20
5
extraction procedure in limited fields. We use both of the two methods to extract features and find the latter
one is more scalable and have better result.
When the many-to-one features are already selected, we can manually write pattern to extract these
features on web. We retrieval about 1000 different URLs for each name in the name list using search
engine (including “Google” and “Yahoo!”) and then fetch all the web pages. After that, we use the pattern
manually written to match the feature and extract them. This procedure is described in Figure 1.
1.
For each feature, manually write a set of patterns.
2.
For each name in name list:
3.
4.
Retrieval the related URLs from search engine and fetch all the web pages.
For each web page :
5.
open a window when you meet a name
6.
for each window:
7.
use the patterns to match and extract features
Fig.1 Feature Extraction Using Manually Written Patterns
As we collect the target features listed in table 1 and 2, the automatic pattern based algorithm learns a
group of patterns for each of these features, and then the patterns are employed to expand the query and
extract feature values for specific entity name.
At first, we should manually select a set of <Name Feature> pair. Take person name and person birthday
for example, the initial set could be “< Mozart 1756>, < Beethoven 1770> …” If a name list exists, the
name should be selected randomly to avoid bias.
Then each pair’s elements are combined as a query and sent to search engine. Different from the method
above, we just use the abstract of a URL in the retrieval page to extract patterns. The patterns should be
employed to expand query and extract more <Name Feature> pair. In the learning processing, the two steps
are executed alternatively until the pattern set is stable. The procedure is described in Figure 2.
1. Manually select a set of <Name Feature> pair.
2.
For each pair, retrieval the result from search engine and get all the abstracts.
3. Extract patterns from the abstracts.
4.
5.
for each name in name list:
Combine name and term in patterns to generate query and fetch the retrieval pages.
6. Repeat 3 to 5 until pattern set is stable. No more patterns can be found.
Fig.2 Feature Pattern Learning and Extraction
4. Entity Identification and Fake Information Filtering
With the features we extracted, how could we identify the number of the referents one name has? There are
two problem, the first one is the features of one entity may be present in several independent web pages;
the second one is some features are fake information.
4.1. Entity Identification
After learning the patterns of each feature, the feature values can be extracted from retrieved pages of
queries combined by entity name and pattern keywords.
Defining the feature set as F  { f1 , f 2, , f m } , for an entity name ei , the retrieved document set
Yuan Liu et al. /Journal of Computational Information Systems 3:2 (2007) 203-213
is D = {d1, d 2,  ,d n } .Then the feature vector extracted from
20
6
d i is Vi  {vi1 , vi2 ,..., vin } . Vi may be so
sparse that most of values are NULL because one page may just contain a few feature value of an entity.
Then we can combine the set of vectors into a small one, each of which represents one entity in real
world. If Vi and V j are a pair of feature vector, we define a kind of consistence relations between them.
We use first order logical representation to express the definition.
Definition 1: Consistence Feature.

Consistence Vi ,V j  = true  x Vi  V j  Vi  NULL  V j  NULL
x
x
x
x

Obviously, it is a kind of consistence relation because it is symmetric.
The description of an entity such as a person is combined by some documents from vector, so it is a
group of feature vectors. The main idea is that the feature vectors for an entity should be consistent.
Because the feature we select is many-to-one relation, no conflict should occur upon an entity. Therefore,
the problem of discovering person entities of same name can be converted into a problem of finding
maximum consistent sub-set in set of feature vectors.
To solve the problem, we can select a seed from the feature vector set firstly. Then the problem can be
divided into two sub-problems of discovering maximum subsets containing and not containing the seed.
The each feature vector in maximum subset which contains the seed must be consistent with the seed, and
the feature vectors in maximum subset which does not contain the seed must be inconsistent with seed. The
problem can be divided iteratively until no seed in candidate set. The algorithm is described in Figure 3.
Input : feature vector set S
Output : candidate feature vector set cand
1. cand = NULL
2.
While S <> NULL :
3.
seed = choose an element from S
4.
S = S – {seed}
5.
for f in S:
6.
if Consistence(seed, f) == true :
7.
seed = merge(seed, f)
8.
S = S – {f}
9.
cand = cond + {seed}
Fig.3 Algorithm for entity identification
4.2. Fake Information Filtering
However, unlike high quality collection of dataset [10], there is a lot of fake information on web, so even
the many-to-one relation feature of one entity may have multiple values. Such situation is so common on
Web, so we should filter some fake information on Web. The framework constructed in section 4.1 is so
flexible for adapting the requirement of filtering.
We should modify the definition of consistence we made in section 4.1. The idea is to consider the
probability of consistence. We can get the probability of conflict of each feature by its domain.
First, we give different Confidence Level to the features in feature vector. The value of one feature's
confidence level is according to the feature's value domain. For example, feature Sex as a person name's
feature and could only have two values, male or female. And Birthday, if we suppose all the persons on
web are mainly birthed in the recent a hundred years, could be any date in the last hundred years. So the
latter has lower confliction in probability.
Second, we introduce positive and negative confidence in computing the consistence of feature vector. In
Yuan Liu et al. /Journal of Computational Information Systems 3:2 (2007) 203-213
feature vectors
Vi and V j , vi
x
and
vj
x
are related to the same feature. If
vi  v j
x
x
20
7
and both of
them are not NULL, it will contribute a positive value to final confidence result. In contrast, a negative
value will be added upon the final result.
Finally, we check the final confidence value. If it is in a threshold which is given according applied field,
the two feature vectors will be consider as consistent.
Definition 2: Fuzzy Consistent. Feature Vector Vi and V j are consistent if the confidence value is in
a threshold.
If two feature vectors have same value on several features which have little probability to be same, the
feature vectors are consistent with high confidence, though some other feature values may be conflict. The
construction of graph and entity identification is similar to section 4.1. The fake information filtering is
done after the consistent subsets are built, based on voting algorithm. We should select the value with
major support as the correct one, and filter others.
5. Experiment and Evaluation
In this section, we describe the experiments we do and evaluate the result.
5.1. Feature extraction
We use both manually wrote patterns and automatic extracted patterns to extract feature. For the former, we
use a name list containing 1000 person name and each name has at least 1000 web pages on web. We
extract person features such as sex, birthday, and birth place using regular expression.
When we started the extraction process, we faced data sparse problem. Just a few of web pages we got
from the retrieval data containing the features we want to extract. In our experiment, few people have more
than 100 web pages containing these features in the first 1000 web page we got from web search engine.
The distribution is described in Figure 4. The X axis is the number of web pages which contain features
information, and the Y axis is the number of the person name. And both axis values are logarithmic. A
x
y
point (x, y) in Figure 4 means there are 2 person names contain 2 web pages in the 1000 related web
pages.
We use query expansion to handle data sparse problem. The expansion words are chosen from the
patterns where both person name and feature information can be found. Use the person name and each of
the words you choose to generate a query, and then verify whether this query can bring up more patterns
about this feature. In Chinese, the words such as “出生”, “简介” perform well when extracting birthday
feature.
We use edit distance of the raw text and Markov Graph Clustering [9] to extract final pattern. Edit
distance can be used to measure how similar two strings are. So we use it to get the similarity of two raw
texts. When each pair of raw text has similarity value, a graph is established. Then we use Markov Graph
Clustering to get the clusters, the raw text in same clustering will be transformed to a pattern. Then the
cluster which has very few elements will be though as noises and be wiped out.
5.2. Entity Identification experiment
We manually labeled 20 person names, including 240 real people and their information including birthday,
birth place and sex. Then we use algorithms in Figure 3 to identify the number of person each name has.
The result is described in Figure 5. The X axis is the 20 person, and the Y axis is the related referents
number. Red bars are original number of each person name and yellow bar are the number we identified.
Fake information always appears for the entity which has a lot of related web pages on web. So more
information is available and more feature patterns are needed. The fake information filtering method is
useful to prevent some famous entity to be divided into two or more entities. For example, Chinese
Yuan Liu et al. /Journal of Computational Information Systems 3:2 (2007) 203-213
20
8
actors “傅彪” has two birth places on web (Beijing and Shandong) and “刘亦菲”has two birthdays
(1987/8/25 and 1986/8/25). Both of the feature values are supported with a lot of web pages and can
not be omitted
casually.
Fig.4 Data sparse In Feature Extraction
Fig.5 Identification Person Name Using Many-To-One features
6. Conclusion
Name disambiguation is important in knowledge discovery, data mining and other fields. We just use a
special feature of the name entity, the many-to-one feature, to find out the lower bound of the referents'
number. Some more methods can be performed based on the result such as network analysis and text
clustering. In the future, we planed to find out some other features such as many-to-many feature and how
they influence the name disambiguation procedure.
Acknowledgment
Supported by NSFC Grant 60435020: the NSFC key program "Theory and methods of
question-and-answer information retrieval".
- NSFC Grant 60573166: Research on the correlation model and experimental computing methodology
between web structure and social information
- NSFC Grant 60603056: Research and Application on Sampling the Web (Web 抽样研究理论与应用)
References
[1]
[2]
N. Wacholder, Y. Ravin, and M. Choi. Disambiguation of proper names in text. Proceedings of Fifth Conference
on Applied Natural Language Processing, pages 202--208, March 1997.
David A. Smith and Gregory Crane. Disambiguating geographic names in a historical digital library. In
Proceedings of ECDL, pages 127--136, Darmstadt, 4-9 September 2001.
Yuan Liu et al. /Journal of Computational Information Systems 3:2 (2007) 203-213
[3]
20
9
BRADLEY MALIN,EDOARDO AIROLDI, and KATHLEEN M.CARLEY. A Network Analysis Model for
Disambiguation of Names in Lists. Computational & Mathematical Organization Theory,11,119–139,2005
[4] Gideon S. Mann and David Yarowsky. Unsupervised personal name disambiguation. In CoNLL , Edmonton,
Alberta. 2003.
[5] Deepak Ravichandran, Eduard Hovy Learning Surface Text Patterns for a Question Answering System 2002
[6] R. Bekkerman and A. McCallum. Disambiguating web appearances of people in a social network. In WWW 2005
[7] Bradley Malin, EDOARDO AIROLDI and KATHLEEN M. CARLEY A Network Analysis Model for
Disambiguation of Names in Lists
[8] Sergey Brin. Extracting patterns and relations from the world wide web. In WebDB Workshop at EDBT '98,
1998.
[9] Ulrik Brandes, Marco Gaertler, Dorothea Wagner Experiments on Graph Clustering Algorithms
[10] Takaaki Hasegawa ,Satoshi Sekine and Ralph Grishman. Discovering Relations among Named Entities from
Large Corpora.
Download