4. Efficient Approaches for Entity Relation Discovery

advertisement
Journal of Computational Information Systems3:2(2007) 203-213
Available at http://www.JofCI.org
Efficient Entity Relation Discovery on Web
Jing He†, Yuan Liu, Qichen Tu, Conglei Yao, Nan Di
Computer Networks and Distributed System Laboratory
School of Computer Science and Technology, Peking University (100871), Beijing, China
Abstract
With popularization of Web, there are billions of pages on Web, which contain affluent information of real world entities
and their relations. Therefore, much research focuses on named entity extraction and entity relation discovery for
constructing social networks which can reflect the real society. However, some former entity relation discovery
approaches, extracting a small group of entities in a limited community or intranet, is not so scalable. So when it is applied
to a large group of entities on Web, it may fail. In this paper, we employ co-occurrence to identify the relations between
entities. The contribution of the paper is: 1. empirically evaluating various frequently used measures for co-occurrence and
find Cosine outperforms the others; 2. presenting two novel efficient algorithms for discovering relations between entities
and comparing them.
Keywords: Entity Relation Extraction; Web Mining, Algorithm Complexity; Graph Clustering; Probabilistic Model
1. Introduction
Social Network is a traditional field in sociology[1, 2]. With the popularization of computer and Internet,
some research focus on social network discovery by information systems[3,4,5] and so on. Web is the most
noteworthy information system and it is meaningful to mine the entities and their relations on Web. In this
paper, assuming entities have been extracted[6, 7], we focus on discovering the relations among them
efficiently. In section 2, some related works are discussed. In section 3, five frequently used measures for
co-occurrence are empirically evaluated. In section 4, two novel relation discovery approaches are
represented, and in section 5, both approaches are experimented and further discussed. Conclusion and
future work will be represented in section 6 finally.
2. Related Work
The entities and their relations compose a social network, which is explored by some former research.
Milgram discovered the small world phenomena in 1967[1]. After that, social networks became an attracted
research field in sociology. Watts and Strogatz[8] formulated small world networks in mathematics and
concluded some properties. Barabasi and Albert explored scale free networks[9].
†
Corresponding author.
Email address: hj@net.pku.edu.cn (Jing He)
1553-9105/ Copyright © 2007 Binary Information Press
June, 2007
Jing He et al. /Journal of Computational Information Systems 1:2 (2005) 203-213
2
The work of entity relation discovery on Web started soon after Web’s birth. In 1997, Kautz discovered
relations between persons on Web by Referral Web[10]. Mika developed a system Flink[11], employing
Web pages, emails, and personal profile to mine the relations. [6] is recent work to extract social networks
from Web. It builds a social network for a conference, and distinguishes the relation to some classes. Data
clustering can help entity discovery, because community is common in social networks. There some good
clustering algorithms such as [12] are also useful for relation discovery problem.
3 Selection Measures for Occurrence
3.1. Candidate Measures
There are multiple measures for co-occurrence, such as matching coefficient, mutual information, Dice
coefficient, Jaccard coefficient, overlap coefficient and Cosine coefficient, formulated in Table1. All are
applied generally in Information Retrieval, Information Extraction, entity relation identification and so on[6,
10]. However, no comparison is made on them in the application of entity relation identification.
Table 1 Candidate Measures for Cooccurrence
Measure Name
Measure Formulation
Mutual Information(variant)
log(C x  y / C x C y )
Dice Coefficient
2C x  y /(C x  C y )
Overlap Coefficient
C x  y / min(C x , C y )
Jaccard Coefficient
Cx y / Cx y
Cosine Coefficient
Cx  y /
CxC y
3.2. Approaches for Evaluating the Measures
A suitable measure should be selected for the task. There are two perspectives to evaluate a measure:
Accuracy: In our application, a person is like a query in IR, a sorted list of candidate relevant persons is
like a list of relevant documents. Therefore, the evaluation metric of IR are also meaningful here. MAP
defines the precision of a list, so it is useful in evaluating the accuracy of a measure.
Stability: Unlike accuracy, which considers person separately and only cares order, stability should
consider the measure value in global view. Because the purpose of the measure is to classify relation
between pair of persons to two classes: relative and irrelative. A threshold maybe required, so not only the
sorting order but also the measure value is essential. To evaluate the stability, we combine all result
together and plot precision at 11 standard recall levels to compare between them.
3.3. Empirical Comparison
To compare the measures, we build a gold standard. 2 datasets of 50 persons are selected randomly from
top 1000 celebrities of Chinese Web. For each person in dataset, top 10 neighbors of him for each measure
are considered to be relevant and merged together in a pooling. Three experts are employed to vote whether
two persons are relevant or not. With gold standard built, both the MAP and Precision at 11 standard recall
levels can be computed. Table 2 shows the MAP results for datasets. The results show that Cosine
Jing He et al. /Journal of Computational Information Systems 1:2 (2005) 203-213
3
Coefficient outperforms the others. Stability result is presented in Figure 1. Because the purpose of the
work is to find general relations between persons, recall is more important than precision, so Mutual
Information and Cosine Coefficient are good measures for stability. Therefore, Cosine Coefficient will be
employed as standard measure later.
Table 2. MAP results for five measures of cooccurrence
Dataset 1
Dataset 2
Mutual Information
0.459
0.421
Dice Coefficient
0.506
0.487
Overlap Coefficient
0.530
0.461
Jaccard Coefficient
0.508
0.473
Cosine Coefficient
0.558
0.510
(a)
(b)
Fig.1 Precision at 11 standard levels for five measures of co-occurrence
4. Efficient Approaches for Entity Relation Discovery
In Entity Discovery task, some works scan over whole corpora [13], and others [6, 11] search all single
entities and co-occurrence of pairs. The computation complexity of the two algorithms are O(N) and
O(m^2) separately. N is the number of documents and m is number of entities. However, when the job tries
to capture the relations over large group of entities, they are inefficient. Researches on Social Network
point out that the relation graph is very sparse, so it’s waste to count each pairs. Another approach[10] just
uses local information of entities. We state it loses relations by only using local information. We design two
algorithms to overcome high complexity. Both are based on high density of local distribution, meaning the
fact that the relation A-B and relation A-C will lead to high possibility of relation B-C. A measure called
clustering coefficient describing it. Clustering coefficient for a vertex u in a graph (V , E ) is defined as:
Cu 
| { v, w | v  V  w  V  ( w, v)  E  (u , v )  E  (u , w)  E} |
| { v, w | v V  w V  (u , v)  E  (u , w)  E} |
(1)
And clustering coefficient for a graph is average of clustering coefficient of all vertices.
4.1. Validation of Clustering Coefficient
We validate Clustering Coefficient property of entities on Web. Three probabilities are validated in the
context. They are P( R(u, v) | R( w, u ) R(v, u )) (clustering coefficient), P( R(u, v)) ( relation probability) and
Jing He et al. /Journal of Computational Information Systems 1:2 (2005) 203-213
P( R(u , v) | R( w, u ) R (v, u ))
4
(anti-clustering coefficient). Selecting five samples randomly from top 1000
celebrities on Web [7], three probabilities are computed for each of the sample. The results are presented in
Figure 2, in which X axis is number of person selected, and Y axis is the probability. It is obvious that
clustering coefficient is much higher than the other probabilities.
Fig.2 Relation Probability, Clustering Coefficient and Anti-clustering Coefficient of entities on Web
4.2. Graph Clustering Approach for Entity Discovery
It is natural to consider clustering as an initial step for entity relation discovery. We use graph clustering in
this application because social network is suitable to be represented in graph. Some former research only
extracted the entities in local pages, but it loses lots of relations. However, it’s still useful because it
supplies an initial candidate set. Given a list of m entities, we build a initial graph (V, E) , where V is set of
vertices, representing entities, and E is set of weighted edges, the weight of which represents how relevant
the two entities are. Initially, all value in matrix M is 0. We make an entity as seed, and retrieve the related
documents of it. Then other entities in the context around the seed are extracted. Once a pair of relevant
entities is discovered, the value of corresponding position in M increases by 1.
The algorithm of graph clustering is similar to [12], called Markov Clustering. The process iterates until
the graph is stable. The condition for breaking out the loop is the distance between pair of matrices adjacent
in iteration is less than a threshold. In each loop, the matrix should be normalized first. M[i][j] means the
possibility of walking randomly from entity i to entity j. Then matrix M is assigned as a new matrix of Me,
each entry of which means the possibility walk from one entity to another in e steps. This matrix is density
λ
than original one. Next process is to operate each entry v by v , where λ>1. It is called inflation because
it emphasizes the heterogeneity within a row. After graph Markov clustering, the entities are clustered into
groups, and then the relation of entities in a group can be validated by Cosine measure and a threshold. The
cost of validation must be much lower than complete pair validation because it does not search all pairs.
4.3. Probabilistic Model for Entity Discovery
Graph clustering approach still loses some relation because it has not covered some entities without inner
entities. One way to solve the problem is to use clustering coefficient explicitly. The initial graph is empty.
Entities are added sequentially one by one. When the first entity is added, nothing is to be done. When
there are k vertexes in graph, the k+1-th entity, called e, is to be added into the graph. Before any test,
Jing He et al. /Journal of Computational Information Systems 1:2 (2005) 203-213
5
initial probability of relation between e and any entity in the graph is Relation Probability, defined in
section 4.1. Firstly, one random entity e0 is selected to test whether it is relevant with entity e. If they are
relevant, an edge is generated. The relation probabilities of the entities which are relevant to e0 should be
updated by multiplying P( R(u, v) | R( w, u ) R(v, u )) / P( R(u, v)) or P( R(u, v) | R( w, u ) R(v, u )) / P( R(u, v)) . Then
the entity with maximum probability of relation should be selected to test, and probabilities for round
entities are updated. The process will continue until all entities are tested or the probabilities of relevant for
left entities are less than a threshold. The algorithm described above has not used some local information,
so the test is random when choosing the entities for test. With the local information such as those used in
section 4.2, it should be more efficient.
5. Experiment and Discussion
The dataset for experiments contains 200 persons randomly selected from top 1000 celebrities on web. The
gold standard for the experiments is Cosine values with a threshold. The threshold is 0.08 because the value
is corresponding to an inflexion in Figure 1 for Cosine plot. MAP is used for evaluation. Table 3 presents
the results for variance of parameters setting.
Table 3. Parameter Setting for Markov Graph Clustering
The diagram of precision on 11 standard recall is presented in Figure 3.a. The recall means the percent of
relations discovered and precision means how accuracy the discovery is. The number in diagram means
actual search time required. The upper bound Markov clustering approach is about to extract 70% of all
relations, so it requires probabilistic model to extract more in some applications. Original probabilistic
model approach only requires a parameter of threshold, the parameter setting and corresponding result is
presented in Figure 3.b. Three points in the diagram is presented by the threshold of 0.05, 0.1 and 0.2
separately. The recall is so high (nearly 98%), but the cost increases fast meanwhile, though less than
complete search. Extended model uses Markov clustering approach as initial relation graph, obtaining
relation information. The diagram shows the recall can be further increased, but the improvement is not so
obviously.
(a)
(b)
Fig 3. Precision at 11 standard Recall of Markov Graph Clustering Algorithm and Probabilistic Model Approach
Jing He et al. /Journal of Computational Information Systems 1:2 (2005) 203-213
6
Therefore, Markov graph clustering and probabilistic model are both efficient than complete search.
Markov graph clustering approach is more flexible. It randomly walks from entity to others nearby, so it
can’t cover all relations. On the other hand, probabilistic model is a global approach. The computation
complexity for probabilistic model is much higher. So they should be used in different scenarios. For some
tasks that recall requirement is not so high, Markov approach is a good choice. However, in tasks such as
build a complete social network graph, probabilistic model is more suitable.
6. Conclusion and Future Work
In this paper, we employ co-occurrence information for entity relation discovery on the Web. Firstly, some
empirical comparison has done between measures and Cosine measure is selected for higher accuracy and
stability. And we present two approaches: Markov graph clustering algorithm is very efficient, but
comparable lower in recall and probabilistic model can extract most of relations with much lower cost than
complete search. In future, we should add some heuristic strategies in probabilistic model so that the
sequence of the selection entity for test is more efficient. And the relation should be specified.
Acknowledgement
This work is Supported by NSFC Grant 60435020: the NSFC key program "Theory and methods of
question-and-answer information retrieval".- NSFC Grant 60573166 : Research on the correlation model
and experimental computing methodology between web structure and social information
- NSFC Grant
60603056 : Research and Application on Sampling the Web
Reference
[1] S, M., The small world problem. Psychology Today, 1967: p. 60~67.
[2] M, G., The strength of weak ties. American Journal of Sociology, 1973. 78(6): p. 1360~1380.
[3] S. Staab, P.D., P. Mika, J. Golbeck, L. Ding, T. Finin, A. Joshi, Social networks applied. IEEE Intelligent systems,
2005: p. 80~93.
[4] J. Tyler, D.W., and B. Huberman, Email as spectroscopy: automated discovery of community strcture within
organizations. The Information Society, 2003: p. 81~96.
[5] Adar, L.A.A.a.E., Friends and neighbors on the web. Social Networks, 2003. 25(3): p. 211~230.
[6] Yutaka Matsuo, J.M., Masahiro Hamasaki. POLYPHONET: An Advanced Social Network Extraction System from
the Web. in International World Wide Web Conference. 2006.
[7] Conglei Yao. http:// webdigest.grids.cn. 2006.
[8] Watts D J, S.S.H., Collective dynamics of 'small-world' networks. Nature, 1998. 393(6684): p. 440~442.
[9] Barabasi A L, A.R., Emergence of scling in random networks. Science, 1999. 286(5439): p. 509~512.
[10] H. Kautz, B.S., and M. Shah, The hidden Web. AI magazine, 1997. 18(2): p. 27~35.
[11] Mika, P., Flink: Semantic web technology for the extraction and analysis of social networks. Journal of Web
Semantics, 2005. 3(2).
[12] Dongen, S.V., Graph Clustering by Glow Simulation. 2000, University of Utrecht.
[13] Xin Li, B.L. Mining Community Structure of Named Entities from Free Text. in Conference on Information and
Knowledge Management. 2005.
Download