Journal of Computational Information Systems3:2(2007) 203-213 Available at http://www.JofCI.org Efficient Entity Relation Discovery on Web Jing He†, Yuan Liu, Qichen Tu, Conglei Yao, Nan Di Computer Networks and Distributed System Laboratory School of Computer Science and Technology, Peking University (100871), Beijing, China Abstract With popularization of Web, there are billions of pages on Web, which contain affluent information of real world entities and their relations. Therefore, much research focuses on named entity extraction and entity relation discovery for constructing social networks which can reflect the real society. However, some former entity relation discovery approaches, extracting a small group of entities in a limited community or intranet, is not so scalable. So when it is applied to a large group of entities on Web, it may fail. In this paper, we employ co-occurrence to identify the relations between entities. The contribution of the paper is: 1. empirically evaluating various frequently used measures for co-occurrence and find Cosine outperforms the others; 2. presenting two novel efficient algorithms for discovering relations between entities and comparing them. Keywords: Entity Relation Extraction; Web Mining, Algorithm Complexity; Graph Clustering; Probabilistic Model 1. Introduction Social Network is a traditional field in sociology[1, 2]. With the popularization of computer and Internet, some research focus on social network discovery by information systems[3,4,5] and so on. Web is the most noteworthy information system and it is meaningful to mine the entities and their relations on Web. In this paper, assuming entities have been extracted[6, 7], we focus on discovering the relations among them efficiently. In section 2, some related works are discussed. In section 3, five frequently used measures for co-occurrence are empirically evaluated. In section 4, two novel relation discovery approaches are represented, and in section 5, both approaches are experimented and further discussed. Conclusion and future work will be represented in section 6 finally. 2. Related Work The entities and their relations compose a social network, which is explored by some former research. Milgram discovered the small world phenomena in 1967[1]. After that, social networks became an attracted research field in sociology. Watts and Strogatz[8] formulated small world networks in mathematics and concluded some properties. Barabasi and Albert explored scale free networks[9]. † Corresponding author. Email address: hj@net.pku.edu.cn (Jing He) 1553-9105/ Copyright © 2007 Binary Information Press June, 2007 Jing He et al. /Journal of Computational Information Systems 1:2 (2005) 203-213 2 The work of entity relation discovery on Web started soon after Web’s birth. In 1997, Kautz discovered relations between persons on Web by Referral Web[10]. Mika developed a system Flink[11], employing Web pages, emails, and personal profile to mine the relations. [6] is recent work to extract social networks from Web. It builds a social network for a conference, and distinguishes the relation to some classes. Data clustering can help entity discovery, because community is common in social networks. There some good clustering algorithms such as [12] are also useful for relation discovery problem. 3 Selection Measures for Occurrence 3.1. Candidate Measures There are multiple measures for co-occurrence, such as matching coefficient, mutual information, Dice coefficient, Jaccard coefficient, overlap coefficient and Cosine coefficient, formulated in Table1. All are applied generally in Information Retrieval, Information Extraction, entity relation identification and so on[6, 10]. However, no comparison is made on them in the application of entity relation identification. Table 1 Candidate Measures for Cooccurrence Measure Name Measure Formulation Mutual Information(variant) log(C x y / C x C y ) Dice Coefficient 2C x y /(C x C y ) Overlap Coefficient C x y / min(C x , C y ) Jaccard Coefficient Cx y / Cx y Cosine Coefficient Cx y / CxC y 3.2. Approaches for Evaluating the Measures A suitable measure should be selected for the task. There are two perspectives to evaluate a measure: Accuracy: In our application, a person is like a query in IR, a sorted list of candidate relevant persons is like a list of relevant documents. Therefore, the evaluation metric of IR are also meaningful here. MAP defines the precision of a list, so it is useful in evaluating the accuracy of a measure. Stability: Unlike accuracy, which considers person separately and only cares order, stability should consider the measure value in global view. Because the purpose of the measure is to classify relation between pair of persons to two classes: relative and irrelative. A threshold maybe required, so not only the sorting order but also the measure value is essential. To evaluate the stability, we combine all result together and plot precision at 11 standard recall levels to compare between them. 3.3. Empirical Comparison To compare the measures, we build a gold standard. 2 datasets of 50 persons are selected randomly from top 1000 celebrities of Chinese Web. For each person in dataset, top 10 neighbors of him for each measure are considered to be relevant and merged together in a pooling. Three experts are employed to vote whether two persons are relevant or not. With gold standard built, both the MAP and Precision at 11 standard recall levels can be computed. Table 2 shows the MAP results for datasets. The results show that Cosine Jing He et al. /Journal of Computational Information Systems 1:2 (2005) 203-213 3 Coefficient outperforms the others. Stability result is presented in Figure 1. Because the purpose of the work is to find general relations between persons, recall is more important than precision, so Mutual Information and Cosine Coefficient are good measures for stability. Therefore, Cosine Coefficient will be employed as standard measure later. Table 2. MAP results for five measures of cooccurrence Dataset 1 Dataset 2 Mutual Information 0.459 0.421 Dice Coefficient 0.506 0.487 Overlap Coefficient 0.530 0.461 Jaccard Coefficient 0.508 0.473 Cosine Coefficient 0.558 0.510 (a) (b) Fig.1 Precision at 11 standard levels for five measures of co-occurrence 4. Efficient Approaches for Entity Relation Discovery In Entity Discovery task, some works scan over whole corpora [13], and others [6, 11] search all single entities and co-occurrence of pairs. The computation complexity of the two algorithms are O(N) and O(m^2) separately. N is the number of documents and m is number of entities. However, when the job tries to capture the relations over large group of entities, they are inefficient. Researches on Social Network point out that the relation graph is very sparse, so it’s waste to count each pairs. Another approach[10] just uses local information of entities. We state it loses relations by only using local information. We design two algorithms to overcome high complexity. Both are based on high density of local distribution, meaning the fact that the relation A-B and relation A-C will lead to high possibility of relation B-C. A measure called clustering coefficient describing it. Clustering coefficient for a vertex u in a graph (V , E ) is defined as: Cu | { v, w | v V w V ( w, v) E (u , v ) E (u , w) E} | | { v, w | v V w V (u , v) E (u , w) E} | (1) And clustering coefficient for a graph is average of clustering coefficient of all vertices. 4.1. Validation of Clustering Coefficient We validate Clustering Coefficient property of entities on Web. Three probabilities are validated in the context. They are P( R(u, v) | R( w, u ) R(v, u )) (clustering coefficient), P( R(u, v)) ( relation probability) and Jing He et al. /Journal of Computational Information Systems 1:2 (2005) 203-213 P( R(u , v) | R( w, u ) R (v, u )) 4 (anti-clustering coefficient). Selecting five samples randomly from top 1000 celebrities on Web [7], three probabilities are computed for each of the sample. The results are presented in Figure 2, in which X axis is number of person selected, and Y axis is the probability. It is obvious that clustering coefficient is much higher than the other probabilities. Fig.2 Relation Probability, Clustering Coefficient and Anti-clustering Coefficient of entities on Web 4.2. Graph Clustering Approach for Entity Discovery It is natural to consider clustering as an initial step for entity relation discovery. We use graph clustering in this application because social network is suitable to be represented in graph. Some former research only extracted the entities in local pages, but it loses lots of relations. However, it’s still useful because it supplies an initial candidate set. Given a list of m entities, we build a initial graph (V, E) , where V is set of vertices, representing entities, and E is set of weighted edges, the weight of which represents how relevant the two entities are. Initially, all value in matrix M is 0. We make an entity as seed, and retrieve the related documents of it. Then other entities in the context around the seed are extracted. Once a pair of relevant entities is discovered, the value of corresponding position in M increases by 1. The algorithm of graph clustering is similar to [12], called Markov Clustering. The process iterates until the graph is stable. The condition for breaking out the loop is the distance between pair of matrices adjacent in iteration is less than a threshold. In each loop, the matrix should be normalized first. M[i][j] means the possibility of walking randomly from entity i to entity j. Then matrix M is assigned as a new matrix of Me, each entry of which means the possibility walk from one entity to another in e steps. This matrix is density λ than original one. Next process is to operate each entry v by v , where λ>1. It is called inflation because it emphasizes the heterogeneity within a row. After graph Markov clustering, the entities are clustered into groups, and then the relation of entities in a group can be validated by Cosine measure and a threshold. The cost of validation must be much lower than complete pair validation because it does not search all pairs. 4.3. Probabilistic Model for Entity Discovery Graph clustering approach still loses some relation because it has not covered some entities without inner entities. One way to solve the problem is to use clustering coefficient explicitly. The initial graph is empty. Entities are added sequentially one by one. When the first entity is added, nothing is to be done. When there are k vertexes in graph, the k+1-th entity, called e, is to be added into the graph. Before any test, Jing He et al. /Journal of Computational Information Systems 1:2 (2005) 203-213 5 initial probability of relation between e and any entity in the graph is Relation Probability, defined in section 4.1. Firstly, one random entity e0 is selected to test whether it is relevant with entity e. If they are relevant, an edge is generated. The relation probabilities of the entities which are relevant to e0 should be updated by multiplying P( R(u, v) | R( w, u ) R(v, u )) / P( R(u, v)) or P( R(u, v) | R( w, u ) R(v, u )) / P( R(u, v)) . Then the entity with maximum probability of relation should be selected to test, and probabilities for round entities are updated. The process will continue until all entities are tested or the probabilities of relevant for left entities are less than a threshold. The algorithm described above has not used some local information, so the test is random when choosing the entities for test. With the local information such as those used in section 4.2, it should be more efficient. 5. Experiment and Discussion The dataset for experiments contains 200 persons randomly selected from top 1000 celebrities on web. The gold standard for the experiments is Cosine values with a threshold. The threshold is 0.08 because the value is corresponding to an inflexion in Figure 1 for Cosine plot. MAP is used for evaluation. Table 3 presents the results for variance of parameters setting. Table 3. Parameter Setting for Markov Graph Clustering The diagram of precision on 11 standard recall is presented in Figure 3.a. The recall means the percent of relations discovered and precision means how accuracy the discovery is. The number in diagram means actual search time required. The upper bound Markov clustering approach is about to extract 70% of all relations, so it requires probabilistic model to extract more in some applications. Original probabilistic model approach only requires a parameter of threshold, the parameter setting and corresponding result is presented in Figure 3.b. Three points in the diagram is presented by the threshold of 0.05, 0.1 and 0.2 separately. The recall is so high (nearly 98%), but the cost increases fast meanwhile, though less than complete search. Extended model uses Markov clustering approach as initial relation graph, obtaining relation information. The diagram shows the recall can be further increased, but the improvement is not so obviously. (a) (b) Fig 3. Precision at 11 standard Recall of Markov Graph Clustering Algorithm and Probabilistic Model Approach Jing He et al. /Journal of Computational Information Systems 1:2 (2005) 203-213 6 Therefore, Markov graph clustering and probabilistic model are both efficient than complete search. Markov graph clustering approach is more flexible. It randomly walks from entity to others nearby, so it can’t cover all relations. On the other hand, probabilistic model is a global approach. The computation complexity for probabilistic model is much higher. So they should be used in different scenarios. For some tasks that recall requirement is not so high, Markov approach is a good choice. However, in tasks such as build a complete social network graph, probabilistic model is more suitable. 6. Conclusion and Future Work In this paper, we employ co-occurrence information for entity relation discovery on the Web. Firstly, some empirical comparison has done between measures and Cosine measure is selected for higher accuracy and stability. And we present two approaches: Markov graph clustering algorithm is very efficient, but comparable lower in recall and probabilistic model can extract most of relations with much lower cost than complete search. In future, we should add some heuristic strategies in probabilistic model so that the sequence of the selection entity for test is more efficient. And the relation should be specified. Acknowledgement This work is Supported by NSFC Grant 60435020: the NSFC key program "Theory and methods of question-and-answer information retrieval".- NSFC Grant 60573166 : Research on the correlation model and experimental computing methodology between web structure and social information - NSFC Grant 60603056 : Research and Application on Sampling the Web Reference [1] S, M., The small world problem. Psychology Today, 1967: p. 60~67. [2] M, G., The strength of weak ties. American Journal of Sociology, 1973. 78(6): p. 1360~1380. [3] S. Staab, P.D., P. Mika, J. Golbeck, L. Ding, T. Finin, A. Joshi, Social networks applied. IEEE Intelligent systems, 2005: p. 80~93. [4] J. Tyler, D.W., and B. Huberman, Email as spectroscopy: automated discovery of community strcture within organizations. The Information Society, 2003: p. 81~96. [5] Adar, L.A.A.a.E., Friends and neighbors on the web. Social Networks, 2003. 25(3): p. 211~230. [6] Yutaka Matsuo, J.M., Masahiro Hamasaki. POLYPHONET: An Advanced Social Network Extraction System from the Web. in International World Wide Web Conference. 2006. [7] Conglei Yao. http:// webdigest.grids.cn. 2006. [8] Watts D J, S.S.H., Collective dynamics of 'small-world' networks. Nature, 1998. 393(6684): p. 440~442. [9] Barabasi A L, A.R., Emergence of scling in random networks. Science, 1999. 286(5439): p. 509~512. [10] H. Kautz, B.S., and M. Shah, The hidden Web. AI magazine, 1997. 18(2): p. 27~35. [11] Mika, P., Flink: Semantic web technology for the extraction and analysis of social networks. Journal of Web Semantics, 2005. 3(2). [12] Dongen, S.V., Graph Clustering by Glow Simulation. 2000, University of Utrecht. [13] Xin Li, B.L. Mining Community Structure of Named Entities from Free Text. in Conference on Information and Knowledge Management. 2005.