Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw Computer System and Communication Lab Institute of Information Science Academia Sinica What is Community? In Graph Theory densely connected groups of vertices, with sparser connection between groups In Social Network Analysis groups of entities that share similar properties or connect to each other via certain relations A social network is a structure made up of nodes, representing entities from different conceptual groups, that are linked with different types of relations 2 Why is Community Important? Interesting data with community structure researcher collaboration, friendship network, WWW, Massive Multi-player on-line gaming, electronic communications. Groups of web pages that link to more web pages in the community than pages outside correspond to web pages on related topics Groups in social networks correspond to social communities, which can be used to understand organizational structure, academic collaboration, shared interests and affinities, etc. 3 Motivation Understand the research network between authors, conferences and topics (rank entities by relevance for given entities) Find and justifiably recommend research collaborators for given authors Explore the academic social network Find out most important papers, researchers and venues for a given topic 4 Related Systems Many digital library systems exist ACM Digital Library IEEExplorer DBLP Citeseer Libra DBConnect Problems The coverage of dataset is not large enough Name ambiguous problem exists in Web pages Citation records 5 Libra Academic Search http://libra.msra.cn Free computer science bibliography search engine A test-bed for object-level vertical search research Currently the following types of paper-related objects can be searched: Papers, Authors, Conferences, Journals, Research Communities 6 7 8 DBconnect: Conference 9 DBconnect: Topic 10 DBconnect: Author 11 ZoomInfo (1) People Directory (2) Developer Tools (3) Social Network, Profile Statistics, Employment History (4) Ability to identify ambiguous?! Ex. Can get 21 different people called “Bing Liu” 12 ArnetMiner 13 Our goal Developing an automatic system to Explore the academic social network Find out most important papers, researchers and venues for a given topic Provide solutions for existent problems Collecting larger citation datasets Retrieving data from web pages • Publication list finder • Extracting citation strings from web pages • Citation parser Multilingual data sources • Chinese and English corpuses Name dissemination mechanism in Web pages Citation records 14 Our contributions Kai-Hsiang Yang, Kun-Yan Chiou, Hahn-Ming Lee, and Jan-Ming Ho, "Web Appearance Disambiguation of Personal Names Based on Network Motif," in the 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006), Hong Kong, Dec. 18-22, 2006 Kai-Hsiang Yang, Jen-Ming Chung and Jan-Ming Ho, "PLF: A Publication List Web Page Finder for Researchers," in Proceedings of the 2007 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2007), Silicon Valley, USA, Nov. 25, 2007 Kai-Hsiang Yang, Wei-Da Chen, Hahn-Ming Lee and Jan-Ming Ho, "Mining Translations of Chinese Name from Web Corpora by Using Query Expansion Technique and Support Vector Machine," in Proceedings of the 2007 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2007), Silicon Valley, USA, Nov. 25, 2007 Chia-Ching Chou, Kai-Hsiang Yang and Hahn-Ming Lee, "AEFS: Authoritative Expert Finding System Based on a Language Model and Social Network Analysis," in Proceedings of the 12th Conference on Artificial Intelligence and Applications (TAAI2007), Nov 16-17, 2007 Chien-Chih Chen, Kai-Hsiang Yang and Jan-Ming Ho, "BibPro: A Citation Parser Based on Sequence Alignment Techniques," will appear in Proceedings of the IEEE 22nd International Conference on Advanced Information Networking and Applications (AINA-08) 15 PLF: A Publication List Web Page Finder for Researchers 16 Agenda Introduction Publication List Web Page Finder, PLF Performance Evaluation Conclusion, Future Work 17 Overview of a Publication List Web Page Keep abreast of state-of-the-art research Contains citations not found elsewhere. May provide some reference materials, such as slides and talks. Challenges How to find the publication list web pages Only with the given name . Various versions or Multiple copies An author may have many affiliations. Name ambiguity problem E.g., Dr. Bing Liu, we found that 26 people share the same name by inquiring to ZoomInfo (people search engine). 18 Problem “Publication List Web Page?” 19 Definition of Publication List Affiliated Personal Publication List Web Page (APPL) a web page belongs to the affiliated web site of a specific person with the given name. [Affiliation] Institute of Information Science, Academia Sinica citation string 20 Agenda Introduction Publication List Web Page Finder, PLF Performance Evaluation Conclusion, Future Work 21 Process Flow Citation String Search Web Page Crawler Rank Function collects the citation strings from digital libraries collects the hyperlinks of web pages from search engines by using the collected citation strings as queries analyses the statistics of all the collected hyperlinks of web pages web page Analyse Interaction Search Engines Digital Libraries Citations statistics Parsing Query hyperlinks Given Names 22 Basic Concept A publication list web page may contain many citation strings A publication list web page QPT1 Query Search Engine Web Page1 WP2 WP3 Jan-Ming Ho . . Paper Title1 . WPn QPT2 Query Search Engine Web Page1 WP2 WP3 . . PT2 PT3 . X . . PTm . WPn 23 Agenda Introduction Publication List Web Page Finder, PLF Performance Evaluation Conclusion, Future Work 24 Dataset Scenario Seminar members have usually published major research works We randomly collected 200 names from the WWW ’06 Conference Committee website APPL Types #APPL #people %population others 0 22 11% single-group 1 120 60% multi-group 2 35 17.5% 3 16 8% 4 7 3.5% 25 Experiment Evaluation Evaluation metrics We consider the top-5 results derived by each link and focus on the top-5 recall metric, which is calculated by: Ra recall R Notation Definition Ra the number of publication list web pages belonging to researchers listed in the dataset R the number of publication list web pages contained in the top-5 results 26 Parameter Analysis for Single-Group (m, n) (m, n) (a) Fixed n mixed with different scale m (b) Fixed m mixed with different scale n Figure (a) • When m increases, the recall rate also increases. •Figure (b) • System performance may be constrained by m. 27 Parameter Analysis for Multi-Group (a) Fixed n mixed with different scale m (b) Fixed m mixed with different scale n Figure (a) • It is clear that the performance when m = 40 is always better than the other settings. Figure (b) • The best performance (top-5 recall is 70%) occurs when n = 75. 28 Performance Evaluations (given name + keyword) (a)Performance of approaches in (b)Performance of different ways in single-group multi-group 1. The parameter m has a strong influence on the system’s performance; for example, an oversized m may degrade the performance. 2. The parameter n has little influence on the system’s performance. 3. The PLF system outperforms the other two approaches on both the single-group and the multi-group datasets. 29 Conclusion We have defined the problem of finding the publication list web pages of a researcher, and proposed “PLF” system Ongoing work Name ambiguity problem How to merge the multiple publication list web pages for a specific person into a single page. 30 Discussion – Name Ambiguity Problem Scenario We take the name “Bing Liu” as an example Analyze manually Observation Citation Count Name translation problem Partial matching problem 31 Extracting Citation Strings from Web Pages 32 Extract Citation Records Extract Web Page Structured Data 33 Challenges The formats of publication list web pages vary There are no fixed syntactic rules for parsing citation records Hence, We can not apply simple rules to extract citation records automatically 34 Challenges: Complex Layouts of Publication List Pages 35 Ideas The semantic structure of web pages is organized by visual arrangement. We can utilize semi-structure information (visual ) of web pages to help extraction task. With hierarchical structure and geometric information, DOM tree is not only a great structure to present Web pages, but also very helpful for visual pattern analysis. 36 DOM Tree Presentation of Web page Banner Citation String Navigator Bar Citation String Citation String Citation String Publication List 37 Architecture of Citation Extraction System p Publication List Pages li Parsing a T1 li T2 a em T3 T4 Common Style Finder Mining div Common Style li Patterns T5 a a T2 T6 T1 p li div a em T3 T4 T5 a T6 DOM Tree Publication List Web Page Finder Citation Extractor Ranking Records Citation Records Candidate Citation Records T3T4T5 p li T1T2 Normal Citation Model Estimating Citation Length Extracting a Candidate T1 Records T2 li a em T3 T4 T5 Citation Extraction System CiteSeer 38 Modules of Citation Extraction System Common Style Finder find out all common style patterns for each level of granularity in web pages Citation Extractor explore data regions with common style patterns distill extraction rules from those data regions rank extraction patterns based on a normal word count distribution probability 39 BibPro: A Citation Parser based on Sequence Alignment Techniques 40 System Goal Citation Chomsky, Noam. 1956. Three models for the description of language. IRE Transactions on Information Theory. 2(3) 113--124. Our System MetaData Author: Chomsky, Noam Title: Three models for the description of language Journal: IRE Transactions on Information Theory Volume: 2 Issue: 3 Page: 113-124 Month: Year: 1956 41 Basic Idea(1/2) Encode citation to protein sequence Only keep the citation style information order of fields field separators Author protein sequence A Title D T Journal D L year page … … D Y R P H S 42 Basic Idea(2/2) To determine citation style by the order of punctuation marks and reserved words System Preprocess Online parsing Citation Feature Index Style Citation Feature Index Style Citation String Search Tool BLAST Feature Index Citation Feature Index Style . . . . 43 How to encode citation to protein sequence? Keep the citation style information Which field should be included? (only can use 23 symbol) Which punctuation are used to separate fields? By observing different citation styles, we define an encode table to translate each token of citation to an amino acid symbol 44 Encode Table A: Author T: Title L: Journal F: Volumn value W: Issue value H: Page value M: Month Y: Year X: noise (unrecognized token) S: Issue key. e.g. “no”, “No” P: Page key. e.g. “pp”, “page” V: Volume key. e.g. “Vol”, “vo” N: numeral Q: @ # $ % ^ & * + = \ | ~ _ / ! ? 。 I: ( [ { < 「 K: ) ] } > 」 D: . G: " “ ” R: , C: - : E: ' ` Z: ; B: blank 45 How to using protein sequence to extract metadata? Transform extraction problem to sequence alignment problem Form translation Unknown Answer BASE FORM ALIGN FORM INDEX FORM Known Answer RESULT FORM STYLE FORM INDEX FORM 46 RESULT FORM (Known Answer) 47 BASE FORM (Unknow Answer) 48 System Structure System PreProcess (Template Generating System) Citation Crawler Template Builder Online Parsing (Parsing System) Resource on the Internet Template Matching Metadata Extraction Query Citation 1 2 System PreProcess Online Parsing Template Database Metadata 49 Citation Crawler BibTex BibTex BibTex Citation Crawler Google Engine IEEE Engine ACM Engine CiteSeer Engine Citation Citation Citation BibTex Parser MetaData MetaData MetaData 50 BLAST-powered Template Matching Query Citation TEMPLATE DATABASE Encode Table Encode Citation INDEX FORM Form Translation INDEX FORM STYLE FORM INDEX FORM STYLE FORM INDEX FORM STYLE FORM INDEX FORM STYLE FORM INDEX FORM STYLE FORM INDEX FORM STYLE FORM BLAST STYLE FORM STYLE FORM STYLE FORM 51 Evaluation for CiteSeer DataSet Consider the inconsistency between the Citation String and BibTex file(metadata) Old Measurement: Field Precision old # [Token parsed field Token BibTex field ] # [Token parsed field Token BibTex ] New Measurement: Field Precision new # [Token parsed field Token BibTex field ] # [Token query citation Token BibTex ] 52 Definition Tokenparsedfield: denote tokens that appear in the parsed subfield Tokenquery citation: denote tokens that appear in the query citation string TokenBibTex field : denote tokens that appear in the specific subfield in the BibTex file TokenBibTex : denote all tokens that appear in the BibTex file These tokens don' t include punctuation 53 Compare with ParaCite DataSet Collected from CiteSeer Training Set: 2416 Testing Set: 4131 ParaCite Using default template Database • add template to its database isn’t easy Test Testing Set Our System Using training template Database (Training Set) Test Testing Set 54 Experimental Results ParaCite Autor Title Journal Page Issue Year Score new Eva 32.90% 73.35% 29.83% 4.58% 25.05% 77.04% 50.22% ParaCite Autor Title Journal Page Issue Year Score old Eva 99.08% 62.72% 30.46% 100.00% 93.96% 99.70% 78.81% Our Author Title Journal Volumn Page Issue Month Year Score new Eva 93.73% 73.32% 51.34% 83.52% 94.62% 85.11% 89.18% 96.49% 84.80% Our Author Title Journal Volumn Page Issue Month Year Score old Eva 90.58% 89.51% 67.66% 93.58% 96.69% 91.79% 99.49% 99.50% 91.45% 55 Analysis ParaCite only can extract one author name Old evaluation have a problem: it is highly probable that you will obtain high accuracy, if you extract less information 56 Evaluation for clean DataSet Ciation String is fully composed of corresponding metadata Number of correctly extracted fields Accuracy Total number of fields 57 Compare with INFOMAP DataSet Includes 160000 record Training Dataset: 10000 X 6 (JMIS, ACM, IEEE, APA, MISQ, and ISR) Testing Dataset: 10000 X 6 (JMIS, ACM, IEEE, APA, MISQ, and ISR) 58 Result Author Title Journal Volumn Page Issue Year Overall average APA 99.67% 96.38% 97.06% 98.99% 98.71% 98.12% 99.42% 98.33% IEEE 98.72% 98.12% 99.12% 99.30% 98.40% 98.39% 99.40% 98.78% ACM 97.14% 95.01% 93.93% 97.19% 97.92% 97.03% 98.88% 96.73% ISR 99.48% 96.17% 96.96% 99.15% 98.55% 98.39% 99.35% 98.29% MISQ 98.59% 97.99% 98.98% 99.41% 98.83% 98.61% 99.54% 98.85% JMIS 91.95% 87.90% 90.46% 99.23% 98.76% 98.03% 99.46% 95.11% Average 97.59% 95.26% 96.09% 98.88% 98.53% 98.09% 99.34% 97.68% 59 Evaluation for Cora DataSet 500 records Be used as benchmark for many papers (HMM, SVM, CRF) 60 Evaluation Divide words into four kinds: TP,FP,TN,FN Four metrics: Word Accuracy: (TP+TN)/(TP+FP+FN+TN) Precision: TP/(TP+FP) Recall: TP/(TP+FN) F1-measure: (2*Precision*Recall)/(Precision+Recall) 61 Our System acc. F1. Author 97.17% 93.98% Title 94.17% 90.13% Journal 93.58% 83.27% Volume 99.21% 84.62% Page 99.21% 92.09% Date 99.92% 98.96% 62 Mining Translations of Chinese Names from Web Corpora by Using a Query Expansion Technique and Support Vector Machine 63 Agenda Introduction Proposed Approach Experiments Conclusions and Future Work 64 Background Most of academic information can be found on the Web Scholar Google, DBLP etc. 65 Problems in Searching Chinese Name Only Chinese Corpus 66 Challenges in Chinese Name Translation Many pronunciation rules in different areas 陳 Chen (Taiwan) 陳 Tsun (Hong Kong) 陳 Tan (Fukien) Some additional words exist. Ex: 黃光明 (Kwang-Ming Frank Hwang) Ex: 張韻詩 (Jane Win-Shih Liu) 67 Common Chinese Name Translation Format Name Format Examples Type-1. (Chinese given name) (Surname) or (Surname), (Chinese given name) 劉豐哲 (Fon-Che Liu) 黃田漢 (Ng Tian Hann) 林牛 (Ngau Lam) Type-2. (Merged Chinese given name) (Surname) 吳德琪 (Derchyi Wu) Type-3. (Western first name) (Surname) 趙蓮菊 (Anne Chao) Type-4. (Chinese given name) (Western first name) (Surname) Type-5. (Abbreviated Chinese given name) (Surname) Type-6. (Western first name) (Abbreviated Chinese given name) (Surname) Type-7. (Chinese given name) (Abbreviated Chinese given name) (Surname) Type-8. (Chinese given name) (Unpredictable Surname) 黃光明 (Kwang-Ming Frank Hwang) 張秀瑜 (S.-Y. Chang) 李昭勝 (Jack-C. Lee) 蔡桂紅(Gwei-Hung H. Tsai) 張韻詩(Jane Win-Shih Liu) 68 Goal Design an automatic mechanism to translate a given Chinese name into its related English name 69 Agenda Introduction Proposed Approach Experiments Conclusions and Future Work 70 Concepts of Proposed Approach No corresponding translations 71 Three Major Techniques Query expansion technique Translation of the surname • Obtaining the related Web page snippets of the Chinese name translation. • Solve the problem of the unrelated term existing in the name translation. Knowledge-based method Chinese surname database, A common dictionary, Western first name database • Obtaining all the name-like terms from the returned Web page snippets. SVM Chinese pronunciation database, the phonetic feature and the distant feature, selectedatraining samples • Selecting the appropriate Chinese name translations from the candidates. 72 System Architecture Chinese names Query expander Chinese surname database Returned Web page snippets Name candidates Candidate extractor Western first name database On-line dictionary SVM-based name selector Chinese pronunciation database Translated English names 73 Query Expander Goal: To retrieve Web page snippets that contain both a person’s Chinese name and the translation of the person’s surname. Name splitter Determining whether the input Chinese name contains a compound surname Chinese surname database Dividing the input Chinese name into a “Surname” part and a “given name” part. Surname translator Selecting appropriate surname translations. Chinese surname database The strength of relationship between each surname translation and the person is determined by the “distance from the person’s Chinese name to the surname’s translation”. Web page retriever Making the concept of the query word more clearly. Retrieving the related Web pages back. The new query word will be “(Chinese name) + (Surname’s translation)”. 74 Distance from Two Terms Calculation of the “distance from two terms”: DN where D is the distance, N is the number of non-words between the two terms. 陳威達( Wei-Da Chen) The distance from the person’s Chinese name (陳威達) to the surname’s translation (Chen) is 3. 75 Candidate Extractor Goal: To extract possible candidates from the retrieved Web page snippets. Steps: 1. Removing all HTML tags. 2. Identifying out all the positions of the Chinese surnames existing in the snippets. Chinese surname database 3. Extracting any English terms near each surname in the snippets if the term has one of the following properties: – – – The term cannot be found in a common dictionary. The term is a Western first name. The length of the term is 1. ※At most three English terms in the neighborhood of the surname will be extracted. 76 System Architecture 4/10 - Candidate extractor The extracted terms will be Step1 Identifying the name translation out all the positions ofand candidates thebe Chinese sent to surnames existing SVM-based name in selector the snippets. for processing Step2 Extracting any English terms near each surname in the snippets if the term has one of the following properties: •The term cannot be found in a common dictionary. •The term is a Western first name. •The length of the term is 1. 77 SVM-based Name Selector Goal: To extract each candidate’s features and utilize them to determine whether the candidate is the correct translation of the input Chinese name. Features: 1. The phonetic feature: – Phonetic similarity Soundex algorithm 2. The distant feature: – – Smallest distance (between the Chinese name and the translation candidates) Number of appearance in the neighborhood 78 Distant Features The “neighborhood”: The close area of each occurrence of the Chinese name. The close area is defined by a given threshold of distance of number of words. Smallest distance 2 Number of appearance in the neighborhood of the candidate “win-shih”: 2 79 Summary Query expansion technique Retrieving related Web pages. Knowledge-based method Extracting appropriate name translation candidates from the retrieved Web pages. SVM Learning the verification rule and Selecting appropriate name translation candidates. from extracted 80 Agenda Introduction Proposed Approach Experiments Conclusions and Future Work 81 Testing Environment and Dataset 1/3 The following tool are used: Cambridge on-line dictionary Google search engine LIBSVM Two datasets are used: Dataset I (training & testing): Collected from the Directory of scholars of Institute of Mathematics. Contains 78 pieces of data. Dataset II (testing): Collected by our program from the Website of the Directory of Division of Computer Science of National Science Council. Contains 1,157 pieces of data, and the name translations of 40 data are not existed in Google. 82 Testing Environment and Dataset 2/3 Name format Example Dataset I Dataset II # % # % Type-1. (Chinese given name) (Surname) or (Surname), (Chinese given name) 丁建文(Jen-Wen Ding) 丁德榮(Der-Rong Din) 歐陽明(Ming Ouhyang) 19 24.3% 1000 89.5% Type-2. (Merged Chinese given name) (Surname) 蔡丕裕(Piyu Tsai) 10 12.8% 42 3.8% Type-3. (Western first name) (Surname) 賴友仁(Eugene Lai) 9 11.5% 9 0.8% Type-4. (Chinese given name) (Western first name) (Surname) 劉立頌(Alan Li-Sung liu) 陳嘉懿(Jia-Yih Joy Chen) 楊豐瑞(Fongray Frank Young) 14 17.9% 50 4.5% Type-5. (Abbreviated Chinese given name) (Surname) 洪英超(I.-C. Hung) 3 3.8% 0 0% Type-6. (Western first name) (Abbreviated Chinese given name) (Surname) 曾秋蓉(Judy C. R. Tseng) 8 10.3% 9 0.8% Type-7. (Chinese given name) (Abbreviated Chinese given name) (Surname) 黃哲志(Tetz C. Huang) 3 3.8% 3 0.4% Type-8. (Chinese given name) (Unpredictable Surname) 張肇健(Trieu-Kien Truong) 12 15.4% 4 0.4% 83 Testing Environment and Dataset 3/3 The alignment accuracy Proposed by Huang (2005). The probability of selecting the correct answers when the searched snippets contain the correct answers. N cc AAi Nd where Ai : The alignment accuracy of candidate i. Nd : The number of testing data. Ncc : The number of correct translation. Performance measurement: Top-1 to Top-5 alignment accuracy. 84 Results and Analysis 1/3 - Overall performance on Dataset I 70.5% top-1 accuracy 91% top-5 accuracy 85 Results and Analysis 2/3 - Overall performance on Dataset II 57.9% top-1 accuracy 86.2% top-5 accuracy 86 Results and Analysis 3/3 - Performance of each name type Name format Example Type-1 丁建文(Jen-Wen Ding) 丁德榮(Der-Rong Din) 歐陽明(Ming Ouhyang) Type-2 蔡丕裕(Piyu Tsai) Type-3 賴友仁(Eugene Lai) Type-4 劉立頌(Alan Li-Sung liu) 陳嘉懿(Jia-Yih Joy Chen) Type-5 洪英超(I.-C. Hung) Type-6 曾秋蓉(Judy C. R. Tseng) Type-7 黃哲志(Tetz C. Huang) Type-8 張肇健(Trieu-Kien Truong) Our system performs better in type-1, type-2, type-4, type-6. 87 Discussions Major reason for the low performance on Type-3, Type-5, Type-7 and Type-8 The lack of Web information. Usually more than one correct name translations for an input Chinese name are found out. The name ambiguity problem. 88 Limitations Uncommon surname Rely on Web resources Search engine selecting No name disambiguation 89 Agenda Introduction Proposed Approach Experiments Conclusions 90 Conclusions Mining information through Web corpora is effective for dealing with person name translation problem Name ambiguity problem arises frequently 91 Thank You Jan-Ming Ho hoho@iis.sinica.edu.tw Institute of Information Science Academia Sinica 92