Some studies on Vietnamese multi-document summarization and semantic relation extraction Laboratory of Data Mining & Knowledge Science 3/22/2016 Laboratory of Data Mining & Knowledge Science 1 Content I. Vietnamese multi-document summarization 1. 2. 3. 4. Vietnamese VNSEN search engine Clustering Semantic similarity Multi-document summarization II. Semantic relation extraction 1. 2. 3. 4. 3/22/2016 Vietnamese medical ontology Object relation extraction Cause-and-effect relations Vietnamese entity search engine Laboratory of Data Mining & Knowledge Science 2 Vietnamese multi-document summarization • Vietnamese VNSEN search engine – Based on NUTCH – Integrated Vietnamese word segmentation tool • JvnSegmenter – Indexed 500.000 pages from vi.wikipedia.org 3/22/2016 Laboratory of Data Mining & Knowledge Science 3 Vietnamese multi-document summarization • Clustering – Integrated clustering to VNSEN search engine • Using snippet results from VNSEN search engine • Hierarchical Agglomerative Clustering (HAC) algorithm – Estimation with Clustering on Vivisimo search engine • Cluster labeling • Compactness of clusters • Isolation of clusters 3/22/2016 Laboratory of Data Mining & Knowledge Science 4 Vietnamese multi-document summarization • Implementation of semantic similarity measures – Semantic similarity between words based on Semantic Network • Path length (PL) • Information content (IC) – Semantic similarity between sentences based on topic analysis – Word order similarity between sentences 3/22/2016 Laboratory of Data Mining & Knowledge Science 5 Vietnamese multi-document summarization • Building Vietnamese semantic corpus – Hidden topic corpus • Using Latent Dirichlet Allocation (LDA) model • Using JgibbsLDA tool to analyze topic – Vietnamese Wikipedia corpus • Using category graph model • Result • 120/150/200 hidden topics corpus based on Vnexpress/Wikipedia data set • Category graph with 14.000 category nodes and 200.000 articles 3/22/2016 Laboratory of Data Mining & Knowledge Science 6 Vietnamese multi-document summarization • Multi-document summarization – Maximal Marginal Relevance (MMR) method • Improving with Semantic Similarity Measures based on Hidden topic analysis Sentences List of sentences Cluster S1 weights … …. …. Sk …. Summary document Hidden topic Pre-processing Label Cosine measure Documents Weights List of documents 3/22/2016 Laboratory of Data Mining & Knowledge Science D1 …. Dk … … … 7 Vietnamese multi-document summarization • Multi-document summarization for simple Vietnamese Medical Q&A system – Semantic Similarity Measures based on Vietnamese Wikipedia corpus – Medical Ontology – Hidden topic analysis – Clustering 3/22/2016 Laboratory of Data Mining & Knowledge Science 8 Vietnamese multi-document summarization 3/22/2016 Laboratory of Data Mining & Knowledge Science 9 Vietnamese multi-document summarization • Table-of-Contents generation – Using some solutions of Text Segmentation and Title Generation for automatically generating a Table-of-Contents. 3/22/2016 Laboratory of Data Mining & Knowledge Science 10 Vietnamese multi-document summarization • Some our Vietnamese language processing utilities – Nguyen Cam Tu, Phan Xuan Hieu. JvnSegmenter. A Javabased Vietnamese Word Segmentation – Nguyen Cam Tu. JVnTextpro: A Java-based Vietnamese Text Processing Toolkit – Nguyen Cam Tu. JGibbsLDA: A Java and Gibbs Sampling based Implementation of Latent Dirichlet Allocation (LDA) – http://203.113.130.205:8080/sise: VNSEN Search Engine (Implementers: Nguyen Thu Trang, Nguyen Cam Tu, Nguyen Viet Cuong, Tran Mai Vu, Nguyen Minh Tuan etc.) 3/22/2016 Laboratory of Data Mining & Knowledge Science 11 Semantic Relation Extraction • Vietnamese Medical Ontology – 23 classes entity – 14 relations – 200 entities • Technique to improve ontology – Named Entity Recognition – Relation extraction –… 3/22/2016 Laboratory of Data Mining & Knowledge Science 12 Semantic Relation Extraction 3/22/2016 Laboratory of Data Mining & Knowledge Science 13 Semantic Relation Extraction • Object relation extraction – Product domain – Medical domain • Technique – Using Wrapper technique for structured data (HTML/XML/Table) – NLP for unstructured data (Text) • HMM Model • CRF Model • … 3/22/2016 Laboratory of Data Mining & Knowledge Science 14 Semantic Relation Extraction • Cause-and-effect relations Using the researching result by Corina Roxana Girju to investigated some cause-and-effect relations such as : • • • • Adverbial causal link Preposition causal link Subordination causal link Clause integrated link [Rox08] Corina Roxana Girju (2008). Semantic Relation Extraction and its Applications, Invited tutorial at the European Summer School in Logic, Language and Information (ESSLLI 2008), Hamburg, Germany, August 2008. 3/22/2016 Laboratory of Data Mining & Knowledge Science 15 Semantic Relation Extraction • Vietnamese entity search engine on the field of Medical Healthy Care – Using Medical Ontology, Object relation extraction, Cause-and-effect relation extraction… – Associating UIUC-DB&IS Lab (University of Illinois at Urbana-Champaign) • Object Search • Query Log Mining • Object Extraction [Cha08] Kevin C. Chang (2008). Data-Aware Search on the Web, Act. 2: Entity Search, Technical Report, University of Illinois at Urbana-Charmpaign (a talking at College of Technology, Vietnam National University, Hanoi, July 08, 2008). 3/22/2016 Laboratory of Data Mining & Knowledge Science 16 Some articles in 2008 [LNH08] Dieu-Thu Le, Cam-Tu Nguyen, Quang-Thuy Ha, Xuan-Hieu Phan, and Susumu Horiguchi (2008). Matching and Ranking with Hidden Topics towards Online Contextual Advertising, The 2008 IEEE/WIC/ACM International Conference on Web Intelligence (WI-08), University of Technology, Sydney, Australia, December 9 - 12, 2008 (accepted) [PNL08] Xuan-Hieu Phan, Cam-Tu Nguyen, Dieu-Thu Le, Le-Minh Nguyen, Susumu Horiguchi, and Quang-Thuy Ha (2008). Classification and Contextual Match on the Web with Hidden Topics from Large Data Collections, IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING (Submitted) [VUH08] Tran Mai Vu, Pham Thi Thu Uyen, Hoang Minh Hien, Ha Quang Thuy (2008). Semantic Similarity of sentences and application for multi-document summarization to evalute on clustering component of Vietnamese search engine, Workshop on Information Communication Technology (ICTFIT08), College of Science, Vietnam National University, Ho Chi Minh City, November 14, 2008 (in Vietnamese, accepted). 3/22/2016 Laboratory of Data Mining & Knowledge Science 17 THANK YOU 3/22/2016 Laboratory of Data Mining & Knowledge Science 18