Vietnamese VNSEN search engine

advertisement
Some studies on Vietnamese
multi-document summarization and
semantic relation extraction
Laboratory of Data Mining &
Knowledge Science
3/22/2016
Laboratory of Data Mining & Knowledge
Science
1
Content
I. Vietnamese multi-document summarization
1.
2.
3.
4.
Vietnamese VNSEN search engine
Clustering
Semantic similarity
Multi-document summarization
II. Semantic relation extraction
1.
2.
3.
4.
3/22/2016
Vietnamese medical ontology
Object relation extraction
Cause-and-effect relations
Vietnamese entity search engine
Laboratory of Data Mining & Knowledge
Science
2
Vietnamese multi-document
summarization
• Vietnamese VNSEN search engine
– Based on NUTCH
– Integrated Vietnamese word segmentation tool
• JvnSegmenter
– Indexed 500.000 pages from vi.wikipedia.org
3/22/2016
Laboratory of Data Mining & Knowledge
Science
3
Vietnamese multi-document
summarization
• Clustering
– Integrated clustering to VNSEN search engine
• Using snippet results from VNSEN search engine
• Hierarchical Agglomerative Clustering (HAC) algorithm
– Estimation with Clustering on Vivisimo search
engine
• Cluster labeling
• Compactness of clusters
• Isolation of clusters
3/22/2016
Laboratory of Data Mining & Knowledge
Science
4
Vietnamese multi-document
summarization
• Implementation of semantic similarity
measures
– Semantic similarity between words based on
Semantic Network
• Path length (PL)
• Information content (IC)
– Semantic similarity between sentences based on
topic analysis
– Word order similarity between sentences
3/22/2016
Laboratory of Data Mining & Knowledge
Science
5
Vietnamese multi-document
summarization
• Building Vietnamese semantic corpus
– Hidden topic corpus
• Using Latent Dirichlet Allocation (LDA) model
• Using JgibbsLDA tool to analyze topic
– Vietnamese Wikipedia corpus
• Using category graph model
• Result
• 120/150/200 hidden topics corpus based on
Vnexpress/Wikipedia data set
• Category graph with 14.000 category nodes and 200.000
articles
3/22/2016
Laboratory of Data Mining & Knowledge
Science
6
Vietnamese multi-document
summarization
• Multi-document summarization
– Maximal Marginal Relevance (MMR) method
• Improving with Semantic Similarity Measures based on
Hidden topic analysis
Sentences
List of
sentences
Cluster
S1
weights
…
….
….
Sk
….
Summary
document
Hidden
topic
Pre-processing
Label
Cosine
measure
Documents Weights
List of
documents
3/22/2016
Laboratory of Data Mining & Knowledge
Science
D1
….
Dk
…
…
…
7
Vietnamese multi-document
summarization
• Multi-document summarization for simple
Vietnamese Medical Q&A system
– Semantic Similarity Measures based on
Vietnamese Wikipedia corpus
– Medical Ontology
– Hidden topic analysis
– Clustering
3/22/2016
Laboratory of Data Mining & Knowledge
Science
8
Vietnamese multi-document
summarization
3/22/2016
Laboratory of Data Mining & Knowledge
Science
9
Vietnamese multi-document
summarization
• Table-of-Contents generation
– Using some solutions of Text Segmentation and
Title Generation for automatically generating a
Table-of-Contents.
3/22/2016
Laboratory of Data Mining & Knowledge
Science
10
Vietnamese multi-document
summarization
• Some our Vietnamese language processing
utilities
– Nguyen Cam Tu, Phan Xuan Hieu. JvnSegmenter. A Javabased Vietnamese Word Segmentation
– Nguyen Cam Tu. JVnTextpro: A Java-based Vietnamese Text
Processing Toolkit
– Nguyen Cam Tu. JGibbsLDA: A Java and Gibbs Sampling
based Implementation of Latent Dirichlet Allocation (LDA)
– http://203.113.130.205:8080/sise: VNSEN Search Engine
(Implementers: Nguyen Thu Trang, Nguyen Cam Tu,
Nguyen Viet Cuong, Tran Mai Vu, Nguyen Minh Tuan etc.)
3/22/2016
Laboratory of Data Mining & Knowledge
Science
11
Semantic Relation Extraction
• Vietnamese Medical Ontology
– 23 classes entity
– 14 relations
– 200 entities
• Technique to improve ontology
– Named Entity Recognition
– Relation extraction
–…
3/22/2016
Laboratory of Data Mining & Knowledge
Science
12
Semantic Relation Extraction
3/22/2016
Laboratory of Data Mining & Knowledge
Science
13
Semantic Relation Extraction
• Object relation extraction
– Product domain
– Medical domain
• Technique
– Using Wrapper technique for structured data
(HTML/XML/Table)
– NLP for unstructured data (Text)
• HMM Model
• CRF Model
• …
3/22/2016
Laboratory of Data Mining & Knowledge
Science
14
Semantic Relation Extraction
• Cause-and-effect relations
Using the researching result by Corina Roxana Girju
to investigated some cause-and-effect relations
such as :
•
•
•
•
Adverbial causal link
Preposition causal link
Subordination causal link
Clause integrated link
[Rox08] Corina Roxana Girju (2008). Semantic Relation Extraction and its Applications,
Invited tutorial at the European Summer School in Logic, Language and
Information (ESSLLI 2008), Hamburg, Germany, August 2008.
3/22/2016
Laboratory of Data Mining & Knowledge
Science
15
Semantic Relation Extraction
• Vietnamese entity search engine on the field
of Medical Healthy Care
– Using Medical Ontology, Object relation
extraction, Cause-and-effect relation extraction…
– Associating UIUC-DB&IS Lab (University of Illinois
at Urbana-Champaign)
• Object Search
• Query Log Mining
• Object Extraction
[Cha08] Kevin C. Chang (2008). Data-Aware Search on the Web, Act. 2: Entity Search,
Technical Report, University of Illinois at Urbana-Charmpaign (a talking at College
of Technology, Vietnam National University, Hanoi, July 08, 2008).
3/22/2016
Laboratory of Data Mining & Knowledge
Science
16
Some articles in 2008
[LNH08] Dieu-Thu Le, Cam-Tu Nguyen, Quang-Thuy Ha, Xuan-Hieu Phan, and Susumu
Horiguchi (2008). Matching and Ranking with Hidden Topics towards Online
Contextual Advertising, The 2008 IEEE/WIC/ACM International Conference on Web
Intelligence (WI-08), University of Technology, Sydney, Australia, December 9 - 12,
2008 (accepted)
[PNL08] Xuan-Hieu Phan, Cam-Tu Nguyen, Dieu-Thu Le, Le-Minh Nguyen, Susumu
Horiguchi, and Quang-Thuy Ha (2008). Classification and Contextual Match on the
Web with Hidden Topics from Large Data Collections, IEEE TRANSACTIONS ON
KNOWLEDGE AND DATA ENGINEERING (Submitted)
[VUH08] Tran Mai Vu, Pham Thi Thu Uyen, Hoang Minh Hien, Ha Quang Thuy (2008).
Semantic Similarity of sentences and application for multi-document
summarization to evalute on clustering component of Vietnamese search engine,
Workshop on Information Communication Technology (ICTFIT08), College of
Science, Vietnam National University, Ho Chi Minh City, November 14, 2008 (in
Vietnamese, accepted).
3/22/2016
Laboratory of Data Mining & Knowledge
Science
17
THANK YOU
3/22/2016
Laboratory of Data Mining & Knowledge
Science
18
Download