LOD2 KOREA : Towards Publishing Korean Linked Data on the Web Key-Sun Choi Joint work with Martin Rezk Jungyeul Park Yoon Yongun Kyungtae Lim YoungGyun Hahm Key-Sun Choi - Personal History • • • • • • • • NEC C&C Lab. – PIVOT Japanese-Korean Machine Translation Korean Part-of-Speech Tagset, Corpus, Dictionary CoreNet (Korean-Chinese-Japanese) Semantic Wordnet (2004) KORTERM: Korea Terminology Research Center for Language and Knowledge Engineering (1998-2007), Research Center of Ministry of Culture KAIST Research Grand Award (1998) ISO/TC37/SC4 Founding member (Language Resource Management Standards) ISWC 2007 PC Co-Chair (International Semantic Web Conference) AFNLP President (2009-2010) • DBPedia Korea http://ko.dbpedia.org/ • http://lod2.eu/ partner (EU FP7) 2 NLP2RDF • Subject • Object • Predicate • Extract from Sentences • 野生種의 장미는 主로 北半球의 溫帶와 寒帶 地方에 分布한다. • Wild rose is located mainly in the northern hemisphere of its temperate and figid zones. 1. 2. 3. Subject : 장미 (rose) Object : 북반구의 온대 지방, 한대 지방 (Northern hemi-sphere, Temperate and Frigid Zones) Predicate : 分布 (isDistributedAt) Key-Sun Choi - LOD2 Korea • Triple in Natural Language 3 4 4 제조회사 consists_of 플랫폼 제조사 reside_on 시스템 소프트웨어 개발환경 운영체제 응용 프로그램 미들웨어 임베디드 시스템 임베디드 소프트웨어 브라우져 임베디드 운영체제 미디어 플레이어 실시간 임베디드 운영체제 RTOS 통신 미들웨어 가전기기 비실시간 임베디드 운영체제 디지털카메라 DVD 플레이어 MP3플레이어 셋탑박스 VRTX VxWorks pSOS WinCE 5 마이크로소프트 Wind River 5 NLP2RDF <Conceptonal Layer> <DBpedia> (based on DBpedia Ontology) Barack Obama URI = dbpedia12415 (conceptonal Unique) <Career> President <Nationality> United States <Party> Democrats ,,, LOD algorithm Barack Obama is the President of the United States Barack Obama URI = sen1word1 (documentary Unique) <POStag> NNG </POStag> ,,, “KNIF” Wrapper The Output of NLP tools Sentence: ‘Barack Obama is the President of the United States’ For these work 1. For RDF Mapping • String Ontology • Structured Sentence Ontology • NIF and Korean language 2. For LOD Mapping • • URI for DBpedia entity Mapping Word in Text DBpedia Key-Sun Choi - LOD2 Korea • Triples and URI • Ontology 8 Parser tree to Summary • 물체의 낙하 거리는 시간의 제곱에 비례한다 <Triple> 2. Predicate • 비례한다 3. Contents • 시간의 제곱 Key-Sun Choi - LOD2 Korea 1. Subject • 물체의 낙하거리 10 Why NLP? Why Syntactic, Semantics? Key-Sun Choi - LOD2 Korea • Advanced technology on the higher-level layers 11 Key-Sun Choi - LOD2 Korea NLP Layer Cake 12 Semantic Web vs. NLP layer cake Discourse John: X1, room: L2 Syntactic structure subject, object, predicate Phrase Room in 2nd floor Semantic tagging [John: Human], [2-FL: Loc], [seminar-room: Room] Morph. Analysis +가//2+층+에 POS tagging NPP/JOSA//Numeral/ Tokenization 철수가//2층에// String URI Encoding Key-Sun Choi - LOD2 Korea 철수가 2층에 있는 세미나실을 예약한다. John-SUBJ 2-floor-LOC room-OBJ reserve-FIN 13 How to develop parser and semantic classifier creatively? • Open Source NLP tools • Rich English, Japanese open tools/resources • A few Korean tools • Already developed Korean language resources • KAIST tools/resources • KAIST open source in sourceforge and web • Cambridge University Press: NLP Textbook (undergoing) • Linked Data – http://lod2.eu/ partner Key-Sun Choi - LOD2 Korea • How to adapt Korean tools to the already developed tools 14 • The idea of linking data from different sources is not new: • Network Database Model: 70’s • Linked Data: Today • The goal is to facilitate sharing and re-using information. • Linked Data aims to extend the Web with data commons by creating typed links between data from different sources Key-Sun Choi - LOD2 Korea Background 15 Background • Each piece of data is identified with an URI Key-Sun Choi - LOD2 Korea • These links are usually modeled using the Resource Description Framework (RDF) • The first task towards linking data is to identify which resources and which properties we want to describe 16 • NLP2RDF is a LOD2 Community project that is developing the NLP Interchange Format (NIF) • NIF aims to achieve interoperability between Natural Language Processing (NLP) tools, language resources and annotations • The output of NLP tools can be converted into RDF and used in the LOD2 Stack • http://nlp2rdf.org NIF… • Is based on RDF/OWL • Enables users to annotate for several languages in a uniform way • Enables users to query text documents with SPARQL (EX http://semanticweb.kaist.ac.kr/nlp2rdf/ ) • • Sentence : 다크나이트는 미국의 영화이다. Dark knight is a American film. Key-Sun Choi - LOD2 Korea Introduction 17 18 Key-Sun Choi - LOD2 Korea NIF Wrapping Key-Sun Choi - LOD2 Korea • NLP Interchange Format (NIF) is an RDF/OWL-based format that allows to combine and chain several Natural Language Processing (NLP) tools in a flexible, light-weight way. 19 Sebastian Hellmann, AKSW, Universitat Leipzig, NLP Interchange Format(NIF) Structure of NLP2RDF Interchange Layer Data Layer Key-Sun Choi - LOD2 Korea NLP Layer 20 Example of NLP Layer English NLP Tokenization CFG Parser Dependency Parser Key-Sun Choi - LOD2 Korea Input Sentence 21 How to create RDF from NLP output Process Example Raw Texts My dog also likes eating sausage. output NIF Wrapper StanfordWrapper.Java Key-Sun Choi - LOD2 Korea NLP Tools RDF 22 Example of NLP2RDF in ENG • http://nlp2rdf.lod2.eu/demo.php <http://prefix.given.by/theClient#offset_0_5> sso:oliaLink <http://purl.org/olia/penn.owl#NNP> ; sso:posTag "NNP" ; sso:lemma "Obama" ; str:referenceContext<http://prefix.given.by/theClient#offset_0_30> ; str:anchorOf "Obama" ; rdf:type sso:Word , str:String . Key-Sun Choi - LOD2 Korea • Sentence: Obama is the president of USA. 23 Korean NLP2RDF • Properties: POS, grammatical roles, etc. • Problems to solve: • Linguistic Modeling (OLiA) • Processing Korean Text (NLP) • How to Produce and Query RDF Key-Sun Choi - LOD2 Korea • Resources: morphemes, words (eojeols) and sentences in Korean 24 Linguistic Modeling (1) • We use OLiA (Ontologies of Linguistic Annotation) to link the Sejong tagset with language-independent reference concepts. • OLiA consists of three different ontologies: • the OLiA reference model (language-independent), • the OLiA annotation model (depends on the tagset), • the OLiA linking model (depends on the tagset). • We developed a fragment of these last two ontologies for Korean, that is, for the Sejong tagset. Key-Sun Choi - LOD2 Korea • Sejong tagset is a Korean default standard 25 Linguistic Modeling (2) • We use the NIF (NLP Interchange Format) to • NIF provides two URI schemes to identify resources • Offset-based • Hash-based Key-Sun Choi - LOD2 Korea • standardize the input/output of the different tools to ease to connection among them, and to • uniquely identify (parts of) text, entities and relationships. • We opt in our application for the Hash-based 26 Korean NLP2RDF Platform RAW Text HanNanum • Korean Open Source Morpheme Analyzer • Developed by SWRC, KAIST Morpheme Analyzer • Training set: Modified Sejong Treebank (DongHyun Choi, Jungyeul Park, Key-Sun Parser Choi , Korean Treebank Transformation for Parsr Training, ACL - SPMRL 2012) • F1-score: 82.12% Wrapper Key-Sun Choi - LOD2 Korea Korean Berkeley Parser Produce triples • Use OLiA (Ontologies of Linguistic Annotation) to link the Korean tagsets with NIF output language-independent reference concepts • The OLiA annotation model and the OLiA linking model produce triples using the Sejong tagset 27 Korean Language information Morph. Analyzer Input Korean Sentence CFG Parser Korean Grammar Framework Parsed result URI, Tag Dependency Parser DataBase Mappings Ontologies RDF triples SPARQL Query RDF generator SPARQL Query Handler OnTop Framework RDF triples Key-Sun Choi - LOD2 Korea Korean NLP 28 • Each piece of data is identified with an URI (Hash-based) • Resources: Morphemes, Words (eojeols), Sentences in Korean • Properties: POS-tag, Grammatical roles, etc. Some produced triples DEMO site: http://semanticweb.kaist.ac.kr/nlp2rdf Parsing results Key-Sun Choi - LOD2 Korea NIF Output 29 NIF Output Key-Sun Choi - LOD2 Korea 이탈리아에서 공부하고 온 마틴은 한국을 사랑합니다. Martin who came from Italy after studying there loves Korea. 30 Specific Issues of Korean 1. 2. 3. 4. String Word, Sentence, Phrase,,, Tag ,,, Ontology: 1. 2. 3. 4. String Ontology Structured Sentence Ontology (SSO) OLiA Penn Sejong Tag Set NLP2RDF: Produce Triples RDF output 1. Korean Tagset 2. Linking with OLiA Key-Sun Choi - LOD2 Korea Parser Output 31 superclass Sejong OLiA LinguisticAnnotation/Tag/ LinguisticConcept/MorphosyntacticCategory/ Adverb Adverb/ConjunctiveAdverb Adverb Adverb and Conjunction/CoordinatingConjunction MAG SN, XN MM SH, SL IC NA, NF, NN XR NNB, NNG NNP NP SE, SF, SO, SP, SS Adverb/GeneralAdverb CardinalNumber Determiner ForeignWord Interjection Noun Noun/BaseMorpheme Noun/CommonNoun Noun/ProperNoun Pronoun Symbol Adverb Quantifier/Numeral PronounOrDeterminer/Determiner Residual/Foreign Interjection Noun Noun/CommonNoun Noun/CommonNoun Noun/ProperNoun PronounOrDeterminer/Pronoun Punctuation NV, V VA VX VC, VCN, VCP Verb Verb/Adjective Verb/AuxiliaryPredicate Verb/Copula Verb Verb and Adjective/PredicativeAdjective Verb/AuxiliaryVerb Verb/FiniteVerb VV E, JK, XP, XS JC, JX Verb/VerbalPredicate Particle Particle/AuxiliaryPostposition Verb MorphologicalCategory/morpheme/ Particle/CaseMarker MorphologicalCategory/morpheme/MorphologicalParticle Particle/Prefix Particle/Suffix MorphologicalCategory/morpheme/prefix MorphologicalCategory/morpheme/suffix MA MAJ JKB, JKC, JKG, J KO, JKQ, JKS, JK V XPN XSA, XSN, XSV EC, EF, EP, ETM, E Particle/VerbalEnding TN Key-Sun Choi - LOD2 Korea Tag MorphologicalCategory/morpheme/MorphologicalParticle MorphologicalCategory/morpheme/suffix 32 Conclusions: • We presented a framework that allows • The RDF outcome of our framework is compliant with the NIF (NLP Interchange Format) and the OLiA ontologies to facilitate its combination with other NLP tools • Future: • complete the development of the language-dependent part of the OLiA ontologies, • include the missing features required by NIF, • allow richer SPARQL queries, and • disambiguate the different entities in the text and link them with Wikipedia articles. Key-Sun Choi - LOD2 Korea • processing Korean text, • Efficiently producing RDF triples, and • querying the NLP tools outcome 33 Issues • Josa (postposition case marker) • Korean specific grammatical feature Sentence : 다크나이트는 미국의 영화이다. Key-Sun Choi - LOD2 Korea • DBpedia • How to link between produced triples and DBpedia triples Sentence : Dark knight is the American movie. 34 Source • Demo Site : for Korean • http://semanticweb.kaist.ac.kr/nlp2rdf • Demo site : for English • http://nlp2rdf.lod2.eu/demo.php Key-Sun Choi - LOD2 Korea • OnTop • https://babbage.inf.unibz.it/trac/obdapublic/wiki/ObdalibPluginIntro • NLP2RDF • http://nlp2rdf.org 35 Key-Sun Choi, Mun-Yong Yi, In-Young Koh, Younghee Lee (CS/WebST, Knowledge Service Eng., CS/WebST, CS) Tony Veale (Invited Professor, Computational Creativity) Yoon, Yong-Un (research professor, NLP+DB) Martin Rezk (postdoctoral researcher, Logic) Park, Jung-Yeol (researcher, parser) Lee, Jae-Sung (Professor, morphology and word) Graduate Students: Soon-Gil Hong, Young-Gyun Hahm , Kyungtae Lim, Se-Mi Jang, Youngho Jeong, … http://ko.dbpedia.org/ http://semanticweb.kaist.ac.kr kschoi@kaist.ac.kr