KKAP(KAIST Korean Analysis Platform

advertisement
KKAP: KAIST Korean Analysis Platform
Morphological Analyzer, POS Tagger, Parser
Sangwon Park
January 20, 2011
Research Goal
• The goal of the research is to develop KKAP(KAIST Korean
Analysis Platform), which is a infrastructure for Korean natural
language analysis.
• The KKAP will be flexible and easy to utilize so that it can be
widely used in various areas. The platform will include
morphological analyzer, POS tagger, parser, etc.
KKAP: KAIST Korean Analysis Platform
Workflow for Korean Analysis
Phase 1.
Text Preprocessing
Supplement Plugin
7일 저녁 발표예정인 노벨문학상의
유력 수상자로 고은 시인이 거론되고
있다. AP통신은 스웨덴의 노벨상 관측
통들 사이에 한국의 고은 시인이 시리
아의 시인 아도니스와 함께 올해 노벨
상 수상 가능성이 큰 후보로 가장 많이
거론됐다고 전했다. …
Phase 2.
Morphological Analysis
Major Plugin
Supplement
Plugin
Major Plugin
Supplement
Plugin
Plugin Pool
Phase 1. Plugin
Sentence
Segmentation
Auto
Spacing
Noun
Extraction
HMM-based
POS Tagging
Phase 2. Plugin
Unknown Term
Processing
Noun
Extraction
Input
Filter
Korean
Document
Analysis
Phase 3.
POS Tagging
Tag
Mapper
Phase 3. Plugin
Noun Phrase
Extractor
Chart Parser
Chart-base
Morph Analyzer
Tag
Mapper
Verb Phrase
Extractor
Phase 4. Plugin
Phase 4. Parsing
Major Plugin
Supplement
Plugin
7/nnc+일/nbu
저녁/ncn
발표예정/ncpa+이/jp+ㄴ/etm
노벨문학상/nq+의/jcm
유력/ncps
수상자/ncn+로/jca
고은/nq
시인/ncn+이/jcc
거론/ncpa+되/xsv+고/ecc
있/paa+다/ef
./sf
통신은 통/ncn+신/ncn+은/jxc
스웨덴/nq+의/jcm
노벨상/ncn
관측통/ncn+들/xsn
사이/ncn+에/jca
….
Analyzed Korean
Document
Target Users
• The Korean parser can support the other researches which need
Korean analysis.
• The major goal is to make the parser useful on the following
researches.
• I plan to work on a dependency parser so that I can follow and
improve the previous researches of our laboratory and existing
parser.
Smart Calendar Project
HanNanum
Parser
Korean E-mail Analysis
Multi-lingual Knowledge
Sync. on Wikipedia
Korean Wikipedia Analysis
Korean Syntactic Tagged Corpus
• KAIST Syntactic Tagged Corpus
–
–
–
–
http://bora.or.kr
Corpus 5. Manual sentence analysis corpus
31,091 Sentences from 97 different sources.
Length: 1 ~ 33 Eojeols
Average 11.35 Eojeols
• Related document
– Kong joo Lee, Byung Gyu Chang, Gil Chang Kim, “Bracketing Guidelines for
Korean Syntactic Tree Tagged Corpus Version 1”, KAIST CS Department
Technical Report, CS/TR-97-112, 1997 (In Korean)
– Byung Gyu Chang, Kong joo Lee, Gil Chang Kim, “Design and Implementation
of Tree Tagging Workbench To Build a Large Tree Tagged Corpus of Korean”,
Proceedings of the Conference on Hangul and Korean Language Information
Processing, pp.421~429, 1997 (In Korean)
Korean Syntactic Tagged Corpus
• KAIST Syntactic Tagged Corpus
[4226] ; 물론 꼭 필요할 땐 어디서든지 부르짖어야지요.
((((((물론/mag
)0Mag((((((꼭/mag
)0Mag((필요/ncps
)0Ncps+
하/xsm
)1Paa
)MgPaa+
ㄹ/etm
)emPaa(때/nbn
)0Nbn
)EmNbn+
ㄴ/jxt
)jtNbn((((어디/npd
)0Npd+
에서/jca
)jaNpd+
든지/jxc
)jxNpd(부르짖/pvg
)0Pvg
)JxPvg
)JtPvg
)MgPvg+
어야지/ef
)efPvg+
요/jxf
)jfPvg+
(./sf
)0Sf
)sfPvg
)S
0 : 0Mag -> mag
1 : 0Mag -> mag
2 : 0Ncps -> ncps
3 : 1Paa -> 0Ncps+xsm
4 : MgPaa -> 0Mag 1Paa
5 : emPaa -> MgPaa+etm
6 : 0Nbn -> nbn
7 : EmNbn -> emPaa 0Nbn
8 : jtNbn -> EmNbn+jxt
9 : 0Npd -> npd
10 : jaNpd -> 0Npd+jca
11 : jxNpd -> jaNpd+jxc
12 : 0Pvg -> pvg
13 : JxPvg -> jxNpd 0Pvg
14 : JtPvg -> jtNbn JxPvg
15 : MgPvg -> 0Mag JtPvg
16 : efPvg -> MgPvg+ef
17 : jfPvg -> efPvg+jxf
18 : 0Sf -> sf
19 : sfPvg -> jfPvg+0Sf
20 : S -> sfPvg
Korean Syntactic Tagged Corpus
• Sejong Syntactic Tagged Corpus
– I got the latest release from the National Institute of the Korean
Language this week.
– Released on December 2010
• 15 Documents
• 433,839 Eojeols / 43,828 Sentences
; 프랑스의 세계적인 의상 디자이너 엠마누엘 웅가로가 실내 장식용 직물 디자이너로 나섰다.
(S
(NP_SBJ
(NP
(NP_MOD 프랑스/NNP + 의/JKG)
(NP
(VNP_MOD 세계/NNG + 적/XSN + 이/VCP + ᆫ/ETM)
(NP
(NP 의상/NNG)
(NP 디자이너/NNG))))
(NP_SBJ
(NP 엠마누엘/NNP)
(NP_SBJ 웅가로/NNP + 가/JKS)))
(VP
(NP_AJT
(NP
(NP
(NP 실내/NNG)
(NP 장식/NNG + 용/XSN))
(NP 직물/NNG))
(NP_AJT 디자이너/NNG + 로/JKB))
(VP 나서/VV + 었/EP + 다/EF + ./SF)))
Kookmin Univ. KLT version 2.2.0
POSTECH: KoPA
SNU KKMA
Current Version
Question & Comments
Download