Tool_RomanianLanguag..

A Text Processing Tool for the Romanian Language Oana Frunza and Diana Inkpen School of Information Technology and Engineering, University of Ottawa {ofrunza,diana}@site.uottawa.ca David Nadeau Institute for Information Technology National Research Council of Canada David.Nadeau@nrc-cnrc.gc.ca Outline   BALIE System RO-BALIE     Capabilities Improvements Evaluation & Results Future Work BALIE- BaseLine Information Extraction   Multilingual information extraction system  Language identification  Tokenization  Sentence boundary detection  Part-of-speech tagging for English, French, German, Spanish [1] Java trainable open source system   Uses WEKA [2] a Machine Learning Tool Uses QTag [3] – a language independent probabilistic part-of-speech tagger BALIE- BaseLine Information Extraction (cont.)  Input Example 1.Introduction Information Extraction (IE) is the name given to any process which selectively structures and combines data which is found, explicitly stated or implied, in o ne or more texts. BALIE- BaseLine Information Extraction (cont.)  Output <?xml version="1.0" ?> <balie> <tokenList> <s> <token type="2" pos="number" canon="1">1</token> <token type="1" pos="period" canon=".">.</token> <token type="2" pos="noun" canon="introduction">Introduction</token> </s> <s> <token type="2" pos="noun" canon=“information">Information</token> … </s> </tokenList> </balie> RO-BALIE  Improvements    Easier manipulation of the input and output texts A new tag set that maps the numerical tag set internally used by BALIE More information in the output provided by the system Available at: http://www.site.uottawa.ca/~ofrunza/ROBalie/RO-Balie.html RO-BALIE  Language Identification    2-grams (sequence of 2 characters) Naïve Bayes classifier Overall accuracy is: 99.25%. Language Files Train Files Test Correctly classified Accuracy English 50 27 27 100% French 50 26 25 96% Spanish 50 25 25 100% German 50 27 27 100% Romanian 50 32 32 100% RO-BALIE (cont.)  Tokenization   Split each compound word based on “-” and “/” Examples: iat-o, socio-economic Tokenization results: Tokens Precision Recall 904 99.5% 98.7% RO-BALIE (cont.)  Sentence Boundary Detection    Training – 106 hand-tagged English sentences Decision Tree Classifier Features     Beginning of the sentence – first token Previous token Current token Next token RO-BALIE (cont.)  Sentence Boundary Detection (cont.)    Feature values  Period, Open Quote, Close Quote, New Line, Capital Word, Digit, Abbreviation, etc. A list with Romanian abbreviations (510) Evaluation on Orwell’s 1984 novel Text Accuracy Precision Recall Romanian 97% 92% 71% English 97.5% 96.5% 82% RO-BALIE (cont.)  Part-of-speech tagging – QTag tagger  Used a corpus of 40 million words of newspaper articles  Romanian newspapers 3-year period  The training corpus is 98% accurate  Our system has a tagset of 14 tags for POS and 30 tags for punctuations Train Corpus Test Corpus Accuracy 2.5 mil words 13.425 words 95.3% RO-BALIE (cont.)  Output for Apel tirziu si inutil NISTORESCU. <?xml version="1.0" ?> <balie> <Language ID="Romanian"> <tokenList> <Tokens Count="896"> <s id="1"> <token type="2" pos="NN" canon="apel">Apel</token> <token type="2" pos="ADV" canon="tirziu">tirziu</token> <token type="2" pos="CJ" canon="si">si</token> <token type="2" pos="NN" canon="inutil">inutil</token> <token type="2" pos="PN" canon="nistorescu">NISTORESCU</token> <token type="1" pos="PER" canon=".">.</token> </s> </Tokens> </tokenList> </Language> </balie> RO-BALIE (cont.)  Future Work    Use machine learning for the tokenization task Add new services: morphological analysis, named entity recognition, etc. Add more specific information for each supported language. RO-BALIE (cont.)  References 1. http://balie.sourceforge.net/index.html 2. http://www.cs.waikato.ac.nz/~ml/weka/ 3.http://www.english.bham.ac.uk/staff/omason/software/qt ag.html http://www.site.uottawa.ca/~ofrunza/RO-Balie/ROBalie.html THANK YOU! ??? ?

Tool_RomanianLanguag..

Related documents

Products

Support

Tool_RomanianLanguag..

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib