A Text Processing Tool for the Romanian Language Oana Frunza and Diana Inkpen School of Information Technology and Engineering, University of Ottawa {ofrunza,diana}@site.uottawa.ca David Nadeau Institute for Information Technology National Research Council of Canada David.Nadeau@nrc-cnrc.gc.ca Outline BALIE System RO-BALIE Capabilities Improvements Evaluation & Results Future Work BALIE- BaseLine Information Extraction Multilingual information extraction system Language identification Tokenization Sentence boundary detection Part-of-speech tagging for English, French, German, Spanish [1] Java trainable open source system Uses WEKA [2] a Machine Learning Tool Uses QTag [3] – a language independent probabilistic part-of-speech tagger BALIE- BaseLine Information Extraction (cont.) Input Example 1.Introduction Information Extraction (IE) is the name given to any process which selectively structures and combines data which is found, explicitly stated or implied, in o ne or more texts. BALIE- BaseLine Information Extraction (cont.) Output <?xml version="1.0" ?> <balie> <tokenList> <s> <token type="2" pos="number" canon="1">1</token> <token type="1" pos="period" canon=".">.</token> <token type="2" pos="noun" canon="introduction">Introduction</token> </s> <s> <token type="2" pos="noun" canon=“information">Information</token> … </s> </tokenList> </balie> RO-BALIE Improvements Easier manipulation of the input and output texts A new tag set that maps the numerical tag set internally used by BALIE More information in the output provided by the system Available at: http://www.site.uottawa.ca/~ofrunza/ROBalie/RO-Balie.html RO-BALIE Language Identification 2-grams (sequence of 2 characters) Naïve Bayes classifier Overall accuracy is: 99.25%. Language Files Train Files Test Correctly classified Accuracy English 50 27 27 100% French 50 26 25 96% Spanish 50 25 25 100% German 50 27 27 100% Romanian 50 32 32 100% RO-BALIE (cont.) Tokenization Split each compound word based on “-” and “/” Examples: iat-o, socio-economic Tokenization results: Tokens Precision Recall 904 99.5% 98.7% RO-BALIE (cont.) Sentence Boundary Detection Training – 106 hand-tagged English sentences Decision Tree Classifier Features Beginning of the sentence – first token Previous token Current token Next token RO-BALIE (cont.) Sentence Boundary Detection (cont.) Feature values Period, Open Quote, Close Quote, New Line, Capital Word, Digit, Abbreviation, etc. A list with Romanian abbreviations (510) Evaluation on Orwell’s 1984 novel Text Accuracy Precision Recall Romanian 97% 92% 71% English 97.5% 96.5% 82% RO-BALIE (cont.) Part-of-speech tagging – QTag tagger Used a corpus of 40 million words of newspaper articles Romanian newspapers 3-year period The training corpus is 98% accurate Our system has a tagset of 14 tags for POS and 30 tags for punctuations Train Corpus Test Corpus Accuracy 2.5 mil words 13.425 words 95.3% RO-BALIE (cont.) Output for Apel tirziu si inutil NISTORESCU. <?xml version="1.0" ?> <balie> <Language ID="Romanian"> <tokenList> <Tokens Count="896"> <s id="1"> <token type="2" pos="NN" canon="apel">Apel</token> <token type="2" pos="ADV" canon="tirziu">tirziu</token> <token type="2" pos="CJ" canon="si">si</token> <token type="2" pos="NN" canon="inutil">inutil</token> <token type="2" pos="PN" canon="nistorescu">NISTORESCU</token> <token type="1" pos="PER" canon=".">.</token> </s> </Tokens> </tokenList> </Language> </balie> RO-BALIE (cont.) Future Work Use machine learning for the tokenization task Add new services: morphological analysis, named entity recognition, etc. Add more specific information for each supported language. RO-BALIE (cont.) References 1. http://balie.sourceforge.net/index.html 2. http://www.cs.waikato.ac.nz/~ml/weka/ 3.http://www.english.bham.ac.uk/staff/omason/software/qt ag.html http://www.site.uottawa.ca/~ofrunza/RO-Balie/ROBalie.html THANK YOU! ??? ?