Tool_RomanianLanguag..

advertisement
A Text Processing Tool for
the Romanian Language
Oana Frunza and Diana Inkpen
School of Information Technology and
Engineering, University of Ottawa
{ofrunza,diana}@site.uottawa.ca
David Nadeau
Institute for Information Technology
National Research Council of Canada
David.Nadeau@nrc-cnrc.gc.ca
Outline


BALIE System
RO-BALIE




Capabilities
Improvements
Evaluation & Results
Future Work
BALIE- BaseLine Information Extraction


Multilingual information extraction system
 Language identification
 Tokenization
 Sentence boundary detection
 Part-of-speech tagging
for English, French, German, Spanish [1]
Java trainable open source system


Uses WEKA [2] a Machine Learning Tool
Uses QTag [3] – a language independent probabilistic
part-of-speech tagger
BALIE- BaseLine Information Extraction (cont.)

Input Example
1.Introduction
Information Extraction (IE) is the name given to
any process which selectively structures and combines
data which is found, explicitly stated or implied, in o
ne or more texts.
BALIE- BaseLine Information Extraction (cont.)

Output
<?xml version="1.0" ?>
<balie>
<tokenList>
<s>
<token type="2" pos="number" canon="1">1</token>
<token type="1" pos="period" canon=".">.</token>
<token type="2" pos="noun"
canon="introduction">Introduction</token>
</s>
<s>
<token type="2" pos="noun" canon=“information">Information</token>
…
</s>
</tokenList>
</balie>
RO-BALIE

Improvements



Easier manipulation of the input and output texts
A new tag set that maps the numerical tag set
internally used by BALIE
More information in the output provided by the system
Available at:
http://www.site.uottawa.ca/~ofrunza/ROBalie/RO-Balie.html
RO-BALIE

Language
Identification



2-grams (sequence of
2 characters)
Naïve Bayes classifier
Overall accuracy is:
99.25%.
Language
Files
Train
Files
Test
Correctly
classified
Accuracy
English
50
27
27
100%
French
50
26
25
96%
Spanish
50
25
25
100%
German
50
27
27
100%
Romanian
50
32
32
100%
RO-BALIE (cont.)

Tokenization


Split each compound word based on “-” and “/”
Examples: iat-o, socio-economic
Tokenization results:
Tokens
Precision
Recall
904
99.5%
98.7%
RO-BALIE (cont.)

Sentence Boundary Detection



Training – 106 hand-tagged English sentences
Decision Tree Classifier
Features




Beginning of the sentence – first token
Previous token
Current token
Next token
RO-BALIE (cont.)

Sentence Boundary Detection (cont.)



Feature values
 Period, Open Quote, Close Quote, New Line, Capital
Word, Digit, Abbreviation, etc.
A list with Romanian abbreviations (510)
Evaluation on Orwell’s 1984 novel
Text
Accuracy
Precision
Recall
Romanian
97%
92%
71%
English
97.5%
96.5%
82%
RO-BALIE (cont.)

Part-of-speech tagging – QTag tagger
 Used a corpus of 40 million words of newspaper articles
 Romanian newspapers 3-year period
 The training corpus is 98% accurate
 Our system has a tagset of 14 tags for POS and 30 tags
for punctuations
Train Corpus
Test Corpus
Accuracy
2.5 mil words
13.425 words
95.3%
RO-BALIE (cont.)

Output for Apel tirziu si inutil NISTORESCU.
<?xml version="1.0" ?>
<balie>
<Language ID="Romanian">
<tokenList>
<Tokens Count="896">
<s id="1">
<token type="2" pos="NN" canon="apel">Apel</token>
<token type="2" pos="ADV" canon="tirziu">tirziu</token>
<token type="2" pos="CJ" canon="si">si</token>
<token type="2" pos="NN" canon="inutil">inutil</token>
<token type="2" pos="PN" canon="nistorescu">NISTORESCU</token>
<token type="1" pos="PER" canon=".">.</token>
</s>
</Tokens>
</tokenList>
</Language>
</balie>
RO-BALIE (cont.)

Future Work



Use machine learning for the tokenization task
Add new services: morphological analysis,
named entity recognition, etc.
Add more specific information for each
supported language.
RO-BALIE (cont.)

References
1. http://balie.sourceforge.net/index.html
2. http://www.cs.waikato.ac.nz/~ml/weka/
3.http://www.english.bham.ac.uk/staff/omason/software/qt
ag.html
http://www.site.uottawa.ca/~ofrunza/RO-Balie/ROBalie.html
THANK YOU!
???
?
Download