eWika: Towards the Digitalization of
Philippine Languages
Translate
Isalin
Charibeth K. Cheng (koc@dlsu.edu.ph)
DLSU, College of Computer Studies
Natural Language Processing Research Lab
• started in 1993 at UP-Los Ba ň os
• Dr. Rachel Roxas and Allan Borra
– grammar-based
• in 2004 start at DLSU
– hybrid approach
• 3-year project
• started 2005
• funded by DOST-PCASTRD
• composition:
– 6 faculty members of College of
Computer Studies
– 15 computer science majors
– assisted by the Filipino Dept and
Dept in English & Applied
Linguistics of DLSU-M
Architectural Design of the Program
Source Text User Interface Target Text
MT: Example-based
MT: Rule-based
Translator Engine
Language Resources:
• Lexicon (electronic dictionary),
• Morphological Analyzer & Generator
• Part-of-Speech tagger
• Grammar,
• Corpus (Tagged)
Output Modeller
The boy ate apples.
Apply translation rules
Where do we get the translation rules?
Kumain ng mga mansanas ang batang lalaki.
• Learn the rules from examples
The boy ate apples .
A B C D
Kumain ng mga mansanas ang batang lalaki .
C D A B
Rule Learned:
A B C D C ng D A B
A B C D
The mother cooked fish .
A B C D
Nagluto ng isda ang nanay .
C D A B
C ng D A B
A B C D
The mother went home .
A B C D
C
Umuwi ng bahay ang nanay .
C D A B ng D A B
A B C D
C ng D A B
The boy ate the fish .
A B C D
• Qualities of a Good Translation
– Clarity – 3.3
– Accuracy – 3.2
– Naturalness - 2.8
• highest score of 5
• 100 respondents (5 linguists)
• Language resources
– Quality of translation is dependent on it.
– Built from almost non-existent digital forms
– manual vs. automatic construction
• Diksyunaryo ng Wikang Filipino
• automatic construction (AeFLEX):
– accuracy rate - 57%
• Currently contains about 30,000+ entries
• Challenge: Lexical resources
– translation documents
– part-of-speech tagger
Morphological Analyzer and
Generator
• Dictionary is incomplete
• Create a software that:
– analyzes – determines the root word
– generates – generates the inflected word
Given: eating -> eat -> kain -> kumakain
• Challenge : Lexical resources
– lexicon
– part-of-speech tagger
• automatic association of parts-of-speech to words in a document
– Can? – kaya vs. lata
– Baba? – chin or go down
• Challenge : Lexical resource
– corpora
– lexicon
– morphological analyzer
– grammar
• collection of translation-pair documents
• used by the lexicon extractor and part-ofspeech tagger, example-based MT
• came from translation works of DLSU English majors, verified by linguists
• consists of 207,000 words
Lexicon Resource Dependency
Corpus
• 171 Philippine Languages (SIL)
• No Philippine Corpora
• Unfortunately, today, the Philippines has one of the highest rates of dying languages (Solfed
Foundation Inc)
• “Without our language, we have no culture, we have no identity, we are nothing.” (Thorrson)
eWika: Digitalization of
Philippine Languages
• Build the Philippine Corpus
• Build software tools to study or use the corpus
– Across Regions
– Across Forms and Genres
– Across Languages
• Web-based application: GLOBALIZATION
– upload, download, tools
• Contributors (Main players)
• Verifiers
• Server: DLSU-M commits to host the server for the next three years.
• Terms of Use: Research purposes.
• 171 Philippine Languages ( SIL List )
• start with 8 major languages
– Tagalog, Cebuano, Ilocano, Hiligaynon, Bikol,
Waray, Kapangpangan, Boholano
• Filipino Sign Language
• In various forms:
– Text
– Speech
– Video: Filipino sign language
• In various Genres:
– Text – literary & creative, essays, news articles, religious, etc
– Speech – scripted, conversations, etc
– Video – common signs, regional signs, signs for specific purposes (legal, IT, etc.)
• The dream of building electronic, online
Philippine language resources and tools
• Many many many major hurdles to overcome
• NEEDED : Language Resources, Tools, &
Peopleware