Towards the Development of a Hybrid Machine Translation System involving Philippine Languages (ppt)

advertisement
eWika: Digitalization of Philippine
Languages
Translate
Isalin
Charibeth K. Cheng
March 19, 2008
Machine Translation
• Automate translation
• A study under Natural
Language Processing
Sentence in
SOURCE LANGUAGE
MT System
Sentence in
TARGET LANGUAGE
ENG-FIL MT System Project
•
•
•
•
3-year project
started last year
funded by DOST-PCASTRD
composition:
– 6 faculty members of College of
Computer Studies
– 15 computer science majors
– assisted by the Filipino Dept and
Dept in English & Applied
Linguistics of DLSU-M
Agenda
•
•
•
•
Architecture of the MT System
Linguistic resources
Demo of the Translation Engine
Results for English to Japanese translation
Architectural Design of the Program
Source Text
User Interface
Target Text
MT: Example-based
Output Modeller
MT: Rule-based
Translator Engine
Language Resources:
• Lexicon (electronic dictionary),
• Morphological Analyzer & Generator
• Part-of-Speech tagger
• Grammar,
• Corpus (Tagged)
Challenge!
• Language resources
– Quality of translation is dependent on it.
– Built from almost non-existent digital forms
– manual vs. automatic construction
Lexicon Builder
• Used IsaWika! database as initial lexicon
• Created a lexicon extraction program to
automatically determine candidate translation
pairs from corpora
• Currently contains about 23,000 entries
• Co-occurring words are likely translation
• Challenge: Lexical resources
– parallel corpora
– part-of-speech tagger
Database
Morphological Analyzer
• Initially collected morphological rules from
grammar books
• Developed an example-based morphological
phenomenon learner
– learn from <inflected word, root-word>
– example: <kumakain, kain>
• Challenge : Lexical resources
– lexicon
– part-of-speech tagger
– morphological rules
Generator
Part-Of-Speech Tagger
• automatic association of parts-of-speech to
words in a document
• existing Filipino tagger achieves < 80%
accuracy
• Challenge : Lexical resource
– tagged parallel corpora
– lexicon
– morphological analyzer
– grammar
Grammar
• Derived manually
• Challenge: Free word order in sentence
formation.
The man bought an umbrella from the store.
• Bumili ang lalaki ng payong sa tindahan.
• Bumili sa tindahan ng payong ang lalaki.
• Ang lalaki ay bumili ng payong sa tindahan.
Corpora
• used by the lexicon extractor and part-ofspeech tagger, example-based MT
• came from translation works of DLSU English
majors, verified by linguists
• consists of 207,000 words, 5000 of which are
tagged
Translation Rules
• currently learned from the corpora
• disadvantages
– garbage-in-garbage-out
– comprehensiveness
• need for linguistic-verified rules
Bringing it home …
• 171 Philippine Languages (SIL)
• No Philippine Corpora
• Unfortunately, today, the Philippines has one of
the highest rates of dying languages (Solfed
Foundation Inc)
• “Without our language, we have no culture, we
have no identity, we are nothing.” (Thorrson)
eWika: Digitalization of
Philippine Languages
• Build the Philippine Corpus
• Build software tools to study or
use the corpus
– Across Languages
– Across Regions
– Across Forms and Genres
– Across Land and Sea
Across Languages
• 171 Philippine Languages (SIL List)
• Summer Institute of Linguistics
http://www.ethnologue.com/
• Major languages
• Near extinction languages
• How about the languages in-between?
Filipino Sign Language
• The History of Sign Language in the
Philippines: Piecing Together the Puzzle (Abat
& Martinez, 9th Phil Linguistics Congress, 2006)
• Deaf individuals: handicapped vs members of a
linguistic minority
• Sign languages as true languages
Across Boundaries
• Across Languages
• Across Regions
• Across Forms and Genres
• Across Land and Sea
Across Regions
• e-Wika: Connecting the Philippine Islands through Language
• 17 Regions: The regions are: Ilocos Region (Region I),
Cagayan Valley (Region II), Central Luzon (Region III),
CALABARZON (Region IV-A) , MIMAROPA (Region IV-B) ,
Bicol Region (Region V), Western Visayas (Region VI), Central
Visayas (Region VII), Eastern Visayas (Region VIII),
Zamboanga Peninsula (Region IX), Northern Mindanao (Region
X), Davao Region (Region XI), SOCCSKSARGEN (Region XII),
Caraga (Region XIII), Autonomous Region in Muslim Mindanao
(ARMM), Cordillera Administrative Region (CAR), National
Capital Region (NCR) (Metro Manila)
Across Boundaries
• Across Time: historical, contemporary
• Across Languages
• Across Regions
• Across Forms and Genres
• Across Land and Sea
Across Forms and Genres
• In various forms:
• Text
• Speech: speech to text system (ongoing
project)
• Video: Filipino sign language
• In various Genres: categories of entries in the
corpus
Across Boundaries
•
•
•
•
Across Time: historical, contemporary
Across Languages
Across Regions
Across Forms and Genres
• Across Land and Sea
Across Land and Sea
• Web-based application: c/o Solomon See
(upload, download, tools)
• Contributors (Main players)
• Verify-ers
• Facilitators
• Server: DLSU-M commits to host the server for
the next three years.
• Terms of Use: Research purposes.
• The dream of building Philippine language
resources and tools
• Many many many major hurdles to overcome
• Language Resources, Tools, & Peopleware:
Needed
Download