Translation Support System (English to Hindi) - AU

advertisement
Efforts in Language & Speech Technology
Natural Language Processing Lab
Centre for Development of Advanced Computing
(Ministry of Communications & Information Technology)
‘Anusandhan Bhawan’,
C 56/1 Sector 62, Noida – 201 307, India
karunesharora@cdacnoida.com
C-DAC Noida’2004
Translation Support System
Technology : Angla Bharati (Rule base) developed by IIT Kanpur.
System developed jointly by IIT,Kanpur and CDAC Noida
Operating system support : LINUX/ WINDOWS
Performance : 85% correct parsing, 60% correct translation
Embedded Text Editor ,Pre Processor and Post editor
Lexicon :25,000 root words
C-DAC Noida’2004
Translation Support System (English to Hindi)
English Sentence
Morphological Analyzer
Pattern Directed
Parsing
Lexical Dictionary
CORPUS
Post Editor
Rule Base
Hindi Text Generator
Pseudo
Language Output
C-DAC Noida’2004
Global Concept for Translation System: BhashaSetu
Document
to
Translate
1
User
Translator
2
Pre
Process
Filter
to
Universal
Format
3
4
Tokenizer
Dictionary
Setup
6
Source
Language
Parser
8
Translation
Memory
Setup
10
Example
Memory
Setup
12
Reviewer
13
Machine
Translation
Export
18
19
Project
14
Editor
5
7
9
11
Machine
Translation
Software
21
15
17
16
Machine
Translation
Import
20
Dictionary
Parsing
Translation
Memory
Example
Memory
25
Dictionary Translation Translation
Memory
Editor
Process
Editor
26
Translated
Document
23
29
Dictionary
Merge
28
Translation
Memory
Merge
24
27
Example
Memory
Merge
Post
Process
Filter
Dictionaries
Translation
Memories
C-DAC Noida’2004
Example
Memories
C-DAC Noida’2004
Test suite for Translation Support Systems
C-DAC Noida’2004
Knowledge Management
Parallel Corpus & Tools
C-DAC Noida’2004
Gyan Nidhi : Parallel Corpus
‘GyanNidhi’ which stands for ‘Knowledge Resource’ is parallel in 12 Indian
languages , a project sponsored by TDIL, DIT, MC &IT, Govt of India
C-DAC Noida’2004
Gyan Nidhi: Multi-Lingual Aligned Parallel Corpus
What it is?
The multilingual parallel text corpus contains the same text translated in more than
one language.
What Gyan Nidhi contains?
GyanNidhi corpus consists of text in English and 11 Indian languages (Hindi, Punjabi,
Marathi, Bengali, Oriya, Gujarati, Telugu, Tamil, Kannada, Malayalam, Assamese). It
aims to digitize 1 million pages altogether containing at least 50,000 pages in each
Indian language and English.
Source for Parallel Corpus
•
•
•
•
•
National Book Trust India
Sahitya Akademi
Navjivan Publishing House
Publications Division
SABDA, Pondicherry
C-DAC Noida’2004
GyanNidhi Block
Diagram
C-DAC
Noida’2004
Gyan Nidhi: Multi-Lingual Aligned Parallel Corpus
Platform : Windows
Data Encoding : XML, UNICODE
Portability of Data : Data in XML format supports various platforms
Applications of GyanNidhi
Automatic Dictionary extraction
Creation of Translation memory
Example Based Machine Translation (EBMT)
Language research study and analysis
Language Modeling
C-DAC Noida’2004
Tools:
•
•
•
•
Prabandhika: Corpus Manager
Categorisation of corpus data in various user-defined domains
Addition/Deletion/Modification of any Indian Language data files
in HTML / RTF / TXT / XML format.
Selection of languages for viewing parallel corpus with data aligned
up to paragraph level
Automatic selection and viewing of parallel paragraphs in multiple
languages
– Abstract and Metadata
– Printing and saving parallel data in Unicode format
C-DAC Noida’2004
Sample Screen Shot :
Prabandhika
C-DAC Noida’2004
Tools:
Vishleshika : Statistical Text Analyzer
• Vishleshika is a tool for Statistical Text Analysis for Hindi extendible to other
Indian Languages text
• It examines input text and generates various statistics, e.g.:
• Sentence statistics
• Word statistics
• Character statistics
• Text Analyzer presents analysis in Textual as well as Graphical form.
C-DAC Noida’2004
Sample output:
Percentage of occurance
14.0
Character statistics
Hindi
Nepali
12.0
10.0
8.0
6.0
4.0
2.0
0.0
क ख ग घ ङ च छ ज झ ञ ट ठ ड ढ ण त थ द ध न प फ ब भ म य र ल ळ व श ष स ह
Consonants
Above Graph shows that the distribution is almost equal in Hindi and Nepali in
the sample text.
Most frequent consonants in the Hindi
Most frequent consonants in the Nepali
Results also show that these six consonants constitute more than 50% of the
consonants usage.
C-DAC Noida’2004
Vishleshika: Word and sentence Statistics
C-DAC Noida’2004
Speech Technology and tools
C-DAC Noida’2004
Annotated Speech Corpora for Hindi, Punjabi and Marathi languages
Vishleshika Statistical
AnalysisTool
Gyan Nidhi
Corpus
Phonetically
Rich sentence
set
XML
Meta Data
Creation
Manual
Verification
and Editing
Studio
Recording
by Professionals
Segmentation
and labeling using
Praat / Emulabel
C-DAC Noida’2004
C-DAC Noida’2004
Modules under TTS
Module
Description
TTS Shell
TTS shell is multi-threaded interface that call different
TTS modules and returns messages that user can
process to generate different events.
Voice Builder
It is a utility that helps in building syllable database. It
reduces the space utilization and helps in performing
fast search.
Query Tool for Voice Builder
Tool for reading voice file and retrieving the information
about the “UNIT” from the file i.e.: Wave Data.
Text Parser
This unit breaks the Normalized text into logical units
like: Sentences, Words and Syllables etc
Prosody Matching &
Syllable concatenation
“PSOLA” technique for smooth joining of speech
samples is being followed
Synthesizer
Function: For writing wave data directly onto a sound
card or wave file.
C-DAC Noida’2004
Other Areas of expertise
• OCR for Devanagri Script
• Digital Library for Indian languages
• Word Processing tools like Spell Checker, Transliteration,
Terminology Development, Document analysis, Font converters
• Indian Language eContent Creation
C-DAC Noida’2004
Areas for future work
• Machine Translation
• Standardization Lexware Database design
• Working on the global approach ‘BhashaSetu’ which is a amalgamation of different
approaches to squeeze the best of each approach
• Development of Translation system Test Bed
• Knowledge Management
• Automatic Text Summarization tool for Hindi and other Indian languages
• Standardization of Parts of Speech TagSet for Hindi extendible to other Indian
languages
• Parts of Speech Tagger development for Indian languages
• Automated Terminology Development tools
• Sentence alignment tool for Indian languages
•Development of manually tagged parallel corpus up to word level
•Speech Technology
• Speech to Speech Translation System
• Development of Semi-automated speech annotation tools
C-DAC Noida’2004
Thank You
C-DAC Noida’2004
Download