Efforts in Language & Speech Technology Natural Language Processing Lab Centre for Development of Advanced Computing (Ministry of Communications & Information Technology) ‘Anusandhan Bhawan’, C 56/1 Sector 62, Noida – 201 307, India karunesharora@cdacnoida.com C-DAC Noida’2004 Translation Support System Technology : Angla Bharati (Rule base) developed by IIT Kanpur. System developed jointly by IIT,Kanpur and CDAC Noida Operating system support : LINUX/ WINDOWS Performance : 85% correct parsing, 60% correct translation Embedded Text Editor ,Pre Processor and Post editor Lexicon :25,000 root words C-DAC Noida’2004 Translation Support System (English to Hindi) English Sentence Morphological Analyzer Pattern Directed Parsing Lexical Dictionary CORPUS Post Editor Rule Base Hindi Text Generator Pseudo Language Output C-DAC Noida’2004 Global Concept for Translation System: BhashaSetu Document to Translate 1 User Translator 2 Pre Process Filter to Universal Format 3 4 Tokenizer Dictionary Setup 6 Source Language Parser 8 Translation Memory Setup 10 Example Memory Setup 12 Reviewer 13 Machine Translation Export 18 19 Project 14 Editor 5 7 9 11 Machine Translation Software 21 15 17 16 Machine Translation Import 20 Dictionary Parsing Translation Memory Example Memory 25 Dictionary Translation Translation Memory Editor Process Editor 26 Translated Document 23 29 Dictionary Merge 28 Translation Memory Merge 24 27 Example Memory Merge Post Process Filter Dictionaries Translation Memories C-DAC Noida’2004 Example Memories C-DAC Noida’2004 Test suite for Translation Support Systems C-DAC Noida’2004 Knowledge Management Parallel Corpus & Tools C-DAC Noida’2004 Gyan Nidhi : Parallel Corpus ‘GyanNidhi’ which stands for ‘Knowledge Resource’ is parallel in 12 Indian languages , a project sponsored by TDIL, DIT, MC &IT, Govt of India C-DAC Noida’2004 Gyan Nidhi: Multi-Lingual Aligned Parallel Corpus What it is? The multilingual parallel text corpus contains the same text translated in more than one language. What Gyan Nidhi contains? GyanNidhi corpus consists of text in English and 11 Indian languages (Hindi, Punjabi, Marathi, Bengali, Oriya, Gujarati, Telugu, Tamil, Kannada, Malayalam, Assamese). It aims to digitize 1 million pages altogether containing at least 50,000 pages in each Indian language and English. Source for Parallel Corpus • • • • • National Book Trust India Sahitya Akademi Navjivan Publishing House Publications Division SABDA, Pondicherry C-DAC Noida’2004 GyanNidhi Block Diagram C-DAC Noida’2004 Gyan Nidhi: Multi-Lingual Aligned Parallel Corpus Platform : Windows Data Encoding : XML, UNICODE Portability of Data : Data in XML format supports various platforms Applications of GyanNidhi Automatic Dictionary extraction Creation of Translation memory Example Based Machine Translation (EBMT) Language research study and analysis Language Modeling C-DAC Noida’2004 Tools: • • • • Prabandhika: Corpus Manager Categorisation of corpus data in various user-defined domains Addition/Deletion/Modification of any Indian Language data files in HTML / RTF / TXT / XML format. Selection of languages for viewing parallel corpus with data aligned up to paragraph level Automatic selection and viewing of parallel paragraphs in multiple languages – Abstract and Metadata – Printing and saving parallel data in Unicode format C-DAC Noida’2004 Sample Screen Shot : Prabandhika C-DAC Noida’2004 Tools: Vishleshika : Statistical Text Analyzer • Vishleshika is a tool for Statistical Text Analysis for Hindi extendible to other Indian Languages text • It examines input text and generates various statistics, e.g.: • Sentence statistics • Word statistics • Character statistics • Text Analyzer presents analysis in Textual as well as Graphical form. C-DAC Noida’2004 Sample output: Percentage of occurance 14.0 Character statistics Hindi Nepali 12.0 10.0 8.0 6.0 4.0 2.0 0.0 क ख ग घ ङ च छ ज झ ञ ट ठ ड ढ ण त थ द ध न प फ ब भ म य र ल ळ व श ष स ह Consonants Above Graph shows that the distribution is almost equal in Hindi and Nepali in the sample text. Most frequent consonants in the Hindi Most frequent consonants in the Nepali Results also show that these six consonants constitute more than 50% of the consonants usage. C-DAC Noida’2004 Vishleshika: Word and sentence Statistics C-DAC Noida’2004 Speech Technology and tools C-DAC Noida’2004 Annotated Speech Corpora for Hindi, Punjabi and Marathi languages Vishleshika Statistical AnalysisTool Gyan Nidhi Corpus Phonetically Rich sentence set XML Meta Data Creation Manual Verification and Editing Studio Recording by Professionals Segmentation and labeling using Praat / Emulabel C-DAC Noida’2004 C-DAC Noida’2004 Modules under TTS Module Description TTS Shell TTS shell is multi-threaded interface that call different TTS modules and returns messages that user can process to generate different events. Voice Builder It is a utility that helps in building syllable database. It reduces the space utilization and helps in performing fast search. Query Tool for Voice Builder Tool for reading voice file and retrieving the information about the “UNIT” from the file i.e.: Wave Data. Text Parser This unit breaks the Normalized text into logical units like: Sentences, Words and Syllables etc Prosody Matching & Syllable concatenation “PSOLA” technique for smooth joining of speech samples is being followed Synthesizer Function: For writing wave data directly onto a sound card or wave file. C-DAC Noida’2004 Other Areas of expertise • OCR for Devanagri Script • Digital Library for Indian languages • Word Processing tools like Spell Checker, Transliteration, Terminology Development, Document analysis, Font converters • Indian Language eContent Creation C-DAC Noida’2004 Areas for future work • Machine Translation • Standardization Lexware Database design • Working on the global approach ‘BhashaSetu’ which is a amalgamation of different approaches to squeeze the best of each approach • Development of Translation system Test Bed • Knowledge Management • Automatic Text Summarization tool for Hindi and other Indian languages • Standardization of Parts of Speech TagSet for Hindi extendible to other Indian languages • Parts of Speech Tagger development for Indian languages • Automated Terminology Development tools • Sentence alignment tool for Indian languages •Development of manually tagged parallel corpus up to word level •Speech Technology • Speech to Speech Translation System • Development of Semi-automated speech annotation tools C-DAC Noida’2004 Thank You C-DAC Noida’2004