सस् ु वागतम ् Welcome Technology Development for Indian Languages Presentation at LREC 2010 Conf , Malta Swaran Lata & Somnath Chandra Human Centred Computing Division Department of Information Technology slata@mit.gov.in, schandra@mit.gov.ind Presented by : Shyam S Agrawal Executive Director KIIT, Advisor CDAC TDIL 1 Technology Development for Indian Languages (TDIL) Programme Promotes Research & Development of Technology, Software Tools and Applications for Indian Languages Catalyzes proliferation of Language Technology products and solutions Promotes Standardization TDIL 2 Complexity for Language Technology Development Complexity: Very challenging area for computer scientists due to voluminous, informal and ambiguous nature of human languages. Involves interdisciplinary research in advanced and sophisticated computer processing involving Artificial Intelligence and Machine Learning in one hand ; linguistic knowledge for incorporating human communication techniques on the other hand. Still in research stage in many areas despite huge efforts by academia and scientists in India as well as abroad More intense challenge for Indian Languages: Large linguistic diversity with 22 officially recognized languages and 12 scripts. One-language Many Scripts ; Many Languages – One Script Specificity for each language and script is unique in nature and can not be easily replicated. Difference in perceptions of usage among various user groups, e.g. State Governments , Academia and industry TDIL 3 Official Indian Languages & Scripts Sl. No. Language Script 1. Hindi Devanagari 2. Sanskrit Devanagari 3. Marathi Devanagari 4. Konkani Devanagari 5. Nepali Devanagari 6. Maithili Devanagari 7. Sindhi Devanagari 8. Bodo Devenagari 9. Dogri Devanagari, Sharda 10. Bengali Bengali 11. Assamese Bengali 12. Manipuri Bengali, Meitei (Mayak) 13. Gujarati Gujarati 14. Kannada Kannada 15. Malayalam Malayalam 16. Oriya Oriya 17. Punjabi Gurmukhi 18. Tamil Tamil 19. Telugu Telugu 20. Urdu Arabic 21. Santhali Ol-Chiki, Devanagai, 22. Kashmiri Arabic, Sharda TDIL 4 Genesis of Language Technology Development in India-Early Initiatives Pioneering Effort by DIT in collaboration with IIT Kanpur in 1983: Department of Electronics (Now DIT) entrusted a sponsored project to IIT Kanpur to build an integrated Devanagari Terminal (GIST). A standalone system with a computer keyboard was used for inputting the character in devanagari, a monitor for display and a Dot Matrix printer for printing and a serial communication for sending character to another terminal was developed. Developed Indian Script Code for Information Interchange (ISCII standard) C-DAC Pune adopted GIST technology to develop products and licensed it to manufacturers Technology Development for Indian Languages (TDIL) Programme started in the year 1991 as a separate entity. TDIL 5 Phases of Language Technology developments Seeding Phase : 1991-1995 TDIL programme established in the year 1991 Some linguistic resources such as corpora developed NLP training programme for Computer Scientists and linguists Some stand-alone language learning tools have also been developed Exploratory Work in the area of NLP Exploratory Phase : 1995-2000 Development of Proof –of –concept Machine Translation System for English to Indian Languages and Indian Languages (Angla-Bharti) to Indian Languages (Anusaraka) systems have been developed. Laboratory model of font dependent Optical Character Recognition in Hindi Text-to-Speech for Hindi TDIL 6 Catch-up Phase :2000-2004 The TDIL programme gathered momentum by establishing 13 Resource Centres for Indian Languages Technology Solutions (RCILTS)and 10 CoIL-Net Centres. Resource Centres for Indian Languages Technology Solutions (RCILTS) The objective was to proliferate this activity to a large number of institutions across the country with the specific mandate for a language or a group of languages. Under this project, these centres have developed several important tools , linguistic resources and technologies for Indian language support Many of these tools are now being modified and upgraded to be released in public domain under National Roll-Out Project. Some of the important language technology tools and resources developed under Resource centres Project are: TDIL 7 Bi-lingual Dictionaries: between Indian Languages with over 30,000 words [Resource Centres] Spell-Checkers in Indian Languages [Resource Centres] Ontology & Word-Net: 9000 syn-sets with morphological analyzer and front end for Hindi Word-net with 1100 lexical entries with X-window interface for Oriya. Proof-of –concept technologies for Optical Character Recognition system (OCR) in other Indian languages Proof-of-concept TTS in other Indian languages In addition , several other tools , Operating Systems and resources have been developed under various sponsored projects. Some of the notables are: INDIX-2 (Localized LINUX in 12 Indian languages) Phrasal Dictionaries: in Tamil and Kananda [IIIT, Hyderabad] Online VishwaKosha: with 9162 topics [CDAC] Parallel Corpora: One Million pages Parallel Corpora in 11 languages [CDAC] TDIL 8 COIL-Net Centres: The objective was to develop Localized Content in Hindi Speaking states for enhancement of IT proliferation E- content of approximately 16000 HTML & Dynamic pages in the domains of health, education, tourism and agri-business have been developed. Content on the eminent personalities, tourist places, classical work, and cultural heritage information on these regions have been developed. The developed content is uploaded on the internet at the website http://tdil.mit.gov.in. National Train Enquiry http://www.trainenquiry.com TDIL website localized in Hindi by CDAC. 9 Product Development and Proliferation Phase :2005-onwards A ‘Roadmap for Language Technology Development in India’ was evolved-to formulate short-term & long-term mission plan and strategy for development of Language Technologies in India. The Focus is to synergize development efforts and Develop deployable products National Roll-Out Programme and Six Mission Mode Projects have been initiated to facilitate Speedy Development & Availability of the Language Technologies. TDIL 10 Proliferation of Indian Language Technology Products : National Roll-Out Plan Objectives of the initiative To facilitate Speedy Development & Availability of the Language Technologies. Broad contents of the CD • Common user’s Toolkit – Content Creation Tools, DTP, Office Automation, Code Converters • Productivity Tools – Spellchecker, Domain based Dictionaries, Transliteration. • Power user – OCR, Text to Speech, MAT, etc Distribution channel for the CD • Registered users of www.ildc.in web site of TDIL, DIT – through postal department. • IT magazines, publications, etc. • Schools, Government departments, etc. Software tools and fonts for 12 Indian languages namely Hindi, Tamil, Telugu, Assamese, Kannada, Malayalam, Marathi, Oriya, Punjabi and Urdu and Gujarati and Sanskrit languages have been released in public domain CDs containing 4 Indian Languages namely Bodo , Dogri , Maithili and Nepali languages are being released on Feb 21, 2009 – UNESCO International Mother Language Day. TDIL 11 Software tools and fonts CD contents 1 Language True Type Fonts with Keyboard Driver - more than 200 Supporting INSCRIPT, Typewriter, Phonetic Keyboard layouts Allows content creation in Indian languages using applications running under Microsoft windows 2 Language Multi-font Keyboard Engine for True Type Fonts Allows content creation in Indian languages using applications running under Microsoft windows in variety of font encoding. 3 Language Unicode Compliant Open Type Fonts - more than 200 Allows to render the Indian language Unicode data. 4 Unicode Compliant Keyboard Driver Supporting INSCRIPT, Typewriter, Phonetic Keyboard layouts. Allows Unicode complaint data inputting 5 Generic fonts and storage code converter Allows user to convert the existing data in different encoding to ISCII / UNICODE 6 Localized version of Bharateeya OO (Office Suite) This consists of word processor, presentation tool, spreadsheet & drawing tool 7 Fire fox browser Localized version of Fire fox browser TDIL 12 Software tools and fonts CD contents 8 Colombo - Email client for Windows and Linux Operating systems. Using this user can send / receive emails in Indian languages. The menus are also in local language. 9 GAIM - Multiprotocol Messenger. This enables the user to user various messenger clients for communications 10 Optical Character Recognition With the help of OCR one can scan the printed text matter and convert it into editable form for further processing. 11 Typing Tutor This application teaches the user to type in Indian languages. 12 Spellchecker Allows the end user to rectify spelling mistakes in the document 13 Dictionaries English to Indian language and vice versa dictionaries in general, administrative, technical domains. 14 Transliteration Tool Transliterates a given Indian language text into Roman & vice versa. Useful for user who is not familiar with the script. 15 Text to Speech system Readouts the text TDIL 13 TDIL 14 Product Development Efforts…. Mission Mode Projects Phase -I In the consortium mode 26 premier Institutes and R&D organizations are working together on six projects to develop the advanced technologies & applications. Development of English to Indian Languages Machine Translation (MT) System: 10 institutions are participating to build deployable MT System. Consortium Leader: CDAC, Pune Domains: Tourism and Health Six Languages pairs: English to Hindi/ Marathi/ Bengali/ Oriya/ Tamil/ Urdu. Development of English to Indian Languages Machine Translation (MT) System with Angla-Bharti Technology: 4 institutions are participating to build deployable MT System. Consortium Leader: IIT Kanpur Domains: Tourism and Health Six Languages pairs: English to Hindi/ Marathi/ Bengali/ Oriya/ Tamil/ Urdu. Development of Indian Language to Indian Language Machine Translation System: 11 institutions are participating to build deployable Bi-directional MT System. Consortium Leader: IIIT, Hyderabad Domains: Tourism and Health Nine Language pairs: Tamil-Hindi, Telugu-Hindi, Urdu-Hindi, Kannada-Hindi, PunjabiHindi, Marathi-Hindi, Bengali-Hindi, Tamil-Telugu, Malayalam-Tamil TDIL 15 Development Efforts…. Mission Mode Projects Phase -I Development of Robust Document Analysis & Recognition System for Indian Languages: 11 institutions participating to build OCR System with improved accuracy, font and point-size independent recognition capability. Consortium Leader: IIT, Delhi 10 Scripts: Bengali, Devanagari, Malayalam, Gujarati, Telugu, Tamil, Oriya, Tibetan/Nepali, Gurmukhi, Kannada Development of On-line handwriting recognition system: Seven institutions are participating to build On-Line Handwriting Recognition System. Consortium Leader: IISc, Bangalore 6 Scripts: Devanagari, Bengali, Tamil, Telugu, Kannada and Malayalam Development of Cross-lingual Information Access 11 institutions participating to develop a portal where, a user will be able to give a query in one Indian Language and the user will be able to access documents available in (a) The language of the query and (b) Hindi (If the query language is not Hindi) and (c) English. Consortium Leader: IIT, Bombay Domains: Tourism and Health Six Languages: Bengali, Hindi, Marathi, Punjabi, Tamil and Telugu. 16 TDIL Status of the readiness of the consortium mode projects Sl No Name of the product /system 1 English to Indian Languages Machine Translation System (E-IL) Tourism α 2 English to Indian Languages Machine Translation System (E-IL) with Angla-Bharti approach Tourism α 3 Indian Language to Indian Language Machine Translation (IL-IL) Tourism α March 31, 2009 4 Cross-lingual Information Access (CLIA) Tourism α March 31, 2009 5 Printed Text OCR -- α March 31,2009 6 On-line Handwriting recognition system (OHWR) -- α March 31, 2009 TDIL Language Pairs Marathi , Tamil and Bengali Domains Version Possible Date 17 Development Efforts…. Speech Processing Speech Corpora: • Annotated Speech Corpora of approximately 50 hours developed for Hindi, Marathi, Punjabi, Bengali, Assamese and Manipuri. [CDAC] • Speech Corpora for Tamil, Malayalam, Telugu and Kannada under development. [CDAC] Speech Recognition: Phonetic Engine for Speech recognition system for Hindi and Telugu languages are being developed [IIIT Hyderabad] Text-to-Speech Languages: (TTS) and Automatic Speech Recognition in Indian Consortium Mode project for development of Text-to-Speech system for visually challenged persons in six Indian languages namely Hindi, Tamil , Telugu , Marathi , Malayalam and Bengali languages has been initiated. Development for Automatic Speech Processing in Indian languages is also being initaited. TDIL 18 Development Efforts…. Sanskrit Computing : Consortium Mode project has been initiated for development of Sanskrit Computational tool kit and Sanskrit-Hindi Machine Translation System [ Univ. of Hyderabad] Corpora Consortium Mode project is being initiated for development of annotated corpora in 11 Indian languages. The project will evolve the standards for natural language processing TDIL 19 Development Efforts for North –Eastern Languages Consortium Mode Projects to develop linguistic resources and basic information processing tool for North-Eastern languages namely Assamese, Bodo Manipuri and Nepali languages have been initiated. [ C-DAC Pune] Consortium Mode project has also been initiated for development of Word-net in North-Eastern Languages [ IIT Bombay] Speech Corpora and standardization of International Phonetic Alphabet (IPA) for Bodo language has been initiated [Univ. of Guwahati] TDIL 20 Standardization TDIL 21 ISCII – Indian Script Code for Information Interchange Since the 1970s, efforts were made to evolve different codes for characters and symbols of the 10 Brahmi based Indian scripts due to their common phonetic structure. These efforts culminated in bringing out Indian standards for Indian Script Code for Information Interchange (ISCII) in December, 1991. The ISCII code standard specifies a 7- bit code table which can be used in 7 or 8-bit ISO compatible environment. It allows English and Indian script alphabets to be used simultaneously. TDIL 22 INSCRIPT Keyboard Layout Standardized by Bureau of Indian Standards - 13194 :1991 Key placement is such a way that a user well versed with one language can type in another without efforts. This is overlaid on the existing QWERTY keyboard. Language selection is done with help of either Caps lock, scroll lock or Num lock key Since it is based on phonetic nature of Indian languages it is very easy to learn. Efforts have been initiated to incorporate the additional characters as per latest UNICODE 5.1 standards in the modified layout TDIL 23 UNICODE TDIL Unicode uses a 16 bit encoding that provides code point for more than 65000 characters (65536). Corresponds to ISO/IEC 10646-1 Universal Multiple Octet Coded Character Set (UCS) In Sync with other International Standards such as W3C Unicode Standards assigns each character a unique numeric value and name. Encodes all of the characters used for the written languages of the world. Unicode is increasing being accepted as a standard for Information Interchange worldwide as most of the major IT Companies have declared their support for it Department of Information Technology is the voting member of the Unicode Consortium since the year 2000 to ensure the adequate representation of Indic scripts in the Unicode Standards. http://tdil.mit.gov.in/pchangeuni.htm DIT finalized the changes in the Unicode Standard and majority of changes have been accepted and incorporated in UNICODE Standards version 5.0. Initiatives have been taken to incorporate additional languages/ scripts and additional characters and symbols of Vedic Sanskrit in UNICODE. http://tdil.mit.gov.in/prop_uni/Vedic.pdf 24 UNICODE .. Examples Indicates proposed characters/symbols/signs shape change in the existing standard Indicates the change in the annotation/explanation of that particular code point . Indicates proposed characters/symbols/signs addition in the existing standard TDIL 25 W3C Project “Web Internationalization Initiative” has been initiated with the objective of adequate representation of Indic scripts in the Web Technology Standards being evolved by World Wide Web Consortium (W3C). Initiative has been taken to incorporate key findings of WII projects in the W3C standards /guidelines TDIL 26 W3C - Large amount of works need to be done In the phase-I of WII projects only few exploratory work has been carried out W3C Internationalization has 115 recommendations covering web internationalization , XML, Cascaded Style Sheet and Speech Synthesis These recommendations needs to carefully studied in the Indian Language perspective and specific recommendations need to projected to W3C. A few specific Interest of them are Internationalization Tag Set (ITS) Version 1.0 ,Voice Extensible Mark-up Language (VoiceXML) 2.1 , Web Content Accessibility Guidelines 1.0 , Cascading Style Sheets (CSS1) Level 1 Specification , Speech Synthesis Mark-up Language 1.0 Need for consultation with all stake holders such as academia , industry and various state governments. Sensitization to industry and web service providers to adopt W3C standards. W3C India Office has been established at DIT under the aegis of the TDIL programme. TDIL 27 International Phonetic Alphabet (IPA) Since phonetic representation of symbols is the required for present-day speech mark-up language like W3C Speech synthesis mark-up language (SSML), standardization of IPA symbols is necessary. India being a multilingual country a standardized phonetic alphabet has to be developed for scientific study of phonetics and SSML for Indian languages. The IPA standardization for all Indian language and acceptance of it by International Phonetic Association is thus required for development of speech technology and associated products. Efforts initiated to standardize IPA symbols in Indian languages TDIL 28 Common Locale Data Repository Common Locale Data Repository (CLDR) is an initiative of UNICODE consortium to develop locale data for World languages. The Unicode CLDR provides key building blocks for software to support the world's languages. CLDR is by far the largest and most extensive standard repository of locale data. This data is used by a wide spectrum of companies for their software internationalization and localization Department of Information Technology has already become TC (Team Coordinator) to incorporate / modify Indian languages in CLDR. Modifications/ Development of Common Locale data repository in Indian languages have been initiated in consultations with state governments and other stake holders. CLDR data for 6 Indian Languages have been incorporated in UNICODE CLDR. TDIL 29 TDIL Language Tags Language Tags are being used in most of the multilingual applications such as web development, Multilingual Internet Data Exchange, Language Negotiation and web services. The nomenclatures of the Language Tags are being standardized under ISO 639 standard. The Language Tag Standard ISO 639-x (x stands for different versions) are being used in many other international Standards and Best Practices such as IETF (Internet Engineering Task Force) RFC 4646, RFC 4647 and W3C web standards. They are also related to ISO 3166 (for region codes) and ISO 15924 (script codes). The present forms of ISO 639-2, ISO 639-3 and the futuristic ISO 639-5 and ISO 639-6 have many ambiguous entries for Indian languages, which need to be corrected urgently in order to prevent propagation of incorrect nomenclatures for Language sets. Modification / Additions of Language Tags in Indian languages have been taken up in consultations with the state governments and all stake holders. 30 Information Dissemination TDIL portal: http://tdil.mit.gov.in ILDC Portal: http://ildc.gov.in, http://ildc.in On ILDC Portal, a user can: • Request for a Language Tools CD • Register on ILDC website • Provide Feedback and Access FAQ (Frequently asked Questions) • Free Downloads and Software for Indian Language Tools TDIL TDIL Half-yearly Journal: VishwaBharat@tdil: 16 Issues published; accessible through TDIL web-site. 31 Future Activities: All the on-going consortium mode projects, National Roll-Out Plan project and Specialized Manpower Development in Language Technology project would be continued. Phase-II of Consortium Mode Projects: Consortium Mode Projects –Phase –II in the areas of Machine Translation , Cross-lingual Information Access , Optical Character Recognition and OHWR . The systems developed in Phase-I would be improvised and expended for other domains Technology Development: (a) Speech Technology: Development of Automatic Speech Processing engines to be initiated for major Indian languages. (b) Basic Research : Basic Research in the areas of semantic Web technology would be initiated. Web Internationalization Initiative: The phase –II of Project “Web Internationalization Initiative (WII) ” to be initiated with the objective of adequate representation of Indic scripts in the Web Technology Standards being evolved by World Wide Web Consortium (W3C). 32 TDIL Future Activities: Establishment of Data Centre at TDIL Setting up of Indian Language Data Centre , Language Technology Demonstration Facility at DIT , Up-gradation of ILDC and TDIL websites , Up-gradation of Language CDs and their distribution support would be undertaken. Web Internationalization Initiative: Phase-II of Web Internationalization Initiative (WII) programme would be initiated for finalization of Indian Language specific inputs / recommendations in W3C web technology standards. National Localization Research and Resource Centre (NLRRC) Seeding Activity for National Localization Research and Resource Centre would be initiated. 33 TDIL Government, Academia, Industry together to play globally and to serve locally for making India a Global Multilingual Computing Hub धन्यवाद ਧਨ੍ਯਵਾਦ ધન્યવાદ Thank You