Electronics and Information Technology Exposition - ELITEX 2005 India Habitat Centre, Lodhi Road, New Delhi. 25 - 26 April 2005 Nurturing Living Languages Mahesh D. Kulkarni C-DAC GIST Group mdk@cdac.in © C-DAC Nurturing Living Languages Multimodal System (Human Computer Interface) for Indian languages Issues - Solutions © C-DAC Nurturing Living Languages Multimodal System • Enables users to communicate with computers via several modes such as Keyboard, OCR, Speech, Gesture, Gaze, Visual, etc. • Major challenge for computer system designers lies in simplifying the Human Machine Interface. • Researchers all over the world are inventing different modes of interactions, some of them with little or no success. • No single mode is sufficient for effective communication with the machine. • Some of the popular interaction mechanisms are • Keyboard • Unistroke • • • • Graffiti Predictive writing OCR Speech (limited vocabulary) © C-DAC Nurturing Living Languages Multimodal System for Indian languages • • • • • • • • Challenges Multilingual – 22 scheduled languages. Complex script(s) as compared to English. (especially poses problems for OCR) While inputting, many to one and many to many relationship unlike English. Limited availability of linguistic resources. Layman terminology versus pure linguistics terminology. Various dialects poses challenge for speech input Impact Lack of efficient Indian language based multimodal system has put restriction on content creation. Possible solution Need for Development of Expert /Smart writing systems backed up with Multimodal inputs, Linguistic Resources such as Spellcheckers, Grammar checker,Synonyms, Antonyms, Thesauri, Domain based Dictionaries, Phrases and references. © C-DAC Nurturing Living Languages English Language English base Character - 26 A B C D ……… Hindi Language Base character - 80 vowel character – 12 Half character – 43 Matra character - 12 © C-DAC Nurturing Living Languages • Because of unavailability of Indian language keyboard layout(s). processing power, mechanical Typewriter were devised, which were based on the fact “the way you see the way you write” • INSCRIPT - Popular and widely used & has become de-facto standard. • Based on phonetic structure of Indian languages – “the way speak the way you write. • And Phonetic English for Urban users Limitations in mobile world • Its very bulky, difficult to carry & large as compared to the target device itself • Use of both hands, not suitable for portable, mobile devices • Not possible to use without training • More than 80 keys required with UNSHIFT / SHIFT operations © C-DAC Nurturing Living Languages Virtual / LASER keyboard • PDA’s, Cellular telephones. • Tablet PCs, Laptops. • Industrial, sterile & medical environments. • Test Equipment. • Transport (Air, Rail, Automotive). Limitations • Need a proper surface to display Image. • Typing is cumbersome, since the finger positions and movements are restricted. • Speed limitations. © C-DAC Nurturing Living Languages • KITTY, a finger-mounted keyboard for data entry into PDA's, Pocket PC's and Wearable Computers has been developed at the University of California in Irvine. KITTY – Keyboard Independent Touch Typing © University of California and Senseboard respectively. • Two hand-mounted devices connect to the target computing device with the help of Blue tooth wireless networking technology. • The user can type on a hard surface like a desk or table, or into the air. © C-DAC Nurturing Living Languages • Each character is represented by a single stroke & hence no segmentation problem Unistroke Inputting • The system does not need to use up resources to figure out where one character ends and another begins • No need to write characters within bounding boxes, characters can be recognized even when they are written one on top of the other. • Even can be used by blind person. Limitations • However require the user to spend some time learning the characters. • Complex implementation for Asian languages. • More oriented towards English. © C-DAC Nurturing Living Languages Graffiti inputting • Requires minimal time for learning the alphabet. • This is all because Graffiti is easy to learn while Unistrokes is comparatively harder. • Though Unistrokes is a faster mode for inputting text than Graffiti, nobody uses Unistrokes © C-DAC Nurturing Living Languages Non-Predictive & Predictive Inputting mechanism for Handheld / Mobile Devices By C-DAC GIST Group © C-DAC Nurturing Living Languages Multitap text entry mechanism English Hindi / Indian languages English has 26 alphabets only. In Hindi there are around 80 basic characters, 43 half characters, 12 vowels, 12 matras making it more than 147 characters. These are spread over 9 keys. I.e. 3 to 4 characters on single key. Spreading these 80 characters & half, vowels & matras over the 10 keys, it comes to around 9 to 10 characters on one key. To get the desired character user needs to press the key up to 4 times. It will be very cumbersome when inputting in a multi-tap way. Since more key presses are required to get the required characters it becomes more tedious to type a bigger matter Inputting the bigger message using this kind of mechanism for Indian languages is next to impossible. © C-DAC Nurturing Living Languages Comparative study of English & Hindi English • Single character combinations 26 • Two character combinations 52 • Three character combinations. 4056 Hindi • Single character combinations 80 • Two characters combinations 6889 • Three characters 571787 © C-DAC Nurturing Living Languages Multitap • 12 keys are required to input the character • If a character is missed out then you need to restart all over again • Ideally suitable for less than 3-4 character per key. • Not suitable for Indian language inputting, since almost 7-8 characters are required to be placed on each key. (Basic character 80, half character 43, vowels 12, Matra 12) © C-DAC Nurturing Living Languages Two key non-predictive • 4 keys are required to input the character • Any character entered in just two key press. • Key mapping done on basis of vargas & hence easy to remember. • Very short learning time. (3-5 minutes) • No need to remember the keys • Guiding reduces mistakes • With the same keyboard layout all Indian languages can be inputted, so no need to learn again for other language. © C-DAC Nurturing Living Languages Two key non-predictive • 13 keys are required to input the character • * Key is the mode key used for selecting halant / half character. Technology given to MNC’s © C-DAC Nurturing Living Languages Predictive writing • This should address the need of fast inputting using limited keys. • Should not take more than one key press per characters. • Should help in auto completion of word, so less key press than length of the word. • Fast searching with help of most commonly used words dictionary as a backup. • Can manage the user-defined words also. • C-DAC GIST has developed predictive writing for Hindi language and work in progress for others. © C-DAC Nurturing Living Languages • Because of nature of script, more complex to implement than any other language. • “Accuracy increase” is a function of continuous development process. • Stepwise approach to achieve good level of prediction. • Approaches for Predictive inputting for devices • • • • Pure Dictionary based. Dictionary plus rule based approach Addition of Domain specific dictionaries. Increase in accuracy by analyzing live data & accordingly enhancing built-in dictionaries. © C-DAC Nurturing Living Languages Predictive writing Demo 5 keys required to complete the word Dhanyavad © C-DAC Nurturing Living Languages © C-DAC Nurturing Living Languages Features : • Highly efficient algorithm & automatic prediction of the frequently used words by the user. • Auto tracking of the frequently used words by the user & giving them priority. • Currently 25,000 common “spoken Hindi” words. • Addition of words by the user which are not available in the dictionary with the help of non-predictive mechanism. • Current memory requirements • 180 KB for 25,000 words - uncompressed • 8 KB for code • 3 KB scratch memory. © C-DAC Nurturing Living Languages Conclusions • Urgent need for Development of Expert /Smart writing systems backed up with Multimodal inputs, Linguistic Resources such as Spellcheckers, Grammar Checker,Synonyms, Antonyms, Thesauri, Domain based Dictionaries, Phrases and References. • Standardization for inputting Indian languages through limited keys. © C-DAC Nurturing Living Languages THANK YOU Nurturing living languages © C-DAC