Towards Universalisation of Creativity Dr. Om Vikas Department of Information Technology Ministry of Communications and Information Technology Government of India E-mail: omvikas@mit.gov.in Dr.Om vikas ICDL-2004 Is there gain in knowledge or loss of Knowledge? • From an estimated 10,000 world languages in 1900, about 6,700 language survived in 2000. Two percent of the world's languages are becoming extinct every year. • There is worldwide, unquantifiable erosion of cultural participation, knowledge and innovation. • With the loss of a language, we lose art and ideas, scientific information and technological innovation capacity. • World-level literacy is improving. More people can read than ever before, but fewer people create stories. • There is tendency from being creators to consumers at the time when technology could have amplified our creative capacities. • UNESCO study (1999) of 65 languages: 49 of the languages (75%) had experienced real decline in number of works translated from these languages into other languages. • The proportion for English arose from 43 percent in 1980 to over 57 percent in 1994. • The share held by top four translated languages (English, Spanish, French and German) rose from 65 percent in 1980 to 81 percent in 1994. • According to an UNESCO study involving world’s 140 most published authors; 90 out of 140 were English writers in 1994 compared to 64 out of 140 in 1980. • There is collapse in authorship, translation and quality in other languages. Erosion of Language and Culture !! Dr.Om vikas ICDL-2004 Is the technology to divide or to unite ? • Latin Alphabet users , 39 % of the global population enjoy 84% of access to the Internet • Hanzi-users in (CJK), 22% in global population enjoy 13% of Internet access • Arbic script users, 9% of the population have 1.2 % of the Internet Access • Bralmi-origin scripts users in South-east Asia and Indic scripts users occupy 22 % of the World population have just 0.3 % of Internet access. • More than 80% content on Internet is in English. • ICT penetration in India and other developing countries is lower. Dr.Om vikas ICDL-2004 ICT Indicators Teledensity Advanced Nations Developing Nations Cellphone Density 50-70 % 30-75 % 20-30 % 04-7 % PC penetration 30-60 % 0.5-2 % Underdeveloped Nations Sprawling Digital Dr.Om vikas <<<<<<< >>>>>> >Divide ICDL-2004 Digital Divide as They Behold Perception Developed Countries Developing Countries Why discussed ? Desire to capture larger markets Policy Information explosion Fear of lagging behind in economic race Localization Results Increasing use of English and thrust of western culture. Preservation of local language and culture. Consumer nature “substitute the old” [Consumerism-centric] “Upgrade the Old” Technology development IPR-Centric Open source technology Low cost PC $400 less than $ 40 Reason: PPP : (15:1) GNP : (75:1) 34260 (USA) 24260 2400 (India) 460 Digital divide Access to Information Wider control Digital Unite Share the Knowledge Small is beautiful. Focus Low affordability means low ICT penetration & sprawling Digital Divide Dr.Om vikas ICDL-2004 e-Content & Universal Access UNESO identifies Challenges in Multilinguism and universal access to information • General affordable worldwide access • Hardware and Software, Web and Internet Features. • Availability of Accessible websites and Internet Access devices. • Accessibility of multiple languages • Development of content in Native languages, and its placement on Internet. • Appropriate design of software for users Dr.Om vikas ICDL-2004 Potential Use of non-English languages on Internet will increase drastically by 2010 as shown below: Users 2010 2003 500 Mn 400 Mn 300 Mn 200 Mn 100 Mn 0 Eng Jap Chinese French Spanish German Indian languages 65 % information on Internet is in English Dr.Om vikas Source : IBM’s Web Fountain ICDL-2004 New Order of Knowledge based Society : • Universalization of Creativity • Rise, Raise & Race Dr.Om vikas ICDL-2004 Raise to Rise & Race to Limits Liberalisation is advice of advanced nations to the rest for creating conducive environment for technology acquisition and absorption and thus expanding their market. Mindset needs to be changed to help the underdeveloped nations to catch-up in technology absorption and participation in knowledge generation. Following is an example of providing high-tech solution in low-tech environment. A group of engineer volunteers in USA designed and built a rugged and low-cost bicycle- powered computer and wireless network for villagers of phon kham in Laos which had no electricity or phone service. There was no way to call relatives living abroad or even in the next town. This is a project to bridge the digital divide. Innovation follows on Stretching our imagination to limits. As we noticed that constrained environment of a village in Lao led development of new operating system, cycle-powered PC, etc. Heterogeneity of communities opens up new opportunities for innovation and integration skills. Time is critical factor in the context of ICT. Let all the communities the world over catch up to the basic technology absorption capability and use it for improving quality of life of the people at large. Dr.Om vikas ICDL-2004 Digital Knowledge Resources: • Electronic Information is being created in many forms and formats and stored in many repositories • Ever improving Information Technology makes sharing Knowledge Resources economical , universally accessible Dr.Om vikas of ICDL-2004 World Scenario of Digital Library Initiatives Digital libraries are a form of information technology in which social impact matters as much as technological advancements. DLI in USA Six major projects were launched during 1994-1998 under DLI (Digital Library Initiative) funded by the NSF, DARPA and NASA in the USA. Digital Libraries Initiative-phase 2 (DLI-2) is an NSF led initiative that builds on the successes of DLI-1. DLI-2 is supported by many funding agencies like NSF, DARPA, National Library of Medicine, Library of congress National Endowment for the Humanities. DLI-2 will investigate digital libraries as human-centered systems. Dr.Om vikas ICDL-2004 DARPA's Information Management program address (www.dapra.mil/ito/research/in) core digital library issues requiring revolutionary research technology: Federated repositories. The organisation of distributed repositories into a coherent virtual collection is fundamental Scalability. Managing billions of digital objects and millions of sources poses challenges in identifying, categorizing, indexing, summarizing and extracting content. Interoperability. Digital libraries require semantic interoperability among heterogeneous repositories distributed across the network. Collaboration. Analysts work in distributed teams, building on each other's knowledge experience and resources. Communication. Timely dissemination of research results is the focus of D-Lib. Dr.Om vikas ICDL-2004 The Illinois D-Lib project (http://dli.grainger.uiuc.edu) take SGML directly from the publisher's collections, convert it into a canonical format for federated searching and transform tags into a standard set. Federating the search at a semantic level is an area of active research in digital library community. Statistical approaches lead toward scalable semantics - indexing deeper than text word search that is computable on large real collections. Journal Storage project started at University of Michigan with the grant of the Andrew W Mellon Foundation. JSTOR database total 450,000 articles and 2.7 million pages created via a combination of page images and full-text at a rate pf 100,000 pages. The www.jstor.org URL links to three server machines: two at University of Michigan, a third at Princeton University. Distributed mirrors offer increased reliability, accessibility, and capacity. Dr.Om vikas ICDL-2004 The Informedia Project at Carnegie Mellon University has created a terabyte digital video library in which automatically derived descriptors for the video are used for indexing, segmenting, and accessing the library contents. Artificial Intelligence techniques have been used to create metadata - the data that describes video content. Powerful browsing capabilities are essential in a multimedia information retrieval system. The Carnegie Mellon DLI project searched multimedia, particularly video segments, by generating text indexes using speech understanding. The Stanford DLI project searched across different engines using multiprotocol gateways. Other even harder issues remain untouched, such as multicultural search across context and meaning. Dr.Om vikas ICDL-2004 DLI in Europe The importance of D-Lib research is spreading beyond the US. European research in Digital Libraries is funded by the European Union as well as national sources. DL projects have supported by the Information Engineering, (www.echo.lu/ie), Language Engineering (www.echo.lu/langeng/en/lehome.html), and Esprit (www.cordis.lu/esprit) programs in Europe. Under NSF-EU collaboration, five working groups has been formed in the key technical areas of Interoperability, Metadata, IPR, Resource indexing and discovery, and multilingual information access. Dr.Om vikas ICDL-2004 DLI in Asia Since 1995, D-Lib research has become a national grand challenge in several countries in Asia. Most projects can be classified into the following categories: Nationwide D-Lib initiative and special purpose digital librariesfor example, the library 2000 Project in Singapore (to link all library resources) and Financial Digital Library at the University of Hong Kong (to serve the needs of HK stock market and users) Digital museum and historical document digitalization-fox example, Digital Museum Project of the National Taiwan University and Digitalization of art collection of the Palace Museum in Taipai by IBM. Local language processing and historical cultural content could be the most immediate Asian contribution to the international DL community. An Asia Digital Library consortium is fostering longterm collaboration and projects in DL-related topics in Asia (www.cyberlib.net/adl). Dr.Om vikas ICDL-2004 Local language and multilingual information retrieval-for example, the Net Compass Project of Tsinghua University in China, Chinese Information Retrieval at the Academia Sinica, Taiwan, and New Zealand's multilingual project. The New Zealand D-Lib (http://www.nzdl.org) currently offers about 20 collections, varying in size from a few documents upto 10 million documents and several gigabytes of text. The documents written in many different languages, including English, French, German, Arabic, Maori, Portugese and Swahili. The D-Lib provides interfaces to the collections in several languages. To accommodate blind users (with speech synthesizers) and partially sighted users (with large-font displays), NZ D-Lib provides text only version of the interface for each language. Dr.Om vikas ICDL-2004 iv. Digital Library of India Initiative Broad Objectives : • To digitize and index the heritage knowledge. • To promote life long learning in the society (a necessity of the Knowledge-based society). • To promote collaborative creativity and building up knowledge teams across borders. • Participation in World initiatives on Digital Library such as UDL. [ It is to note that India has Multiple Languages, Multiple scripts, Manuscripts in different forms, Books using various fonts, Vast tacit knowledge resource of vanishing scholars, and Multiple commentaries on a text This forms a vast treasure of heritage knowledge.] Dr.Om vikas ICDL-2004 • Mobile Digital Library – Knowledge at doorsteps To facilitate surf, access, print,and take away a book of choice anywhere and anytime • 20 DL Centers with 106 high resolution Scanners • 4 Megacenters (to setup) Dr.Om vikas ICDL-2004 • Issues pertaining to digitization Multilingual Issues • Character Sets (UNICODE?) • Representations • Multilingual Navigation • Translation Assistance Policy Challenges • Convenient quality displays • What to digitize first? • Use of copyrighted material • Economics (Who pays? Who gets?) • Privacy • Reliability of information • Authentication of text from multiple versions • Digital Library Act. Dr.Om vikas ICDL-2004 Need for Indian Digital Library Act. Issues to tackle may include compulsory Licensing, digital pack book (incentive: 10% tax deduction on lifetime revenue); deemed out of print (donate electronic rights); concept shift in Royalty per copy to per preview; public lending rights (as in Japan); 4Cs (Consortium for Compensation for Creative Content), formula to respect content creator and pay compensation, (min. Rs. 100/- to max Rs. 1 lakh), inclusion of books, music and movie with higher & higher privacy value. Dr.Om vikas ICDL-2004 • Linguistic Scenario in India • Eighteen constitutional Indian Languages are mentioned as follows with their scripts within parentheses: Hindi (Devanagari), Konkani (Devanagari), Marathi (Devanagari), Nepali (Devanagari), Sanskrit (Devanagari), Sindhi (Devanagari/Urdu), Kashmiri (Devanagari/Urdu); Assamese (Assamese), Manipuri (Manipuri), Bangla (Bengali), Oriya (Oriya), Gujarati (Gujarati), Punjabi (Gurumukhi), Telugu (Telugu), Kannada (Kannada), Tamil (Tamil), Malayalam (Malayalam) and Urdu (Urdu). There are 10 Indic Scripts in vogue. • Interestingly, Indian languages owe their origin to Sanskrit, hence they have in common rich cultural heritage and treasure of knowledge. Indic scripts have originated from Brahmi script. Less than 5 percent of people can either read & write English. Over 95 percent population is normally deprived of the benefits of English-based Information Technology. Characteristics of Indian Languages • What You Speak Is What You Write (WYSIWYW) • Script grammar - transformation rules • Relatively word order free • Common phonetic based alphabet • Common concept terms (from Sanskrit) Dr.Om vikas ICDL-2004 Indian Language Technology Map CoILTech IETE – New Delhi G.G.Univ. Bilaspur Dr.Om vikas CoILTech ICDL-2004 Major Achievements in ILT Information Dissemination Localization of LINUX Translation Support Systems Human Machine Interface systems Standardization Knowledge Tools Dr.Om vikas Knowledge Resources ICDL-2004 Translation Support Systems (MAT) • English to Hindi (Angla-Bharati) http:// anglahindi.iitk.ac.in (very satisfactory above 85% consistently okay) • Indian Languages to Hindi (In the process of development) • Hindi to English (In the process of development) Human Machine interface Systems Optical Character Recognition (OCR) (accuracy for 7 ILs viz. Hindi Marathi, Bangla, Tamil, Telugu, Gurumukhi, Malayalam, above 97%. OCRs in other ILs are in the process of development) Text to Speech system (TTS) (Hindi, Bangla,) Continuous Speech Recognition CSR (Hindi) Dr.Om vikas ICDL-2004 Major Achievements in ILT….. Knowledge Resources Bilingual dictionaries (over 30, 000) words • English - Hindi • English - Telugu Hindi • English - Tamil Hindi • English - Kannada - Hindi • English - Bangla Hindi • English - Punjabi Hindi • English - Oriya Hindi • English - Malayalam - Hindi • English - Sanskrit Hindi Parallel Corpora – One Million page Parallel Corpora is under process of development. The development of the parallel corpora is one of the unique achievement of the TDIL programme and is appreciated worldwide [ 600 Thousand pages ready.] Dr.Om vikas ICDL-2004 Major Achievements in ILT….. Standardization UNICODE DIT is the voting member of the Unicode Consortium. Proposed changes in the Unicode Standards finalized in consultation with respective State Government and Indian IT Industry and presented in the UNICODE Technical committee meeting. Some of the proposed changes have been incorporated in Unicode version 4.0 INdian Scripts FOnt Code (INSFOC) Standards have been developed Indian Script to Romanization Tables (INSROT)are ready Knowledge Tools Morph Analyzer, Syntactic Analyzer, Spell checker, Messaging system , Authoring Systems, Word processors, code conversion utilities have been developed. Dr.Om vikas ICDL-2004 Major Achievements in ILT….. Localization of LINUX systems INDIX system : Localized INDIX-2 supports 5 IL s Viz. Hindi, Marathi, Gujrati, Tamil and Bangla. LINUX operating system with other Indian Languages support is in the process of development. Information Dissemination: TDIL Web-site http://tdil.mit.gov.in This Web Site contains information for various TDIL activities, achievements and provides access to a variety of content and downloadable in Hindi and for other Indian languages. – Free Downloads Indian Language keyboard driver & fonts and other tools, corpus, content, conversion utilities, Machine aided Translation systems. Quarterly Language Technology Flash : Vishwabharat@tdil Dr.Om vikas ICDL-2004 • Language Technology HRD • Post Graduate Programs in the Domains of Computational Linguistics & Knowledge Engineering. • All the Bachelors and Masters Programmes in Computer Science Engineering will cover the Multilingual Computing aspect also. • School curricula include basics of multilingual computing. Dr.Om vikas ICDL-2004 Typical illustration of Indian Language OCRs Hindi OCR Input OCR Output Efficiency 96.8%, working for font size from 12-36 Dr.Om vikas ICDL-2004 Gurmukhi to Shahmukhi Transliteration Gurmukhi Dr.Om vikas Shahmukhi ICDL-2004 • Machine Translation (MAT) – English to Hindi http://anglahindi.iitk.ac.in Illustration of online MAT system Simple Sentences. sarala vaakya . sarla vaa@ya. Welcome to London. landana men aapakaa svaagata hai. landna maoM Aapka svaagat hO. There are some cases which are still pending. vahaan kuc'ha kesa hain jo abhii bhii nilamibata hain . vahaÐ kuC kosa hOM jaao ABaI BaI inalaimbat hOM . Dr.Om vikas ICDL-2004 • Machine Translation (MAT) – Hindi to English Dr.Om vikas ICDL-2004 Innovating to Innovate Researchers always want to go for that last 2% of performance. But it’s better to get a sufficient solution out fast and then continue to enhance it. ….MarkDean, IBM (Source : Harvard Business Review, Aug’2002) Hence TDIL Program emphasizes on Collaborative development of language technology and. Taking Language Technology Products out to market rapidly for feedback and refinement Dr.Om vikas ICDL-2004 Media Lab Asia : another initiative World Computer (Lowcost PC) Rural Operating Systems; Speech Interfaces For Local Dialects; Visual Language; Interfaces for All; Interlingua Web; Multi-Literate Interface; Literacy Learning Through Pictures Bits for All (Universal Connectivity) Rural WiFi, DakNet, Digital Gangetic Plain, Off-Line Internet Access, Rural VoIP Tomorrow's Tools (Language Interfaces) Mapping For the Masses, Community Access to Sustainable Health (Ca:sh), Building Robots Creating Science (BRICS), Digital Craft Revival, Digital Human Body, Digital Music, InfoSculpture, Suchik, Polysensors, Complex RF Impedance Analyzers, UV-VIS Spectrometer, Power Sensors, Think Cycle Digital Village (Consolidation in delivering value to the masses) Sustainable Access in Rural India, Community Connection, Digital Mandi, InfoThela Dr.Om vikas ICDL-2004 Trends in Language Technology • Intelligent Human Computer Interaction To support more sophisticated and natural input and output that promise knowledge or agent-based dialogue in which the interface gracefully handles errors and interruptions and dynamically adapts to the current context. Typical properties : Multimodal input - They process potentially ambiguous, imprecise combinations of mixed input such as written text, spoken language, gestures (e.g., mouse, pen, dataglove) and gaze. Multimodal output - They design coordinated presentation of, e.g., text, speech, graphics, and gestures, which may be presented via conventional displays or animated, life-like agents. Interaction management - mixed initiative interactions that are contextdependent based on system models of the discourse, user, and task Dr.Om vikas ICDL-2004 • Machine Translation 1970s : Narrow domain , Rules-based approach 1980s : Practical MT system example based approach nterlingua and Transfer method. 1990s : Multilingual MT, Simultaneous Interpretation, example based revisited, corpus based and statistics based approach. 2000s : MT through NL understanding language resources Dr.Om vikas ICDL-2004 • Speech Technology Development: • Speech technology is the field of Interactive Technologies. There is ongoing shift from Speech component research to research on integrated Speech Systems. Together with Speech, are the modalities that constitute full natural human - human communication (e.g.. Gesture, lip movements, facial expression, gaze, bodily posture) leading towards multimodal interactive systems • 1970s : Speech synthesis systems used rule-based formant system. (Formants are transfer function of vocal tract resonant frequency.) • 1990s: Concatenated speech synthesis systems use small pieces of pre-recorded speech. • There is trend towards cross-project collaboration, synergy, critical mass, and deployable & scalable technologies Dr.Om vikas ICDL-2004 • Trends in Digital Library Technologies Multi-modal Input Scanning, Smartizing (Value Addition), Content, Multi-lingual, Multi-media Standardization Character Code, Font Code, Semantic Indexing, DOI, XML, SCORM Navigation Browsing, Finding, Searching, Zooming, Hyperbolic Tree, Virtual Reality, Aboutness, Searching Mathematics, Multilingual Navigation, Translation Assistance. Architecture Interoperability, Multi-lingual Information Access, Metadata, Resource Indexing & Discovery In Globally Distributed Digital Library IPR Issues 4Cs(Consortium For Compensation For Creative Content) Knowledge Generation Capacity Focus In 20th Century Capitalistic & Monopolistic Trend In Publication & Dissemination. Focus In 21st Century Universalization Of Creativity. Dr.Om vikas ICDL-2004 Future Knowledge Networks The Interspace represents the third wave in the ongoing evolution of the Global Information Infrastructure, driven by rapid advances in computing and Information Technology during . The wave pattern roughly describes four distinct phases of functionality: fundamental research (trough), development of prototype systems (ascent), emergence of commercial systems (crest), and mass propagation (descent) Dr.Om vikas ICDL-2004 Scalable Semantics Future knowledge networks will rely on scalable semantics, on automatically indexing the community collections so The knowledge networks of the Interspace will be connected via switching machines that switch concepts. Connectivity and training continue to be the principal barriers to integrating the global network of libraries. Interspace focuses on scalable technologies for semantic indexing that work generally across all subject domains. We can use concept spaces collections of abstract concept generated from concrete objects-to boost searches by interactively suggesting alternative terms. We can use category maps to boost navigation by interactively browsing clusters of related documents. Scalable semantics is used to index the semantics of document contents on large collections. Concept spaces use text documents as the objects and noun phrases as the concepts. Dr.Om vikas ICDL-2004 Summing up the Challenges Ahead •ML Open Source Software - Shareable Software - Standards database and updating - Support service & Help line - Consortium approach - GPL with performance else Garbage In Garbage out • Benchmarking & Standards - testing against international standards - active participation in evolving standards • Information Technology Culture - Awareness : IT Clinic, Workshops, media - BIPK (Basic information Processing Kit) with user friendly, easy-to-use, affordable, scalable, interoperable and re-usable tools. BIPK may consist star office like processing facility, fonts, KB driver, spell checker, dictionary and conversion utility. - Entrepreneurship : Gyanaudyog workshops. Dr.Om vikas ICDL-2004 .... Challenges Ahead • Cross–lingual Information Access - Search engine, Web Crawler, on-line machine translation. • Localization - Localization of software and content into local languages - Enlarging share in localization outsourcing ( $ 8 Bn By 2006:IDC) • International Collaboration in Language Informatics. - Industry - academia cooperation in joint research & technology development projects. - Exchange of faculty and students - HRD programs in knowledge Engineering & Computational Linguistics • Rise, Raise & Race - Possess basic language technologies - Promote Collectivistic Culture - Think globally & act locally - Collaborate for innovation Dr.Om vikas ICDL-2004 Digital Library is a means to meet the end : Objective of Universalization of Creativity Dr.Om vikas ICDL-2004 Nothing is so pious as knowledge. xÉ Ê½þ YÉÉxÉäxÉ ºÉoù¶ÉÆ {ÉÊ´ÉjÉʨɽþ Ê´ÉtiÉä* (Bhagwadgita: 4.38) ¶ÉÉÆÊiÉ: (Shaantih) Dr.Om vikas ICDL-2004