View presentation

advertisement
Towards Universalisation of Creativity
Dr. Om Vikas
Department of Information Technology
Ministry of Communications and Information Technology
Government of India
E-mail: omvikas@mit.gov.in
Dr.Om vikas
ICDL-2004

Is there gain in knowledge or loss of Knowledge?
• From an estimated 10,000 world languages in 1900, about 6,700 language survived in
2000. Two percent of the world's languages are becoming extinct every year.
• There is worldwide, unquantifiable erosion of cultural participation, knowledge and
innovation.
• With the loss of a language, we lose art and ideas, scientific information and
technological innovation capacity.
• World-level literacy is improving. More people can read than ever before, but fewer
people create stories.
• There is tendency from being creators to consumers at the time when technology
could have amplified our creative capacities.
• UNESCO study (1999) of 65 languages: 49 of the languages (75%) had experienced
real decline in number of works translated from these languages into other languages.
• The proportion for English arose from 43 percent in 1980 to over 57 percent in 1994.
• The share held by top four translated languages (English, Spanish, French and
German) rose from 65 percent in 1980 to 81 percent in 1994.
• According to an UNESCO study involving world’s 140 most published authors; 90 out
of 140 were English writers in 1994 compared to 64 out of 140 in 1980.
• There is collapse in authorship, translation and quality in other languages.
 Erosion of Language and Culture !!
Dr.Om vikas
ICDL-2004

Is the technology to divide or to unite
?
• Latin Alphabet users , 39 % of the global population enjoy 84% of
access to the Internet
• Hanzi-users in (CJK), 22% in global population enjoy 13% of
Internet access
• Arbic script users, 9% of the population have 1.2 % of the Internet
Access
• Bralmi-origin scripts users in South-east Asia and Indic scripts
users occupy 22 % of the World population have just 0.3 % of
Internet access.
• More than 80% content on Internet is in English.
• ICT penetration in India and other developing countries is lower.
Dr.Om vikas
ICDL-2004
ICT Indicators
Teledensity
Advanced Nations
Developing Nations
Cellphone Density
50-70 %
30-75 %
20-30 %
04-7 %
PC penetration
30-60 %
0.5-2 %
Underdeveloped
Nations
Sprawling
Digital
Dr.Om vikas
<<<<<<<
>>>>>> >Divide
ICDL-2004
Digital Divide as They Behold
Perception
Developed Countries
Developing Countries
Why discussed ?
Desire to capture larger markets
Policy
Information explosion
Fear of lagging behind in
economic race
Localization
Results
Increasing use of English and
thrust of western culture.
Preservation of local
language and culture.
Consumer nature
“substitute the old”
[Consumerism-centric]
“Upgrade the Old”
Technology
development
IPR-Centric
Open source technology
Low cost PC
$400
less than $ 40
Reason:
PPP : (15:1)
GNP : (75:1)
34260 (USA)
24260
2400 (India)
460
Digital divide
Access to Information
Wider control
Digital Unite
Share the Knowledge
Small is beautiful.
Focus
Low affordability means low ICT penetration & sprawling Digital Divide
Dr.Om vikas
ICDL-2004
 e-Content & Universal Access
 UNESO identifies Challenges in Multilinguism and universal
access to information
• General affordable worldwide access
• Hardware and Software, Web and Internet Features.
• Availability of Accessible websites and Internet Access
devices.
• Accessibility of multiple languages
• Development of content in Native languages, and its
placement on Internet.
• Appropriate design of software for users
Dr.Om vikas
ICDL-2004
Potential Use of non-English languages on Internet
will increase drastically by 2010 as shown below:
Users
2010
2003
500
Mn
400
Mn
300
Mn
200
Mn
100
Mn
0
Eng
Jap
Chinese French Spanish German Indian languages
65 % information on Internet is in English
Dr.Om vikas
Source : IBM’s Web Fountain
ICDL-2004
 New Order of Knowledge based Society :
•
Universalization of Creativity
•
Rise, Raise & Race
Dr.Om vikas
ICDL-2004
Raise to Rise & Race to Limits
Liberalisation is advice of advanced nations to the rest for creating
conducive environment for technology acquisition and absorption and
thus expanding their market. Mindset needs to be changed to help the
underdeveloped nations to catch-up in technology absorption and
participation in knowledge generation.
Following is an example of providing high-tech solution in low-tech
environment. A group of engineer volunteers in USA designed and
built a rugged and low-cost bicycle- powered computer and wireless
network for villagers of phon kham in Laos which had no electricity or
phone service. There was no way to call relatives living abroad or even
in the next town. This is a project to bridge the digital divide.
Innovation follows on Stretching our imagination to limits. As we
noticed that constrained environment of a village in Lao led
development of new operating system, cycle-powered PC, etc.
Heterogeneity of communities opens up new opportunities for
innovation and integration skills. Time is critical factor in the context
of ICT. Let all the communities the world over catch up to the basic
technology absorption capability and use it for improving quality of life
of the people at large.
Dr.Om vikas
ICDL-2004
Digital Knowledge Resources:
• Electronic Information is being created in many forms and formats
and stored in many repositories
• Ever improving Information Technology makes sharing
Knowledge Resources economical , universally accessible
Dr.Om vikas
of
ICDL-2004
 World Scenario of Digital Library Initiatives
Digital libraries are a form of information technology in which social
impact matters as much as technological advancements.
DLI in USA
Six major projects were launched during 1994-1998 under DLI (Digital
Library Initiative) funded by the NSF, DARPA and NASA in the USA.
Digital Libraries Initiative-phase 2 (DLI-2) is an NSF led initiative that
builds on the successes of DLI-1. DLI-2 is supported by many funding
agencies like NSF, DARPA, National Library of Medicine, Library of
congress National Endowment for the Humanities.
DLI-2 will
investigate digital libraries as human-centered systems.
Dr.Om vikas
ICDL-2004
DARPA's Information Management program address
(www.dapra.mil/ito/research/in) core digital library issues requiring
revolutionary research technology:
 Federated repositories.
The organisation of distributed
repositories into a coherent virtual collection is fundamental
 Scalability. Managing billions of digital objects and millions of
sources poses challenges in identifying, categorizing, indexing,
summarizing and extracting content.
 Interoperability. Digital libraries require semantic interoperability
among heterogeneous repositories distributed across the network.
 Collaboration. Analysts work in distributed teams, building on
each other's knowledge experience and resources.
 Communication. Timely dissemination of research results is the
focus of D-Lib.
Dr.Om vikas
ICDL-2004
The Illinois D-Lib project (http://dli.grainger.uiuc.edu) take SGML
directly from the publisher's collections, convert it into a canonical
format for federated searching and transform tags into a standard set.
Federating the search at a semantic level is an area of active research in
digital library community. Statistical approaches lead toward scalable
semantics - indexing deeper than text word search that is computable on
large real collections. Journal Storage project started at University of
Michigan with the grant of the Andrew W Mellon Foundation. JSTOR
database total 450,000 articles and 2.7 million pages created via a
combination of page images and full-text at a rate pf 100,000 pages. The
www.jstor.org URL links to three server machines: two at University of
Michigan, a third at Princeton University. Distributed mirrors offer
increased reliability, accessibility, and capacity.
Dr.Om vikas
ICDL-2004
The Informedia Project at Carnegie Mellon University has created
a terabyte digital video library in which automatically derived
descriptors for the video are used for indexing, segmenting, and
accessing the library contents. Artificial Intelligence techniques
have been used to create metadata - the data that describes video
content.
Powerful browsing capabilities are essential in a
multimedia information retrieval system.
The Carnegie Mellon DLI project searched multimedia,
particularly video segments, by generating text indexes using
speech understanding. The Stanford DLI project searched across
different engines using multiprotocol gateways. Other even
harder issues remain untouched, such as multicultural search
across context and meaning.
Dr.Om vikas
ICDL-2004
DLI in Europe
The importance of D-Lib research is spreading beyond the US.
European research in Digital Libraries is funded by the European
Union as well as national sources. DL projects have supported by
the Information Engineering, (www.echo.lu/ie), Language
Engineering (www.echo.lu/langeng/en/lehome.html), and Esprit
(www.cordis.lu/esprit) programs in Europe.
Under NSF-EU collaboration, five working groups has been formed
in the key technical areas of Interoperability, Metadata, IPR,
Resource indexing and discovery, and multilingual information
access.
Dr.Om vikas
ICDL-2004
DLI in Asia
Since 1995, D-Lib research has become a national grand challenge in
several countries in Asia. Most projects can be classified into the
following categories:
 Nationwide D-Lib initiative and special purpose digital librariesfor example, the library 2000 Project in Singapore (to link all
library resources) and Financial Digital Library at the University
of Hong Kong (to serve the needs of HK stock market and users)
 Digital museum and historical document digitalization-fox
example, Digital Museum Project of the National Taiwan
University and Digitalization of art collection of the Palace
Museum in Taipai by IBM.
 Local language processing and historical cultural content could be
the most immediate Asian contribution to the international DL
community. An Asia Digital Library consortium is fostering longterm collaboration and projects in DL-related topics in Asia
(www.cyberlib.net/adl).
Dr.Om vikas
ICDL-2004

Local language and multilingual information retrieval-for
example, the Net Compass Project of Tsinghua University in
China, Chinese Information Retrieval at the Academia Sinica,
Taiwan, and New Zealand's multilingual project.
The New Zealand D-Lib (http://www.nzdl.org) currently offers
about 20 collections, varying in size from a few documents upto 10
million documents and several gigabytes of text. The documents
written in many different languages, including English, French,
German, Arabic, Maori, Portugese and Swahili. The D-Lib
provides interfaces to the collections in several languages. To
accommodate blind users (with speech synthesizers) and partially
sighted users (with large-font displays), NZ D-Lib provides text
only version of the interface for each language.
Dr.Om vikas
ICDL-2004
iv. Digital Library of India Initiative
Broad Objectives :
•
To digitize and index the heritage knowledge.
•
To promote life long learning in the society (a necessity of the
Knowledge-based society).
•
To promote collaborative creativity and building up knowledge teams
across borders.
•
Participation in World initiatives on Digital Library such as UDL.
[ It is to note that India has
Multiple Languages, Multiple scripts, Manuscripts in different forms,
Books using various fonts, Vast tacit knowledge resource of
vanishing scholars, and Multiple commentaries on a text This forms
a vast treasure of heritage knowledge.]
Dr.Om vikas
ICDL-2004
• Mobile Digital Library – Knowledge at doorsteps
To facilitate surf, access, print,and take away a
book of choice anywhere and anytime
• 20 DL Centers with 106 high resolution Scanners
• 4 Megacenters (to setup)
Dr.Om vikas
ICDL-2004
• Issues pertaining to digitization
Multilingual Issues
• Character Sets (UNICODE?)
• Representations
• Multilingual Navigation
• Translation Assistance
Policy Challenges
• Convenient quality displays
• What to digitize first?
• Use of copyrighted material
• Economics (Who pays? Who gets?)
• Privacy
• Reliability of information
• Authentication of text from multiple versions
• Digital Library Act.
Dr.Om vikas
ICDL-2004
Need for Indian Digital Library Act.
Issues to tackle may include compulsory Licensing, digital pack
book (incentive: 10% tax deduction on lifetime revenue); deemed
out of print (donate electronic rights); concept shift in Royalty
per copy to per preview; public lending rights (as in Japan); 4Cs
(Consortium for Compensation for Creative Content), formula to
respect content creator and pay compensation, (min. Rs. 100/- to
max Rs. 1 lakh), inclusion of books, music and movie with
higher & higher privacy value.
Dr.Om vikas
ICDL-2004
• Linguistic Scenario in India
• Eighteen constitutional Indian Languages are mentioned as follows with
their scripts within parentheses: Hindi (Devanagari), Konkani
(Devanagari), Marathi (Devanagari), Nepali (Devanagari), Sanskrit
(Devanagari), Sindhi (Devanagari/Urdu), Kashmiri (Devanagari/Urdu);
Assamese (Assamese), Manipuri (Manipuri), Bangla (Bengali), Oriya
(Oriya), Gujarati (Gujarati), Punjabi (Gurumukhi), Telugu (Telugu),
Kannada (Kannada), Tamil (Tamil), Malayalam (Malayalam) and Urdu
(Urdu). There are 10 Indic Scripts in vogue.
• Interestingly, Indian languages owe their origin to Sanskrit, hence they
have in common rich cultural heritage and treasure of knowledge. Indic
scripts have originated from Brahmi script. Less than 5 percent of people
can either read & write English. Over 95 percent population is normally
deprived of the benefits of English-based Information Technology.
Characteristics of Indian Languages
• What You Speak Is What You Write (WYSIWYW)
• Script grammar - transformation rules
• Relatively word order free
• Common phonetic based alphabet
• Common concept terms (from Sanskrit)
Dr.Om vikas
ICDL-2004
Indian Language Technology Map
CoILTech
IETE – New Delhi
G.G.Univ. Bilaspur
Dr.Om vikas
CoILTech
ICDL-2004

Major Achievements in ILT
Information
Dissemination
Localization of
LINUX
Translation
Support
Systems
Human Machine
Interface systems
Standardization
Knowledge
Tools
Dr.Om vikas
Knowledge
Resources
ICDL-2004
 Translation Support Systems (MAT)
• English to Hindi (Angla-Bharati) http:// anglahindi.iitk.ac.in
(very satisfactory above 85% consistently okay)
• Indian Languages to Hindi (In the process of development)
• Hindi to English
(In the process of development)
 Human Machine interface Systems
Optical Character Recognition (OCR)
(accuracy for 7 ILs viz. Hindi Marathi, Bangla, Tamil, Telugu,
Gurumukhi, Malayalam, above 97%. OCRs in other ILs are in the process
of development)
Text to Speech system (TTS) (Hindi, Bangla,)
Continuous Speech Recognition CSR (Hindi)
Dr.Om vikas
ICDL-2004
Major Achievements in ILT…..
 Knowledge Resources

Bilingual dictionaries (over 30, 000) words
• English - Hindi
• English - Telugu Hindi
• English - Tamil Hindi
• English - Kannada - Hindi
• English - Bangla Hindi
• English - Punjabi Hindi
• English - Oriya Hindi
• English - Malayalam - Hindi
• English - Sanskrit Hindi
 Parallel Corpora – One Million page Parallel Corpora is under process
of development. The development of the parallel corpora is one of the
unique achievement of the TDIL programme and is appreciated
worldwide [ 600 Thousand pages ready.]
Dr.Om vikas
ICDL-2004
Major Achievements in ILT…..
 Standardization
UNICODE
DIT is the voting member of the Unicode Consortium.
Proposed changes in the Unicode Standards finalized in consultation
with respective State Government and Indian IT Industry and presented
in the UNICODE Technical committee meeting. Some of the proposed
changes have been incorporated in Unicode version 4.0
INdian Scripts FOnt Code (INSFOC) Standards have been developed
Indian Script to Romanization Tables (INSROT)are ready
 Knowledge Tools
Morph Analyzer, Syntactic Analyzer, Spell checker, Messaging
system , Authoring Systems, Word processors, code conversion
utilities have been developed.
Dr.Om vikas
ICDL-2004
Major Achievements in ILT…..
 Localization of LINUX systems
INDIX system : Localized INDIX-2 supports 5 IL s Viz. Hindi, Marathi, Gujrati, Tamil
and Bangla. LINUX operating system with other Indian Languages support is in the
process of development.
 Information Dissemination:
TDIL Web-site http://tdil.mit.gov.in
This Web Site contains information for various TDIL activities, achievements and
provides access to a variety of content and downloadable in Hindi and for other
Indian languages.
– Free Downloads
Indian Language keyboard driver & fonts and other tools, corpus, content,
conversion utilities, Machine aided Translation systems.
 Quarterly Language Technology Flash : Vishwabharat@tdil
Dr.Om vikas
ICDL-2004
• Language Technology HRD
• Post Graduate Programs in the Domains of
Computational Linguistics
&
Knowledge Engineering.
• All the Bachelors and Masters Programmes in Computer Science
Engineering will cover the Multilingual Computing aspect also.
• School curricula include basics of multilingual computing.
Dr.Om vikas
ICDL-2004
Typical illustration of Indian Language OCRs
Hindi OCR Input
OCR Output
Efficiency 96.8%, working for font size from 12-36
Dr.Om vikas
ICDL-2004
Gurmukhi to Shahmukhi Transliteration
Gurmukhi
Dr.Om vikas
Shahmukhi
ICDL-2004
•
Machine Translation (MAT) – English to Hindi
http://anglahindi.iitk.ac.in
Illustration of online MAT system
Simple Sentences.
sarala vaakya .
sarla vaa@ya.
Welcome to London.
landana men aapakaa svaagata hai.
landna maoM Aapka svaagat hO.
There are some cases which are still pending.
vahaan kuc'ha kesa hain jo abhii bhii nilamibata hain .
vahaÐ kuC kosa hOM jaao ABaI BaI inalaimbat hOM .
Dr.Om vikas
ICDL-2004
•
Machine Translation (MAT) – Hindi to English
Dr.Om vikas
ICDL-2004
Innovating to Innovate
Researchers always want to go for that last 2% of performance. But it’s
better to get a sufficient solution out fast and then continue to enhance it.
….MarkDean, IBM
(Source : Harvard Business Review, Aug’2002)
Hence TDIL Program emphasizes on
 Collaborative development of language technology and.
 Taking Language Technology Products out to market rapidly for
feedback and refinement
Dr.Om vikas
ICDL-2004
 Media Lab Asia : another initiative
World Computer (Lowcost PC)
Rural Operating Systems; Speech Interfaces For Local Dialects; Visual
Language; Interfaces for All; Interlingua Web; Multi-Literate Interface;
Literacy Learning Through Pictures
Bits for All (Universal Connectivity)
Rural WiFi, DakNet, Digital Gangetic Plain, Off-Line Internet Access,
Rural VoIP
Tomorrow's Tools (Language Interfaces)
Mapping For the Masses, Community Access to Sustainable Health
(Ca:sh), Building Robots Creating Science (BRICS), Digital Craft Revival,
Digital Human Body, Digital Music, InfoSculpture, Suchik, Polysensors,
Complex RF Impedance Analyzers, UV-VIS Spectrometer, Power
Sensors, Think Cycle
Digital Village (Consolidation in delivering value to the masses)
Sustainable Access in Rural India, Community Connection, Digital Mandi,
InfoThela
Dr.Om vikas
ICDL-2004
 Trends in Language Technology
• Intelligent Human Computer Interaction
To support more sophisticated and natural input and output that promise
knowledge or agent-based dialogue in which the interface gracefully
handles errors and interruptions and dynamically adapts to the current
context.
Typical properties :
Multimodal input - They process potentially ambiguous, imprecise
combinations of mixed input such as written text, spoken language,
gestures (e.g., mouse, pen, dataglove) and gaze.
Multimodal output - They design coordinated presentation of, e.g., text,
speech, graphics, and gestures, which may be presented via
conventional displays or animated, life-like agents.
Interaction management - mixed initiative interactions that are contextdependent based on system models of the discourse, user, and task
Dr.Om vikas
ICDL-2004
•
Machine Translation
1970s :
Narrow domain , Rules-based approach
1980s :
Practical MT system example based approach
nterlingua and Transfer method.
1990s :
Multilingual MT, Simultaneous Interpretation, example
based revisited, corpus based and statistics based
approach.
2000s :
MT through NL understanding language resources
Dr.Om vikas
ICDL-2004
•
Speech Technology Development:
•
Speech technology is the field of Interactive Technologies.
There is ongoing shift from Speech component research to
research on integrated Speech Systems. Together with Speech,
are the modalities that constitute full natural human - human
communication (e.g.. Gesture, lip movements, facial expression,
gaze, bodily posture) leading towards multimodal interactive
systems
•
1970s : Speech synthesis systems used rule-based formant
system. (Formants are transfer function of vocal tract resonant
frequency.)
•
1990s: Concatenated speech synthesis systems use small
pieces of pre-recorded speech.
•
There is trend towards cross-project collaboration, synergy,
critical mass, and deployable & scalable technologies
Dr.Om vikas
ICDL-2004
• Trends in Digital Library Technologies
Multi-modal Input
Scanning, Smartizing (Value Addition), Content, Multi-lingual,
Multi-media
Standardization
Character Code, Font Code, Semantic Indexing, DOI, XML,
SCORM
Navigation
Browsing, Finding, Searching, Zooming, Hyperbolic Tree, Virtual
Reality, Aboutness, Searching Mathematics, Multilingual
Navigation, Translation Assistance.
Architecture
Interoperability, Multi-lingual Information Access, Metadata,
Resource Indexing & Discovery In Globally Distributed Digital
Library
IPR Issues
4Cs(Consortium For Compensation For Creative Content)
Knowledge
Generation
Capacity
Focus In 20th Century
Capitalistic & Monopolistic Trend In Publication & Dissemination.
Focus In 21st Century
Universalization Of Creativity.
Dr.Om vikas
ICDL-2004
Future Knowledge Networks
The Interspace represents the third wave in the ongoing evolution of the Global
Information Infrastructure, driven by rapid advances in computing and
Information Technology during .
The wave pattern roughly describes four distinct phases of functionality:
fundamental research (trough), development of prototype systems (ascent),
emergence of commercial systems (crest), and mass propagation (descent)
Dr.Om vikas
ICDL-2004
Scalable Semantics
Future knowledge networks will rely on scalable semantics, on
automatically indexing the community collections so The knowledge
networks of the Interspace will be connected via switching machines that
switch concepts. Connectivity and training continue to be the principal
barriers to integrating the global network of libraries.
Interspace focuses on scalable technologies for semantic indexing that work
generally across all subject domains. We can use concept spaces collections of abstract concept generated from concrete objects-to boost
searches by interactively suggesting alternative terms. We can use category
maps to boost navigation by interactively browsing clusters of related
documents. Scalable semantics is used to index the semantics of document
contents on large collections. Concept spaces use text documents as the
objects and noun phrases as the concepts.
Dr.Om vikas
ICDL-2004
 Summing up the Challenges Ahead
•ML Open Source Software
- Shareable Software
- Standards database and updating
- Support service & Help line
- Consortium approach
- GPL with performance else Garbage In Garbage out
• Benchmarking & Standards
- testing against international standards
- active participation in evolving standards
• Information Technology Culture
- Awareness : IT Clinic, Workshops, media
- BIPK (Basic information Processing Kit) with user
friendly, easy-to-use, affordable, scalable, interoperable
and re-usable tools. BIPK may consist star office like
processing facility, fonts, KB driver, spell checker,
dictionary and conversion utility.
- Entrepreneurship : Gyanaudyog workshops.
Dr.Om vikas
ICDL-2004
.... Challenges Ahead
• Cross–lingual Information Access
- Search engine, Web Crawler, on-line machine translation.
• Localization
- Localization of software and content into local languages
- Enlarging share in localization outsourcing ( $ 8 Bn By 2006:IDC)
• International Collaboration in Language Informatics.
- Industry - academia cooperation in joint research & technology
development projects.
- Exchange of faculty and students
- HRD programs in knowledge Engineering & Computational
Linguistics
• Rise, Raise & Race
- Possess basic language technologies
- Promote Collectivistic Culture
- Think globally & act locally
- Collaborate for innovation
Dr.Om vikas
ICDL-2004
Digital Library is a means to meet the end :
Objective of Universalization of Creativity
Dr.Om vikas
ICDL-2004
Nothing is so pious as knowledge.
xÉ Ê½þ YÉÉxÉäxÉ ºÉoù¶ÉÆ {ÉÊ´ÉjÉʨɽþ Ê´ÉtiÉä*
(Bhagwadgita: 4.38)
¶ÉÉÆÊiÉ:
(Shaantih)
Dr.Om vikas
ICDL-2004
Download