Development of Linguistic Resources and Tools for Providing

advertisement
सस्
ु वागतम ्
Welcome
Technology Development for Indian Languages
Presentation at LREC 2010 Conf , Malta
Swaran Lata & Somnath Chandra
Human Centred Computing Division
Department of Information Technology
slata@mit.gov.in, schandra@mit.gov.ind
Presented by : Shyam S Agrawal
Executive Director KIIT, Advisor CDAC
TDIL
1
Technology Development for Indian Languages (TDIL) Programme
 Promotes Research & Development of Technology, Software Tools and
Applications for Indian Languages
 Catalyzes proliferation of Language Technology products and solutions
 Promotes Standardization
TDIL
2
 Complexity
for Language Technology Development
Complexity:
 Very challenging area for computer scientists due to voluminous, informal and
ambiguous nature of human languages.
 Involves interdisciplinary research in advanced and sophisticated computer
processing involving Artificial Intelligence and Machine Learning in one hand ;
linguistic knowledge for incorporating human communication techniques on
the other hand.
 Still in research stage in many areas despite huge efforts by academia and
scientists in India as well as abroad
More intense challenge for Indian Languages:
 Large linguistic diversity with 22 officially recognized languages and 12 scripts.
 One-language Many Scripts ; Many Languages – One Script
 Specificity for each language and script is unique in nature and can not be
easily replicated.
 Difference in perceptions of usage among various user groups, e.g. State
Governments , Academia and industry
TDIL
3
 Official Indian Languages & Scripts
Sl. No.
Language
Script
1.
Hindi
Devanagari
2.
Sanskrit
Devanagari
3.
Marathi
Devanagari
4.
Konkani
Devanagari
5.
Nepali
Devanagari
6.
Maithili
Devanagari
7.
Sindhi
Devanagari
8.
Bodo
Devenagari
9.
Dogri
Devanagari, Sharda
10.
Bengali
Bengali
11.
Assamese
Bengali
12.
Manipuri
Bengali, Meitei (Mayak)
13.
Gujarati
Gujarati
14.
Kannada
Kannada
15.
Malayalam
Malayalam
16.
Oriya
Oriya
17.
Punjabi
Gurmukhi
18.
Tamil
Tamil
19.
Telugu
Telugu
20.
Urdu
Arabic
21.
Santhali
Ol-Chiki, Devanagai,
22.
Kashmiri
Arabic, Sharda
TDIL
4
 Genesis of Language Technology Development in India-Early
Initiatives
 Pioneering Effort by DIT in collaboration with IIT Kanpur in 1983:
 Department of Electronics (Now DIT) entrusted a sponsored project to IIT
Kanpur to build an integrated Devanagari Terminal (GIST).
 A standalone system with a computer keyboard was used for inputting the
character in devanagari, a monitor for display and a Dot Matrix printer for
printing and a serial communication for sending character to another terminal
was developed.
 Developed Indian Script Code for Information Interchange (ISCII standard)
 C-DAC Pune adopted GIST technology to develop products and licensed it to
manufacturers
 Technology Development for Indian Languages (TDIL) Programme
started in the year 1991 as a separate entity.
TDIL
5
 Phases of Language Technology developments
 Seeding Phase : 1991-1995

TDIL programme established in the year 1991

Some linguistic resources such as corpora developed

NLP training programme for Computer Scientists and linguists

Some stand-alone language learning tools have also been developed

Exploratory Work in the area of NLP
 Exploratory Phase : 1995-2000

Development of Proof –of –concept Machine Translation System for English
to Indian Languages and Indian Languages (Angla-Bharti) to Indian
Languages (Anusaraka) systems have been developed.

Laboratory model of font dependent Optical Character Recognition in Hindi

Text-to-Speech for Hindi
TDIL
6
 Catch-up Phase :2000-2004
The TDIL programme gathered momentum by establishing 13 Resource
Centres for Indian Languages Technology Solutions (RCILTS)and 10 CoIL-Net
Centres.
Resource Centres for Indian Languages Technology Solutions (RCILTS)
 The objective was to proliferate this activity to a large number of institutions
across the country with the specific mandate for a language or a group of
languages.
 Under this project, these centres have developed several important tools ,
linguistic resources and technologies for Indian language support
 Many of these tools are now being modified and upgraded to be released in
public domain under National Roll-Out Project.
 Some of the important language technology tools and resources developed
under Resource centres Project are:
TDIL
7
 Bi-lingual Dictionaries: between Indian Languages with over 30,000 words
[Resource Centres]
 Spell-Checkers in Indian Languages [Resource Centres]
 Ontology & Word-Net: 9000 syn-sets with morphological analyzer and front
end for Hindi Word-net with 1100 lexical entries with X-window interface for
Oriya.
 Proof-of –concept technologies for Optical Character Recognition system
(OCR) in other Indian languages
 Proof-of-concept TTS in other Indian languages
 In addition , several other tools , Operating Systems and resources have
been developed under various sponsored projects. Some of the notables are:
 INDIX-2 (Localized LINUX in 12 Indian languages)
 Phrasal Dictionaries: in Tamil and Kananda [IIIT, Hyderabad]
 Online VishwaKosha: with 9162 topics [CDAC]
 Parallel Corpora: One Million pages Parallel Corpora in 11 languages
[CDAC]
TDIL
8
COIL-Net Centres:
 The objective was to develop Localized Content in Hindi Speaking states for
enhancement of IT proliferation
 E- content of approximately 16000 HTML & Dynamic pages in the domains of
health, education, tourism and agri-business have been developed. Content
on the eminent personalities, tourist places, classical work, and cultural
heritage information on these regions have been developed.
 The developed content is uploaded on the internet at the website
http://tdil.mit.gov.in.
 National Train Enquiry
http://www.trainenquiry.com
TDIL
website
localized
in
Hindi
by
CDAC.
9
 Product Development and Proliferation Phase :2005-onwards
 A ‘Roadmap for Language Technology Development in India’ was
evolved-to formulate short-term & long-term mission plan and strategy for
development of Language Technologies in India.
 The Focus is to synergize development efforts and Develop deployable
products
 National Roll-Out Programme and Six Mission Mode Projects have
been initiated to facilitate Speedy Development & Availability of the
Language Technologies.
TDIL
10
Proliferation of Indian Language Technology Products : National Roll-Out
Plan
Objectives of the initiative
To facilitate Speedy Development & Availability of the Language Technologies.
Broad contents of the CD
• Common user’s Toolkit – Content Creation Tools, DTP, Office Automation, Code
Converters
• Productivity Tools – Spellchecker, Domain based Dictionaries, Transliteration.
• Power user – OCR, Text to Speech, MAT, etc
Distribution channel for the CD
• Registered users of www.ildc.in web site of TDIL, DIT – through postal department.
• IT magazines, publications, etc.
• Schools, Government departments, etc.
Software tools and fonts for 12 Indian languages namely Hindi, Tamil, Telugu,
Assamese, Kannada, Malayalam, Marathi, Oriya, Punjabi and Urdu and Gujarati
and Sanskrit languages have been released in public domain
CDs containing 4 Indian Languages namely Bodo , Dogri , Maithili and Nepali
languages are being released on Feb 21, 2009 – UNESCO International Mother
Language Day.
TDIL
11
Software tools and fonts CD contents
1 Language True Type Fonts with Keyboard Driver - more than 200
Supporting INSCRIPT, Typewriter, Phonetic Keyboard layouts
Allows content creation in Indian languages using applications running under Microsoft
windows
2 Language Multi-font Keyboard Engine for True Type Fonts
Allows content creation in Indian languages using applications running under Microsoft
windows in variety of font encoding.
3 Language Unicode Compliant Open Type Fonts - more than 200
Allows to render the Indian language Unicode data.
4 Unicode Compliant Keyboard Driver
Supporting INSCRIPT, Typewriter, Phonetic Keyboard layouts. Allows Unicode complaint
data inputting
5 Generic fonts and storage code converter
Allows user to convert the existing data in different encoding to ISCII / UNICODE
6 Localized version of Bharateeya OO (Office Suite)
This consists of word processor, presentation tool, spreadsheet & drawing tool
7 Fire fox browser
Localized version of Fire fox browser
TDIL
12
Software tools and fonts CD contents
8
Colombo - Email client for Windows and Linux Operating systems. Using this user can
send / receive emails in Indian languages. The menus are also in local language.
9
GAIM - Multiprotocol Messenger. This enables the user to user various messenger
clients for communications
10
Optical Character Recognition
With the help of OCR one can scan the printed text matter and convert it into
editable form for further processing.
11
Typing Tutor
This application teaches the user to type in Indian languages.
12
Spellchecker
Allows the end user to rectify spelling mistakes in the document
13
Dictionaries
English to Indian language and vice versa dictionaries in general, administrative,
technical domains.
14
Transliteration Tool
Transliterates a given Indian language text into Roman & vice versa. Useful for user
who is not familiar with the script.
15
Text to Speech system
Readouts the text
TDIL
13
TDIL
14
Product Development Efforts…. Mission Mode Projects Phase -I
In the consortium mode 26 premier Institutes and R&D organizations are
working together on six projects to develop the advanced technologies &
applications.
 Development of English to Indian Languages Machine Translation (MT) System:




10 institutions are participating to build deployable MT System.
Consortium Leader: CDAC, Pune
Domains: Tourism and Health
Six Languages pairs: English to Hindi/ Marathi/ Bengali/ Oriya/ Tamil/ Urdu.

Development of English to Indian Languages Machine Translation (MT) System
with Angla-Bharti Technology:
4 institutions are participating to build deployable MT System.
Consortium Leader: IIT Kanpur
Domains: Tourism and Health
Six Languages pairs: English to Hindi/ Marathi/ Bengali/ Oriya/ Tamil/ Urdu.




 Development of Indian Language to Indian Language Machine Translation
System:
 11 institutions are participating to build deployable Bi-directional MT System.
 Consortium Leader: IIIT, Hyderabad
 Domains: Tourism and Health
 Nine Language pairs: Tamil-Hindi, Telugu-Hindi, Urdu-Hindi, Kannada-Hindi, PunjabiHindi, Marathi-Hindi, Bengali-Hindi, Tamil-Telugu, Malayalam-Tamil
TDIL
15
Development Efforts…. Mission Mode Projects Phase -I
 Development of Robust Document Analysis & Recognition System for
Indian Languages:
 11 institutions participating to build OCR System with improved accuracy, font
and point-size independent recognition capability.
 Consortium Leader: IIT, Delhi
 10 Scripts: Bengali, Devanagari, Malayalam, Gujarati, Telugu, Tamil, Oriya,
Tibetan/Nepali, Gurmukhi, Kannada
 Development of On-line handwriting recognition system:
 Seven institutions are participating to build On-Line Handwriting Recognition
System.
 Consortium Leader: IISc, Bangalore
 6 Scripts: Devanagari, Bengali, Tamil, Telugu, Kannada and Malayalam
 Development of Cross-lingual Information Access
 11 institutions participating to develop a portal where, a user will be able to give
a query in one Indian Language and the user will be able to access documents
available in (a) The language of the query and (b) Hindi (If the query language is
not Hindi) and (c) English.
 Consortium Leader: IIT, Bombay
 Domains: Tourism and Health
 Six Languages: Bengali, Hindi, Marathi, Punjabi, Tamil and Telugu.
16
TDIL
Status of the readiness of the consortium mode projects
Sl
No
Name of the product /system
1
English to Indian Languages
Machine Translation System
(E-IL)
Tourism
α
2
English to Indian Languages
Machine Translation System
(E-IL) with Angla-Bharti
approach
Tourism
α
3
Indian Language to Indian
Language Machine
Translation (IL-IL)
Tourism
α
March 31, 2009
4
Cross-lingual Information
Access (CLIA)
Tourism
α
March 31, 2009
5
Printed Text OCR
--
α
March 31,2009
6
On-line Handwriting
recognition system (OHWR)
--
α
March 31, 2009
TDIL
Language Pairs
Marathi , Tamil
and Bengali
Domains
Version
Possible Date
17
Development Efforts…. Speech Processing
Speech Corpora:
•
Annotated Speech Corpora of approximately 50 hours developed for
Hindi, Marathi, Punjabi, Bengali, Assamese and Manipuri. [CDAC]
•
Speech Corpora for Tamil, Malayalam, Telugu and Kannada under
development. [CDAC]

Speech Recognition:
Phonetic Engine for Speech recognition system for Hindi and Telugu
languages are being developed [IIIT Hyderabad]
Text-to-Speech
Languages:
(TTS)
and
Automatic
Speech
Recognition
in
Indian
 Consortium Mode project for development of Text-to-Speech system for visually
challenged persons in six Indian languages namely Hindi, Tamil , Telugu ,
Marathi , Malayalam and Bengali languages has been initiated.
 Development for Automatic Speech Processing in Indian languages is also being
initaited.
TDIL
18
Development Efforts….
Sanskrit Computing :
Consortium Mode project has been initiated for development of Sanskrit
Computational tool kit and Sanskrit-Hindi Machine Translation System
[ Univ. of Hyderabad]

Corpora
Consortium Mode project is being initiated for development of annotated
corpora in 11 Indian languages. The project will evolve the standards for
natural language processing
TDIL
19
Development Efforts for North –Eastern Languages
 Consortium Mode Projects to develop linguistic resources and basic information
processing tool for North-Eastern languages namely Assamese, Bodo Manipuri
and Nepali languages have been initiated. [ C-DAC Pune]
 Consortium Mode project has also been initiated for development of Word-net in
North-Eastern Languages [ IIT Bombay]
 Speech Corpora and standardization of International Phonetic Alphabet (IPA)
for Bodo language has been initiated [Univ. of Guwahati]
TDIL
20
Standardization
TDIL
21
ISCII – Indian Script Code for
Information Interchange
Since the 1970s, efforts were
made to evolve different codes
for characters and symbols of
the 10 Brahmi based Indian
scripts due to their common
phonetic structure.
These efforts culminated in
bringing out Indian standards
for Indian Script Code for
Information Interchange (ISCII)
in December, 1991.
The ISCII code standard
specifies a 7- bit code table
which can be used in 7 or 8-bit
ISO compatible environment. It
allows English and Indian script
alphabets
to
be
used
simultaneously.
TDIL
22
INSCRIPT Keyboard Layout
 Standardized by Bureau of Indian Standards - 13194 :1991
 Key placement is such a way that a user well versed with one language
can type in another without efforts.
 This is overlaid on the existing QWERTY keyboard.
 Language selection is done with help of either Caps lock, scroll lock or
Num lock key
 Since it is based on phonetic nature of Indian languages it is very easy to
learn.
 Efforts have been initiated to incorporate the additional characters as
per latest UNICODE 5.1 standards in the modified layout
TDIL
23
UNICODE









TDIL
Unicode uses a 16 bit encoding that provides code point for more than 65000
characters (65536).
Corresponds to ISO/IEC 10646-1 Universal Multiple Octet Coded Character
Set (UCS)
In Sync with other International Standards such as W3C
Unicode Standards assigns each character a unique numeric value and
name.
Encodes all of the characters used for the written languages of the world.
Unicode is increasing being accepted as a standard for Information
Interchange worldwide as most of the major IT Companies have declared
their support for it
Department of Information Technology is the voting member of the Unicode
Consortium since the year 2000 to ensure the adequate representation of
Indic scripts in the Unicode Standards. http://tdil.mit.gov.in/pchangeuni.htm
DIT finalized the changes in the Unicode Standard and majority of changes
have been accepted and incorporated in UNICODE Standards version 5.0.
Initiatives have been taken to incorporate additional languages/ scripts and
additional characters and symbols of Vedic Sanskrit in UNICODE.
http://tdil.mit.gov.in/prop_uni/Vedic.pdf
24
UNICODE .. Examples
Indicates proposed characters/symbols/signs shape change in the existing standard
Indicates the change in the annotation/explanation of that particular code point .
Indicates proposed characters/symbols/signs addition in the existing standard
TDIL
25
W3C
 Project “Web Internationalization Initiative” has been initiated with the objective of
adequate representation of Indic scripts in the Web Technology Standards being
evolved by World Wide Web Consortium (W3C).
 Initiative has been taken to incorporate key findings of WII projects in the W3C
standards /guidelines
TDIL
26
W3C - Large amount of works need to be done

In the phase-I of WII projects only few exploratory work has been carried
out

W3C Internationalization has 115 recommendations covering web
internationalization , XML, Cascaded Style Sheet and Speech Synthesis

These recommendations needs to carefully studied in the Indian Language
perspective and specific recommendations need to projected to W3C.

A few specific Interest of them are Internationalization Tag Set (ITS)
Version 1.0 ,Voice Extensible Mark-up Language (VoiceXML) 2.1 , Web
Content Accessibility Guidelines 1.0 , Cascading Style Sheets (CSS1) Level
1 Specification , Speech Synthesis Mark-up Language 1.0

Need for consultation with all stake holders such as academia , industry and
various state governments.

Sensitization to industry and web service providers to adopt W3C
standards.
W3C India Office has been established at DIT under the aegis of the TDIL
programme.
TDIL
27
International Phonetic Alphabet (IPA)
 Since phonetic representation of symbols is the required for present-day
speech mark-up language like W3C Speech synthesis mark-up language
(SSML), standardization of IPA symbols is necessary.
 India being a multilingual country a standardized phonetic alphabet has to be
developed for scientific study of phonetics and SSML for Indian languages.
 The IPA standardization for all Indian language and acceptance of it by
International Phonetic Association is thus required for development of speech
technology and associated products.
 Efforts initiated to standardize IPA symbols in Indian languages
TDIL
28
 Common Locale Data Repository
 Common Locale Data Repository (CLDR) is an initiative of UNICODE
consortium to develop locale data for World languages.
 The Unicode CLDR provides key building blocks for software to support the
world's languages. CLDR is by far the largest and most extensive standard
repository of locale data.
 This data is used by a wide spectrum of companies for their software
internationalization and localization
 Department of Information Technology has already become TC (Team
Coordinator) to incorporate / modify Indian languages in CLDR.
 Modifications/ Development of Common Locale data repository in Indian
languages have been initiated in consultations with state governments and
other stake holders. CLDR data for 6 Indian Languages have been
incorporated in UNICODE CLDR.
TDIL
29






TDIL
Language Tags
Language Tags are being used in most of the multilingual applications
such as web development, Multilingual Internet Data Exchange,
Language Negotiation and web services.
The nomenclatures of the Language Tags are being standardized under
ISO 639 standard.
The Language Tag Standard ISO 639-x (x stands for different versions)
are being used in many other international Standards and Best Practices
such as IETF (Internet Engineering Task Force) RFC 4646, RFC 4647
and W3C web standards. They are also related to ISO 3166 (for region
codes) and ISO 15924 (script codes).
The present forms of ISO 639-2, ISO 639-3 and the futuristic ISO 639-5
and ISO 639-6 have many ambiguous entries for Indian languages, which
need to be corrected urgently in order to prevent propagation of incorrect
nomenclatures for Language sets.
Modification / Additions of Language Tags in Indian languages have been
taken up in consultations with the state governments and all stake
holders.
30
Information Dissemination

TDIL portal:
http://tdil.mit.gov.in

ILDC Portal:
http://ildc.gov.in, http://ildc.in
On ILDC Portal, a user can:
• Request for a Language Tools CD
• Register on ILDC website
• Provide Feedback and Access FAQ (Frequently asked Questions)
• Free Downloads and Software for Indian Language Tools

TDIL
TDIL Half-yearly Journal: VishwaBharat@tdil: 16 Issues published;
accessible through TDIL web-site.
31
Future Activities:
All the on-going consortium mode projects, National Roll-Out Plan project and
Specialized Manpower Development in Language Technology project would be
continued.
Phase-II of Consortium Mode Projects: Consortium Mode Projects –Phase –II in the
areas of Machine Translation , Cross-lingual Information Access , Optical Character
Recognition and OHWR . The systems developed in Phase-I would be improvised
and expended for other domains
Technology Development:
(a) Speech Technology: Development of Automatic Speech Processing engines to be
initiated for major Indian languages.
(b) Basic Research : Basic Research in the areas of semantic Web technology would be
initiated.
Web Internationalization Initiative:
The phase –II of Project “Web Internationalization Initiative (WII) ” to be initiated with
the objective of adequate representation of Indic scripts in the Web Technology
Standards being evolved by World Wide Web Consortium (W3C).
32
TDIL
Future Activities:
Establishment of Data Centre at TDIL
Setting up of Indian Language Data Centre , Language Technology Demonstration
Facility at DIT , Up-gradation of ILDC and TDIL websites , Up-gradation of Language CDs
and their distribution support would be undertaken.
Web Internationalization Initiative:
Phase-II of Web Internationalization Initiative (WII) programme would be initiated for
finalization of Indian Language specific inputs / recommendations in W3C web
technology standards.
National Localization Research and Resource Centre (NLRRC)
Seeding Activity for National Localization Research and Resource Centre would be
initiated.
33
TDIL
Government, Academia, Industry together to
play globally and to serve locally for making
India a
Global Multilingual Computing Hub
धन्यवाद
ਧਨ੍ਯਵਾਦ
ધન્યવાદ
Thank You
Download