HLT in Estonia

advertisement
eVikings II
Establishment of the
Virtual Centre of Excellence for IST RTD
in Estonia
FP5 IST accompanying measures project IST-37592
Deliverable D3.3:
Estonian Language Technology Roadmap
Institute of Cybernetics at Tallinn Technical University
University of Tartu
Institute of the Estonian Language
Archimedes Foundation
Edited by Einar Meister
Tallinn
October 2003
Updated February 2004
Table of contents
Introduction ................................................................................................................................ 3
Role of HLT in the information society ..................................................................................... 4
HLT in Estonia ........................................................................................................................... 4
HLT research in Estonia: research groups and topics ............................................................ 5
HLT market in Estonia ........................................................................................................... 7
Financing of HLT in Estonia .................................................................................................. 7
Estonian HLT Roadmap for 2004-2011 ..................................................................................... 8
Conclusion ................................................................................................................................ 10
References ................................................................................................................................ 10
2
Introduction
Human Language Technology (HLT) is one of the key technologies in the Information
Society. It deals with the application of knowledge about language to the development of
computer systems which can recognise, understand, interpret, and generate human language
in all its forms.
HLT comprises two main components: a set of techniques and language resources.
The techniques include methods, algorithms and computer programs for:

automated document processing (speller, hyphenation, machine translation, etc.),
optical character recognition (OCR),

natural language understanding,

speech synthesis and recognition,

speaker identification and verification, etc.
Language resources represent the knowledge of language in a computer-accessible form,
including:

lexicons (electronic dictionaries and databases, thesaurus),

formal grammars (descriptions of the structure of a language at different levels),

corpora (text and speech).
Compiling the Estonian HLT Roadmap serves several aims, as it:
1. describes the current status of HLT research and development in Estonia;
2. lists available financial and human resources in the field of HLT;
3. presents the most important research and development tasks for the next decade;
4. indicates to industrial customers the potential of upcoming technological development
and resulting applications;
5. forms a basis for decisions on funding R&D in the field;
6. enables benchmarking with the other countries.
3
Role of HLT in the information society
In the information society a personal computer is a necessary instrument of our daily life both
at work and at home. We need computers to process, maintain and exchange information,
which in most cases exists in textual form. A majority of information available on the Internet
is in English as is most of the computer software that is in daily use as well.
The development efforts of human-computer interaction (HCI) during the past few decades
have been directed towards natural communication using spoken language input and output.
For several, especially "big" languages, progress in HLT has been impressive – research
results have been successfully exploited in real products and services, and the speech
technology market shows growing trends. According to the latest Euromap report [1] on HLT
progress in EU countries, the leading positions are held by the UK, Germany, France and
surprisingly enough, the Netherlands and Finland also belong to the leading group. In the case
of the first three countries the result can be explained mainly by large market demands,
whereas in the latter cases the leading position is achieved due to several simultaneous factors
– a healthy environment for technological development, a relatively large and strong research
community, and significant national-level support for the HLT-area.
What about languages with a small number of speakers, like Estonian? Will they survive in
the information society?
According to a futuristic view, the only manner in which small languages will survive is
through development of human language technology that includes tools for both written and
spoken language processing, and machine translation.
Therefore, development of HLT for Estonian plays a crucial role in the future of the language.
HLT in Estonia
Estonian is a Finno-Ugric language (non-Indo-European) spoken by almost one million
people. Estonian is related to Finnish and Hungarian. About 90% of Estonians live in Estonia
while a considerable number of speakers are located in Canada, Sweden, the USA, and
Russia. The Estonian language has the status of the state language of Estonia where 67.3% of
the inhabitants (ca. 921,000 persons) speak it as their mother tongue (source: Statistical Office
of Estonia).
4
The situation of HLT in Estonia compares poorly with the leading EU countries in the field.
Estonia has few native speakers, there are only a few small research groups in the field,
resources for research and technological development are modest, and national-level support
for HLT has been marginal.
It should be stressed that the nature of HLT is to a great extent language-specific – methods or
tools developed for one language do not always work with other languages. For example, the
state-of-art in ASR-technology is based on stochastic models using a HMM technique for
word and language modelling. It has been successfully implemented for non-agglutinative
languages, but according to the view of expert's it cannot easily be adapted to agglutinative
languages, including Estonian. The problem lies in the enormous (almost infinite) number of
different word forms available in these languages. The need for alternative approaches has
been stressed by several experts of the field.
HLT research in Estonia: research groups and topics
There are only three groups in Estonia working in the field of HLT:
1. At the University of Tartu the HLT research area is covered by the research group of
computer linguistics (CL) which consists of the employees and research fellows of the UT
department of language technology and department of general linguistics.
The CL research group was established in 1993. The researchers in the CL group are
involved in education and in research. Computer linguistics has been an independent
subject in the curriculum since 1997 and language technology was added to the
curriculum in 2002. The main research areas of the CL group are:
 Creating formal descriptions of morphology, syntax and semantics for the Estonian
language;
 Creating a semantic database (thesaurus, Estonian WordNet);
 Creating Estonian language resources (electronic corpora of written and spoken
language, dialogue corpora, parallel corpora, lexical databases);
 Developing software for morphologic, syntactic and semantic analysis and
synthesis.
Number of employees: 1 Ph.D. (Philology), 1 M.Sc. (Technical sciences) 1 M.Sc.
(Physics, Mathematics), 1 Ph.D. (Linguistics), 11 Masters (6 M.A. + 5 M.Sc. in
Linguistics, Estonian language and Informatics).
5
2. Institute of the Estonian Language (EKI) was founded in 1947 (1947–1993 Institute of
Language and Literature). The main research areas cover:
 Research into the Estonian language and affiliated languages (grammar,
lexicology, dialects);
 Compilation and editing of fundamental original dictionaries and source
publications;
 Development of language regulation theories and regulation in practice;
 Development of language technology tools for the Estonian language.
The sector of computer linguistics in EKI was established in 1977. A virtual work group
has been formed in the field of language technology. Its main areas of activity include:
 Language resources: electronic versions of traditional dictionaries, linguistic
databases (names, terms, language assistance, compound words, etc.), text-based
dictionaries, lexicons for machine translation, www-applications;
 Rule-based morphological systems: formal grammars and software (morphological
synthesis and analysis, morphological disambiguation);
 Phonetics and speech technology: speech synthesis software and linguistic
problems (modelling of speech prosody, relations between syntax and prosody)
and databases (diphones).
Six people are employed in language technology, incl. three Ph.D.’s (two in philology and
one in informatics) and two M.Sc.’s.
3. The Institute of Cybernetics has been active in the research of spoken language for more
than 30 years. The Laboratory of Phonetics and Speech Technology was founded in 1990
and has two main research areas:
 phonetics of the Estonian language – experimental research of the Estonian sound
system and prosody;
 speech technology – speech analysis and synthesis, speech and speaker
recognition, speech databases;
The above mentioned research fields are closely connected in that the results of phonetic
research are implemented in creating speech synthesis and recognition models.
Number of employees: six, incl. two Ph.D.’s (philology, general linguistics), one M.Sc.
and one M.A.
6
HLT market in Estonia
There are very few HLT-products available on the market. An Estonian speller (developed by
Filosoft) for Microsoft Office is the most widely used product. There is also some OCR
software available for Estonian (by Nekstom). Other product categories include mainly
electronic dictionaries and lexicons (on CD-ROMs or accessible via the Internet) and CALLprograms mostly of foreign origin.
In 2001 Microsoft Office and in 2002 Windows XP were (partly) localised and appeared on
the market. There are also available several software packages in Estonian developed mainly
by local companies.
According to the study on the usage of personal computers in Estonia [2], 70-80% of users
would prefer computer software in Estonian.
Financing of HLT in Estonia
Research and development of HLT has been funded from different sources:
 support from the governmental budget for basic scientific research,
 grants from the Estonian Science Foundation,
 the national programme “Estonian Language and Cultural Heritage" (1999-2003),
 the Estonian Language Technology programme initiated by the Estonian Informatics
Centre (1998-2000),
 the project “Language technology and the dictionaries of the Institute of the Estonian
Language” (2002-2003) at the Ministry of Education and Research.
The total amount of funding for HLT has been approximately 2 million Estonian crowns per
year in the period of 1998-2001, and about 4 million Estonian crowns per year in the period of
2002-2003.
In 2004 a new national programme entitled “Estonian language and national memory” (20042008) has been launched, that also includes a sub-programme “Language technology”
(funding for 2004: 1,9 million Estonian crowns).
The governmental support to the Competence Centre for Estonian Language Technology to
be established in 2004 under the National Competence Centre Programme will be 4,2 million
Estonian crowns for 2004.
7
Estonian HLT Roadmap for 2004-2011
Action Line 1:
Spoken Language Technology
Action Line 2:
Written Language Technology
Action Line 3:
Language Resources
2011
2011
Advanced Spoken Dialogue System
Prototype for audio-visual TTS
2010
2010
Speech recognition, 100000 words
English<>Estonian
translation system
Transfer from semantics
to pragmatics
Database for audio-visual
speech synthesis
2009
2009
High quality TTS
Semantic analysis and
disambiguation
Tree bank 100 000 words
2008
2008
Prosody model based on
syntactic analysis
Database of emotional speech
Transfer from syntax
to semantics
Morpho-syntactic language model
for large vocabulary ASR
Thesaurus
Dialog corpus of 1 million words
2007
2007
Prototype of automated
recognition of dialogue acts
Language-specific speech
recognition engine
Prototype of automatic
e-mail reading
English<>Estonian phraseology
translation aid
Grammar checker
Estonian-English database
Lexico-semantic database
Thoroughly transcribed general
corpus of Spoken
Estonian 0.1 million words
2006
2006
Advanced Estonian TTS
Analysis of compound
phrases
Prototype of a simple spoken
dialogue system
Deep syntactic analysis
Descriptions of dialogue acts
Morphologic analysis and
disambiguation
Tree bank 50 000 words
Lexico-grammatical database
Superficially transcribed
general corpus of Spoken
Estonian 0.1 mil words
Dialog corpus (0.5 million words)
General corpus of spoken
Estonian (1 million words)
2005
2005
ASR with limited vocabulary
1000 words
Parallel corpus: 10 (Estonian)
+ 10 (English) million words
Dialogue corpus (100,000 words)
Surface syntactic marking:
50 000 words
2004
Prototype of Estonian TTS
Morphologic analysis
Prototype for small
vocabulary ASR
Spelling checker
Surface syntactic analysis
Formal syntax grammar
of Estonian
Rule-based morphologic
analysis and synthesis
General corpus of written
Estonian (ca 80 million words)
Semantic database (Estonian
WordNet 15,000 word meanings)
Disambiguated corpus of word
meanings (100,000 textual words)
Estonian-English parallel corpus
(2 million words)
Estonian BABEL Database
Estonian SpeechDat-like Database
Electronic dictionaries:
Russian-Estonian, Finnish-Estonian
English-Estonian, etc.
Resources and tools developed
before 2004
Resources and tools developed
before 2004
2004
8
The Roadmap shows the base line – the resources and tools developed in Estonia during the
several years before 2004, and presents future developments in three major action lines:
Action Line 1: Spoken Language Technology including:

speech synthesis: creating Estonian TTS software for several applications and
development of audio-visual synthesis;

speech recognition: creating a limited-vocabulary speech recognition system
prototype and development of unlimited-vocabulary speech recognition methods;

dialogue systems: creating intelligent user application services capable of
replacing routine human work.
Action Line 2: Written Language Technology including:

language processing methods: working out of formalisms for automated
processing of various different language levels (phonetics, morphology, syntax,
semantics, pragmatics), modelling and creating the corresponding prototypes;

machine translation: creating methods for translating to and from Estonian,
compiling multilingual vocabularies and mechanisms for transforming syntactic
structures; developing a prototype for Estonian <-> English machine translation.
Action Line 3: Language Resources including:

developing an infrastructure for language resource creation and utilization

creating and annotating different types of language resources:
o speech corpora: for development of speech synthesis and recognition, and
dialogue systems
o text corpora: for development of written language software modules and
machine translation
o electronic dictionaries: essential in almost every language technology
software development activity.
9
Conclusion
The Roadmap is not as detailed and refined as the ELSNET Roadmap for Human Language
Technologies [3], but it is specific enough to understand the development trends in Estonian
HLT.
The Roadmap is based on contributions from three research groups (Computer Linguistics at
the University of Tartu, the HLT-group at the Institute of the Estonian Language, and the
Laboratory of Phonetics and Speech Technology at the Institute of Cybernetics) and therefore
is focused mainly on research. Several areas, like HLT standards and evaluation criteria,
market applications, software localization, etc. have not been covered.
The Roadmap has been used in defining research projects of the Competence Centre for
Estonian Language Technology to be established in 2004 under the National Competence
Centre Programme.
References
1. A. Joscelyne, R.Lockwood, Benchmarking HLT progress in Europe. The EUROMAP
Study. Copenhagen 2003.
2. Usage of personal computers in Estonia. Report of the study by EMOR. Tallinn,
September 2002.
3. The ELSNET Roadmap for Human Language Technologies. http://elsnet.dfki.de
10
Download