eVikings II Establishment of the Virtual Centre of Excellence for IST RTD in Estonia FP5 IST accompanying measures project IST-37592 Deliverable D3.3: Estonian Language Technology Roadmap Institute of Cybernetics at Tallinn Technical University University of Tartu Institute of the Estonian Language Archimedes Foundation Edited by Einar Meister Tallinn October 2003 Updated February 2004 Table of contents Introduction ................................................................................................................................ 3 Role of HLT in the information society ..................................................................................... 4 HLT in Estonia ........................................................................................................................... 4 HLT research in Estonia: research groups and topics ............................................................ 5 HLT market in Estonia ........................................................................................................... 7 Financing of HLT in Estonia .................................................................................................. 7 Estonian HLT Roadmap for 2004-2011 ..................................................................................... 8 Conclusion ................................................................................................................................ 10 References ................................................................................................................................ 10 2 Introduction Human Language Technology (HLT) is one of the key technologies in the Information Society. It deals with the application of knowledge about language to the development of computer systems which can recognise, understand, interpret, and generate human language in all its forms. HLT comprises two main components: a set of techniques and language resources. The techniques include methods, algorithms and computer programs for: automated document processing (speller, hyphenation, machine translation, etc.), optical character recognition (OCR), natural language understanding, speech synthesis and recognition, speaker identification and verification, etc. Language resources represent the knowledge of language in a computer-accessible form, including: lexicons (electronic dictionaries and databases, thesaurus), formal grammars (descriptions of the structure of a language at different levels), corpora (text and speech). Compiling the Estonian HLT Roadmap serves several aims, as it: 1. describes the current status of HLT research and development in Estonia; 2. lists available financial and human resources in the field of HLT; 3. presents the most important research and development tasks for the next decade; 4. indicates to industrial customers the potential of upcoming technological development and resulting applications; 5. forms a basis for decisions on funding R&D in the field; 6. enables benchmarking with the other countries. 3 Role of HLT in the information society In the information society a personal computer is a necessary instrument of our daily life both at work and at home. We need computers to process, maintain and exchange information, which in most cases exists in textual form. A majority of information available on the Internet is in English as is most of the computer software that is in daily use as well. The development efforts of human-computer interaction (HCI) during the past few decades have been directed towards natural communication using spoken language input and output. For several, especially "big" languages, progress in HLT has been impressive – research results have been successfully exploited in real products and services, and the speech technology market shows growing trends. According to the latest Euromap report [1] on HLT progress in EU countries, the leading positions are held by the UK, Germany, France and surprisingly enough, the Netherlands and Finland also belong to the leading group. In the case of the first three countries the result can be explained mainly by large market demands, whereas in the latter cases the leading position is achieved due to several simultaneous factors – a healthy environment for technological development, a relatively large and strong research community, and significant national-level support for the HLT-area. What about languages with a small number of speakers, like Estonian? Will they survive in the information society? According to a futuristic view, the only manner in which small languages will survive is through development of human language technology that includes tools for both written and spoken language processing, and machine translation. Therefore, development of HLT for Estonian plays a crucial role in the future of the language. HLT in Estonia Estonian is a Finno-Ugric language (non-Indo-European) spoken by almost one million people. Estonian is related to Finnish and Hungarian. About 90% of Estonians live in Estonia while a considerable number of speakers are located in Canada, Sweden, the USA, and Russia. The Estonian language has the status of the state language of Estonia where 67.3% of the inhabitants (ca. 921,000 persons) speak it as their mother tongue (source: Statistical Office of Estonia). 4 The situation of HLT in Estonia compares poorly with the leading EU countries in the field. Estonia has few native speakers, there are only a few small research groups in the field, resources for research and technological development are modest, and national-level support for HLT has been marginal. It should be stressed that the nature of HLT is to a great extent language-specific – methods or tools developed for one language do not always work with other languages. For example, the state-of-art in ASR-technology is based on stochastic models using a HMM technique for word and language modelling. It has been successfully implemented for non-agglutinative languages, but according to the view of expert's it cannot easily be adapted to agglutinative languages, including Estonian. The problem lies in the enormous (almost infinite) number of different word forms available in these languages. The need for alternative approaches has been stressed by several experts of the field. HLT research in Estonia: research groups and topics There are only three groups in Estonia working in the field of HLT: 1. At the University of Tartu the HLT research area is covered by the research group of computer linguistics (CL) which consists of the employees and research fellows of the UT department of language technology and department of general linguistics. The CL research group was established in 1993. The researchers in the CL group are involved in education and in research. Computer linguistics has been an independent subject in the curriculum since 1997 and language technology was added to the curriculum in 2002. The main research areas of the CL group are: Creating formal descriptions of morphology, syntax and semantics for the Estonian language; Creating a semantic database (thesaurus, Estonian WordNet); Creating Estonian language resources (electronic corpora of written and spoken language, dialogue corpora, parallel corpora, lexical databases); Developing software for morphologic, syntactic and semantic analysis and synthesis. Number of employees: 1 Ph.D. (Philology), 1 M.Sc. (Technical sciences) 1 M.Sc. (Physics, Mathematics), 1 Ph.D. (Linguistics), 11 Masters (6 M.A. + 5 M.Sc. in Linguistics, Estonian language and Informatics). 5 2. Institute of the Estonian Language (EKI) was founded in 1947 (1947–1993 Institute of Language and Literature). The main research areas cover: Research into the Estonian language and affiliated languages (grammar, lexicology, dialects); Compilation and editing of fundamental original dictionaries and source publications; Development of language regulation theories and regulation in practice; Development of language technology tools for the Estonian language. The sector of computer linguistics in EKI was established in 1977. A virtual work group has been formed in the field of language technology. Its main areas of activity include: Language resources: electronic versions of traditional dictionaries, linguistic databases (names, terms, language assistance, compound words, etc.), text-based dictionaries, lexicons for machine translation, www-applications; Rule-based morphological systems: formal grammars and software (morphological synthesis and analysis, morphological disambiguation); Phonetics and speech technology: speech synthesis software and linguistic problems (modelling of speech prosody, relations between syntax and prosody) and databases (diphones). Six people are employed in language technology, incl. three Ph.D.’s (two in philology and one in informatics) and two M.Sc.’s. 3. The Institute of Cybernetics has been active in the research of spoken language for more than 30 years. The Laboratory of Phonetics and Speech Technology was founded in 1990 and has two main research areas: phonetics of the Estonian language – experimental research of the Estonian sound system and prosody; speech technology – speech analysis and synthesis, speech and speaker recognition, speech databases; The above mentioned research fields are closely connected in that the results of phonetic research are implemented in creating speech synthesis and recognition models. Number of employees: six, incl. two Ph.D.’s (philology, general linguistics), one M.Sc. and one M.A. 6 HLT market in Estonia There are very few HLT-products available on the market. An Estonian speller (developed by Filosoft) for Microsoft Office is the most widely used product. There is also some OCR software available for Estonian (by Nekstom). Other product categories include mainly electronic dictionaries and lexicons (on CD-ROMs or accessible via the Internet) and CALLprograms mostly of foreign origin. In 2001 Microsoft Office and in 2002 Windows XP were (partly) localised and appeared on the market. There are also available several software packages in Estonian developed mainly by local companies. According to the study on the usage of personal computers in Estonia [2], 70-80% of users would prefer computer software in Estonian. Financing of HLT in Estonia Research and development of HLT has been funded from different sources: support from the governmental budget for basic scientific research, grants from the Estonian Science Foundation, the national programme “Estonian Language and Cultural Heritage" (1999-2003), the Estonian Language Technology programme initiated by the Estonian Informatics Centre (1998-2000), the project “Language technology and the dictionaries of the Institute of the Estonian Language” (2002-2003) at the Ministry of Education and Research. The total amount of funding for HLT has been approximately 2 million Estonian crowns per year in the period of 1998-2001, and about 4 million Estonian crowns per year in the period of 2002-2003. In 2004 a new national programme entitled “Estonian language and national memory” (20042008) has been launched, that also includes a sub-programme “Language technology” (funding for 2004: 1,9 million Estonian crowns). The governmental support to the Competence Centre for Estonian Language Technology to be established in 2004 under the National Competence Centre Programme will be 4,2 million Estonian crowns for 2004. 7 Estonian HLT Roadmap for 2004-2011 Action Line 1: Spoken Language Technology Action Line 2: Written Language Technology Action Line 3: Language Resources 2011 2011 Advanced Spoken Dialogue System Prototype for audio-visual TTS 2010 2010 Speech recognition, 100000 words English<>Estonian translation system Transfer from semantics to pragmatics Database for audio-visual speech synthesis 2009 2009 High quality TTS Semantic analysis and disambiguation Tree bank 100 000 words 2008 2008 Prosody model based on syntactic analysis Database of emotional speech Transfer from syntax to semantics Morpho-syntactic language model for large vocabulary ASR Thesaurus Dialog corpus of 1 million words 2007 2007 Prototype of automated recognition of dialogue acts Language-specific speech recognition engine Prototype of automatic e-mail reading English<>Estonian phraseology translation aid Grammar checker Estonian-English database Lexico-semantic database Thoroughly transcribed general corpus of Spoken Estonian 0.1 million words 2006 2006 Advanced Estonian TTS Analysis of compound phrases Prototype of a simple spoken dialogue system Deep syntactic analysis Descriptions of dialogue acts Morphologic analysis and disambiguation Tree bank 50 000 words Lexico-grammatical database Superficially transcribed general corpus of Spoken Estonian 0.1 mil words Dialog corpus (0.5 million words) General corpus of spoken Estonian (1 million words) 2005 2005 ASR with limited vocabulary 1000 words Parallel corpus: 10 (Estonian) + 10 (English) million words Dialogue corpus (100,000 words) Surface syntactic marking: 50 000 words 2004 Prototype of Estonian TTS Morphologic analysis Prototype for small vocabulary ASR Spelling checker Surface syntactic analysis Formal syntax grammar of Estonian Rule-based morphologic analysis and synthesis General corpus of written Estonian (ca 80 million words) Semantic database (Estonian WordNet 15,000 word meanings) Disambiguated corpus of word meanings (100,000 textual words) Estonian-English parallel corpus (2 million words) Estonian BABEL Database Estonian SpeechDat-like Database Electronic dictionaries: Russian-Estonian, Finnish-Estonian English-Estonian, etc. Resources and tools developed before 2004 Resources and tools developed before 2004 2004 8 The Roadmap shows the base line – the resources and tools developed in Estonia during the several years before 2004, and presents future developments in three major action lines: Action Line 1: Spoken Language Technology including: speech synthesis: creating Estonian TTS software for several applications and development of audio-visual synthesis; speech recognition: creating a limited-vocabulary speech recognition system prototype and development of unlimited-vocabulary speech recognition methods; dialogue systems: creating intelligent user application services capable of replacing routine human work. Action Line 2: Written Language Technology including: language processing methods: working out of formalisms for automated processing of various different language levels (phonetics, morphology, syntax, semantics, pragmatics), modelling and creating the corresponding prototypes; machine translation: creating methods for translating to and from Estonian, compiling multilingual vocabularies and mechanisms for transforming syntactic structures; developing a prototype for Estonian <-> English machine translation. Action Line 3: Language Resources including: developing an infrastructure for language resource creation and utilization creating and annotating different types of language resources: o speech corpora: for development of speech synthesis and recognition, and dialogue systems o text corpora: for development of written language software modules and machine translation o electronic dictionaries: essential in almost every language technology software development activity. 9 Conclusion The Roadmap is not as detailed and refined as the ELSNET Roadmap for Human Language Technologies [3], but it is specific enough to understand the development trends in Estonian HLT. The Roadmap is based on contributions from three research groups (Computer Linguistics at the University of Tartu, the HLT-group at the Institute of the Estonian Language, and the Laboratory of Phonetics and Speech Technology at the Institute of Cybernetics) and therefore is focused mainly on research. Several areas, like HLT standards and evaluation criteria, market applications, software localization, etc. have not been covered. The Roadmap has been used in defining research projects of the Competence Centre for Estonian Language Technology to be established in 2004 under the National Competence Centre Programme. References 1. A. Joscelyne, R.Lockwood, Benchmarking HLT progress in Europe. The EUROMAP Study. Copenhagen 2003. 2. Usage of personal computers in Estonia. Report of the study by EMOR. Tallinn, September 2002. 3. The ELSNET Roadmap for Human Language Technologies. http://elsnet.dfki.de 10