Where do we stand? MT development, research, and deployment in Asia Key-Sun Choi (KAIST) AAMT http://www.asianlp.org/ http://www.afnlp.org/ http://korterm.org/ Contents China Japan India Malaysia Thailand Taiwan Korea UNL Associations related to MT 2 MT in China – 1980-1990’s To translate the scientific documents From Russian and Western Countries’ language Supported by government No private company in early stage TRANS-STAR: 30,000 words/hour for 386 PC. Basis dictionary includes 40,000 entries, 10 specialized technical dictionaries including 350,000 entries. subject fields: computer, economics, telecommunication, ceramics, thermal power industry, printing machine industry, automobile/tractor industry, Petroleum prospecting, geology, Chemical industry. 3 MT in China – Present English-to-Chinese GAOLI: jointly by Beijing GAOLI Computer Co. Lid. & Linguistics Institute of CASS. Basic lexical dictionary: 60,000 entries in which usage and grammatical function of every word is described in detail. Translation accuracy: 80% Readability of translated text: 80%-90% 863-IMT/EC: by the Institute of Computer Technology, Academia Sinica. commercialized and got very good economic benefits. 4 MT in China – Present Chinese-to-English SINO-TRANS by the Company CS&S (China National Software & Technology Service Co.) at 1993. Basic dictionary: 40,000 entries Two special subject technical dictionaries: Naval ships and boats (9312 entries), rocket-gun (33,773 entries) Linguistic rules: 1,000 rules 5 MT in China – Present English-to-Chinese + terminology TONGYI system: by the Tianjin DATONG computer software company WINDOWS platform Different special subject dictionaries: a.commonly-used scientific terms: 200,000 entries b.terms including 22 different subjects (e.g. machine building, telecommunication, aviation, medicine, etc): 3,000,000 entries Good market strategy and service Cooperation with enterprises 6 MT in China – Present English-to-Chinese + internet browsing + more user interface YIWANG: by SUNSHINE company of Shenzhen. Highest translation speed: 100 sentences per second. Internet browsing YIBA: by YAXINCHENG software technical company. Three translation: on line, automatic, interface. Open to users: to revise dictionary and rules Rich special subject dictionaries: 30 subjects (e.g. Computer, telecommunication, medicine) 7 MT in China – Present English-to-Japanese E-to-J by JEC company in Beijing. Technique of transformation from phrase tree (P-tree) to dependency tree (D-tree). Closely integrated with word processor 8 MT in China – Present Example-based MT: experimental systems Japanese-Chinese EBMT: computer department of Qinghua university in 1996. corpus for Japanese and Chinese alignment sentences The example unit is sentence The similarity rate calculation based on word DAYA EBMT: Harbin Polytechnic University. machine-aided translation system, human factor is very important corpus is sentence-level alignment 9 MT in China Government Funding: 1990’s Hi-Tech 863 funding: 863-IMT/EC system (English-Chinese) SUNSHINE YIWANG system. 905 Chinese Project: Language Processing completed in 1998. 10 MT in China User’s English Level The proportion of English level of user for TONGYI MT software: Higher level: 16.5% Middle level: 49.5% Lower level: 34.1% So the MT software must be oriented to common people 11 MT in China Potential Users The proportion of enterprise user for TONGYI MT software: Small enterprises: 31.3% Medium-scale & large-scale enterprises: 68.7% So the MT software must be oriented to large-scale & medium-scale enterprises, but we don’t ignore the small enterprises that also has translation demand. 12 MT in China Regional Distribution User’s region distribution of MT software: translation demand is concentrated in the big cities and developing regions. Beijing: 18.7% Liaoning: 7.9%, Jiangsu: 7.5% Zhejiang: 6.5%, Hubei: 6.5%, Shanghai: 6.1% Sichuan: 4.7%, Guangdong: 4.7% Henan: 3.3%, Helongjiang: 3.3% Hebei: 2.8%, Shanxi: 2.3%, Jilin: 2.3% Yunnan: 1.9%, Neimeng: 1.5%, Gansu: 1.4% Guizhou: 0.5%, Anhui: 0.5% 13 MT in China - Future and Strategies (1) Terminology Data Bank MT software combines with terminology data bank 1990: sub-committee of computer-aided in terminology of China set up. This sub-committee is attached to the State Language Commission (SLC) of China A series of national standards for terminology data-bank Terminology Databank creation Chinese-English: Since 1995, by ISTIC (Institute of Scientific and technical Information of China) Remarkable databanks… 14 MT in China - Future and Strategies (2) Language Corpus Processing Corpus construction: the scale of 25 million Chinese characters (1999) Automatic segmentation of Chinese writing text in corpus (97.68%, close test) Automatic phrase bracketing and syntactic annotation for Chinese Corpus 15 MT in China - Future and Strategies (3) speech-to-speech translation Chinese speech into Chinese text. "SIDA-863A" system can recognize 398 basic Chinese syllable, recognition rate can arrive to 93%, response time is less than 0.1 second, input rapidity can arrive to 80 Chinese characters per minute 16 MT in China - Future and Strategies (4) combined with OCR and Internet Internet MT: SUNSHINE YIWANG, YAXIN YIBA, TONGYI, etc. The advantage for MT software in INTERNET are: Higher translation speed, real-time translation Cheap price Large machine dictionary Possibility to add the new words 17 MT in China: New National Project 973 project: from 2001 supported by Chinese government. For creative research in Natural Language processing including machine translation. automatic speech-to-speech translation system (English-Chinese) developing in Institute of Automation of Academia Sinica. 18 MT in China – Survey Source Prof. Feng, Zhiwei: Secretary-general and the deputy chairman of sub-committee of computer-aided in terminology of China under the State Language Commission (SLC) of China. Invited professor, KAIST (Sep/2001 – Aug/2002) Dr. Liu, Qun Institute of Computer Technology, Academia Sinica, Beijing 19 MT in Japan - 1 More than 10 companies For English, Chinese, Korean Waiting for the new breakthrough Internet eLearning Co-work with special-domain related companies Technology transfer Collaboration tools is ready to be in market For translator’s collaboration workbench thru network User interface: well-organized. 20 MT in Japan - 2 Leading Systems Cross-lingual patent retrieval Prime NTT/ALT Japanese-to-English Japanese-to-Malay Japanese-to-Chinese Speech Translation ATR: C-Star 21 UNL in UN University Through Universal Networking Language With Hindi, Japanese, Persian, IndonesiaMalay, Thai, Chinese, Mongolian, Korean in Asian Region Other region: Major European languages and English Possible Users: ITU mail translation 22 MT in Malaysia No commercial product yet. But in academic sectors For application to Internet eLearning eCommerce Universiti Sains Malaysia Computer Aided Translation Unit Prof. Tang Enya Kong and Prof. Yusoff Zaharin 23 MT in India 18 constitutional languages with 10 different scripts: their script grammar and language grammars are quite similar they have 40 to 80 percent vocabularies in common less than 5 percent people who can work in English 24 MT in India: 1990-2001 government effort for IT TDIL (Technology Development of Indian Languages): 1990-1991 development of corpora, OCR, Text-to-Speech, machine translation; Standards for keyboard and internal code for information interchange 2000-2001 seven major initiatives: Knowledge Resources, Knowledge Tools, Translation Support Systems, Human Machine Interface Systems, Localisation, Standardization and Language Technology Human Resource Development. Thirteen Resource centres for Indian Language Technology Solutions (RC-ILTS) were supported covering all 18 Indian languages. 25 MT in India: Future Digital Unite and Knowledge for All Indian Language Technology Vision 2010 has been prepared with the Vision statement “ Digital Unite and Knowledge for All”. Growing popularity of Internet content creation, localisation, on-line gisting and summarisation, e-learning, Cross-Lingual Information Retrieval are being promoted to ensure information access in cyberspace in Indian languages Source: Dr. Om Vikas Senior Director and Head, Computer Development Division, Ministry of Information Technology 26 MT in Thailand Government 1996 IT-2000 To build a national information infrastructure (NII) To invest in people, intends to concentrate on transferring IT knowledge to their children. To build a Government Information Network (GINET) Internet Users in Thailand (2000): 2.3M/66M Age <10 10-14 15-19 20-29 30-39 40-49 50-59 60-69 70+ Total Freq 18 124 261 1,238 572 187 32 27 2 2,461 Percent 0.7 5 10.6 50.3 23.2 7.6 1.3 1.1 0.1 100 Most of the Thai Internet users know English and other Internet languages at a basic or low intermediate level 27 MT in Thailand PARSIT web-based Thai-English Machine Translation since 1998 in cooperation with NEC (Japan). very popular among Thai users to translate English to Thai with the accuracy of 60%. 20 percent mistranslating might be due to differences in expressions, slang, and sentence structures http://www.suparsit.com/ 300,000 hits/month 25,000 users/month 28 MT in Thailand: Dictionary a web-based dictionary: Lexitron Thai-English and English-Thai dictionary 29 MT in Thailand: Future to develop PARSIT translating system Thai-to-English and to other target languages. Other language programs, such as OCR research, speech research, and language research Thai full-text search engine 30 MT in Thailand: eASEAN eASEAN Plan: Multilingual Machine Translation Proposal Thailand, Cambodia, Laos, Vietnam, Japan, Korea, English source: Dr. Virach Sornlertlamvanich [virach@nectec.or.th] Dr. Prayong THITITHANANON (Rajabhat Institute Ubon Ratchathani, Thailand) 31 MT in Taiwan Prof. Su, Keh-Ih Machine translation localization 32 MT in Korea Commercial Product English-to-Korean (Korean-to-English) Enguide E-Tran2001 EZ Reader ClickWorld Transmate … LNI Soft NLP Lab (Seoul National University) Language and Computer ClickQ IBM Korea Japanese-to/from-Korea Unisoft Changmyung … Translation Memory Localization companies develop for their own use: ITI … 33 MT in Korea Test suite for E-to-K KAIST (http://korterm.kaist.ac.kr/ksurimal) Supported by Ministry of Science and Technology Exhaustive Evaluation A variety of Sentences (5000 from high school textbooks, 10000 from internet ebusiness site) To identify the R&D direction 34 Problematic Part of System A average serious Article Part of Specech Structural Part Noun Mark Semantic Part Ellipsis Multiple part of speech Phrase Collocation V+N Gerund Special Construction Insertion Speech Adv.+N VP Different meaning between singular and plural Tense Idioms Number Comparative Inversion Subjunctive mood Lists Negation N+N Adv.+ Prep N NP Relatives N+V Ambiguous word Natural Expression Verb Realtion and Scope of modification V+Prep. N+Prep. Adjective Conjunction Participle Sentence type Sentence Structure Adverb Preposition Infinitive Partial Structure Pronoun Idioms PP V Etc. AP(adjective phrase) Sentence 35 MT in Korea Caption/EK and KE - ETRI Real-time translation of caption in the TV news CNN for English-Korean KBS for Korean-English Chinese-Korean MT Pohang University of Science & Tech. KAIST ETRI (Korean-to-Chinese) Companies: Konan tech. Japanese-Korean MT (technology transfer) Pohang University of Science & Tech. 36 Online language populations (2001 June) English 45%, Japanese 9.8%, Chinese 8.4% German 6.2%, Korean 4.7%, Spanish 4.5% Italian 3.6%, French 3.4%, Portuguese 2.5% Dutch 2%, Russian 1.9% GlobalReach. Language). Global Internet Statistics (by http://www.glreach.com/globstats/index.php3 37 Organizations in Asia AAMT AFNLP (Asia Federation of NLP Assocations) http://asianlp.org/ http://afnlp.org/ Eafterm (East Asia Terminology Forum) http://eafterm.org/ Language Resource Sharing and Management Jan/2001 – workshop in Tokyo, invited by Japan Prof. Tanaka, Hozumi (Chair; GSK) Nov/2001 – workshop in NLPRS-2001, Tokyo ISO TC37/SC4 (Language Resource Management) under organization 38 MT Status in Asia Thank you.