Key-Sun Choi

advertisement
Where do we stand? MT development,
research, and deployment in Asia
Key-Sun Choi (KAIST)
AAMT
http://www.asianlp.org/
http://www.afnlp.org/
http://korterm.org/
Contents
China
Japan
India
Malaysia
Thailand
Taiwan
Korea
UNL
Associations related to MT
2
MT in China – 1980-1990’s
 To translate the scientific documents
 From Russian and Western Countries’ language
 Supported by government
 No private company in early stage
 TRANS-STAR:
 30,000 words/hour for 386 PC.
 Basis dictionary includes 40,000 entries,
 10 specialized technical dictionaries
 including 350,000 entries.
 subject fields: computer, economics, telecommunication,
ceramics, thermal power industry, printing machine industry,
automobile/tractor industry, Petroleum prospecting,
geology, Chemical industry.
3
MT in China – Present
English-to-Chinese
GAOLI:
 jointly by Beijing GAOLI Computer Co. Lid. & Linguistics
Institute of CASS.
 Basic lexical dictionary: 60,000 entries in which usage and
grammatical function of every word is described in detail.
 Translation accuracy: 80%
 Readability of translated text: 80%-90%
863-IMT/EC:
 by the Institute of Computer Technology, Academia Sinica.
 commercialized and got very good economic benefits.
4
MT in China – Present
Chinese-to-English
SINO-TRANS
 by the Company CS&S (China National
Software & Technology Service Co.) at
1993.
 Basic dictionary: 40,000 entries
 Two special subject technical dictionaries:
Naval ships and boats (9312 entries),
rocket-gun (33,773 entries)
 Linguistic rules: 1,000 rules
5
MT in China – Present
English-to-Chinese + terminology
TONGYI system:
 by the Tianjin DATONG computer software
company
 WINDOWS platform
 Different special subject dictionaries:
a.commonly-used scientific terms: 200,000 entries
b.terms including 22 different subjects (e.g. machine
building, telecommunication, aviation, medicine, etc):
3,000,000 entries
 Good market strategy and service
 Cooperation with enterprises
6
MT in China – Present
English-to-Chinese + internet browsing + more
user interface
YIWANG:
 by SUNSHINE company of Shenzhen.
 Highest translation speed: 100 sentences per
second.
 Internet browsing
YIBA:




by YAXINCHENG software technical company.
Three translation: on line, automatic, interface.
Open to users: to revise dictionary and rules
Rich special subject dictionaries: 30 subjects (e.g.
Computer, telecommunication, medicine)
7
MT in China – Present
English-to-Japanese
E-to-J
 by JEC company in Beijing.
 Technique of transformation from phrase
tree (P-tree) to dependency tree (D-tree).
 Closely integrated with word processor
8
MT in China – Present
Example-based MT: experimental systems
Japanese-Chinese EBMT:
 computer department of Qinghua university in
1996.
 corpus for Japanese and Chinese alignment
sentences
 The example unit is sentence
 The similarity rate calculation based on word
DAYA EBMT:
 Harbin Polytechnic University.
 machine-aided translation system, human factor is
very important
 corpus is sentence-level alignment
9
MT in China
Government Funding: 1990’s
Hi-Tech 863 funding:
 863-IMT/EC system (English-Chinese)
 SUNSHINE YIWANG system.
905 Chinese
Project:
Language
Processing
 completed in 1998.
10
MT in China
User’s English Level
The proportion of English level of user
for TONGYI MT software:
 Higher level: 16.5%
 Middle level: 49.5%
 Lower level: 34.1%
So the MT software must be oriented to
common people
11
MT in China
Potential Users
The proportion of enterprise user for TONGYI
MT software:
 Small enterprises: 31.3%
 Medium-scale & large-scale enterprises: 68.7%
So the MT software must be oriented to
 large-scale & medium-scale enterprises,
 but we don’t ignore the small enterprises that also
has translation demand.
12
MT in China
Regional Distribution
User’s region distribution of MT software:
 translation demand is concentrated in the big
cities and developing regions.








Beijing: 18.7%
Liaoning: 7.9%, Jiangsu: 7.5%
Zhejiang: 6.5%, Hubei: 6.5%, Shanghai: 6.1%
Sichuan: 4.7%, Guangdong: 4.7%
Henan: 3.3%, Helongjiang: 3.3%
Hebei: 2.8%, Shanxi: 2.3%, Jilin: 2.3%
Yunnan: 1.9%, Neimeng: 1.5%, Gansu: 1.4%
Guizhou: 0.5%, Anhui: 0.5%
13
MT in China - Future and Strategies (1)
Terminology Data Bank
MT software combines with terminology data
bank
 1990: sub-committee of computer-aided in
terminology of China set up.
 This sub-committee is attached to the State Language
Commission (SLC) of China
 A series of national standards for terminology
data-bank
 Terminology Databank creation
 Chinese-English: Since 1995, by ISTIC (Institute of
Scientific and technical Information of China)
 Remarkable databanks…
14
MT in China - Future and Strategies (2)
Language Corpus Processing
Corpus construction:
 the scale of 25 million Chinese characters
(1999)
 Automatic segmentation of Chinese writing
text in corpus (97.68%, close test)
 Automatic phrase bracketing and syntactic
annotation for Chinese Corpus
15
MT in China - Future and Strategies (3)
speech-to-speech translation
Chinese speech into Chinese text.
 "SIDA-863A" system can recognize
 398 basic Chinese syllable,
 recognition rate can arrive to 93%,
 response time is less than 0.1 second,
 input rapidity can arrive to 80 Chinese
characters per minute
16
MT in China - Future and Strategies (4)
combined with OCR and Internet
Internet MT:
 SUNSHINE YIWANG, YAXIN YIBA, TONGYI, etc.
The advantage for MT software in INTERNET
are:




Higher translation speed, real-time translation
Cheap price
Large machine dictionary
Possibility to add the new words
17
MT in China: New National
Project
973 project: from 2001
 supported by Chinese government.
 For creative research in
 Natural Language processing including
machine translation.
 automatic speech-to-speech translation system
(English-Chinese)
 developing in Institute of Automation of Academia
Sinica.
18
MT in China – Survey Source
Prof. Feng, Zhiwei:
 Secretary-general and the deputy chairman of
 sub-committee of computer-aided in terminology of China
 under the State Language Commission (SLC) of China.
 Invited professor, KAIST (Sep/2001 – Aug/2002)
Dr. Liu, Qun
 Institute of Computer Technology, Academia
Sinica, Beijing
19
MT in Japan - 1
More than 10 companies
 For English, Chinese, Korean
Waiting for the new breakthrough
 Internet
 eLearning
 Co-work with special-domain related companies
Technology transfer
 Collaboration tools is ready to be in market
 For translator’s collaboration workbench thru network
 User interface: well-organized.
20
MT in Japan - 2
Leading Systems
 Cross-lingual patent retrieval
 Prime
 NTT/ALT
 Japanese-to-English
 Japanese-to-Malay
 Japanese-to-Chinese
 Speech Translation
 ATR: C-Star
21
UNL in UN University
Through Universal Networking
Language
 With Hindi, Japanese, Persian, IndonesiaMalay, Thai, Chinese, Mongolian, Korean
in Asian Region
 Other region: Major European languages
and English
Possible Users:
 ITU mail translation
22
MT in Malaysia
No commercial product yet.
 But in academic sectors
For application to
 Internet
 eLearning
 eCommerce
Universiti Sains Malaysia
 Computer Aided Translation Unit
 Prof. Tang Enya Kong and Prof. Yusoff Zaharin
23
MT in India
18 constitutional languages with 10
different scripts:
 their script grammar and language
grammars are quite similar
 they have 40 to 80 percent vocabularies in
common
 less than 5 percent people who can work in
English
24
MT in India: 1990-2001
government effort for IT
TDIL (Technology Development of Indian Languages):
 1990-1991
 development of corpora, OCR, Text-to-Speech, machine
translation; Standards for keyboard and internal code for
information interchange
 2000-2001
 seven major initiatives:
 Knowledge Resources, Knowledge Tools, Translation
Support Systems, Human Machine Interface Systems,
Localisation, Standardization and Language Technology
Human Resource Development.
 Thirteen Resource centres for Indian Language
Technology Solutions (RC-ILTS)
 were supported covering all 18 Indian languages.
25
MT in India: Future
Digital Unite and Knowledge for All
Indian Language Technology Vision 2010 has
been prepared
 with the Vision statement “ Digital Unite and
Knowledge for All”.
 Growing popularity of Internet
 content creation, localisation, on-line gisting and
summarisation, e-learning, Cross-Lingual Information
Retrieval are being promoted to ensure information access
in cyberspace in Indian languages
Source: Dr. Om Vikas
 Senior Director and Head, Computer Development
Division, Ministry of Information Technology
26
MT in Thailand
Government 1996
IT-2000
 To build a national information infrastructure (NII)
 To invest in people, intends to concentrate on transferring IT
knowledge to their children.
 To build a Government Information Network (GINET)
Internet Users in Thailand (2000): 2.3M/66M
 Age <10 10-14 15-19 20-29 30-39 40-49 50-59 60-69 70+ Total
 Freq 18 124 261 1,238 572 187 32
27 2
2,461
 Percent 0.7 5 10.6 50.3 23.2 7.6 1.3
1.1 0.1
100
Most of the Thai Internet users know English and other Internet
languages at a basic or low intermediate level
27
MT in Thailand
PARSIT
web-based Thai-English Machine Translation
 since 1998 in cooperation with NEC (Japan).
 very popular among Thai users
 to translate English to Thai with the accuracy of
60%.
 20 percent mistranslating might be due to
differences in expressions, slang, and sentence
structures
http://www.suparsit.com/
300,000 hits/month
25,000 users/month
28
MT in Thailand: Dictionary
a web-based dictionary: Lexitron
 Thai-English and English-Thai dictionary
29
MT in Thailand: Future
to develop PARSIT translating system
 Thai-to-English
 and to other target languages.
Other language programs, such as
OCR research, speech research, and
language research
Thai full-text search engine
30
MT in Thailand: eASEAN
eASEAN Plan:
 Multilingual Machine Translation Proposal
 Thailand, Cambodia, Laos, Vietnam, Japan,
Korea, English
source:
 Dr. Virach Sornlertlamvanich
[virach@nectec.or.th]
 Dr. Prayong THITITHANANON (Rajabhat
Institute Ubon Ratchathani, Thailand)
31
MT in Taiwan
Prof. Su, Keh-Ih
 Machine translation
 localization
32
MT in Korea
Commercial Product
English-to-Korean (Korean-to-English)






Enguide
E-Tran2001
EZ Reader
ClickWorld
Transmate
…
LNI Soft
NLP Lab (Seoul National University)
Language and Computer
ClickQ
IBM Korea
Japanese-to/from-Korea
 Unisoft
 Changmyung
 …
Translation Memory
 Localization companies develop for their own use:
 ITI …
33
MT in Korea
Test suite for E-to-K
KAIST
(http://korterm.kaist.ac.kr/ksurimal)
 Supported by Ministry of Science and
Technology
Exhaustive Evaluation
 A variety of Sentences (5000 from high
school textbooks, 10000 from internet ebusiness site)
 To identify the R&D direction
34
Problematic Part of System
A average
serious
Article
Part of
Specech
Structural
Part
Noun
Mark
Semantic
Part
Ellipsis
Multiple part of speech
Phrase
Collocation
V+N
Gerund
Special Construction
Insertion
Speech
Adv.+N
VP
Different meaning between singular and plural
Tense
Idioms
Number
Comparative
Inversion
Subjunctive mood
Lists
Negation
N+N
Adv.+ Prep
N
NP
Relatives
N+V
Ambiguous word
Natural Expression
Verb
Realtion and Scope of modification
V+Prep.
N+Prep.
Adjective
Conjunction
Participle
Sentence type
Sentence
Structure
Adverb
Preposition
Infinitive
Partial Structure
Pronoun
Idioms
PP
V
Etc.
AP(adjective phrase)
Sentence
35
MT in Korea
Caption/EK and KE - ETRI
 Real-time translation of caption in the TV news
 CNN for English-Korean
 KBS for Korean-English
Chinese-Korean MT




Pohang University of Science & Tech.
KAIST
ETRI (Korean-to-Chinese)
Companies: Konan tech.
Japanese-Korean MT (technology transfer)
 Pohang University of Science & Tech.
36
Online language populations
(2001 June)
English 45%, Japanese 9.8%, Chinese 8.4%
German 6.2%, Korean 4.7%, Spanish 4.5%
Italian 3.6%, French 3.4%, Portuguese 2.5%
Dutch 2%, Russian 1.9%
GlobalReach.
Language).
Global
Internet
Statistics
(by
 http://www.glreach.com/globstats/index.php3
37
Organizations in Asia
AAMT
AFNLP (Asia Federation of NLP Assocations)
 http://asianlp.org/
 http://afnlp.org/
Eafterm (East Asia Terminology Forum)
 http://eafterm.org/
Language Resource Sharing and Management
 Jan/2001 – workshop in Tokyo, invited by Japan
 Prof. Tanaka, Hozumi (Chair; GSK)
 Nov/2001 – workshop in NLPRS-2001, Tokyo
 ISO TC37/SC4 (Language Resource Management)
under organization
38
MT Status in Asia
Thank you.
Download