Statistical Machine Translation

advertisement
CS 188: Artificial Intelligence
Spring 2007
Lecture 25: Machine Translation
4/24/2007
Srini Narayanan – ICSI and UC Berkeley
Announcements
 Assignment 7 is up.
 Grid-world and robot crawler.
 Due 5/3.
 Extra Office Hours first two weeks of May
 This week as usual Thursday 11-1 PM
 5/2 extra (Tuesday 11-1 PM)
 5/3 usual 11-1 PM
 Next assignment (not graded) will be a
final exam review.
Reinforcement Learning
 What you should know
 MDPs





Basics, discounted reward
Policy Evaluation
Bellman’s equation
Value iteration
Policy iteration
 Reinforcement Learning
 Adaptive Dynamic Programming
 TD learning (Model-free)
 Q Learning
Where we are
 Past:
 Basic Techniques of AI
 Search, Representation, Uncertainty and
Inference, Learning
 Next
 Applications
 MT, NLU (this week)
 Neural Computation, Perception (next week).
 Today: Machine Translation (MT)
 (Semi) Automatically translating text/speech from
one language to another.
Translation is hard
•
In a Bucharest hotel lobby.
•
•
The lift is being fixed for the next day. During that time we regret that
you will be unbearable.
In a Paris hotel elevator:
•
•
Please leave your values at the front desk.
In a hotel in Athens:
•
•
Visitors are expected to complain at the office between the hours of 9
and 11 a.m. daily.
In a Japanese hotel:
•
•
You are invited to take advantage of the chambermaid.
In the lobby of a Moscow hotel across from a Russian Orthodox
monastery:
•
You are welcome to visit the cemetery where famous Russian and
Soviet composers, artists, and writers are buried daily except
Thursday.
MT History
 1946 (Pre-AI) Booth and Weaver discuss MT at
Rockefeller foundation in New York;
 1947-48 idea of dictionary-based direct
translation
 1949 Weaver memorandum popularized idea
 1952 all 18 MT researchers in world meet at MIT
 1954 IBM/Georgetown Demo Russian-English
MT
 1955-65 lots of labs take up MT
Early translation problems
 English to Russian to English
 The spirit is willing but the flesh is weak.
 The vodka is good but the meat is rotten.
History of MT: Pessimism
 1959/1960: Bar-Hillel “Report on the state of MT
in US and GB”
 Argued FAHQT too hard (semantic ambiguity, etc)
 Should work on semi-automatic instead of automatic
 His argument
Little John was looking for his toy box. Finally, he
found it. The box was in the pen. John was very
happy.
 Only human knowledge let’s us know that ‘playpens’
are bigger than boxes, but ‘writing pens’ are smaller
 His claim: we would have to encode all of human
knowledge
History of MT
 Systran (Babelfish) been used for 30 years
 1970’s:
 European focus in MT; mainly ignored in US
 1980’s
 ideas of using AI techniques in MT (KBMT, CMU)
 1990’s
 Commercial MT systems
 Statistical MT (SMT), Speech-to-speech translation
 2000’s
 SMT matures to be an exciting AI technology
 Well funded, high-payoff, can make a real difference.
Levels of Transfer
(Vauquois
triangle)
Interlingua
Semantic
Composition
Semantic
Analysis
Syntactic
Analysis
Syntactic
Structure
Word
Structure
Morphological
Analysis
Source Text
Semantic
Structure
Semantic
Decomposition
Semantic
Transfer
Syntactic
Transfer
Direct
Semantic
Structure
Semantic
Generation
Syntactic
Structure
Syntactic
Generation
Word
Structure
Morphological
Generation
Target Text
What makes a good translation
 Translators often talk about two factors we
want to maximize:
 Faithfulness or fidelity
 How close is the meaning of the translation to
the meaning of the original
 (Even better: does the translation cause the
reader to draw the same inferences as the
original would have)
 Fluency or naturalness
 How natural the translation is, just considering
its fluency in the target language
The Coding View
 “One naturally wonders if the problem of
translation could conceivably be treated as a
problem in cryptography. When I look at an article
in Russian, I say: ‘This is really written in English,
but it has been coded in some strange symbols. I
will now proceed to decode.’ ”
 Warren Weaver (1955:18, quoting a letter he wrote in 1947)
MT System Components
Language Model
source
P(e)
best
e
Translation Model
e
decoder
channel
P(f|e)
observed
f
argmax P(e|f) = argmax P(f|e)P(e)
e
e
Finds an English translation which is both fluent
and semantically faithful to the French source
f
The Classic Language Model
Word N-Grams
Generative approach:
w1 = START
repeat until END is generated:
produce word w2 according to a big table P(w2 | w1)
w1 := w2
P(I saw water on the table) =
P(I | START) *
P(saw | I) *
P(water | saw) *
START
P(on | water) *
P(the | on) *
P(table | the) *
P(END | table)
Probabilities can be learned
from online English text.
w1
w2
wn-1
END
Parallel Corpora
 Parallel corpora (or
bitexts)
 Collection of sourcetarget translation pairs
 Main resource for
learning a translation
model
 Either naturally occurring
(e.g. parliamentary
proceedings, news
translation services) or
commissioned
Building a Translation Model
 Steps in building a
simple statistical
translation model
 Match up words in
training sentence
pairs (word
alignment)
 Learn a lexicon from
these alignments
 Learn larger phrases
What
is
the
anticipated
cost
of
collecting
fees
under
the
new
proposal
?
En
vertu
de
les
nouvelles
propositions
,
quel
est
le
coût
prévu
de
perception
de
les
droits
?
Centauri/Arcturan [Knight, 1997]
Your assignment, translate this to Arcturan:
farok crrrok hihok yorok clok kantok ok-yurp
1a. ok-voon ororok sprok .
7a. lalok farok ororok lalok sprok izok enemok .
1b. at-voon bichat dat .
7b. wat jjat bichat wat dat vat eneat .
2a. ok-drubel ok-voon anok plok sprok .
8a. lalok brok anok plok nok .
2b. at-drubel at-voon pippat rrat dat .
8b. iat lat pippat rrat nnat .
3a. erok sprok izok hihok ghirok .
9a. wiwok nok izok kantok ok-yurp .
3b. totat dat arrat vat hilat .
4a. ok-voon anok drok brok jok .
9b. totat nnat quat oloat at-yurp .
10a. lalok mok nok yorok ghirok clok .
4b. at-voon krat pippat sat lat .
5a. wiwok farok izok stok .
10b. wat nnat gat mat bat hilat .
11a. lalok nok crrrok hihok yorok zanzanok .
5b. totat jjat quat cat .
6a. lalok sprok izok jok stok .
11b. wat nnat arrat mat zanzanat .
12a. lalok rarok nok izok hihok mok .
6b. wat dat krat quat cat .
12b. wat nnat forat arrat vat gat .
Centauri/Arcturan [Knight, 1997]
Your assignment, translate this to Arcturan:
farok crrrok hihok yorok clok kantok ok-yurp
1a. ok-voon ororok sprok .
7a. lalok farok ororok lalok sprok izok enemok .
1b. at-voon bichat dat .
7b. wat jjat bichat wat dat vat eneat .
2a. ok-drubel ok-voon anok plok sprok .
8a. lalok brok anok plok nok .
2b. at-drubel at-voon pippat rrat dat .
8b. iat lat pippat rrat nnat .
3a. erok sprok izok hihok ghirok .
9a. wiwok nok izok kantok ok-yurp .
3b. totat dat arrat vat hilat .
4a. ok-voon anok drok brok jok .
9b. totat nnat quat oloat at-yurp .
10a. lalok mok nok yorok ghirok clok .
4b. at-voon krat pippat sat lat .
5a. wiwok farok izok stok .
10b. wat nnat gat mat bat hilat .
11a. lalok nok crrrok hihok yorok zanzanok .
5b. totat jjat quat cat .
6a. lalok sprok izok jok stok .
11b. wat nnat arrat mat zanzanat .
12a. lalok rarok nok izok hihok mok .
6b. wat dat krat quat cat .
12b. wat nnat forat arrat vat gat .
Centauri/Arcturan [Knight, 1997]
Your assignment, translate this to Arcturan:
farok crrrok hihok yorok clok kantok ok-yurp
1a. ok-voon ororok sprok .
7a. lalok farok ororok lalok sprok izok enemok .
1b. at-voon bichat dat .
7b. wat jjat bichat wat dat vat eneat .
2a. ok-drubel ok-voon anok plok sprok .
8a. lalok brok anok plok nok .
2b. at-drubel at-voon pippat rrat dat .
8b. iat lat pippat rrat nnat .
3a. erok sprok izok hihok ghirok .
9a. wiwok nok izok kantok ok-yurp .
3b. totat dat arrat vat hilat .
4a. ok-voon anok drok brok jok .
9b. totat nnat quat oloat at-yurp .
10a. lalok mok nok yorok ghirok clok .
4b. at-voon krat pippat sat lat .
5a. wiwok farok izok stok .
10b. wat nnat gat mat bat hilat .
11a. lalok nok crrrok hihok yorok zanzanok .
5b. totat jjat quat cat .
6a. lalok sprok izok jok stok .
11b. wat nnat arrat mat zanzanat .
12a. lalok rarok nok izok hihok mok .
6b. wat dat krat quat cat .
12b. wat nnat forat arrat vat gat .
Centauri/Arcturan [Knight, 1997]
Your assignment, translate this to Arcturan:
farok crrrok hihok yorok clok kantok ok-yurp
1a. ok-voon ororok sprok .
7a. lalok farok ororok lalok sprok izok enemok .
1b. at-voon bichat dat .
7b. wat jjat bichat wat dat vat eneat .
2a. ok-drubel ok-voon anok plok sprok .
8a. lalok brok anok plok nok .
2b. at-drubel at-voon pippat rrat dat .
8b. iat lat pippat rrat nnat .
3a. erok sprok izok hihok ghirok .
9a. wiwok nok izok kantok ok-yurp .
3b. totat dat arrat vat hilat .
4a. ok-voon anok drok brok jok .
9b. totat nnat quat oloat at-yurp .
10a. lalok mok nok yorok ghirok clok .
4b. at-voon krat pippat sat lat .
5a. wiwok farok izok stok .
10b. wat nnat gat mat bat hilat .
11a. lalok nok crrrok hihok yorok zanzanok .
5b. totat jjat quat cat .
6a. lalok sprok izok jok stok .
11b. wat nnat arrat mat zanzanat .
12a. lalok rarok nok izok hihok mok .
6b. wat dat krat quat cat .
12b. wat nnat forat arrat vat gat .
???
Centauri/Arcturan [Knight, 1997]
Your assignment, translate this to Arcturan:
farok crrrok hihok yorok clok kantok ok-yurp
1a. ok-voon ororok sprok .
7a. lalok farok ororok lalok sprok izok enemok .
1b. at-voon bichat dat .
7b. wat jjat bichat wat dat vat eneat .
2a. ok-drubel ok-voon anok plok sprok .
8a. lalok brok anok plok nok .
2b. at-drubel at-voon pippat rrat dat .
8b. iat lat pippat rrat nnat .
3a. erok sprok izok hihok ghirok .
9a. wiwok nok izok kantok ok-yurp .
3b. totat dat arrat vat hilat .
4a. ok-voon anok drok brok jok .
9b. totat nnat quat oloat at-yurp .
10a. lalok mok nok yorok ghirok clok .
4b. at-voon krat pippat sat lat .
5a. wiwok farok izok stok .
10b. wat nnat gat mat bat hilat .
11a. lalok nok crrrok hihok yorok zanzanok .
5b. totat jjat quat cat .
6a. lalok sprok izok jok stok .
11b. wat nnat arrat mat zanzanat .
12a. lalok rarok nok izok hihok mok .
6b. wat dat krat quat cat .
12b. wat nnat forat arrat vat gat .
Centauri/Arcturan [Knight, 1997]
Your assignment, translate this to Arcturan:
farok crrrok hihok yorok clok kantok ok-yurp
1a. ok-voon ororok sprok .
7a. lalok farok ororok lalok sprok izok enemok .
1b. at-voon bichat dat .
7b. wat jjat bichat wat dat vat eneat .
2a. ok-drubel ok-voon anok plok sprok .
8a. lalok brok anok plok nok .
2b. at-drubel at-voon pippat rrat dat .
8b. iat lat pippat rrat nnat .
3a. erok sprok izok hihok ghirok .
9a. wiwok nok izok kantok ok-yurp .
3b. totat dat arrat vat hilat .
4a. ok-voon anok drok brok jok .
9b. totat nnat quat oloat at-yurp .
10a. lalok mok nok yorok ghirok clok .
4b. at-voon krat pippat sat lat .
5a. wiwok farok izok stok .
10b. wat nnat gat mat bat hilat .
11a. lalok nok crrrok hihok yorok zanzanok .
5b. totat jjat quat cat .
6a. lalok sprok izok jok stok .
11b. wat nnat arrat mat zanzanat .
12a. lalok rarok nok izok hihok mok .
6b. wat dat krat quat cat .
12b. wat nnat forat arrat vat gat .
Centauri/Arcturan [Knight, 1997]
Your assignment, translate this to Arcturan:
farok crrrok hihok yorok clok kantok ok-yurp
1a. ok-voon ororok sprok .
7a. lalok farok ororok lalok sprok izok enemok .
1b. at-voon bichat dat .
7b. wat jjat bichat wat dat vat eneat .
2a. ok-drubel ok-voon anok plok sprok .
8a. lalok brok anok plok nok .
2b. at-drubel at-voon pippat rrat dat .
8b. iat lat pippat rrat nnat .
3a. erok sprok izok hihok ghirok .
9a. wiwok nok izok kantok ok-yurp .
3b. totat dat arrat vat hilat .
4a. ok-voon anok drok brok jok .
9b. totat nnat quat oloat at-yurp .
10a. lalok mok nok yorok ghirok clok .
4b. at-voon krat pippat sat lat .
5a. wiwok farok izok stok .
10b. wat nnat gat mat bat hilat .
11a. lalok nok crrrok hihok yorok zanzanok .
5b. totat jjat quat cat .
6a. lalok sprok izok jok stok .
11b. wat nnat arrat mat zanzanat .
12a. lalok rarok nok izok hihok mok .
6b. wat dat krat quat cat .
12b. wat nnat forat arrat vat gat .
Centauri/Arcturan [Knight, 1997]
Your assignment, translate this to Arcturan:
farok crrrok hihok yorok clok kantok ok-yurp
1a. ok-voon ororok sprok .
7a. lalok farok ororok lalok sprok izok enemok .
1b. at-voon bichat dat .
7b. wat jjat bichat wat dat vat eneat .
2a. ok-drubel ok-voon anok plok sprok .
8a. lalok brok anok plok nok .
2b. at-drubel at-voon pippat rrat dat .
8b. iat lat pippat rrat nnat .
3a. erok sprok izok hihok ghirok .
9a. wiwok nok izok kantok ok-yurp .
3b. totat dat arrat vat hilat .
4a. ok-voon anok drok brok jok .
9b. totat nnat quat oloat at-yurp .
10a. lalok mok nok yorok ghirok clok .
???
4b. at-voon krat pippat sat lat .
5a. wiwok farok izok stok .
10b. wat nnat gat mat bat hilat .
11a. lalok nok crrrok hihok yorok zanzanok .
5b. totat jjat quat cat .
6a. lalok sprok izok jok stok .
11b. wat nnat arrat mat zanzanat .
12a. lalok rarok nok izok hihok mok .
6b. wat dat krat quat cat .
12b. wat nnat forat arrat vat gat .
Centauri/Arcturan [Knight, 1997]
Your assignment, translate this to Arcturan:
farok crrrok hihok yorok clok kantok ok-yurp
1a. ok-voon ororok sprok .
7a. lalok farok ororok lalok sprok izok enemok .
1b. at-voon bichat dat .
7b. wat jjat bichat wat dat vat eneat .
2a. ok-drubel ok-voon anok plok sprok .
8a. lalok brok anok plok nok .
2b. at-drubel at-voon pippat rrat dat .
8b. iat lat pippat rrat nnat .
3a. erok sprok izok hihok ghirok .
9a. wiwok nok izok kantok ok-yurp .
3b. totat dat arrat vat hilat .
4a. ok-voon anok drok brok jok .
9b. totat nnat quat oloat at-yurp .
10a. lalok mok nok yorok ghirok clok .
4b. at-voon krat pippat sat lat .
5a. wiwok farok izok stok .
10b. wat nnat gat mat bat hilat .
11a. lalok nok crrrok hihok yorok zanzanok .
5b. totat jjat quat cat .
6a. lalok sprok izok jok stok .
11b. wat nnat arrat mat zanzanat .
12a. lalok rarok nok izok hihok mok .
6b. wat dat krat quat cat .
12b. wat nnat forat arrat vat gat .
Centauri/Arcturan [Knight, 1997]
Your assignment, translate this to Arcturan:
farok crrrok hihok yorok clok kantok ok-yurp
1a. ok-voon ororok sprok .
7a. lalok farok ororok lalok sprok izok enemok .
1b. at-voon bichat dat .
7b. wat jjat bichat wat dat vat eneat .
2a. ok-drubel ok-voon anok plok sprok .
8a. lalok brok anok plok nok .
2b. at-drubel at-voon pippat rrat dat .
8b. iat lat pippat rrat nnat .
3a. erok sprok izok hihok ghirok .
9a. wiwok nok izok kantok ok-yurp .
3b. totat dat arrat vat hilat .
4a. ok-voon anok drok brok jok .
9b. totat nnat quat oloat at-yurp .
10a. lalok mok nok yorok ghirok clok .
4b. at-voon krat pippat sat lat .
5a. wiwok farok izok stok .
10b. wat nnat gat mat bat hilat .
11a. lalok nok crrrok hihok yorok zanzanok .
5b. totat jjat quat cat .
6a. lalok sprok izok jok stok .
11b. wat nnat arrat mat zanzanat .
12a. lalok rarok nok izok hihok mok .
6b. wat dat krat quat cat .
12b. wat nnat forat arrat vat gat .
process of
elimination
Centauri/Arcturan [Knight, 1997]
Your assignment, translate this to Arcturan:
farok crrrok hihok yorok clok kantok ok-yurp
1a. ok-voon ororok sprok .
7a. lalok farok ororok lalok sprok izok enemok .
1b. at-voon bichat dat .
7b. wat jjat bichat wat dat vat eneat .
2a. ok-drubel ok-voon anok plok sprok .
8a. lalok brok anok plok nok .
2b. at-drubel at-voon pippat rrat dat .
8b. iat lat pippat rrat nnat .
3a. erok sprok izok hihok ghirok .
9a. wiwok nok izok kantok ok-yurp .
3b. totat dat arrat vat hilat .
4a. ok-voon anok drok brok jok .
9b. totat nnat quat oloat at-yurp .
10a. lalok mok nok yorok ghirok clok .
4b. at-voon krat pippat sat lat .
5a. wiwok farok izok stok .
10b. wat nnat gat mat bat hilat .
11a. lalok nok crrrok hihok yorok zanzanok .
5b. totat jjat quat cat .
6a. lalok sprok izok jok stok .
11b. wat nnat arrat mat zanzanat .
12a. lalok rarok nok izok hihok mok .
6b. wat dat krat quat cat .
12b. wat nnat forat arrat vat gat .
cognate?
Centauri/Arcturan [Knight, 1997]
Your assignment, put these words in order:
{ jjat, arrat, mat, bat, oloat, at-yurp }
1a. ok-voon ororok sprok .
7a. lalok farok ororok lalok sprok izok enemok .
1b. at-voon bichat dat .
7b. wat jjat bichat wat dat vat eneat .
2a. ok-drubel ok-voon anok plok sprok .
8a. lalok brok anok plok nok .
2b. at-drubel at-voon pippat rrat dat .
8b. iat lat pippat rrat nnat .
3a. erok sprok izok hihok ghirok .
9a. wiwok nok izok kantok ok-yurp .
3b. totat dat arrat vat hilat .
4a. ok-voon anok drok brok jok .
9b. totat nnat quat oloat at-yurp .
10a. lalok mok nok yorok ghirok clok .
4b. at-voon krat pippat sat lat .
5a. wiwok farok izok stok .
10b. wat nnat gat mat bat hilat .
11a. lalok nok crrrok hihok yorok zanzanok .
5b. totat jjat quat cat .
6a. lalok sprok izok jok stok .
11b. wat nnat arrat mat zanzanat .
12a. lalok rarok nok izok hihok mok .
6b. wat dat krat quat cat .
12b. wat nnat forat arrat vat gat .
zero
fertility
It’s Really Spanish/English
Clients do not sell pharmaceuticals in Europe => Clientes no venden medicinas en Europa
1a. Garcia and associates .
1b. Garcia y asociados .
7a. the clients and the associates are enemies .
7b. los clients y los asociados son enemigos .
2a. Carlos Garcia has three associates .
2b. Carlos Garcia tiene tres asociados .
8a. the company has three groups .
8b. la empresa tiene tres grupos .
3a. his associates are not strong .
3b. sus asociados no son fuertes .
9a. its groups are in Europe .
9b. sus grupos estan en Europa .
4a. Garcia has a company also .
4b. Garcia tambien tiene una empresa .
10a. the modern groups sell strong pharmaceuticals .
10b. los grupos modernos venden medicinas fuertes .
5a. its clients are angry .
5b. sus clientes estan enfadados .
11a. the groups do not sell zenzanine .
11b. los grupos no venden zanzanina .
6a. the associates are also angry .
6b. los asociados tambien estan enfadados .
12a. the small groups are not modern .
12b. los grupos pequenos no son modernos .
Statistical Machine Translation
… la maison … la maison bleue … la fleur …
… the house … the blue house … the flower …
All word alignments equally likely
All P(french-word | english-word) equally likely
Statistical Machine Translation
… la maison … la maison bleue … la fleur …
… the house … the blue house … the flower …
“la” and “the” observed to co-occur frequently,
so P(la | the) is increased.
Statistical Machine Translation
… la maison … la maison bleue … la fleur …
… the house … the blue house … the flower …
“house” co-occurs with both “la” and “maison”, but
P(maison | house) can be raised without limit, to 1.0,
while P(la | house) is limited because of “the”
(pigeonhole principle)
Statistical Machine Translation
… la maison … la maison bleue … la fleur …
… the house … the blue house … the flower …
settling down after another iteration
Statistical Machine Translation
… la maison … la maison bleue … la fleur …
… the house … the blue house … the flower …
Inherent hidden structure revealed by EM training!
For details, see:
“A Statistical MT Tutorial Workbook” (Knight, 1999).
“The Mathematics of Statistical Machine Translation” (Brown et al, 1993)
Software: GIZA++
Decoding
 Now we have a phrase table:
 A huge list of translation phrases (e.g. 1M
phrases)
 Each phrase has a probability P(f|e)
 When we see a new input sentence:
 Grow a translation left to right
 Extend translation using known phrases
 Also multiply by language model score
The Pharaoh Decoder
 Probabilities at each step include LM and TM
Recent Progress in Statistical MT
2002
slide from C. Wayne, DARPA
insistent Wednesday may
recurred her trips to Libya
tomorrow for flying
Cairo 6-4 ( AFP ) - an official
announced today in the
Egyptian lines company for
flying Tuesday is a company "
insistent for flying " may
resumed a consideration of a
day Wednesday tomorrow her
trips to Libya of Security Council
decision trace international the
imposed ban comment .
And said the official " the
institution sent a speech to
Ministry of Foreign Affairs of
lifting on Libya air , a situation
her receiving replying are so a
trip will pull to Libya a morning
Wednesday " .
2003
Egyptair Has Tomorrow to
Resume Its Flights to Libya
Cairo 4-6 (AFP) - said an official
at the Egyptian Aviation
Company today that the
company egyptair may resume
as of tomorrow, Wednesday its
flights to Libya after the
International Security Council
resolution to the suspension of
the embargo imposed on Libya.
" The official said that the
company had sent a letter to the
Ministry of Foreign Affairs,
information on the lifting of the
air embargo on Libya, where it
had received a response, the
first take off a trip to Libya on
Wednesday morning ".
Statistical Machine Translation
… la maison … la maison bleue … la fleur …
… the house … the blue house … the flower …
P(juste | fair) = 0.411
P(juste | correct) = 0.027
P(juste | right) = 0.020
…
new French
sentence
Possible English translations,
to be rescored by language model
What is MT not (yet) good for?
 Really hard stuff
 Literature
 Natural spoken speech (meetings, court
reporting)
 Really important stuff
 Medical translation in hospitals, 911
What is MT good for?
 Tasks for which a rough translation is fine
 Web pages, email
 Multilingual Speech-based queries
 Tasks for which MT can be post-edited
 MT as first pass
 “Computer-aided human translation”
 Tasks in sublanguage domains where
high-quality MT is possible
The next five years
 Bootstrapping Resources
 Trying to design better learning methods to work from
scarce data (see Knight 2003, Plauche et al 2007)
 Germann and the ISI experiment in Tamil
 MT in a month
 100K tokens achieved tolerable performance in 2002
 Including Syntactic/Semantic Information in SMT
 Markup on the Web
 Multi-lingual Lexical resources
 WordNet PropBank FrameNet
 Combining MT methods
Pos
Language
Family
Script(s) Used
Speakers
Where Spoken (Major)
1
Mandarin
Sino-Tibetan
Chinese Characters
1051
China, Malaysia, Taiwan
2
English
Indo-European
Latin
510
USA, UK, Australia, Canada, New Zealand
3
Hindi
Indo-European
Devanagari
490
North and Central India
4
Spanish
Indo-European
Latin
425
The Americas, Spain
5
Arabic
Afro-Asiatic
Arabic
255
Middle East, Arabia, North Africa
6
Russian
Indo-European
Cyrillic
254
Russia, Central Asia
7
Portuguese
Indo-European
Latin
218
Brazil, Portugal, Southern Africa
8
Bengali
Indo-European
Bengali
215
Bangladesh, Eastern India
9
Indonesian
MalayoPolynesian
Latin
175
Indonesia, Malaysia, Singapore
10
French
Indo-European
Latin
130
France, Canada, West Africa, Central Africa
11
Japanese
Altaic
Chinese Characters and 2 Japanese Alphabets
127
Japan
12
German
Indo-European
Latin
123
Germany, Austria, Central Europe
13
Farsi (Persian)
Indo-European
Nastaliq
110
Iran, Afghanistan, Central Asia
14
Urdu
Indo-European
Nastaliq
104
Pakistan, India
15
Punjabi
Indo-European
Gurumukhi
103
Pakistan, India
16
Vietnamese
Austroasiatic
Based on Latin
86
Vietnam, China
17
Tamil
Dravidian
Tamil
78
Southern India, Sri Lanka, Malyasia
18
Wu
Sino-Tibetan
Chinese Characters
77
China
19
Javanese
Malayo-Polynesian
Javanese
76
Indonesia
20
Turkish
Altaic
Latin
75
Turkey, Central Asia
21
Telugu
Dravidian
Telugu
74
Southern India
22
Korean
Altaic
Hangul
72
Korean Peninsula
23
Marathi
Indo-European
Devanagari
71
Western India
24
Italian
Indo-European
Latin
61
Italy, Central Europe
25
Thai
Sino-Tibetan
Thai
60
Thailand, Laos
26
Cantonese
Sino-Tibetan
Chinese Characters
55
Southern China
27
Gujarati
Indo-European
Gujarati
47
Western India, Kenya
28
Polish
Indo-European
Latin
46
Poland, Central Europe
29
Kannada
Dravidian
Kannada
44
Southern India
30
Burmese
Sino-Tibetan
Burmese
42
Myanmar
Top Ten Internet Languages
MT in Developing Countries
Community
Rec
Traditional
Rec
Related Berkeley work at
TIER

Kiosks / Livelihood




Education



Long-distance diagnosis using 802.11b
Teaching


Studies of social impacts of Computer Aided Learning in rural areas
Observations of shared computer usage among children in resource strapped areas
Telemedicine


Cellphones for pricing in rural Rwandan coffee markets
Computers and livelihood development in urban slums in Brazil
E-literacy / Entrepreneurship in rural Kerala
‘Technology and Development’ graduate class design (see reader/syllabus)
Conference

First peer-reviewed IEEE/ACM conference in series
URL bibliography




























http://www.cicc.or.jp—CICC website.
http://nespole.itc.it—NESPOLE! website.
http://www.umiacs.umd.edu—UMIACS website.
http://www.isi.edu.
http://www-2.cs.cmu.edu.
http://www.lti.cs.cmu.edu.
http://blombos.isi.edu—DINO browser.
http://www-2.cs.cmu.edu—Enthusiast.
http://www.ll.mit.edu—CCLINC.
http://www-2.cs.cmu.edu—Speechalator.
http://isl.ira.uka.de—FAME.
http://www.cogsci.princeton.edu—WordNet.
http://www.globalwordnet.org—Global WordNet Association.
http://www.illc.uva.nl—EuroWordNet.
http://www.sfs.nphil.uni-tuebingen.de—GermaNet.
http://www.ceid.upatras.gr—BalkaNet.
http://www.keenage.comChinese HowNet.
http://www.gittens.nl—Mimida multilingual semantic network.
http://www.icsi.berkeley.edu—FrameNet project.
http://www.coli.uni-sb.de—SALSA project.
http://www.nak.ics.keio.ac.jp—FrameNet project for
Japanese.
http://gemini.uab.es—FrameNet project for Spanish.
http://www.cis.upenn.edu—PropBank project.
http://www.cis.upenn.edu—VerbNet.
http://www.cis.upenn.edu—combination of VerbNet and
FrameNet.
http://nlp.cs.nyu.edu—The NomBank
References
Download