Machine Translation (Level 2)

advertisement
Machine Translation (Level 2)
Anna Sågvall Hein
GSLT Course, September 2004
Translation
”substitute the text material of one language (SL) by the equivalent
text material of another language (TL)” (Catford 1965: 20)
”Translation consists in producing in the target language the closest
natural equivalent of the text material of the source language, in
the first hand concerning meaning, in the second hand concerning
style (Nida 1975: 32)
”Translation is in theory impossible, but in practice fairly possible”
Mounin (1967)
Catford, J. C. (1965), A Linguistic Theory of Translation, Oxford Press, England.
Mounin, G. (1967) Les problèmes théotitiques de la traduction. Paris
Nida, E. (1975), A Framework for the Analysis and Evaluation of Theories of
Translation, in Brislin, R. W. (ed) (1975), Translation Application and Research,
Gardner Press, New York.
Anna Sågvall Hein, GSLT, September
2004
Equivalence
•
•
•
•
form
meaning
style
effect
Anna Sågvall Hein, GSLT, September
2004
Formal and dynamic equivalence
• Formal equivalence focuses attention on the message itself,
in both form and content. It aims to allow the reader to
understand as much of the SL context as possible.
• Dynamic equivalence is based on the principle of
equivalent effect, i.e. that the relationship between receiver
and message should aim at being the same as that between
the original receivers and the SL message.
(Nida 75)
Anna Sågvall Hein, GSLT, September
2004
Can computers translate?
• Not a simple yes or no; it depends on the
purpose of the translation and the required
quality.
Anna Sågvall Hein, GSLT, September
2004
Classical problems with MT
• unrealistic expectations
• bad translations
• difficulties in integrating MT in the work
flow
– the Ericsson case
Anna Sågvall Hein, GSLT, September
2004
Feasibility of machine translation
•
•
•
•
•
quality in relation to purpose
control of the source language
human machine interaction
re-use of translations
evalution
Anna Sågvall Hein, GSLT, September
2004
Quality
• publishing quality
• editing quality
• browsing qualiy
Anna Sågvall Hein, GSLT, September
2004
Translation related tasks
•
•
•
•
•
•
•
translation
browsing
gisting
drafting
message dissemination
cross-language information searches
cross-language interchanges
Anna Sågvall Hein, GSLT, September
2004
MT as a cross-language
communication tool
MT is used not only for pure translation
purposes but also for writing in a foreign
language and for browsing (Hutchins 2001)
Hutchins, J., 2001, Towards a new vision for MT,
Introductory speech at MT Summit VIII conference, 18-22
September 2001
(http://ourworld.compuserve.com/homepages/WJHutchins/M
TS-2001.pdf)
Anna Sågvall Hein, GSLT, September
2004
Control of the source language
• spell checked and grammar checked SL
• sublanguage
– Domain
– Text type
• controlled language
Anna Sågvall Hein, GSLT, September
2004
Spell checking and grammar
checking
• If there are spelling errors or typos in the SL
dictionary search will fail
• If there are grammatical errors in the SL
grammatical analysis will fail
• Where and how should spell and grammar
checking be accounted for? Before or in the
process?
Anna Sågvall Hein, GSLT, September
2004
Controlled language
• consistent authoring of source texts
– reduction of ambiguity
– full linguistic coverage
• controlled vocabulary
– full lexical coverage
• controlled grammar
– full grammatical coverage
• controlled language checking
– e.g. Scania Checker
Anna Sågvall Hein, GSLT, September
2004
Ex. of controlled languages
• Simplified English
• KANT controlled English
• Scania Swedish
– Scania checker
Anna Sågvall Hein, GSLT, September
2004
Human intervention
• before
– language checking
• during
– e.g. ambiguity resolution
• after
– post-editing
Anna Sågvall Hein, GSLT, September
2004
Re-use of translations
•
•
•
•
•
translation memories
translation dictionaries incl. terminologies
lexicalistic translation
statistical machine translation
example-based translation
Anna Sågvall Hein, GSLT, September
2004
Evaluation of MT
• human
• automatic
– using a gold standard
• coverage (recall)
• quality (precision)
• global similarity measures
– merge of recall and precision
– BLEU, NIST
Anna Sågvall Hein, GSLT, September
2004
Why machine translation?
• cheaper
• faster
• more consistent
– when it succeeds …
Anna Sågvall Hein, GSLT, September
2004
What is MT proper?
To be considered as MT, a system should provide
• minimally correct morphology
• minimal syntactic processing
• minimal semantic processing
• handle and produce full sentences
Hutchins, J., 2000, The IAMT Certification initiative and defining
translation system categories (http://nl.ijs.si/eamt00/proc/Hutchins.pdf)
Anna Sågvall Hein, GSLT, September
2004
Examples of MT products
•
•
•
•
Systran (http://babelfish.altavista.com/)
Comprendium (based on Metal)
ProMT (http://www.translate.ru/eng)
ESTeam
See further:
http://ourworld.compuserve.com/homepages/WJHutchins/
Compendium-4.pdf ,
http://www.foreignword.com/Technology/mt/mt.htm
Anna Sågvall Hein, GSLT, September
2004
Basic strategies
• direct translation
• rule-based translation
– transfer
– interlingua
• example-based translation
• statistical translation
• hybrids
Anna Sågvall Hein, GSLT, September
2004
Direct translation
• no complete intermediary sentence structure
• translation proceeds in a number of steps, each
step dedicated to a specific task
• the most important component is the bilingual
dictionary
• typically general language
• problems with
– ambiguity
– inflection
– word order and other structural shifts
Anna Sågvall Hein, GSLT, September
2004
Simplistic approach
•
•
•
•
sentence splitting
tokenisation
handling capital letters
dictionary look-up and lexical substitution incl.
some heuristics for handling ambiguities
• copying unknown words, digits, signs of
punctuation etc.
• formal editing
Anna Sågvall Hein, GSLT, September
2004
Advanced classical approach
(Tucker 1987)
•
•
•
•
•
Source text dictionary look-up and
morphological analysis
Identification of homographs
Identification of compound nouns
Identification of nouns and verb phrases
Processing of idioms
Anna Sågvall Hein, GSLT, September
2004
Advanced approach, cont.
•
•
•
•
•
processing of prepositions
subject-predicate identification
syntactic ambiguity identification
synthesis and morphological processing of
target text
rearrangement of words and phrases in
target text
Anna Sågvall Hein, GSLT, September
2004
Feasibility of the direct
translation strategy
Is it possible to carry out the direct
translation steps as suggested by Tucker
with sufficient precision without relying on
a complete sentence structure?
Anna Sågvall Hein, GSLT, September
2004
Assignment 1: manual direct
translation
Sv. Ytterst handlar kampen för sysselsättning om att hålla
samman Sverige.
En. Ultimately, the fight for full employment concerns the
cohesion of Swedish society.
(from Statement of Government Policy 1996)
• Define an algorithm and a dictionary (based on
Norstedts) for simplistic translation of the
example.
• Present the model and the result.
Anna Sågvall Hein, GSLT, September
2004
Assignment 1, cont.
• Improve the result stepwise in accordance with the
advanced direct translation strategy
– Specify each step carefully and demonstrate its effect
on the translation.
• Evaluate and discuss the final result.
• Translate the ex. using Systran
(http://kwic.systran.fr/systran/svdemo) and discuss
the differences in an evaluative way
• Report the assignment and up-load on the web
(041001)
Anna Sågvall Hein, GSLT, September
2004
Current trends in direct
translation
• re-use of translations
– translation memories of sentences and sub-sentence
units such as words, phrases and larger units
– lexicalistic translation
– example-based translation
– statistical translation
Will re-use of translations overcome the problems with the
direct translation approach that were discussed above?
If so, how can they be handled?
Anna Sågvall Hein, GSLT, September
2004
Systran
•
•
•
•
•
•
System Translation
developed in the US by Peter Toma
first version 1969 (Ru-En)
EC bought the rights of Systran in 1976
currently 18 language pairs
demo version sv-en in 2003
(http://kwic.systran.fr/systran/svdemo)
• http://babelfish.altavista.com/
Anna Sågvall Hein, GSLT, September
2004
Systran, cont.
• more than 1,600,000 dictionary units
• 20 domain dictionaries
• daily use by EC translators, administrators
of the European institutions
• originally a direct translation strategy
– see H&S
• today more of a transfer-based strategy
Anna Sågvall Hein, GSLT, September
2004
Ex. 1: fairly good translation
/Systran sv-en
• "Enskilda företagare som inte bildat bolag
klassificeras hit."
• "Individual entrepreneurs that have not formed
companies are classified here.”
• Systemet har känt igen bildat som en perfektform
och översätter tempusformen korrekt have formed
med negationen not på rätt plats.
Anna Sågvall Hein, GSLT, September
2004
Ex. 2: word order problem/
Systran sv-en
• "När byarna kontaktades hade de inte ens
utsatts för influensa."
• "When the villages were contacted had they
not even been exposed to flu.”
• Systemet har inte hittat subjekt och predikat
och ger därför fel ordföljd.
Anna Sågvall Hein, GSLT, September
2004
Ex. 3: ambiguity problem/
Systran sv-en
• "Vad kan vi lära av Arrawetestammen?"
• "What can we faith of the Arawete?”
• Systemet hittar inte sambandet mellan kan
och lära och ser därför inte att lära är ett
verb.
Anna Sågvall Hein, GSLT, September
2004
Ex. 4: ambiguity problem/
Systran sv-en
• ”Extrapoleringen går till så här. "
• ”The extrapolation goes to so here.”
• Systemet känner inte till partikelverbet
känna till och översätter därför felaktigt ord
för ord.
Anna Sågvall Hein, GSLT, September
2004
Systran Linguistic Resources
• Dictionaries
–
–
–
–
POS Definitions
Inflection Tables
Decomposition Tables
Segmentation Dictionaries
• Disambiguation Rules
• Analysis Rules
Anna Sågvall Hein, GSLT, September
2004
Systran Processing Steps
• Analysis
–
–
–
–
–
Lookup
Compound Decomposition
Disambiguation
Syntactic Analysis
Compound Expansion
• Sentence Transfer
–
–
–
–
Initial Target Structure
Lookup
Default Transfer of Attributes
Structure Transformation
Anna Sågvall Hein, GSLT, September
2004
Systran Processing Steps (cont)
• Sentence Synthesis
– Structure Transformation
– Inflection lookup
– Surface Transformation
Anna Sågvall Hein, GSLT, September
2004
Motivations for transfer-based
translation
• lexical ambiguity
• structural differences
See further Ingo 91
Anna Sågvall Hein, GSLT, September
2004
Example 1
Sv. Fyll på olja i växellådan. 
En. Fill gearbox with oil.
(from the Scania corpus)
• fyll på  fill
• obj  adv
• adv  obj
Anna Sågvall Hein, GSLT, September
2004
Example 2
Sv. I oljefilterhållaren sitter en överströmningsventil.

En. The oil filter retainer has an overflow valve.
(from the Scania corpus)
• sitter  has
• adv  subj
• subj  obj
Anna Sågvall Hein, GSLT, September
2004
Transfer-based translation
• intermediary sentence structure
• basic processes
– analysis
– transfer
– generation (synthesis)
• language modules
– dictionary and grammar of SL
– transfer dictionary and transfer rules
– dictionary and grammar of TL
Anna Sågvall Hein, GSLT, September
2004
SL
Transfer
Direct translation
Metal
Multra
Interlingua
Anna Sågvall Hein, GSLT, September
2004
TL
Levels of intermediary structure
• cf. J&M, Chapter 21
• word order
Anna Sågvall Hein, GSLT, September
2004
Metal
• See H&S
Anna Sågvall Hein, GSLT, September
2004
MULTRA
Multilingual Support for Translation and Writing
• translation engine
• transfer-based
– shake-and-bake
•
•
•
•
modular
unification-based
preference machinery
trace-able
Anna Sågvall Hein, GSLT, September
2004
Anna Sågvall Hein, GSLT, September
2004
Analysis
• chart parser (Lisp  C)
– procedural formalism
• unification and other kinds of operations
• sentence structure
– feature structure
– grammatical relations
– surface order implicit via grammatical relations
See further Sågvall Hein&Starbäck (99),Weijnitz (02), Dahllöf (89)
Anna Sågvall Hein, GSLT, September
2004
Transfer
• unification-based
• declarative formalism
– Multra transfer formalism (Beskow 93)
• lexical and structural rules
• rules are partially ordered
• a more specific rule takes precedence over a less
specific one
– specificity in terms of number of transfer equations
• all applicable rules are applied
• written in prolog
Anna Sågvall Hein, GSLT, September
2004
Generation
• syntactic generation
– Multra syntactic generation formalism (Beskow 97a)
– PATR-like style
• unification
• concatenation
• typed features
• morphological generation (Beskow 97b)
– lexical insertion rules
– morphological realisation and phonological finish in
prolog
• written in prolog
Anna Sågvall Hein, GSLT, September
2004
An example: Tippa hytten.
Tippa hytten. :
(* = (PHR.CAT = CL
MODE = IMP
SUBJ = 2ND
VERB = (WORD.CAT = VERB
INFF = IMP
DIAT = ACT
LEX = TIPPA.VB.1
VSURF = +)
OBJ.DIR = (PHR.CAT = NP
NUMB = SING
GENDER = UTR
CASE = BASIC
DEF = DEF
HEAD = (LEX = HYTT.NN.1
WORD.CAT = NOUN)))
REG = (V1.LEM = TIPPA.VB)
SEP = (WORD.CAT = SEP
LEX = STOP.SR.0)))
Anna Sågvall Hein, GSLT, September
2004
Transfer structure
Transfer structure
[VERB : [WORD.CAT : VERB
LEX : TILT.VB.0
DIAT : ACT
INFF : IMP]
OBJ.DIR : [PHR.CAT : NP
DEF : DEF
NUMB : SING
HEAD : [WORD.CAT : NOUN
LEX : CAB.NN.0]]
MODE : IMP
SUBJ: 2ND
VSURF: +
SEP : [WORD.CAT : SEP
LEX : STOP.SR.0]
PHR.CAT : CL]
Anna Sågvall Hein, GSLT, September
2004
Generation
Tilt the cab.
Anna Sågvall Hein, GSLT, September
2004
A grammar rule
defrule legal.obj {
<?1 phr.cat> = 'np,
not <?1 case> = 'gen,
not <?1 case> = 'subj
}
Anna Sågvall Hein, GSLT, September
2004
Transfer rules
•
•
•
•
copy feature
delete feature
transfer feature
assign feature
Anna Sågvall Hein, GSLT, September
2004
Copy feature
LABEL
mode
SOURCE
<* mode> = ?x1
TARGET
<* mode> = ?x2
TRANSFER
Anna Sågvall Hein, GSLT, September
2004
Delete feature
LABEL
REG
SOURCE
<* REG> = ANY
TARGET
<*> = <*>
TRANSFER
Anna Sågvall Hein, GSLT, September
2004
Transfer feature
LABEL
OBJ.DIR
SOURCE
<* OBJ.DIR> = ?x1
TARGET
<* OBJ.DIR> = ?x2
TRANSFER
?x1 <=> ?x2
Anna Sågvall Hein, GSLT, September
2004
Define feature
LABEL
trycka.in-press
SOURCE
<* lex sym>=trycka.vb+in.ab.1
<* word.cat>=VERB
TARGET
<* lex>=press.vb.1
<* word.cat>=VERB
TRANSFER
Anna Sågvall Hein, GSLT, September
2004
A generation rule
LABEL CL.IMP
X1 ---> X2 X3 X4 :
<X1 PHR.CAT> = CL
<X1 VERB> = <X2>
<X1 TYPE> = IMP
<X1 OBJ.DIR> = <X3>
<X1 SEP> = <X4>
Anna Sågvall Hein, GSLT, September
2004
A contextual lexical rule
LABEL
tänka.på-think.about
SOURCE
<* verb lex sym> = tänka.vb.1
<* obj.prep phr.cat> = pp
<* obj.prep prep> = ?prep
<* obj.prep prep lex sym> = på.pp.1
<* obj.prep rect> = ?rect1
TARGET
<* obj.prep phr.cat> = pp
<* obj.prep prep word.cat> = PREP
<* obj.prep prep lex> = about.pp.1
<* obj.prep rect> = ?rect2
TRANSFER
?rect1<=>?rect2
Anna Sågvall Hein, GSLT, September
2004
A generation trace
1-Applying Rule cl-sep
1- Applying Rule cl.imp
1- Applying Rule subj2nd-verb-obj.dir
1Applying Rule verb.main.act
1Applying Rule np.the-df
1Applying Rule ng.noun-def
1-Success!
Anna Sågvall Hein, GSLT, September
2004
Language resources in the MATS
system
• dictionary in a database with different views
• analysis grammar
• transfer grammar
– incl. contextually defined lexical rules
• generation grammar
Anna Sågvall Hein, GSLT, September
2004
sv-en_LinkLexicon
en-Inflections
en_LemmaLexicon
en_LexemeLexicon
en_Lexicon
en_StemLexicon
sv_Inflections
sv_LemmaLexicon
sv_LexemeLexicon
sv_Lexicon
sv_StemLexicon
The MATS system
Frozen demo…
Anna Sågvall Hein, GSLT, September
2004
Assignment 2: Working with
MATS
http://stp.ling.uu.se/~evapet/mt04/assignment2.html
Anna Sågvall Hein, GSLT, September
2004
Lexicalistic translation
• Identify (lexical) translation units in the
source sentence
• Translate each unit separately (considering
the context)
• Order the result in agreement with a model
of the target language
Formulation due to Lars Ahrenberg; see further AH (reading
list) ; see also Beaven, L. John, Shake-and-Bake Machine
Translation. Coling –92, Nantes, 23-28 Aout 1992.
Anna Sågvall Hein, GSLT, September
2004
T4F – a lexicalistic system
• processes in T4F
–
–
–
–
–
tokenisation
tagging
transfer
transposition
filtering
See further AH (in the reading list)
Anna Sågvall Hein, GSLT, September
2004
Interlingua translation
• See SN
Anna Sågvall Hein, GSLT, September
2004
Anna Sågvall Hein, GSLT, September
2004
Anna Sågvall Hein, GSLT, September
2004
Anna Sågvall Hein, GSLT, September
2004
Applications of alignment
•
•
•
•
•
translation memories
translation dictionaries
lexicalistic translation
statistical machine translation
example-based translation
Anna Sågvall Hein, GSLT, September
2004
Translation memories
• based on sentence links
• optionally, sub sentence links
See further Macklovitch, E. (2000)
Anna Sågvall Hein, GSLT, September
2004
Translation dictionaries
• based on word links
• refinement of word links
Anna Sågvall Hein, GSLT, September
2004
Refinement of word alignment
data
• neutralise capital letters where appropriate
• lemmatise or tag source and target units
• identify ambiguities
– search for criteria to resolve them
• identify partial links
– compounds?
– remove or complete them
• manual revision?
Anna Sågvall Hein, GSLT, September
2004
Informally about statistical MT
• build a translation dictionary based on word
alignment
• aim for as big fragments as possible
• keep information on link frequency
• build an n-gram model of the target language
• implement a direct translation strategy
– including alternatives ordered by length and frequency
• process the output by the n-gram model filtering
out the best alternatives and adjust the translation
accordingly
Anna Sågvall Hein, GSLT, September
2004
Example-based MT
HS (in the reading list)
Anna Sågvall Hein, GSLT, September
2004
Some current research topics
•
•
•
•
•
•
•
•
•
intersentential dependences
hybrid systems: data-driven and rule-driven
improved alignment techniques
improved language modeling in ST
automatic learning from post-editing
translation by structural correspondences
translation of spoken language
improved preference strategies
ambiguity preserving translation
Anna Sågvall Hein, GSLT, September
2004
Intersentential dependencies
• pronoun resolution
• lexical ambiguity resolution, such as
– (torkar)motorn
– (förbrännings)motorn
the motor
the engine
• fluency
Anna Sågvall Hein, GSLT, September
2004
Preserving the information
structure
• information structure is expressed in
different ways in the source and the target
• syntactic clues are exploited in the analysis
to compute the information structure (topicfocus articulation)
• information structure is used to guide the
generation
Anna Sågvall Hein, GSLT, September
2004
An example
Torkarmotorn M2 är
sammankopplad med
omkopplare S24 och
intervallrelä R22. För att
inte motorn skall
överbelastas, t.ex. om
torkarbladen fastnat, finns
en inbyggd termovakt
som bryter strömmen till
motorn när …
Wiper motor M2 is
connected to switch S24
and intermittent relay
R22. To prevent motor
overload, e.g. if the
wiper blade gets stuck,
there is an integral
thermal sensor which
breaks the current to the
motor when …
Anna Sågvall Hein, GSLT, September
2004
Preferences
• syntactic preferences
– the principle of right association
– the principle of minimal attachment
– two-stage processing
• semantic preferences
–
–
–
–
lexical selectional restrictions
lexical contextual rules
conceptual taxonomies
likelihood of occurrence
See further Bennet, P. & Paggio, P., 1993, Preference in Eurotra.
Anna Sågvall Hein, GSLT, September
2004
Preferences in Multra
• parsing
– a formalism for expressing syntactic
preferences in the parse
• not fully developed
• transfer
– contextual lexical rules
– rule specificity
• generation
– rule specificity
Anna Sågvall Hein, GSLT, September
2004
Hybrid systems
•
•
•
•
•
aims
components
problems
architecture
scores
Anna Sågvall Hein, GSLT, September
2004
Aims of a hybrid system
• simple techniques for simple tasks
• complex techniques for complex tasks
Anna Sågvall Hein, GSLT, September
2004
Components of a hybrid systems
• component strategies
– translation memory
• full sentences
• fragments
• direct translation
– statistical translation
– ebmt
Anna Sågvall Hein, GSLT, September
2004
Component strategies, cont’d
• rule-based translation
– simplistic analysis (cf. direct translation)
• word by word (S  sequence of words)
• phrase by phrase (S  sequence of phrases)
– partial parsing
– full parsing
Anna Sågvall Hein, GSLT, September
2004
Problems of a hybrid system
• how does the system know when a simple
technique is appropriate?
– does the source tell?
– does the target tell?
Anna Sågvall Hein, GSLT, September
2004
Architecture and scores
• simple first?
• concerting results?
• scoring?
Anna Sågvall Hein, GSLT, September
2004
Improved techniques for re-use
of translation
• combining clues for word alignment
(Tiedemann 2003)
• interactive word alignment (Ahrenberg
et al. 2003)
• parallel treebanks
Anna Sågvall Hein, GSLT, September
2004
Translation by structural
correspondences
• LFG
• HPSG
Anna Sågvall Hein, GSLT, September
2004
Translation of spoken language
See
Krauver, Steven (ed.), 2000, Machine
Translation, June 2000. Volume 15, Issue 12, Special issue on Spoken Language
Translation.
Anna Sågvall Hein, GSLT, September
2004
Download