Language Divergences and Solutions Advanced Machine

advertisement
Language
Divergences and
Solutions
Advanced Machine
Translation Seminar
Alison Alvarez
Overview



Introduction
Morphology Primer
Translation Mismatches
 Types
 Solutions

Translation Divergences
 Types
 Solutions
 Different MT Systems
 Generation Heavy Machine Translation
 DUSTer
Source ≠ Target

Languages don’t encode the same
information in the same way
 Makes
MT complicated
 Keeps all of us employed
Morphology in a Nutshell

Morphemes are word parts
 Work
+er
 Iki +ta +ku +na +ku +na +ri +ma +shi +ta

Types of Morphemes
 Derivational:
makes new word
 Inflectional: adds information to an existing
word
Morphology in a Nutshell

Analytic/Isolating
 little or no inflectional morphology,
 Vietnamese, Chinese
 I was made to go

separate words
Synthetic
 Lots of inflectional morphology
 Fusional vs. Agglutinating
 Romance Languages, Finnish,
Japanese, Mapudungun
 Ika (to go) +se (to make/let) +rare (passive) +ta (past
tense)
 He need +s (3rd person singular) it.
Translation Differences

Types
 Translation

Different information from source to target
 Translation

Mismatches
Divergences
Same information from source to target, but the
meaning is distributed differently in each language
Translation Mismatches
“…the information that is conveyed is
different in the source and target
languages”
 Types:

 Lexical
level
 Typological level
Lexical Mismatches

A lexical item in one language may have
more distinctions than in another
Brother
兄さん
弟
Ani-san
otouto
Older Brother
Younger Brother
Typological Mismatches
Mismatch between languages with
different levels of grammaticalization
 One language may be more structurally
complex
 Source marking, Obligatory Subject

Typological Mismatches

Source: Quechua vs. English

(they say) s/he was singing --> takisharansi
 taki (sing) +sha (progressive) +ra (past) + n (3rd sg) +si
(reportative)

Obligatory Arguments: English vs. Japanese
 Kusuri
wo Nonda --> (I, you, etc.) took medicine.
 Makasemasu! -->(I’ll) leave (it) to (you)
Translation Mismatch Solutions


More information --> Less information (easy)
Less information --> More information (hard)
 Context
clues
 Language Models
 Generalization
 Formal representations
Translation Divergences
“…the same information is conveyed in
source and target texts”
 Divergences are quite common

 Occurs
in about 1 out of every three
sentences in the TREC El Norte Newspaper
corpus (Spanish-English)
 Sentences can have multiple kinds of
divergences
Translation Divergence Types
Categorial Divergence
 Conflational Divergence
 Structural Divergence
 Head Swapping Divergence
 Thematic Divergence

Categorial Divergence
Translation that uses different parts of
speech
 Tener hambre (have hunger) --> be hungry
 Noun --> adjective

Conflational Divergence




The translation of two words using a single word
that combines their meaning
Can also be called a lexical gap
X stab Z --> X dar puñaladas a Z (X give stabs to
Z)
glastuinbouw --> cultivation under glass
Structural Divergence
A difference in the realization of
incorporated arguments
 PP to Object

X
entrar en Y (X enter in Y) --> X enter Y
 X ask for a referendum --> X pedir un
referendum (ask-for a referendum)
Head Swapping Divergence

Involves the demotion of a head verb and the
promotion of a modifier verb to head position
S
S
NP
N
NP
VP
V
PP
N
VP
Yo entro en el cuarto corriendo
I
ran
VP
V
into
PP
the
room.
Thematic Divergence
This divergence occurs when sentence
arguments switch argument roles from one
language to another
 X gustar a Y (X please to Y) --> Y like X

Divergence Solutions and
Statistical/EBMT Systems
Not really addressed explicitly in SMT
 Covered in EBMT only if it is covered
extensively in the data

Divergence Solutions and Transfer
Systems
Hand-written transfer rules
 Automatic extraction of transfer rules from
bi-texts
 Problematic with multiple divergences

Divergence Solutions and
Interlingua Systems




Mel’čuk’s Deep Syntactic Structure
Jackendoff’s Lexical Semantic Structure
Both require “explicit symmetric knowledge” from
both source and target language
Expensive
Divergence Solutions and
Interlingua Systems
John swam across a river
[event CAUSE JOHN
[event GO JOHN [path ACROSS JOHN
[position AT JOHN RIVER]]]
[manner SWIM+INGLY]]
Juan cruza el río nadando
Generation-Heavy MT
Built to address language divergences
 Designed for source-poor/target-rich
translation
 Non-Interlingual
 Non-Transfer
 Uses symbolic overgeneration to account
for different translation divergences

Generation-Heavy MT

Source language
 syntactic
parser
 translation lexicon

Target language
 lexical
semantics, categorial variations &
subcategorization frames for overgeneration
 Statistical language model
GHMT System
Analysis Stage
Independent of Target Language
 Creates a deep syntactic dependency
 Only argument structure, top-level
conceptual nodes & thematic-role
information
 Should normalize over syntactic &
morphological phenomena

Translation Stage
Converts SL lexemes to TL lexemes
 Maintains dependency structure

Analysis/Translation Stage
GIVE (v)
[cause go]
I
STAB (n)
JOHN
agent
theme
goal
Generation Stage

Lexical & Structural Selection

Conversion to a thematic dependency



Structural expansion


Uses syntactic-thematic linking map
“loose” linking
Addresses conflation & head-swapped divergences
Turn thematic dependency to TL syntactic
dependency

Addresses categorial divergence
Generation Stage: Structural
Expansion
Generation Stage

Linearization Step

Creates a word lattice to encode different
possible realizations
 Implemented using oxyGen engine

Sentences ranked & extracted

Nitrogen’s statistical extractor
Generation Stage
GHMT Results
4 of 5 Spanish-English divergences “can
be generated using structural expansion &
categorial variations”
 The remaining 1 out of 5 needed more
world knowledge or idiom handling
 SL syntactic parser can still be hard to
come by

Divergences and DUSTer
Helps to overcome divergences for word
alignment & improve coder agreement
 Changes an English sentence structure to
resemble another language
 More accurate alignment and projection of
dependency trees without training on
dependency tree data

DUSTer

Motivation for the development of
automatic correction of divergences
“Every Language Pair has translation
divergences that are easy to recognize”
2. “Knowing what they are and how to
accommodate them provides the basis for
refined word level alignment”
3. “Refined word-level” alignment results in
improved projection of structural information
from English to another language
1.
DUSTer
DUSTer




Bi-text parsed on English side only
“Linguistically Motivated” & common search
terms
Conducted on Spanish & Arabic (and later
Chinese & Hindi)
Uses all of the divergences mentioned before,
plus a “light verb” divergence
 Try
 put to trying  poner a prueba
DUSTer Rule Development
Methods
Identify canonical transformations for each
divergence type
 Categorize English sentences into
divergence type or “none”
 Apply appropriate transformations
 Humans align E  E’  foreign language

DUSTer Rules
# "kill" => "LightVB kill(N)" (LightVB = light verb)
# Presumably, this will work for "kill" => "give death to”
# "borrow" => "take lent (thing) to”
# "hurt" => "make harm to”
# "fear" => "have fear of”
# "desire" => "have interest in”
# "rest" => "have repose on”
# "envy" => "have envy of”
type1.B.X [English{2 1 3} Spanish{2 1 3 4 5} ]
[ Verb<1,i,CatVar:V_N> [ Noun<2,j,Subj> ] [ Noun<3,k,Obj> ] ] <-->
[ LightVB<1,Verb>[ Noun<2,j,Subj> ] [ Noun<3,i,Obj> ] [
Oblique<4,Pred,Prep> [ Noun<5,k,PObj> ] ] ]
Conclusion
Divergences are common
 They are not handled well by most MT
systems
 GHMT can account for divergences, but
still needs development
 DUSTer can handle divergences through
structure transformations, but requires a
great deal of linguistic knowledge

The End

Questions?
References
Dorr, Bonnie J., "Machine Translation Divergences: A Formal Description and Proposed Solution,"
Computational Linguistics, 20:4, pp. 597--633, 1994.
Dorr, Bonnie J. and Nizar Habash, "Interlingua Approximation: A Generation-Heavy Approach", In
Proceedings of Workshop on Interlingua Reliability, Fifth Conference of the Association for
Machine Translation in the Americas, AMTA-2002,Tiburon, CA, pp. 1--6, 2002
Dorr, Bonnie J., Clare R. Voss, Eric Peterson, and Michael Kiker, "Concept Based Lexical Selection,"
Proceedings of the AAAI-94 fall symposium on Knowledge Representation for Natural Language
Processing in Implemented Systems, New Orleans, LA, pp. 21--30, 1994.
Dorr, Bonnie J., Lisa Pearl, Rebecca Hwa, and Nizar Habash, "DUSTer: A Method for Unraveling
Cross-Language Divergences for Statistical Word-Level Alignment," Proceedings of the Fifth
Conference of the Association for Machine Translation in the Americas, AMTA-2002,Tiburon, CA,
pp. 31--43, 2002.
Habash, Nizar and Bonnie J. Dorr, "Handling Translation Divergences: Combining Statistical and
Symbolic Techniques in Generation-Heavy Machine Translation", In Proceedings of the Fifth
Conference of the Association for Machine Translation in the Americas, AMTA-2002,Tiburon, CA,
pp. 84--93, 2002.
Haspelmath, Martin. Understanding Morphology. Oxford Univeristy Press, 2002.
Kameyama, Megumi and Ryo Ochitani, Stanley Peters “Resolving Translation Mismatches With
Information Flow” Annual Meeting of the Assocation of Computational Linguistics, 1991
Download