DECOMPOUNDING GERMAN WORDS

advertisement
DECOMPOUNDING
GERMAN WORDS
Introduction
Costanza Conforti
05.02.2015
Overview
§ Bird’s eye view
§ Preliminary steps:
– Identification
– Structure Analysis
§ German Compounds
§ Decompounding German words
– Decompounding Pipeline
§ Tasks
– State-of-the-art model (Alfonseca 2008)
Decompounding German Words
BIRD’S EYEVIEW
Bird’s eye view
§ “There are probably no languages without either compounding,
affixing, or both. In other words, there are probably no purely
isolating languages. There are a considerable number of languages
without inflection, perhaps none without compounding and
derivation”
Greenberg 1963. In Scalise 2010
§
[X r Y]X
[N+A]N camposanto (sp, ‘field+holly, graveyard’)
[N+[P+N]]N arc-en-ciel (fr, ‘arch in sky, rainbow’)
[X r Y]Y
[A+N]N green card (engl), [N+N]N Taxifahrer (germ)
[X r Y]Z
[V+N]A 缺德 Quēdé (jap, ‘lack+morals, immoral’)
Decompounding German Words
Scalise 2010
Bird’s eye view § Hybrid status:
Syntactically complex units which behave in a syntactically simple way
Matthews 1947, In: Scalise 2010
§ Phonetics + Syntax-Morphology Interface + Semantics
Complex interface domains of language competence
Baroni 2010
§ Orthograpy
Distinct words
-Unusual POS sequence
-Simplify parsing
vs.
one-words compounds
-Lexical recognition
-Segmentation
Decompounding German Words
PRELIMINARY STEPS
IDENTIFICATION
STRUCTURE ANALYSIS
Preliminary steps - Identification
§ High type frequency/ low token frequency
§ Productive compounding:
Sizeable number of new orthographic words will
constantly be added to the language
Baroni 2002
§ Training corpus: Data sparseness
Decompounding: compound as concatenation of its units
Decompounding German Words
Right
No head
Left
Both
66.7
16.3
6.8
5.9
Preliminary steps - Structure analysis § Table
Bracketing/Labelling
internal
structure
2. Frequency of compoundcompound’s
types with different
head positions
in different
language groups
Right
No head
Left
Both
General %
Rom %
Germ %
Slav %
East A. %
66.7
16.3
6.8
5.9
40,7
31,4
20,3
6,8
87,0
8,9
1,9
1,3
61,9
12,2
6,0
3,1
57,5
17,7
6,8
15,0
Scalise 2010
As can be seen, the Slavic group follows the general pattern more closely than
the
other groups. The Romance languages exhibit a relatively high percentage of exocentric constructions, while Germanic languages are more consistently right-headed. The
East Asian languages are different in allowing a relatively high percentage of compounds with two heads, most of which are coordinate compounds.
In languages such as Italian, the exocentric pattern V+N is one of the most productive processes in compound formation. In some languages we also find a pattern
that has been called ‘absolute
exocentricity’ German
(Scalise, Fabregas
Decompounding
Words & Forza 2009), from
both a categorial and semantic point of view. In such compounds, the output category
This allows listeners, upon hearing the ith compound constituent, to reuse the compound structure built up on the sequence w1, …, wi-1. Compounds that are easier to
understand are more frequent overall.
Lauer (1995) states the problem in probabilistic terms. The two structures in (8)
and (9) presuppose the dependency chains illustrated in (11) below, where each pointed arc represents a ‘dependant → head relationship’. Given two relations such as
§ Recursion
child
→ language and language → acquisition, the three words can give rise to one compound
only:language]
child language acquisition. In contrast, the pair of dependencies
[child
child[[child
→ acquisition
and language → acquisition can in principle produce two equally
language]acquisition]
likely compounds: child language acquisition and language child acquisition. This means
[[[child language]acquisition]research]
that child → acquisition and language → acquisition can generate twice as many com[[[[child
pounds
as child →language]acquisition]research]group]
language and language → acquisition do. If this is true in general, it
[[[[[child
language]acquisition]research]group]member]
implies,
by straightforward
application of Bayes’ rules, that, given a compound ABC,
the probability that it has a left-branching structure is twice the probability that it has
a right-branching structure. This is what corpus data tell us.
Preliminary steps - Structure analysis (11) [[childN languageN] acquisitionN]N [childN [languageN acquisitionN]N]N
Baroni
2010
How can a parser choose between the different bracketing structures in (11)?
Work
in
the 80’s and early 90’s (Finin 1980, McDonald 1982, Gay & Croft 1990, Vanderwende
Decompounding
German
Words intensive algorithms. The
1994) proposes to deal with the
problem through
knowledge
immediate constituents of this compound are
Kraftfahrzeug and steuer, with the first constituent then splitting further into Kraft and fahrzeug,
etc. (see Figure 1). In order to be able to systematically link compounds in GermaNet to their
constituent parts, compound splitting needs to be
ermaNet
applied recursively
and has to identify only the
§ L/R branching:
Not Dichotomous!
work that is
immediate constituents at each level of analysis.
Net for Engnto a set of
ets) that are
synset is a
where all the
aning. There
n GermaNet.
two synsets,
elations, enons hold be-
hero’s courille; ‘will to
Schmerz +
Preliminary steps - Structure analysis a systematic
absent from
ailable. The
ow to treat
Figure 1. Compounds in GermaNet.
4
Related Work on Compound Splitting
Decompounding German Words
GermaNet 8.0
Henrich 2011
GERMAN COMPOUNDS
German compounds
§ Statistics:
APA newswire corpus (28 mill words)
47% types
7% tokens
hapax/rare <5
Baroni 2002
§ NN – AN – VN – PN – BN
62% - 95%
Baroni 2002 - Heinrich 2011
Decompounding German Words
German compounds
[[Modifier
Table 2. Linking morphemes allo
+ (Linking morpheme) + (-)] + Head]
bound form:
- Hände-druck
- Farb-fernseher
- Doktor-/Doktoren-
Paradigmatic
morphemes
German
== Primitive
words
Baroni 2002
Alfonseca 2008
Morpheme
Semantic-(halb
∅
Halbblutprinz
-e
Wundfieber
(Wunde
morphologic
+s
Rechtsstellung
properties(Rech
+e
Mauseloch (Maus Lo
+en
Armaturenbrett (Arm
+nen
+ens
Herzensangst (Herz A
+es
Tagesanbruch (Tag A
+ns
Willensbildung (Will
+er
Bilderrahmen (Bild R
Decompounding German Words
German compounds
Langer 1998
Decompounding German Words
DECOMPOUNDING
GERMAN WORDS
Guidelines
Decompounding German words
§ Decompounding:
LINGUISTIC – CONSERVATIVE – AGGRESSIVE
Braschler 2003
§ Overrecognize (Schweiner-ei)
+ Lexicalized Compounds (Frühstück, Freitag)
Goldsmith 2002
Decompounding German Words
Decompounding Pipeline Empirical Methods for Compound Splitting
Philipp Koehn
Kevin Knight
1.Sciences
[Prior
step: compound
list]
ormation
Institute
Information
Sciences Institute
rtment of Computer
Science
Department
of Computer
Science
es. Bispo
Santos 2014,
GermaNet
8.0: semi-automatically
generated
ersity of Southern California
University of Southern California
knight@isi.edu
2. Calculate every possible
way of splitting a word in 1+ parts
- each of one is a known word > corpus + allowing for linking morphemes
- only binary splits/recursively
koehn@isi.edu
Abstract
ed words are a challenge for
ations such as machine trans). We introduce methods to
ng rules from monolingual
corpora. We evaluate them
old standard and measure
t on performance of statistiems. Results show accuracy
d performance gains for MT
LEU on a German-English
e translation task.
Aktionsplan
Akti on
Akt
actionplan
plan
action plan
plan
act ion plan
Figure 1: Splitting options for the German word
Decompounding German Words
A ktionsplan
Koehn 2003 Decompounding Pipeline 3. Score those parts
Weighting function:
- Frequency-Based Methods
- Probability-Based Methods
4. Take the highest-scoring decomposition.
(if it contains just one part, it means that word is not a compound)
Decompounding German Words
TASKS
Tasks
§ Prediction in AAC systems Baroni et. al 2002
§ Speech Understanding Systems Adda et al. 2009
§ Machine Translation
Brown 2002 - Koehn 2003 – Stymne 2008 - ecc
§ Information Retrieval Braschler 2003 - ecc
§ Query Keywords Alfonseca 2008
Decompounding German Words
German Decompounding
in a Difficult Corpus
129
State-of-the-art
model
Alfonseca 2008 able 1. Some examples of words found on German user queries that can be difficult
handle by a German decompounder
§ Query Keywords
noisy data
Word
Probable interpretation
achzigerjahre achtzig+jahre (misspelled)
activelor
unknown word
adminslots
admin slots (English)
agilitydog
agility dog (English)
akkuwarner unknown word
alabams
alabama (misspelled)
almyghtyzeus almyghty zeus (English and misspelled)
amaryllo
amarillo (Spanish and misspelled)
ampihbien
amphibian (misspelled)
German,
contain small
fragments
in other
languages. This problem exists to
§ Unknown
Compounds:
useful
information
larger degree
when handling
user queries:
they are written quickly, not paying
Turingmaschine,
Blumenstrause,
Bismarkausseenpolitik
tention to mistakes, and the language used for querying can change during the
Decompounding German Words
me query session. Language identification of search queries is a difficult prob-
Correct splits: no. of words that should be split and are split correctly.
Correct non-splits: no. of words that should not be split and are not.
Wrong non-splits: no. of words that should be split and are not.
Alfonseca
2008
Wrong faulty splits: no. of words
that should
be split and are, but incorrectly.
Wrong splits: no. of words that should no be split but are.
State-of-the-art model
P recision =
Recall =
correct splits
correct splits + wrong faulty splits + wrong splits
correct splits
correct splits + wrong faulty splits + wrong no splits
correct splits
Accuracy =
correct splits + wrong splits
Approaches
Koehn 2003
approaches for German compound splitting can be considered as having
general structure: given a word w,
Calculate every possible way
of splittingGerman
w in Words
one or more parts.
Decompounding
Accuracy
64.09%
65.58%
76.23%
80.52%
87.21%
ations.
State-of-the-art model
Alfonseca 2008 Method
Never split
Geometric mean of frequencies
Compound probability
Mutual Information
Support-Vector Machine
Table 3:
Precision
39.77%
60.41%
82.00%
83.56%
Recall
0.00%
54.06%
80.68%
48.29%
79.48%
Accuracy
64.09%
65.58%
76.23%
80.52%
87.21%
Language
P
R
A
Results
the several
German of 83.56%
79.48% configurations.
87.21%
Dutch
78.99% 76.18% 83.45%
Danish
81.97% 87.12% 85.36%
Norwegian 88.13% 93.05% 90.40%
Swedish
83.34% 92.98% 87.79%
Finnish
90.79% 91.21% 91.62%
Concerning the second step, there is some work
hat
forTable
scoring,
additional
information
such
some uses,
work
Decompounding
German Words
4: Results in all the different languages.
References
§
S. Langer. 1998. Zur Morphologie und Semantik von Nominalkomposita. Tagungsband der 4.
Konferenz zur Verarbeitung naturlicher Sprache (KONVENS).
§
M. Baroni, J. Matiasek. 2002. Wordform- and class-based prediction of the components of German
compounds in AAC system. OeFAI Technical Report
§
R.D. Brown. 2002. Corpus-driven splitting of compound words. In TMI-2002.
§
M. Braschler and B. Ripplinger. 2004. How effective is stemming and decompounding for german
text retrieval? Information Retrieval, 7:291–316.
P. Koehn and K. Knight. 2003. Empirical methods for compound splitting. In ACL-2003.
§
§
E. Alfonseca, S. Bilac, and S. Pharies. 2008. German decompounding in a difficult corpus. In
CICLING
§
V. Pirrelli, E. Guevara, and M. Baroni. 2010. Computational issues in compound processing. In:
Scalise 2010
§
S.Scalise, I.Vogel. 2010. Why Compounding? Cross-Disciplinary Issues in Compounding
§
V. Heinrich, Hinrichs E. 2011. Determining Immediate Constituents of Compounds in GermaNet.
Proceedings of Recent Advances in NLP
Thank you - Danke - Grazie
Download