Linking Etymological Database: A case study in Germanic

advertisement
LDL – 2014, LREC
Reykjavik, Iceland
27th May 2014
Linking Etymological Database:
A case study in Germanic
Christian Chiarcos, Maria Sukhareva
Goethe University Frankfurt am Main
Overview
1.
2.
3.
4.
5.
Background
Linked Etymological Dictionaries
Enriching of Linked Etymological Dictionaries
Application
Conclusion
Background
Background
Processing of Old Germanic
Languages at Goethe University
Frankfurt,
in collaboration between:
ACoLi Lab
1. Empirical Linguistics
Thesaurus
of Indo-European
Text and
Language Materials (TITUS)
TITUS
2. ACoLi Lab (Applied Computational
Linguistics)
3. LOEWE Cluster “Digital
Humanities”
4. DFG-funded Old German
Reference Corpus (DDD)
DDD
Referenzkorpus
Althochdeutsch
Linked Etymological Data
Linked Etymological Data
Linked Etymological Data
Conversion of etymological dictionaries to RDF
• Linkability: representation of relations within and
beyond lexicons
• Interoperability: (meta)data representation
through community-maintained vocabularies
(lexvo, Glottolog, OLiA, lemon)
• Inference: filling the logical gaps of the original XML
representation
– Symmetric closure of cross-references
Linked Etymological Data
all language identifiers
were mapped from the
original abbreviations
and assigned ISO 6393 codes wherever
possible.
lemonet:translates
a relation between
lemon:LexicalEntrys
lemonet:etym
links between
languages,
transitive and
symmetric.
Subproperty of
lemon:lexicalVariant
Linked Etymological Data
Original XML (lemma)
RDF Triples
Symmetric closure of
etymological relations
generated by SPARQL
pattern
Links to external resources
Enriching Etymological Dictionaries
Enriching Etymological Dictionaries
Germanic parallel Bible corpus
(parentheses indicate
marginal fragments with
less than 50,000 tokens)
Enriching Etymological Dictionaries
1. Statistical word alignment of parallel texts (GIZA++)
2. Lexical translation tables as basis for the extracted word lists:
• Unidirectional: maximum of P(wt|ws)
• Bidirectional: maximum of P(wt|ws) ∗ P(ws |wt)
3. Pruning by frequency
Application
Application
Thematical Alignment of Bible paraphrases
– E.g., cross references within the Bible and between the Bible and gospel
harmonies
• an interlinked index of thematically similar sections in the gospels and
OS/OHG gospel harmonies
– OS Heliand and OHG Tatian section level alignment (Sievers, 1872) has been
digitized
– 4560 inter-text groups based on the Eusebian canon
• Basis for a more fine-grained level of alignment
Application
similarity metrics δ(wOS;wOHG)
for every OS word wOS and its potential OHG cognate wOHG
Character-based
similarity measures:
– GEOMETRY: δ = difference between the
relative positions of wOS and wOHG
– IDENTITY: δ(wOS;wOHG) = 1 iff wOHG =
wOS (0 otherwise);
– ORTHOGRAPHY: relative Levenshtein distance
& statistical character replacement probability
(Neubig et al., 2012)
– NORMALIZATION: norm(wOS;wOHG) =
δ(w’OS;wOHG) , with w’OS being the OHG
‘normalization’
(Bollmann et al., 2011)
– COOCCURRENCES: δ(wOS;wOHG) =
P(wOS|wOHG)P(wOHG|wOS)
Lexicon-based
similarity measures:
δlex(wOS;wOHG) = 1 iff wOHG 2 W (0
otherwise) where W is a set of possible OHG translations
for wOS suggested by a lexicon, i.e., either:
 ETYM: etymological link in (the symmetric
closure of the etymological dictionaries,
 ETYM-INDIRECT: shared German gloss in
the etymological dictionaries,
 TRANSLATIONAL DIRECT: link in the
translational dictionaries,
 TRANSLATIONAL INDIRECT: indirectly
linked in the translational dictionaries
through a third language.
Application
similarity metrics δ(wOS;wOHG)
for every OS word wOS and its potential OHG cognate wOHG
Character-based
similarity measures:
– GEOMETRY: δ = difference between the
relative positions of wOS and wOHG
– IDENTITY: δ(wOS;wOHG) = 1 iff wOHG =
wOS (0 otherwise);
– ORTHOGRAPHY: relative Levenshtein distance
& statistical character replacement probability
(Neubig et al., 2012)
– NORMALIZATION: norm(wOS;wOHG) =
δ(w’OS;wOHG) , with w’OS being the OHG
‘normalization’
(Bollmann et al., 2011)
– COOCCURRENCES: δ(wOS;wOHG) =
P(wOS|wOHG)P(wOHG|wOS)
Lexicon-based
similarity measures:
δlex(wOS;wOHG) = 1 iff wOHG 2 W (0
otherwise) where W is a set of possible OHG translations
for wOS suggested by a lexicon, i.e., either:
 ETYM: etymological link in (the symmetric
closure of the etymological dictionaries,
 ETYM-INDIRECT: shared German gloss in
the etymological dictionaries,
 TRANSLATIONAL DIRECT: link in the
translational dictionaries,
 TRANSLATIONAL INDIRECT: indirectly
linked in the translational dictionaries
through a third language.
Conclusion & Discussion
Conclusion
1. Application of Linked Data Paradigm to modeling of
etymological dictionaries
2. Adopting of Lemon core model
3. Representation of Köbler’s dictionary in a machine-readable
format
4. Enriching etymological dictionaries by automatically obtained
translation pairs
5. Initial experiment on usage of dictionaries for quasi-parallel
alignment
lemon & etymology:
A square peg for a round hole ?
lemon gained a lot of popularity as a shared vocabulary for lexical
resources in the LLOD.
L!
L! L! L! L! L!
L!
L!
lemon & etymology:
A square peg for a round hole ?
lemon gained a lot of popularity as a shared vocabulary for lexical
resources in the LLOD.
… but many of these resources are
created by (or for) linguists rather
than ontologists.
L!
L! L! L! L! L!
L!
L!
The original motivation for lemon was
to lexicalize ontologies. Quite a
different problem from the interoperability issues that linguists are
trying to solve by using it.
lemon & etymology:
A square peg for a round hole ?
lemon gained a lot of popularity as a shared vocabulary for lexical
resources in the LLOD.
But obviously, our usage of lemon is slightly abusive.
1. Etymological and translational links between WordForms ?
2. No external ontology to ground senses ?
3. No word senses at all ?
But that is symptomatic for linguistic resources in a strict sense
4. Similar problems observed by Cysouw & Moran on multilingual dictionaries
for South American indigeneous languages.
lemon & etymology:
A square peg for a round hole ?
lemon gained a lot of popularity as a shared vocabulary for lexical
resources in the LLOD.
But obviously, our usage of lemon is slightly abusive.
1. Etymological and translational links between word forms ?
2. No external ontology to ground senses ?
3. No word senses at all ?
But that is symptomatic for linguistic resources in a strict sense
What can we do about this state of affairs ?
• Would there have been alternative ways to model our data ?
• Shall we extend/abandon/replace/adjust lemon?
Takk fyrir!
Download