Beyond the Transfer and Merge Wordnet Construction

advertisement
Beyond the Transfer and Merge
Wordnet Construction:
plWordNet
and a Comparison with WordNet
Marek Maziarz, Maciej Piasecki,
Ewa Rudnicka, Stanisław Szpakowicz
G4.19 Research Group
Wrocław University of Technology
nlp.pwr.wroc.pl
plwordnet.pwr.wroc.pl
Wordnet
{ samochodzik 2 `small car’ }
deminutiveness
{samochód 1, pojazd samochodowy 1,
auto 1, wóz 1 `car, automobile’ }
meronymy
hypernymy/hyponymy
{bagażnik 1 `boot’ }
{pogotowie 3, karetka 1, sanitarka 1,
karetka pogotowia 1 `ambulance’ }
plWordNet 2.0
Independent vs. Translation-based
Wordnet Construction
• Transfer and merge.
Examples:
– EuroWordNet – most component wordnets built by the
transfer method (Vossen 2002)
– MultiWordNet – semi-automatic acquisition method
from the Princeton WordNet (Bentivogli et. al. 2000)
– IndoWordNet – expansion from Hindi Wordnet (Sinha
et al. 2006, Bhattacharyya 2010)
– FinWordNet – directly translated from the Princeton
WordNet
Independent vs. Translation-based
Wordnet Construction
• From scratch.
Examples:
– GermaNet – the core built
independently
– plWordNet – a unique, corpus-based
method; largely independent of the
Princeton WordNet
Synonymy and synsets
• “A wordnet is a collection of synsets
linked by semantic relations.”
• A synset is a set of synonyms which
represent the same lexicalised concept
• Synonyms are members of the same synset
Wordnet development deserves better: an
operational theory with precise guidelines
for wordnet editors.
Basic building block: synset vs lexical unit?
• Synset relations link lexicalised concepts
• But are named after linguistic lexico-semantic
relations
• Substitution tests are defined for lexical units
• Synsets group lexical units
• Every wordnet includes relations between lexical
units (lexical relations), e.g., antonymy
• Lexical units can be observed in text, concepts
cannot
Constitutive relations
• Synset = a group of lexical units which share all
constitutive relations
• Constitutive relation = a lexico-semantic relation
which
– is frequent enough
– and frequently shared by groups
Also
– is established in linguistics
– and accepted in the wordnet tradition
• Examples: hypernymy, meronymy, cause
Synset as an abbreviation
Synset as a notational convention
for a group of lexical units sharing certain relations
represents synonyms
{afekt 1 `passion’, uczucie 2 `feeling’} hypernym
{miłość 1 `love’, umiłowanie 1 `affection’ , kochanie 1 `loving’}
This is based on constitutive relations
Additional distinctions: stylistic register and aspect
Minimal committment principle: make as few assumptions
as possible
Relations in plWordNet
• Starting point: relations in Princeton
WordNet, EuroWordNet and GermaNet
e.g., hyponymy, meronymy, antonymy,
cause, instance for proper names
• Additional constitutive relations
– e.g., verb meronymy, preceding,
presupposition,
– gradation for adjectives
Relations in plWordNet
• Specific: derivationally based lexicosemantic relations, e.g.,
– inhabitant (góral ‘highlander’ – góry
‘highlands’)
– inchoativity (zapalić sięperfect `light, start
burning' -- palić sięimperfect `burn, produce
light')
– process (chamiećimperfect `to become a boor‘ –
cham `boor‘)
Construction process
1. Data collection: 1.8 billion words corpus
2. Data selection phase
–
–
–
corpus browsing
WSD-based word usage example extraction
WordnetWeaver: semi-automatic expansion
3. Data analysis – questions
•
•
•
•
is it a correct Polish lemma?
how many lexical units does it have?
how to describe them with relations?
Other knowledge sources:
available Polish dictionaries, thesauri, encyclopaedias,
lexicons, the Web, and intuition.
The result – size matters
compared with
Princeton WordNet:
www.plwordnet.pwr.wroc.pl
•
•
•
•
•
•
General statistics
Lexical coverage
Polysemy
Synset size
Relation density
Hypernymy depth
LUs per synset
General statistics
2
1
0
PWN
plWN
GermaNet
Number of synsets, lemmas and LUs in the largest wordnets
250,000
200,000
150,000
100,000
50,000
0
synsets
PWN
lemmas
LUs
plWN
GermaNet
Lexical coverage
Proportion of lemmas from PWN/plWN found among
vocabulary with a given corpus frequency
70%
60%
plWN
45.6%
50%
40%
PWN
58.3%
38.3%
35.0%
28.0%
30%
27.7%
17.0%
20%
10%
21.0%
10.7%
6.4%
0%
≥1000
≥500
≥200
≥100
≥50
Corpus frequency
Polysemy
Proportion of polysemous lemmas with regard to POS
70%
60%
60%
50%
41%
40%
32%
26%
30%
20%
PWN
plWN
38%
18%
10%
0%
nouns
verbs
adjectives
Relation density
Synset relation density in PWN 3.1 and in plWordNet 2.0
4.5
4
3.5
PWN
3.99
3.51
3.54
3.11
3.06
3
2.5
plWN
2.21
2
2.43
1.56
1.5
1
0.5
0
nouns
verbs
adjectives
total
Hypernymy depth
Hypernymy path length for nouns in PWN 3.1
and plWordNet 2.0
9
8
7
6
5
4
3
2
1
0
7.76
5.71
PWN
0.57
up
plWN
0.6
down
Hypernymy depth
Princeton WordNet
Polish WordNet
Hypernymy depth
Princeton WordNet
Entity
SUMO
Physical
Polish WordNet
Object
Artifact
Device
ElectricDevice
Computer
Mapping procedure:
plWordNet onto Princeton WordNet
1. Recognise the sense of the source synset:
• the position in the network structure
• existing relations, commentaries; other synsets
containing the given lemma
2. Search the target synset
• candidates for the target synset: intuitions,
automatic prompting and dictionaries
• verifying candidates:
• comparing hypernymy and hyponymy structures
• existing inter-lingual relations;
• definitions, commentaries; dictionaries
3. Link the source synset with the target synset
Hierarchy of inter-lingual relations
•
•
•
•
•
•
Inter-lingual Synonymy (only one per synset)
Inter-lingual inter-register synonymy
I-partial synonymy
I-hyponymy
I-hypernymy
I-meronymy
for parts, elements or materials of bigger wholes
• I-holonymy
for a whole made of smaller parts, elements or
materials
Results of inter-lingual mapping
• Mapping direction: plWordNet – Princeton WordNet
• Bottom-up – from the lowest levels in the hierarchy up
• ~48 300 synsets mapped (~64 400 lexical units/senses)
–
–
–
–
–
–
–
Synonymy:
15268
Partial synonymy:
971
Inter-register synonymy: 676
Hyponymy:
23677
Hypernymy:
3526
Meronymy:
1898
Holonymy:
555
• Mapped branches
– people, artefacts, places, food, time units: all
communication, states and processes, body parts, group names:
partially
Different relations for coding the
same conceptual dependencies
Applications
Free WordNet-type licence facilitate applications. Examples:
• Semantic annotation in a corpus of referential gestures (Lis, 2012)
• Lexicon of semantic valency frames (Hajnicz, 2011; Hajnicz, 2012)
• Features for text mining from Web pages (Maciolek and Dobrowolski, 2013)
• Mapping between a lexicon and an ontology (Wróblewska et al., 2013)
• Word-to-word similarity in ontologies (Lula and Paliwoda-Pękosz, 2009)
• Text similarity for Information Retrieval (Siemiński, 2012)
• Text classification (Maciołek, 2010)
• Terminology extraction and clustering (Mykowiecka and Marciniak, 2012)
• Automated extraction of Opinion Attribute Lexicons (Wawer and
Gołuchowski, 2012)
• Named Entity Recognition
• Word Sense Disambiguation (Gołuchowski and Przepiórkowski, 2012)
• Anaphora resolution
More than 500 registered users, ~70 declared commercial applications
Conclusions
• plWordNet 2.0 – a national wordnet not adapted
from Princeton WordNet
• plWordNet 2.0 is comparable to WordNet 3.1
in size, as well as in lexical coverage, hypernymy depth
and relation density
• Synset membership depends only on constitutive
relations between lexical units.
• A unique mapping strategy and a unique
opportunity to compare the two lexical systems
• plWordNet 3.0 (2015):
– a comprehensive wordnet of Polish
– 200k of lemmas and 260k of LUs, mapped to PWN 3.?
Thank-you
www.plwordnet.pwr.wroc.pl
Thank you!
Differences between plWN and PWN
• Inter-lingual lexico-grammatical
differences:
– marked forms (diminutives, augmentatives)
– lexicalised gender
– lexical gaps
• Differences in the definition of synonymy
and synset:
– 'Mixed' PWN synsets – marked and unmarked
forms, feminine and masculine, countable
and uncountable, hypernym and hyponymhypernymy and (plWN) vs. and/or (PWN)
Differences between plWN and PWN
• Other differences:
– synset definitions incompatible with relations
(PWN)
– different relations used for coding the same
conceptual dependencies
– more fine-grained meaning differentiation
– differences boiling down to the content and
size of resource
Differences in lexicalisation
Relation density
Synset relation density in PWN 3.1 and in plWordNet 2.0
in the select semantic domains
16
14
12
10
8
6
4
2
0
PWN
plWN
Semantic domain
Download