Beyond the Transfer and Merge Wordnet Construction: plWordNet and a Comparison with WordNet Marek Maziarz, Maciej Piasecki, Ewa Rudnicka, Stanisław Szpakowicz G4.19 Research Group Wrocław University of Technology nlp.pwr.wroc.pl plwordnet.pwr.wroc.pl Wordnet { samochodzik 2 `small car’ } deminutiveness {samochód 1, pojazd samochodowy 1, auto 1, wóz 1 `car, automobile’ } meronymy hypernymy/hyponymy {bagażnik 1 `boot’ } {pogotowie 3, karetka 1, sanitarka 1, karetka pogotowia 1 `ambulance’ } plWordNet 2.0 Independent vs. Translation-based Wordnet Construction • Transfer and merge. Examples: – EuroWordNet – most component wordnets built by the transfer method (Vossen 2002) – MultiWordNet – semi-automatic acquisition method from the Princeton WordNet (Bentivogli et. al. 2000) – IndoWordNet – expansion from Hindi Wordnet (Sinha et al. 2006, Bhattacharyya 2010) – FinWordNet – directly translated from the Princeton WordNet Independent vs. Translation-based Wordnet Construction • From scratch. Examples: – GermaNet – the core built independently – plWordNet – a unique, corpus-based method; largely independent of the Princeton WordNet Synonymy and synsets • “A wordnet is a collection of synsets linked by semantic relations.” • A synset is a set of synonyms which represent the same lexicalised concept • Synonyms are members of the same synset Wordnet development deserves better: an operational theory with precise guidelines for wordnet editors. Basic building block: synset vs lexical unit? • Synset relations link lexicalised concepts • But are named after linguistic lexico-semantic relations • Substitution tests are defined for lexical units • Synsets group lexical units • Every wordnet includes relations between lexical units (lexical relations), e.g., antonymy • Lexical units can be observed in text, concepts cannot Constitutive relations • Synset = a group of lexical units which share all constitutive relations • Constitutive relation = a lexico-semantic relation which – is frequent enough – and frequently shared by groups Also – is established in linguistics – and accepted in the wordnet tradition • Examples: hypernymy, meronymy, cause Synset as an abbreviation Synset as a notational convention for a group of lexical units sharing certain relations represents synonyms {afekt 1 `passion’, uczucie 2 `feeling’} hypernym {miłość 1 `love’, umiłowanie 1 `affection’ , kochanie 1 `loving’} This is based on constitutive relations Additional distinctions: stylistic register and aspect Minimal committment principle: make as few assumptions as possible Relations in plWordNet • Starting point: relations in Princeton WordNet, EuroWordNet and GermaNet e.g., hyponymy, meronymy, antonymy, cause, instance for proper names • Additional constitutive relations – e.g., verb meronymy, preceding, presupposition, – gradation for adjectives Relations in plWordNet • Specific: derivationally based lexicosemantic relations, e.g., – inhabitant (góral ‘highlander’ – góry ‘highlands’) – inchoativity (zapalić sięperfect `light, start burning' -- palić sięimperfect `burn, produce light') – process (chamiećimperfect `to become a boor‘ – cham `boor‘) Construction process 1. Data collection: 1.8 billion words corpus 2. Data selection phase – – – corpus browsing WSD-based word usage example extraction WordnetWeaver: semi-automatic expansion 3. Data analysis – questions • • • • is it a correct Polish lemma? how many lexical units does it have? how to describe them with relations? Other knowledge sources: available Polish dictionaries, thesauri, encyclopaedias, lexicons, the Web, and intuition. The result – size matters compared with Princeton WordNet: www.plwordnet.pwr.wroc.pl • • • • • • General statistics Lexical coverage Polysemy Synset size Relation density Hypernymy depth LUs per synset General statistics 2 1 0 PWN plWN GermaNet Number of synsets, lemmas and LUs in the largest wordnets 250,000 200,000 150,000 100,000 50,000 0 synsets PWN lemmas LUs plWN GermaNet Lexical coverage Proportion of lemmas from PWN/plWN found among vocabulary with a given corpus frequency 70% 60% plWN 45.6% 50% 40% PWN 58.3% 38.3% 35.0% 28.0% 30% 27.7% 17.0% 20% 10% 21.0% 10.7% 6.4% 0% ≥1000 ≥500 ≥200 ≥100 ≥50 Corpus frequency Polysemy Proportion of polysemous lemmas with regard to POS 70% 60% 60% 50% 41% 40% 32% 26% 30% 20% PWN plWN 38% 18% 10% 0% nouns verbs adjectives Relation density Synset relation density in PWN 3.1 and in plWordNet 2.0 4.5 4 3.5 PWN 3.99 3.51 3.54 3.11 3.06 3 2.5 plWN 2.21 2 2.43 1.56 1.5 1 0.5 0 nouns verbs adjectives total Hypernymy depth Hypernymy path length for nouns in PWN 3.1 and plWordNet 2.0 9 8 7 6 5 4 3 2 1 0 7.76 5.71 PWN 0.57 up plWN 0.6 down Hypernymy depth Princeton WordNet Polish WordNet Hypernymy depth Princeton WordNet Entity SUMO Physical Polish WordNet Object Artifact Device ElectricDevice Computer Mapping procedure: plWordNet onto Princeton WordNet 1. Recognise the sense of the source synset: • the position in the network structure • existing relations, commentaries; other synsets containing the given lemma 2. Search the target synset • candidates for the target synset: intuitions, automatic prompting and dictionaries • verifying candidates: • comparing hypernymy and hyponymy structures • existing inter-lingual relations; • definitions, commentaries; dictionaries 3. Link the source synset with the target synset Hierarchy of inter-lingual relations • • • • • • Inter-lingual Synonymy (only one per synset) Inter-lingual inter-register synonymy I-partial synonymy I-hyponymy I-hypernymy I-meronymy for parts, elements or materials of bigger wholes • I-holonymy for a whole made of smaller parts, elements or materials Results of inter-lingual mapping • Mapping direction: plWordNet – Princeton WordNet • Bottom-up – from the lowest levels in the hierarchy up • ~48 300 synsets mapped (~64 400 lexical units/senses) – – – – – – – Synonymy: 15268 Partial synonymy: 971 Inter-register synonymy: 676 Hyponymy: 23677 Hypernymy: 3526 Meronymy: 1898 Holonymy: 555 • Mapped branches – people, artefacts, places, food, time units: all communication, states and processes, body parts, group names: partially Different relations for coding the same conceptual dependencies Applications Free WordNet-type licence facilitate applications. Examples: • Semantic annotation in a corpus of referential gestures (Lis, 2012) • Lexicon of semantic valency frames (Hajnicz, 2011; Hajnicz, 2012) • Features for text mining from Web pages (Maciolek and Dobrowolski, 2013) • Mapping between a lexicon and an ontology (Wróblewska et al., 2013) • Word-to-word similarity in ontologies (Lula and Paliwoda-Pękosz, 2009) • Text similarity for Information Retrieval (Siemiński, 2012) • Text classification (Maciołek, 2010) • Terminology extraction and clustering (Mykowiecka and Marciniak, 2012) • Automated extraction of Opinion Attribute Lexicons (Wawer and Gołuchowski, 2012) • Named Entity Recognition • Word Sense Disambiguation (Gołuchowski and Przepiórkowski, 2012) • Anaphora resolution More than 500 registered users, ~70 declared commercial applications Conclusions • plWordNet 2.0 – a national wordnet not adapted from Princeton WordNet • plWordNet 2.0 is comparable to WordNet 3.1 in size, as well as in lexical coverage, hypernymy depth and relation density • Synset membership depends only on constitutive relations between lexical units. • A unique mapping strategy and a unique opportunity to compare the two lexical systems • plWordNet 3.0 (2015): – a comprehensive wordnet of Polish – 200k of lemmas and 260k of LUs, mapped to PWN 3.? Thank-you www.plwordnet.pwr.wroc.pl Thank you! Differences between plWN and PWN • Inter-lingual lexico-grammatical differences: – marked forms (diminutives, augmentatives) – lexicalised gender – lexical gaps • Differences in the definition of synonymy and synset: – 'Mixed' PWN synsets – marked and unmarked forms, feminine and masculine, countable and uncountable, hypernym and hyponymhypernymy and (plWN) vs. and/or (PWN) Differences between plWN and PWN • Other differences: – synset definitions incompatible with relations (PWN) – different relations used for coding the same conceptual dependencies – more fine-grained meaning differentiation – differences boiling down to the content and size of resource Differences in lexicalisation Relation density Synset relation density in PWN 3.1 and in plWordNet 2.0 in the select semantic domains 16 14 12 10 8 6 4 2 0 PWN plWN Semantic domain