TUNING HIERARCHIES IN PRINCETON WORDNET AHTI LOHK | CHRISTIANE D. FELLBAUM | LEO VÕHANDU THE 8TH MEETING OF THE GLOBAL WORDNET CONFERENCE IN BUCHAREST J A N U A RY 2 7 - 3 0 , 2 0 1 6 What kind of methods different developers have used? Group of methods Corpus-based methods Rule-based methods Graph-based methods Use of corpus data, Use the contents of lexical resources a synset Popularity + + High – + Medium – – Low (yet!) Corpus-based methods Group of methods Use of corpus data, lexical resources Use the contents of a synset Popularity Corpus-based meth. + + High Rule-based meth. – + Medium Graph-based meth. – – Low (yet!) Different techniques for extracting the relevant information have been applied. Some of the well-known approaches include: Lexico-syntactic patterns (Hearst, 1992), (Nadig et al., 2008) Similarity measurements (Sagot and Fišer, 2012) Mapping and comparing to wordnet (Pedersen et al., others, 2013) Applying wordnet in NLP tasks (Saito et al., 2002) Rule-based methods Group of methods Use of corpus data, lexical resources Use the contents of a synset Popularity Corpus-based meth. + + High Rule-based meth. – + Medium Graph-based meth. – – Low (yet!) These methods for validating hierarchies rely on lexical relations (word-word), semantic relations (concept-concept) and the rules among them. This includes the rules applied to the construction of WordNet (Fellbaum, 1998), and additional rules, such as the following: Metaproperties (rigidity, identity, unity and dependence) described in ontology construction (Guarino and Welty, 2002) Top Ontology concepts or “unique beginners” (Atserias et al., 2005; Miller, 1998) Specific rules for particular error detections (Gupta, 2002; Nadig et al., 2008). For instance, a rule proposed by (Nadig et al., 2008):“If one term of a synset X is a proper suffix of a term in a synset Y, X is a hypernym of Y” The advantages of graph-based methods • Test patterns are applicable to wordnets in every language • Test patterns highlight substructures that refer to possible errors and they simplify the work of the expert lexicographer (Lohk et al., 2012a), (Lohk et al., 2012b), (Lohk et al., 2014b) • Using a test is always quicker than “[doing] a full revision in topdown or alphabetical order” (Čapek, 2012). Graph-based methods Group of methods Use of corpus data, lexical resources Use the contents of a synset Popularity Corpus-based meth. + + High Rule-based meth. – + Medium Graph-based meth. – – Low (yet!) These methods are purely formal and do not take into account the semantics among word forms. Specific substructures of a wordnet’s hierarchies are checked and validated. Target substructures include: Cycles (Šmrz, 2004), (Kubis, 2012) Shortcuts (Fischer, 1997) Rings (Liu et al., 2004; Richens, 2008) Cycle Shortcut Dangling uplinks (Koeva et al., 2004; Šmrz, 2004) Orphan nodes (null graphs) (Čapek, 2012) Ring Dangling uplink An artificial hierarchy and specific substructures 6 4 1 2 5 3 Specific substructures = test patterns 1 Short cut 2 Heart-shaped substructure 3 Ring 4 Closed subset 5 Dense component 6 Connected roots + 4 substructures Dense component {facial} care for the face that usually involves cleansing and massage and the application of cosmetic creams {makeover} an overall beauty treatment (involving a person's hair style and cosmetics and clothing) {manicure} professional care for the hands and fingernails {beauty treatment} (2|4) enhancement of someone's personal beauty {pedicure} professional care for the feet and toenails {aid, attention, care, ...} (2|20) the work of providing treatment for or attending to someone or something {hair care, ...} care for the hair: the activity of washing or cutting or curling or arranging the hair ... Heart-shaped substructure {narcotic} a drug that produces numbness or stupor; often taken for pleasure or to reduce pain; {soft drug} a drug of abuse that is considered relatively mild and not likely to cause addiction {controlled substance} a drug or chemical substance whose possession and use are controlled by law; {hard drug} a narcotic that is considered relatively strong and likely to cause addiction {cannibis, marijuana, ...} the most commonly used illicit drug „Compound“ pattern 1– {baseball} a ball used in playing baseball {baseball equipment} equipment used in playing baseball 2 – {basketball} an inflated ball used in playing basketball {basket ball equipment} sports equipment used in playing basket ball 3 – {cricket ball} the ball used in playing cricket {cricket equipment} sports equipment used in playing cricket 4 – {crouquet ball} a wooden ball used in playing croquet {crouquet equipment} sports equipment used in playing croquet 5 – {golf ball} a small hard ball used in playing golf {golf equipment} sports equipment used in playing golf … 9 – {football} the inflated oblong ball used in playing American football ... 24 – {volleyball} an inflated ball used in playing volleyball {ball} round object that is hit or thrown or kicked in games Connected roots 1/2 - {South_1} 1* - 1|8 -> {Alabama_1, ...} 19/74,023 - {entity_1} 1* - 1|9 -> {Epimetheus_1} 1/2 - {Spain_1, ...} 13 Princeton WordNet Version 3.0 Finnish Wordnet Version 2.0 Cornetto Version 2.0 Polish Wordnet Version 2.0 Estonian Wordnet Version 70 The largest closed subsets „Compound“ pattern Synset with many roots Heartshaped substructure Dense component Rings Short cuts Multiple inheritance cases Verb roots Wordnet Noun roots Wordnets in comparison 12 334 1,453 40 2,991 18 155 115 358 1,333×167 12 334 1,453 40 2,991 18 155 115 394 1,334×167 351 5,309 62 1,226 217 549 11,032×589 553 57,887 205,254 5,037 778 541 30,794×4,683 0 3 2 2 2,438 637 42 10,942 118 4 51 7 21 70 7 123x4