Tuning Hierarchies in Princeton WordNet

advertisement
TUNING HIERARCHIES IN
PRINCETON WORDNET
AHTI LOHK | CHRISTIANE D. FELLBAUM
|
LEO VÕHANDU
THE 8TH MEETING OF THE GLOBAL WORDNET CONFERENCE IN BUCHAREST
J A N U A RY 2 7 - 3 0 , 2 0 1 6
What kind of methods different
developers have used?
Group of methods
Corpus-based
methods
Rule-based
methods
Graph-based
methods
Use of corpus data, Use the contents of
lexical resources
a synset
Popularity
+
+
High
–
+
Medium
–
–
Low (yet!)
Corpus-based methods
Group of methods
Use of corpus
data, lexical
resources
Use the contents
of a synset
Popularity
Corpus-based meth.
+
+
High
Rule-based meth.
–
+
Medium
Graph-based meth.
–
–
Low (yet!)
Different techniques for extracting the relevant information have
been applied.
Some of the well-known approaches include:
 Lexico-syntactic patterns (Hearst, 1992), (Nadig et al., 2008)
 Similarity measurements (Sagot and Fišer, 2012)
 Mapping and comparing to wordnet (Pedersen et al., others, 2013)
 Applying wordnet in NLP tasks (Saito et al., 2002)
Rule-based methods
Group of methods
Use of corpus
data, lexical
resources
Use the contents
of a synset
Popularity
Corpus-based meth.
+
+
High
Rule-based meth.
–
+
Medium
Graph-based meth.
–
–
Low (yet!)
These methods for validating hierarchies rely on lexical relations (word-word),
semantic relations (concept-concept) and the rules among them. This includes the
rules applied to the construction of WordNet (Fellbaum, 1998), and additional
rules, such as the following:
Metaproperties (rigidity, identity, unity and dependence) described in ontology
construction (Guarino and Welty, 2002)
Top Ontology concepts or “unique beginners” (Atserias et al., 2005; Miller, 1998)
 Specific rules for particular error detections (Gupta, 2002; Nadig et al., 2008). For
instance, a rule proposed by (Nadig et al., 2008):“If one term of a synset X is a proper
suffix of a term in a synset Y, X is a hypernym of Y”
The advantages of graph-based methods
• Test patterns are applicable to wordnets in every language
• Test patterns highlight substructures that refer to possible errors
and they simplify the work of the expert lexicographer (Lohk et al.,
2012a), (Lohk et al., 2012b), (Lohk et al., 2014b)
• Using a test is always quicker than “[doing] a full revision in topdown or alphabetical order” (Čapek, 2012).
Graph-based methods
Group of methods
Use of corpus
data, lexical
resources
Use the contents
of a synset
Popularity
Corpus-based meth.
+
+
High
Rule-based meth.
–
+
Medium
Graph-based meth.
–
–
Low (yet!)
These methods are purely formal and do not take into account the
semantics among word forms. Specific substructures of a wordnet’s
hierarchies are checked and validated.
Target substructures include:
 Cycles (Šmrz, 2004), (Kubis, 2012)
 Shortcuts (Fischer, 1997)
 Rings (Liu et al., 2004; Richens, 2008)
Cycle Shortcut
 Dangling uplinks (Koeva et al., 2004; Šmrz, 2004)
 Orphan nodes (null graphs) (Čapek, 2012)
Ring
Dangling uplink
An artificial hierarchy and specific substructures
6
4
1
2
5
3
Specific substructures = test patterns
1 Short cut
2 Heart-shaped substructure
3 Ring
4 Closed subset
5 Dense component
6 Connected roots
+ 4 substructures
Dense component
{facial} care for the face that
usually involves cleansing and
massage and the application of
cosmetic creams
{makeover} an overall beauty
treatment (involving a person's
hair style and cosmetics and
clothing)
{manicure}
professional care for the hands
and fingernails
{beauty treatment} (2|4)
enhancement of someone's
personal beauty
{pedicure}
professional care for the feet
and toenails
{aid, attention, care, ...} (2|20)
the work of providing treatment
for or attending to someone or
something
{hair care, ...} care for the hair:
the activity of washing or
cutting or curling or arranging
the hair
...
Heart-shaped substructure
{narcotic} a drug that
produces numbness or
stupor; often taken for
pleasure or to reduce
pain;
{soft drug} a drug
of abuse that is
considered relatively
mild and not likely to
cause addiction
{controlled substance} a
drug or chemical
substance whose
possession and use are
controlled by law;
{hard drug} a
narcotic that is
considered relatively
strong and likely to
cause addiction
{cannibis, marijuana, ...}
the most commonly used
illicit drug
„Compound“ pattern
1– {baseball}
a ball used in playing baseball
{baseball equipment}
equipment used in playing
baseball
2 – {basketball}
an inflated ball used in
playing basketball
{basket ball equipment}
sports equipment used in
playing basket ball
3 – {cricket ball}
the ball used in playing cricket
{cricket equipment}
sports equipment used in
playing cricket
4 – {crouquet ball}
a wooden ball used in playing
croquet
{crouquet equipment}
sports equipment used in
playing croquet
5 – {golf ball}
a small hard ball used in
playing golf
{golf equipment}
sports equipment used in
playing golf
…
9 – {football}
the inflated oblong ball used
in playing American football
...
24 – {volleyball}
an inflated ball used in
playing volleyball
{ball}
round object that is hit or
thrown or kicked in games
Connected roots
1/2 - {South_1}
1* - 1|8 -> {Alabama_1, ...}
19/74,023 - {entity_1}
1* - 1|9 -> {Epimetheus_1}
1/2 - {Spain_1, ...}
13
Princeton WordNet
Version 3.0
Finnish Wordnet
Version 2.0
Cornetto
Version 2.0
Polish Wordnet
Version 2.0
Estonian Wordnet
Version 70
The largest
closed
subsets
„Compound“
pattern
Synset with
many roots
Heartshaped
substructure
Dense
component
Rings
Short cuts
Multiple
inheritance
cases
Verb roots
Wordnet
Noun roots
Wordnets in comparison
12 334
1,453
40
2,991
18
155
115
358
1,333×167
12 334
1,453
40
2,991
18
155
115
394
1,334×167
351
5,309
62
1,226
217
549
11,032×589
553 57,887 205,254
5,037
778
541 30,794×4,683
0
3
2
2
2,438
637
42
10,942
118
4
51
7
21
70
7
123x4
Download