The Topology of WordNet: some metrics Ann Devitt and Carl Vogel Computational Linguistics Group Trinity College Dublin, Ireland Introduction Measures WordNet “sub-hierarchies” Multiple inheritance Branching Factor Depth versus Height Cluster coefficients Specificity pilot study Ann Devitt, TCD Terminology WordNet as directed acyclic graph Node and synset interchangeable Ann Devitt, TCD Dimensional distribution Ann Devitt, TCD Overlap between hierarchies 2072 synsets: more than 1 top hierarchy 35 synsets: more than 2 top hierarchies Ann Devitt, TCD Some overlap examples Abstraction and Event 948 synsets group action Entity and Group 250 nodes weaponry Ann Devitt, TCD Multiple inheritance 2.6% of nodes Normal distribution throughout depth Significantly different in different taxonomies: χ2 (8, N=75180)=324.27, p≤0.001 Ann Devitt, TCD Specificity examples Parents = 1, depth < 3 Parents > 1, depth < 3 person artefact damnation office Parents = 1, depth > 8 Parents > 1, depth > 8 sea bass selfcondemnation bombardon beagle palomino Ann Devitt, TCD Branching Factor Number of children + 1 Including leaf nodes Range: 1 – 573 Average: 2.023 Excluding leaf nodes: Average: 5.793 97% less than 20 Ann Devitt, TCD Branching factor Overall low branching factor Same distribution in all sub-hierarchies Large number of nodes in total Greater overall depth in paths Not a shallow structure despite 55,000 leaf nodes Ann Devitt, TCD Depth vs Height Depth: Maximum = 18 Normal distribution Height: Maximum = 5 93.6% 1 or 2 nodes from a leaf node Zipfian distribution Ann Devitt, TCD Depth vs Height Reported distributions the same across the different sub hierarchies Depth is a more informative measure Ann Devitt, TCD Clustering coefficient Measure of graph connectivity Ratio: Number of connections btwn nodes Possible number of connections 2 Σi ki (ki – 1) Ann Devitt, TCD Cluster coefficients First-order measure Not useful for WordNet Only 62 nodes have a coefficient > 0 Does not form clusters readily Ann Devitt, TCD Cluster coefficients Second-order measure Average 0.337 Normal distribution May form clusters of wider diameter Ann Devitt, TCD Pilot Study Aims 1. 2. 3. Do people have a notion of generality/specificity for concepts? Do people agree on what is more/less general/specific? What features of WordNet do these judgments correlate with? Ann Devitt, TCD Sample ranking task I Axis, axis of rotation – (the center around which something rotates River boat – (a boat used on rivers or to ply a river) Remains – (any object that is left unused or still extant; “I threw out the remains of my dinner” Ann Devitt, TCD Sample ranking task II rational motive - (a motive that can be defended by reasoning or logical argument disapproval - (the act of disapproving or condemning) harmony, concord, concordance (agreement of opinions) Ann Devitt, TCD Do people agree on what is more/less general/specific? YES Cochran Q statistic (Cochran 1950) H0 : that any agreement between respondents is due to chance Overall: for 11 respondents Cochran's Q165.859 44 degrees of freedom Asymp. Sig. .000 Ann Devitt, TCD What WN features correlate? Depth Less deep = more general Children Inconclusive Sisters Less sisters = more general Sub-hierarchy Did not seem to affect judgments Did increase the difficulty of the task Ann Devitt, TCD Conclusion WordNet metrics Inheritance: Sub-hierarchy and parentage Branching Factor Distance: depth and height Clustering Pilot study Suggests where to go with a larger study Ann Devitt, TCD Bibliography W. G. Cochran: The comparison of percentages in matched samples. Biometrika, 37:256-266, 1950 David Touretsky: The Mathematics of Inheritance Systems, Los Altos, CA: Morgan Kaufmann (1986) D. J. Watts and S. H. Strogatz: Collective dynamics of small world networks, Nature 401, 130 (1999) Ann Devitt, TCD Multiple Inheritance vs Depth Ann Devitt, TCD