Constructing Folksonomies from UserSpecified Relations on Flickr Anon Plangprasopchok and Kristina Lerman (WWW 2009) 1 Motivation Annotation / Metadata Produce Users Web content hierarchical classification Organize Search Recommend Leverage Categorize 2 Motivation • Metadata from an individual user may be too inaccurate and incomplete… •The metadata from different users may complement each other, making it, in combination, meaningful. Goal: to induce category knowledge from social annotation produced by many users 3 Folksonomy • Original definition: classification emerging from the use of tags by users (Thomas Vander Wal) • In this work: hidden classification hierarchies from annotation created by many users 4 Hierarchical Relations in Social Web • Appear Implicitly Tags: Insect Goal: to induce deeper Grasshopper hierarchies from this Australian metadata Macro Orthopteran (直翅類) • Appear Explicitly Folder (collection) Sub folder (set) Relations 5 Outline • • • • • Motivation Approaches Results Discussion Related work 6 Inducing Hierarchy from Tags Existing approaches • Graph based (Mika05) • build a network of associated tags (node = tag, edge = cooccurrence of tags) • suggest applying betweenness centrality and set theory to determine broader/narrower relations • Hierarchical Clustering (Brooks06; Heymann06+) •Tags appear more frequently would have higher centrality and thus more abstract. • Probabilistic subsumption (Sanderson99+, Schmitz06) • x is broader than y if x subsumes y x • x subsumes y if p(x|y) > t & p(y|x) < t y 7 Inducing Hierarchy from Tags • Some difficulties when using tags to induce hierarchy: Notation: A B (A is broader than B) (hypernym relation) Washington United States Car Automobile Specificity Rarity Insect Hongkong Color Brazilian Tags are from different facets* Above relations induced using subsumption approach on tags [Sanderson99+, Schmitz06] 8 Inducing Hierarchy from user-specified relations • User specified relations, e.g., – Flickr’s Collection-Set , – Delicious’ Bundle-Tag, – Bibsonomy’s Relation-Tag • Key intuition: Not so many people specify peculiar relations like – “automobile” “car”, or – “Washington” “United States” 9 Simple Strategy Collection Sets Concept relations netherland Collection The Netherlands - Holanda holanda blijdorp holanda blijdorp rotterdam netherland Set Tokenize + Stem Blijdorp - Rotterdam countri rotterdam netherland 1. Remove “noisy” relations - Conflict resolution - Significance test countri holland netherland blijdorp china …… 2. Link concepts & Select path 10 Remove noisy relations: 1st approach • Conflict Resolution (when both a->b and b->a appear) – Relation conflicts occur because of noise – Voting scheme: Keep ab (and discard ba) If Nu(ab) > 1 and Nu(ab) > Nu(ba) insect butterfly 2 10 butterfly insect 11 Remove noisy relations: 2nd approach • Significance Test - Use statistical significance test to decide if a b is significant - Null hypothesis: observed relation ab was generated by chance, via the random, independent generation of individual concepts a, b (according to the binomial distribution). Is “b” narrower than “a” by chance? accept reject # observations # of ab 12 Link concepts and select path • Link concepts: assume that same terms refer to the same concept. anim anim + bug anim insect bug insect • Select path: link relations from many users can cause a spaghetti graph 4 possible paths from anim moth: 1) abim 2) aim 3) am 4) abm anim 26 72 1 insect bug Network Bottleneck idea: “the flow bottleneck is a minimum flow capacity among all relations in the path” 10 18 4 moth 1) 2) 3) 4) abim [BN score = min(26,1,18) = 1] aim [BN score = min(72,18) = 18] am [BN score = min(10) = 10] abm [BN score = min(26,4) = 4] 13 Contribution#2:Learning Concept Hierarchies Evaluation & Data Set • Hypothesis: the approach that takes explicit relations into account can induce better hierarchies. • “Better” means more consistent with hand-built hierarchies (ODP ver. 10/08) • The baseline approach is subsumption approach [Schmitz06] Collection and set terms are used instead of tags, making it comparable. Data Set: Data from 17 user groups, devoted to wildlife and naturalist photography 21,792 of 39,922 users specify at least one collection 110,543 unique terms (c.f. 166,153 unique terms in ODP), 15,495 terms in common. 14 Contribution#2:Learning Concept Hierarchies Evaluation methodology ODP has many sub hierarchies: comparing to the induced ones are impractical! It’s easier to compare when specifying “root concept” and “leaf concepts”, i.e., specifying a certain sub tree to compare. Relations (right after tokenized) Reference hierarchy (ODP) Induced hierarchy 15 Contribution#2:Learning Concept Hierarchies Metrics • Taxonomic Overlap [adapted from Maedche02+] – measuring structure similarity between two trees – for each node, determining how many ancestor and descendant nodes overlap to those in the reference tree. • Lexical Recall – measuring how well an approach can discover concepts, existing in the reference hierarchy (coverage) 16 Quantitative Results 17 Contribution#2:Learning Concept Hierarchies Quantitative Results • Manually selecting 32 root nodes • Taxonomic Overlap : • 27 of them are better than those by subsumption • 3 of them get zero score in both approaches • Lexical Recall: • 28 of them are better than those by subsumption • 2 of them get similar score on both approaches • the rest, by subsumption, only induce the root node. • The proposed approach can induce deeper trees The proposed approach can induce hierarchies more consistent with ODP in almost all cases. 18 Sport hierarchy 19 Invertebrate hierarchy 20 Country hierarchy 21 Discussion • Simple strategy to aggregate a large number of shallow relations specified by different users into a common, deeper hierarchy • Induced hierarchies are more consistent with ODP • Future work includes: Term ambiguity Relation types Global path Apply to other datasets 22 Thank You! & Questions? 23