Constructing Folksonomies from UserSpecified Relations on Flickr Anon Plangprasopchok and Kristina Lerman 1 Motivation Annotation / Metadata Produce Users Web content hierarchical classification Organize Search Recommend Leverage Categorize 2 Motivation • Metadata from an individual user may be too inaccurate and incomplete… •The metadata from different users may complement each other, making it, in combination, meaningful. Goal: to induce category knowledge from social annotation produced by many users 3 Folksonomy • Original definition: classification emerging from the use of tags by users (Thomas Vander Wal) • In this work: hidden classification hierarchies from annotation created many users 4 Hierarchical Relations in Social Web • Appear Implicitly Tags: Insect Grasshopper Australian Macro Orthoptera • Appear Explicitly Folder (collection) Sub folder (set) Goal: to induce deeper hierarchies from this metadata Relations 5 Outline • • • • • Motivation Approaches Results Discussion Related work 6 Inducing Hierarchy from Tags Existing approaches • Graph based (Mika05) • build a network of associated tags (node = tag, edge = cooccurrence of tags) • suggest applying betweenness centrality and set theory to determine broader/narrower relations • Hierarchical Clustering (Brooks06; Heymann06+) •Tags appear more frequently would have higher centrality and thus more abstract. • Probabilistic subsumption (Sanderson99+, Schmitz06) • x is broader than y if x subsumes y x • x subsumes y if p(x|y) > t & p(y|x) < t y 7 Inducing Hierarchy from Tags • Some difficulties when using tags to induce hierarchy: Notation: A B (A is broader than B) (hypernym relation) Washington United States Car Automobile Specificity Rarity Insect Hongkong Color Brazilian Tags are from different facets* Above relations induced using subsumption approach on tags [Sanderson99+, Schmitz06] 8 Inducing Hierarchy from user-specified relations • User specified relations, e.g., – Flickr’s Collection-Set , – Delicious’ Bundle-Tag, – Bibsonomy’s Relation-Tag • Key intuition: Not so many people specify peculiar relations like – “automobile” “car”, or – “Washington” “United States” 9 Simple Strategy Collection Sets Concept relations netherland Collection The Netherlands - Holanda holanda blijdorp holanda blijdorp rotterdam netherland Set Tokenize + Stem Blijdorp - Rotterdam countri rotterdam netherland 1. Remove “noisy” relations - Conflict resolution - Significance test countri holland netherland blijdorp china …… 2. Link concepts & Select path 10 Remove noisy relations: 1st approach • Conflict Resolution (when both a->b and b->a appear) – Relation conflicts occur because of noise – Voting scheme: Keep ab (and discard ba) If Nu(ab) > 1 and Nu(ab) > Nu(ba) insect butterfly 2 10 butterfly insect 11 Remove noisy relations: 2nd approach • Significance Test - Use statistical significance test to decide if a b is significant - Null hypothesis: observed relation ab was generated by chance, via the random, independent generation of individual concepts a, b (according to the binomial distribution). Is “b” narrower than “a” by chance? accept reject # observations # of ab 12 Link concepts and select path • Link concepts: assume that same terms refer to the same concept. anim anim + bug anim insect bug insect • Select path: link relations from many users can cause a spaghetti graph 4 possible paths from anim moth: 1) abim 2) aim 3) am 4) abm anim 26 72 1 insect bug Network Bottleneck idea: “the flow bottleneck is a minimum flow capacity among all relations in the path” 10 18 4 moth 1) 2) 3) 4) abim [BN score = min(26,1,18) = 1] aim [BN score = min(72,18) = 18] am [BN score = min(10) = 10] abm [BN score = min(26,4) = 4] 13 Contribution#2:Learning Concept Hierarchies Evaluation & Data Set • Hypothesis: the approach that takes explicit relations into account can induce better hierarchies. • “Better” means more consistent with hand-built hierarchies (ODP ver. 10/08) • The baseline approach is subsumption approach [Schmitz06] Collection and set terms are used instead of tags, making it comparable. Data Set: Data from 17 user groups, devoted to wildlife and naturalist photography 21,792 of 39,922 users specify at least one collection 110,543 unique terms (c.f. 166,153 unique terms in ODP), 15,495 terms in common. 14 Contribution#2:Learning Concept Hierarchies Evaluation methodology ODP has many sub hierarchies: comparing to the induced ones are impractical! It’s easier to compare when specifying “root concept” and “leaf concepts”, i.e., specifying a certain sub tree to compare. Relations (right after tokenized) Reference hierarchy (ODP) Induced hierarchy 15 Contribution#2:Learning Concept Hierarchies Metrics • Taxonomic Overlap [adapted from Maedche02+] – measuring structure similarity between two trees – for each node, determining how many ancestor and descendant nodes overlap to those in the reference tree. • Lexical Recall – measuring how well an approach can discover concepts, existing in the reference hierarchy (coverage) 16 Quantitative Results 17 Contribution#2:Learning Concept Hierarchies Quantitative Results • Manually selecting 32 root nodes • Taxonomic Overlap : • 27 of them are better than those by subsumption • 3 of them get zero score in both approaches • Lexical Recall: • 28 of them are better than those by subsumption • 2 of them get similar score on both approaches • the rest, by subsumption, only induce the root node. • The proposed approach can induce deeper trees The proposed approach can induce hierarchies more consistent with ODP in almost all cases. 18 Sport hierarchy 19 Invertebrate hierarchy 20 Country hierarchy 21 Discussion • Simple strategy to aggregate a large number of shallow relations specified by different users into a common, deeper hierarchy • Induced hierarchies are more consistent with ODP • Future work includes: Term ambiguity Relation types Global path Apply to other datasets 22 Related Work • Learning concept hierarchy from text data • Syntactic based [Hearst92, Caraballo99, Pasca04, Cimiano+05, Snow+06] • Word clustering [e.g. Segal+02, Blei+03] • Induce concept hierarchy from tags • Graph-based & clustering based [Mika05, Brooks+06, Heymann+06, Zhou07+] • Probabilistic subsumption [Schmitz06] • Ontology alignment [Udrea+07] • Exploit user-specified hierarchy • GiveALink [Markines06+] 23 • Questions? • Is the metric used in evaluation meaningful? • How is the scalability of the system? • Wordnet, ODP is already there. Why do we need this system? • How is this work related to ontology enrichment? • Is it ethical to collect users’ data? – ….? 24 Spared slides beyond here 25 Open Problems • Term ambiguity - The current approach: similar terms refer to the similar concept …. but.. Canada “Victoria” Australia Lotus Person name - And has no explicit way to merge synonyms Spain España (There are also many acronyms & colloquial terms in Social Web) A possible solution: concept clustering 26 Open Problems • Inducing “related-to” relation – “Flora” and “Fauna”, “Pet” and “Family” – Prepositions or some connectors may give some clues, e.g., “flora & fauna” and “Pets – Family” – Tag distributions may also help Nature Flora Fauna Nature Flora Fauna 27 Open Problems • True parent selection – Tokenizing collection/set names can cause another problem Flora & Fauna Flora Fauna Insect Insect Insect A possible solution: conditional probability ratio 28 Conclusion • Propose statistical approaches for inducing concepts; inducing concept hierarchies, from social annotation These approaches perform better than existing approaches • On going work aim to improve induced hierarchies’ quality includes: • • • • Resolve term ambiguity Induce “related to” relations Select the right parent Evaluate on more data sets 29 Social Web spare User Content 30 Adapted from The Social Web: an Information Revolution (courtesy of Kristina Lerman) Social Web users content 3 Basic Entities Involved (1) User (2) Content (3) Metadata - Produce - Consume - Annotate & Organize Delicious : 5.3 million users; over 180 million unique URLs [blog.delicious.com, 2009] Flickr: 2 billion photos [techcrunch.com, 2007]/ 4000+ photos upload per min (1/21/2009 morning) 31 Motivation spare Social Annotation is potentially a good source of evidence for inducing category knowledge, which is useful in many applications, e.g., • Organizing Arranging/ Visualizing users’ content (e.g., semantic directory) • Search/Discovery Especially, binary content like photos and videos, where social annotation functions as a semantic index • Recommendation Learning users’ taste/ interest • Leveraging knowledge bases Updating lexical systems and ontologies for semantic web applications • Categorization Understanding how new content fits to existing ones 32 Motivation Although metadata from an individual user may be too inaccurate and incomplete, those from different users may complement each other, making them meaningful for the tasks. Goal: to induce category knowledge from social annotation produced by many users 33 Contribution#2:Learning Concept Hierarchies Evaluation methodology ODP has many sub hierarchies: comparing to the induced ones are impractical! It’s easier to compare when specifying “root concept” and “leaf concepts”, i.e., specifying a certain sub tree to compare. 34 <root, leaf> e.g., <anim, rat> User-specified relations Collection <root, leaf, odp path> e.g., Animal/Mammal/Rodent/Rat Find ODP rootleaf pairs that overlap w/Flickr Data preprocessing Set Relation weighting & linking Flickr relations Significance Test Conflict Resolution Hierarchy Construction Flickr-ODP root-leaf overlaps Compute Taxonomic Overlap, Lexical Recall Subsumption Evaluation 35 Why subsumption does not work so well? Countri Ideal China Reality 36 Contribution#2:Learning Concept Hierarchies Africa Hierarchy 37