Constructing folsonomies from User

advertisement
Constructing Folksonomies from UserSpecified Relations on Flickr
Anon Plangprasopchok and
Kristina Lerman
1
Motivation
Annotation /
Metadata
Produce
Users
Web content
hierarchical classification
Organize
Search
Recommend
Leverage
Categorize
2
Motivation
• Metadata from an individual user may
be too inaccurate and incomplete…
•The metadata from different users may
complement each other, making it,
in combination, meaningful.
Goal: to induce category knowledge from
social annotation produced by many users
3
Folksonomy
• Original definition: classification emerging
from the use of tags by users (Thomas
Vander Wal)
• In this work: hidden classification
hierarchies from annotation created many
users
4
Hierarchical Relations in Social Web
• Appear Implicitly
Tags:
Insect
Grasshopper
Australian
Macro
Orthoptera
• Appear Explicitly
Folder (collection)
Sub folder (set)
Goal: to induce deeper
hierarchies from this
metadata
Relations
5
Outline
•
•
•
•
•
Motivation
Approaches
Results
Discussion
Related work
6
Inducing Hierarchy from Tags
Existing approaches
• Graph based (Mika05)
• build a network of associated tags (node = tag, edge = cooccurrence of tags)
• suggest applying betweenness centrality and set theory to
determine broader/narrower relations
• Hierarchical Clustering (Brooks06; Heymann06+)
•Tags appear more frequently would have higher centrality and
thus more abstract.
• Probabilistic subsumption (Sanderson99+, Schmitz06)
• x is broader than y if x subsumes y
x
• x subsumes y if p(x|y) > t & p(y|x) < t
y
7
Inducing Hierarchy from Tags
• Some difficulties when using tags
to induce hierarchy:
Notation: A  B
(A is broader than B)
(hypernym relation)
Washington  United States
Car  Automobile
Specificity  Rarity
Insect  Hongkong
Color  Brazilian
Tags are from different facets*
Above relations induced using subsumption approach on tags [Sanderson99+, Schmitz06]
8
Inducing Hierarchy from user-specified relations
• User specified relations, e.g.,
– Flickr’s Collection-Set ,
– Delicious’ Bundle-Tag,
– Bibsonomy’s Relation-Tag
• Key intuition: Not so many people specify
peculiar relations like
– “automobile”  “car”, or
– “Washington”  “United States”
9
Simple Strategy
Collection
Sets
Concept relations
netherland
Collection
The Netherlands - Holanda
holanda
blijdorp
holanda
blijdorp
rotterdam
netherland
Set
Tokenize + Stem
Blijdorp - Rotterdam
countri
rotterdam
netherland
1. Remove “noisy” relations
- Conflict resolution
- Significance test
countri
holland
netherland
blijdorp
china
……
2. Link concepts & Select path
10
Remove noisy relations:
1st approach
• Conflict Resolution (when both a->b and b->a appear)
– Relation conflicts occur because of noise
– Voting scheme:
Keep ab (and discard ba)
If Nu(ab) > 1 and Nu(ab) > Nu(ba)
insect
butterfly
2
10
butterfly
insect
11
Remove noisy relations:
2nd approach
• Significance Test
- Use statistical significance test to decide if a  b is
significant
- Null hypothesis: observed relation ab was
generated by chance, via the random, independent
generation of individual concepts a, b (according to
the binomial distribution).
Is “b” narrower than “a” by chance?
accept
reject

# observations
# of ab
12
Link concepts and select path
• Link concepts: assume that same terms refer to the same concept.
anim
anim

+
bug
anim
insect
bug
insect
• Select path: link relations from many users can cause a spaghetti graph
4 possible paths from anim  moth:
1) abim
2) aim
3) am
4) abm
anim
26
72
1
insect
bug
Network Bottleneck idea:
“the flow bottleneck is a minimum flow capacity
among all relations in the path”
10
18
4
moth
1)
2)
3)
4)
abim [BN score = min(26,1,18) = 1]
aim [BN score = min(72,18) = 18]
am [BN score = min(10) = 10]
abm [BN score = min(26,4) = 4]
13
Contribution#2:Learning Concept Hierarchies
Evaluation & Data Set
•
Hypothesis: the approach that takes explicit relations into account can
induce better hierarchies.
•
“Better” means more consistent with hand-built hierarchies (ODP ver.
10/08)
•
The baseline approach is subsumption approach [Schmitz06]
Collection and set terms are used instead of tags, making it comparable.
Data Set:
 Data from 17 user groups, devoted to wildlife
and naturalist photography
 21,792 of 39,922 users specify at least one collection
 110,543 unique terms (c.f. 166,153 unique terms in ODP),
15,495 terms in common.
14
Contribution#2:Learning Concept Hierarchies
Evaluation methodology
ODP has many sub hierarchies: comparing to the induced ones are impractical!
It’s easier to compare when specifying “root concept” and “leaf concepts”, i.e.,
specifying a certain sub tree to compare.
Relations (right after tokenized)
Reference hierarchy
(ODP)
Induced hierarchy
15
Contribution#2:Learning Concept Hierarchies
Metrics
• Taxonomic Overlap [adapted from Maedche02+]
– measuring structure similarity between two trees
– for each node, determining how many ancestor and
descendant nodes overlap to those in the reference
tree.
• Lexical Recall
– measuring how well an approach can discover
concepts, existing in the reference hierarchy
(coverage)
16
Quantitative Results
17
Contribution#2:Learning Concept Hierarchies
Quantitative Results
• Manually selecting 32 root nodes
• Taxonomic Overlap :
• 27 of them are better than those by subsumption
• 3 of them get zero score in both approaches
• Lexical Recall:
• 28 of them are better than those by subsumption
• 2 of them get similar score on both approaches
• the rest, by subsumption, only induce the root node.
• The proposed approach can induce deeper trees
The proposed approach can induce hierarchies more
consistent with ODP in almost all cases.
18
Sport hierarchy
19
Invertebrate hierarchy
20
Country hierarchy
21
Discussion
• Simple strategy to aggregate a large number of shallow
relations specified by different users into a common,
deeper hierarchy
• Induced hierarchies are more consistent with ODP
• Future work includes:
Term ambiguity
Relation types
Global path
Apply to other datasets
22
Related Work
• Learning concept hierarchy from text data
• Syntactic based [Hearst92, Caraballo99, Pasca04,
Cimiano+05, Snow+06]
• Word clustering [e.g. Segal+02, Blei+03]
• Induce concept hierarchy from tags
• Graph-based & clustering based [Mika05,
Brooks+06, Heymann+06, Zhou07+]
• Probabilistic subsumption [Schmitz06]
• Ontology alignment [Udrea+07]
• Exploit user-specified hierarchy
• GiveALink [Markines06+]
23
• Questions?
• Is the metric used in evaluation meaningful?
• How is the scalability of the system?
• Wordnet, ODP is already there. Why do we
need this system?
• How is this work related to ontology
enrichment?
• Is it ethical to collect users’ data?
– ….?
24
Spared slides beyond here
25
Open Problems
• Term ambiguity
- The current approach: similar terms refer to the similar
concept …. but..
Canada
“Victoria”
Australia
Lotus
Person name
- And has no explicit way to merge synonyms
Spain
España
(There are also many acronyms & colloquial terms in Social Web)
A possible solution: concept clustering
26
Open Problems
• Inducing “related-to” relation
– “Flora” and “Fauna”, “Pet” and “Family”
– Prepositions or some connectors may give some clues, e.g.,
“flora & fauna” and
“Pets – Family”
– Tag distributions may also help
Nature
Flora Fauna
Nature
Flora
Fauna
27
Open Problems
• True parent selection
– Tokenizing collection/set names can cause
another problem
Flora & Fauna
Flora
Fauna
Insect
Insect
Insect
A possible solution: conditional probability ratio
28
Conclusion
• Propose statistical approaches for
inducing concepts;
inducing concept hierarchies, from social annotation
These approaches perform better than existing approaches
• On going work aim to improve induced
hierarchies’ quality includes:
•
•
•
•
Resolve term ambiguity
Induce “related to” relations
Select the right parent
Evaluate on more data sets
29
Social Web
spare
User
Content
30
Adapted from The Social Web: an Information Revolution (courtesy of Kristina Lerman)
Social Web
users
content
3 Basic Entities Involved
(1) User
(2) Content
(3) Metadata
- Produce
- Consume
- Annotate & Organize
Delicious : 5.3 million users; over 180 million unique URLs [blog.delicious.com, 2009]
Flickr: 2 billion photos [techcrunch.com, 2007]/
4000+ photos upload per min (1/21/2009 morning)
31
Motivation
spare
Social Annotation is potentially a good source of evidence for
inducing category knowledge, which is useful in many applications, e.g.,
• Organizing
Arranging/ Visualizing users’ content (e.g., semantic directory)
• Search/Discovery
Especially, binary content like photos and videos, where social annotation
functions as a semantic index
• Recommendation
Learning users’ taste/ interest
• Leveraging knowledge bases
Updating lexical systems and ontologies for semantic web applications
• Categorization
Understanding how new content fits to existing ones
32
Motivation
Although metadata from an individual user may be too
inaccurate and incomplete, those from different users
may complement each other, making them meaningful
for the tasks.
Goal: to induce category knowledge from
social annotation produced by many users
33
Contribution#2:Learning Concept Hierarchies
Evaluation methodology
ODP has many sub hierarchies: comparing to the induced ones are impractical!
It’s easier to compare when specifying “root concept” and “leaf concepts”, i.e.,
specifying a certain sub tree to compare.
34
<root, leaf>
e.g., <anim, rat>
User-specified
relations
Collection
<root, leaf, odp path>
e.g.,
Animal/Mammal/Rodent/Rat
Find ODP rootleaf pairs that
overlap w/Flickr
Data preprocessing
Set
Relation
weighting &
linking
Flickr
relations
Significance
Test
Conflict
Resolution
Hierarchy
Construction
Flickr-ODP
root-leaf
overlaps
Compute
Taxonomic Overlap,
Lexical Recall
Subsumption
Evaluation
35
Why subsumption does not work so
well?
Countri
Ideal
China
Reality
36
Contribution#2:Learning Concept Hierarchies
Africa Hierarchy
37
Download