ppt

advertisement
Constructing Folksonomies from UserSpecified Relations on Flickr
Anon Plangprasopchok and
Kristina Lerman
(WWW 2009)
1
Motivation
Annotation /
Metadata
Produce
Users
Web content
hierarchical classification
Organize
Search
Recommend
Leverage
Categorize
2
Motivation
• Metadata from an individual user may
be too inaccurate and incomplete…
•The metadata from different users may
complement each other, making it,
in combination, meaningful.
Goal: to induce category knowledge from
social annotation produced by many users
3
Folksonomy
• Original definition: classification emerging
from the use of tags by users (Thomas
Vander Wal)
• In this work: hidden classification
hierarchies from annotation created by
many users
4
Hierarchical Relations in Social Web
• Appear Implicitly
Tags:
Insect
Goal: to induce deeper
Grasshopper
hierarchies from this
Australian
metadata
Macro
Orthopteran (直翅類)
• Appear Explicitly
Folder (collection)
Sub folder (set)
Relations
5
Outline
•
•
•
•
•
Motivation
Approaches
Results
Discussion
Related work
6
Inducing Hierarchy from Tags
Existing approaches
• Graph based (Mika05)
• build a network of associated tags (node = tag, edge = cooccurrence of tags)
• suggest applying betweenness centrality and set theory to
determine broader/narrower relations
• Hierarchical Clustering (Brooks06; Heymann06+)
•Tags appear more frequently would have higher centrality and
thus more abstract.
• Probabilistic subsumption (Sanderson99+, Schmitz06)
• x is broader than y if x subsumes y
x
• x subsumes y if p(x|y) > t & p(y|x) < t
y
7
Inducing Hierarchy from Tags
• Some difficulties when using tags
to induce hierarchy:
Notation: A  B
(A is broader than B)
(hypernym relation)
Washington  United States
Car  Automobile
Specificity  Rarity
Insect  Hongkong
Color  Brazilian
Tags are from different facets*
Above relations induced using subsumption approach on tags [Sanderson99+, Schmitz06]
8
Inducing Hierarchy from user-specified relations
• User specified relations, e.g.,
– Flickr’s Collection-Set ,
– Delicious’ Bundle-Tag,
– Bibsonomy’s Relation-Tag
• Key intuition: Not so many people specify
peculiar relations like
– “automobile”  “car”, or
– “Washington”  “United States”
9
Simple Strategy
Collection
Sets
Concept relations
netherland
Collection
The Netherlands - Holanda
holanda
blijdorp
holanda
blijdorp
rotterdam
netherland
Set
Tokenize + Stem
Blijdorp - Rotterdam
countri
rotterdam
netherland
1. Remove “noisy” relations
- Conflict resolution
- Significance test
countri
holland
netherland
blijdorp
china
……
2. Link concepts & Select path
10
Remove noisy relations:
1st approach
• Conflict Resolution (when both a->b and b->a appear)
– Relation conflicts occur because of noise
– Voting scheme:
Keep ab (and discard ba)
If Nu(ab) > 1 and Nu(ab) > Nu(ba)
insect
butterfly
2
10
butterfly
insect
11
Remove noisy relations:
2nd approach
• Significance Test
- Use statistical significance test to decide if a  b is
significant
- Null hypothesis: observed relation ab was
generated by chance, via the random, independent
generation of individual concepts a, b (according to
the binomial distribution).
Is “b” narrower than “a” by chance?
accept
reject

# observations
# of ab
12
Link concepts and select path
• Link concepts: assume that same terms refer to the same concept.
anim
anim

+
bug
anim
insect
bug
insect
• Select path: link relations from many users can cause a spaghetti graph
4 possible paths from anim  moth:
1) abim
2) aim
3) am
4) abm
anim
26
72
1
insect
bug
Network Bottleneck idea:
“the flow bottleneck is a minimum flow capacity
among all relations in the path”
10
18
4
moth
1)
2)
3)
4)
abim [BN score = min(26,1,18) = 1]
aim [BN score = min(72,18) = 18]
am [BN score = min(10) = 10]
abm [BN score = min(26,4) = 4]
13
Contribution#2:Learning Concept Hierarchies
Evaluation & Data Set
•
Hypothesis: the approach that takes explicit relations into account can
induce better hierarchies.
•
“Better” means more consistent with hand-built hierarchies (ODP ver.
10/08)
•
The baseline approach is subsumption approach [Schmitz06]
Collection and set terms are used instead of tags, making it comparable.
Data Set:
 Data from 17 user groups, devoted to wildlife
and naturalist photography
 21,792 of 39,922 users specify at least one collection
 110,543 unique terms (c.f. 166,153 unique terms in ODP),
15,495 terms in common.
14
Contribution#2:Learning Concept Hierarchies
Evaluation methodology
ODP has many sub hierarchies: comparing to the induced ones are impractical!
It’s easier to compare when specifying “root concept” and “leaf concepts”, i.e.,
specifying a certain sub tree to compare.
Relations (right after tokenized)
Reference hierarchy
(ODP)
Induced hierarchy
15
Contribution#2:Learning Concept Hierarchies
Metrics
• Taxonomic Overlap [adapted from Maedche02+]
– measuring structure similarity between two trees
– for each node, determining how many ancestor and
descendant nodes overlap to those in the reference
tree.
• Lexical Recall
– measuring how well an approach can discover
concepts, existing in the reference hierarchy
(coverage)
16
Quantitative Results
17
Contribution#2:Learning Concept Hierarchies
Quantitative Results
• Manually selecting 32 root nodes
• Taxonomic Overlap :
• 27 of them are better than those by subsumption
• 3 of them get zero score in both approaches
• Lexical Recall:
• 28 of them are better than those by subsumption
• 2 of them get similar score on both approaches
• the rest, by subsumption, only induce the root node.
• The proposed approach can induce deeper trees
The proposed approach can induce hierarchies more
consistent with ODP in almost all cases.
18
Sport hierarchy
19
Invertebrate hierarchy
20
Country hierarchy
21
Discussion
• Simple strategy to aggregate a large number of shallow
relations specified by different users into a common,
deeper hierarchy
• Induced hierarchies are more consistent with ODP
• Future work includes:
Term ambiguity
Relation types
Global path
Apply to other datasets
22
Thank You!
&
Questions?
23
Download