A soft frequent pattern mining approach for textual topic

advertisement
A soft frequent pattern mining approach for textual topic
detection
Georgios Petkos, Symeon Papadopoulos, Yiannis Kompatsiaris (CERTH-ITI)
Luca Aiello (Yahoo Labs)
Ryan Skraba (Alcatel-Lucent Bell Labs)
WIMS 2014, Thessaloniki, June 2014
Overview
• Motivation: classes of topic detection methods, degree of cooccurrence patterns considered.
• Beyond pairwise co-occurrence analysis:
– Frequent Pattern Mining.
– Soft Frequent Pattern Mining.
• Experimental evaluation.
• Conclusions and future work.
WIMS 2014, Thessaloniki, June 2014
#2
Georgios Petkos
Classes of textual topic detection methods
• Probabilistic topic models:
– Learn the joint probability distribution of topics and terms and
perform inference on it (e.g. LDA).
• Document-pivot methods:
– Cluster together documents, each group of documents is a topic (e.g.
incremental clustering based on cosine similarity / tfidf
representation)
• Feature-pivot methods:
– Cluster together terms based on their co-occurrence patterns, each
group of terms is a topic (e.g. graph-based feature-pivot)
WIMS 2014, Thessaloniki, June 2014
#3
Georgios Petkos
Feature-pivot methods:
degree of cooccurrence (1/2)
• We focus on feature-pivot methods and examine the effect of the
“degree” of examined co-occurrence patterns on the term clustering
procedure.
• Let us consider the following topics:
– Romney wins Virginia
– Romney appears on TV and thanks Virginia
– Obama wins Vermont
– Romney congratulates Obama
• Key terms (e.g. Romney, Obama, Virginia, Vermont, wins) co-occur with
more than one other key term.
WIMS 2014, Thessaloniki, June 2014
#4
Georgios Petkos
Feature-pivot methods:
degree of cooccurrence (2/2)
• Previous approaches typically examined only “pairwise”
cooccurrence patterns.
• In the case of closely related topics, such as the above,
examining only pairwise cooccurrence patterns may lead in
mixed topics.
• We propose to examine the simultaneous co-occurrence
patterns of a larger number of terms.
WIMS 2014, Thessaloniki, June 2014
#5
Georgios Petkos
Beyond pairwise co-occurrence analysis: FPM
• Frequent Pattern Mining.
• A variety of algorithms (e.g. APriori, FPGrowth) that can be
used to find groups of items that co-occur frequently.
• Not a new approach for textual topic detection.
WIMS 2014, Thessaloniki, June 2014
#6
Georgios Petkos
Beyond pairwise co-occurrence analysis: SFPM
• Frequent Pattern Mining is strict in that it looks for sets of terms all of
which cooccur frequently at the same time. May be able to surface only
topics with a very small number of terms, i.e. very coarse topics.
• Can we formulate an algorithm that lies between the two ends of the
spectrum, i.e. looks at co-occurrence patterns with degree higher than
two but is not as strict as a typical FPM algorithm?
WIMS 2014, Thessaloniki, June 2014
#7
Georgios Petkos
SFPM
1. Term selection.
2. Co-occurrence vector formation.
3. Post-processing.
WIMS 2014, Thessaloniki, June 2014
#8
Georgios Petkos
SFPM–Step 1: Term selection
•
Select the top K terms that will enter the clustering
procedure.
•
Different options to do this. For instance:
– Select the most frequent terms.
– Select the terms that exhibit the most “bursty” behaviour (if we are
considering temporal processing).
•
In our experiments we select the terms that are most
“unusually frequent”:
WIMS 2014, Thessaloniki, June 2014
#9
Georgios Petkos
SFPM–Step 2: Co-occurrence vector formation (1/4)
•
The heart of the SFPM approach.
•
Notation:
– n: The number of documents in the collection
– S: A set of terms, representing a topic
– DS: vector of length n, the i-th element indicates how many of the
terms in S co-occur in the i-th document.
– Dt: binary vector of length n, the i-th element indicates if the term t
occurs in the i-th document.
•
The vector Dt for a term t that frequently cooccurs with the
terms in S will have high cosine similarity with DS.
•
Idea of algorithm: greedily expand S with the term t that best
matches DS.
WIMS 2014, Thessaloniki, June 2014
#10
Georgios Petkos
SFPM–Step 2: Co-occurrence vector formation (2/4)
WIMS 2014, Thessaloniki, June 2014
#11
Georgios Petkos
SFPM–Step 2: Co-occurrence vector formation (3/4)
•
Need a stopping criterion for the expansion procedure.
•
If not properly set, the set of terms may end up being too
small (i.e. the topic may be too coarse) or may end up being
a mixture of topics.
•
We use a threshold on the cosine of Ds and Dt and we adapt
the threshold dynamically on the size of S. In particular, we
use a sigmoid function that is a function of |S|:
WIMS 2014, Thessaloniki, June 2014
#12
Georgios Petkos
SFPM–Step 2: Co-occurrence vector formation (4/4)
•
We run the expansion procedure many times, each time
starting from a different term.
•
Additionally, to avoid having less important terms dominate
Ds and the cosine similarity, at the end of each expansion
step we zero-out entries of Ds that have a value smaller than
|S|/2.
WIMS 2014, Thessaloniki, June 2014
#13
Georgios Petkos
SFPM–Step 3: Post-processing
•
Due to the fact that we run the expansion procedure many
times, each time starting with a different term, we may end
up with a large number of duplicate topics.
•
At the final step of the algorithm, we filter out duplicate
topics (by considering the Jaccard similarity between the sets
of keywords of the produced topics).
WIMS 2014, Thessaloniki, June 2014
#14
Georgios Petkos
SFPM – Overview
WIMS 2014, Thessaloniki, June 2014
#15
Georgios Petkos
Evaluation – Datasets and evaluation metrics
•
Three datasets collected from Twitter:
–
Super Tuesday dataset: 474,109 documents
–
F.A. Cup final: 148,652 documents
–
U.S. presidential elections: 1,247,483 documents
•
For each of them a number of ground-truth topics were determined by
examining the relevant stories that appeared in the mainstream media.
•
Each topic is represented by a set of mandatory terms, a set of forbidden
terms (so that we make sure that closely related topics are not merged)
and a set of optional terms.
•
We evaluate:
–
Topic recall.
–
Keyword recall.
–
Keyword precision.
WIMS 2014, Thessaloniki, June 2014
#16
Georgios Petkos
Evaluation – Competing methods
•
A classic probabilistic method: LDA.
•
A graph-based feature-pivot approach that examines only
pairwise co-occurrence patterns.
•
FPM using the FP-Growth algorithm.
WIMS 2014, Thessaloniki, June 2014
#17
Georgios Petkos
Evaluation – Results: topic recall
WIMS 2014, Thessaloniki, June 2014
•
Evaluated for different numbers of
returned topics
•
SFPM achieves highest topic recall
in all three datasets
#18
Georgios Petkos
Evaluation – Results: keyword recall
WIMS 2014, Thessaloniki, June 2014
•
SFPM achieves highest topic recall
in all three datasets
•
SFPM not only retrieves more
target topics than the other
methods, but also provides a quite
complete representation of the
topics
#19
Georgios Petkos
Evaluation – Results: keyword precision
WIMS 2014, Thessaloniki, June 2014
•
SFPM nevertheless achieves a
somewhat lower keyword
precision, indicating that some
spurious keywords are also
included in the topics.
•
FPM as the strictest method
achieves the highest keyword
precision.
#20
Georgios Petkos
Evaluation – Example topics produced
WIMS 2014, Thessaloniki, June 2014
#21
Georgios Petkos
Conclusions
•
Started from the observation that in order to detect closely
related topics, a feature-pivot topic detection method
should examine co-occurrence patterns of degree larger
than 2.
•
We have presented an approach, SFPM, that does this,
albeit in a soft and controllable manner. It is based on a
greedy set expansion procedure.
•
We have experimentally shown that the proposed approach
may indeed improve performance when dealing with
corpora containing closely inter-related topics.
WIMS 2014, Thessaloniki, June 2014
#22
Georgios Petkos
Future work
•
Experiment with different types of documents.
•
Consider the problem of synonyms.
•
Examine alternative, more efficient search strategies. E.g.
may index the Dt vectors, using e.g. LSH, in order to rapidly
retrieve the best matching term to a set S.
WIMS 2014, Thessaloniki, June 2014
#23
Georgios Petkos
Thank you!
•
Open source implementation (including a set of other topic detection
methods) available at:
https://github.com/socialsensor/topic-detection
•
Dataset and evaluation resources available at:
http://www.socialsensor.eu/results/datasets/72-twitter-tdt-dataset
•
Relevant topic detection dataset on which SFPM will be tested:
http://figshare.com/articles/SNOW_2014_Data_Challenge/1003755
Questions, comments, suggestions?
WIMS 2014, Thessaloniki, June 2014
#24
Georgios Petkos
Download