A soft frequent pattern mining approach for textual topic detection Georgios Petkos, Symeon Papadopoulos, Yiannis Kompatsiaris (CERTH-ITI) Luca Aiello (Yahoo Labs) Ryan Skraba (Alcatel-Lucent Bell Labs) WIMS 2014, Thessaloniki, June 2014 Overview • Motivation: classes of topic detection methods, degree of cooccurrence patterns considered. • Beyond pairwise co-occurrence analysis: – Frequent Pattern Mining. – Soft Frequent Pattern Mining. • Experimental evaluation. • Conclusions and future work. WIMS 2014, Thessaloniki, June 2014 #2 Georgios Petkos Classes of textual topic detection methods • Probabilistic topic models: – Learn the joint probability distribution of topics and terms and perform inference on it (e.g. LDA). • Document-pivot methods: – Cluster together documents, each group of documents is a topic (e.g. incremental clustering based on cosine similarity / tfidf representation) • Feature-pivot methods: – Cluster together terms based on their co-occurrence patterns, each group of terms is a topic (e.g. graph-based feature-pivot) WIMS 2014, Thessaloniki, June 2014 #3 Georgios Petkos Feature-pivot methods: degree of cooccurrence (1/2) • We focus on feature-pivot methods and examine the effect of the “degree” of examined co-occurrence patterns on the term clustering procedure. • Let us consider the following topics: – Romney wins Virginia – Romney appears on TV and thanks Virginia – Obama wins Vermont – Romney congratulates Obama • Key terms (e.g. Romney, Obama, Virginia, Vermont, wins) co-occur with more than one other key term. WIMS 2014, Thessaloniki, June 2014 #4 Georgios Petkos Feature-pivot methods: degree of cooccurrence (2/2) • Previous approaches typically examined only “pairwise” cooccurrence patterns. • In the case of closely related topics, such as the above, examining only pairwise cooccurrence patterns may lead in mixed topics. • We propose to examine the simultaneous co-occurrence patterns of a larger number of terms. WIMS 2014, Thessaloniki, June 2014 #5 Georgios Petkos Beyond pairwise co-occurrence analysis: FPM • Frequent Pattern Mining. • A variety of algorithms (e.g. APriori, FPGrowth) that can be used to find groups of items that co-occur frequently. • Not a new approach for textual topic detection. WIMS 2014, Thessaloniki, June 2014 #6 Georgios Petkos Beyond pairwise co-occurrence analysis: SFPM • Frequent Pattern Mining is strict in that it looks for sets of terms all of which cooccur frequently at the same time. May be able to surface only topics with a very small number of terms, i.e. very coarse topics. • Can we formulate an algorithm that lies between the two ends of the spectrum, i.e. looks at co-occurrence patterns with degree higher than two but is not as strict as a typical FPM algorithm? WIMS 2014, Thessaloniki, June 2014 #7 Georgios Petkos SFPM 1. Term selection. 2. Co-occurrence vector formation. 3. Post-processing. WIMS 2014, Thessaloniki, June 2014 #8 Georgios Petkos SFPM–Step 1: Term selection • Select the top K terms that will enter the clustering procedure. • Different options to do this. For instance: – Select the most frequent terms. – Select the terms that exhibit the most “bursty” behaviour (if we are considering temporal processing). • In our experiments we select the terms that are most “unusually frequent”: WIMS 2014, Thessaloniki, June 2014 #9 Georgios Petkos SFPM–Step 2: Co-occurrence vector formation (1/4) • The heart of the SFPM approach. • Notation: – n: The number of documents in the collection – S: A set of terms, representing a topic – DS: vector of length n, the i-th element indicates how many of the terms in S co-occur in the i-th document. – Dt: binary vector of length n, the i-th element indicates if the term t occurs in the i-th document. • The vector Dt for a term t that frequently cooccurs with the terms in S will have high cosine similarity with DS. • Idea of algorithm: greedily expand S with the term t that best matches DS. WIMS 2014, Thessaloniki, June 2014 #10 Georgios Petkos SFPM–Step 2: Co-occurrence vector formation (2/4) WIMS 2014, Thessaloniki, June 2014 #11 Georgios Petkos SFPM–Step 2: Co-occurrence vector formation (3/4) • Need a stopping criterion for the expansion procedure. • If not properly set, the set of terms may end up being too small (i.e. the topic may be too coarse) or may end up being a mixture of topics. • We use a threshold on the cosine of Ds and Dt and we adapt the threshold dynamically on the size of S. In particular, we use a sigmoid function that is a function of |S|: WIMS 2014, Thessaloniki, June 2014 #12 Georgios Petkos SFPM–Step 2: Co-occurrence vector formation (4/4) • We run the expansion procedure many times, each time starting from a different term. • Additionally, to avoid having less important terms dominate Ds and the cosine similarity, at the end of each expansion step we zero-out entries of Ds that have a value smaller than |S|/2. WIMS 2014, Thessaloniki, June 2014 #13 Georgios Petkos SFPM–Step 3: Post-processing • Due to the fact that we run the expansion procedure many times, each time starting with a different term, we may end up with a large number of duplicate topics. • At the final step of the algorithm, we filter out duplicate topics (by considering the Jaccard similarity between the sets of keywords of the produced topics). WIMS 2014, Thessaloniki, June 2014 #14 Georgios Petkos SFPM – Overview WIMS 2014, Thessaloniki, June 2014 #15 Georgios Petkos Evaluation – Datasets and evaluation metrics • Three datasets collected from Twitter: – Super Tuesday dataset: 474,109 documents – F.A. Cup final: 148,652 documents – U.S. presidential elections: 1,247,483 documents • For each of them a number of ground-truth topics were determined by examining the relevant stories that appeared in the mainstream media. • Each topic is represented by a set of mandatory terms, a set of forbidden terms (so that we make sure that closely related topics are not merged) and a set of optional terms. • We evaluate: – Topic recall. – Keyword recall. – Keyword precision. WIMS 2014, Thessaloniki, June 2014 #16 Georgios Petkos Evaluation – Competing methods • A classic probabilistic method: LDA. • A graph-based feature-pivot approach that examines only pairwise co-occurrence patterns. • FPM using the FP-Growth algorithm. WIMS 2014, Thessaloniki, June 2014 #17 Georgios Petkos Evaluation – Results: topic recall WIMS 2014, Thessaloniki, June 2014 • Evaluated for different numbers of returned topics • SFPM achieves highest topic recall in all three datasets #18 Georgios Petkos Evaluation – Results: keyword recall WIMS 2014, Thessaloniki, June 2014 • SFPM achieves highest topic recall in all three datasets • SFPM not only retrieves more target topics than the other methods, but also provides a quite complete representation of the topics #19 Georgios Petkos Evaluation – Results: keyword precision WIMS 2014, Thessaloniki, June 2014 • SFPM nevertheless achieves a somewhat lower keyword precision, indicating that some spurious keywords are also included in the topics. • FPM as the strictest method achieves the highest keyword precision. #20 Georgios Petkos Evaluation – Example topics produced WIMS 2014, Thessaloniki, June 2014 #21 Georgios Petkos Conclusions • Started from the observation that in order to detect closely related topics, a feature-pivot topic detection method should examine co-occurrence patterns of degree larger than 2. • We have presented an approach, SFPM, that does this, albeit in a soft and controllable manner. It is based on a greedy set expansion procedure. • We have experimentally shown that the proposed approach may indeed improve performance when dealing with corpora containing closely inter-related topics. WIMS 2014, Thessaloniki, June 2014 #22 Georgios Petkos Future work • Experiment with different types of documents. • Consider the problem of synonyms. • Examine alternative, more efficient search strategies. E.g. may index the Dt vectors, using e.g. LSH, in order to rapidly retrieve the best matching term to a set S. WIMS 2014, Thessaloniki, June 2014 #23 Georgios Petkos Thank you! • Open source implementation (including a set of other topic detection methods) available at: https://github.com/socialsensor/topic-detection • Dataset and evaluation resources available at: http://www.socialsensor.eu/results/datasets/72-twitter-tdt-dataset • Relevant topic detection dataset on which SFPM will be tested: http://figshare.com/articles/SNOW_2014_Data_Challenge/1003755 Questions, comments, suggestions? WIMS 2014, Thessaloniki, June 2014 #24 Georgios Petkos