Contextual Anomaly Detection in Text Data Authors: Amogh Mahapatra, Nisheeth Srivastava, Jaideep Srivastava University Of Minnesota, Dept Of. Computer Science Outline: Background and Motivation Problem Statement & Related Work Model and Methods Statistical Content Analysis Semantic Context Incorporation Oracle: The Decision Engine Evaluation Conclusion [1/5] Background and Motivation Two schools of thought in textual information retrieval Statistical : Techniques based on statistical frequencies of word cooccurrence in text corpus. The famous bag of words model. e.g: LSA, LDA etc. Pro: Focus on the content of the given text corpus Con: Cannot infer cognitive atypicality, idiomatic inferences etc. Semantic: Techniques discriminating correlation between linguistic elements based on domain knowledge provided by linguists e.g. Wordnet(Path,Gloss), Sentiwordnet, NGD Pro: Captures the semantic relations appropriately. Con: Misses out on thematic abstractions in the given text corpus Context matters in Anomaly Detection? Anomaly Detection In Text: • Applications • Anomalous events in airplane logs • Events in twitter stream • Organizational security, fraudulent emails etc. Problems with current techniques • Required evaluation by human expert • Number of false positives are too high due to high risk associated with false negatives • No adaptive thresholds, can’t include domain expertise [2/5] Problem Statement & Related Work Problem Objectives? • Incorporate real world context to textual data-streams and logs • Detect previously undetected contextual anomalies • Reduce False Positives Related Work-1 Supervised Settings: Manevitz et. al.[2] used neural networks, one class SVMs etc. to classify negative documents from positive documents Unsupervised Settings: Srivastava et. al. have used Kmeans, spectral clustering etc. to cluster and visualize text data to detect anomalies. Agovic et al. have used topic models to detect deviant anomalous topics from airplane logs. Styles: Guthrie et al. have defined 200 stylistic features to characterize writing and find statistical deviation All the above techniques content of the text corpus rely completely on Related Work-2 Semantic Relationships: Pederson et. al. have defined a series of semantic measures between words using WordNet Cilibrasi et. al. have defined normalized google distance, which uses search queries to characterize distance between concepts Bollegala et al. have used the web to define various semantic relationships between concepts and components [3/5] Models & Methods The Components Of Textual Anomaly Prescriber: Statistical Content Analysis (LDA Based) Semantic Context Incorporation ( WordNet Based) Oracle: The Decision Engine The Algorithm Algorithm • Input: Documents D, number of topics n, partition parameter m, anomaly threshold k • Output: Set of anomalous topics • B = ComputeBagOfWords(D) (Tokenize) • T = LDA(B,n) (Cluster bag-of-words into n topics) • T1 = Rank(T) (Rank topics based on document co-occurrence) • Test = Partition(T1,m) (Partition into typical and test topic sets) • Find Context(Test) ( Find context from Wordnet measures ) • R = Decision(Test,k) ( Pick k lowest context score topics ) • Output R Statistical Content Analysis-I Clustering: The text log was divided into k-Clusters using the LDA model. The top 10 most likely words were extracted as a representative summary of each topic Why LDA: Extremely useful for content analysis of textual logs and deriving latent thematic abstractions. (Blei .et al. 2003 [2]). What is LDA: Represents each topic as a distribution over words and allows for mixed memberships of documents to several topics. Can be used to deal with very large scale text corpora(Newman et. al. 2007 ) Statistical Content Analysis2 Click to edit Master text styles Second level ● Third level ● Fourth level ● Fifth level Statistical Content Analysis -3 Ranking of Topics: Assumption: Order based on co-usage in documents.. Topics that are similar to each other have low relative divergence in their rankings Distance(a, b)=KL(a, b)+KL(b, a) where a, b are two different topics MDS for dimensionality reduction and use first dimension for ranking the topics. Augmenting Semantic Context -1 Capture word pair relationships from an external semantic corpus. We use WordNet 2.05, a semantic relational dataset built by notable linguists Organized nouns and verbs in a hierarchical is-a relationship called synsets. e.g. dog-is-a –animal Also supports non-hierarchical relationships like haspart, is-made-of etc. Augmenting Semantic Context -2 Two words considered similar if derived from a common set of ancestors or share similar horizontal relationships. Network Measures: Quantify the distance of one word from another in the set of synsets Gloss Measure: Quantify the overlap of the glossary text of two words Path Measure = 1/(D) D = Number of Nodes in the shortest path between a,b Gloss Measure = Cosine of the word vector between the two glossary texts Our Measure = (Path + Gloss)/2 Decision Engine (Oracle) 1 Picks the bottom 1/3 rd topics ranked topics as given by the content engine and sends them to context engine Assumption: Anomalies are rare Calculate the semantic distance of each a given test topic with a normal topic. S(I,J) denotes the distance between ith word in normal topic with jth word in a test topic Decision Engine(Oracle) 2 Aggregate over all normal topics to get the score for the given test topic. Repeat for all test topics [4/5] Evaluation Data Sets: • Enron Email Data Set: (Organizational data, made public after a legal trial ) 200,399 messages from 158 users with an average of 757 messages per user • NIPS full papers: (Conference on Computational Neuroscience)Randomized 1500 documents with 6.4 million total words and 12,419 unique words • Daily Kos Blogs: American political blog that publishes political news and opinions, typically adopting a liberal stance. Random set of blogs taken from their website. 3430 documents,6906 unique words and approximately 467714 total words in it, • All datasets taken from UCI ML repository Results • Detection Accuracy: 0.9333 • Reduction of false positives by 75% • Detect previously undetected anomalies, rise by 20% • Hard to ascertain ground truth, human evaluation is biased in it’s own way • Show Topics? [5/5] Conclusion Our Contributions: • We could introduce contextual semantic information to data streams to further inform the existing state-of the art algorithm • Devised a two-stage algorithm and provided empirical validation • The new technique allows for adaptive thresholds, including domain expertise, augments context as a recommendation Future Work Experiment with better semantic measures like GND Experiment with bigger text corpuses like Wikipedia, google, bing etc. Extend to online setting could significantly improve existing techniques of sentiment extraction Implement privacy-preserving tracking of intraorganizational communication using systems built around our basic concept Extend to both supervised, unsupervised settings Thanks Authors acknowledge the survey takers for reinstating their faith in the highly irrational and diverse nature of subjective human interpretations which inspires their future research