Contextual Anomaly Detection in Text Data Amogh Mahapatra,

advertisement
Contextual Anomaly
Detection in Text Data
Authors: Amogh Mahapatra, Nisheeth Srivastava, Jaideep Srivastava
University Of Minnesota, Dept Of. Computer Science
Outline:
Background and Motivation
Problem Statement & Related Work
Model and Methods
Statistical Content Analysis
Semantic Context Incorporation
Oracle: The Decision Engine
Evaluation
Conclusion
[1/5] Background
and Motivation
Two schools of thought in textual
information retrieval
Statistical :
Techniques based on statistical
frequencies of word cooccurrence in text corpus. The
famous bag of words model.
e.g: LSA, LDA etc.
Pro: Focus on the content of
the given text corpus
Con: Cannot infer cognitive
atypicality, idiomatic
inferences etc.
Semantic:
Techniques discriminating
correlation between linguistic
elements based on domain
knowledge provided by linguists
e.g. Wordnet(Path,Gloss),
Sentiwordnet, NGD
Pro: Captures the semantic
relations appropriately.
Con: Misses out on thematic
abstractions in the given text
corpus
Context matters in Anomaly
Detection?
Anomaly Detection In Text:
• Applications
• Anomalous events in airplane logs
• Events in twitter stream
• Organizational security, fraudulent emails etc.
Problems with current techniques
• Required evaluation by human expert
• Number of false positives are too high due to high
risk associated with false negatives
• No adaptive thresholds, can’t include domain
expertise
[2/5] Problem
Statement &
Related Work
Problem Objectives?
• Incorporate real world context to textual
data-streams and logs
• Detect previously undetected contextual
anomalies
• Reduce False Positives
Related Work-1
Supervised Settings: Manevitz et. al.[2] used neural
networks, one class SVMs etc. to classify negative
documents from positive documents
Unsupervised Settings: Srivastava et. al. have used Kmeans, spectral clustering etc. to cluster and visualize
text data to detect anomalies. Agovic et al. have used
topic models to detect deviant anomalous topics from
airplane logs.
Styles: Guthrie et al. have defined 200 stylistic features
to characterize writing and find statistical deviation
All the above techniques
content of the text corpus
rely
completely
on
Related Work-2
Semantic Relationships:
Pederson et. al. have defined a series of semantic
measures between words using WordNet
Cilibrasi et. al. have defined normalized google
distance, which uses search queries to characterize
distance between concepts
Bollegala et al. have used the web to define various
semantic relationships between concepts and
components
[3/5] Models &
Methods
The Components Of Textual
Anomaly Prescriber:
Statistical Content Analysis (LDA Based)
Semantic Context Incorporation ( WordNet
Based)
Oracle: The Decision Engine
The Algorithm
Algorithm
• Input: Documents D, number of topics n, partition parameter m,
anomaly threshold k
• Output: Set of anomalous topics
• B = ComputeBagOfWords(D) (Tokenize)
• T = LDA(B,n) (Cluster bag-of-words into n topics)
• T1 = Rank(T) (Rank topics based on document co-occurrence)
• Test = Partition(T1,m) (Partition into typical and test topic sets)
• Find Context(Test) ( Find context from Wordnet measures )
• R = Decision(Test,k) ( Pick k lowest context score topics )
• Output R
Statistical Content Analysis-I
Clustering: The text log was divided into k-Clusters
using the LDA model. The top 10 most likely words
were extracted as a representative summary of each
topic
Why LDA: Extremely useful for content analysis of
textual logs and deriving latent thematic
abstractions. (Blei .et al. 2003 [2]).
What is LDA: Represents each topic as a
distribution over words and allows for mixed
memberships of documents to several topics.
Can be used to deal with very large scale text
corpora(Newman et. al. 2007 )
Statistical Content Analysis2
Click to edit Master text styles
Second level
●
Third level
●
Fourth level
●
Fifth level
Statistical Content Analysis
-3
Ranking of Topics:
Assumption: Order based on co-usage in
documents.. Topics that are similar to each other
have low relative divergence in their rankings
Distance(a, b)=KL(a, b)+KL(b, a) where a, b are two
different topics
MDS for dimensionality reduction and use first
dimension for ranking the topics.
Augmenting Semantic
Context -1
Capture word pair relationships from an external
semantic corpus.
We use WordNet 2.05, a semantic relational dataset
built by notable linguists
Organized nouns and verbs in a hierarchical is-a
relationship called synsets. e.g. dog-is-a –animal
Also supports non-hierarchical relationships like haspart, is-made-of etc.
Augmenting Semantic
Context -2
Two words considered similar if derived from a common
set of ancestors or share similar horizontal relationships.
Network Measures: Quantify the distance of one word
from another in the set of synsets
Gloss Measure: Quantify the overlap of the glossary
text of two words
Path Measure = 1/(D)
D = Number of Nodes in the shortest path between a,b
Gloss Measure = Cosine of the word vector between
the two glossary texts
Our Measure = (Path + Gloss)/2
Decision Engine (Oracle) 1
Picks the bottom 1/3 rd topics ranked topics as
given by the content engine and sends them to
context engine
Assumption: Anomalies are rare
Calculate the semantic distance of each a given test
topic with a normal topic. S(I,J) denotes the distance
between ith word in normal topic with jth word in a
test topic
Decision Engine(Oracle) 2
Aggregate over all normal topics to get the
score for the given test topic.
Repeat for all test topics
[4/5] Evaluation
Data Sets:
• Enron Email Data Set: (Organizational data, made public
after a legal trial ) 200,399 messages from 158 users with
an average of 757 messages per user
• NIPS full papers: (Conference on Computational
Neuroscience)Randomized 1500 documents with 6.4 million
total words and 12,419 unique words
• Daily Kos Blogs: American political blog that publishes
political news and opinions, typically adopting a liberal
stance. Random set of blogs taken from their website. 3430
documents,6906 unique words and approximately 467714
total words in it,
• All datasets taken from UCI ML repository
Results
• Detection Accuracy: 0.9333
• Reduction of false positives by 75%
• Detect previously undetected anomalies, rise
by 20%
• Hard to ascertain ground truth, human
evaluation is biased in it’s own way
• Show Topics?
[5/5] Conclusion
Our Contributions:
• We could introduce contextual semantic
information to data streams to further inform
the existing state-of the art algorithm
• Devised a two-stage algorithm and provided
empirical validation
• The new technique allows for adaptive
thresholds, including domain expertise,
augments context as a recommendation
Future Work
Experiment with better semantic measures like GND
Experiment with bigger text corpuses like
Wikipedia, google, bing etc.
Extend to online setting could significantly improve
existing techniques of sentiment extraction
Implement privacy-preserving tracking of intraorganizational communication using systems built
around our basic concept
Extend to both supervised, unsupervised settings
Thanks
Authors acknowledge the survey
takers for reinstating their faith in
the highly irrational and diverse
nature of subjective human
interpretations which inspires their
future research
Download