Project presentation (PPT)

advertisement
Monitoring Message Streams:
Retrospective and Prospective
Event Detection
Fred S. Roberts
DIMACS, Rutgers University
1
DIMACS is a partnership of:
•Rutgers University
•Princeton University
•AT&T Labs
•Bell Labs
•NEC Research Institute
•Telcordia Technologies
http:dimacs.rutgers.edu
center@dimacs.rutgers.edu
732-445-5928
2
OBJECTIVE:
Monitor streams of textualized communication to
detect pattern changes and "significant" events
Motivation:
 monitoring of global satellite communications
(though this may produce voice rather than
text)
 sniffing and monitoring email traffic
3
TECHNICAL PROBLEM:
• Given stream of text in any language.
• Decide whether "new events" are present in the
flow of messages.
• Event: new topic or topic with unusual level of
activity.
• Retrospective or “Supervised” Event
Identification: Classification into pre-existing
classes.
4
More Complex Problem: Prospective Detection
or “Unsupervised” Learning
 Classes change - new classes or change
meaning
 A difficult problem in statistics
 Recent new C.S. approaches
1) Algorithm detects a new class
2) Human analyst labels it; determines its
significance
5
COMPONENTS OF AUTOMATIC
MESSAGE PROCESSING
(1). Compression of Text -- to meet storage and
processing limitations;
(2). Representation of Text -- put in form
amenable to computation and statistical analysis;
(3). Matching Scheme -- computing similarity
between documents;
(4). Learning Method -- build on judged examples
to determine characteristics of document cluster
(“event”)
(5). Fusion Scheme -- combine methods (scores)6 to
yield improved detection/clustering.
STATE OF THE ART:
Best results to date:
 Retrospective Detection: David Lewis
(2001), using simple Support Vector
Machines
 Prospective Detection: results reported by a
group from Oracle (2001), using change of
basis representation, which builds on natural
language knowledge
7
WHY WE CAN DO BETTER:
• Existing methods use some or all 5 automatic
processing components, but don’t exploit the full
power of the components and/or an understanding
of how to apply them to text data.
• Lewis' methods used an off-the-shelf support vector
machine supervised learner, but tuned it for
frequency properties of the data.
• The combination dominated competing approaches
in the TREC-2001 batch filtering evaluation.
8
WHY WE CAN DO BETTER II:
• Existing methods aim at fitting into available
computational resources without paying attention
to upfront data compression.
• We hope to do better by a combination of:
 more sophisticated statistical methods
 sophisticated data compression in a preprocessing stage
Alta Vista: combining data compression with naïve
statistical methods leads to some success
9
COMPRESSION:
• Reduce the dimension before statistical analysis.
• Recent results: “One-pass” through data can reduce
volume significantly w/o degrading performance
significantly. (E.g.: use random projections.)
• Unlike feature-extracting dimension reduction, which
can lead to bad results.
We believe that sophisticated dimension reduction
methods in a preprocessing stage followed by
sophisticated statistical tools in a detection/filtering
stage can be a very powerful approach.
10
MORE SOPHISTICATED STATISTICAL
APPROACHES:
• Representations: Boolean representations; weighting
schemes
• Matching Schemes: Boolean matching; nonlinear
transforms of individual feature values
• Learning Methods: new kernel-based methods
(nonlinear classification); more complex Bayes
classifiers to assign objects to highest probability class
• Fusion Methods: combining scores based on ranks,
linear functions, or nonparametric schemes
11
• .
THE APPROACH
• Identify best combination of newer methods through
careful exploration of variety of tools.
• Address issues of effectiveness (how well task is done)
and efficiency (in computational time and space)
• Use combination of new or modified algorithms and
improved statistical methods built on the algorithmic
primitives.
12
• .
IN LATER YEARS
• Extend work to unsupervised learning.
• Still concentrate on new methods for the 5 components.
• Emphasize “semi-supervised learning” - human analysts
help to focus on features most indicative of anomaly or
change; algorithms assess incoming documents as to
deviation on those features.
• Develop new techniques to represent data to highlight
significant deviation:
 Through an appropriately defined metric
 With new clustering algorithms
• .
 Building on analyst-designated features
13
DIMACS STRENGTHS:
Strong team: statisticians, computer scientists,
experts in info. retrieval & library science
DAVID MADIGAN, Rutgers Statistics:
 NSF project on text classification.
 An expert on Bayes classifiers.
 developing extensions beyond Bayes
classifiers
 (Lewis is his co-PI and a subcontractor on his
NSF grant.)
14
DAVID LEWIS, Private Consultant:
 Best basic batch filtering methods.
 Extensive experience in text classification.
PAUL KANTOR, Rutgers, Library Information
Science and Operations Research:
 Expert on combining multiple methods for
classifying candidate documents.
 Expert on information retrieval and
interactive systems -- human input to
15
leverage filtering and processing capabilities.
ILYA MUCHNIK, Rutgers Computer Science:
 Developed a fast statistical clustering
algorithm that can deal with millions of
cases in reasonable time.
 Pioneered use of kernel methods for machine
learning.
MUTHU MUTHUKRISHNAN, Rutgers Computer
Science:
 Developed algorithms for making one pass over
text documents to gain information about them.
16
MARTIN STRAUSS, AT&T Labs
 Has new methods for handling data streams
whose items are read once, then discarded.
RAFAIL OSTROVSKY, Telcordia Technologies
 Developed dimension reduction methods in
the hypercube.
 Used these in powerful algorithms to detect
patterns in streams of data.
17
ENDRE BOROS, Rutgers Operations Research
 Developed extremely useful methods for
Boolean representation and rule learning.
FRED ROBERTS, Rutgers Mathematics and
DIMACS
 Developed methods for combining scores in
software and hardware testing.
 Long-standing expertise in decision making
and the social sciences.
18
DAVID GOLDSCHMIDT, Director, Institute for
Defense Analyses - Center for Communications
Research
 advisory role
 long-standing partnership between IDA-CCR
and DIMACS
 will sponsor and co-organize tutorial and
workshop on state-of-the-art in data mining
and homeland security to kick off the project
19
S.O.W: FIRST 12 MONTHS:
• Prepare available corpora of data on which to uniformly
test different combinations of methods
• Concentrate on supervised learning and detection
• Systematically explore & compare combinations of
compression schemes, representations, matching
schemes, learning methods, and fusion schemes
• Test combinations of methods on common data sets and
exchange information among the team
• Develop and test promising dimension reduction
methods
20
S.O.W: YEARS 2 AND 3:
• Combine leading methods for supervised learning with
promising upfront dimension reduction methods
• Develop research quality code for the leading identified
methods for supervised learning
• Develop the extension to unsupervised learning :
 Detect suspicious message clusters before an event
has occurred
 Use generalized stress measures indicating a
significant group of interrelated messages don’t fit
into the known family of clusters
 Concentrate on semi-supervised learning.
21
IMPACT:
12 MONTHS:
• We will have established a state-of-the art scheme for
classification of accumulated documents in relation to
known tasks/targets/themes and building profiles to
track future relevant messages.
• We are optimistic that by end-to-end experimentation,
we will discover synergies between new mathematical
and statistical methods for addressing each of the
component tasks and thus achieve significant
improvements in performance on accepted measures
that could not be achieved by piecemeal study of one
22
or two component tasks.
IMPACT:
3 YEARS:
• We will have produced prototype code for testing the
concepts and a rigorously precise expression of the
ideas for translation into a commercial or government
system.
• We will have extended our analysis to semisupervised discovery of potentially interesting
clusters of documents.
• This should allow us to identify potentially
threatening events in time for cognizant
agencies to prevent them from occurring.
23
Download