Homeland Security Research at DIMACS: Monitoring Message Streams

advertisement
HOMELAND
SECURITY
RESEARCH
AT DIMACS
1
Working Group on Adverse Event/Disease
Reporting, Surveillance, and Analysis
•Health surveillance a core activity
in public health
•Concerns about bioterrorism have
attracted attention to new
Vaccine Safety
Surveillance
surveillance methods:
–OTC drug sales
–Subway worker absenteeism
–Ambulance dispatches
•Spawns need for novel statistical
methods for surveillance of
multiple data streams.
Disease Surveillance
Drug Safety
Surveillance
Syndromic Surveillance
2
Working Group on Privacy &
Confidentiality of Health Data
•Privacy concerns are a major
stumbling block to public health
surveillance, in particular
bioterrorism surveillance.
•Challenge: produce anonymous
data specific enough for research.
•Exploring ways to remove
identifiers (s.s. #, tel. #, zip code)
from data sets.
•Exploring ways to aggregate,
remove information from data
sets.
3
Working Group on Analogies between
Computer Viruses and Biological Viruses
•Can ideas for defending against biological viruses lead to
ideas for defending against computer viruses?
•Concern about large gap between initial time of attack
and implementation of defensive strategies
•“Public health” approach: Once a virus has infected a
machine, it tries to connect it to as many computers as
possible, as fast as possible. A “throttle” limits rate at
which a computer can connect to new computers.
4
Working Group on Modeling Social
Responses to Bioterrorism
•Models of the spread of
infectious disease commonly
assume passive bystanders and
rational actors who will comply
with health authorities.
•It is not clear how well this
assumption applies to situations
like a bioterrorist attack using
1947, NYC, smallpox outbreak
smallpox or plague.
•Interdisciplinary group is discussing incorporating social
behavior into models, models of public health
5
decisionmaking, risk communication.
The Bioterrorism
Sensor Location
Problem
• Early warning is
critical
• This is a crucial factor
underlying
government’s plans to
place networks of
sensors/detectors to
warn of a bioterrorist
attack
The BASIS System6
Two Fundamental Problems
• Sensor Location
Problem (SLP):
– Choose an
appropriate mix of
sensors
– decide where to
locate them for best
protection and early
warning
7
Two Fundamental Problems
• Pattern Interpretation
Problem (PIP): When
sensors set off an alarm,
help public health
decision makers decide
– Has an attack taken place?
– What additional
monitoring is needed?
– What was its extent and
location?
– What is an appropriate
response?
8
Monitoring Message Streams:
Algorithmic Methods for Automatic
Processing of Messages
Supported by
Interagency KD-D
Group
9
OBJECTIVE:
Monitor huge streams of textualized
communication to automatically detect pattern
changes and "significant" events
Motivation: monitoring
email traffic
10
TECHNICAL
PROBLEM:
• Given stream of text in any language.
• Decide whether "new events" are present in the
flow of messages.
• Event: new topic or topic with unusual level of
activity.
• Retrospective or “Supervised” Event
Identification: Classification into pre-existing
classes.
11
TECHNICAL
PROBLEM:
• Batch filtering:
Given relevant
documents up front.
• Adaptive filtering:
“pay” for
information about
relevance as
process moves
along.
12
MORE COMPLEX PROBLEM:
PROSPECTIVE DETECTION OR
“UNSUPERVISED” LEARNING
• Classes change - new classes or change
meaning
• A difficult problem in statistics
• Recent new C.S. approaches
“Semi-supervised Learning”:
• Algorithm suggests a new class
• Human analyst labels it; determines its
significance
13
COMPONENTS OF AUTOMATIC
MESSAGE PROCESSING
(1). Compression of Text -- to meet storage and
processing limitations;
(2). Representation of Text -- put in form
amenable to computation and statistical analysis;
(3). Matching Scheme -- computing similarity
between documents;
(4). Learning Method -- build on judged examples
to determine characteristics of document cluster
(“event”)
(5). Fusion Scheme -- combine methods (scores)14 to
yield improved detection/clustering.
COMPONENTS OF AUTOMATIC
MESSAGE PROCESSING - II
•These distinctions are somewhat arbitrary.
•Many approaches to message processing overlap
several of these components of automatic message
processing.
Existing methods don’t exploit the full power of
the 5 components, synergies among them, and/or
an understanding of how to apply them to text
15
data.
COMPRESSION:
• Reduce the dimension before statistical analysis.
• We often have just one shot at the data as it
comes “streaming by”
16
COMPRESSION II:
• Recent results: “One-pass” through data can
reduce volume significantly w/o degrading
performance significantly.
We believe that sophisticated dimension reduction
methods in a preprocessing stage followed by
sophisticated statistical tools in a detection/filtering
stage can be a very powerful approach. Our
methods so far give us some confidence that we are
right.
17
COMPRESSION III:
•Three directions of work involving adaptation of
nearest neighbor (NN) algorithms from theoretical
computer science:
*Use of random projections into real subspaces.
(Still promising, though not competitive for our
data.)
*Random projections into Hamming cubes
*Efficient discovery of “deviant” cases in stream
of vectorized entities
18
MORE SOPHISTICATED STATISTICAL
APPROACHES BEING STUDIED:
• Representations: Boolean representations;
weighting schemes
• Matching Schemes: Boolean matching; nonlinear
transforms of individual feature values
• Learning Methods: new kernel-based methods;
more complex Bayes classifiers; boosting;
• Fusion Methods: combining scores based on
ranks, linear functions, or nonparametric
schemes
19
DATA SETS USED:
• No readily available data set has all the
characteristics of data on which we expect our
methods to be used
• However: Many of our methods depend
essentially only on term frequencies by
document.
• Thus, many available data sets can be used for
experimentation.
20
DATA SETS USED II:
TREC (Text Retrieval
Conference) data:
time-stamped subsets
of the data (order 105
to 106 messages)
Reuters Corpus Vol. 1
(8 x 105 messages)
Medline Abstracts
(order 107 with
human indexing)
21
THE MONITORING MESSAGE
STREAMS PROJECT TEAM:
Endre Boros, RUTCOR
Paul Kantor, SCILS
Dave Lewis, Consultant
Ilya Muchnik, DIMACS/CS
S. Muthukrishnan, CS
David Madigan, Statistics
Rafail Ostrovsky, Telcordia Technologies
Fred Roberts, Rutgers
Martin Strauss, AT&T Labs
Wen-Hua Ju, Avaya Labs (collaborator)
22
Download