Homeland Security Research at DIMACS

advertisement
PROGRAMS
IN
HOMELAND
SECURITY AT
DIMACS
Fred S. Roberts
DIMACS Director
1
THE FOUNDING OF DIMACS
THE NSF SCIENCE AND
TECHNOLOGY CENTERS PROGRAM
The STC program was launched by the White House
and the National Academy of Sciences in 1988 in
order to increase the economic competitiveness of the
U.S.
NSF ran a nationwide competition. The rules:
*cutting edge research
*education and knowledge transfer
*university-industry partnerships
2
THE FOUNDING OF DIMACS
Because of the increasing importance of discrete
mathematics and theoretical computer science, especially
in the fields of telecommunications and computing, four
institutions, Rutgers and Princeton Universities and AT&T
Bell Labs and Bell Communications Research (Bellcore)
each developed strong research groups in these fields.
Under the leadership of Rutgers, they came together to
found DIMACS and entered the STC competition.
There were more than 800 preproposals; more than 300
proposals, in all fields of science; 11 winners.
3
The DIMACS Partners Today
Rutgers University
Princeton University
AT&T Labs
Bell Labs (Lucent Technologies)
NEC Laboratories America
Telcordia Technologies
Affiliates:
Avaya Labs
HP Labs
IBM Research
Microsoft Research
Stevens Institute of Technology
4
WHO IS DIMACS?
•There are about 250 scientists affiliated
with DIMACS and called permanent
members.
•Most are from the partner and affiliated
organizations.
•They include many of the world’s leaders
in discrete mathematics and theoretical
computer science and their applications.
•They also include statisticians, biologists,
psychologists, chemists, epidemiologists,
and engineers.
•None are paid by DIMACS, but they join
in DIMACS projects.
5
Outline: A Selection of DIMACS Projects
•Bioterrorism Sensor Location
•Port of Entry Inspection Algorithms
•Monitoring Message Streams
•Author Identification
•Computational and Mathematical Epidemiology
•Adverse Event/Disease
Reporting/Surveillance/Analysis
•Bioterrorism Working Group
•Modeling Social Responses to Bioterrorism
•Predicting Disease Outbreaks from Remote
Sensing and Media Data
•Communication Security and Information Privacy6
The Bioterrorism Sensor
Location Problem
7
• Early warning is
critical in defense
against terrorism
• This is a crucial factor
underlying the
government’s plans to
place networks of
sensors/detectors to
warn of a bioterrorist
attack
8
The BASIS System – Salt Lake City
Locating Sensors is not Easy
• Sensors are expensive
• How do we select them and where do we place
them to maximize “coverage,” expedite an
alarm, and keep the cost down?
• Approaches that improve upon existing, ad hoc
location methods could save countless lives in
the case of an attack and also money in capital
and operational costs.
9
Two Fundamental Problems
• Sensor Location
Problem
– Choose an
appropriate mix of
sensors
– decide where to
locate them for
best protection and
early warning
10
Two Fundamental Problems
• Pattern Interpretation
Problem: When sensors set
off an alarm, help public
health decision makers
decide
– Has an attack taken place?
– What additional
monitoring is needed?
– What was its extent and
location?
– What is an appropriate
response?
11
The SLP: What is a Measure of
Success of a Solution?
• A modeling problem.
• Needs to be made precise.
• Many possible formulations.
12
The SLP: What is a Measure of
Success of a Solution?
• Identify and ameliorate false alarms.
• Defending against a “worst case” attack or an
“average case” attack.
• Minimize time to first alarm? (Worst case?
(Average case?)
• Maximize “coverage” of the area.
– Minimize geographical area not covered
– Minimize size of population not covered
– Minimize probability of missing an attack
13
The SLP: What is a Measure of
Success of a Solution?
•Cost: Given a mix of available sensors and a
fixed budget, what mix will best accomplish our
other goals?
14
The SLP: What is a Measure of
Success of a Solution?
•It’s hard to separate the goals.
•Even a small number of sensors might detect
an attack if there is no constraint on time to
alarm.
•Without budgetary restrictions, a lot more can
be accomplished.
15
The Sensor Location Problem
•Approach is to develop new algorithmic
methods.
•We are building on approaches to other modeling
problems, seeing if they can be modified in the
sensor location context.
•This is a multi-criteria modeling problem and it
seems hopeless to try to find “optimal solutions”
•We will be happy with “efficient” algorithms that
find “good” solutions
16
Algorithmic Approaches I : Greedy
Algorithms
17
Greedy Algorithms
• Find the most important location first and locate a
sensor there.
• Find second-most important location.
• Etc.
• Builds on earlier mathematical work at Institute for
Defense Analyses (Grotte, Platt)
• “Steepest ascent approach.’’
• No guarantee of “optimal” or best solution.
• In practice, gets pretty close to optimal solution.
18
Algorithmic Approaches II :
Variants of Classic Location and
Clustering Methods
19
Algorithmic Approaches II :
Variants of Classic Location and
Clustering Methods
• Location theory: locate facilities (sensors) to
be used by users located in a region.
• Cluster analysis: Given points in a metric
space, partition them into groups or clusters so
points within clusters are relatively close.
• Clusters correspond to points covered by a
facility (sensor).
20
Variants of Classic Location and
Clustering Methods
• k-median clustering: Given k sensors, place
them so each point in the city is within x feet
of a sensor.
• Complications: More dimensions: location
affects sensitivity, wind strength enters,
sensors have different characteristics, etc.
• This higher-dimensional k-median clustering
problem is hard! Best-known algorithms are
due to Rafail Ostrovsky.
21
Variants of Classic Location and
Clustering Methods
• Further complications make this even more
challenging:
– Different costs of different sensors
– Restrictions on where we can place different
sensors
– Is it better to have every point within x feet of
some sensor or every point within y feet of at least
three sensors (y > x)?
• Approximation methods due to Chuzhoy,
Ostrovsky, and Rabani and to Guha, Tardos, and
22
Shmoys are relevant.
Algorithmic Approaches III :
Variants of Highway Sensor
Network Algorithms
23
Variants of Highway Sensor
Network Algorithms
• Sensors located along highways and nearby
pathways measure atmospheric and road
conditions.
• Muthukrishnan, et al. have developed very
efficient algorithms for sensor location.
• Based on “bichromatic clustering” and
“bichromatic facility location” (color nodes
corresponding to sensors red, nodes
corresponding to sensor messages blue)
24
Variants of Highway Sensor
Network Algorithms
• These algorithms apply to situations with
many more sensors than the bioterrorism
sensor location problem.
• As BT sensor technology changes, we can
envision a myriad of miniature sensors
distributed around a city, making this work all
the more relevant.
25
Algorithmic Approaches IV :
Building on Equipment Placing
Algorithms
26
Building on Equipment Placing
Algorithms
• The “Node Placement Problem” is problem of
determining locations or nodes to install
certain types of networking equipment.
• “Coverage” and cost are a major consideration.
• Researchers at Telcordia Technologies have
studied variations of this problem arising from
broadband access technologies.
27
The Broadband Access Node
Placement Problem
• There are inherent range limitations that
drive placement.
• E.g.: customer for DSL service must be
within xx feet of an assigned
multiplexer.
• Multiplexer = sensor.
• Problem solved using dynamic
programming algorithms.
(Tamra Carpenter, Martin Eiger,David
Shallcross, Paul Seymour)
28
The Broadband Access Node
Placement Problem: Complications
• Restrictions on types of equipment that can be
placed at a given node.
• Constraints on how far a signal from a given
piece of equipment can travel.
• Cost and profit maximization considerations.
• Relevance of work on general integer
programming, the knapsack cover problem,
and local access network expansion problems.
29
The Pattern Interpretation Problem
30
The Pattern Interpretation Problem
• It will be up to the
Decision Maker to
decide how to
respond to an alarm
from the sensor
network.
31
The Pattern Interpretation Problem
• Little has been done to develop analytical models
for rapid evaluation of a positive alarm or pattern
of alarms from a sensor network.
• How can this pattern be used to minimize false
alarms?
• Given an alarm, what other surveillance measures
can be used to confirm an attack, locate areas of
major threat, and guide public health
interventions?
32
The Pattern Interpretation Problem
(PIP)
• Close connection to the SLP.
• How we interpret a pattern of alarms will affect
how we place the sensors.
• The same simulation models used to place the
sensors can help us in tracing back from an alarm
to a triggering attack.
33
Approaching the PIP: Minimizing
False Alarms
34
Approaching the PIP: Minimizing
False Alarms
• One approach:
Redundancy. Require
two or more sensors
to make a detection
before an alarm is
considered
confirmed.
35
Approaching the PIP: Minimizing
False Alarms
• Portal Shield: requires two positives for the
same agent during a specific time period.
• Redundancy II: Place two or more sensors at
or near the same location. Require two
proximate sensors to give off an alarm before
we consider it confirmed.
• Redundancy drawbacks: cost, delay in
confirming an alarm.
36
Approaching the PIP: Using Decision
Rules
• Existing sensors
come with a
sensitivity level
specified and sound
an alarm when the
number of particles
collected is
sufficiently high –
above threshold.
37
Approaching the PIP: Using
Decision Rules
• Alternative decision rule: alarm if two sensors
reach 90% of threshold, three reach 75% of
threshold, etc.
• One approach: use clustering algorithms for
sounding an alarm based on a given
distribution of clusters of sensors reaching a
percentage of threshold.
38
Approaching the PIP: Using
Decision Rules
• When sensors are to be used jointly, the rules
for “tuning” each sensor should be optimized
to take advantage of the fact that each is part of
a network.
• The optimal tuning depends on the decision
rule applied to reach an overall decision given
the sensor inputs.
39
Approaching the PIP: Using
Decision Rules
• Prior work along
these lines in missile
detection (Cherikh
and Kantor)
40
Approaching the PIP: Using
Decision Rules
• Most work has concentrated on the case of
stochastic independence of information
available at two sensors – clearly violated in
BT sensor location problems.
• Even with stochastic independence, finding
“optimal” decision rules is nontrivial.
• Recent promising approaches of Paul Kantor:
study fusion of multiple methods for
monitoring message streams.
41
Approaching the PIP: SpatioTemporal Mining of Sensor Data
42
Approaching the PIP: SpatioTemporal Mining of Sensor Data
• Sensors provide observations of the state of the
world localized in space and time.
• Finding trends in data from individual sensors: time
series data mining.
• PIP: detecting general correlations in multiple time
series of observations.
• This has been studied in statistics, database theory,
knowledge discovery, data mining.
• Complications: proximity relationships based on
43
geography; complex chronological effects.
Approaching the PIP: SpatioTemporal Mining of Sensor Data
• Sensor technology is evolving rapidly.
• It makes sense to consider idealized settings where
data are collected continuously and communicated
instantly.
• Then, modern methods of spatio-temporal data
mining due to Muthukrishnan and others are
relevant.
44
Approaching the PIP: Triggering
Other Methods of Surveillance
• One type of BT surveillance cannot be considered in
isolation.
• Question: How can the pattern of sensor warnings
guide other biosurveillance methods?
• Increased syndromic surveillance?
• Change threshold for alarm in syndromic
surveillance?
• Increased attention to E.R. visits in a certain region?
45
Approaching the PIP: Triggering
Other Methods of Surveillance
• Decreased threshold for alarm from subway
worker absenteeism levels?
46
Approaching the PIP: Triggering
Other Methods of Surveillance
• If there is an initial alarm, each sensor may be
read more often.
• How do we pick the sensors to read more
frequently?
• This is “adaptive biosensor engagement.”
• Methods of bichromatic combinatorial
optimization may be relevant.
• As for the SLP, sensors get one color, sensor
messages another.
• Relevance of work of Muthukrishnan.
47
Outline
•Bioterrorism Sensor Location
•Port of Entry Inspection Algorithms
•Monitoring Message Streams
•Author Identification
•Computational and Mathematical Epidemiology
•Adverse Event/Disease
Reporting/Surveillance/Analysis
•Bioterrorism Working Group
•Modeling Social Responses to Bioterrorism
•Predicting Disease Outbreaks from Remote
Sensing and Media Data
48
•Communication Security and Information Privacy
Port of Entry Inspection Algorithms
In collaboration with
Los Alamos National Laboratory
49
Port of Entry Inspection Algorithms
•Goal: Find ways to intercept illicit nuclear
materials and weapons destined for the U.S. via
the maritime transportation system
•Aim: Develop decision support algorithms that
will help us to “optimally” intercept illicit
materials and weapons
•Find inspection schemes that minimize total
“cost” including “cost” of false positives and
false negatives
50
Sequential Decision Making Problem
•Stream of entities arrives at a port
•Decision Maker needs to decide which to inspect,
which to subject to increasingly stringent
inspection based on outcomes of previous
inspections
•Our approach: “decision logics” and
combinatorial optimization methods
•Builds on approach of Stroud
and Saeger and large literature
in sequential decision making.
51
Sequential Decision Making Problem
•Entities arriving to be classified into categories.
•Simple case: 0 = “ok”, 1 = “suspicious”
•Observations are made.
•Inspection scheme: specifies which observations are
to be made based on previous observations
•Entities have attributes a0, a1, …, an, each in a
number of states
•Sample attributes:
Does ship’s manifest set off an “alarm”?
Does container give off neutron or Gamma
emission above threshold?
Does a radiograph image come up positive?
52
Does an induced fission test come up positive?
Sequential Decision Making Problem
•Simplest Case: Attributes are in state 0 or 1
•Then: Entity is a binary string like 011001
•Then: Classification is a decision function F that
assigns each binary string to a category.
•If there are two categories, 0 and 1, F is a
boolean function.
F(000) = F(111) = 1, F(abc) = 0 otherwise
This classifies an entity as positive iff it has none
of the attributes or all of them.
53
Sequential Decision Making Problem
•Different problems depending on whether or not F
is known. Assume first that F is known.
•Given an entity, test its attributes until know
enough to calculate the value of F.
•An inspection scheme tells us in which order to
test the attributes to minimize cost.
•Even this simplified problem is hard
computationally.
54
Binary Decision Tree Approach
•We assume we have sensors to measure presence
or absence of attributes.
•Build a tree:
•Nodes are sensors or categories (0 or 1)
•Label nodes with atrribute the sensor measures
for or the number of the category
•Category nodes are “leaves” of the tree – nodes
with only one neighbor
•Two arcs exit from each sensor node, labeled
left and right.
•Take the right arc when sensor says the
55
attribute is present, left arc otherwise
Binary Decision Tree Approach
•We reach category 1
from the root only
through the path a0 to a1
to 1.
•Thus, an entity is
classified in category 1
iff it has both attributes.
•The binary decision tree
corresponds to the
boolean function F(11) =
1, F(10) = F(01) = F(00)
= 0.
Figure 1
56
Binary Decision Tree Approach
•We reach category 1
from the root by:
a0 L to a1 R a2 R 1 or
a0 R a2 R1
•An entity is classified in
category 1 iff has
a1 and a2 and not a0 or
a0 and a2 and possibly a1.
•Corresponding boolean
function F(111) = F(101)
= F(011) = 1, F(abc) = 0
otherwise.
Figure 2
57
Binary Decision Tree Approach
•This binary decision
tree corresponds to the
same boolean function
F(111) = F(101) =
F(011) = 1, F(abc) = 0
otherwise.
However, it has one less
observation node. So, it
is more efficient if all
observations are equally
costly and equally likely.
Figure 3
58
Binary Decision Tree Approach
•Even if the boolean function F is fixed, the problem of
finding the “optimal” binary decision tree for it is NPcomplete.
•For small n, can try to solve it by brute force
enumeration.
•But even for n = 4, not practical. (n = 4 at Port of Long
Beach-Los Angeles)
•Seeking heuristic algorithms, approximations to optimal.
•Making special assumptions about the boolean function
F.
•Example: For so-called “monotone” boolean functions,
integer programming formulations give promising
heuristics.
59
Cost Functions
•Above analysis: Only uses number of sensors
•Using a sensor has a cost:
Unit cost of inspecting one item with it
Fixed cost of purchasing and deploying it
Delay cost from queuing up at the sensor
station
•How many nodes of the decision tree are
actually visited during average inspection?
Depends on “distribution” of entities.
60
Cost Functions
•Cost of false positive: Cost of additional tests.
•If it means opening the container, it’s very
expensive.
•Cost of false negative: Complex issue.
61
Complications
•Sensor errors – probabilistic approach
•More than two values of an attribute (present,
absent, present with 75% probability, …)
•Partially defined boolean functions (inferring
the boolean function from observations)
•In this case, machine learning approaches are
promising:
Bayesian binary regression
Splitting strategies
Pruning learned decision trees
62
Outline
•Bioterrorism Sensor Location
•Port of Entry Inspection Algorithms
•Monitoring Message Streams
•Author Identification
•Computational and Mathematical Epidemiology
•Adverse Event/Disease
Reporting/Surveillance/Analysis
•Bioterrorism Working Group
•Modeling Social Responses to Bioterrorism
•Predicting Disease Outbreaks from Remote
Sensing and Media Data
63
•Communication Security and Information Privacy
Monitoring
Message Streams:
Algorithmic
Methods for
Automatic
Processing of
Messages
64
OBJECTIVE:
Monitor huge communication streams, in particular,
streams of textualized communication to
automatically detect pattern changes and
"significant" events
Motivation: monitoring
email traffic, news,
communiques, faxes,
voice intercepts (with
speech recognition)
65
TECHNICAL APPROACHES:
• Given stream of text in any language.
• Decide whether "new events" are present in the
flow of messages.
• Event: new topic or topic with unusual level of
activity.
• Initial Problem: Retrospective or “Supervised”
Event Identification: Classification into preexisting classes. Given example messages on
events/topics of interest, algorithm detects
66
instances in the stream.
TECHNICAL APPROACHES:
SUPERVISED FILTERING
• Batch filtering: Given
examples of relevant
documents up front.
• Adaptive filtering:
Examples accumulated;
need to decide if will
bother analyst for
guidance; “pay” for
information about
relevance as process
moves along.
67
MORE COMPLEX PROBLEM:
PROSPECTIVE DETECTION OR
“UNSUPERVISED” FILTERING
• Classes change - new classes or change
meaning
• A difficult problem in statistics
• Recent new C.S. approaches
“Semi-supervised Learning”:
• Algorithm suggests a possible new
event/topic
• Human analyst labels it; determines its
significance
68
COMPONENTS OF AUTOMATIC
MESSAGE PROCESSING
(1). Compression of Text – increase speed, reduce
memory/disk use
(2). Representation of Text – convert text to form
amenable to computation and statistical analysis;
(3). Matching Scheme – compute similarity
between texts;
(4). Learning Method – create profiles of
events/topics from known examples.
(5). Fusion Scheme -- combine multiple filtering
techniques to increase accuracy.
69
COMPONENTS OF AUTOMATIC
MESSAGE PROCESSING - II
•These distinctions are somewhat arbitrary.
•Many approaches to message processing overlap
several of these components of automatic message
processing; our techniques usually address more
than one component.
Project Premise: Existing methods don’t exploit
the full power of the 5 components, synergies
among them, and/or an understanding of how to
apply them to text data.
70
COMPONENTS OF AUTOMATIC
MESSAGE PROCESSING - III
•Our approach is to develop/explore methods for
each component and then to combine them.
•In the first phase of the project, we did over 5000
complete experiments with different combinations
of methods.
71
Nearest Neighbor (kNN) Classifiers
• Route message by
– Finding k most similar training messages
(neighbors)
– Assign to classes that are most common among
neighbors (using weighting by distance)
• kNN classifiers studied since 1958, for text since early
90’s
– Moderately effective for text; has been considered
inefficient; finding neighbors is slow
• But, finding neighbors only needs to be done once
– No matter how many classes (even if huge)
– So: for large number of topics, maybe more efficient
than one-classifier-per-topic approaches
72
Speeding up kNN
• Can finding neighbors be made fast enough to make kNN
practical?
• Worked on fast implementation
• Store text and classes sparsely (Representation)
– Store class labels sparsely
– Arrange computations to do work proportional only to
number of class labels in neighbors, not total number of
classes
• Search engine heuristics use the in-memory inverted file
(Matching)
– Use inverted file (group by word, not by document)
– Retain only high impact terms within each document, or
within each inverted list
– Compute similarities using only inverted lists for the
few words occurring in test document
73
kNN: Results
• Great reduction in size of inverted index and
speed of classification
• Slight additional cost in effectiveness
• Effectiveness slightly below our best methods
(Bayesian probit and logistic classifiers)
• Compressed index 90% smaller than original
index w/only 7-12% loss in effectiveness (macroF1)
• Approximate matching is 10 to 100 times faster
w/ only 2-10% loss in effectiveness (macro-F1)
• Ours are first large scale experiments on search
engine heuristic for neighbor lookup in kNN
• Partnership between theoreticians and
74
practitioners.
Bayesian Methods
•Bayesian statistical methods place
“prior” probability distributions on all
unknowns, and then compute
“posterior” distribution for the
unknowns conditional on the knowns.
Thomas Bayes
75
Bayesian Methods
•Zhang and Oles (2001): developed an efficient optimization
algorithm for logistic regression (10,000 dimensions) and
achieved excellent predictive performance.
•The Bayesian approach explicitly incorporates prior
knowledge about model complexity (“regularization”)
•We extended the Bayesian approach to incorporate a
prior requirement for sparsity.
•Logistic regression has one parameter per dimension; our
sparse model sets many of these to zero; handles hundreds of
thousands of parameters efficiently.
•Resulting sparse models produce outstanding accuracy
76
and ultra-fast predictions with no ad-hoc feature selection
Bayesian Methods: Sample Results
•We have implemented several efficient variants, e.g.,
probit,informative priors.
•Publicly released software; over 1000 downloads
•Compared to Zhang & Oles, our implementation:
–Eliminates ad hoc feature selection
–Often uses less than 1% of the features at
prediction time
–Is publicly available
•Accuracy: as good as the best results ever published.
•In sum, we have a sparseness-inducing Bayesian
approach that produces dramatically simpler models
with no loss in accuracy
77
Streaming Data Analysis
•Motivated by need to
make decisions about
data during an initial
scan as data “stream
by”
•Recent development of
theoretical CS algorithms
•Algorithms motivated by
intrusion detection,
transaction applications,
time series transactions
1
1
0
0
1
0
1
1
0
1
1
78
Streaming Text Data: “Historic” Data
Analysis
• The accumulation of text messages is
massive over time
• A lot of streaming research is focused on
on-going or current analyses
• It is a great challenge to use only
summarized historic data and see if a
currently emerging phenomenon had
precursors occurring in the past
• We are working on a novel architecture for
historic and posterior analyses via small
summaries - “sketches”
79
Streaming Analysis Tool: CM Sketch
• Theoretical: We have developed the CM Sketch that
uses (1/e) log 1/d space to approximate data distribution
with error at most e, and probability of success at least
1-d.
– All other previously known sample or sketch
methods use space at least (1/e2).
– CM Sketch is an order of magnitude better.
• Practical: Few 10's of KBs gives accurate summary of
large data: Create summaries of data that allow historic
queries to find
– Heavy Hitters (Most Frequent Items)
– Quantiles of a Distribution (Median, Percentiles etc.)
– Finding items with large changes
80
Outline
•Bioterrorism Sensor Location
•Port of Entry Inspection Algorithms
•Monitoring Message Streams
•Author Identification
•Computational and Mathematical Epidemiology
•Adverse Event/Disease
Reporting/Surveillance/Analysis
•Bioterrorism Working Group
•Modeling Social Responses to Bioterrorism
•Predicting Disease Outbreaks from Remote
Sensing and Media Data
81
•Communication Security and Information Privacy
Large-scale Automated Author
Identification
82
Statistical Analysis of Text
•Statistical text
analysis has a long
history in literary
analysis and in
solving disputed
authorship problems
•First (?) is Thomas
C. Mendenhall in
1887
83
•Hamilton versus Madison: the Federalist Papers
•Mosteller and Wallace (1963) used Naïve Bayes
with a Poisson and Negative Binomial model
•Good predictive performance
84
Some Background
• Identification technologies important
for homeland security and in the legal
system
• Author attribution for textual artifacts
using “topic independent” stylometric
features has a long history
• Historical focus on small numbers of
authors and low-dimensional
representations via function words
85
Author ID Project Objectives
• Application of state-of-the-art
statistical and computing
technologies to authorship
attribution
• Work with very highdimensional document
representations
• Focus on providing working
solutions to particular
problems
86
Author ID Project Focus
Goal: Identification of Authors From Large
Collection of Objects
•traditional disputed authorship (choose among k known
authors)
•clustering of “putative” authors (e.g., internet handles:
termin8r, heyr, KaMaKaZie)
•document pair analysis: Were two documents written by
the same author?
•odd-man-out: Were these documents written by one of
this set of authors or by someone else?
87
Representation
•Long tradition in stylometry that seeks a small
number of textual characteristics that distinguish the
texts of authors from one another (Burrows, Holmes,
Binongo, Hoover, Mosteller & Wallace, McMenamin,
Tweedie, etc.)
•Typically use “function words” (a, with, as, were, all,
would, etc.) followed by PCA & cluster analysis
•Function words aim to be “topic-independent”
•Hoover (2003) shows that using all high-frequency
words does a better job than function words alone
88
Idiosyncratic Usage
•Idiosyncratic usage less formalized in the literature (misspellings,
repeated neologisms, etc.) but apparently useful. For example,
Foster’s unmasking of Klein as the author of “Primary Colors”:
“Klein and Anonymous loved unusual adjectives ending in -y
and –inous: cartoony, chunky, crackly, dorky, snarly,…,
slimetudinous, vertiginous, …”
“Both Klein and Anonymous added letters to their interjections:
ahh, aww, naww.”
“Both Klein and Anonymous loved to coin words beginning in
hyper-, mega-, post-, quasi-, and semi-, more than all others put
together”
“Klein and Anonymous use “riffle” to mean rifle or rustle, a
usage for which the OED provides no instance in the past
thousand years”
89
Odd-Man Out
Were these documents written by one of this set of
authors or by someone else?
•Training data contains documents by given set of authors
•Test data contains documents by some set of authors
including some not in original set
•Bayesian hierarchical model incorporates prior
knowledge that model parameters for different authors
differ from each other
•Initial success on small-scale simulated examples
•Generalizations for more than one new author
90
Some Results
• Created largest-ever (?) feature set including
function words, suffixes, POS tags, lengths,
spelling errors, common English errors,
grammatical errors, phrases, idiosyncratic usage,
ngrams, etc.
• Extensive experiments for 1-of-K and “odd-manout”
• New 1.2 million message Listserv corpus,
82,000 authors
91
Some Results - II
•
Developed general purpose feature
extraction software for author attribution
• Bayesian Multinomial Regression Software
extends our highly scalable, sparse, BBR
software (MMS Project) to the multi-class
case
92
Outline
•Bioterrorism Sensor Location
•Port of Entry Inspection Algorithms
•Monitoring Message Streams
•Author Identification
•Computational and Mathematical Epidemiology
•Adverse Event/Disease
Reporting/Surveillance/Analysis
•Bioterrorism Working Group
•Modeling Social Responses to Bioterrorism
•Predicting Disease Outbreaks from Remote Sensing
and Media Data
•Communication Security and Information Privacy 93
“Special Focus” on
Computational and
Mathematical Epidemiology
smallpox
94
Components of a Special Focus
•Working Groups
•Tutorials
•Workshops
•Visitor Programs
•Graduate Student Programs
•Postdoc Programs
•Dissemination
95
A Sampling of Working Groups
WG’s on Large Data Sets:
•Adverse Event/Disease Reporting, Surveillance &
Analysis
•Data Mining and Epidemiology
WG’s on Analogies between Computers and
Humans:
•Analogies between Computer Viruses/Immune
Systems and Human Viruses/Immune Systems
•Distributed Computing, Social Networks, and
Disease Spread Processes
96
WG’s on Methods/Tools of Theoretical CS
•Phylogenetic Trees and Rapidly Evolving
Diseases
•Order-Theoretic Aspects of Epidemiology
WG’s on Computational Methods for Analyzing
Large Models for Spread/Control of Disease
•Spatio-temporal and Network Modeling of
Diseases
•Methodologies for Comparing Vaccination
Strategies
97
WG’s on Mathematical Sciences Methodologies
•Mathematical Models and Defense Against
Bioterrorism
•Predictive Methodologies for Infectious Diseases
•Statistical, Mathematical, and Modeling Issues in
the Analysis of Marine Diseases
98
A Sampling of Workshops
Workshops on Modeling of Infectious
Diseases
•The Pathogenesis of Infectious Diseases
•Models/Methodological Problems of Botanical
Epidemiology
WS on Modeling of Non-Infectious Diseases
•Disease Clusters
99
Workshops on Evolution and Epidemiology
•Genetics and Evolution of Pathogens
•The Epidemiology and Evolution of Influenza
•The Evolution and Control of Drug Resistance
•Models of Co-Evolution of Hosts and Pathogens
100
Workshops on Methodological Issues
•Capture-recapture Models in Epidemiology
•Spatial Epidemiology and Geographic Information
Systems
• Ecologic Inference
•Combinatorial Group Testing
101
Outline
•Bioterrorism Sensor Location
•Port of Entry Inspection Algorithms
•Monitoring Message Streams
•Author Identification
•Computational and Mathematical Epidemiology
•Adverse Event/Disease
Reporting/Surveillance/Analysis
•Bioterrorism Working Group
•Modeling Social Responses to Bioterrorism
•Predicting Disease Outbreaks from Remote
Sensing and Media Data
102
•Communication Security and Information Privacy
The DIMACS Working Group on
Adverse Event/Disease Reporting,
Surveillance, and Analysis
103
Working Group on Adverse Event/Disease
Reporting, Surveillance, and Analysis
•Health surveillance a core activity in public
health
•Concerns about bioterrorism have attracted
attention to new surveillance methods:
–OTC drug sales
–Subway worker absenteeism
–Ambulance dispatches
•Spawns need for novel statistical methods for
surveillance of multiple data streams.
•WG coordinated closely with National
Syndromic Surveillance Conferences
104
New Data Types for Public Health
Surveillance
• Managed care patient encounter data
• Pre-diagnostic/chief complaint (text data)
• Over-the-counter sales transactions
– Drug store
– Grocery store
• 911-emergency calls
• Ambulance dispatch data
• Absenteeism data
• ED discharge summaries
• Prescription/pharmaceuticals
105
• Adverse event reports
Farzad Mostashari
106
New Analytic Methods and Approaches
•
•
•
•
•
•
•
Spatial-temporal scan statistics
Statistical process control (SPC)
Bayesian applications
Market-basket association analysis
Text mining
Rule-based surveillance
Change-point techniques
107
SubGroup on Privacy &
Confidentiality of Health Data
•Privacy concerns are a major
stumbling block to public health
surveillance, in particular
bioterrorism surveillance.
•Challenge: produce anonymous data
specific enough for research.
•Exploring ways to remove
identifiers (s.s. #, tel. #, zip code)
from data sets.
•Exploring ways to aggregate,
remove information from data sets.
•Partnerships with cryptographers
•Exploring methods of combinatorial
optimization
108
Outline
•Bioterrorism Sensor Location
•Port of Entry Inspection Algorithms
•Monitoring Message Streams
•Author Identification
•Computational and Mathematical Epidemiology
•Adverse Event/Disease
Reporting/Surveillance/Analysis
•Bioterrorism Working Group
•Modeling Social Responses to Bioterrorism
•Predicting Disease Outbreaks from Remote
Sensing and Media Data
109
•Communication Security and Information Privacy
Bioterrorism Working Group
anthrax
110
Bioterrorism Working Group
•Biosurveillance
•Evolution
•Modeling Bioterror Response Logistics
•Computer Science Challenges
•Agroterrorism
111
Modeling Bioterror Response Logistics
Exploring Discrete Optimization/Queueing
•size of stockpiles of vaccines
•allocation of medications
•analysis of bottlenecks in treatment facilities
•transportation schedules
1947
smallpox
vaccincation
queue
NYC
112
Agroterrorism
•Subgroup just starting
•Interest in plant diseases
•Partnership with the National Plant
Diagnostic Network
•Emphasis on Data Mining and
Epidemiology
113
Outline
•Bioterrorism Sensor Location
•Port of Entry Inspection Algorithms
•Monitoring Message Streams
•Author Identification
•Computational and Mathematical Epidemiology
•Adverse Event/Disease
Reporting/Surveillance/Analysis
•Bioterrorism Working Group
•Modeling Social Responses to Bioterrorism
•Predicting Disease Outbreaks from Remote
Sensing and Media Data
114
•Communication Security and Information Privacy
Working Group on Modeling Social
Responses to Bioterrorism
•Models of the spread of
infectious disease
commonly assume passive
bystanders and rational
actors who will comply
with health authorities.
•It is not clear how well this
assumption applies to
situations like a bioterrorist
attack using smallpox or
plague.
115
Working Group on Modeling Social
Responses to Bioterrorism
Interdisciplinary group is discussing incorporating
social behavior into models, building models of
public health decisionmaking, risk
communication.
Some Issues
•Movement
•Compliance
•Rumor
•Subcultural differences
•Indirect economic effects
•Social stigmata
•Panic
How do you
measure the indirect
cost of an attack?
116
Outline
•Bioterrorism Sensor Location
•Port of Entry Inspection Algorithms
•Monitoring Message Streams
•Author Identification
•Computational and Mathematical Epidemiology
•Adverse Event/Disease
Reporting/Surveillance/Analysis
•Bioterrorism Working Group
•Modeling Social Responses to Bioterrorism
•Predicting Disease Outbreaks from Remote
Sensing and Media Data
117
•Communication Security and Information Privacy
Predicting Disease Outbreaks from
Remote Sensing and Media Data
Outbreaks of disease in other parts of the world
have the capacity to affect the security of the US
Joint project with Imaging
Science and Information
Systems Center at Georgetown
University Medical School
(ISIS Center)
118
Predicting Disease Outbreaks from
Remote Sensing and Media Data
•Recent work has shown that it’s possible to
predict disease outbreaks in distant parts of the
world using remotely sensed satellite data.
•SARS and heightened avian flu in the Pacific
Rim appeared following temperature anomalies in
China.
•Could we have anticipated this
given enviro-climatic information?
119
Predicting Disease Outbreaks from
Remote Sensing and Media Data
•Rift Valley Fever epidemic in 1997/8 in East
Africa occurred following heavy flooding related
to El Nino
•Flooding in Venezuela in 1995 resulted in a
multi-pathogen outbreak.
120
Predicting Disease Outbreaks from
Remote Sensing and Media Data
•Indications and warnings
can alert US responders to
bioevents in faraway
places.
•Disease that can result in
social disruptions can be
detected in open source
media reports even if there
is no official reporting of
this.
121
Predicting Disease Outbreaks from
Remote Sensing and Media Data
•A model developed at the ISIS Center at
Georgetown predicts social disruptions due to
disease based on keyword “hit counts” from textbased sources (media reports).
•DIMACS Project goal: Use media model to
develop ways to predict social disruptions from
disease from remote sensing enviro-climatic data.
•We will be using remote sensing data indicating
increased Normalized Difference Vegetation
Index (NDVI).
122
Predicting Disease Outbreaks from
Remote Sensing and Media Data
•Project Premise: We can use enviro-climatic
indices such as NDVI coupled with diseaserelated social disruption predictors from media
data delayed by several months to validate the
enviro-climatic indicators as predictors.
•Approach: Machine Learning
•Project waiting to get started
123
Predicting Disease Outbreaks from
Remote Sensing and Media Data
•The approach is similar to
ones used by members of
the DIMACS team to
estimate probability of a
match between remotely
sensed signals and a
signature that has been
observed before. This work
has been applied to face
recognition and explosive
detection.
124
Outline
•Bioterrorism Sensor Location
•Port of Entry Inspection Algorithms
•Monitoring Message Streams
•Author Identification
•Computational and Mathematical Epidemiology
•Adverse Event/Disease
Reporting/Surveillance/Analysis
•Bioterrorism Working Group
•Modeling Social Responses to Bioterrorism
•Predicting Disease Outbreaks from Remote
Sensing and Media Data
•Communication Security and Information 125
Privacy
“Special Focus” on Communication
Security and Information Privacy
126
“Special Focus” on Communication
Security and Information Privacy
Working Groups
•Privacy-Preserving Data Mining
•Usable Privacy and Security Software
•Data De-Identification, Combinatorial
Optimization, Graph Theory, and the Stat-OR
Interface
•Intrusion Detection and Network Security
Management Systems
127
“Special Focus” on Communication
Security and Information Privacy
A Selection of Workshops
•Software Security
•Applied Cryptography and Network Security
•Large-scale Internet Attacks
•Mobile and Wireless Security
•Security of Web Services and E-Commerce
•Database Security: Query Authorization and
Information Inference
128
Working Group on Analogies between
Computer Viruses and Biological
Viruses
•Can ideas for defending against biological viruses lead to
ideas for defending against computer viruses?
•Concern about large gap between initial time of attack
and implementation of defensive strategies
•“Public health” approach: Once a virus has infected a
machine, it tries to connect it to as many computers as
possible, as fast as possible. A “throttle” limits rate at
which a computer can connect to new computers.
129
130
Download