Homeland Security: What Can Mathematics Do?

advertisement
Homeland Security
What Can
Mathematics Do?
Fred Roberts
Professor of Mathematics,
Rutgers University
Chair, RU Homeland
Security Research Initiative
1
Director, DIMACS Center
Mathematical methods have become
important tools in preparing plans for defense
against terrorist attacks, especially when
combined with powerful, modern computer
methods for analysis and simulation.
2
Are you Serious?? What Can
Mathematics Do For Us?
3
4
.
After Pearl Harbor: Mathematics and
mathematicians played a vitally important role
5
in the US World War II effort.
Critical War-Effort Contributions Included:
•Code breaking.
Enigma machine
•Creation of the mathematics-based field of
Operations Research:
logistics
optimal scheduling
inventory
strategic planning
6
But: Terrorism is Different.
Can Mathematics Really Help?
5+2=?
1, 2, 3, …
7
I’ll Illustrate with Mathematics
Projects I’m Involved in.
There are Many Others
•
•
•
•
Bioterrorism Sensor Location
Monitoring Message Streams
Identification of Authors
Detecting a Bioterrorist Attack through
“Syndromic Surveillance”
8
OUTLINE
•
•
•
•
Bioterrorism Sensor Location
Monitoring Message Streams
Identification of Authors
Detecting a Bioterrorist Attack through
“Syndromic Surveillance”
9
The Bioterrorism Sensor
Location Problem
10
• Early warning is
critical in defense
against terrorism
• This is a crucial
factor underlying the
government’s plans
to place networks of
sensors/detectors to
warn of a bioterrorist
attack
11
The BASIS System – Salt Lake City
Locating Sensors is not Easy
• Sensors are expensive
• How do we select them and where do we
place them to maximize “coverage,”
expedite an alarm, and keep the cost
down?
• Approaches that improve upon existing, ad
hoc location methods could save
countless lives in the case of an attack
and also money in capital and operational
12
costs.
Two Fundamental Problems
• Sensor Location
Problem
– Choose an
appropriate mix of
sensors
– decide where to
locate them for
best protection and
early warning
13
Two Fundamental Problems
• Pattern Interpretation
Problem: When sensors
set off an alarm, help public
health decision makers
decide
– Has an attack taken
place?
– What additional
monitoring is needed?
– What was its extent and
location?
– What is an appropriate
response?
14
The Sensor Location Problem
•Approach is to develop new algorithmic methods.
•Developing new algorithms involves fundamental
mathematical analysis.
•Analyzing how efficient algorithms are involves
fundamental mathematical methods.
•Implementing the algorithms on a computer is often
a separate problem – which needs to go hand in
hand with the basic mathematics of algorithm
15
development.
Algorithmic Approaches I :
Greedy Algorithms
16
Greedy Algorithms
• Find the most important location first and locate
a sensor there.
• Find second-most important location.
• Etc.
• Builds on earlier mathematical work at Institute
for Defense Analyses (Grotte, Platt)
• “Steepest ascent approach.’’
• No guarantee of “optimal” or best solution.
• In practice, gets pretty close to optimal solution.
17
Algorithmic Approaches II :
Variants of Classic Facility
Location Theory Methods
18
Location Theory
• Old problem in Operations research: Where
to locate facilities (fire houses, garbage
dumps, etc.) to best serve “users”
• Often deal with a network with nodes,
edges, and distances along edges
• Users u1, u2, …, un are located at nodes
• One approach: locate the facility at node x
chosen so that sum of distances to users is
minimized.
n
d ( x, ui )
• Minimize:

i 1
19
Location Theory: A Network
f
1
a
1
1
e
b
1
1
d
Nodes are places
for users or facilities
1
c
1’s represent
distances along
20
edges
1
f
1
a
1
u1
e
b
u2
1
1
d
1
c
x=a: d(x,ui)=1+1+2=4
u3
x=b: d(x,ui)=2+0+1=3
x=c: d(x,ui)=3+1+0=4
x=d: d(x,ui)=2+2+1=5
x=e: d(x,ui)=1+3+2=6
x=f: d(x,ui)=0+2+3=5
x=b is optimal
21
Variants of Classic Facility
Location Theory Methods:
Complications
• We don’t have a network with nodes and edges;
we have points in a city
• Sensors can only be at certain locations (size,
weight, power source, hiding place)
• We need to place more than one sensor
• Instead of “users,” we have places where potential
attacks take place.
• Potential attacks take place with certain
probabilities.
• Wind, buildings, mountains, etc. add
22
complications.
Variants of Classic Facility
Location Theory Methods:
Complications
• These more complex problems are hard!
• The best-known algorithms for solving
these “higher-dimensional” variants of the
classic location problem are due to Rafail
Ostrovsky -- a partner on our project.
• The mathematics-based approximation
methods due to Ostrovsky and his
colleagues are promising.
23
Algorithmic Approaches IIII :
Variants of Air Pollution
Monitoring Models
24
Variants of Air Pollution
Monitoring Models
• Long history of using
mathematical models
to locate air pollution
monitors.
• Use fluid dynamics
• Use plume models.
• Large computer
simulations needed.
• Long used in nuclear
weapons defense.
25
Variants of Air Pollution
Monitoring Models
• Mathematical challenge: Modify air pollution
monitor placement modeling tools for
complex biological agents.
• E.g.: Complications arise when applying the
models to cities: Buildings make it hard!
26
The Pattern Interpretation
Problem
27
The Pattern Interpretation
Problem (PIP)
• It will be up to the
Decision Maker to
decide how to
respond to an
alarm from the
sensor network.
28
Approaching the PIP: Minimizing
False Alarms
29
Approaching the PIP: Minimizing
False Alarms
One approach: Redundancy.
• Could require two or more
sensors to make a detection
before an alarm is
considered confirmed
• Could require same sensor
to register two alarms:
Portal Shield requires two
positives for the same agent
during a specific time
period.
30
Approaching the PIP: Minimizing
False Alarms
• Could place two or more sensors at or
near the same location. Require two
proximate sensors to give off an alarm
before we consider it confirmed.
Redundancy has drawbacks: cost, delay in
confirming an alarm.
We need mathematical methods to
analyze the tradeoff between lowered
false alarm rate and extra cost/delay
31
Approaching the PIP: Using
Decision Rules
• Existing sensors
come with a
sensitivity level
specified and
sound an alarm
when the number
of particles
collected is
sufficiently high –
above threshold.
32
Approaching the PIP: Using
Decision Rules
• Let f(x) = number of particles collected at
sensor x in the past 24 hours. Sound an
alarm if f(x) > T.
• Alternative decision rule: alarm if two sensors
reach 90% of threshold, three reach 75% of
threshold, etc.
Alarm if:
f(x) > T for some x,
or if f(x1) > .9T and f(x2) > .9T for some x1,x2,
or if f(x1) > .75T and f(x2) > .75T and f(x3) > .75T
33
for some x1,x2,x3.
Approaching the PIP: Using
Decision Rules
• Prior work along
these lines in
missile detection
(Cherikh and
Kantor)
34
Bioterrorism Sensor Location:
Partner Agencies/Institutions
•
•
•
•
•
Defense Threat Reduction Agency
MITRE Corporation
Los Alamos National Laboratory
Institute for Defense Analysis
New York City Dept. of Health
35
OUTLINE
•
•
•
•
Bioterrorism Sensor Location
Monitoring Message Streams
Identification of Authors
Detecting a Bioterrorist Attack through
“Syndromic Surveillance”
36
Monitoring
Message
Streams:
Algorithmic
Methods for
Automatic
Processing of
Messages
37
Objective:
Monitor huge communication streams, in
particular, streams of textualized communication, to
automatically detect pattern changes and
"significant" events
Motivation: monitoring
email traffic, news,
communiques, faxes
38
Technical Approaches:
• Given stream of text in any language.
• Decide whether "events" are present in the
flow of messages.
• Event: new topic or topic with unusual level of
activity.
• Suppose events have been classified into
classes or groups: group 1, group 2, …
• A new message comes in. Does it fit into
group 1? Into group 2? Or does it (and related
messages) define a new group of interest? 39
One Approach: “Bag of Words”
• List all the words of interest
that may arise in the
messages being studied:
w1, w2,…,wn
• Bag of words vector b has
k as the ith entry if word wi
appears k times in the
message.
• Sometimes, use “bag of
bits”: Vector of 0’s and 1’s;
count 1 if word wi appears in
the message, 0 otherwise.
40
“Bag of Words” Example
Words:
w1 = bomb, w2 = attack, w3 = strike
w4 = train, w5 = plane, w6 = subway
w7 = New York, w8 = Los Angeles, w9 =
Madrid, w10 = Tokyo, w11 = London
w12 = January, w13 = March
41
“Bag of Words”
Message 1:
Strike Madrid trains on March 1.
Strike Tokyo subway on March 2.
Strike New York trains on March 11.
Bag of words b1 = (0,0,3,2,0,1,1,0,1,1,0,0,3)
w1 = bomb, w2 = attack, w3 = strike
w4 = train, w5 = plane, w6 = subway
w7 = New York, w8 = Los Angeles, w9 =
Madrid, w10 = Tokyo, w11 = London
42
w12 = January, w13 = March
The Approach: “Bag of Words”
• Key idea: how close are two such vectors?
• Suppose known messages have been
classified into different groups: group 1,
group 2, …
• A message comes in. Which group should
we put it in? Or is it “new”?
• You look at the bag of words vector
associated with the incoming message
and see if “fits” closely to typical vectors
43
associated with a given group.
The Approach: “Bag of Words”
• Your performance can improve over time.
• You “learn” how to classify better.
• Typically you do this “automatically” and
try to develop mathematical methods that
will allow a machine to “learn” from past
data.
44
“Bag of Words”
Message 2:
Bomb Madrid trains on March 1.
Attack Tokyo subway on March 2.
Strike New York trains on March 11.
Bag of words b2 = (1,1,1,2,0,1,1,0,1,1,0,0,3)
w1 = bomb, w2 = attack, w3 = strike
w4 = train, w5 = plane, w6 = subway
w7 = New York, w8 = Los Angeles, w9 =
Madrid, w10 = Tokyo, w11 = London
45
w12 = January, w13 = March
“Bag of Words”
Note that b1 and b2 are “close”
b1 = (0,0,3,2,0,1,1,0,1,1,0,0,3)
b2 = (1,1,1,2,0,1,1,0,1,1,0,0,3)
Close could be measured using distance d(b1,b2)
= number of places where b1,b2 differ
(“Hamming distance” between vectors).
Here: d(b1,b2) = 3
The messages are “similar” – could belong to the
same group or class of messages.
46
“Bag of Words”
Message 3:
Go on strike against Madrid trains on March 1.
Go on strike against Tokyo subway on March 2.
Go on strike against New York trains on March
11.
Bag of words b3 = same as b1.
BUT: message 3 is quite different from message 1.
Shows complexity of problem. Maybe missing some
key words like “go” or maybe we should use pairs
of words like “on strike” (“bigrams”)
47
One Approach: k-Nearest Neighbor
(kNN) Classifiers
• How kNN Classifiers Work:
– Find k most similar “training” messages
(neighbors)
– Assign a message to those groups that
are most common among neighbors
(using weighting by distance)
• kNN classifiers had been considered
inefficient since finding neighbors is slow
48
Speeding up kNN
• Can finding neighbors be made fast enough to
make kNN practical?
• Mathematics can help.
• Store text and classes “sparsely”
• Use “inverted file” heuristics that group input by
word, not by “document” and compute similarities
using only the few words occurring in the
document
• Result: New methods are 10 to 100 times
faster with only a 2-10% loss in
“effectiveness” (according to some standard
measures)
49
• Software delivered to sponsors.
Streaming Data
• We often have just one shot at the data as it
comes “streaming by” because there is so much
of it. This calls for powerful new algorithms.
50
Research Challenge: “Historic”
Data Analysis
• The accumulation of text messages is
massive over time
• We can only save summaries of the data.
• It is a great challenge to use only
summarized historic data and see if a
currently emerging phenomenon had
precursors occurring in the past – since
you don’t have the original data.
• We have had some success with a novel
architecture for historic and posterior
analyses via small summaries “sketches”
51
OUTLINE
•
•
•
•
Bioterrorism Sensor Location
Monitoring Message Streams
Identification of Authors
Detecting a Bioterrorist Attack through
“Syndromic Surveillance”
52
Related Project: Author
Identification
Develop and evaluate techniques for identifying
authors in large collections of textual artifacts (emails, communiques, transcribed speech, etc.).
Questions Addressed:
Which of a set of authors
wrote a particular
document/message?
Were two documents
written by the same
author?
53
Author Identification
• We are using methods developed in the
Monitoring Message Streams Project
• Building on classical work in Statistics: Who
wrote the Federalist papers, Hamilton or
Madison?
• More complicated than conventional text
classification:
– Large number of possible authors
– Not much “training data”
– Authors write on multiple topics
– Authors write in different styles in different
“genres”
54
One Approach: In “Bag of
Words”: Use “Function Words”
•
•
•
•
•
•
•
•
•
•
•
•
a
about
above
according
accordingly
actual
actually
after
afterward
afterwards
again
against
•
•
•
•
•
•
•
•
•
•
•
•
•
•
ago
ah
ain't
all
almost
along
already
also
although
always
am
among
an
and
•
•
•
•
•
•
•
•
•
•
•
•
•
•
another
any
anybody
anyone
anything
anywhere
are
aren't
around
art
as
aside
at
away
55
Partner Agencies: Monitoring
Message Streams and Author
Identification Projects
• Research sponsored by ITIC: Intelligence
Technology Innovation Center
• Administratively under the CIA
• Through interagency Knowledge, Discovery, and
Dissemination (KDD) program.
56
OUTLINE
•
•
•
•
Bioterrorism Sensor Location
Monitoring Message Streams
Identification of Authors
Detecting a Bioterrorist Attack through
“Syndromic Surveillance”
57
Bioterrorist Event Detection
Great concern about the deliberate
introduction of diseases such as smallpox by
bioterrorists has led to new challenges for
mathematical scientists.
smallpox
58
Bioterrorist Event Detection
• Mathematical models of infectious diseases
go back to Daniel Bernoulli’s mathematical
analysis of smallpox in 1760.
• However, modern data-gathering methods
bring with them new challenges for
mathematicians.
• Methods used in Monitoring Message Streams
and Author ID projects enter into using large
data sets to detect “bioterrorist events” or
“emerging diseases” (SARS) through
59
“syndromic surveillance”
New Data Types for Public Health
Surveillance
• Managed care patient encounter data
• Pre-diagnostic/chief complaint (ED data)
• Over-the-counter sales transactions
– Drug store
– Grocery store
• 911-emergency calls
• Ambulance dispatch data
• Absenteeism data
• ED discharge summaries
• Prescription/pharmaceuticals
• Adverse event reports
60
Syndromic Surveillance: NYC Dept.
of Health Data
61
Approach:
• As with Monitoring Message Streams and
Author Identification, represent data by
using a vector.
• For example, use “bag of bits” (0 or 1 only
in each entry).
• If use symptoms, then 1 or 0 represents
presence or absence of symptoms such
as coughing, fever over 102 degrees, achy
legs, disoriented, etc.
62
Many New Mathematical Methods
and Approaches under Development
•
•
•
•
•
•
•
Spatial-temporal “scan statistics”
Statistical process control (SPC)
Bayesian applications
“Market-basket” association analysis
Text mining
Rule-based surveillance
Change-point techniques
63
Project a Collaboration between a
Math/CS Research Center and a
Government Agency
DIMACS: Center for
Discrete Mathematics
and Theoretical
Computer Science
CDC: Centers for
Disease Control
and Prevention
64
Would Mathematics help Protect
our Bridges and Tunnels?
George Washington Bridge
Lincoln Tunnel
65
Would Mathematics Help Protect
our Borders?
66
Would it help with a Deliberate
Outbreak of Anthrax?
67
Similar approaches, using mathematical
models, have proven useful in many other
fields, to:
•make policy
•plan operations
•analyze risk
•compare interventions
•identify the cause of observed events
68
Why not in homeland security?
69
Download