Data Science Challenges for Homeland Security

advertisement
Data Science Challenges for
Homeland Security
Fred S. Roberts
CCICADA Center
Rutgers University
Piscataway, NJ USA
froberts@dimacs.rutgers.edu
1
Founded 2009 as a DHS
University Center of
Excellence
The Data Challenge
• Virtually all of the activities in the homeland security
enterprise require the ability to reach conclusions from
massive flows of data
– Arriving fast
– From disparate sources
– Subject to error/uncertainty
• Data Science:
2
–
–
–
–
–
–
–
Incomparably positioned to address this massive flow of data
Gain rapid situational awareness
Assist in identifying new threats
Make rapid risk assessments
Mitigate the effects of natural or man-made disasters
Allow homeland security community to work more efficiently
Protect the privacy of individuals
Data Science Application Areas
• Data Science (DS) methods are applicable to a wide
variety of homeland security applications
• CCICADA researchers already heavily involved in
these with homeland security practitioner partners,
e.g.:
– Intelligence analysis of text
– Disease event detection
– Port of entry inspection
3
Data Science Application Areas
• DS methods are applicable to a wide variety of
homeland security applications
• CCICADA researchers already heavily involved in
these with homeland security practitioner partners,
e.g.:
– Author identification
– Rapid summarization of crime data
– Bioterrorism sensor location
B-T sensor,
Salt Lake
City Olympics
4
Data Science Application Areas
• DS methods are applicable to a wide variety of
homeland security applications
• CCICADA researchers already heavily involved in
these with homeland security practitioner partners,
e.g.:
– Nuclear detection using moving detectors
– Stadium evacuation
– Privacy-preserving data sharing
5
Data Science Application Areas
• DS methods are applicable to a wide variety of
homeland security applications
• CCICADA researchers already heavily involved in
these with homeland security practitioner partners,
e.g.:
– Predicting paths of plumes
– Analyzing open source data
– Thwarting attacks on cyberinfrastructure by
detecting abnormalities in network traffic.
6
Data Science Application Areas
• DS methods are applicable to a wide variety of
homeland security applications
• CCICADA researchers already heavily involved in
these with homeland security practitioner partners,
e.g.:
– Assessing risks in waterways
– Planning for evacuations during heat events
from climate change
– Understanding economic impact of terrorist
events and natural disasters
7
Outline
• Port Security: Risk Scoring from Ships’
Manifest Data
• Visualization of Manifest Data
• Making use of Uncertainty: Placing Nuclear
Detectors in Randomly Moving Vehicles
• Crime Information Management: Enhancing
Force Deployment in Law Enforcement and
Counter-terrorism: Work with Port Authority
of New York/New Jersey
• Biosurveillance: Early Warning from Entropy
• Privacy-preserving Data Analysis
• Data Analysis Challenges Arising from
Climate Change
8
Port Security: Risk Scoring from Ships’
Manifest Data
Xueying Chen, Jerry Cheng
and Minge Xie
Rutgers University
Project supported by
Domestic Nuclear
Detection Office
9
Risk Scoring: Manifest Data
•Manifest/bill of lading
•Data either text or numerical/categorical
•Increased emphasis by US Customs and Border
Protection (CBP) on documents submitted prior to a
shipping container reaching the US
•Data screened before ship’s arrival in US
•Identifying mislabeled or anomalous shipments may
prove useful in finding nuclear materials
10
Data Description
• We obtained from CBP one month’s data of all
cargo shipments to all US ports
• Jan 30, 2009 – Feb 28, 2009
• Description
– Foreign port (origin)
– Domestic port (destination)
Aggregation
– Item description
– Item count
Inconsistencies
11
Manifest Data: Sample
12
Clean/Readable Manifest Data
13
Data Description
• Data has errors and ambiguities
• Does 150 waters mean 150 bottles of water or
150 cases of bottles of water?
• What does “household goods” mean?
• Still, there are things we can do with the data.
14
Manifest Data
• Manifest descriptions of products such as…
– Soft drink concentrates
– Ten knockdown empty cartons
– Ikea home furnishing products
• …should match classifications of container types,
ship types, or port of departure types.
• Anomalies may be discoverable when product
descriptions are closely associated with container,
ship, or port classifications.
• E.g., a shipment of IKEA products may have more in
common with specific container, ship, or port than a
shipment containing airplane parts.
15
Some Stats of Manifests
(1/30/09-2/28/09)
• # of manifests per day: 26,577
• # of ships per day: 307
Among the cargos coming to Long Beach, CA
• # of manifests per day: 3,648
• # of ships per day: 54
• # of originating countries: 16
16
Mining of Manifest Data
• Goal: Predict risk score for each container
– Quantify the likelihood of need for inspection
– Based on covariates/characteristics of a
container’s manifest data.
• Methods:
– We are developing machine learning algorithms to
detect anomalies in manifest data.
– Text mining on verbiage fields leads to useful
characteristics.
– Then regression based on the useful characteristics
or “covariates”
– “Penalized regression” using LASSO and Bayesian
Binary Regression software developed by our group.
17
Mining of Manifest Data
• A complication is that we do not know the risk
status of the containers in our data.
• We got around this by simulation and assuming
risk scores of some containers in order to test
our methods.
• Our approach used “penalized regression” with
explanatory variables chosen from the
covariates contained in the manifest data, e.g.,
voyage number, foreign port, contents of
containers, etc.
• A first step was to do a correlation analysis to
group covariates.
18
Correlation Analysis of Variables
Preliminary to “Risk Scoring” of Containers
19
Penalized Regression
Model: y = Xβ + ε, subject to ρ(β) ≤ λ
y = vector corresponding to the risk score;
X = matrix corresponding to the
characteristics/contents of shipping containers;
ε is the random error;
β are model parameters;
ρ is the penalty function and λ is a tuning parameter.
• Learning the correlation between risk score and the
covariates can help determine the importance of
different covariates
20
Examples
• Example 1
risk score=β1×(voyage number)+β2×(inbound entry type) + ε
• Example 2
risk score=β1×(voyage number)+β2×(dp of
unlading)+β3×(foreign port)+β4×contents+ε
• More sophisticated analysis replaces individual
variables by groups of closely correlated variables.
21
Real Data Analysis
• Simulation setting
– Use the simulation to select potential covariates and
simulate the risk scores for each shipment
– Run penalized regression over p=216 variables which
are created from vessel country code, voyage number,
dp of unlading, foreign port, inbound entry type,
contents information, etc. n=24,000 samples
(manifests)
– The method chooses the “model” (the explanatory
covariates) that best predict the risk score y.
• Goals:
– Model selection accuracy: pick up the true influential
covariates
– Prediction: predict the risk score for future shipments
22
Real Data Analysis
Sample result for 373 predicted risk scores
23
Real Data Analysis
Sample result: Box plot of E(y) – estimated y = error
(Middle 50% of errors are in the box)
24
Conclusion
• This method performs effectively in picking
up important variables and removing
unimportant variables.
• The selected model can identify the shipments
with high risk scores while maintaining a low
rate of false alarms
25
Visualization of Manifest Data
Work of James Abello, Tsvetan Asamov
Rutgers University
Sponsored by Domestic Nuclear Detection
Office
26
Visualization of Manifest Data
• Visualizing data can give us insight into
interconnections, patterns, and what is “normal” or
“abnormal”
• Our visual analysis methods are based on tools
originally developed at AT&T for detection of
anomalies in telephone calling patterns – e.g., quick
detection that someone has stolen your AT&T calling
card.
• The visualizations are interactive so you can “zoom”
in on areas of interest, get different ways to present
the data, etc.
27
Visualization of Manifest Data
• For port p, a vector contents[p] gives the number of
items of each kind of commodity shipped out of port p
in a given time period.
• We devise similarity measures between ports p and q
as a function of the dot product of their contents
vectors.
• Contents[p,q] gives the number of items of each kind
of commodity shipped from port p to port q in a given
time period.
• We represent such vectors using
edge-weighted, labeled graphs that
can be visualized using our software.
28
General View of Port to Port Traffic
Color-coded connections represent number of
shipments
29
Shanghai, LA, Newark, Singapore
Vertex Size encodes number of shipments
30
Zooming into Shanghai (gray)
Zooming into a vertex gives more
data about traffic
31
Contents To Port Pairs
Vertices are KeyWords and Port Pairs (color coded by
degree), Edges encode number of containers (or
shipments) with that keyword for the corresponding
Port Pair
32
Contents To Port Pairs (cont)
( Vertices color coded by WeightRatio )
33
Temporal Evolution of Manifest Data
Fix a commodity.
Each vertex represents all shipments from foreign to US
ports on a given day.
Cluster by similarity. Notice how all Tuesdays and
Wednesdays are well clustered
34
34
Can also Cluster by Ports
Note similarity, e.g., Cincinnati, OH and Brunswick, GA
35
Making use of
Uncertainty/Randomness in
Detection/Prevention Protocols
Placing Nuclear Detectors in
Randomly Moving Vehicles
Work of Jerry Cheng,
Fred Roberts, Minge Xie
Rutgers University
Supported by Domestic
Nuclear Detection Office
36
Making use of Uncertainty/
Randomness in
Detection/Prevention Protocols
Placing Nuclear Detectors in
Randomly Moving Vehicles
• Goal: of uncertainty/randomness: keep the
adversary guessing and increase their
potential cost.
37
Nuclear Detection using Taxicabs
and/or Police Cars
38
Nuclear Detection Using Vehicles
• Distribute GPS tracking and nuclear detection
devices to taxicabs or police cars in a metropolitan
area.
– Feasibility: New technologies are making devices
portable, powerful, and cheaper.
– Some police departments are already
experimenting with nuclear detectors.
• Taxicabs are a good example because their
movements are subject to considerable uncertainty.
• Send out signals if the vehicles are getting close to
nuclear sources.
• Analyze the information (both locations and nuclear
signals) to detect potential location of a source.
39
Nuclear Detection Using Vehicles
Issues of Concern in our Project:
Our discussions with law enforcement suggest
reluctance to depend on the private sector (e.g.,
taxicab drivers) in surveillance
However, are there enough police cars to get
sufficient “coverage” in a region?
How many vehicles are needed for sufficient
coverage?
How does the answer depend upon:
– Routes vehicles take?
– Range of the detectors?
– False positive and false negative rates of
detectors?
•
•
•
•
40
Detectors in Vehicles – Model
Components
• Source Signal Model
– Definition: random variable S - the indicator of nuclear
signal from a source
– Values 1 (existence of source) or 0
– The closer to the source, the higher the probability P(S=1)
– Key parameter: maximum detection range
• Source Detection Model
– Random variable D:
– Values 1 (the sensor detects the source) or 0
– Model parameter: False positive rate P(D=1|S=0)
 The probability of detecting a nonexistent signal
– Model parameter: False negative rate P(D=0|S=1)
 The probability of not detecting a true signal.
41
Detectors in Vehicles – Model
Components
• In our early work, we did not have a
specific model of vehicle movement.
• We assumed that vehicles are
randomly moved to new locations in
the region being monitored each time
period.
• If there are many vehicles with
sufficiently random movements, this is
a reasonable first approximation.
• It is probably ok for taxicabs, less so
for police cars.
42
Vehicles – Clustering of Events
• Definition of Clusters:
– Unusually large number of events/patterns clumping
within a small region of time, space or location in a
sequence
– A cluster of alarms suggests there is a source
• Use statistical methodology:
– Formal tests: provide statistical significance against
random chance.
• Traditional statistical method is via Scan Statistics
– Scan entire study area and seek to locate region(s) with
unusually high likelihood of incidence
– E.g, use:
43
maximum number of cases in a fixed-size moving window
Diameter of the smallest window that contains a fixed
number of cases
Nuclear Detection
using
Taxicabs
Manhattan, New York City
.
.
.
.. . .
.
.
........
.
.
.
.
.
.
.
+
GPS tracking
44 device
Nuclear sensor
device
. .
.
dirty
bomb?
A simulation of taxicab locations
at morning rush hour
Taxicabs - Simulation
• First stage of work
• Generated data in Manhattan and did
a simulation – applying the clustering
approach with success
• Used spatclus package in R: software
package to detect clusters
• In the simulations, we have
considered both moving and
stationary sources.
• Our emphasis then turned to
comparing taxicab “coverage” to
police car “coverage” using our
simulations.
45
Number of Vehicles Needed
• The required number of vehicles in the surveillance
network can be determined by statistical power analysis
– The larger # of vehicles, the higher power of detection
• An illustrative example:
– A surveillance network covers area 4000 ft by 10000 ft
 Roughly equal to the area of the roads and
sidewalks of Mid/Downtown Manhattan
– N vehicles are randomly moving around in the area
 Fix key parameters
– Effective range of a working detector
– False positive & false negative rates for detectors
– The ranges and rates we used are not realistic,
but we wanted to test general methods, & not be
tied to today’s technology
– A fixed nuclear source randomly placed in the area
46
Number of Vehicles Needed
•
•
•
•
•
47
First Model
Effective range of detector: 150 ft.
False positive rate 2%
False negative rate 5%
Varied number of vehicles (= number of sensors)
and ran at least 50 simulations for each number
of vehicles.
For each, measure the power = P(D=1/S=1) =
probability of detection of a source.
Number of Vehicles (Sensors) Needed
• Sensor range=150 feet, false positive=2%, false negative=5%.
Detection Power
0.8
0.7
0.6
Power
0.5
0.4
0.3
0.2
0.1
0
1500
2000
2500
3000
3500
4000
Number of Sensors
Conclusion: Need 4000 vehicles to even get 75% power.
48
Number of Vehicles Needed
• NYPD has 3000+ vehicles in 76 precincts in
5 boroughs. Perhaps 500 to 750 are in streets
of Mid/Downtown Manhattan at one time.
• Preliminary conclusion: The number of
police cars in Manhattan would not be
sufficient to even give 30% power.
• So, if we want to use vehicles, we need
a larger fleet, as in taxicabs.
Modified Model
• What if we have a better detector, say with
an effective range of 250 ft.?
• Don’t change assumptions about false
positive and false negative rates.
49
Number of Vehicles (Sensors) Needed
•Sensor range=250 feet, false positive=2%, false negative=5%.
Detection Power
1.02
1
0.98
Power
0.96
0.94
0.92
0.9
0.88
0.86
1500
2000
2500
3000
3500
4000
Num ber of Sensors
Conclusion: 2000 vehicles already give 93% power.
50
Number of Vehicles Needed
• There are not enough police cars to
accomplish this kind of coverage.
• Taxicabs could do it.
• There are other problems with our
model as it relates to police cars:
– Police cars tend to remain in their own
region/precinct.
– Police cars don’t move around as
randomly or as frequently as taxicabs
51
Police Car Coverage
• There are roughly 20 police precincts in
Manhattan (actual number = 22)
• Suppose we divide the region into 20 equal-sized
and shaped subregions.
• We place k police cars in each subregion and at
each time period move them to a random spot in
the subregion.
• This may be a bit more realistic than placing
them randomly in the entire region.
• We place a fixed nuclear source randomly in the
whole region.
52
Police Car Coverage
• Assume number of police vehicles in use in each
subregion is 25.
• A total of 500 police vehicles.
• Assume each has a detector with effective range
250 ft. and false positive, false negative rates of
2% and 5%, respectively.
• We again ran simulations.
• The power is 35%.
• Not very good.
53
Police Car Coverage
• Note: using 500 taxicabs and allowing them to
range through the whole region gives about the
same power.
• It is not yet clear whether the power will
generally differ significantly if we have a fixed
number of vehicles, but in one case allow them
to range only through subregions, and in another
allow them to range through the whole region.
• This is a research issue.
54
Hybrid Model: Police Cars +
Taxicabs
• Keeping detectors with effective range of 250 ft.,
false positive and false negative rates of 2% and
5%, respectively.
• Use 500 police cars split into 25 in each of 20
regions.
• In addition, use 2000 taxicabs ranging through
the whole region.
• Now get 98% power.
55
Effect of False Positive, False
Negative Rates
Modified Model
• We also experimented with change of false
positive and false negative rates.
56
Different Error Rates
• Compare (false positive=5%, false negative=10%) vs. (2%,5%)
• Sensor range=250 feet
Detection Power
1
0.95
Power
0.9
errors:5%,10%
0.85
errors:2%,5%
0.8
0.75
0.7
1500
2000
2500
3000
Number of Sensors
57
3500
4000
Next Step: Add a Random
Movement Model
• Adding a movement model makes the analysis
more realistic.
• We take a street network.
• We assume that vehicles move along until they
hit an intersection.
• At each intersection, they continue straight or
turn left or right according to a random
process.
58
Screen Shot of Simulation Tool for Street
Grids & Traffic Movements
“Arena”
simulation software
59
Detectors in Vehicles – Simulation
• Take a 25 by 25 block region
– Roughly the lower
Manhattan/Wall Street area.
– Use our simulator tool with
vehicle movements
60
Detectors in Vehicles - Simulation
• Sensor range: 75ft to 250 ft. (.75 units to 2.50 units)
• False positive (FPR) and negative rates (FNR): (2%, 2%)
and (5%, 5%)
• One stationary nuclear source in the study region.
• Vary # of sensors (vehicles) from 500 to 3,000 with
increments of 500.
– Repeat each setting 500 times and compute the
percentage of correctly detecting the true source
within a certain period of time.
Criteria:
61
–The detected cluster covers the true
source location.
–The cluster is statistically significant.
Detection Power: error rates - (5%,5%)
Power
Power = 1 – Type II error (= 1 – Probability of missing detection =
P(D=1/S=1)
Note: power = 70% when number of sensors (vehicles) is 1500 with
mid-ranged sensors.
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
500
r=0.75
r=1.00
r=1.25
1000
1500
2000
Number of Sensors
62
2500
3000
Detection Power: error rates - (2%,2%)
Power
Note: power = 91% when number of sensors is 1500 with mid-ranged
sensors (and almost 100% with higher-ranged sensors)
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
500
r=0.75
r=1.00
r=1.25
1000
1500
2000
Number of Sensors
63
2500
3000
Comparing to Earlier Results
•Note: We saw that power = 91% when number of
sensors is 1500 with mid-ranged sensors (range 100
ft.) (and almost 100% with higher-ranged sensors,
range 125 ft.) with error rates (2%,2%)
•Is this inconsistent with our earlier work (without
movement model) where we concluded that with a
150 ft. range and error rates (2%,5%), 4000 vehicles
needed to even get 75% power?
•No: earlier work had much larger area being
covered.
64
Detection Power: error rates - (2%,2%)
(Police Car Scenario: Higher Range )
 Power < 30% with 300-400 sensors (200 ft or 250 ft, 2% error rate)
You would need 500 sensors to reach 70% power with 250ft. range
 # of Police Patrol Cars in each precinct is about 25 to 50
 The study region is roughly a precinct (25 by 25 blocks)
 Therefore police cars alone are not enough for detection even with more
powerful detectors.

1
0.9
0.8
0.7
Power
0.6
r=200
0.5
r=250
0.4
0.3
0.2
0.1
0
300
400
500
600
700
800
900
1000
Number of Sensors
65
1100
1200
1300
1400
1500
Additional Work on Vehicles
66
Additional Work on Vehicles
• There are many more explorations we are working on:
– Modifying the key parameters of effective range and
false positive and false negative rates
– Exploring hybrid models with some taxis and some
police cars
– Exploring a hybrid model that includes some stationary
detectors
– Inserting more than one source
– Bringing in moving sources
– Exploring reports over x successive time periods for
varying x
– Exploring different decision rules for reports over x time
periods.
67
Detectors in Cell Phones
• Similar ideas for placing sensors in cell phones have been
proposed and tested by the Radiation Laboratory at Purdue
University and at Lawrence Livermore.
• At a meeting with the NYC Police Department, where we
presented our taxicab and police car work, we were
encouraged to explore applying our methods to the cell
phone idea.
68
Crime Information Management:
Enhancing Force Deployment in
Law Enforcement and CounterTerrorism Using Data & Visual
Analytics
Work of Bill Pottenger
and colleagues at CCICADA,
PNNL, START, INTUIDEX
Sponsored by DHS
In collaboration with the Port
Authority of NY and NJ
69
A Joint Project Of:
DHS Center for the Study of Terrorism
and Responses to Terrorism University of Maryland
70
71
PANYNJ Project
•We have been working with the Port Authority of
New York and New Jersey to bring them enhanced
analytics to support risk and threat assessment
from their existing data.
•Goal: an integrated, real-time, crime information
management system.
71
72
Motivation
• Information is critical in Law Enforcement
today
– Potential: better informed decisions on threat and
risk assessment, force deployment, officer safety,
crime and terrorism prevention, ...
– Challenges: volume, quality, access, distribution,
formats
• Marry law enforcement and counter-terrorism
initiatives to aid in decision making
– Data Collection
– Secure Sharing
– Data Analysis
• Build on “CompStat” system
in use by PANYNJ police & other PDs.
72
CompStat Next Generation
73
• CompStat NG™ (Next Generation)
• Provides crime and counter-terrorism
information in an enterprise architecture
(web-based)
• Employs advanced visual and data
analytics for crime analysis and counterterrorism for use in force deployment
• Enables real-time situational awareness
on any networked device (Assessment
Wall, PC, smart phone)
– Integrates Key S&T Technologies
– Aimed at enabling dramatic improvements in fighting
crime and in counter-terrorism
73
74
Step 1: Existing IT and Data Access
74
• Database on old legacy VAX,
separate network
• Migrate data from old database
to workstation network
• Data cleansing
– Recognize anomalies and
notify PANYNJ staff
– Cleaned on future
extractions
• Host in central Data
Warehouse
– Parallel with production
system
75
CompStat NG™ Data Warehouse
• Incorporate information
and support real-time
updating from
– Global Terrorism
Database @ START
– PAPD
Access/SQL/Oracle DBs,
etc.
• Integrate multiple
sources of information
• Integrate information
extraction capability
from police incident data
and the GTDB
75
Step 2: Determine R&D Technology
to Deploy
• Most agencies have enormous quantities
of unprocessed, inaccessible textual data
• Information extraction technology is
needed in the law enforcement community
• Data is distributed in a “System of
Systems” (David Boyd, DHS OIC)
– Continuously growing amount of data
–75% Untapped
– Network sophistication of offenders is
growing (globalization  interjurisdictional crime / terrorism)
– Consolidation & coordination of data is
needed
76
76
77
Entity Extraction Technology
• Intuidex’s IxEEE™ engine
• Extracts textual named entities from virtually
any data source and puts them to work as
useful data
• Bridges the gap between unstructured &
structured data
• Puts formerly “buried” data to work
77
Entity Extraction from Law
Enforcement Data
• Data Entry Validation for PAPD Incident
Report Data Entry Application
• Automatic Database Population
• Extraction from Police Narratives, GTDB
News Sources
78
78
79
CompStat NG™ Visual Analytics
• PNNL Law Enforcement
Information Framework (LEIF)
• PNNL IN-SPIRE Assessment
Wall
79
Assessment Wall
80
Step 3: Develop On and Off-Site
• On-Site
• Deployments and
Integration of System
• Off-Site
– Commanding
Officer’s Application
(COs App)
Integration of
Vital Information
Summary Crime
Statistics
Recap reports on
demand
80
CompStat NG™ COs App
81
81
82
Selected Project Results
• Deployments at PANYNJ CompStat NG™




Data Warehouse
Commanding Officer’s Application
Law Enforcement Information Framework
IN-SPIRE Assessment Wall
• Ongoing research at Rutgers
– Knowledge Discovery – using Higher Order Naïve
Bayes machine learning methods
• Have Developed New Capabilities
– On-demand statistics reports
– Real-time Web-based analytics tools
– Automatic data entry validation
82
R&D Next Steps: Example: MO
Search
• Leverage technology for more helpful modus
operandi (MO) search / matching in LEIF
–Modus operandi: a particular way or method
of doing something (e.g., perpetrating a
crime
–Very common law enforcement activity
–Criminals tend to follow same MO
• Based on prior NIJ funded BPD_MO project
• Able to be invoked from LEIF (browser; service)
83
84
Closing Comment
•PANYNJ has adopted this vision as part of their
strategic long-term goals
•PA Executive Director Tony Shorris’ Goal: A
“national model” for counter-terrorism and
crime prevention and analysis!
•Similar methods under discussion with police
departments nationwide
84
Biosurveillance: Early Warning
from Entropy
Initiated as a DHS MSI Funded Summer
Program through DyDAn
Continuing under DHS-DyDAn-CCICADA
Support
85
Work of Nina Fefferman
Abdul-Aziz Yakubu
Asamoah Nkwanta
Ashley Crump
Devroy McFarlane
Anthony Ogbuka
Nakeya Williams
A New Possibility For
Biosurveillance: Entropy
• Early detection of disease outbreaks
critical for public health.
• Entropy quantifies the amount of
information communicated within
a signal
• Signal strength may change when
an outbreak starts
• We are hoping to detect changes
in signal strength early into the
onset of an outbreak
86
smallpox
Our Ultimate Goal:
Effective Biosurveillance
Shannon’s Entropy Formula
n
H ( X )  E ( I ( X ))   p ( xi ) log 2 p ( xi )
i 1
 I ( X ) is the information content of X
 p( xi )  Pr( X  xi ) is the probability mass function of X
120
100
Incidence
We want to be able to
take incoming disease
data and, as early as
possible, notice when
an outbreak is starting
87
80
60
40
20
0
1
73
145 217 289 361 433 505 577 649 721 793 865 937
Week
Biosurveillance Methods
• Current methods of outbreak detection are often
hit or miss.
• A frequently used method: CuSum
– Compares current cumulative summed
incidence to average
– Needs a lot of historical “non-outbreak” data
(bad for newly emerging threats)
– Has to be manually “reset”
• Other methods have similar problems
88
Biosurveillance Using Entropy
Reported Incidence Data
120
Incidence
100
We apply 3
preprocessing steps
80
60
40
20
0
1
86
171 256
341 426
511 596
681 766 851
936
Week
Entropy
Entropy Outcome
4.5
4
3.5
3
2.5
2
1.5
1
0.5
0
1
84 167 250 333 416 499 582 665 748 831 914
Iteration
89
We stream the
processed data
through our entropy
calculation
Biosurveillance Using Entropy:
The 3 Preprocessing Steps
1. Binning the Incidence Data: Number
of categories
2. Analyzing within a Temporal
Window: Number of time points
lumped into one observation
3. Moving the temporal window
according to different Step Sizes
90
Binning
•Assign each “count” to a bin or category.
•Binning lets us try to focus on biologically
meaningful differences.
Weekly Disease Incidence (Number of Cases)
Data: 3, 2, 4, 5, 8, 10, 12, 40, 35, 17, 37, 20, 23, 25, 4,…
Bin 1
Bin 2
Bin 3
Binned
Data: 1, 1, 1, 1, 2, 2, 2, 4, 4, 3, 4, 3, 3, 3, 1
91
Bin 4
Method of Binning can Really
Change the Outcome
Method 1
Method 2
1.2
0.8
Entropy
Entropy
1
0.6
0.4
0.2
0
1
92
81 161 241 321 401 481 561 641 721 801 881 961
Iteration
Number of bins = 2
4.5
4
3.5
3
2.5
2
1.5
1
0.5
0
1
84 167 250 333 416 499 582 665 748 831 914
Iteration
Number of bins = 14
Window Size
The window for number of data points to look at at one
time should be large enough to detect when a change
has happened (some data from “before” and some from
“after” the outbreak starts), but small enough that it
can’t entirely contain rapid peak. Window Size = 7
Incidence Data:
Step Size = 1
3, 2, 4, 5, 8, 10, 12, 40, 35, 17, 37, 20, 23, 25, 4,…
Calculate Entropy
Disease Incidence
120
100
80
60
40
20
0
1
93
E(1)
110 219 328 437
546 655 764 873 982
Weeks
1.2
3.5
1
3
0.8
Entropy
Entropy
Window Size can Also Have
a Huge Impact
0.6
0.4
2.5
2
1.5
1
0.2
0.5
0
0
1
36 71 106 141 176 211 246 281 316 351 386 421 456 491
Iteration
94
Window
Methodsize
1=3
1
71 141 211 281 351 421 491 561 631 701 771 841 911 981
Iteration
Window size
Method
2 = 14
Step Size
We allow windows to overlap. The window might need
to ‘walk along’ the data, not just expand to always
include more and more history. Step size tells us how
continuous the process is (e.g. how much overlap with
the last window)
Window Size = 7
Incidence Data:
Step Size = 1
3, 2, 4, 5, 8, 10, 12, 40, 35, 17, 37, 20, 23, 25, 4,…
Calculate Entropy
etc. for all the data
95
E(1), E(2)
Step Size
Adjusting step size can eliminate glitches like
weekends and holidays in daily datasets.
1.2
0.8
Entropy
Entropy
1
0.6
0.4
0.2
0
1
1
82 163 244 325 406 487 568 649 730 811 892 973
Iteration
Step size = 1
96
4
3.5
3
2.5
2
1.5
1
0.5
0
36 71 106 141 176 211 246 281 316 351 386 421 456 491
Iteration
Step size = 5
Computing an Entropy Output
We produce a new data stream by doing this over again,
walking the window along the binned data, using our
step size
Window Size = 6
Incidence Data:
Step Size = 1
3, 2, 4, 5, 8, 10, 12, 40, 35, 17, 37, 20, 23, 25, 4,…
This is the 9th window since
step size is 1
97
Calculate Entropy
E(1), E(2), E(3), E(4), E(5), E(6), E(7), E(8), E(9), …
Biosurveillance using Entropy
• Our preliminary results show this method can work.
• Favorable when compared to CuSum and other
methods.
120
Incidence
100
80
60
40
20
951
751
551
901
701
501
851
651
451
801
601
551
401
501
451
401
351
301
251
201
151
51
101
1
0
Week
3.5
Entropy
3
2.5
2
1.5
1
0.5
Window Count
98
751
701
651
601
351
301
251
201
151
101
51
1
0
We need more work
to test it to make
sure it’s sensitive
and specific enough
Biosurveillance using Entropy
Next step: Make selection of preprocessing
parameters automatic
• Right now, all parameters (number of bins, how
to bin, how large the window size, how long the
step size) is all determined manually by trial and
error
• To make this useful for actual surveillance, we
are working to design algorithms to select
optimal parameters for these three preprocessing
steps based on small samples of training data
and known outbreak definitions
99
Privacy Preserving Data Analysis
Work of Rebecca Wright and colleagues
at Rutgers and elsewhere.
Partner Institutions:
Rutgers, Yale, Stanford, University of
New Mexico, NYU
Sponsors:
National Science Foundation
DHS through DyDAn/CCICADA
100
Data Privacy
• Privacy-preserving methods of data handling
seek to provide sufficient privacy as well as
sufficient utility.
• Complications:
101
– Multiple data sources
– Ability to infer information by combining multiple
sources
– Difficult to understand extent to which limited
personal identifiers still allow identification
 Sweeney: 87% of the US population can be
uniquely identified by their date of birth, 5-digit
zip code, and gender.
– Privacy means different things to different people
and different societies and in different situations
Advantages of Privacy Protection
• Protection of personal information:
protects individuals and helps maintain
their trust
• Protection of proprietary or
government-sensitive information
• Enables collaboration between
different data holders (since they may
be more willing or able to collaborate
if they need not reveal their
information)
• Compliance with legislative policies
(e.g., HIPAA, EU privacy directives)
102
Secrecy vs. Privacy
Encryption works reasonably well to protect
secrecy of data in transit and in storage.
Alice
Bob
c = EK(m)
Encrypts message m
Decrypts c to obtain m
In contrast, privacy is about what Bob can and will do
with m.
103
Some of Our Privacy Preserving Data
Analysis Work
Our work uses a distributed cryptographic approach.
• Privacy-preserving construction of Bayesian networks
from vertically partitioned data.
• Privacy-preserving frequency mining in the fully
distributed model (enables naïve Bayes classification,
decision trees, and association rule mining).
• An experimental platform for PPDA.
• Privacy-preserving clustering: k-means clustering for
arbitrarily partitioned data and a divide-and-merge
clustering algorithm for horizontally partitioned data.
• Privacy-preserving reinforcement learning, partitioned
by observation or by time.
104
Sample Future Direction
• Incorporate differential privacy [Dwork, et al.]
into graph analysis
• Differential privacy is a measure of the
increased risk to privacy by participating in a
statistical database
• Literature has methods for achieving any
degree of privacy using this measure.
105
Climate Change
An Area with Vast Homeland
Security Challenges
Work of Nina Fefferman, Endre Boros, Melike Gursoy
Sponsored by Dept. of Defense
106
Climate Change
Concerns about global warming.
Resulting impact on health
–Of people
–Of animals
–Of plants
–Of ecosystems
107
Climate Change
•Some early warning signs:
–1995 extreme heat event in Chicago
514 heat-related deaths
3300 excess emergency admissions
–2003 heat wave in Europe
35,000 deaths
–Food spoilage on Antarctica
expeditions
Not cold enough to store
food in the ice
108
Climate Change
•Some early warning signs:
–Malaria in the African Highlands
–Dengue epidemics
–Floods, hurricanes
109
Climate Change is a Homeland Security
Problem
•Potential for future conflicts over shortages of
–Potable water and water for agriculture
–Land due to flooding
–Food due to different growing seasons or
unavailability of land
•Potential for spread of
disease to new places
•Potential for
civil unrest
110
Dengue fever
Habitat change for mosquitoes
SEES: Science, Engineering, and
Education for Sustainability
• New US National Science Foundation Initiative
• SEES integrates issues of environment, energy,
and economics, with an emphasis on climate
change.
• SEES is concerned with the 2-way interaction of
human activity with environmental processes.
• Combining existing and new monies, SEES
budget is $660M in FY10 and est. $765M in
FY11.
• Major data science challenges in SEES
111
SEES: Science, Engineering, and
Education for Sustainability
“The two-way interaction of human activity
with environmental processes now defines
the challenges to human survival and
wellbeing. Human activity is changing the
climate and the ecosystems that support
human life and livelihoods.”
Dust storm in Mali
112
SEES: Science, Engineering, and
Education for Sustainability
“Reliable and affordable energy is essential to
meet basic human needs and to fuel
economic growth, but many environmental
problems arise from the harvesting,
generation, transport, processing, conversion,
and storage of energy.”
113
SEES: Science, Engineering, and
Education for Sustainability
“Climate change is a pressing anthropogenic
stressor, but it is not the only one. The
growing challenges associated with
climate change, water and energy availability,
emerging infectious diseases, invasive
species, and other effects linked
to anthropogenic change are causing
increasing hardship and instability in
natural and social ecologies throughout
the world.”
114
Sustainability
• A key is to define “sustainability” and to
develop metrics of our progress toward a
sustainable life style
• There are a great many data science
challenges in climate change science and in
sustainability science
115
Climate Change
• Some of the key areas where data science
challenges arise:
 Dimensions of Biodiversity
 Water Sustainability and Climate
 Ocean Acidification
 Decadal and Regional Climate
Prediction Using Earth System
Models
116
Climate Change
• Data science is relevant to all of these areas.
117
Sample Research Area: Extreme
Events due to Global Warming
Similar interests at CDC’s new math modeling
program
We anticipate an increase in number and severity of
extreme events due to global warming.
More heat waves.
More floods, hurricanes.
118
Problem 1: Evacuations during
Extreme Heat Events
One response to such events: evacuation of most
vulnerable individuals to climate controlled
environments.
Modeling challenges:
Where to locate the evacuation centers?
Whom to send where?
Goals include minimizing travel time, keeping facilities to
their maximum capacity
119
Problem 2: Rolling Blackouts during
Extreme Heat Events
A side effect of such events: Extremes in energy use lead
to need for rolling blackouts.
Modeling challenges:
Understanding health impacts of blackouts and bringing
them into models
Design efficient rolling blackouts while minimizing
impact on health
Lack of air conditioning
Elevators no work: vulnerable people’s
over-exertion
Food spoilage
Minimizing impact on the most
vulnerable populations
120
Problem 3: Pesticide Applications
after a Flood
•Pesticide applications often called for after a flood
•Chemicals used for pesticidal and larvicidal control
of mosquito disease vectors are themselves harmful
to humans
•Maximize insect control while minimizing health
effects
121
Problem 4: Emergency Rescue Vehicle
Routing to Avoid Rising Flood Waters
Emergency rescue vehicle routing to avoid
rising flood waters while still minimizing
delay in provision of medical attention and
still get afflicted people to available hospital
facilities
122
Optimal Locations for Shelters in
Extreme Heat Events
• Work based in Newark NJ
• Data includes locations of potential shelters, travel
distance from each city block to potential shelters, and
population size and demographic distribution on each
city block.
• Determined “at risk” age groups and their likely levels
of healthcare needed to avoid serious problems
• Computing optimal routing plans for at-risk
population to minimize adverse health outcomes and
travel time
• Using techniques of probabilistic mixed integer
programming and aspects of location theory constrained
by shelter capacity (based on predictions of duration,
123 onset time, and severity of heat events)
Pesticides after Floods: Maximal
Insect Control,
Minimal Chemical Exposure
• We are working to compute optimal application
patterns for control chemicals
• Minimizing:
– Disease risk from vector-borne diseases
– Adverse health outcomes from human exposure to
control chemicals
• Using techniques of:
– Control Theory
– Stochastic Simulation
124
Thank you!
125
Download