Homeland Security What Can Mathematics Do? Fred Roberts Professor of Mathematics, Rutgers University Chair, RU Homeland Security Research Initiative 1 Director, DIMACS Center Mathematical methods have become important tools in preparing plans for defense against terrorist attacks, especially when combined with powerful, modern computer methods for analysis and simulation. 2 Are you Serious?? What Can Mathematics Do For Us? 3 4 . After Pearl Harbor: Mathematics and mathematicians played a vitally important role 5 in the US World War II effort. Critical War-Effort Contributions Included: •Code breaking. Enigma machine •Creation of the mathematics-based field of Operations Research: logistics optimal scheduling inventory strategic planning 6 But: Terrorism is Different. Can Mathematics Really Help? 5+2=? 1, 2, 3, … 7 I’ll Illustrate with Mathematics Projects I’m Involved in. There are Many Others • • • • Bioterrorism Sensor Location Monitoring Message Streams Identification of Authors Detecting a Bioterrorist Attack through “Syndromic Surveillance” 8 OUTLINE • • • • Bioterrorism Sensor Location Monitoring Message Streams Identification of Authors Detecting a Bioterrorist Attack through “Syndromic Surveillance” 9 The Bioterrorism Sensor Location Problem 10 • Early warning is critical in defense against terrorism • This is a crucial factor underlying the government’s plans to place networks of sensors/detectors to warn of a bioterrorist attack 11 The BASIS System – Salt Lake City Locating Sensors is not Easy • Sensors are expensive • How do we select them and where do we place them to maximize “coverage,” expedite an alarm, and keep the cost down? • Approaches that improve upon existing, ad hoc location methods could save countless lives in the case of an attack and also money in capital and operational 12 costs. Two Fundamental Problems • Sensor Location Problem – Choose an appropriate mix of sensors – decide where to locate them for best protection and early warning 13 Two Fundamental Problems • Pattern Interpretation Problem: When sensors set off an alarm, help public health decision makers decide – Has an attack taken place? – What additional monitoring is needed? – What was its extent and location? – What is an appropriate response? 14 The Sensor Location Problem •Approach is to develop new algorithmic methods. •Developing new algorithms involves fundamental mathematical analysis. •Analyzing how efficient algorithms are involves fundamental mathematical methods. •Implementing the algorithms on a computer is often a separate problem – which needs to go hand in hand with the basic mathematics of algorithm 15 development. Algorithmic Approaches I : Greedy Algorithms 16 Greedy Algorithms • Find the most important location first and locate a sensor there. • Find second-most important location. • Etc. • Builds on earlier mathematical work at Institute for Defense Analyses (Grotte, Platt) • “Steepest ascent approach.’’ • No guarantee of “optimal” or best solution. • In practice, gets pretty close to optimal solution. 17 Algorithmic Approaches II : Variants of Classic Facility Location Theory Methods 18 Location Theory • Old problem in Operations research: Where to locate facilities (fire houses, garbage dumps, etc.) to best serve “users” • Often deal with a network with nodes, edges, and distances along edges • Users u1, u2, …, un are located at nodes • One approach: locate the facility at node x chosen so that sum of distances to users is minimized. n d ( x, ui ) • Minimize: i 1 19 Location Theory: A Network f 1 a 1 1 e b 1 1 d Nodes are places for users or facilities 1 c 1’s represent distances along 20 edges 1 f 1 a 1 u1 e b u2 1 1 d 1 c x=a: d(x,ui)=1+1+2=4 u3 x=b: d(x,ui)=2+0+1=3 x=c: d(x,ui)=3+1+0=4 x=d: d(x,ui)=2+2+1=5 x=e: d(x,ui)=1+3+2=6 x=f: d(x,ui)=0+2+3=5 x=b is optimal 21 Variants of Classic Facility Location Theory Methods: Complications • We don’t have a network with nodes and edges; we have points in a city • Sensors can only be at certain locations (size, weight, power source, hiding place) • We need to place more than one sensor • Instead of “users,” we have places where potential attacks take place. • Potential attacks take place with certain probabilities. • Wind, buildings, mountains, etc. add 22 complications. Variants of Classic Facility Location Theory Methods: Complications • These more complex problems are hard! • The best-known algorithms for solving these “higher-dimensional” variants of the classic location problem are due to Rafail Ostrovsky -- a partner on our project. • The mathematics-based approximation methods due to Ostrovsky and his colleagues are promising. 23 Algorithmic Approaches IIII : Variants of Air Pollution Monitoring Models 24 Variants of Air Pollution Monitoring Models • Long history of using mathematical models to locate air pollution monitors. • Use fluid dynamics • Use plume models. • Large computer simulations needed. • Long used in nuclear weapons defense. 25 Variants of Air Pollution Monitoring Models • Mathematical challenge: Modify air pollution monitor placement modeling tools for complex biological agents. • E.g.: Complications arise when applying the models to cities: Buildings make it hard! 26 The Pattern Interpretation Problem 27 The Pattern Interpretation Problem (PIP) • It will be up to the Decision Maker to decide how to respond to an alarm from the sensor network. 28 Approaching the PIP: Minimizing False Alarms 29 Approaching the PIP: Minimizing False Alarms One approach: Redundancy. • Could require two or more sensors to make a detection before an alarm is considered confirmed • Could require same sensor to register two alarms: Portal Shield requires two positives for the same agent during a specific time period. 30 Approaching the PIP: Minimizing False Alarms • Could place two or more sensors at or near the same location. Require two proximate sensors to give off an alarm before we consider it confirmed. Redundancy has drawbacks: cost, delay in confirming an alarm. We need mathematical methods to analyze the tradeoff between lowered false alarm rate and extra cost/delay 31 Approaching the PIP: Using Decision Rules • Existing sensors come with a sensitivity level specified and sound an alarm when the number of particles collected is sufficiently high – above threshold. 32 Approaching the PIP: Using Decision Rules • Let f(x) = number of particles collected at sensor x in the past 24 hours. Sound an alarm if f(x) > T. • Alternative decision rule: alarm if two sensors reach 90% of threshold, three reach 75% of threshold, etc. Alarm if: f(x) > T for some x, or if f(x1) > .9T and f(x2) > .9T for some x1,x2, or if f(x1) > .75T and f(x2) > .75T and f(x3) > .75T 33 for some x1,x2,x3. Approaching the PIP: Using Decision Rules • Prior work along these lines in missile detection (Cherikh and Kantor) 34 Bioterrorism Sensor Location: Partner Agencies/Institutions • • • • • Defense Threat Reduction Agency MITRE Corporation Los Alamos National Laboratory Institute for Defense Analysis New York City Dept. of Health 35 OUTLINE • • • • Bioterrorism Sensor Location Monitoring Message Streams Identification of Authors Detecting a Bioterrorist Attack through “Syndromic Surveillance” 36 Monitoring Message Streams: Algorithmic Methods for Automatic Processing of Messages 37 Objective: Monitor huge communication streams, in particular, streams of textualized communication, to automatically detect pattern changes and "significant" events Motivation: monitoring email traffic, news, communiques, faxes 38 Technical Approaches: • Given stream of text in any language. • Decide whether "events" are present in the flow of messages. • Event: new topic or topic with unusual level of activity. • Suppose events have been classified into classes or groups: group 1, group 2, … • A new message comes in. Does it fit into group 1? Into group 2? Or does it (and related messages) define a new group of interest? 39 One Approach: “Bag of Words” • List all the words of interest that may arise in the messages being studied: w1, w2,…,wn • Bag of words vector b has k as the ith entry if word wi appears k times in the message. • Sometimes, use “bag of bits”: Vector of 0’s and 1’s; count 1 if word wi appears in the message, 0 otherwise. 40 “Bag of Words” Example Words: w1 = bomb, w2 = attack, w3 = strike w4 = train, w5 = plane, w6 = subway w7 = New York, w8 = Los Angeles, w9 = Madrid, w10 = Tokyo, w11 = London w12 = January, w13 = March 41 “Bag of Words” Message 1: Strike Madrid trains on March 1. Strike Tokyo subway on March 2. Strike New York trains on March 11. Bag of words b1 = (0,0,3,2,0,1,1,0,1,1,0,0,3) w1 = bomb, w2 = attack, w3 = strike w4 = train, w5 = plane, w6 = subway w7 = New York, w8 = Los Angeles, w9 = Madrid, w10 = Tokyo, w11 = London 42 w12 = January, w13 = March The Approach: “Bag of Words” • Key idea: how close are two such vectors? • Suppose known messages have been classified into different groups: group 1, group 2, … • A message comes in. Which group should we put it in? Or is it “new”? • You look at the bag of words vector associated with the incoming message and see if “fits” closely to typical vectors 43 associated with a given group. The Approach: “Bag of Words” • Your performance can improve over time. • You “learn” how to classify better. • Typically you do this “automatically” and try to develop mathematical methods that will allow a machine to “learn” from past data. 44 “Bag of Words” Message 2: Bomb Madrid trains on March 1. Attack Tokyo subway on March 2. Strike New York trains on March 11. Bag of words b2 = (1,1,1,2,0,1,1,0,1,1,0,0,3) w1 = bomb, w2 = attack, w3 = strike w4 = train, w5 = plane, w6 = subway w7 = New York, w8 = Los Angeles, w9 = Madrid, w10 = Tokyo, w11 = London 45 w12 = January, w13 = March “Bag of Words” Note that b1 and b2 are “close” b1 = (0,0,3,2,0,1,1,0,1,1,0,0,3) b2 = (1,1,1,2,0,1,1,0,1,1,0,0,3) Close could be measured using distance d(b1,b2) = number of places where b1,b2 differ (“Hamming distance” between vectors). Here: d(b1,b2) = 3 The messages are “similar” – could belong to the same group or class of messages. 46 “Bag of Words” Message 3: Go on strike against Madrid trains on March 1. Go on strike against Tokyo subway on March 2. Go on strike against New York trains on March 11. Bag of words b3 = same as b1. BUT: message 3 is quite different from message 1. Shows complexity of problem. Maybe missing some key words like “go” or maybe we should use pairs of words like “on strike” (“bigrams”) 47 One Approach: k-Nearest Neighbor (kNN) Classifiers • How kNN Classifiers Work: – Find k most similar “training” messages (neighbors) – Assign a message to those groups that are most common among neighbors (using weighting by distance) • kNN classifiers had been considered inefficient since finding neighbors is slow 48 Speeding up kNN • Can finding neighbors be made fast enough to make kNN practical? • Mathematics can help. • Store text and classes “sparsely” • Use “inverted file” heuristics that group input by word, not by “document” and compute similarities using only the few words occurring in the document • Result: New methods are 10 to 100 times faster with only a 2-10% loss in “effectiveness” (according to some standard measures) 49 • Software delivered to sponsors. Streaming Data • We often have just one shot at the data as it comes “streaming by” because there is so much of it. This calls for powerful new algorithms. 50 Research Challenge: “Historic” Data Analysis • The accumulation of text messages is massive over time • We can only save summaries of the data. • It is a great challenge to use only summarized historic data and see if a currently emerging phenomenon had precursors occurring in the past – since you don’t have the original data. • We have had some success with a novel architecture for historic and posterior analyses via small summaries “sketches” 51 OUTLINE • • • • Bioterrorism Sensor Location Monitoring Message Streams Identification of Authors Detecting a Bioterrorist Attack through “Syndromic Surveillance” 52 Related Project: Author Identification Develop and evaluate techniques for identifying authors in large collections of textual artifacts (emails, communiques, transcribed speech, etc.). Questions Addressed: Which of a set of authors wrote a particular document/message? Were two documents written by the same author? 53 Author Identification • We are using methods developed in the Monitoring Message Streams Project • Building on classical work in Statistics: Who wrote the Federalist papers, Hamilton or Madison? • More complicated than conventional text classification: – Large number of possible authors – Not much “training data” – Authors write on multiple topics – Authors write in different styles in different “genres” 54 One Approach: In “Bag of Words”: Use “Function Words” • • • • • • • • • • • • a about above according accordingly actual actually after afterward afterwards again against • • • • • • • • • • • • • • ago ah ain't all almost along already also although always am among an and • • • • • • • • • • • • • • another any anybody anyone anything anywhere are aren't around art as aside at away 55 Partner Agencies: Monitoring Message Streams and Author Identification Projects • Research sponsored by ITIC: Intelligence Technology Innovation Center • Administratively under the CIA • Through interagency Knowledge, Discovery, and Dissemination (KDD) program. 56 OUTLINE • • • • Bioterrorism Sensor Location Monitoring Message Streams Identification of Authors Detecting a Bioterrorist Attack through “Syndromic Surveillance” 57 Bioterrorist Event Detection Great concern about the deliberate introduction of diseases such as smallpox by bioterrorists has led to new challenges for mathematical scientists. smallpox 58 Bioterrorist Event Detection • Mathematical models of infectious diseases go back to Daniel Bernoulli’s mathematical analysis of smallpox in 1760. • However, modern data-gathering methods bring with them new challenges for mathematicians. • Methods used in Monitoring Message Streams and Author ID projects enter into using large data sets to detect “bioterrorist events” or “emerging diseases” (SARS) through 59 “syndromic surveillance” New Data Types for Public Health Surveillance • Managed care patient encounter data • Pre-diagnostic/chief complaint (ED data) • Over-the-counter sales transactions – Drug store – Grocery store • 911-emergency calls • Ambulance dispatch data • Absenteeism data • ED discharge summaries • Prescription/pharmaceuticals • Adverse event reports 60 Syndromic Surveillance: NYC Dept. of Health Data 61 Approach: • As with Monitoring Message Streams and Author Identification, represent data by using a vector. • For example, use “bag of bits” (0 or 1 only in each entry). • If use symptoms, then 1 or 0 represents presence or absence of symptoms such as coughing, fever over 102 degrees, achy legs, disoriented, etc. 62 Many New Mathematical Methods and Approaches under Development • • • • • • • Spatial-temporal “scan statistics” Statistical process control (SPC) Bayesian applications “Market-basket” association analysis Text mining Rule-based surveillance Change-point techniques 63 Project a Collaboration between a Math/CS Research Center and a Government Agency DIMACS: Center for Discrete Mathematics and Theoretical Computer Science CDC: Centers for Disease Control and Prevention 64 Would Mathematics help Protect our Bridges and Tunnels? George Washington Bridge Lincoln Tunnel 65 Would Mathematics Help Protect our Borders? 66 Would it help with a Deliberate Outbreak of Anthrax? 67 Similar approaches, using mathematical models, have proven useful in many other fields, to: •make policy •plan operations •analyze risk •compare interventions •identify the cause of observed events 68 Why not in homeland security? 69