HOMELAND SECURITY RESEARCH AT DIMACS 1 Working Group on Adverse Event/Disease Reporting, Surveillance, and Analysis •Health surveillance a core activity in public health •Concerns about bioterrorism have attracted attention to new Vaccine Safety Surveillance surveillance methods: –OTC drug sales –Subway worker absenteeism –Ambulance dispatches •Spawns need for novel statistical methods for surveillance of multiple data streams. Disease Surveillance Drug Safety Surveillance Syndromic Surveillance 2 Working Group on Privacy & Confidentiality of Health Data •Privacy concerns are a major stumbling block to public health surveillance, in particular bioterrorism surveillance. •Challenge: produce anonymous data specific enough for research. •Exploring ways to remove identifiers (s.s. #, tel. #, zip code) from data sets. •Exploring ways to aggregate, remove information from data sets. 3 Working Group on Analogies between Computer Viruses and Biological Viruses •Can ideas for defending against biological viruses lead to ideas for defending against computer viruses? •Concern about large gap between initial time of attack and implementation of defensive strategies •“Public health” approach: Once a virus has infected a machine, it tries to connect it to as many computers as possible, as fast as possible. A “throttle” limits rate at which a computer can connect to new computers. 4 Working Group on Modeling Social Responses to Bioterrorism •Models of the spread of infectious disease commonly assume passive bystanders and rational actors who will comply with health authorities. •It is not clear how well this assumption applies to situations like a bioterrorist attack using 1947, NYC, smallpox outbreak smallpox or plague. •Interdisciplinary group is discussing incorporating social behavior into models, models of public health 5 decisionmaking, risk communication. The Bioterrorism Sensor Location Problem • Early warning is critical • This is a crucial factor underlying government’s plans to place networks of sensors/detectors to warn of a bioterrorist attack The BASIS System6 Two Fundamental Problems • Sensor Location Problem (SLP): – Choose an appropriate mix of sensors – decide where to locate them for best protection and early warning 7 Two Fundamental Problems • Pattern Interpretation Problem (PIP): When sensors set off an alarm, help public health decision makers decide – Has an attack taken place? – What additional monitoring is needed? – What was its extent and location? – What is an appropriate response? 8 Monitoring Message Streams: Algorithmic Methods for Automatic Processing of Messages Supported by Interagency KD-D Group 9 OBJECTIVE: Monitor huge streams of textualized communication to automatically detect pattern changes and "significant" events Motivation: monitoring email traffic 10 TECHNICAL PROBLEM: • Given stream of text in any language. • Decide whether "new events" are present in the flow of messages. • Event: new topic or topic with unusual level of activity. • Retrospective or “Supervised” Event Identification: Classification into pre-existing classes. 11 TECHNICAL PROBLEM: • Batch filtering: Given relevant documents up front. • Adaptive filtering: “pay” for information about relevance as process moves along. 12 MORE COMPLEX PROBLEM: PROSPECTIVE DETECTION OR “UNSUPERVISED” LEARNING • Classes change - new classes or change meaning • A difficult problem in statistics • Recent new C.S. approaches “Semi-supervised Learning”: • Algorithm suggests a new class • Human analyst labels it; determines its significance 13 COMPONENTS OF AUTOMATIC MESSAGE PROCESSING (1). Compression of Text -- to meet storage and processing limitations; (2). Representation of Text -- put in form amenable to computation and statistical analysis; (3). Matching Scheme -- computing similarity between documents; (4). Learning Method -- build on judged examples to determine characteristics of document cluster (“event”) (5). Fusion Scheme -- combine methods (scores)14 to yield improved detection/clustering. COMPONENTS OF AUTOMATIC MESSAGE PROCESSING - II •These distinctions are somewhat arbitrary. •Many approaches to message processing overlap several of these components of automatic message processing. Existing methods don’t exploit the full power of the 5 components, synergies among them, and/or an understanding of how to apply them to text 15 data. COMPRESSION: • Reduce the dimension before statistical analysis. • We often have just one shot at the data as it comes “streaming by” 16 COMPRESSION II: • Recent results: “One-pass” through data can reduce volume significantly w/o degrading performance significantly. We believe that sophisticated dimension reduction methods in a preprocessing stage followed by sophisticated statistical tools in a detection/filtering stage can be a very powerful approach. Our methods so far give us some confidence that we are right. 17 COMPRESSION III: •Three directions of work involving adaptation of nearest neighbor (NN) algorithms from theoretical computer science: *Use of random projections into real subspaces. (Still promising, though not competitive for our data.) *Random projections into Hamming cubes *Efficient discovery of “deviant” cases in stream of vectorized entities 18 MORE SOPHISTICATED STATISTICAL APPROACHES BEING STUDIED: • Representations: Boolean representations; weighting schemes • Matching Schemes: Boolean matching; nonlinear transforms of individual feature values • Learning Methods: new kernel-based methods; more complex Bayes classifiers; boosting; • Fusion Methods: combining scores based on ranks, linear functions, or nonparametric schemes 19 DATA SETS USED: • No readily available data set has all the characteristics of data on which we expect our methods to be used • However: Many of our methods depend essentially only on term frequencies by document. • Thus, many available data sets can be used for experimentation. 20 DATA SETS USED II: TREC (Text Retrieval Conference) data: time-stamped subsets of the data (order 105 to 106 messages) Reuters Corpus Vol. 1 (8 x 105 messages) Medline Abstracts (order 107 with human indexing) 21 THE MONITORING MESSAGE STREAMS PROJECT TEAM: Endre Boros, RUTCOR Paul Kantor, SCILS Dave Lewis, Consultant Ilya Muchnik, DIMACS/CS S. Muthukrishnan, CS David Madigan, Statistics Rafail Ostrovsky, Telcordia Technologies Fred Roberts, Rutgers Martin Strauss, AT&T Labs Wen-Hua Ju, Avaya Labs (collaborator) 22