Monitoring Message Streams: Retrospective and Prospective Event Detection Paul Kantor, Dave Lewis, David Madigan, Fred Roberts DIMACS, Rutgers University 1 DIMACS is a partnership of: •Rutgers University •Princeton University •AT&T Labs •Bell Labs •NEC Research Institute •Telcordia Technologies http:dimacs.rutgers.edu center@dimacs.rutgers.edu 732-445-5928 2 OBJECTIVE: Monitor streams of textualized communication to detect pattern changes and "significant" events Motivation: sniffing and monitoring email traffic 3 TECHNICAL PROBLEM: • Given stream of text in any language. • Decide whether "new events" are present in the flow of messages. • Event: new topic or topic with unusual level of activity. • Retrospective or “Supervised” Event Identification: Classification into pre-existing classes. 4 More Complex Problem: Prospective Detection or “Unsupervised” Learning Classes change - new classes or change meaning A difficult problem in statistics Recent new C.S. approaches 1) Algorithm suggests a new class 2) Human analyst labels it; determines its significance 5 COMPONENTS OF AUTOMATIC MESSAGE PROCESSING (1). Compression of Text -- to meet storage and processing limitations; (2). Representation of Text -- put in form amenable to computation and statistical analysis; (3). Matching Scheme -- computing similarity between documents; (4). Learning Method -- build on judged examples to determine characteristics of document cluster (“event”) (5). Fusion Scheme -- combine methods (scores)6 to yield improved detection/clustering. OUR APPROACH: WHY WE CAN DO BETTER THAN STATE OF THE ART: • Existing methods use some or all 5 automatic processing components, but don’t exploit the full power of the components and/or an understanding of how to apply them to text data. • Dave Lewis' method at TREC filtering used an offthe-shelf support vector machine supervised learner, but tuned it for frequency properties of the data. • The combination still dominated competing approaches in the TREC-2001 batch filtering evaluation. 7 OUR APPROACH:WHY WE CAN DO BETTER II: • Existing methods aim at fitting into available computational resources without paying attention to upfront data compression. • We hope to do better by a combination of: more sophisticated statistical methods sophisticated data compression in a preprocessing stage optimization of component combinations 8 COMPRESSION: • Reduce the dimension before statistical analysis. • Recent results: “One-pass” through data can reduce volume significantly w/o degrading performance significantly. (E.g.: use random projections.) • Unlike feature-extracting dimension reduction, which can lead to bad results. We believe that sophisticated dimension reduction methods in a preprocessing stage followed by sophisticated statistical tools in a detection/filtering stage can be a very powerful approach. 9 MORE SOPHISTICATED STATISTICAL APPROACHES: • Representations: Boolean representations; weighting schemes • Matching Schemes: Boolean matching; nonlinear transforms of individual feature values • Learning Methods: new kernel-based methods; more complex Bayes classifiers; boosting; • Fusion Methods: combining scores based on ranks, linear functions, or nonparametric schemes 10 OUTLINE OF THE APPROACH • Identify best combination of newer methods through careful exploration of variety of tools. • Address issues of effectiveness (how well task is done) and efficiency (in computational time and space) • Use combination of new or modified algorithms and improved statistical methods built on the algorithmic primitives. 11 IN LATER YEARS • Extend work to unsupervised learning. • Still concentrate on new methods for the 5 components. • Emphasize “semi-supervised learning” - human analysts help to focus on features most indicative of anomaly or change; algorithms assess incoming documents as to deviation on those features. • Develop new techniques to represent data to highlight significant deviation: Through an appropriately defined metric With new clustering algorithms Building on analyst-designated features 12 THE PROJECT TEAM: Strong team: Statisticians: David Madigan, Rutgers Statistics; Ilya Muchnik, Rutgers CS Experts in Info. Retrieval & Library Science & Text Classification: Paul Kantor, Rutgers Info. And Library Science; David Lewis, Private Consultant 13 THE PROJECT TEAM: Learning Theorists/Operations Researchers: Endre Boros, Rutgers Operations Research Computer Scientists: Muthu Muthukrishnan, Rutgers CS, Martin Strauss, AT&T Labs, Rafail Ostrovsky, Telcordian Technologies Decision Theorists/Mathematical Modelers: Fred Roberts, Rutgers Math/DIMACS Homeland Security Consultants: David Goldschmidt, IDA-CCR 14 IMPACT: 12 MONTHS: • We will have established a state-of-the art scheme for classification of accumulated documents in relation to known tasks/targets/themes and building profiles to track future relevant messages. • We are optimistic that by end-to-end experimentation, we will discover synergies between new mathematical and statistical methods for addressing each of the component tasks and thus achieve significant improvements in performance on accepted measures that could not be achieved by piecemeal study of one 15 or two component tasks. IMPACT: 3 YEARS: • prototype code for testing the concepts and a precise system specification for commercial or government development. • we will have extended our analysis to semisupervised discovery of potentially interesting clusters of documents. • this should allow us to identify potentially threatening events in time for cognizant agencies to prevent them from occurring. 16 RISKS •Data will not be realistic enough. •We will find it harder than expected to combine good approaches to the 5 components •Multidisciplinary cooperation won’t work as well as we think. 17 TOP ACCOMPLISHMENTS TO DATE Infrastructure Work to Date (1 of 2) --Built platform for text filtering experiments *Modified CMU Lemur retrieval toolkit to support filtering *Created newswire testset with test information needs (250 topics, 240K documents) *Wrote evaluation and adaptive thresholding software 18 TOP ACCOMPLISHMENTS TO DATE II Infrastructure Work to Date (2 of 2): --Implemented fundamental adaptive linear classifier (Rocchio) --Benchmarked them using our data sets and submitted to NIST TREC evaluation 19 TOP ACCOMPLISHMENTS TO DATE III Developed a Formal Framework for Monitoring Message Streams: •Cast Monitoring Message Streams as a multistage decision problem •For each message, decide to send to an analyst or not •Positive utility for sending an “interesting” message; else negative…but 20 A Formal Framework for Monitoring Message Streams Continued •…positive “value of information” even for negative documents •Use Influence Diagrams as a modeling framework •Key input is the learning curve •Building simple learning curve models •BinWorld – discrete model of feature space 21 TOP ACCOMPLISHMENTS TO DATE IV In June, held a “data mining in homeland security” tutorial and workshop at IDA-CCR Princeton. Organized Algorithmic Approach to Compression/Dimension Reduction Beginning Work on Nearest Neighbor Search Methods 22 S.O.W: FIRST 12 MONTHS: • Prepare available corpora of data on which to uniformly test different combinations of methods • Concentrate on supervised learning and detection • Systematically explore & compare combinations of compression schemes, representations, matching schemes, learning methods, and fusion schemes • Test combinations of methods on common data sets and exchange information among the team • Develop and test promising dimension reduction (compression) methods 23 S.O.W: FIRST 12 MONTHS: Midterm Exam (by end of November): • Reports on Algorithms: draft writeups • Research Quality Code: Under Development • Reports on Experimental Evaluation: Interim Project Report • Dissemination: draft writeups, interim report plus website, workshop in June 2002 just prior to beginning of project 24 S.O.W: FIRST 12 MONTHS: Final Exam (by end of First 12 Months): • Reports on Algorithms: formal writeups as technical reports and research papers • Research Quality Code: Made available to sponsors and mission agencies on a web site • Reports on Experimental Evaluation: Project Report Summarizing end-to-end studies on effectiveness of different components of our approach + their effectiveness in combination • Dissemination: technical reports, conference papers, journal submissions, final reports on algorithms and experimental evaluation, refinement of websites, meetings with sponsors and mission agencies. End of Year 1 Workshop for Sponsors/Practitioners. 25 S.O.W: YEARS 2 AND 3: • Combine leading methods for supervised learning with promising upfront dimension reduction methods • Develop research quality code for the leading identified methods for supervised learning • Develop the extension to unsupervised learning : Detect suspicious message clusters before an event has occurred Use generalized stress measures indicating a significant group of interrelated messages don’t fit into the known family of clusters Concentrate on semi-supervised learning. 26 WE ARE OFF TO A GOOD START The task we face is of great value in forensic activities. We are bringing to bear on this task a multidisciplinary approach with a large, enthusiastic, and experienced team. Preliminary results are very encouraging. Work is needed to make sure that our ideas are of use to analysts. 27