Monitoring Message Streams: Retrospective and Prospective Event Detection Fred S. Roberts DIMACS, Rutgers University 1 DIMACS is a partnership of: •Rutgers University •Princeton University •AT&T Labs •Bell Labs •NEC Research Institute •Telcordia Technologies http:dimacs.rutgers.edu center@dimacs.rutgers.edu 732-445-5928 2 OBJECTIVE: Monitor streams of textualized communication to detect pattern changes and "significant" events Motivation: monitoring of global satellite communications (though this may produce voice rather than text) sniffing and monitoring email traffic 3 TECHNICAL PROBLEM: • Given stream of text in any language. • Decide whether "new events" are present in the flow of messages. • Event: new topic or topic with unusual level of activity. • Retrospective or “Supervised” Event Identification: Classification into pre-existing classes. 4 More Complex Problem: Prospective Detection or “Unsupervised” Learning Classes change - new classes or change meaning A difficult problem in statistics Recent new C.S. approaches 1) Algorithm detects a new class 2) Human analyst labels it; determines its significance 5 COMPONENTS OF AUTOMATIC MESSAGE PROCESSING (1). Compression of Text -- to meet storage and processing limitations; (2). Representation of Text -- put in form amenable to computation and statistical analysis; (3). Matching Scheme -- computing similarity between documents; (4). Learning Method -- build on judged examples to determine characteristics of document cluster (“event”) (5). Fusion Scheme -- combine methods (scores)6 to yield improved detection/clustering. STATE OF THE ART: Best results to date: Retrospective Detection: David Lewis (2001), using simple Support Vector Machines Prospective Detection: results reported by a group from Oracle (2001), using change of basis representation, which builds on natural language knowledge 7 WHY WE CAN DO BETTER: • Existing methods use some or all 5 automatic processing components, but don’t exploit the full power of the components and/or an understanding of how to apply them to text data. • Lewis' methods used an off-the-shelf support vector machine supervised learner, but tuned it for frequency properties of the data. • The combination dominated competing approaches in the TREC-2001 batch filtering evaluation. 8 WHY WE CAN DO BETTER II: • Existing methods aim at fitting into available computational resources without paying attention to upfront data compression. • We hope to do better by a combination of: more sophisticated statistical methods sophisticated data compression in a preprocessing stage Alta Vista: combining data compression with naïve statistical methods leads to some success 9 COMPRESSION: • Reduce the dimension before statistical analysis. • Recent results: “One-pass” through data can reduce volume significantly w/o degrading performance significantly. (E.g.: use random projections.) • Unlike feature-extracting dimension reduction, which can lead to bad results. We believe that sophisticated dimension reduction methods in a preprocessing stage followed by sophisticated statistical tools in a detection/filtering stage can be a very powerful approach. 10 MORE SOPHISTICATED STATISTICAL APPROACHES: • Representations: Boolean representations; weighting schemes • Matching Schemes: Boolean matching; nonlinear transforms of individual feature values • Learning Methods: new kernel-based methods (nonlinear classification); more complex Bayes classifiers to assign objects to highest probability class • Fusion Methods: combining scores based on ranks, linear functions, or nonparametric schemes 11 • . THE APPROACH • Identify best combination of newer methods through careful exploration of variety of tools. • Address issues of effectiveness (how well task is done) and efficiency (in computational time and space) • Use combination of new or modified algorithms and improved statistical methods built on the algorithmic primitives. 12 • . IN LATER YEARS • Extend work to unsupervised learning. • Still concentrate on new methods for the 5 components. • Emphasize “semi-supervised learning” - human analysts help to focus on features most indicative of anomaly or change; algorithms assess incoming documents as to deviation on those features. • Develop new techniques to represent data to highlight significant deviation: Through an appropriately defined metric With new clustering algorithms • . Building on analyst-designated features 13 DIMACS STRENGTHS: Strong team: statisticians, computer scientists, experts in info. retrieval & library science DAVID MADIGAN, Rutgers Statistics: NSF project on text classification. An expert on Bayes classifiers. developing extensions beyond Bayes classifiers (Lewis is his co-PI and a subcontractor on his NSF grant.) 14 DAVID LEWIS, Private Consultant: Best basic batch filtering methods. Extensive experience in text classification. PAUL KANTOR, Rutgers, Library Information Science and Operations Research: Expert on combining multiple methods for classifying candidate documents. Expert on information retrieval and interactive systems -- human input to 15 leverage filtering and processing capabilities. ILYA MUCHNIK, Rutgers Computer Science: Developed a fast statistical clustering algorithm that can deal with millions of cases in reasonable time. Pioneered use of kernel methods for machine learning. MUTHU MUTHUKRISHNAN, Rutgers Computer Science: Developed algorithms for making one pass over text documents to gain information about them. 16 MARTIN STRAUSS, AT&T Labs Has new methods for handling data streams whose items are read once, then discarded. RAFAIL OSTROVSKY, Telcordia Technologies Developed dimension reduction methods in the hypercube. Used these in powerful algorithms to detect patterns in streams of data. 17 ENDRE BOROS, Rutgers Operations Research Developed extremely useful methods for Boolean representation and rule learning. FRED ROBERTS, Rutgers Mathematics and DIMACS Developed methods for combining scores in software and hardware testing. Long-standing expertise in decision making and the social sciences. 18 DAVID GOLDSCHMIDT, Director, Institute for Defense Analyses - Center for Communications Research advisory role long-standing partnership between IDA-CCR and DIMACS will sponsor and co-organize tutorial and workshop on state-of-the-art in data mining and homeland security to kick off the project 19 S.O.W: FIRST 12 MONTHS: • Prepare available corpora of data on which to uniformly test different combinations of methods • Concentrate on supervised learning and detection • Systematically explore & compare combinations of compression schemes, representations, matching schemes, learning methods, and fusion schemes • Test combinations of methods on common data sets and exchange information among the team • Develop and test promising dimension reduction methods 20 S.O.W: YEARS 2 AND 3: • Combine leading methods for supervised learning with promising upfront dimension reduction methods • Develop research quality code for the leading identified methods for supervised learning • Develop the extension to unsupervised learning : Detect suspicious message clusters before an event has occurred Use generalized stress measures indicating a significant group of interrelated messages don’t fit into the known family of clusters Concentrate on semi-supervised learning. 21 IMPACT: 12 MONTHS: • We will have established a state-of-the art scheme for classification of accumulated documents in relation to known tasks/targets/themes and building profiles to track future relevant messages. • We are optimistic that by end-to-end experimentation, we will discover synergies between new mathematical and statistical methods for addressing each of the component tasks and thus achieve significant improvements in performance on accepted measures that could not be achieved by piecemeal study of one 22 or two component tasks. IMPACT: 3 YEARS: • We will have produced prototype code for testing the concepts and a rigorously precise expression of the ideas for translation into a commercial or government system. • We will have extended our analysis to semisupervised discovery of potentially interesting clusters of documents. • This should allow us to identify potentially threatening events in time for cognizant agencies to prevent them from occurring. 23