Monitoring Message Streams: Algorithmic Methods for Automatic Processing of Messages Fred Roberts, Rutgers University 1 MMS: Goal Monitor huge communication streams, in particular, streams of textualized communication to automatically detect pattern changes and "significant" events Motivation: monitoring email traffic, news, communiques, faxes, voice intercepts (with speech recognition) 2 MMS: Overall Objectives • Synergistic improvements in - Performance in terms of space, time, effectiveness, and/or “insight” - Understanding the tradeoffs among these types of improvements - Compression for efficient resource use - Representation that aids fitting models - Efficient matching of text to text and model to text - Learning models from data and prior knowledge - Reduction in need for large amounts of training data or labor-intensive input 3 - Fusion of complementary filtering approaches MMS: Approaches • Emphasis on “Supervised” Filtering: - Given example documents, textbook descriptions, etc., find documents on this topic in incoming stream or past data • Less Emphasis on “Unsupervised” Event Identification: - Detect emergent characteristics, anomalous patterns, etc. in incoming stream of text or historical statistics on the stream 4 MMS Approaches: Supervised Filtering • Batch filtering: All training texts processed before any texts of active interest to user • Adaptive filtering: User trains system during use - Value of examples for both information and training must be considered 5 MMS Approaches: Dealing with Massive Data • Creating summary statistics on massive data streams - Detect outliers, heavy hitters (most frequent items) , etc. - Allow us to return to past without keeping raw data • Reducing need for labeled training examples in supervised classification - Bayesian priors from domain knowledge - Tuning on unlabeled data 6 Accomplishments Phase II (Jan ‘04 – Sep ‘04) Bayesian Logistic Regression • Using sparseness-favoring priors, our methods have produced outstanding accuracy and fast predictions with no ad-hoc feature selection • State of art text classification effectiveness - Recently: Highest score on TREC 2004 triage task • Public release of our Bayesian Binary Regression (BBR) software (500 downloads) Thomas Bayes 7 Accomplishments Phase II (Jan ‘04 – Sep ‘04) Bayesian Logistic Regression (cont’d) • Ability to use domain knowledge to set prior distributions led to large improvements in effectiveness when little training data is available • New online algorithms: online updating of Bayesian models as new data become available 8 Accomplishments Phase II (Jan ‘04 – Sep ‘04) Streaming Algorithms • New sketch-based algorithms for detecting word frequency changes and other patterns in massive text streams • Rapid methods for finding changing trends, outliers and deviants, rare events, heavy hitters • Initial results using summarized data to 1 search for meaningful answers to 0 1 queries about the past • Initial work on textual and structural patterns in informal communication networks 1 1 0 0 1 0 1 1 9 Accomplishments Phase II (Jan ‘04 – Sep ‘04) Nearest neighbor classification: Fast implementation • Continued development of heuristics for approximate neighbor finding with an in-memory inverted index • Our results have reduced memory by 90% and time by 90 to 99% with minimal impact on effectiveness. • Packaged and delivered kNN software • Developing algorithms for speeding up slow but potentially highly effective “local learning” approach - Based on training a separate logistic regression on the neighbors of each test document! - Slow, but with many avenues to large speedups 10 Accomplishments Phase II (Jan ‘04 – Sep ‘04) Adaptive Filtering • Models to Aid in Learning: When to act greedily (“exploit” -- submit documents we believe relevant) and when to take risks (“explore” -- submit documents that can be irrelevant) • Seek approximate solutions to the intractable optimal exploration/exploitation tradeoff • Experiments show slight improvements in filtering effectiveness compared to greedy (exploit-only) approach 11 Some MMS Work in Depth: Bayesian Priors from Domain Knowledge • Bayesian methods assume prior beliefs about parameters before data is seen - Project Phase I: generic, vague priors - Project Phase II: Reference materials or intuitions about words may help predict class. Use these to set priors. (Material very unlike training examples) • Goal: reduce need for training examples - Replace 1000’s of randomly sampled 12 examples with few, possibly biased examples Knowledge-Driven Priors: Issues • Reference texts have some non-topical words - Use words that discriminate among topics (use Inverted Document Frequency (IDF) weighting within reference collection) • Small training sets increase problems with thresholding and text representation - Use unlabeled data to aid thresholding and to learn IDF weights 13 - Use separate prior for intercept term of model Knowledge-Driven Priors: Results • Topics: 27 Reuters Region categories • Knowledge: CIA World Factbook (WFB) entries • Examples: 10/topic • Baseline results (F1 measure): • WFB: 0.234 , no WFB: 0.052 • Better small training sets, improved algorithms • WFB: 0.591, no WFB: 0.395 14 Knowledge-Driven Priors: Summary • Reference materials of text type very different from documents to be classified can aid supervised filtering - In combination with tuning on unlabeled data, this technique can provide immediate practical benefits • Current methods are crude and ad hoc – substantial improvements should be possible 15 Some MMS Work in Depth: Streaming Analysis • Problem: Monitor fast, massive text streams and support both online tracking as well as historic analysis for events. • Multidimensional data: source, destination, time sent or received, metadata (reply, language), text labels (words, phrases), links. • Goal: To use highly compact summaries that are computed at stream speed and perform accurate analyses. 16 Streaming Analysis Tool: CM Sketch • Theoretical: We have developed the CM Sketch that uses (1/e) log 1/d space to approximate data distribution with error at most e, and probability of success at least 1-d. – All other previously known sample or sketch methods use space at least (1/e2). – CM Sketch is an order of magnitude better. • Practical: Few 10's of KBs gives accurate summary of large data: Create summaries of data that allow historic queries to find – Heavy Hitters (Most Frequent Items) – Quantiles of a Distribution (Median, Percentiles etc.) – Finding items with large changes 17 Streaming Analysis: Using Web Logs • Web logs (blogs) or regularly updated online journals provide informal, opinionated, candid data that is more like email than is the web. • We have begun to automatically collect blogs, stripping formatting and tags, ads, etc., and outputting corresponding "bag of words" into streaming algorithms for analysis, archiving. 10’s to 100 GB scale. • 3000:1 compression using CM Sketch methods. • Allows accurate analysis of popular words, new emergent words, etc., including multilingual occurrences. 18 Deliverables: Phase I • Classic method: Rocchio • Classic method: Centroid • kNN with IFH (inverted file heuristic) • Sparse Bayesian (Bayesian with Laplace priors) • Combinatorial PCA • Homotopic Linking of Widely Varying Rocchio Methods • aiSVM • Fusion 19 Deliverables: Phase II • Revised and extended version of kNN code, including scripts for running local learning experiments • Substantially extended version of BBR, including use of domain knowledge to set priors • CM Sketch (C library for count-min sketching) • Code to use CM Sketch to find heavy hitters, quantiles, and large changes in streams 20 MMS: Future Directions Bayesian • Expand types of domain knowledge usable - For instance, making use of the taxonomies available in many subject areas • Improve self-tuning of BBR software - Make it more effective for novice users - Surprisingly subtle questions: Crossvalidation, calibration, scaling (e.g., when multiple features) • Incorporate previous work on online Bayesian 21 methods into BBR MMS: Future Directions Streaming • Systematically explore summarization methods such as sampling, bitmaps, sketches - Develop warehousing techniques for large scale sketch-based historical analyses • Massiveness of data implies linear algorithms too inefficient. Seek sublinear methods. • Develop sketch-based methods for link analysis in temporally changing multigraphs - From and To addresses in email, links between blogs, etc. • Add modeling component to the sketch-based analysis: Exploit knowledge of distribution of the22 data. MMS: Future Directions kNN • kNN with small training sample for each of massive number of topics - maybe only 5 to 10 known relevant/irrelevant documents - Since small samples have little overlap, extend kNN approach to deal with partially labeled datasets • Bayesian kNN - Incorporate methods developed in our Bayesian work for dealing with small training sets (e.g., tuning thresholds on unlabeled data). - More fundamental combinations of Bayesian and kNN methods (e.g., tunable distance metrics) 23 MMS: Future Directions Greedy Round Robin Feature Selection • In phase I work: Explored greedy heuristic to choose subset of original set of terms as features - Did extremely well in TREC2002 “topic intersection tasks” • Will develop a Greedy Round Robin (GRR) method - Applies if features fall into two or more “conceptually distinct” sets (e.g., metadata such as source/destination, genre or medium of the message) - Each list of features is consulted in turn. - Plan experimental analysis of GRR - Plan theoretical analysis of GRR using simulation 24 MMS: Future Directions Adaptive filtering • Experiment with new adaptive thresholding methods (synergy with Bayesian thresholding work) - Scoring threshold is adjusted downward if judging too many irrelevant documents; upward if judging too few relevant documents • Aim for algorithm with state-of-art effectiveness and provable theoretical properties • Compare rate of convergence of various algorithms on real data. 25 MMS PROJECT TEAM: Paul Kantor, Rutgers Communic., Info.& Library Studies Dave Lewis, Consultant Michael Littman, Rutgers CS David Madigan, Rutgers Statistics S. Muthukrishnan, Rutgers CS Rafail Ostrovsky, Telcordia/UCLA Fred Roberts, Rutgers DIMACS/Math Martin Strauss, AT&T Labs/U. Michigan) Wen-Hua Ju, Avaya Labs (collaborator) Andrei Anghelescu, Graduate Student Suhrid Balakrishnan, Graduate Student Aynur Dayanik, Graduate Student Dmitry Fradkin, Graduate Student Peng Song, Graduate Student Graham Cormode, postdoc Alex Genkin, software developer Vladimir Menkov, software developer 26