September 2004 PI's Meeting Overview Presentation

advertisement
Monitoring
Message Streams:
Algorithmic
Methods for
Automatic
Processing of
Messages
Fred Roberts, Rutgers University
1
MMS: Goal
Monitor huge communication streams, in particular,
streams of textualized communication to
automatically detect pattern changes and
"significant" events
Motivation: monitoring
email traffic, news,
communiques, faxes,
voice intercepts (with
speech recognition)
2
MMS: Overall Objectives
• Synergistic improvements in
- Performance in terms of space, time,
effectiveness, and/or “insight”
- Understanding the tradeoffs among these types of
improvements
- Compression for efficient resource use
- Representation that aids fitting models
- Efficient matching of text to text and model to
text
- Learning models from data and prior knowledge
- Reduction in need for large amounts of training
data or labor-intensive input
3
- Fusion of complementary filtering approaches
MMS: Approaches
• Emphasis on “Supervised” Filtering:
- Given example documents, textbook
descriptions, etc., find documents on this topic
in incoming stream or past data
• Less Emphasis on “Unsupervised” Event
Identification:
- Detect emergent characteristics, anomalous
patterns, etc. in incoming stream of text or
historical statistics on the stream
4
MMS Approaches:
Supervised Filtering
• Batch filtering: All
training texts processed
before any texts of
active interest to user
• Adaptive filtering: User
trains system during use
- Value of examples for
both information and
training must be
considered
5
MMS Approaches: Dealing with
Massive Data
• Creating summary statistics on massive data
streams
- Detect outliers, heavy hitters (most
frequent items) , etc.
- Allow us to return to past without
keeping raw data
• Reducing need for labeled training examples
in supervised classification
- Bayesian priors from domain knowledge
- Tuning on unlabeled data
6
Accomplishments Phase II (Jan ‘04 – Sep ‘04)
Bayesian Logistic Regression
• Using sparseness-favoring priors, our methods
have produced outstanding accuracy and fast
predictions with no ad-hoc feature selection
• State of art text classification effectiveness
- Recently: Highest score on TREC 2004 triage
task
• Public release of our Bayesian Binary Regression
(BBR) software (500 downloads)
Thomas Bayes
7
Accomplishments Phase II (Jan ‘04 – Sep ‘04)
Bayesian Logistic Regression (cont’d)
• Ability to use domain knowledge to set prior
distributions led to large improvements in
effectiveness when little training data is available
• New online algorithms: online updating of Bayesian
models as new data become available
8
Accomplishments Phase II (Jan ‘04 – Sep ‘04)
Streaming Algorithms
• New sketch-based algorithms for
detecting word frequency changes and
other patterns in massive text streams
• Rapid methods for finding changing
trends, outliers and deviants, rare
events, heavy hitters
• Initial results using summarized data to
1
search for meaningful answers to
0
1
queries about the past
• Initial work on textual and structural
patterns in informal communication
networks
1
1
0
0
1
0
1
1
9
Accomplishments
Phase II (Jan ‘04 – Sep ‘04)
Nearest neighbor classification: Fast
implementation
• Continued development of heuristics for approximate
neighbor finding with an in-memory inverted index
• Our results have reduced memory by 90% and time by
90 to 99% with minimal impact on effectiveness.
• Packaged and delivered kNN software
• Developing algorithms for speeding up slow but
potentially highly effective “local learning” approach
- Based on training a separate logistic regression on
the neighbors of each test document!
- Slow, but with many avenues to large speedups 10
Accomplishments
Phase II (Jan ‘04 – Sep ‘04)
Adaptive Filtering
• Models to Aid in Learning: When to act
greedily (“exploit” -- submit documents
we believe relevant) and when to take risks
(“explore” -- submit documents that can
be irrelevant)
• Seek approximate solutions to the
intractable optimal exploration/exploitation
tradeoff
• Experiments show slight improvements in
filtering effectiveness compared to greedy
(exploit-only) approach
11
Some MMS Work in Depth: Bayesian
Priors from Domain Knowledge
• Bayesian methods assume prior beliefs
about parameters before data is seen
- Project Phase I: generic, vague priors
- Project Phase II: Reference materials or
intuitions about words may help predict class.
Use these to set priors. (Material very unlike
training examples)
• Goal: reduce need for training examples
- Replace 1000’s of randomly sampled
12
examples with few, possibly biased examples
Knowledge-Driven Priors: Issues
• Reference texts have some non-topical words
- Use words that discriminate among topics (use
Inverted Document Frequency (IDF)
weighting within reference collection)
• Small training sets increase problems with
thresholding and text representation
- Use unlabeled data to aid thresholding and to
learn IDF weights
13
- Use separate prior for intercept term of model
Knowledge-Driven Priors: Results
• Topics: 27 Reuters Region categories
• Knowledge: CIA World Factbook (WFB) entries
• Examples: 10/topic
• Baseline results (F1 measure):
• WFB: 0.234 , no WFB: 0.052
• Better small training sets, improved algorithms
• WFB: 0.591, no WFB: 0.395
14
Knowledge-Driven Priors: Summary
• Reference materials of text type very different
from documents to be classified can aid
supervised filtering
- In combination with tuning on unlabeled data,
this technique can provide immediate practical
benefits
• Current methods are crude and ad hoc –
substantial improvements should be possible
15
Some MMS Work in Depth:
Streaming Analysis
• Problem: Monitor fast, massive text
streams and support both online
tracking as well as historic analysis for
events.
• Multidimensional data: source,
destination, time sent or received,
metadata (reply, language), text
labels (words, phrases), links.
• Goal: To use highly compact summaries
that are computed at stream speed and
perform accurate analyses.
16
Streaming Analysis Tool: CM Sketch
• Theoretical: We have developed the CM Sketch that
uses (1/e) log 1/d space to approximate data distribution
with error at most e, and probability of success at least
1-d.
– All other previously known sample or sketch
methods use space at least (1/e2).
– CM Sketch is an order of magnitude better.
• Practical: Few 10's of KBs gives accurate summary of
large data: Create summaries of data that allow historic
queries to find
– Heavy Hitters (Most Frequent Items)
– Quantiles of a Distribution (Median, Percentiles etc.)
– Finding items with large changes
17
Streaming Analysis: Using Web Logs
• Web logs (blogs) or regularly updated online journals provide informal,
opinionated, candid data that is more like
email than is the web.
• We have begun to automatically collect
blogs, stripping formatting and tags, ads,
etc., and outputting corresponding "bag
of words" into streaming algorithms for
analysis, archiving. 10’s to 100 GB scale.
• 3000:1 compression using CM Sketch
methods.
• Allows accurate analysis of popular
words, new emergent words, etc.,
including multilingual occurrences.
18
Deliverables: Phase I
• Classic method: Rocchio
• Classic method: Centroid
• kNN with IFH (inverted file
heuristic)
• Sparse Bayesian (Bayesian with
Laplace priors)
• Combinatorial PCA
• Homotopic Linking of Widely
Varying Rocchio Methods
• aiSVM
• Fusion
19
Deliverables: Phase II
• Revised and extended version of kNN code,
including scripts for running local learning
experiments
• Substantially extended version of BBR, including
use of domain knowledge to set priors
• CM Sketch (C library for count-min sketching)
• Code to use CM Sketch to find heavy hitters,
quantiles, and large changes in streams
20
MMS: Future Directions
Bayesian
• Expand types of domain knowledge usable
- For instance, making use of the taxonomies
available in many subject areas
• Improve self-tuning of BBR software
- Make it more effective for novice users
- Surprisingly subtle questions: Crossvalidation, calibration, scaling (e.g., when
multiple features)
• Incorporate previous work on online Bayesian
21
methods into BBR
MMS: Future Directions
Streaming
• Systematically explore summarization methods such
as sampling, bitmaps, sketches
- Develop warehousing techniques for large scale
sketch-based historical analyses
• Massiveness of data implies linear algorithms too
inefficient. Seek sublinear methods.
• Develop sketch-based methods for link analysis in
temporally changing multigraphs
- From and To addresses in email, links between
blogs, etc.
• Add modeling component to the sketch-based
analysis: Exploit knowledge of distribution of the22
data.
MMS: Future Directions
kNN
• kNN with small training sample for each of massive
number of topics
- maybe only 5 to 10 known
relevant/irrelevant documents
- Since small samples have little overlap, extend kNN
approach to deal with partially labeled datasets
• Bayesian kNN
- Incorporate methods developed in our Bayesian work
for dealing with small training sets (e.g., tuning
thresholds on unlabeled data).
- More fundamental combinations of Bayesian and
kNN methods (e.g., tunable distance metrics) 23
MMS: Future Directions
Greedy Round Robin Feature Selection
• In phase I work: Explored greedy heuristic
to choose subset of original set of terms as
features
- Did extremely well in TREC2002 “topic intersection
tasks”
• Will develop a Greedy Round Robin (GRR) method
- Applies if features fall into two or more
“conceptually distinct” sets (e.g., metadata such as
source/destination, genre or medium of the message)
- Each list of features is consulted in turn.
- Plan experimental analysis of GRR
- Plan theoretical analysis of GRR using simulation 24
MMS: Future Directions
Adaptive filtering
• Experiment with new adaptive thresholding
methods (synergy with Bayesian thresholding
work)
- Scoring threshold is adjusted downward if
judging too many irrelevant documents;
upward if judging too few relevant documents
• Aim for algorithm with state-of-art effectiveness
and provable theoretical properties
• Compare rate of convergence of various
algorithms on real data.
25
MMS PROJECT TEAM:
Paul Kantor, Rutgers Communic., Info.& Library Studies
Dave Lewis, Consultant
Michael Littman, Rutgers CS
David Madigan, Rutgers Statistics
S. Muthukrishnan, Rutgers CS
Rafail Ostrovsky, Telcordia/UCLA
Fred Roberts, Rutgers DIMACS/Math
Martin Strauss, AT&T Labs/U. Michigan)
Wen-Hua Ju, Avaya Labs (collaborator)
Andrei Anghelescu, Graduate Student
Suhrid Balakrishnan, Graduate Student
Aynur Dayanik, Graduate Student
Dmitry Fradkin, Graduate Student
Peng Song, Graduate Student
Graham Cormode, postdoc
Alex Genkin, software developer
Vladimir Menkov, software developer
26
Download