Using Text Mining Techniques to Help Bring Electronic Discovery Under Control

advertisement
Using Text Mining
Techniques to Help Bring
Electronic Discovery
Under Control
Text Mining 2010
Hello, my Name Is...
Bruce
...Ask me about CLUSTERING
http://tinyurl.com/cat-siam2010
1. Background of eDiscovery environment
2. Search analytics through result side
clustering
3. Unsupervised feature extraction through
NMF
4. Catalyst clustering engine
5. Shortcuts we have taken
Agenda
1
Working in
eDiscovery
Time is an elusive commodity
http://www.foxnews.com/ucat/images/276600_time-management-clock.jpg
Documents are plentiful and messy
(and mostly electronic and multi-language unlike this photo)
http://carianddoug.com/blog.htm/wp-content/uploads/2010/01/bigstockphoto_Messy_Desk_30391.jpg
Pressure is stressful
http://www.americanairworks.com/images/dial_a_pressure.gif
Gamesmanship is always in play
http://pauillac.inria.fr/~doligez/go/Kitani_Go_284.jpg
Deadlines are real and costly
http://depot.gdnet.org/cms/gallery//44-iStock_000006154366Medium.jpg
• Time is not a lawyer's friend
• Documents are numerous
• Pressure is high
• Gamesmanship
• Real deadlines
eDiscovery Environment
2
Search Analytics:
Result Side
Clustering
Getting Information Without Drowning
http://employers-rx.com/storage/WomanDrowning%20in%20Paperwork.jpg
How to find the key concepts?
http://www.geronto.at/Links/hauptteil_links.html
documents + vectorizer
"doc vector" or DV
http://www.ultimatespecialtiesllc.com/Pictures/IMG_0247.JPG
• Vectorizer counts
term frequency over
phrases (n=2 to 6)
• Boosts can be applied
vectorizer
for proper nouns,
dictionary validation,
entity validation, and
customer dictionary
validation
[netscape, 1][client products, 0.915179][aol, 0.73283][agreement, 0.451923]
[licensed products, 0.420673][section, 0.37294][aol affiliates, 0.338255]
[standard client, 0.305632][client product, 0.293613][integrated client,
0.281593][customized integrated, 0.276442][licensed product, 0.276442]
[customized client, 0.235234][source code, 0.233516][netscape standard, 0.21978]
[var, 0.195742][attachment, 0.181319][source documentation, 0.163118][premium
client, 0.161401][netscape client, 0.149382][aol affiliate, 0.147665][aol
service, 0.145948][electronic distribution, 0.142514][effective date, 0.137363]
[products, 0.136676][bundled distribution, 0.135646][distribution, 0.135302]
[license fee, 0.130495][set forth, 0.128777][joint marketing, 0.12706][marketing
agreement, 0.12706][intellectual property, 0.125343][customized standard,
0.123626][golden master, 0.113324][art player, 0.106456][player plug-in,
0.103022][netscape premium, 0.103022][online information, 0.103022][aol
services, 0.101305][property rights, 0.101305][netscape products, 0.0995879]
[development services, 0.0961538][netscape communications, 0.0961538]
[confidential, 0.0899725][communications corporation, 0.0892857][america online,
0.0875687][provided, 0.0872253][end-user license, 0.0858516][distributor price,
0.0858516][distributors, 0.0851648][party, 0.0831044][customized premium,
0.0824176][netscape server, 0.0824176][support services, 0.0772665][license
agreements, 0.0772665][netscape control, 0.0738324][standard distributor,
0.0738324][right, 0.0728022][specification nonconformity, 0.0721154][release
number, 0.0721154][speculative distribution, 0.0721154][parties, 0.0721154]
[initial term, 0.0703984][license, 0.0693681][beta version, 0.0686813]
[confidential information, 0.0686813][netscape trademark, 0.0686813][respect,
0.0686813][classic service, 0.0652473][united states, 0.0652473][aol classic,
0.0652473][development, 0.0645604][customized products, 0.0635302][software,
0.0625][license fees, 0.0618132][specification nonconformities, 0.0618132]
This is the vector summary
of the document.
[netscape, 1][client products, 0.915179][aol, 0.73283][agreement, 0.451923][licensed products, 0.420673][section, 0.37294][aol affiliates,
0.338255][standard client, 0.305632][client product, 0.293613][integrated client, 0.281593][customized integrated, 0.276442][licensed product,
0.276442][customized client, 0.235234][source code, 0.233516][netscape standard, 0.21978][var, 0.195742][attachment, 0.181319][source
documentation, 0.163118][premium client, 0.161401][netscape client, 0.149382][aol affiliate, 0.147665][aol service, 0.145948][electronic
distribution, 0.142514][effective date, 0.137363][products, 0.136676][bundled distribution, 0.135646][distribution, 0.135302][license fee,
0.130495][set forth, 0.128777][joint marketing, 0.12706][marketing agreement, 0.12706][intellectual property, 0.125343][customized standard,
0.123626][golden master, 0.113324][art player, 0.106456][player plug-in, 0.103022][netscape premium, 0.103022][online information, 0.103022][aol
services, 0.101305][property rights, 0.101305][netscape products, 0.0995879][development services, 0.0961538][netscape communications, 0.0961538]
[confidential, 0.0899725][communications corporation, 0.0892857][america online, 0.0875687][provided, 0.0872253][end-user license, 0.0858516]
[distributor price, 0.0858516][distributors, 0.0851648][party, 0.0831044][customized premium, 0.0824176][netscape server, 0.0824176][support
services, 0.0772665][license agreements, 0.0772665][netscape control, 0.0738324][standard distributor, 0.0738324][right, 0.0728022][specification
nonconformity, 0.0721154][release number, 0.0721154][speculative distribution, 0.0721154][parties, 0.0721154][initial term, 0.0703984][license,
0.0693681][beta version, 0.0686813][confidential information, 0.0686813][netscape trademark, 0.0686813][respect, 0.0686813][classic service,
0.0652473][united states, 0.0652473][aol classic, 0.0652473][development, 0.0645604][customized products, 0.0635302][software, 0.0625][license
fees, 0.0618132][specification nonconformities, 0.0618132][server products, 0.0600962][unbundled distribution, 0.0600962][localized versions,
0.0583791][custom development, 0.0583791][information services, 0.0583791][parties agree, 0.0583791][event, 0.0583791][terms, 0.0576923][netscape
website, 0.053228][beta versions, 0.053228][netscape navigator, 0.053228][mean, 0.051511][promotion, 0.0501374][aol personnel, 0.049794]
[information service, 0.0480769][netscape trademarks, 0.0480769][netscape iapps, 0.0480769][favored price, 0.0480769][end-users, 0.0467033]
[registration server, 0.0463599][license agreement, 0.0463599][netscape registration, 0.0446429][subject, 0.0446429][iapps products, 0.0446429]
Information Reduction
http://www.ctlawtribune.com/images/articleimages/document_fan.jpg
/// DESCRIPTION: Results-side clustering of DocVectors
///
/// DATE:
20-Sep-2007
///
/// AUTHOR:
Reed Esau
///
/// The basic theory that lays behind this implementation...
///
/// w(s, d) = [TF(s, d)]**alpha * [IDF(s, D)]**beta
///
/// where w(s,d) is the tf-idf weight of a term s in a document d.
/// Note that the document is represented by an array of tf-weighted
/// terms, normalized between 0..1. The D is the corpus of all terms -/// all terms in the result set. Alpha and beta are configurable
/// constants to skew in favor of TF or IDF. Salt to taste.
Exploring Concepts
RSC in Product
3
Feature Extraction
using
Non Negative
Matrix
Factorization
This may be review for many of you...
"(They'll) want to know what's hot in machine learning (nonnegative matrix factorization is so 1998! Latent
Dirichlet Allocation is all the rage now)" - Michael C. Mozer
http://depressionetal.files.wordpress.com/2009/05/004.jpg
The collection of documents
n+1
We try to approximate this
DV
DV
DV DV
DV DV
All documents
Reduction & Sampling
netscape
client products
aol
agreement
licensed products
section
aol affiliates
standard client
client product
N-Grams
netscape
client products
aol
agreement
licensed products
section
aol affiliates
standard client
client product
N-Grams
DV
DV
DV DV
DV DV
Representative
Sample
netscape
client
products
aol
agreement
DV
33
44
42
37
DV
33
34
44
54
DV
44
110
38
23
DV
42
38
61
51
DV
37
23
51
91
DV
18
45
12
12
Matrix (N-Gram frequency in Reduced Docs)
Math is the new black
(photo courtesy of pioneering work with Wolfram|Alpha)
Factorization
converges on a set of features
netscape
client
products
aol
agreement
DV
33
44
42
37
DV
33
34
44
54
DV
44
110
38
23
DV
DV
DV
42
37
18
38
23
45
61
51
12
51
91
12
Matrix (N-Gram freq)
netscape
client
products
aol
agreement
C1
4
3
2
2
C2
5
2
0
9
C3
6
4
3
0
C4
1
9
3
0
Features
C1
C2
C3
C4
DV
4
3
2
2
DV
3
5
2
1
DV
5
2
0
9
DV
6
4
3
0
DV
1
9
3
0
DV
0
0
4
5
Weights
We call these features
‘Centroids’
aol
agreement
licensed
section
aol
affiliates
C1
4
3
2
2
0
C2
5
2
0
9
3
C3
6
4
3
0
1
C4
1
9
3
0
1
Features
Centroids
C2
C2
C3
C4
What is a centroid?
C1
=
aol browser search james clark
navigator netscape netscape
navigator time warner
A list of weighted N-Grams
Each centroid will be used to
‘attract’ documents into a
cluster
0| msn, 100; anthony bay, 63; platform, 62; chris jones, 60; richard wolf, 58; jim barksdale ...
1| e mail message, 100; chris montgomery, 92; information contained, 63; tony adams, 56 ...
2| 外観 チェック, 100; 酸化 膜, 80; コンタクト pr, 66; 膜 成長, 49; 寸法 チェック, 49; マク...
3| control data, 100; commercial credit, 31; data corporation, 16; share, 14; financial services ...
4| microsoft smtpsvc, 100; microsoft exchange, 99; esmtp id, 82; mxlogic, 57; caseshare, 49...
5| microsoft corporation, 100; boot process, 48; microsoft licensing, 47; initial boot, 46 ...
6| north america, 100; enron north, 87; america corp, 79; smith street, 54; ect, 41; sara ...
7| active desktop, 100; section, 20; confidential information, 14; channel guide, 14; active ...
8| java, 99; java runtime, 17; ben slivka, 17; trident, 16; java rad, 16; apple, 16; mac, 15; sun ...
9| intel, 100; paul maritz, 34; brad silverberg, 27; frank gill, 26; exchange, 25; highly confidential ...
10| bill gates, 100; paul maritz, 96; kempin cc, 92; steve ballmer, 91; exchange, 85; bill veghte, 85 ...
11| maintenance window, 100; viawest maintenance, 55; window notifications, 38; viawest ...
12| edward mccreight, 100; 日本 アジア, 22; アジア 証券, 22; 武田 秀樹, 21; attachments ...
Actual Centroids
4
Catalyst
Clustering Engine
aka
Putting Centroids
to Work
• The “engine”
reference is simply a
metaphor
• Catalyst cannot
provide any
mechanical work for
your automobile,
motorcycle, lawn
mower, or scooter
Hyperbole Alert
• Search derived
• Highly structured query
• Custom rank profiles
• Overlap and tiebreaker resolution
• Optional hand tuning and refinement
• Any external tool can define a centroid
What Is the Clustering Engine?
Centroids attract documents
http://jtownsendcgu.files.wordpress.com/2008/10/non-linear-22.jpg
C1
C4
C3
C2
Documents and centroids
Note that the centroids are not really part of the collection
C4
C2
C3
C1
n+1
(no centroid,
therefore no
affinity)
One of these things is not like the other:
n+1cluster
http://corianne.files.wordpress.com/2010/01/818.jpg
• Not all documents will match a centroid
• Clustering engine allows scoring
threshold to determine membership
• Put all the noise into one cluster
• Multiple NMF processes can be run
against different filtered sets
N+1 Cluster
5
Shortcuts
An Exploratory Analysis of Phrases in Text
Retrieval, Jeremy Pickens and W. Bruce Croft
A Formal Derivation of Heaps’ Law, D.C. van
Leijenhorst, Th.P. van der Weide
Algorithms for Non-negative Matrix Factorization,
Daniel D. Lee, H. Sebastian Seung
Andrey A Puretskiy
Dr Michael W Berry
People & Papers
tori.l.wells@enron.com
pershing
ramsay
ministerial
klay@enron.com
aspen
joannie.williamson@enron.com
jeff.skilling@enron.com
eharris@insightparnters.com
murdock
LSI
dinners
fox
sada
Validation was quantitative
Acceptance was qualitative
Easy to quantify different,
hard to quantify better
http://thundafunda.com/33/animals-pictures-nature/dare-to-be-different-pictures.jpg
•
Vectorizer (long documents, short
documents, OCR)
•
Factorizer loop count (diminishing
returns, diff-cost may be local minima)
•
Feature limit (many clusters may be too
confusing)
•
N+1 (don't aim for noisy completeness)
•
Query expression can determine corpus
(custodian=jct)
Choices, Optimizations, Shortcuts
numbers@catalystsecure.com
Email Questions, Leads, Hints, Resumes
Download