Using Text Mining Techniques to Help Bring Electronic Discovery Under Control Text Mining 2010 Hello, my Name Is... Bruce ...Ask me about CLUSTERING http://tinyurl.com/cat-siam2010 1. Background of eDiscovery environment 2. Search analytics through result side clustering 3. Unsupervised feature extraction through NMF 4. Catalyst clustering engine 5. Shortcuts we have taken Agenda 1 Working in eDiscovery Time is an elusive commodity http://www.foxnews.com/ucat/images/276600_time-management-clock.jpg Documents are plentiful and messy (and mostly electronic and multi-language unlike this photo) http://carianddoug.com/blog.htm/wp-content/uploads/2010/01/bigstockphoto_Messy_Desk_30391.jpg Pressure is stressful http://www.americanairworks.com/images/dial_a_pressure.gif Gamesmanship is always in play http://pauillac.inria.fr/~doligez/go/Kitani_Go_284.jpg Deadlines are real and costly http://depot.gdnet.org/cms/gallery//44-iStock_000006154366Medium.jpg • Time is not a lawyer's friend • Documents are numerous • Pressure is high • Gamesmanship • Real deadlines eDiscovery Environment 2 Search Analytics: Result Side Clustering Getting Information Without Drowning http://employers-rx.com/storage/WomanDrowning%20in%20Paperwork.jpg How to find the key concepts? http://www.geronto.at/Links/hauptteil_links.html documents + vectorizer "doc vector" or DV http://www.ultimatespecialtiesllc.com/Pictures/IMG_0247.JPG • Vectorizer counts term frequency over phrases (n=2 to 6) • Boosts can be applied vectorizer for proper nouns, dictionary validation, entity validation, and customer dictionary validation [netscape, 1][client products, 0.915179][aol, 0.73283][agreement, 0.451923] [licensed products, 0.420673][section, 0.37294][aol affiliates, 0.338255] [standard client, 0.305632][client product, 0.293613][integrated client, 0.281593][customized integrated, 0.276442][licensed product, 0.276442] [customized client, 0.235234][source code, 0.233516][netscape standard, 0.21978] [var, 0.195742][attachment, 0.181319][source documentation, 0.163118][premium client, 0.161401][netscape client, 0.149382][aol affiliate, 0.147665][aol service, 0.145948][electronic distribution, 0.142514][effective date, 0.137363] [products, 0.136676][bundled distribution, 0.135646][distribution, 0.135302] [license fee, 0.130495][set forth, 0.128777][joint marketing, 0.12706][marketing agreement, 0.12706][intellectual property, 0.125343][customized standard, 0.123626][golden master, 0.113324][art player, 0.106456][player plug-in, 0.103022][netscape premium, 0.103022][online information, 0.103022][aol services, 0.101305][property rights, 0.101305][netscape products, 0.0995879] [development services, 0.0961538][netscape communications, 0.0961538] [confidential, 0.0899725][communications corporation, 0.0892857][america online, 0.0875687][provided, 0.0872253][end-user license, 0.0858516][distributor price, 0.0858516][distributors, 0.0851648][party, 0.0831044][customized premium, 0.0824176][netscape server, 0.0824176][support services, 0.0772665][license agreements, 0.0772665][netscape control, 0.0738324][standard distributor, 0.0738324][right, 0.0728022][specification nonconformity, 0.0721154][release number, 0.0721154][speculative distribution, 0.0721154][parties, 0.0721154] [initial term, 0.0703984][license, 0.0693681][beta version, 0.0686813] [confidential information, 0.0686813][netscape trademark, 0.0686813][respect, 0.0686813][classic service, 0.0652473][united states, 0.0652473][aol classic, 0.0652473][development, 0.0645604][customized products, 0.0635302][software, 0.0625][license fees, 0.0618132][specification nonconformities, 0.0618132] This is the vector summary of the document. [netscape, 1][client products, 0.915179][aol, 0.73283][agreement, 0.451923][licensed products, 0.420673][section, 0.37294][aol affiliates, 0.338255][standard client, 0.305632][client product, 0.293613][integrated client, 0.281593][customized integrated, 0.276442][licensed product, 0.276442][customized client, 0.235234][source code, 0.233516][netscape standard, 0.21978][var, 0.195742][attachment, 0.181319][source documentation, 0.163118][premium client, 0.161401][netscape client, 0.149382][aol affiliate, 0.147665][aol service, 0.145948][electronic distribution, 0.142514][effective date, 0.137363][products, 0.136676][bundled distribution, 0.135646][distribution, 0.135302][license fee, 0.130495][set forth, 0.128777][joint marketing, 0.12706][marketing agreement, 0.12706][intellectual property, 0.125343][customized standard, 0.123626][golden master, 0.113324][art player, 0.106456][player plug-in, 0.103022][netscape premium, 0.103022][online information, 0.103022][aol services, 0.101305][property rights, 0.101305][netscape products, 0.0995879][development services, 0.0961538][netscape communications, 0.0961538] [confidential, 0.0899725][communications corporation, 0.0892857][america online, 0.0875687][provided, 0.0872253][end-user license, 0.0858516] [distributor price, 0.0858516][distributors, 0.0851648][party, 0.0831044][customized premium, 0.0824176][netscape server, 0.0824176][support services, 0.0772665][license agreements, 0.0772665][netscape control, 0.0738324][standard distributor, 0.0738324][right, 0.0728022][specification nonconformity, 0.0721154][release number, 0.0721154][speculative distribution, 0.0721154][parties, 0.0721154][initial term, 0.0703984][license, 0.0693681][beta version, 0.0686813][confidential information, 0.0686813][netscape trademark, 0.0686813][respect, 0.0686813][classic service, 0.0652473][united states, 0.0652473][aol classic, 0.0652473][development, 0.0645604][customized products, 0.0635302][software, 0.0625][license fees, 0.0618132][specification nonconformities, 0.0618132][server products, 0.0600962][unbundled distribution, 0.0600962][localized versions, 0.0583791][custom development, 0.0583791][information services, 0.0583791][parties agree, 0.0583791][event, 0.0583791][terms, 0.0576923][netscape website, 0.053228][beta versions, 0.053228][netscape navigator, 0.053228][mean, 0.051511][promotion, 0.0501374][aol personnel, 0.049794] [information service, 0.0480769][netscape trademarks, 0.0480769][netscape iapps, 0.0480769][favored price, 0.0480769][end-users, 0.0467033] [registration server, 0.0463599][license agreement, 0.0463599][netscape registration, 0.0446429][subject, 0.0446429][iapps products, 0.0446429] Information Reduction http://www.ctlawtribune.com/images/articleimages/document_fan.jpg /// DESCRIPTION: Results-side clustering of DocVectors /// /// DATE: 20-Sep-2007 /// /// AUTHOR: Reed Esau /// /// The basic theory that lays behind this implementation... /// /// w(s, d) = [TF(s, d)]**alpha * [IDF(s, D)]**beta /// /// where w(s,d) is the tf-idf weight of a term s in a document d. /// Note that the document is represented by an array of tf-weighted /// terms, normalized between 0..1. The D is the corpus of all terms -/// all terms in the result set. Alpha and beta are configurable /// constants to skew in favor of TF or IDF. Salt to taste. Exploring Concepts RSC in Product 3 Feature Extraction using Non Negative Matrix Factorization This may be review for many of you... "(They'll) want to know what's hot in machine learning (nonnegative matrix factorization is so 1998! Latent Dirichlet Allocation is all the rage now)" - Michael C. Mozer http://depressionetal.files.wordpress.com/2009/05/004.jpg The collection of documents n+1 We try to approximate this DV DV DV DV DV DV All documents Reduction & Sampling netscape client products aol agreement licensed products section aol affiliates standard client client product N-Grams netscape client products aol agreement licensed products section aol affiliates standard client client product N-Grams DV DV DV DV DV DV Representative Sample netscape client products aol agreement DV 33 44 42 37 DV 33 34 44 54 DV 44 110 38 23 DV 42 38 61 51 DV 37 23 51 91 DV 18 45 12 12 Matrix (N-Gram frequency in Reduced Docs) Math is the new black (photo courtesy of pioneering work with Wolfram|Alpha) Factorization converges on a set of features netscape client products aol agreement DV 33 44 42 37 DV 33 34 44 54 DV 44 110 38 23 DV DV DV 42 37 18 38 23 45 61 51 12 51 91 12 Matrix (N-Gram freq) netscape client products aol agreement C1 4 3 2 2 C2 5 2 0 9 C3 6 4 3 0 C4 1 9 3 0 Features C1 C2 C3 C4 DV 4 3 2 2 DV 3 5 2 1 DV 5 2 0 9 DV 6 4 3 0 DV 1 9 3 0 DV 0 0 4 5 Weights We call these features ‘Centroids’ aol agreement licensed section aol affiliates C1 4 3 2 2 0 C2 5 2 0 9 3 C3 6 4 3 0 1 C4 1 9 3 0 1 Features Centroids C2 C2 C3 C4 What is a centroid? C1 = aol browser search james clark navigator netscape netscape navigator time warner A list of weighted N-Grams Each centroid will be used to ‘attract’ documents into a cluster 0| msn, 100; anthony bay, 63; platform, 62; chris jones, 60; richard wolf, 58; jim barksdale ... 1| e mail message, 100; chris montgomery, 92; information contained, 63; tony adams, 56 ... 2| 外観 チェック, 100; 酸化 膜, 80; コンタクト pr, 66; 膜 成長, 49; 寸法 チェック, 49; マク... 3| control data, 100; commercial credit, 31; data corporation, 16; share, 14; financial services ... 4| microsoft smtpsvc, 100; microsoft exchange, 99; esmtp id, 82; mxlogic, 57; caseshare, 49... 5| microsoft corporation, 100; boot process, 48; microsoft licensing, 47; initial boot, 46 ... 6| north america, 100; enron north, 87; america corp, 79; smith street, 54; ect, 41; sara ... 7| active desktop, 100; section, 20; confidential information, 14; channel guide, 14; active ... 8| java, 99; java runtime, 17; ben slivka, 17; trident, 16; java rad, 16; apple, 16; mac, 15; sun ... 9| intel, 100; paul maritz, 34; brad silverberg, 27; frank gill, 26; exchange, 25; highly confidential ... 10| bill gates, 100; paul maritz, 96; kempin cc, 92; steve ballmer, 91; exchange, 85; bill veghte, 85 ... 11| maintenance window, 100; viawest maintenance, 55; window notifications, 38; viawest ... 12| edward mccreight, 100; 日本 アジア, 22; アジア 証券, 22; 武田 秀樹, 21; attachments ... Actual Centroids 4 Catalyst Clustering Engine aka Putting Centroids to Work • The “engine” reference is simply a metaphor • Catalyst cannot provide any mechanical work for your automobile, motorcycle, lawn mower, or scooter Hyperbole Alert • Search derived • Highly structured query • Custom rank profiles • Overlap and tiebreaker resolution • Optional hand tuning and refinement • Any external tool can define a centroid What Is the Clustering Engine? Centroids attract documents http://jtownsendcgu.files.wordpress.com/2008/10/non-linear-22.jpg C1 C4 C3 C2 Documents and centroids Note that the centroids are not really part of the collection C4 C2 C3 C1 n+1 (no centroid, therefore no affinity) One of these things is not like the other: n+1cluster http://corianne.files.wordpress.com/2010/01/818.jpg • Not all documents will match a centroid • Clustering engine allows scoring threshold to determine membership • Put all the noise into one cluster • Multiple NMF processes can be run against different filtered sets N+1 Cluster 5 Shortcuts An Exploratory Analysis of Phrases in Text Retrieval, Jeremy Pickens and W. Bruce Croft A Formal Derivation of Heaps’ Law, D.C. van Leijenhorst, Th.P. van der Weide Algorithms for Non-negative Matrix Factorization, Daniel D. Lee, H. Sebastian Seung Andrey A Puretskiy Dr Michael W Berry People & Papers tori.l.wells@enron.com pershing ramsay ministerial klay@enron.com aspen joannie.williamson@enron.com jeff.skilling@enron.com eharris@insightparnters.com murdock LSI dinners fox sada Validation was quantitative Acceptance was qualitative Easy to quantify different, hard to quantify better http://thundafunda.com/33/animals-pictures-nature/dare-to-be-different-pictures.jpg • Vectorizer (long documents, short documents, OCR) • Factorizer loop count (diminishing returns, diff-cost may be local minima) • Feature limit (many clusters may be too confusing) • N+1 (don't aim for noisy completeness) • Query expression can determine corpus (custodian=jct) Choices, Optimizations, Shortcuts numbers@catalystsecure.com Email Questions, Leads, Hints, Resumes