The Effects of Tabular-based Content Extraction on Patent Document Clustering Denise R. Koessler, Benjamin W. Martin, Bruce E. Kiefer, Michael W. Berry SIAM Text Mining Workshop April 27th, 2012 Motivation e-DISCOVERY and PATENT TROLLS 2 eDiscovery 120 Annual Number of Sanctions Cases and Sanctions Awards 100 80 60 40 20 0 Total Cases Sanctioned Source: Duke Law Total Cases Awarded 3 Patent Trolls Source: Beta Beat: The Low Down on High Tech 4 Patents Granted Patent Cases Filed Patent Trolls 2011 US Patent Litigation Study 5 Why Trolls and Text Mining? 1. There is an explosion of data 2. Current methods need revamped 3. Verification…? 6 Project Description A DEEP DIVE INTO AUTOMATING PATENT PROCESSING 7 Project Overview 8 IARPA Project Overview Metadata Extraction and Ground Truth Summary Content Understanding with Visual Gaze Tracking 9 Data Set • • • • Created by Catalyst Source: USPTO 200 GB in size 2.5 million files – HTML – PDF – TIFF • Very unstructured, and irregular! 10 Project Overview Data Reduction Metadata Extraction Content Analysis 11 Data Reduction • Number of Irrelevant files: 158,000 (6.4%) – HTML without full text – Blank PDFs – Duplicate TIFFs 12 Metadata Extraction • HTML XML • Extracts data from all 22 patent fields and stores the data as XML • Provides count of objects in patent: – Figures – Tables – Equations – Excluded objects • Bonus: 14 GB (21.2%) size reduction in text data 13 HTML XML Tool 2) Metadata Extraction 14 HTML XML Tool 2) Metadata Extraction à Extraction Scripts are available upon request 15 Putting it all together ANALYSIS and METHODS 16 Content Analysis • 13% of all documents contain tables: 17 Content Analysis 18 Content Analysis 1. Do the tables contribute to the understanding of the patent’s contents? File with a table Clean Tables Removed Tables Only 19 Content Analysis 20 Content Analysis 2. Do “numerical words” in patents affect a document’s similarity? (2545866march, 1988Watson, a2, a3, a4) File with a table Clean With #s No #s Tables Removed With #s No #s Tables Only With #s No #s 21 Tool: Text to Matrix Generator D. Zeimpekis and E. Gallopoulos: University of Patras, Greece 22 Analysis Effect of Maximum Local Term Frequency on Dictionary Size 2.02 2.00 Without Numerical Data 1.98 Number of 1.96 Words in Dictionary: 1.94 Shown in Logarithmic Scale 2545866march 1988Watson a2, a3, a4 1.92 With Numerical Data 1.90 1.88 1.86 0 10 20 30 Maximum Local Term Frequency 23 Analysis Parameter Implemented Value Local Max Infinity Local Min 2 Global Max N–1 Global Min 2 Stop words It depends… ftp://ftp.cs.cornell.edu/pub/smart/english.stop 24 Text Mining Models Model Local Document Indexing Term Frequency Term Frequency Binary IDF Binary Term Frequency Term Frequency IDF Global Corpus Indexing None IDF IDF !"#(&↓( )=log(+/-↓( ) 25 RESULTS 26 Results Models without Numerical Words Model Metrics Min. % Change Term Frequency Max % Change Avg. % Change Standard Deviation 0.0 81.8 4.1 6.8 Binary IDF 0.9 x 10-4 74.7 5.3 7.2 Term Freq. IDF 0.5 x 10-3 93.7 6.1 8.3 Percent changes in document content shown between patents with tables, and patents 27 without tables Results 28 Results 29 Clustered Patent Classifications 35 Results Cluster Assignment 30 25 20 15 Clean Files 10 Tables Removed 5 0 0 20 40 60 80 100 Document Number, without numerical words 120 30 Results Models with Numerical Words Metrics Min. % Change Max % Change 0.0 81.8 4.1 6.8 Binary IDF 0.2 x 10-2 74.9 5.5 7.6 Term Freq. IDF 0.4 x 10-2 93.7 5.9 8.3 Model Term Frequency Avg. % Change Standard Deviation Percent changes in document content shown between patents with tables, and patents 31 without tables Results 32 Clustered Patent Classifications 35 Results Cluster Assignment 30 25 20 15 Clean Files 10 Tables Removed 5 0 0 20 40 60 80 100 Document Number, with numerical words 120 33 Results Comparison of Numerical Words Metrics Avg. % Difference % Difference in Standard Deviation Term Frequency 0.0 0.0 Binary IDF 3.7 5.4 Term Freq. IDF 3.3 0.0 Model Percent difference shown between models with numerical words, and models without numerical words. 34 Results Comparison between Models With “Numerical Words” vs. without “Numerical Words” 35 Final Thoughts… CONCLUSIONS 36 Research Conclusions 37 Research Conclusions Simple models are insightful! 38 Future Work Data Reduction Metadata Extraction Content Analysis Verification 39 Summary eDiscovery Cases Patent Cases 40 Why Trolls and Text Mining? 1. There is an explosion of data 2. Current methods need revamped 3. Verification…? 41 Thank you! Dr. Michael W. Berry, CISML Dr. Songhua Xu, ORNL Bruce Kiefer, Catalyst Dr. D. Zeimpekis, TMG Dr. E. Gallopoulos, TMG Benjamin Martin 42 Questions? 43 Analysis Effect of Minimum Local Term Frequency on Dictionary Size 2.50 2.00 Without Numerical Data data end found low small time valu Number of 1.50 Words in Dictionary 1.00 0.50 With Numerical Data 0.00 0 5 10 15 20 25 Minimum Local Term Frequency 44 3) Analysis HTML Table Analysis: Directory Total Files Files with Tables 00 16,386 2,155 01 16,382 2,148 02 16,302 2,157 45