The Effects of Tabular-based Content Extraction on Patent Document Clustering

advertisement
The Effects of Tabular-based
Content Extraction on Patent
Document Clustering
Denise R. Koessler, Benjamin W. Martin,
Bruce E. Kiefer, Michael W. Berry
SIAM Text Mining Workshop
April 27th, 2012
Motivation
e-DISCOVERY and
PATENT TROLLS
2
eDiscovery
120
Annual Number of Sanctions Cases and Sanctions Awards
100
80
60
40
20
0
Total Cases Sanctioned
Source: Duke Law
Total Cases Awarded
3
Patent Trolls
Source: Beta Beat: The Low Down on High Tech
4
Patents Granted
Patent Cases Filed
Patent Trolls
2011 US Patent Litigation Study
5
Why Trolls and Text Mining?
1.  There is an
explosion of data
2.  Current methods
need revamped
3.  Verification…?
6
Project Description
A DEEP DIVE INTO AUTOMATING
PATENT PROCESSING
7
Project Overview
8
IARPA Project Overview
Metadata Extraction and
Ground Truth Summary
Content Understanding
with Visual Gaze Tracking
9
Data Set
• 
• 
• 
• 
Created by Catalyst
Source: USPTO
200 GB in size
2.5 million files
–  HTML
–  PDF
–  TIFF
•  Very unstructured,
and irregular!
10
Project Overview
Data
Reduction
Metadata
Extraction
Content
Analysis
11
Data Reduction
•  Number of Irrelevant
files: 158,000 (6.4%)
–  HTML without full
text
–  Blank PDFs
–  Duplicate TIFFs
12
Metadata Extraction
•  HTML  XML
•  Extracts data from all 22 patent fields and
stores the data as XML
•  Provides count of objects in patent:
–  Figures
–  Tables
–  Equations
–  Excluded objects
•  Bonus: 14 GB (21.2%) size reduction in text
data
13
HTML  XML Tool
2) Metadata Extraction
14
HTML  XML Tool
2) Metadata Extraction
à Extraction Scripts are available upon request
15
Putting it all together
ANALYSIS and METHODS
16
Content Analysis
•  13% of all documents contain tables:
17
Content Analysis
18
Content Analysis
1.  Do the tables contribute to the
understanding of the patent’s contents?
File with a table
Clean
Tables
Removed
Tables Only
19
Content Analysis
20
Content Analysis
2.  Do “numerical words” in patents affect a
document’s similarity? (2545866march,
1988Watson, a2, a3, a4)
File with a table
Clean
With #s
No #s
Tables
Removed
With #s
No #s
Tables Only
With #s
No #s
21
Tool: Text to Matrix Generator
D. Zeimpekis and E. Gallopoulos: University of Patras, Greece
22
Analysis
Effect of Maximum Local Term Frequency on Dictionary Size
2.02
2.00
Without
Numerical
Data
1.98
Number of 1.96
Words in
Dictionary: 1.94
Shown in
Logarithmic Scale
2545866march
1988Watson
a2, a3, a4
1.92
With
Numerical
Data
1.90
1.88
1.86
0
10
20
30
Maximum Local Term Frequency
23
Analysis
Parameter
Implemented
Value
Local Max
Infinity
Local Min
2
Global Max
N–1
Global Min
2
Stop words
It depends…
ftp://ftp.cs.cornell.edu/pub/smart/english.stop
24
Text Mining Models
Model
Local
Document
Indexing
Term Frequency Term Frequency
Binary IDF
Binary
Term Frequency
Term Frequency
IDF
Global
Corpus
Indexing
None
IDF
IDF
!"#(​&↓( )=​log⁠(​+/​-↓( ) 25
RESULTS
26
Results
Models without Numerical Words
Model
Metrics Min. %
Change
Term Frequency
Max %
Change
Avg. %
Change
Standard
Deviation
0.0
81.8
4.1
6.8
Binary IDF
0.9 x 10-4
74.7
5.3
7.2
Term Freq. IDF
0.5 x 10-3
93.7
6.1
8.3
Percent changes in document content shown
between patents with tables, and patents
27
without tables
Results
28
Results
29
Clustered Patent Classifications
35
Results
Cluster Assignment
30
25
20
15
Clean
Files
10
Tables
Removed
5
0
0
20
40
60
80
100
Document Number, without numerical words
120
30
Results
Models with Numerical Words
Metrics Min. %
Change
Max %
Change
0.0
81.8
4.1
6.8
Binary IDF
0.2 x 10-2
74.9
5.5
7.6
Term Freq. IDF
0.4 x 10-2
93.7
5.9
8.3
Model
Term Frequency
Avg. %
Change
Standard
Deviation
Percent changes in document content shown
between patents with tables, and patents
31
without tables
Results
32
Clustered Patent Classifications
35
Results
Cluster Assignment
30
25
20
15
Clean
Files
10
Tables
Removed
5
0
0
20
40
60
80
100
Document Number, with numerical words
120
33
Results
Comparison of Numerical Words
Metrics
Avg. %
Difference
% Difference in
Standard Deviation
Term Frequency
0.0
0.0
Binary IDF
3.7
5.4
Term Freq. IDF
3.3
0.0
Model
Percent difference shown between models
with numerical words, and models without
numerical words.
34
Results
Comparison between Models
With “Numerical Words” vs.
without “Numerical Words”
35
Final Thoughts…
CONCLUSIONS
36
Research Conclusions
37
Research Conclusions
Simple models are insightful!
38
Future Work
Data
Reduction
Metadata
Extraction
Content
Analysis
Verification
39
Summary
eDiscovery Cases
Patent Cases
40
Why Trolls and Text Mining?
1.  There is an
explosion of data
2.  Current methods
need revamped
3.  Verification…?
41
Thank you!
Dr. Michael W. Berry, CISML
Dr. Songhua Xu, ORNL
Bruce Kiefer, Catalyst
Dr. D. Zeimpekis, TMG
Dr. E. Gallopoulos, TMG
Benjamin Martin
42
Questions?
43
Analysis
Effect of Minimum Local Term Frequency on Dictionary Size
2.50
2.00
Without
Numerical
Data
data
end
found
low
small
time
valu
Number of 1.50
Words in
Dictionary 1.00
0.50
With
Numerical
Data
0.00
0
5
10
15
20
25
Minimum Local Term Frequency
44
3) Analysis
HTML Table Analysis:
Directory
Total Files
Files with Tables
00
16,386
2,155
01
16,382
2,148
02
16,302
2,157
45
Related documents
Download