Disambiguation of USPTO Inventors - APE-INV

advertisement
DISAMBIGUATION OF
USPTO INVENTORS
Name Game Workshop – Madrid 9-10 December 2010
Presenter:
Coauthors:
Technical Collaborator:
Amy Yu
Ronald Lai
Alex D’Amour
Lee Fleming
Edward Sun
ayu@hbs.edu
rolai@hbs.edu
adamour@iq.harvard.edu
lfleming@hbs.edu
ysun@iq.harvard.edu
We would like to thank the NSF for supporting this research. Errors and omissions remain ours (though we
ask that you bring them to our attention).
The Institute for Quantitative Social Science at Harvard University
Agenda


Introduction
Methodology
 Torvik-Smalheiser

Algorithm (PubMed)
Results and Analysis
 Descriptive
Statistics
 DVN platform
Introduction
Background



Patent data made available by the USTPO enables
further research into technology and innovation
NBER database includes authorship, firm, and state
level data but has not completed the effort to
disambiguate unique inventors (Hall, Trajtenberg,
and Jaffe, 2001)
Inventor disambiguation is non-trivial
 USPTO
does not require consistent and unique
identifiers for inventors
Motivation


Inventor disambiguation allows for construction of
inventor collaboration networks
Open new avenues of study:
 Which
inventors are most central in their field?
 How does connectedness affect inventor productivity?
 What corporate structures are conducive to innovation?
 How do legal changes impact idea flow?

Build a scalable, automated system for tracking and
analyzing developments in the inventor community
Methodology
Overview

Previous methodology (2008)
Linear, unsupervised – more intuitive
 Similarity between records is a weighted average of
element-wise similarity scores
 Weights are not optimized
 Strong results for US - (Lai, D’Amour, Fleming 2008) showed
recall of 97.3% and precision of 96.1%


Current methodology (2010)
Variation of Torvik-Smalheiser algorithm (Torvik et al, 2005;
Torvik and Smalheiser, 2009)
 Multi-dimensional similarity profiles
 Semi-supervised with automatically generated training sets
 Optimal weighting, non linear interactions
 Easier to scale

Disambiguation Process
HBS scripts
Public Databases
Inventor disambiguation
algorithm
HBS scripts
Data preparation
• load and validate
• clean and format
• generate datasets
Consolidated
inventor dataset
Weekly USPTO
patent data
(1998 – 2010)
Primary Datasets
Assignee
Inventors
Classes
Patents
Consolidated
inventor matched
dataset
Data preparation
 Create
inventor, assignee, patent and classification
datasets from primary and secondary data sources
 USPTO:
weekly patent data in XML files
 NBER Patent Data Project: assignee data
 National Geospatial-Intelligence agency: location data
 Standardize
and reformat
 Removal
of excess whitespace, removal of tags, and
translation of Unicode characters
 Construct
the inventor-patent database
 Consolidate
datasets
inventor, assignee, patent, and classification
Patent Data: Base Datasets
INVENTOR
Consolidated
inventor dataset
PATENT
Invnum_N
Disambiguated inventor number
Patent
USPTO assigned patent number.
Invnum
Initial inventor number: Patent +
Invseq
AppDate
Patent application date.
GDate
Patent grant date.
AppYear
Patent application year
Firstname
Inventor first name
Lastname
Inventor last name
InvSeq
Inventor number on patent.
ASSIGNEE
Street
Inventor’s street address
Assignee
Primary firm associated with patent.
City
Inventor’s city.
Asgnum
Generated assignee number.
State
(US only) State
Zipcode
(US only) Zipcode.
CLASSES
Lat
Latitude
Class
Main patent classification
Long
Longitude
Subclass
Patent subclassification
*HBS algorithm generated variables.
10
Consolidated Dataset
Inventor
First &
Last Name
Location
data
Patent
Number
Firstname Lastname City
State
GAROLD
LEE
FLEMING NEWTON KS
Country Zipcode Lat
US
67117 38.13 -97.32
3 4091724
1977
LEE
US
94555 37.57 -122.05
2 5029133
1990
LEE O
FLEMING FREMONT CA
CATHLEEN
FOREST
M
FLEMING HILL
MD
US
94555 37.57 -122.05
1 5136185
1991
US
21050 39.57 -76.40
2 5799675
1997
EILEEN
FLEMING SANDIA
TX
US
78383 28.09 -97.94
1 7066218
2003
EILEEN
FLEMING SANDIA
TX
US
78382 30.09 -100.94
1 7540433
2006
49.32 -123.07
5 7521240
2004
77299 29.77 -95.41
1 5164591
1991
MA US
2478 42.39 -71.18
3 6335339
1999
FLEMING BIRMINGHAM MI US
48012 42.54 -83.21
2 5683090
1995 1997
50.15 -96.88
2 5906799
1992 1999
FLEMING FREMONT CA
NORTH
ELENA
FLEMING VANCOUVER
CA
ELIZABETH
A
FLEMING HOUSTON
TX US
ELIZABETH
S
FLEMING BELMONT
ELLEN L
ERIC
MICHAEL FLEMING SELKIRK
CA
Lng
InvSeq Patent
Patent
Application &
Grant dates
Consolidated
inventor dataset
Assignee
data
Patent
Class
Invnum
AppYear GYear AppDate
Invnum_N
Assignee
AsgNum
Class
Invnum
HESSTON
1978 3/16/1977 CORPORATION H000000002441 100-21
04091724-3
365HEWLETT
189.02/365PACKARD
201/7141991 8/30/1990 COMPANY
A000010088678 703/714-731 05029133-2
HEWLETT
326-16/326PACKARD
31/3261992 9/20/1991 COMPANY
A000010088678 56/326-82
05136185-1
COLOR
132-333/1321998 3/3/1997 PRELUDE INC A000011790130 317
05799675-2
141-198/239TMC SYSTEMS
63/2392006 10/29/2003 L P
H000000134163 64/239-67
07066218-1
239-69/141TMC SYSTEMS
198/2392009 4/27/2006 L P
H000000134163 63/239-64
07540433-1
SMITHKLINE
BEECHAM
2009 12/6/2004 CORPORATION A000011538118
435 07521240-5
SHELL OIL
1992 9/9/1991 COMPANY
A000010266734
250 05164591-1
SCRIPTGEN
PHARMACEUTI
2002 1/13/1999 CALS INC
H000000014253
514 06335339-3
6/7/1995
273/463
HEMLOCK
SEMICONDUCT
OR
6/1/1992 CORPORATION A000011501872 422/501
MARKON
ENGINEERING
Invnum_N
04091724-3
05029133-2
05136185-1
05799675-2
07066218-1
07540433-1
07521240-5
05164591-1
06335339-3
05683090-2 05683090-2
05906799-2 05906799-2
Disambiguation Algorithm
Blocking
Training Sets
Ratios
Disambiguation
Consolidation
Inventor
disambiguation
algorithm
Blocking
Run #
Type
Block1
Block2
1
Consolidated
First name
Last name
2
Consolidated
First 5 characters of first name.
First 8 characters of last name.
3
Consolidated
First 3 characters of first name
First 5 characters of last name
Firstname
Lastname City
State Country Zipcode
Consolidated
Initials of first and middle names.
First 5 characters of last name
GAROLD LEE FLEMING NEWTON
KS
US
67117 …
Consolidated
First initial
First 5 characters of last name
LEE
FLEMING FREMONT
CA US
94555 …
Consolidated
Initials of first and middle names.
Last 5 characters of last name, reversed
LEE O
FLEMING FREMONT
CA US
94555 …
Consolidated
First initial
Last 5 characters of last name, reversed
CATHLEEN M FLEMING FOREST HILL MD US
21050 …
4
5
6
7
EILEEN
FLEMING SANDIA
TX
US
78383
…
EILEEN
FLEMING SANDIA
TX
US
78382
…
Blocking
Ratios
Training Sets
Consolidation
Disambiguation
Inventor
disambiguation
algorithm
Training Sets
Name
Attributes
P(α|M)
Match
P(β|M)
P(α|N)
Nonmatch
Similarity
Profile:
Patent
Attributes
P(β|N)
P(α|M) * P(β|M) =
P(x|M) = Probability of
seeing similarity profile
x given a match
P(α|N) * P(β|N) =
P(x|N) = Probability of
seeing similarity profile
x given nonmatch
[x1, x2, x3, x4, x5, x6, x7]
α
Blocking
β
Ratios
Training Sets
Consolidation
Disambiguation
Inventor
disambiguation
algorithm
Ratios
Similarity Profile
Match Probability P(M|x)
[2, 4, 3, 4, 2, 1, 4]
0.3439485
[3, 4, 3, 5, 3, 2, 5]
0.5872638
[4, 5, 3, 7, 3, 4, 6]
0.7936452
[6, 6, 4, 8, 3, 8, 7]
0.9828447
…
…
…
…
* approximated probabilities for demonstration
•
•
•
•
•
Likelihood ratio: r = P(x|M)/P(x|N) generated from training sets
Probability of match given similarity profile x:
P(M) is empirically determined
Smoothing: enforce monotonicity
r is interpolated/extrapolated for unobserved xa
Blocking
Ratios
Training Sets
Consolidation
Disambiguation
Inventor
disambiguation
algorithm
Disambiguation
Firstname
Lastname City
State Country Zipcode
Invnum
Invnum_N
Firstname
Lastname City
State Country Zipcode
FLEMING SANDIA
TX
US
78383 … 07066218-1 07066218-1
GAROLD LEE FLEMING NEWTON
KS
US
67117 …
FLEMING SANDIA
TX
US
78382 … 07540433-1 07066218-1
07540433-1
LEE
FLEMING FREMONT
CA US
94555 …
EILEEN
EILEEN
LEE O
[6, 6, 4, 8, 3, 8,CATHLEEN
7]
M
EILEEN
EILEEN
Similarity Profile
FLEMING FREMONT
CA
US
94555
…
FLEMING FOREST HILL MD
US
21050
…
FLEMING SANDIA
TX
US
78383
…
FLEMING
Match SANDIA
ProbabilityTX
US
78382
…
…
…
…
…
…
…
[6, 6, 4, 8, 3, 8, 7]
0.9828447 > 0.95
…
…
…
…
…
…
Blocking
Ratios
Training Sets
Consolidation
Disambiguation
Inventor
disambiguation
algorithm
Consolidation
Firstname
Firstname
GAROLD LEE
GAROLD LEE
LEE
LEE
LEE O
LEE O
CATHLEEN M
CATHLEEN M
Lastname
Lastname
FLEMING
FLEMING
FLEMING
FLEMING
FLEMING
FLEMING
FLEMING
FLEMING
City
City
NEWTON
NEWTON
FREMONT
FREMONT
FREMONT
FREMONT
FOREST HILL
FOREST HILL
EILEEN
EILEEN~2
FLEMING SANDIA
FLEMING~2
SANDIA~2
EILEEN
FLEMING
SANDIA
Blocking
State
State
KS
KS
CA
CA
CA
CA
MD
MD
…
…
…
…
…
…
…
…
Invnum
Invnum_N
Invnum
Invnum_N
04091724-3 04091724-3
04091724-3 04091724-3
05029133-2 05029133-2
05029133-2 05029133-2
05136185-1 05136185-1
05136185-1 05136185-1
05799675-2 05799675-2
05799675-2 05799675-2
TX
TX~2
Country Zipcode
Country Zipcode
US
67117
US
67117
US
94555
US
94555
US
94555
US
94555
US
21050
US
21050
78383~1/
US
78383
US~2
78382~1
……
07066218-1
07066218-1 07066218-1
TX
US
…
07540433-1 07066218-1
78382
Ratios
Training Sets
Consolidation
Disambiguation
Process Map:
Consolidated Steps
Inventor
disambiguation
algorithm
tsetC1
tsetC2
tsetC3
tsetC4
tsetC5
tsetC6
tsetC7
…
ratio1
ratio2
ratio3
ratio4
ratio5
ratio6
ratio7
…
…
lower
bound
result
Final Step: Splitting
Inventor
disambiguation
algorithm
Blocking Disambiguation
invnum_N
ratio7
upper
bound
result
Results and Analysis
Patents and Inventors* 1975 – 2010
250000
200000
Unique Inventors (upper bound)
Unique Inventors (lower bound)
Total Patents
Count
150000
100000
50000
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
0
Year
* excluding East Asian inventors
Patents Per Inventor
11 thru 50 50+
2.29% 0.14%
6 thru 10
3.94%
2 thru 5
26.86%
1
66.78%
* based on lower bound disambiguation
Top 10 Inventors
Firstname
Lastname
Country Assignee
KIA
DONALD E
LEONARD
GURTEJ S
PAUL
WARREN M
GEORGE
SALMAN
WILLIAM I
AUSTIN L
SILVERBROOK
WEDER
FORBES
SANDHU
LAPSTUN
FARNWORTH
SPECTOR
AKRAM
WOOD
GURNEY
AU
US
US
US
AU
US
US
US
US
US
SILVERBROOK RESEARCH PTY LTD
WANDA M WEDER AND WILLIAM F STAETER
MICRON TECHNOLOGY INC
MICRON TECHNOLOGY INC
SILVERBROOK RESEARCH PTY LTD
MICRON TECHNOLOGY INC
THE RUIZ LAW FIRM
MICRON TECHNOLOGY INC
GENENTECH INC
GENENTECH INC
Number of Patents
3382
1001
925
832
803
729
715
670
646
618
* based on lower bound disambiguation, excluding East Asian inventors
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
Count
Unique Coauthors by Patent Grant Year
2
Average # Coauthors LB
Average # Coauthors UB
1
0
Grant Year
30000
25000
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
Component Size (Number of vertices)
Largest Component per Year
45000
40000
35000
Lower bound disambig
Upper bound disambig
20000
15000
10000
5000
0
Grant Year
Analysis

Benchmark dataset from Jerry Marschke, NBER
 manually
edited, data derived from inventor CVs
 patent history of ~100 US inventors – mainly research
scientists in university engineering and biochemistry
depts
Verification Measures:
Verification statistics
Run #
Type
# of records Underclumping Overclumping
Recall
Precision
0
Base Dataset 9.17 million
n/a
1
Consolidated 4.61 million
74.6%
1.7%
25.40%
93.73%
2
Consolidated 2.20 million
12.3%
4.8%
87.70%
94.81%
3
Consolidated 2.08 million
6.8%
10.1%
93.20%
90.22%
4
Consolidated 2.05 million
4.6%
10.3%
95.40%
90.26%
5
Consolidated 2.02 million
4.1%
10.3%
95.90%
90.30%
6
Consolidated 2.01 million
2.8%
19.2%**
97.20%
83.51%
7
Consolidated 1.99 million
2.7%
19.2%**
97.30%
83.52%
8
Splitting
15.9%
15.3%
84.10%
84.61%
2.26 million
** due to “blackhole” names
Encouraging results
Challenges and Improvements

Disambiguation of East Asian names is difficult
Current algorithm is well-suited for European names
 Systematic improvements required to handle correlations
between fields


Overclumping for common names – frequency
adjustment using stop listing



removing David Johnson, Eric Anderson, and Stephen Smith
from our analysis improves the overclumping metric from
19.2% to 5.1% for the last two consolidated runs
Computation time v. algorithmic accuracy
Benchmark datasets for results analysis
Research applications




Origin of
breakthroughs
Impact of legislation
on innovation
Organizational
influence on innovation
Inventor careers and
collaboration networks
Dataverse Network Platform
Questions?
Appendix
Patent Data

Consolidated
inventor dataset
Invnum_N
Name
Patent
Assignee
City
State
…
12345
Fleming, Lee
5029133
HP
Fremont
CA
…
12345
Fleming, Lee
9999999
Harvard
Cambridge
MA
…
45678
Yu, Amy
9999999
Harvard
Boston
MA
…
67890
Lai, Ronald
9999999
Harvard
Randolph
MA
…
Prof Fleming, Amy, and Ron collaborate on patent 9999999.


Data are organized in unique inventor-patent pairs.
Unique inventor number (HBS disambiguation algorithm), constant between
patents.




Invnum = Patent Num + inventor sequence
Invnum_N = disambiguated inventor identifier
Patent is assigned to one entity (usually inventors’ employer or self if
blank), constant over a patent.
Location data are personal addresses (at the city level) of inventors, vary
over a patent.
35
Disambiguation Algorithm
Blocking
• Partition the
inventorpatent
dataset
• Based on
seven
different
criteria
Training Sets
Ratios
• Build a
• One ratio
training set
database is
for each set
created for
of block
each training
criteria
set
• Each set is a
• Similarity
database that
profiles are
contains four
paired with
different
match
tables, each
probabilities.
with ~ 10
million pairs
of record ids.
Disambiguation
• Starts from
invpat or
previously
disambiguated
and
consolidated
database
• Within each
block, we
compare each
record
• Output is
invnum_N
Consolidation
• Based on the
disambiguated
invnum_N,
update
invnum_N
within invpat
• Consolidates
records with
the same
invnum_N
Inventor
disambiguation
algorithm
Summary of Data Passes
Run #
Type
Block1
Block2
1
Consolidated
First name
Last name
2
Consolidated
First 5 characters of first name.
First 8 characters of last name.
3
Consolidated
First 3 characters of first name
First 5 characters of last name
4
Consolidated
Initials of first and middle names.
First 5 characters of last name
5
Consolidated
First initial
First 5 characters of last name
6
Consolidated
Initials of first and middle names.
Last 5 characters of last name, reversed
7
Consolidated
First initial
Last 5 characters of last name, reversed
8
Splitting
Invnum_N from step 7
Patent Similarity Profiles

Seven-dimensional


Fields used: name attributes (first name, middle initials, and
last name) and patent attributes (author address, assignee,
technology class, and coauthors)
Each element is a discrete similarity score determined
by a fieldwise comparison between two records


Inventor
disambiguation
algorithm
Jaro-Winkler string comparison
Monotonicity assumption: if one profile dominates
another profile (this is, each of its elements is greater
than or equal to the elements of another similarity
profile), then it must map to a higher match probability.
Similarity Scores
Comparison function scoring: LEFT/RIGHT=LEFT VS RIGHT
4) Assignee: 0-8
Inventor
disambiguation
algorithm
0: DIFFERENT ASGNUM, TOTALY DIFFERENT NAMES ( NO single common word )
1) Firstname: 0-6. Factors: # of token and similarity between tokens
0: Totally different: THOMAS ERIC/RICHARD JACK EVAN
1: ONE NAME MISSING: THOMAS ERIC/(NONE)
2: THOMAS ERIC/ THOMAS JOHN ALEX
3: LEE RON ERIC/LEE ALEX ERIC
4: No space match but raw names don't: JOHNERIC/JOHN ERIC. Short name vs long name: ERIC/ERIC
THOMAS
1: DIFFERENT ASGNUM, One name missing
2: DIFFERENT ASGNUM, Harvard University Longwood Medical School / Dartmouth
Hitchcock Medical Center
3: DIFFERENT ASGNUM, Harvard University President and Fellows / Presidents and Fellow
of Harvard
4: DIFFERENT ASGNUM, Harvard University / Harvard University Medical School
5: DIFFERENT ASGNUM, Microsoft Corporation/Microsoft Corporated
5: ALEX NICHOLAS/ALEX NICHOLAS TAKASHI
6: SAME ASGNUM, COMPANY SIZE>1000
6: ALEX NICHOLAS/ALEX NICHOLA (Might be not exactly the same but identified the same by jarowrinkler)
7: SAME ASGNUM, 1000>SIZE>100
2) Lastname: 0-6 Factors: # of token and similarity between tokens
0: Totally different: ANDERSON/DAVIDSON
8: SAME ASGNUM, SIZE<100
5) CLASS: 0-4
# OF COMMON CLASSES. MISSING=1
1: ONE NAME MISSING: ANDERSON/(NONE)
2: First part non-match: DE AMOUR/DA AMOUR
3: VAN DE WAALS/VAN DES WAALS
6) COAUTHERS 0-10
# OF COMMON COAUTHERS
4: DE AMOUR/DEAMOUR
5: JOHNSTON/JOHNSON
6: DE AMOUR/DE AMOURS
7) DISTANCE: 0-7 FACTORS: LONGITUDE/LATITUDE, STREET ADDRESS
0: TOTALLY DIFFERENT
1: ONE IS MISSING
3) Midname: 0-4 (THE FOLLOWING EXAMPLES ARE FROM THE COLUMN FIRSTNAME, SO FIRSTNAME IS
INCLUDED)
2: 75<DISTANCE < 100KM
0: THOMAS ERIC/JOHN THOMAS
3: 50<DISTANCE < 75
1: JOHN ERIC/JOHN (MISSING)
4: 10<DISTANCE < 50
2: THOMAS ERIC ALEX/JACK ERIC RONALD
5: DISTANCE < 10
3: THOMAS ERIC RON ALEX EDWARD/JACK ERIC RON ALEX LEE
6: DISTANCE < 10 AND STREET MATCH BUT NOT IN US, OR DISTANCE < 10 AND IN US
BUT STREET NOT MATCH
4: THOMAS ERIC/THOMAS ERIC LEE
7: STREET MATCH AND IN US
Probabilistic Matching Model



Inventor
disambiguation
algorithm
Name and Patent attributes are assumed to be
independent
Unbiased training sets are created by conditioning
on one set of features to create a sample of
obvious matches or non-matches to learn about the
other set of features without bias
Count frequency of each similarity profile x in
match and nonmatch sets to calculate P(x|M) and
P(x|N)
Inventor
disambiguation
algorithm
Training Set Criteria
Condition on patent
attributes to train
name attributes
Name Attributes
Match
Nonmatch
Condition on name
attributes to train
patent attributes
Patent Attributes
Choose all the record pairs that have
at least two common coauthors within
each predefined block.
Choose all the record pairs that share
the same rare name (calculate
statistics on unique full names, choose
those whose first or last name only
appear once). Not necessary to check
each block.
Choose all record pairs that have same
appyear, different assignee, no
common coauthors and no common
classes within each predefined block.
Choose all record pairs that have
different last names from a subset of
the whole database in which the
number of records are proportional to
the original one in terms of grant
year.
Probabilistic Matching Model


Likelihood ratio: r = P(x|M)/P(x|N)
Probability of match given similarity profile x:
 where


Inventor
disambiguation
algorithm
P(M) is empirically determined
Smoothing: enforce monotonicity
r is interpolated/extrapolated for unobserved xa
Disambiguation & Consolidation




Inventor
disambiguation
algorithm
Generate similarity profile for each record within
each block
Lookup similarity profile in ratio database to find
match probability
Based on a given probability threshold, we
determine if invnum_N (algorithmically generated
unique inventor identifier) should be updated
Records with same invnum_N are consolidated
 Improves
algorithm efficiency for subsequent runs
Verification Measures
References



Hall, B. H., A. B. Jaffe, and M. Trajtenberg. (2001). The NBER patent Citations Data File: Lessons Insights and
Methodological Tools, NBER.
Torvik, V. and M. Weeber, D. Swanson, N. Smalheiser (2005). “A Probabilistic Similarity Metric for Medline
Records: A Model for Author Name Disambiguation,” JOURNAL OF THE AMERICAN SOCIETY FOR
INFORMATION SCIENCE AND TECHNOLOGY, 56(2):140–158, 2005.
Torvik, V. and N. Smalheiser (2009). “Author Name Disambiguation in MEDLINE.” ACM Transactions on
Knowledge Discovery from Data, Vol. 3., No. 3, Article 11.
Download