DISAMBIGUATION OF USPTO INVENTORS Name Game Workshop – Madrid 9-10 December 2010 Presenter: Coauthors: Technical Collaborator: Amy Yu Ronald Lai Alex D’Amour Lee Fleming Edward Sun ayu@hbs.edu rolai@hbs.edu adamour@iq.harvard.edu lfleming@hbs.edu ysun@iq.harvard.edu We would like to thank the NSF for supporting this research. Errors and omissions remain ours (though we ask that you bring them to our attention). The Institute for Quantitative Social Science at Harvard University Agenda Introduction Methodology Torvik-Smalheiser Algorithm (PubMed) Results and Analysis Descriptive Statistics DVN platform Introduction Background Patent data made available by the USTPO enables further research into technology and innovation NBER database includes authorship, firm, and state level data but has not completed the effort to disambiguate unique inventors (Hall, Trajtenberg, and Jaffe, 2001) Inventor disambiguation is non-trivial USPTO does not require consistent and unique identifiers for inventors Motivation Inventor disambiguation allows for construction of inventor collaboration networks Open new avenues of study: Which inventors are most central in their field? How does connectedness affect inventor productivity? What corporate structures are conducive to innovation? How do legal changes impact idea flow? Build a scalable, automated system for tracking and analyzing developments in the inventor community Methodology Overview Previous methodology (2008) Linear, unsupervised – more intuitive Similarity between records is a weighted average of element-wise similarity scores Weights are not optimized Strong results for US - (Lai, D’Amour, Fleming 2008) showed recall of 97.3% and precision of 96.1% Current methodology (2010) Variation of Torvik-Smalheiser algorithm (Torvik et al, 2005; Torvik and Smalheiser, 2009) Multi-dimensional similarity profiles Semi-supervised with automatically generated training sets Optimal weighting, non linear interactions Easier to scale Disambiguation Process HBS scripts Public Databases Inventor disambiguation algorithm HBS scripts Data preparation • load and validate • clean and format • generate datasets Consolidated inventor dataset Weekly USPTO patent data (1998 – 2010) Primary Datasets Assignee Inventors Classes Patents Consolidated inventor matched dataset Data preparation Create inventor, assignee, patent and classification datasets from primary and secondary data sources USPTO: weekly patent data in XML files NBER Patent Data Project: assignee data National Geospatial-Intelligence agency: location data Standardize and reformat Removal of excess whitespace, removal of tags, and translation of Unicode characters Construct the inventor-patent database Consolidate datasets inventor, assignee, patent, and classification Patent Data: Base Datasets INVENTOR Consolidated inventor dataset PATENT Invnum_N Disambiguated inventor number Patent USPTO assigned patent number. Invnum Initial inventor number: Patent + Invseq AppDate Patent application date. GDate Patent grant date. AppYear Patent application year Firstname Inventor first name Lastname Inventor last name InvSeq Inventor number on patent. ASSIGNEE Street Inventor’s street address Assignee Primary firm associated with patent. City Inventor’s city. Asgnum Generated assignee number. State (US only) State Zipcode (US only) Zipcode. CLASSES Lat Latitude Class Main patent classification Long Longitude Subclass Patent subclassification *HBS algorithm generated variables. 10 Consolidated Dataset Inventor First & Last Name Location data Patent Number Firstname Lastname City State GAROLD LEE FLEMING NEWTON KS Country Zipcode Lat US 67117 38.13 -97.32 3 4091724 1977 LEE US 94555 37.57 -122.05 2 5029133 1990 LEE O FLEMING FREMONT CA CATHLEEN FOREST M FLEMING HILL MD US 94555 37.57 -122.05 1 5136185 1991 US 21050 39.57 -76.40 2 5799675 1997 EILEEN FLEMING SANDIA TX US 78383 28.09 -97.94 1 7066218 2003 EILEEN FLEMING SANDIA TX US 78382 30.09 -100.94 1 7540433 2006 49.32 -123.07 5 7521240 2004 77299 29.77 -95.41 1 5164591 1991 MA US 2478 42.39 -71.18 3 6335339 1999 FLEMING BIRMINGHAM MI US 48012 42.54 -83.21 2 5683090 1995 1997 50.15 -96.88 2 5906799 1992 1999 FLEMING FREMONT CA NORTH ELENA FLEMING VANCOUVER CA ELIZABETH A FLEMING HOUSTON TX US ELIZABETH S FLEMING BELMONT ELLEN L ERIC MICHAEL FLEMING SELKIRK CA Lng InvSeq Patent Patent Application & Grant dates Consolidated inventor dataset Assignee data Patent Class Invnum AppYear GYear AppDate Invnum_N Assignee AsgNum Class Invnum HESSTON 1978 3/16/1977 CORPORATION H000000002441 100-21 04091724-3 365HEWLETT 189.02/365PACKARD 201/7141991 8/30/1990 COMPANY A000010088678 703/714-731 05029133-2 HEWLETT 326-16/326PACKARD 31/3261992 9/20/1991 COMPANY A000010088678 56/326-82 05136185-1 COLOR 132-333/1321998 3/3/1997 PRELUDE INC A000011790130 317 05799675-2 141-198/239TMC SYSTEMS 63/2392006 10/29/2003 L P H000000134163 64/239-67 07066218-1 239-69/141TMC SYSTEMS 198/2392009 4/27/2006 L P H000000134163 63/239-64 07540433-1 SMITHKLINE BEECHAM 2009 12/6/2004 CORPORATION A000011538118 435 07521240-5 SHELL OIL 1992 9/9/1991 COMPANY A000010266734 250 05164591-1 SCRIPTGEN PHARMACEUTI 2002 1/13/1999 CALS INC H000000014253 514 06335339-3 6/7/1995 273/463 HEMLOCK SEMICONDUCT OR 6/1/1992 CORPORATION A000011501872 422/501 MARKON ENGINEERING Invnum_N 04091724-3 05029133-2 05136185-1 05799675-2 07066218-1 07540433-1 07521240-5 05164591-1 06335339-3 05683090-2 05683090-2 05906799-2 05906799-2 Disambiguation Algorithm Blocking Training Sets Ratios Disambiguation Consolidation Inventor disambiguation algorithm Blocking Run # Type Block1 Block2 1 Consolidated First name Last name 2 Consolidated First 5 characters of first name. First 8 characters of last name. 3 Consolidated First 3 characters of first name First 5 characters of last name Firstname Lastname City State Country Zipcode Consolidated Initials of first and middle names. First 5 characters of last name GAROLD LEE FLEMING NEWTON KS US 67117 … Consolidated First initial First 5 characters of last name LEE FLEMING FREMONT CA US 94555 … Consolidated Initials of first and middle names. Last 5 characters of last name, reversed LEE O FLEMING FREMONT CA US 94555 … Consolidated First initial Last 5 characters of last name, reversed CATHLEEN M FLEMING FOREST HILL MD US 21050 … 4 5 6 7 EILEEN FLEMING SANDIA TX US 78383 … EILEEN FLEMING SANDIA TX US 78382 … Blocking Ratios Training Sets Consolidation Disambiguation Inventor disambiguation algorithm Training Sets Name Attributes P(α|M) Match P(β|M) P(α|N) Nonmatch Similarity Profile: Patent Attributes P(β|N) P(α|M) * P(β|M) = P(x|M) = Probability of seeing similarity profile x given a match P(α|N) * P(β|N) = P(x|N) = Probability of seeing similarity profile x given nonmatch [x1, x2, x3, x4, x5, x6, x7] α Blocking β Ratios Training Sets Consolidation Disambiguation Inventor disambiguation algorithm Ratios Similarity Profile Match Probability P(M|x) [2, 4, 3, 4, 2, 1, 4] 0.3439485 [3, 4, 3, 5, 3, 2, 5] 0.5872638 [4, 5, 3, 7, 3, 4, 6] 0.7936452 [6, 6, 4, 8, 3, 8, 7] 0.9828447 … … … … * approximated probabilities for demonstration • • • • • Likelihood ratio: r = P(x|M)/P(x|N) generated from training sets Probability of match given similarity profile x: P(M) is empirically determined Smoothing: enforce monotonicity r is interpolated/extrapolated for unobserved xa Blocking Ratios Training Sets Consolidation Disambiguation Inventor disambiguation algorithm Disambiguation Firstname Lastname City State Country Zipcode Invnum Invnum_N Firstname Lastname City State Country Zipcode FLEMING SANDIA TX US 78383 … 07066218-1 07066218-1 GAROLD LEE FLEMING NEWTON KS US 67117 … FLEMING SANDIA TX US 78382 … 07540433-1 07066218-1 07540433-1 LEE FLEMING FREMONT CA US 94555 … EILEEN EILEEN LEE O [6, 6, 4, 8, 3, 8,CATHLEEN 7] M EILEEN EILEEN Similarity Profile FLEMING FREMONT CA US 94555 … FLEMING FOREST HILL MD US 21050 … FLEMING SANDIA TX US 78383 … FLEMING Match SANDIA ProbabilityTX US 78382 … … … … … … … [6, 6, 4, 8, 3, 8, 7] 0.9828447 > 0.95 … … … … … … Blocking Ratios Training Sets Consolidation Disambiguation Inventor disambiguation algorithm Consolidation Firstname Firstname GAROLD LEE GAROLD LEE LEE LEE LEE O LEE O CATHLEEN M CATHLEEN M Lastname Lastname FLEMING FLEMING FLEMING FLEMING FLEMING FLEMING FLEMING FLEMING City City NEWTON NEWTON FREMONT FREMONT FREMONT FREMONT FOREST HILL FOREST HILL EILEEN EILEEN~2 FLEMING SANDIA FLEMING~2 SANDIA~2 EILEEN FLEMING SANDIA Blocking State State KS KS CA CA CA CA MD MD … … … … … … … … Invnum Invnum_N Invnum Invnum_N 04091724-3 04091724-3 04091724-3 04091724-3 05029133-2 05029133-2 05029133-2 05029133-2 05136185-1 05136185-1 05136185-1 05136185-1 05799675-2 05799675-2 05799675-2 05799675-2 TX TX~2 Country Zipcode Country Zipcode US 67117 US 67117 US 94555 US 94555 US 94555 US 94555 US 21050 US 21050 78383~1/ US 78383 US~2 78382~1 …… 07066218-1 07066218-1 07066218-1 TX US … 07540433-1 07066218-1 78382 Ratios Training Sets Consolidation Disambiguation Process Map: Consolidated Steps Inventor disambiguation algorithm tsetC1 tsetC2 tsetC3 tsetC4 tsetC5 tsetC6 tsetC7 … ratio1 ratio2 ratio3 ratio4 ratio5 ratio6 ratio7 … … lower bound result Final Step: Splitting Inventor disambiguation algorithm Blocking Disambiguation invnum_N ratio7 upper bound result Results and Analysis Patents and Inventors* 1975 – 2010 250000 200000 Unique Inventors (upper bound) Unique Inventors (lower bound) Total Patents Count 150000 100000 50000 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 0 Year * excluding East Asian inventors Patents Per Inventor 11 thru 50 50+ 2.29% 0.14% 6 thru 10 3.94% 2 thru 5 26.86% 1 66.78% * based on lower bound disambiguation Top 10 Inventors Firstname Lastname Country Assignee KIA DONALD E LEONARD GURTEJ S PAUL WARREN M GEORGE SALMAN WILLIAM I AUSTIN L SILVERBROOK WEDER FORBES SANDHU LAPSTUN FARNWORTH SPECTOR AKRAM WOOD GURNEY AU US US US AU US US US US US SILVERBROOK RESEARCH PTY LTD WANDA M WEDER AND WILLIAM F STAETER MICRON TECHNOLOGY INC MICRON TECHNOLOGY INC SILVERBROOK RESEARCH PTY LTD MICRON TECHNOLOGY INC THE RUIZ LAW FIRM MICRON TECHNOLOGY INC GENENTECH INC GENENTECH INC Number of Patents 3382 1001 925 832 803 729 715 670 646 618 * based on lower bound disambiguation, excluding East Asian inventors 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 Count Unique Coauthors by Patent Grant Year 2 Average # Coauthors LB Average # Coauthors UB 1 0 Grant Year 30000 25000 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 Component Size (Number of vertices) Largest Component per Year 45000 40000 35000 Lower bound disambig Upper bound disambig 20000 15000 10000 5000 0 Grant Year Analysis Benchmark dataset from Jerry Marschke, NBER manually edited, data derived from inventor CVs patent history of ~100 US inventors – mainly research scientists in university engineering and biochemistry depts Verification Measures: Verification statistics Run # Type # of records Underclumping Overclumping Recall Precision 0 Base Dataset 9.17 million n/a 1 Consolidated 4.61 million 74.6% 1.7% 25.40% 93.73% 2 Consolidated 2.20 million 12.3% 4.8% 87.70% 94.81% 3 Consolidated 2.08 million 6.8% 10.1% 93.20% 90.22% 4 Consolidated 2.05 million 4.6% 10.3% 95.40% 90.26% 5 Consolidated 2.02 million 4.1% 10.3% 95.90% 90.30% 6 Consolidated 2.01 million 2.8% 19.2%** 97.20% 83.51% 7 Consolidated 1.99 million 2.7% 19.2%** 97.30% 83.52% 8 Splitting 15.9% 15.3% 84.10% 84.61% 2.26 million ** due to “blackhole” names Encouraging results Challenges and Improvements Disambiguation of East Asian names is difficult Current algorithm is well-suited for European names Systematic improvements required to handle correlations between fields Overclumping for common names – frequency adjustment using stop listing removing David Johnson, Eric Anderson, and Stephen Smith from our analysis improves the overclumping metric from 19.2% to 5.1% for the last two consolidated runs Computation time v. algorithmic accuracy Benchmark datasets for results analysis Research applications Origin of breakthroughs Impact of legislation on innovation Organizational influence on innovation Inventor careers and collaboration networks Dataverse Network Platform Questions? Appendix Patent Data Consolidated inventor dataset Invnum_N Name Patent Assignee City State … 12345 Fleming, Lee 5029133 HP Fremont CA … 12345 Fleming, Lee 9999999 Harvard Cambridge MA … 45678 Yu, Amy 9999999 Harvard Boston MA … 67890 Lai, Ronald 9999999 Harvard Randolph MA … Prof Fleming, Amy, and Ron collaborate on patent 9999999. Data are organized in unique inventor-patent pairs. Unique inventor number (HBS disambiguation algorithm), constant between patents. Invnum = Patent Num + inventor sequence Invnum_N = disambiguated inventor identifier Patent is assigned to one entity (usually inventors’ employer or self if blank), constant over a patent. Location data are personal addresses (at the city level) of inventors, vary over a patent. 35 Disambiguation Algorithm Blocking • Partition the inventorpatent dataset • Based on seven different criteria Training Sets Ratios • Build a • One ratio training set database is for each set created for of block each training criteria set • Each set is a • Similarity database that profiles are contains four paired with different match tables, each probabilities. with ~ 10 million pairs of record ids. Disambiguation • Starts from invpat or previously disambiguated and consolidated database • Within each block, we compare each record • Output is invnum_N Consolidation • Based on the disambiguated invnum_N, update invnum_N within invpat • Consolidates records with the same invnum_N Inventor disambiguation algorithm Summary of Data Passes Run # Type Block1 Block2 1 Consolidated First name Last name 2 Consolidated First 5 characters of first name. First 8 characters of last name. 3 Consolidated First 3 characters of first name First 5 characters of last name 4 Consolidated Initials of first and middle names. First 5 characters of last name 5 Consolidated First initial First 5 characters of last name 6 Consolidated Initials of first and middle names. Last 5 characters of last name, reversed 7 Consolidated First initial Last 5 characters of last name, reversed 8 Splitting Invnum_N from step 7 Patent Similarity Profiles Seven-dimensional Fields used: name attributes (first name, middle initials, and last name) and patent attributes (author address, assignee, technology class, and coauthors) Each element is a discrete similarity score determined by a fieldwise comparison between two records Inventor disambiguation algorithm Jaro-Winkler string comparison Monotonicity assumption: if one profile dominates another profile (this is, each of its elements is greater than or equal to the elements of another similarity profile), then it must map to a higher match probability. Similarity Scores Comparison function scoring: LEFT/RIGHT=LEFT VS RIGHT 4) Assignee: 0-8 Inventor disambiguation algorithm 0: DIFFERENT ASGNUM, TOTALY DIFFERENT NAMES ( NO single common word ) 1) Firstname: 0-6. Factors: # of token and similarity between tokens 0: Totally different: THOMAS ERIC/RICHARD JACK EVAN 1: ONE NAME MISSING: THOMAS ERIC/(NONE) 2: THOMAS ERIC/ THOMAS JOHN ALEX 3: LEE RON ERIC/LEE ALEX ERIC 4: No space match but raw names don't: JOHNERIC/JOHN ERIC. Short name vs long name: ERIC/ERIC THOMAS 1: DIFFERENT ASGNUM, One name missing 2: DIFFERENT ASGNUM, Harvard University Longwood Medical School / Dartmouth Hitchcock Medical Center 3: DIFFERENT ASGNUM, Harvard University President and Fellows / Presidents and Fellow of Harvard 4: DIFFERENT ASGNUM, Harvard University / Harvard University Medical School 5: DIFFERENT ASGNUM, Microsoft Corporation/Microsoft Corporated 5: ALEX NICHOLAS/ALEX NICHOLAS TAKASHI 6: SAME ASGNUM, COMPANY SIZE>1000 6: ALEX NICHOLAS/ALEX NICHOLA (Might be not exactly the same but identified the same by jarowrinkler) 7: SAME ASGNUM, 1000>SIZE>100 2) Lastname: 0-6 Factors: # of token and similarity between tokens 0: Totally different: ANDERSON/DAVIDSON 8: SAME ASGNUM, SIZE<100 5) CLASS: 0-4 # OF COMMON CLASSES. MISSING=1 1: ONE NAME MISSING: ANDERSON/(NONE) 2: First part non-match: DE AMOUR/DA AMOUR 3: VAN DE WAALS/VAN DES WAALS 6) COAUTHERS 0-10 # OF COMMON COAUTHERS 4: DE AMOUR/DEAMOUR 5: JOHNSTON/JOHNSON 6: DE AMOUR/DE AMOURS 7) DISTANCE: 0-7 FACTORS: LONGITUDE/LATITUDE, STREET ADDRESS 0: TOTALLY DIFFERENT 1: ONE IS MISSING 3) Midname: 0-4 (THE FOLLOWING EXAMPLES ARE FROM THE COLUMN FIRSTNAME, SO FIRSTNAME IS INCLUDED) 2: 75<DISTANCE < 100KM 0: THOMAS ERIC/JOHN THOMAS 3: 50<DISTANCE < 75 1: JOHN ERIC/JOHN (MISSING) 4: 10<DISTANCE < 50 2: THOMAS ERIC ALEX/JACK ERIC RONALD 5: DISTANCE < 10 3: THOMAS ERIC RON ALEX EDWARD/JACK ERIC RON ALEX LEE 6: DISTANCE < 10 AND STREET MATCH BUT NOT IN US, OR DISTANCE < 10 AND IN US BUT STREET NOT MATCH 4: THOMAS ERIC/THOMAS ERIC LEE 7: STREET MATCH AND IN US Probabilistic Matching Model Inventor disambiguation algorithm Name and Patent attributes are assumed to be independent Unbiased training sets are created by conditioning on one set of features to create a sample of obvious matches or non-matches to learn about the other set of features without bias Count frequency of each similarity profile x in match and nonmatch sets to calculate P(x|M) and P(x|N) Inventor disambiguation algorithm Training Set Criteria Condition on patent attributes to train name attributes Name Attributes Match Nonmatch Condition on name attributes to train patent attributes Patent Attributes Choose all the record pairs that have at least two common coauthors within each predefined block. Choose all the record pairs that share the same rare name (calculate statistics on unique full names, choose those whose first or last name only appear once). Not necessary to check each block. Choose all record pairs that have same appyear, different assignee, no common coauthors and no common classes within each predefined block. Choose all record pairs that have different last names from a subset of the whole database in which the number of records are proportional to the original one in terms of grant year. Probabilistic Matching Model Likelihood ratio: r = P(x|M)/P(x|N) Probability of match given similarity profile x: where Inventor disambiguation algorithm P(M) is empirically determined Smoothing: enforce monotonicity r is interpolated/extrapolated for unobserved xa Disambiguation & Consolidation Inventor disambiguation algorithm Generate similarity profile for each record within each block Lookup similarity profile in ratio database to find match probability Based on a given probability threshold, we determine if invnum_N (algorithmically generated unique inventor identifier) should be updated Records with same invnum_N are consolidated Improves algorithm efficiency for subsequent runs Verification Measures References Hall, B. H., A. B. Jaffe, and M. Trajtenberg. (2001). The NBER patent Citations Data File: Lessons Insights and Methodological Tools, NBER. Torvik, V. and M. Weeber, D. Swanson, N. Smalheiser (2005). “A Probabilistic Similarity Metric for Medline Records: A Model for Author Name Disambiguation,” JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 56(2):140–158, 2005. Torvik, V. and N. Smalheiser (2009). “Author Name Disambiguation in MEDLINE.” ACM Transactions on Knowledge Discovery from Data, Vol. 3., No. 3, Article 11.