Mapping Academic Patents to Papers Hyun-Woo Kim,1 Zhen Lei,1 Brian Wright,2 John Yen1 1 Penn State University 2 UC Berkeley NAS SciSIP PI Conference September 20-21, 2012 NSF SciSIP Project Collaborative Research: The Impacts of University Research and Funding Sources in Chemical Sciences: Publishing, Patenting, Commercialization PIs: Brian Wright (UC Berkeley) and Zhen Lei (Penn State) Role of research sponsor type (government or industry) on university research, patenting, technology transfer Publishing, patenting, licensing/ MTAs, and diffusion and follow-on research of university inventions Interplay between government and industry funding in university research Datasets Data 1: Access to University of California Office of Technology Transfer: 1) Invention disclosures, patenting and licensing history 2) Sponsor information, technology information Data 2: All scientific publications in chemical sciences by UC researchers in 1975-2005, and the associated citation profile of these publications Mapping Patents to Papers Patent/Paper Correspondence: One-to-One in Theory A paper An invention Same researchers Close dates A patent Not So Clean in Practice Patent filing Papers Patent filing Continuation Papers Grant Features of a Patent-Paper Pair Feature Group 1 (paper coauthors’ names): – Does first co-inventor’s last name appear in the co-author list? – Does first co-inventor’s “fist initial and last name” appear in the co-author list? – Does first co-author’s last name appear in the co-inventor list? – Does first co-author’s “fist initial and last name” appear in the co-inventor list? – Does last co-author’s last name appear in the co-inventor list? – Does last co-author’s “fist initial and last name” appear in the co-inventor list? – Fraction of patent inventors whose first initial and last name appear in the coauthor list of the paper – Fraction of patent inventors whose last names appear in the coauthor list of the paper Features of a Patent-Paper Pair Feature Group 2 (paper primary affiliation): – String similarity score (Levenshtein Distance) between patent assignee and paper primary affiliation – Percentage of the common words between patent assignee and paper primary affiliation – Is the patent assignees’ country the same as the paper primary affiliation’s? – Is the patent assignee’s (city or state)+country is the same is the paper primary affiliation’s? – Does first co-inventor’s country appear in the paper primary affiliation? – Does first co-inventor’s city/state and country appear in the paper primary affiliation? – Fraction of the inventors whose countries are same as the paper primary affiliation’s – Fraction of the inventors whose city/state and country are same as the paper primary affiliation’s Features of a Patent-Paper Pair Feature Group 3 (content similarity): – Fraction of the common words in patent and paper titles – Fraction of the common words in patent and paper abstracts – Fraction of the paper’s chemical substances that appear in patent title – Fraction of the paper’s chemical substances that appear in patent abstract Features of a Patent-Paper Pair Feature Group 4 (Timing): − Abs (Paper publication year – Patent filing year) − Abs (Paper publication year – Earliest patent filing year) Data Murray/Stern Data 165 pairs of Nature Biotech paper /US patent Our Experiment 165 patents: 162 with one GT (ground truth) paper, 3 with 2 GTs Retrieve papers from PubMed that share at least one last name Filtering: Exclude Review Articles (Earliest patent filing year -2) TO (Patent filing year +5) A total of 247322 patent-article pairs 1498.92 articles/patent on average Experiment 1 • 10-fold Cross Validation • Algorithms to Build Models Logistic Regression Normal-Identity Regression Binomial-LogLog Regression Binomial-Probit Regression An ensemble method averaging all above Model Comparison (rank of GT) • Use all features Upper 3.1647 3.5276 3.0640 3.1018 2.8788 Lower 3.5393 3.5276 3.4449 3.4765 2.8788 3.6 Upper Poisition Lower Position 3.5 Average GT1 Position Model Logistic Nor-Identity Bin-LogLog Bin-Probit Ensemble 3.7 3.4 3.3 3.2 3.1 3 2.9 2.8 Logistic Normal-Identity Binomial-LogLog Binomial-Probit Model Ensemble Tagging • Evaluate top ranked papers for each patent to see if they are GTs as well? • 1120 patent-paper pairs have been evaluated and tagged. – Not GTs: 566 pairs – Uncertain: 4 pairs – GTs: 550 pairs Histograms: (# of GTs per Patent) After Tagging 160 160 140 140 120 120 # of Patents # of Patents Before Tagging 100 80 100 80 60 60 40 40 20 20 0 1 2 3 4 5 6 7 # of GTs 8 9 10 11 0 1 2 3 4 5 6 7 # of GTs 8 9 10 11 Experiment 2 • Updated GT papers for each patent • 10-fold Cross Validation • Algorithms to Build Models Logistic Regression Normal-Identity Regression Binomial-LogLog Regression Binomial-Probit Regression An ensemble method averaging all above Model Comparison (rank of 1st GT) • Use all features 3 Upper Poisition Lower Position 2.8 Upper 1.0739 1.1923 1.0680 1.0739 1.0870 Lower 1.0739 1.1923 1.0680 1.0739 1.0870 2.6 Average GT1 Position Model Logistic Nor-Identity Bin-LogLog Bin-Probit Ensemble 2.4 2.2 2 1.8 1.6 1.4 1.2 1 Logistic Normal-Identity Binomial-LogLog Models Binomial-Probit Ensemble Model Comparison (fraction of GTs in Top k) • Use all features 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 Top1 Top2 Top3 Top5 0.2 0.1 0 Logistic Normal-Identity Binomial-LogLog Binomial-Probit Ensemble Summary • An algorithm to link patents to papers • Useful tool for studying dynamics and interaction in utilization of university inventions by both academia and industry, and impacts of university patenting and licensing • Useful tool for evaluating impacts of government funding Thank you! zlei@psu.edu Fraction of patent inventors whose last names appear in GT papers 1 0.9 0.8 Feature 8 Value 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1 2 3 4 5 6 Patent ID 7 8 9 10