Mapping Academic Patents to Papers

advertisement
Mapping Academic Patents to Papers
Hyun-Woo Kim,1 Zhen Lei,1 Brian Wright,2 John Yen1
1 Penn
State University
2 UC Berkeley
NAS SciSIP PI Conference
September 20-21, 2012
NSF SciSIP Project
Collaborative Research:
The Impacts of University Research and Funding Sources in
Chemical Sciences: Publishing, Patenting, Commercialization
PIs: Brian Wright (UC Berkeley) and Zhen Lei (Penn State)
 Role of research sponsor type (government or industry) on university
research, patenting, technology transfer
 Publishing, patenting, licensing/ MTAs, and diffusion and follow-on
research of university inventions
 Interplay between government and industry funding in university research
Datasets
Data 1:
Access to University of California Office of Technology Transfer:
1) Invention disclosures, patenting and licensing history
2) Sponsor information, technology information
Data 2:
All scientific publications in chemical sciences by UC researchers
in 1975-2005, and the associated citation profile of these
publications
Mapping Patents to Papers
Patent/Paper Correspondence:
One-to-One in Theory
A paper
An invention
Same researchers
Close dates
A patent
Not So Clean in Practice
Patent filing
Papers
Patent filing
Continuation
Papers
Grant
Features of a Patent-Paper Pair
Feature Group 1 (paper coauthors’ names):
–
Does first co-inventor’s last name appear in the co-author list?
–
Does first co-inventor’s “fist initial and last name” appear in the co-author list?
–
Does first co-author’s last name appear in the co-inventor list?
–
Does first co-author’s “fist initial and last name” appear in the co-inventor list?
–
Does last co-author’s last name appear in the co-inventor list?
–
Does last co-author’s “fist initial and last name” appear in the co-inventor list?
–
Fraction of patent inventors whose first initial and last name appear in the
coauthor list of the paper
–
Fraction of patent inventors whose last names appear in the coauthor list of the
paper
Features of a Patent-Paper Pair
Feature Group 2 (paper primary affiliation):
– String similarity score (Levenshtein Distance) between patent assignee and paper primary
affiliation
– Percentage of the common words between patent assignee and paper primary affiliation
– Is the patent assignees’ country the same as the paper primary affiliation’s?
– Is the patent assignee’s (city or state)+country is the same is the paper primary affiliation’s?
– Does first co-inventor’s country appear in the paper primary affiliation?
– Does first co-inventor’s city/state and country appear in the paper primary affiliation?
– Fraction of the inventors whose countries are same as the paper primary affiliation’s
– Fraction of the inventors whose city/state and country are same as the paper primary
affiliation’s
Features of a Patent-Paper Pair
Feature Group 3 (content similarity):
– Fraction of the common words in patent and paper titles
– Fraction of the common words in patent and paper abstracts
– Fraction of the paper’s chemical substances that appear in patent title
– Fraction of the paper’s chemical substances that appear in patent abstract
Features of a Patent-Paper Pair
Feature Group 4 (Timing):
− Abs (Paper publication year – Patent filing year)
− Abs (Paper publication year – Earliest patent filing year)
Data
Murray/Stern Data
165 pairs of Nature Biotech paper /US patent
Our Experiment
165 patents: 162 with one GT (ground truth) paper, 3 with 2 GTs
Retrieve papers from PubMed that share at least one last name
Filtering:
Exclude Review Articles
(Earliest patent filing year -2) TO (Patent filing year +5)
A total of 247322 patent-article pairs
1498.92 articles/patent on average
Experiment 1
• 10-fold Cross Validation
• Algorithms to Build Models
Logistic Regression
Normal-Identity Regression
Binomial-LogLog Regression
Binomial-Probit Regression
An ensemble method averaging all above
Model Comparison
(rank of GT)
• Use all features
Upper
3.1647
3.5276
3.0640
3.1018
2.8788
Lower
3.5393
3.5276
3.4449
3.4765
2.8788
3.6
Upper Poisition
Lower Position
3.5
Average GT1 Position
Model
Logistic
Nor-Identity
Bin-LogLog
Bin-Probit
Ensemble
3.7
3.4
3.3
3.2
3.1
3
2.9
2.8
Logistic
Normal-Identity Binomial-LogLog Binomial-Probit
Model
Ensemble
Tagging
• Evaluate top ranked papers for each patent to
see if they are GTs as well?
• 1120 patent-paper pairs have been evaluated
and tagged.
– Not GTs: 566 pairs
– Uncertain: 4 pairs
– GTs: 550 pairs
Histograms:
(# of GTs per Patent)
After Tagging
160
160
140
140
120
120
# of Patents
# of Patents
Before Tagging
100
80
100
80
60
60
40
40
20
20
0
1
2
3
4
5
6
7
# of GTs
8
9
10
11
0
1
2
3
4
5
6
7
# of GTs
8
9
10
11
Experiment 2
• Updated GT papers for each patent
• 10-fold Cross Validation
• Algorithms to Build Models
Logistic Regression
Normal-Identity Regression
Binomial-LogLog Regression
Binomial-Probit Regression
An ensemble method averaging all above
Model Comparison
(rank of 1st GT)
• Use all features
3
Upper Poisition
Lower Position
2.8
Upper
1.0739
1.1923
1.0680
1.0739
1.0870
Lower
1.0739
1.1923
1.0680
1.0739
1.0870
2.6
Average GT1 Position
Model
Logistic
Nor-Identity
Bin-LogLog
Bin-Probit
Ensemble
2.4
2.2
2
1.8
1.6
1.4
1.2
1
Logistic
Normal-Identity
Binomial-LogLog
Models
Binomial-Probit
Ensemble
Model Comparison
(fraction of GTs in Top k)
• Use all features
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
Top1
Top2
Top3
Top5
0.2
0.1
0
Logistic
Normal-Identity
Binomial-LogLog
Binomial-Probit
Ensemble
Summary
• An algorithm to link patents to papers
• Useful tool for studying dynamics and interaction in
utilization of university inventions by both academia
and industry, and impacts of university patenting and
licensing
• Useful tool for evaluating impacts of government
funding
Thank you!
zlei@psu.edu
Fraction of patent inventors whose last
names appear in GT papers
1
0.9
0.8
Feature 8 Value
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
1
2
3
4
5
6
Patent ID
7
8
9
10
Download