Yusuke Naito & Naotoshi Tsukada - APE-INV

advertisement
Disambiguating Japanese Inventors
Yusuke Naito,Naotoshi Tsukada
1
ESF-APE-INV 3rd "Name Game" workshop
Yusuke Naito,Naotoshi Tsukada
2011/9/4
Agenda
Motivation
Issues and Topics
Outline of Program
Data
Details
Data acquisition via advanced questionnaire system
Result
Future work








2
ESF-APE-INV 3rd "Name Game" workshop
Yusuke Naito,Naotoshi Tsukada
2011/9/4
Motivation
Innovation research using patents



Trace the individual inventions
Calculate statistics for inventors
“Name Game” is common theme within international
researchers as Scientometrics



Generally the solution of same name problem have scoring style in
its history [Aizawa2005]
The research paper in NBER is pioneer[Trajtenberg 2006] [Kim2006]
The items and methodology depend on the research purpose
or country location



3
Ex. When one get the mobility of inventors, he/she cannot use
affiliation item. [Kim2006] In the Korea, same names cover major
population.
There exists no Japanese published inventors list which are
disambiguated.
ESF-APE-INV 3rd "Name Game" workshop
Yusuke Naito,Naotoshi Tsukada
2011/9/4
Issues
The usable items for identification of inventor is limited.
There are both type 1 error and type 2 error to identify each
inventor.
If we could use birth day or social insurance number …
There exist lots of employee inventions in big company, then
we cannot ignore same name problem
The expression of inventor’s address is not restricted whether
home or company. It is difficult to identify even the company
address. There exist no common rule it may be HQ or
divisional address.






Even for the one inventor, the expression of address is varied in
plural invention.
It is hard to identify in case of change affiliation of same person

4
ESF-APE-INV 3rd "Name Game" workshop
Yusuke Naito,Naotoshi Tsukada
2011/9/4
Topics
Algorithms depends on language characteristics


Natural language processing
Use all usable data




GIS
Phone book
Patent database
Data acquisition from NEDO project


5
Questionnaire System Development from scratch
ESF-APE-INV 3rd "Name Game" workshop
Yusuke Naito,Naotoshi Tsukada
2011/9/4
Outline of Program
Classifying detail attribute
which described on patent
document
Scoring in normalized
value and weighting for
each items
Comparing total score
and threshold between
two inventors



6
Trajtenber My
g 2006
Program
Name string
Soundex
Compare Kanji
and Yomi in
Levenstein
Distance
Applicant
Use under
Median of
Freqency
Use in all case
PopulationDenc
ity
Use under
Median of
Frequency
Use under
Median of
Frequency
Middle
Name,Sir Name
Use in all case
NA
Address
Information
Full matching
Measure the
distance using
GIS
Tech. Fields
Use in all case
Use in all case
Citation, CoUse in all case
invention
ESF-APE-INV 3rd "Name Game" workshop 2011/9/4
Yusuke Naito,Naotoshi Tsukada
Use in all case
Data
Input as target


Description as inventor in patent

Except one time inventor


where no existance of similar string and similar Yomi
Except same name and same address in near application dates

Where belonging company of inventor is not over mid-class company and
different technical field
Output as result


For public inventors identification


For investigating about proper inventor(s)


7
Grouping the same persons
Maximum members in each group with target, or group with highest
average score
As relating data, output another group which belong different group for
target
ESF-APE-INV 3rd "Name Game" workshop
Yusuke Naito,Naotoshi Tsukada
2011/9/4
Details
Items
Issues in items and its solving methods
Scoring
Evaluate Functions
Machine Learning





8
ESF-APE-INV 3rd "Name Game" workshop
Yusuke Naito,Naotoshi Tsukada
2011/9/4
Items(1)name and address
Name



Levehnstein distance in Kanji
Rewriting from string ambiguation in Kanji of current and old which
can be used as same(斉藤 and 齊藤,嶋田 and 島田,etc) or wrong
character(二郎 and 次郎,祐介 and 裕介, etc) as same Yomi,
Japanese Kana pronunciation(by Name Yomi Dictionary). Making
this as rough candidate pairs. (scoring after this between these
candidate pairs.)


Contributing to recall rate
Deciding specificity in order to patent frequency 3.5
Adress



Disambiguate from political history such as M & A in city-size
government, etc by Address History Dictionary
Transform to geographical latitude and longitude date in smallest
area level, and calculate into distance


9
Contributing to recall rate
Deciding specificity in order to patent frequency 4046
ESF-APE-INV 3rd "Name Game" workshop
Yusuke Naito,Naotoshi Tsukada
2011/9/4
Name distributions
10
ESF-APE-INV 3rd "Name Game" workshop
Yusuke Naito,Naotoshi Tsukada
2011/9/4
Geographic distributions
11
ESF-APE-INV 3rd "Name Game" workshop
Yusuke Naito,Naotoshi Tsukada
2011/9/4
Items(2)Network

Co-inventor



Candidate pairs are
identical in 2 or 3
length path in coinventor network
Citation


12
Network of citation
written by inventor(not
examiner)
Candidate pairs are
identical in length path
less than 4 in network
Network patterns in citation
 1
citing
length path(1 pattern)
 2
cited
citing
length path(3 patterns)
cited
citing
cited
cited
citing
cited
citing
cited
citing
citing
cited
citing
cited
citing
cited
cited
citing
citing

cited
3 length path(4 patterns)
citing
citing
cited
cited
citing
citing
cited
citing
cited
ESF-APE-INV 3rd "Name Game" workshop
Yusuke Naito,Naotoshi Tsukada
2011/9/4
cited
Items(3)affiliation and applicant

Affiliation



Score depends on inverse of size of the organization which
name described in inventors address
Distinguish divisional name and company name(refering
applicant name)
Applicant


13
In case of no description of organization in inventor address
and same applicant in candidate pair
Score depends on inverse of size of the applicant
ESF-APE-INV 3rd "Name Game" workshop
Yusuke Naito,Naotoshi Tsukada
2011/9/4
Items(4)application date and IPC

Application date



IPC


Score depends on inverse of the period between candidate
pair’s application dates.
1000 days as maxmum period
Score depends on matching rate in Publication IPCs
Number _ of _ Matched _ IPC 2
Scorenormalized=
Summation _ of _ number _ of _ IPC _ in _ pair
FI(Search IPC)


14
In all patents, common and expanded IPC ver. 4
Easy to compair
ESF-APE-INV 3rd "Name Game" workshop
Yusuke Naito,Naotoshi Tsukada
2011/9/4
Issues in items and its solving methods

Issue:time costing



Method 1:get speed by indexing tables


Using exact indexed tables in all relational (join) process
Method 2:matrix will be sparse


Induce candidates from matrix in all inventors
Text calculation
Calculation by pairwise
Method 3:embedding user difined function

Using compiled programs from C code other than join process


Numeric calculation in distance or similarity
Enhancement:100 times faster


15
30 targets 20 days → 300 targets 2days
Taking suitable time for create indexes once and reuse after
ESF-APE-INV 3rd "Name Game" workshop
Yusuke Naito,Naotoshi Tsukada
2011/9/4
Scoring

Simple method

Summation of each items score


Weighted method



parameters:items allotment
Normalized allotment in items
Parameters:weight of items
Tuning parameters


16
By manual
Machine learning
ESF-APE-INV 3rd "Name Game" workshop
Yusuke Naito,Naotoshi Tsukada
2011/9/4
Machine Learning(not completed)

Teaching data



Enforcement learning(Q-learning)



Genotype:weights Phenotype:total score
Converge to near optimized
Support Vector Machine(SVM)


Weighting depends on sensitivity
Genetic Algorithm(Classifier System)


Disambiguated by manual for Highest 30 inventors in patents of
NEDO
Proved the fact depends on name frequency
Maximize margins in evaluating of Kernel function(polynominal)
Tentative result


17
High weights in high performance items
Comparable with convnetional methods (like “hill climb”)
ESF-APE-INV 3rd "Name Game" workshop
Yusuke Naito,Naotoshi Tsukada
2011/9/4
Evaluating Function

For items




N:true positive (no error)
V:false positive (type 1
error)
M:false negative (type 2
error)
N
Recall rate R 
N M
Precision rate P 
N
N V
2N
F measure F 
2N  M  V
For group




18
Divide wrong set
A:sum of right set size
B:dividing count
D as penalty
B
Dividing rate D   1
A
ESF-APE-INV 3rd "Name Game" workshop
Yusuke Naito,Naotoshi Tsukada
2011/9/4
Final result from clustering


In the above process, result the matrix value between
candidate pairs
Clustering induce disambiguated set from matrix


19
Transitive rule
Score 0.9 in candidate A and B, 0.9 in B and C and 0.1 in A and
C means that the target changed in B situation
ESF-APE-INV 3rd "Name Game" workshop
Yusuke Naito,Naotoshi Tsukada
2011/9/4
Data acquisition via advanced questionnaire
system

Questionnaire system connected Database



Generating question when answering person matched with inventor
in patent database
Answer by inventor him/her self
Issues

In case of the much numbers of patents, selecting lots of patents





Remaining probability of wrong answer caused from answerer restriction
Record answer to database


Targetting by address
Easy to calculate statistics
Generating e-mail of request and remind
Detecting skip in mistake
Auto enable/disable by notating dependencies in questionairs
20
ESF-APE-INV 3rd "Name Game" workshop
Yusuke Naito,Naotoshi Tsukada
2011/9/4
Proof by NEDO questionnaire

Evaluation



Number of inventors corresponding to NEDO patents on database
W=848
Number of inventors corresponding to questionnaire answer Q=854
Result from program execution(manual→ML→improve in manual)





Right result N=412→532→654
Type 1 error V=128→168→305
Type 2 error M=442→314→200
Singlton inventor who have no candidate L=36
Evaluated value



21
Recall rate R=0.50→0.61→0.75
Precision rate P=0.75→0.75→0.67
Dividing rate D=0.61→0.31→0.28
ESF-APE-INV 3rd "Name Game" workshop
Yusuke Naito,Naotoshi Tsukada
2011/9/4
Remaining works

Utilizing name frequency



Phone book vs. Inventors
Frequency of Affiliation or applicant
Maintaining of name (yomi) dictionary

There exists hard reading name(1%)


Performance tuning




Low frequency and easy to miswriting
山本示 ヤマモトシメス
前田維 マエダユイ
高橋召 タカハシミコト
Enhance 10 times more
Preventing to increase time cost by small program change
Comparing ML variety(parameters or kernel function)
Use all inventors attributes items


22
Attorney
Feature words
ESF-APE-INV 3rd "Name Game" workshop
Yusuke Naito,Naotoshi Tsukada
2011/9/4
Items in near future

Attorney


Same inventor may apply via same attorney
Feature words

Hypothese


Words vector calclated from TF・IDF




23
Inventor uses same words in plural patents
TF:Text Frequency (of word)
IDF:Inverse Document Furequency (of word)
Conventional way in retrieval systems
Similarity by inner product between text
ESF-APE-INV 3rd "Name Game" workshop
Yusuke Naito,Naotoshi Tsukada
2011/9/4
Reference



[Trajtenberg 2006] Manuel Trajtenberg, Gil Shiff, Ran
Mclamed, THE “NAMES GAME”: HARNESSING INVENTOR’S
PATENT DATA FOR ECONOMIC RESEARCH, NBER Working
Paper 12479, 2006
[Kim2006] Jinyoung Kim, Sangjoon John Lee, Gerald
Marschke, International Knowledge Flows: Evidence from an
Inventor-Firm Matched Data Set, NBER Working Paper
12692, 2006
[Aizawa2005] Akiko Aizawa, Keizo Oyama, Atsuhiro
Takasu, Jun Adachi, Research Issues and Current Solution for
Identification of Records, IECE Journal,Vol. J88-DI, No.3,
2005
24
ESF-APE-INV 3rd "Name Game" workshop
Yusuke Naito,Naotoshi Tsukada
2011/9/4
Download