Disambiguating Japanese Inventors Yusuke Naito,Naotoshi Tsukada 1 ESF-APE-INV 3rd "Name Game" workshop Yusuke Naito,Naotoshi Tsukada 2011/9/4 Agenda Motivation Issues and Topics Outline of Program Data Details Data acquisition via advanced questionnaire system Result Future work 2 ESF-APE-INV 3rd "Name Game" workshop Yusuke Naito,Naotoshi Tsukada 2011/9/4 Motivation Innovation research using patents Trace the individual inventions Calculate statistics for inventors “Name Game” is common theme within international researchers as Scientometrics Generally the solution of same name problem have scoring style in its history [Aizawa2005] The research paper in NBER is pioneer[Trajtenberg 2006] [Kim2006] The items and methodology depend on the research purpose or country location 3 Ex. When one get the mobility of inventors, he/she cannot use affiliation item. [Kim2006] In the Korea, same names cover major population. There exists no Japanese published inventors list which are disambiguated. ESF-APE-INV 3rd "Name Game" workshop Yusuke Naito,Naotoshi Tsukada 2011/9/4 Issues The usable items for identification of inventor is limited. There are both type 1 error and type 2 error to identify each inventor. If we could use birth day or social insurance number … There exist lots of employee inventions in big company, then we cannot ignore same name problem The expression of inventor’s address is not restricted whether home or company. It is difficult to identify even the company address. There exist no common rule it may be HQ or divisional address. Even for the one inventor, the expression of address is varied in plural invention. It is hard to identify in case of change affiliation of same person 4 ESF-APE-INV 3rd "Name Game" workshop Yusuke Naito,Naotoshi Tsukada 2011/9/4 Topics Algorithms depends on language characteristics Natural language processing Use all usable data GIS Phone book Patent database Data acquisition from NEDO project 5 Questionnaire System Development from scratch ESF-APE-INV 3rd "Name Game" workshop Yusuke Naito,Naotoshi Tsukada 2011/9/4 Outline of Program Classifying detail attribute which described on patent document Scoring in normalized value and weighting for each items Comparing total score and threshold between two inventors 6 Trajtenber My g 2006 Program Name string Soundex Compare Kanji and Yomi in Levenstein Distance Applicant Use under Median of Freqency Use in all case PopulationDenc ity Use under Median of Frequency Use under Median of Frequency Middle Name,Sir Name Use in all case NA Address Information Full matching Measure the distance using GIS Tech. Fields Use in all case Use in all case Citation, CoUse in all case invention ESF-APE-INV 3rd "Name Game" workshop 2011/9/4 Yusuke Naito,Naotoshi Tsukada Use in all case Data Input as target Description as inventor in patent Except one time inventor where no existance of similar string and similar Yomi Except same name and same address in near application dates Where belonging company of inventor is not over mid-class company and different technical field Output as result For public inventors identification For investigating about proper inventor(s) 7 Grouping the same persons Maximum members in each group with target, or group with highest average score As relating data, output another group which belong different group for target ESF-APE-INV 3rd "Name Game" workshop Yusuke Naito,Naotoshi Tsukada 2011/9/4 Details Items Issues in items and its solving methods Scoring Evaluate Functions Machine Learning 8 ESF-APE-INV 3rd "Name Game" workshop Yusuke Naito,Naotoshi Tsukada 2011/9/4 Items(1)name and address Name Levehnstein distance in Kanji Rewriting from string ambiguation in Kanji of current and old which can be used as same(斉藤 and 齊藤,嶋田 and 島田,etc) or wrong character(二郎 and 次郎,祐介 and 裕介, etc) as same Yomi, Japanese Kana pronunciation(by Name Yomi Dictionary). Making this as rough candidate pairs. (scoring after this between these candidate pairs.) Contributing to recall rate Deciding specificity in order to patent frequency 3.5 Adress Disambiguate from political history such as M & A in city-size government, etc by Address History Dictionary Transform to geographical latitude and longitude date in smallest area level, and calculate into distance 9 Contributing to recall rate Deciding specificity in order to patent frequency 4046 ESF-APE-INV 3rd "Name Game" workshop Yusuke Naito,Naotoshi Tsukada 2011/9/4 Name distributions 10 ESF-APE-INV 3rd "Name Game" workshop Yusuke Naito,Naotoshi Tsukada 2011/9/4 Geographic distributions 11 ESF-APE-INV 3rd "Name Game" workshop Yusuke Naito,Naotoshi Tsukada 2011/9/4 Items(2)Network Co-inventor Candidate pairs are identical in 2 or 3 length path in coinventor network Citation 12 Network of citation written by inventor(not examiner) Candidate pairs are identical in length path less than 4 in network Network patterns in citation 1 citing length path(1 pattern) 2 cited citing length path(3 patterns) cited citing cited cited citing cited citing cited citing citing cited citing cited citing cited cited citing citing cited 3 length path(4 patterns) citing citing cited cited citing citing cited citing cited ESF-APE-INV 3rd "Name Game" workshop Yusuke Naito,Naotoshi Tsukada 2011/9/4 cited Items(3)affiliation and applicant Affiliation Score depends on inverse of size of the organization which name described in inventors address Distinguish divisional name and company name(refering applicant name) Applicant 13 In case of no description of organization in inventor address and same applicant in candidate pair Score depends on inverse of size of the applicant ESF-APE-INV 3rd "Name Game" workshop Yusuke Naito,Naotoshi Tsukada 2011/9/4 Items(4)application date and IPC Application date IPC Score depends on inverse of the period between candidate pair’s application dates. 1000 days as maxmum period Score depends on matching rate in Publication IPCs Number _ of _ Matched _ IPC 2 Scorenormalized= Summation _ of _ number _ of _ IPC _ in _ pair FI(Search IPC) 14 In all patents, common and expanded IPC ver. 4 Easy to compair ESF-APE-INV 3rd "Name Game" workshop Yusuke Naito,Naotoshi Tsukada 2011/9/4 Issues in items and its solving methods Issue:time costing Method 1:get speed by indexing tables Using exact indexed tables in all relational (join) process Method 2:matrix will be sparse Induce candidates from matrix in all inventors Text calculation Calculation by pairwise Method 3:embedding user difined function Using compiled programs from C code other than join process Numeric calculation in distance or similarity Enhancement:100 times faster 15 30 targets 20 days → 300 targets 2days Taking suitable time for create indexes once and reuse after ESF-APE-INV 3rd "Name Game" workshop Yusuke Naito,Naotoshi Tsukada 2011/9/4 Scoring Simple method Summation of each items score Weighted method parameters:items allotment Normalized allotment in items Parameters:weight of items Tuning parameters 16 By manual Machine learning ESF-APE-INV 3rd "Name Game" workshop Yusuke Naito,Naotoshi Tsukada 2011/9/4 Machine Learning(not completed) Teaching data Enforcement learning(Q-learning) Genotype:weights Phenotype:total score Converge to near optimized Support Vector Machine(SVM) Weighting depends on sensitivity Genetic Algorithm(Classifier System) Disambiguated by manual for Highest 30 inventors in patents of NEDO Proved the fact depends on name frequency Maximize margins in evaluating of Kernel function(polynominal) Tentative result 17 High weights in high performance items Comparable with convnetional methods (like “hill climb”) ESF-APE-INV 3rd "Name Game" workshop Yusuke Naito,Naotoshi Tsukada 2011/9/4 Evaluating Function For items N:true positive (no error) V:false positive (type 1 error) M:false negative (type 2 error) N Recall rate R N M Precision rate P N N V 2N F measure F 2N M V For group 18 Divide wrong set A:sum of right set size B:dividing count D as penalty B Dividing rate D 1 A ESF-APE-INV 3rd "Name Game" workshop Yusuke Naito,Naotoshi Tsukada 2011/9/4 Final result from clustering In the above process, result the matrix value between candidate pairs Clustering induce disambiguated set from matrix 19 Transitive rule Score 0.9 in candidate A and B, 0.9 in B and C and 0.1 in A and C means that the target changed in B situation ESF-APE-INV 3rd "Name Game" workshop Yusuke Naito,Naotoshi Tsukada 2011/9/4 Data acquisition via advanced questionnaire system Questionnaire system connected Database Generating question when answering person matched with inventor in patent database Answer by inventor him/her self Issues In case of the much numbers of patents, selecting lots of patents Remaining probability of wrong answer caused from answerer restriction Record answer to database Targetting by address Easy to calculate statistics Generating e-mail of request and remind Detecting skip in mistake Auto enable/disable by notating dependencies in questionairs 20 ESF-APE-INV 3rd "Name Game" workshop Yusuke Naito,Naotoshi Tsukada 2011/9/4 Proof by NEDO questionnaire Evaluation Number of inventors corresponding to NEDO patents on database W=848 Number of inventors corresponding to questionnaire answer Q=854 Result from program execution(manual→ML→improve in manual) Right result N=412→532→654 Type 1 error V=128→168→305 Type 2 error M=442→314→200 Singlton inventor who have no candidate L=36 Evaluated value 21 Recall rate R=0.50→0.61→0.75 Precision rate P=0.75→0.75→0.67 Dividing rate D=0.61→0.31→0.28 ESF-APE-INV 3rd "Name Game" workshop Yusuke Naito,Naotoshi Tsukada 2011/9/4 Remaining works Utilizing name frequency Phone book vs. Inventors Frequency of Affiliation or applicant Maintaining of name (yomi) dictionary There exists hard reading name(1%) Performance tuning Low frequency and easy to miswriting 山本示 ヤマモトシメス 前田維 マエダユイ 高橋召 タカハシミコト Enhance 10 times more Preventing to increase time cost by small program change Comparing ML variety(parameters or kernel function) Use all inventors attributes items 22 Attorney Feature words ESF-APE-INV 3rd "Name Game" workshop Yusuke Naito,Naotoshi Tsukada 2011/9/4 Items in near future Attorney Same inventor may apply via same attorney Feature words Hypothese Words vector calclated from TF・IDF 23 Inventor uses same words in plural patents TF:Text Frequency (of word) IDF:Inverse Document Furequency (of word) Conventional way in retrieval systems Similarity by inner product between text ESF-APE-INV 3rd "Name Game" workshop Yusuke Naito,Naotoshi Tsukada 2011/9/4 Reference [Trajtenberg 2006] Manuel Trajtenberg, Gil Shiff, Ran Mclamed, THE “NAMES GAME”: HARNESSING INVENTOR’S PATENT DATA FOR ECONOMIC RESEARCH, NBER Working Paper 12479, 2006 [Kim2006] Jinyoung Kim, Sangjoon John Lee, Gerald Marschke, International Knowledge Flows: Evidence from an Inventor-Firm Matched Data Set, NBER Working Paper 12692, 2006 [Aizawa2005] Akiko Aizawa, Keizo Oyama, Atsuhiro Takasu, Jun Adachi, Research Issues and Current Solution for Identification of Records, IECE Journal,Vol. J88-DI, No.3, 2005 24 ESF-APE-INV 3rd "Name Game" workshop Yusuke Naito,Naotoshi Tsukada 2011/9/4