Disambiguating inventor names using deep neural networks Steve Petrie T’Mir Julius Swinburne Project goal: match inventor names • Inventors in patent apps do not have unique IDs: identical names same inventor? / different inventors? different names same inventor? / different inventors? • Goal: disambiguate inventor names assign unique inventor ID 2 Swinburne Options to tackle the problem • Program a hand-crafted algorithm based on, eg: same/similar last name (account for spelling variations) same/similar first name (account for spelling variations) similar application dates (investigate different windows) similar co-authors (account for spelling variations) • Machine learning (algorithm learns important discriminating features from data), eg: neural networks 3 Swinburne Options to tackle the problem • Program a hand-crafted algorithm: + - inner workings are explicit/transparent requires a lot of programmer effort & time brittle: a lot of special cases (exceptions) may go unseen/unimplemented (analogous to overfitting) • Machine learning: + + + - automatically learns discriminating features from data learns features fast does not require as much expert knowledge of dataset inner workings usually not explicit/transparent overfitting may be a problem 4 Swinburne Neural networks • Biological neuron: [credit: en.wikipedia.org/wiki/Neuron] • Artificial neuron (“perceptron”): [credit: neuralnetworksanddeeplearning.com] Swinburne Neural networks • Deep neural networks (DNNs) multiple hidden layers enables abstraction of concepts [credit: neuralnetworksanddeeplearning.com] 6 Swinburne DNN example: AlexNet2012 • AlexNet2012 architecture: [credit: Krizhevsky et al (2012)] input layer hidden layers output layer Swinburne DNN example: AlexNet2012 • AlexNet2012 architecture: dog [credit: Krizhevsky et al (2012)] input layer hidden layers output layer Swinburne DNN example: AlexNet2012 • AlexNet2012 architecture: [credit: Krizhevsky et al (2012)] • Train (labelled): ship flower elephant Swinburne DNN example: AlexNet2012 • AlexNet2012 architecture: [credit: Krizhevsky et al (2012)] • Train (labelled): ship • Deploy (unlabelled): flower elephant Swinburne DNN example: AlexNet2012 • AlexNet2012 architecture: [credit: Krizhevsky et al (2012)] • Train (labelled): ship • Deploy (unlabelled): flower elephant Swinburne DNN example: AlexNet2012 • AlexNet2012 architecture: [ credit: devblogs.nvidia.com/parallelforall/accelerate-machine-learning-cudnn-deep-neural-network-library/ ] Swinburne Can we use a DNN…? • Perhaps a DNN designed to classify: o o 1.2 Mn training images 1,000 classes o o dog human ship etc… will perform well when classifying: 430k training comparisons 2 classes • But patent app data is text, not images! o match o non-match Swinburne Transforming text to 2D images • Need to represent text as numbers • Convert to vector? • Convert to image (2D bitmap)? works with previous DNNs designed for image analysis accounts for spelling errors, translations (different string positions within word/s) Swinburne Transforming text to 2D images • Clean text: remove whitespace remove punctuation convert to uppercase Swinburne Transforming text to 2D images • 2D map structure: L M N H W D C K/Q B P Z S A E F R T I O U J G Y X V Swinburne Transforming text to 2D images • 2D map structure: L M N H W PETRIE D C K/Q B P Z S A E F R T I O U J G Y X V Swinburne Transforming text to 2D images • 2D map structure: L M N H W PETRIE D C K/Q B P Z S A E F R T I O U J G Y X V Swinburne Transforming text to 2D images • 2D map structure: L M N H W PETRIE D C K/Q B P Z S A E F R T I O U J G Y X V Swinburne Transforming text to 2D images • 2D map structure: L M N H W PETRIE D C K/Q B P Z S A E F R T I O U J G Y X V Swinburne Transforming text to 2D images • 2D map structure: L M N H W PETRIE D C K/Q B P Z S A E F R T I O U J G Y X V Swinburne Transforming text to 2D images • 2D map structure: L M N H W PETRIE D C K/Q B P Z S A E F R T I O U J G Y X V Swinburne Transforming text to 2D images • 2D map structure: L M N H W PETRIE D C K/Q B P Z S A E F R T I O U J G Y X V Swinburne Transforming text to 2D images • 2D map structure: L M N H W PETRIE D C K/Q B P Z S A E F R T I O U J G Y X V Swinburne Transforming text to 2D images • 2D map structure: L M N H W D C K/Q B P WANG Z S A E F R T I O U J G Y X V Swinburne Transforming text to 2D images • 2D map structure: L M N H W D C K/Q B P WANG Z S A E F R T I O U J G Y X V Swinburne Transforming text to 2D images • 2D map structure: L M N H W PETRIE D C K/Q B P WANG Z S A E F R T I O U J G Y X V Swinburne Transforming text to 2D images L M N H W D C K/Q B P Z S A E F R T I O U J G Y X V firstname Swinburne Transforming text to 2D images L M N H W D C K/Q B P Z S A E F R T I O U J G Y X V firstname L M N H W D C K/Q B P Z S A E F R T I O U J G Y X V lastname Swinburne Transforming text to 2D images L M N H W D C K/Q B P Z S A E F R T I O U J G Y X V firstname L M N H W L M N H W D C K/Q B P D C K/Q B P Z S A E F Z S A E F R T I O U R T I O U J G Y X V J G Y X V lastname city Swinburne Transforming text to 2D images L M N H W 8 D C K/Q B P H Z S A E F 9 R T I O U G J G Y X V 5 firstname A B 3 4 0 C 2 1 F 7 D E 6 international patent classification (IPC) L M N H W L M N H W D C K/Q B P D C K/Q B P Z S A E F Z S A E F R T I O U R T I O U J G Y X V J G Y X V lastname city Swinburne Transforming text to 2D images match non-match Swinburne How many comparisons? • Process all data within 5 days: n = 12.4Mn inventor names Swinburne How many comparisons? • Process all data within 5 days: n = 12.4Mn inventor names n – 1 comparisons Swinburne How many comparisons? • Process all data within 5 days: n = 12.4Mn inventor names n – 2 comparisons Swinburne How many comparisons? • Process all data within 5 days: n = 12.4Mn inventor names 𝑛(𝑛−1) 2 = 7.7x1013 (77 Tn) comparisons computation time ~ years Swinburne Binning (blocking) • Sort patent apps into “bins” by lastname: Swinburne Binning (blocking) • Sort patent apps into “bins” by lastname: PETRIE Swinburne Binning (blocking) • Sort patent apps into “bins” by lastname: PETRIE PET Swinburne Binning (blocking) • Sort patent apps into “bins” by lastname: WANG PET Swinburne Binning (blocking) • Sort patent apps into “bins” by lastname: WANG PET WAN Swinburne Binning (blocking) • Sort patent apps into “bins” by lastname: PETERSON PET WAN Swinburne Binning (blocking) • Sort patent apps into “bins” by lastname: PETERSON PET WAN Swinburne Binning (blocking) # comparisons matches retained • No binning: 77 Tn 100% • PETrie: 154 Bn 99.85% • PETRie: 68 Bn 99.68% • PETRIe: 38 Bn 99.37% Swinburne Binning (blocking) # comparisons matches retained • No binning: 77 Tn 100% • PETrie: 154 Bn 99.85% • PETRie: 68 Bn 99.68% • PETRIe: 38 Bn 99.37% • Still a problem: LI: 27k inventors; 360 Mn comparisons LEE: 98k inventors; 5 Bn comparisons Swinburne Binning (blocking) # comparisons matches retained • No binning: 77 Tn 100% • PETrie: 154 Bn 99.85% many “problem” comparisons are within a small no. of bins • only consider bins containing > 100 inventors (> 10k comparisons) if bin key ≤ 2 letters, re-bin with 1st letter of 1st name: Jet Li LI,J Jing Li LI,J Wei Li LI,W Swinburne Binning (blocking) # comparisons matches retained • No binning: 77 Tn 100% • PETrie: 154 Bn 99.85% if bin key ≤ 2 letters, re-bin with 1st letter of 1st name • PETRie: 64 Bn 99.71% only increase 3 4 letters if bin contains > 100 inventors if bin key ≤ 3 letters, re-bin with 1st letter of 1st name Swinburne Binning (blocking) # comparisons matches retained • No binning: 77 Tn 100% • PETrie: 154 Bn 99.85% if bin key ≤ 2 letters, re-bin with 1st letter of 1st name • PETRie: 64 Bn 99.71% only increase 3 4 letters if bin contains > 100 inventors if bin key ≤ 3 letters, re-bin with 1st letter of 1st name • …[15 letters]: 441 Mn 99.54% only increase 14 15 letters if bin contains > 100 invtrs if bin key ≤ 14 letters, re-bin with letters from 1st name Swinburne Preliminary results (labelled data only) • Precision = • Recall = 𝑡𝑟𝑢𝑒𝑝𝑜𝑠 𝑝𝑜𝑠 = 99.54% 𝑡𝑟𝑢𝑒𝑝𝑜𝑠 𝑡𝑜𝑡𝑎𝑙 𝑚𝑎𝑡𝑐ℎ𝑒𝑠 = 98.78% • Splitting = 𝑓𝑎𝑙𝑠𝑒𝑛𝑒𝑔 𝑡𝑜𝑡𝑎𝑙 𝑚𝑎𝑡𝑐ℎ𝑒𝑠 = 1.22% • Lumping = 𝑓𝑎𝑙𝑠𝑒𝑝𝑜𝑠 𝑡𝑜𝑡𝑎𝑙 𝑚𝑎𝑡𝑐ℎ𝑒𝑠 = 0.46% Swinburne Preliminary results (labelled data only) • Precision = • Recall = 𝑡𝑟𝑢𝑒𝑝𝑜𝑠 𝑝𝑜𝑠 = 99.54% 𝑡𝑟𝑢𝑒𝑝𝑜𝑠 𝑡𝑜𝑡𝑎𝑙 𝑚𝑎𝑡𝑐ℎ𝑒𝑠 • match:non-match ratio in labelled data is 48:52 • Probably more non-matches in bulk data (falsepos ↑) = 98.78% • Splitting = 𝑓𝑎𝑙𝑠𝑒𝑛𝑒𝑔 𝑡𝑜𝑡𝑎𝑙 𝑚𝑎𝑡𝑐ℎ𝑒𝑠 = 1.22% • Lumping = 𝑓𝑎𝑙𝑠𝑒𝑝𝑜𝑠 𝑡𝑜𝑡𝑎𝑙 𝑚𝑎𝑡𝑐ℎ𝑒𝑠 = 0.46% Swinburne Preliminary results (labelled data only) Swinburne Preliminary results (labelled data only) Swinburne Run-times for bulk data processing • Total: ~ 70h [not including assigning unique IDs] • Break-down: 1. 2. 3. 4. Binning: ~ 1h Generating comparison-map images: ~ 36h Image classification (deploying DNN for inference): ~ 33h Obtaining linked groups (with unique IDs): [not run yet] Swinburne Obtaining unique inventor IDs • DNN outputs probability of match/non-match for any given invtr-invtr comparison • However, obtaining unique IDs is not straightforward: B A C Swinburne Obtaining unique inventor IDs • DNN outputs probability of match/non-match for any given invtr-invtr comparison • However, obtaining unique IDs is not straightforward: B A • Should we give D: same ID…? different ID…? C D Swinburne Obtaining unique inventor IDs • DNN outputs probability of match/non-match for any given invtr-invtr comparison • However, obtaining unique IDs is not straightforward: B A • Should we give D: same ID…? different ID…? C D Swinburne Obtaining unique inventor IDs • DNN outputs probability of match/non-match for any given invtr-invtr comparison • However, obtaining unique IDs is not straightforward: B A C D • Should we give D: same ID…? bad for precision (more false pos) different ID…? bad for recall (fewer true pos) Swinburne Obtaining unique inventor IDs • DNN outputs probability of match/non-match for any given invtr-invtr comparison • However, obtaining unique IDs is not straightforward: B C A • Should we give D: same ID…? if # links ≥ n/2 different ID…? if # links < n/2 D Swinburne Obtaining unique inventor IDs • DNN outputs probability of match/non-match for any given invtr-invtr comparison • However, obtaining unique IDs is not straightforward: B C A • Should we give D: same ID…? if # links ≥ n/2 different ID…? if # links < n/2 D