at Swinburne University of Technology - Stephen

Disambiguating inventor names using deep neural networks Steve Petrie T’Mir Julius Swinburne Project goal: match inventor names • Inventors in patent apps do not have unique IDs:  identical names  same inventor? / different inventors?  different names  same inventor? / different inventors? • Goal: disambiguate inventor names  assign unique inventor ID 2 Swinburne Options to tackle the problem • Program a hand-crafted algorithm based on, eg:     same/similar last name (account for spelling variations) same/similar first name (account for spelling variations) similar application dates (investigate different windows) similar co-authors (account for spelling variations) • Machine learning (algorithm learns important discriminating features from data), eg:  neural networks 3 Swinburne Options to tackle the problem • Program a hand-crafted algorithm: + - inner workings are explicit/transparent requires a lot of programmer effort & time brittle: a lot of special cases (exceptions) may go unseen/unimplemented (analogous to overfitting) • Machine learning: + + + - automatically learns discriminating features from data learns features fast does not require as much expert knowledge of dataset inner workings usually not explicit/transparent overfitting may be a problem 4 Swinburne Neural networks • Biological neuron: [credit: en.wikipedia.org/wiki/Neuron] • Artificial neuron (“perceptron”): [credit: neuralnetworksanddeeplearning.com] Swinburne Neural networks • Deep neural networks (DNNs)  multiple hidden layers  enables abstraction of concepts [credit: neuralnetworksanddeeplearning.com] 6 Swinburne DNN example: AlexNet2012 • AlexNet2012 architecture: [credit: Krizhevsky et al (2012)] input layer hidden layers output layer Swinburne DNN example: AlexNet2012 • AlexNet2012 architecture: dog [credit: Krizhevsky et al (2012)] input layer hidden layers output layer Swinburne DNN example: AlexNet2012 • AlexNet2012 architecture: [credit: Krizhevsky et al (2012)] • Train (labelled): ship flower elephant Swinburne DNN example: AlexNet2012 • AlexNet2012 architecture: [credit: Krizhevsky et al (2012)] • Train (labelled): ship • Deploy (unlabelled): flower elephant Swinburne DNN example: AlexNet2012 • AlexNet2012 architecture: [credit: Krizhevsky et al (2012)] • Train (labelled): ship • Deploy (unlabelled): flower elephant Swinburne DNN example: AlexNet2012 • AlexNet2012 architecture: [ credit: devblogs.nvidia.com/parallelforall/accelerate-machine-learning-cudnn-deep-neural-network-library/ ] Swinburne Can we use a DNN…? • Perhaps a DNN designed to classify: o o 1.2 Mn training images  1,000 classes o o dog human ship etc… will perform well when classifying: 430k training comparisons  2 classes • But patent app data is text, not images! o match o non-match Swinburne Transforming text to 2D images • Need to represent text as numbers • Convert to vector? • Convert to image (2D bitmap)?  works with previous DNNs designed for image analysis  accounts for spelling errors, translations (different string positions within word/s) Swinburne Transforming text to 2D images • Clean text:  remove whitespace  remove punctuation  convert to uppercase Swinburne Transforming text to 2D images • 2D map structure: L M N H W D C K/Q B P Z S A E F R T I O U J G Y X V Swinburne Transforming text to 2D images • 2D map structure: L M N H W PETRIE D C K/Q B P Z S A E F R T I O U J G Y X V Swinburne Transforming text to 2D images • 2D map structure: L M N H W PETRIE D C K/Q B P Z S A E F R T I O U J G Y X V Swinburne Transforming text to 2D images • 2D map structure: L M N H W PETRIE D C K/Q B P Z S A E F R T I O U J G Y X V Swinburne Transforming text to 2D images • 2D map structure: L M N H W PETRIE D C K/Q B P Z S A E F R T I O U J G Y X V Swinburne Transforming text to 2D images • 2D map structure: L M N H W PETRIE D C K/Q B P Z S A E F R T I O U J G Y X V Swinburne Transforming text to 2D images • 2D map structure: L M N H W PETRIE D C K/Q B P Z S A E F R T I O U J G Y X V Swinburne Transforming text to 2D images • 2D map structure: L M N H W PETRIE D C K/Q B P Z S A E F R T I O U J G Y X V Swinburne Transforming text to 2D images • 2D map structure: L M N H W PETRIE D C K/Q B P Z S A E F R T I O U J G Y X V Swinburne Transforming text to 2D images • 2D map structure: L M N H W D C K/Q B P WANG Z S A E F R T I O U J G Y X V Swinburne Transforming text to 2D images • 2D map structure: L M N H W D C K/Q B P WANG Z S A E F R T I O U J G Y X V Swinburne Transforming text to 2D images • 2D map structure: L M N H W PETRIE D C K/Q B P WANG Z S A E F R T I O U J G Y X V Swinburne Transforming text to 2D images L M N H W D C K/Q B P Z S A E F R T I O U J G Y X V firstname Swinburne Transforming text to 2D images L M N H W D C K/Q B P Z S A E F R T I O U J G Y X V firstname L M N H W D C K/Q B P Z S A E F R T I O U J G Y X V lastname Swinburne Transforming text to 2D images L M N H W D C K/Q B P Z S A E F R T I O U J G Y X V firstname L M N H W L M N H W D C K/Q B P D C K/Q B P Z S A E F Z S A E F R T I O U R T I O U J G Y X V J G Y X V lastname city Swinburne Transforming text to 2D images L M N H W 8 D C K/Q B P H Z S A E F 9 R T I O U G J G Y X V 5 firstname A B 3 4 0 C 2 1 F 7 D E 6 international patent classification (IPC) L M N H W L M N H W D C K/Q B P D C K/Q B P Z S A E F Z S A E F R T I O U R T I O U J G Y X V J G Y X V lastname city Swinburne Transforming text to 2D images match non-match Swinburne How many comparisons? • Process all data within 5 days: n = 12.4Mn inventor names Swinburne How many comparisons? • Process all data within 5 days: n = 12.4Mn inventor names n – 1 comparisons Swinburne How many comparisons? • Process all data within 5 days: n = 12.4Mn inventor names n – 2 comparisons Swinburne How many comparisons? • Process all data within 5 days: n = 12.4Mn inventor names 𝑛(𝑛−1) 2 = 7.7x1013 (77 Tn) comparisons computation time ~ years Swinburne Binning (blocking) • Sort patent apps into “bins” by lastname: Swinburne Binning (blocking) • Sort patent apps into “bins” by lastname: PETRIE Swinburne Binning (blocking) • Sort patent apps into “bins” by lastname: PETRIE PET Swinburne Binning (blocking) • Sort patent apps into “bins” by lastname: WANG PET Swinburne Binning (blocking) • Sort patent apps into “bins” by lastname: WANG PET WAN Swinburne Binning (blocking) • Sort patent apps into “bins” by lastname: PETERSON PET WAN Swinburne Binning (blocking) • Sort patent apps into “bins” by lastname: PETERSON PET WAN Swinburne Binning (blocking) # comparisons matches retained • No binning: 77 Tn 100% • PETrie: 154 Bn 99.85% • PETRie: 68 Bn 99.68% • PETRIe: 38 Bn 99.37% Swinburne Binning (blocking) # comparisons matches retained • No binning: 77 Tn 100% • PETrie: 154 Bn 99.85% • PETRie: 68 Bn 99.68% • PETRIe: 38 Bn 99.37% • Still a problem:  LI: 27k inventors; 360 Mn comparisons  LEE: 98k inventors; 5 Bn comparisons Swinburne Binning (blocking) # comparisons matches retained • No binning: 77 Tn 100% • PETrie: 154 Bn 99.85%  many “problem” comparisons are within a small no. of bins • only consider bins containing > 100 inventors (> 10k comparisons)  if bin key ≤ 2 letters, re-bin with 1st letter of 1st name: Jet Li  LI,J Jing Li  LI,J Wei Li  LI,W Swinburne Binning (blocking) # comparisons matches retained • No binning: 77 Tn 100% • PETrie: 154 Bn 99.85%  if bin key ≤ 2 letters, re-bin with 1st letter of 1st name • PETRie: 64 Bn 99.71%  only increase 3  4 letters if bin contains > 100 inventors  if bin key ≤ 3 letters, re-bin with 1st letter of 1st name Swinburne Binning (blocking) # comparisons matches retained • No binning: 77 Tn 100% • PETrie: 154 Bn 99.85%  if bin key ≤ 2 letters, re-bin with 1st letter of 1st name • PETRie: 64 Bn 99.71%  only increase 3  4 letters if bin contains > 100 inventors  if bin key ≤ 3 letters, re-bin with 1st letter of 1st name • …[15 letters]: 441 Mn 99.54%  only increase 14  15 letters if bin contains > 100 invtrs  if bin key ≤ 14 letters, re-bin with letters from 1st name Swinburne Preliminary results (labelled data only) • Precision = • Recall = 𝑡𝑟𝑢𝑒𝑝𝑜𝑠 𝑝𝑜𝑠 = 99.54% 𝑡𝑟𝑢𝑒𝑝𝑜𝑠 𝑡𝑜𝑡𝑎𝑙 𝑚𝑎𝑡𝑐ℎ𝑒𝑠 = 98.78% • Splitting = 𝑓𝑎𝑙𝑠𝑒𝑛𝑒𝑔 𝑡𝑜𝑡𝑎𝑙 𝑚𝑎𝑡𝑐ℎ𝑒𝑠 = 1.22% • Lumping = 𝑓𝑎𝑙𝑠𝑒𝑝𝑜𝑠 𝑡𝑜𝑡𝑎𝑙 𝑚𝑎𝑡𝑐ℎ𝑒𝑠 = 0.46% Swinburne Preliminary results (labelled data only) • Precision = • Recall = 𝑡𝑟𝑢𝑒𝑝𝑜𝑠 𝑝𝑜𝑠 = 99.54% 𝑡𝑟𝑢𝑒𝑝𝑜𝑠 𝑡𝑜𝑡𝑎𝑙 𝑚𝑎𝑡𝑐ℎ𝑒𝑠 • match:non-match ratio in labelled data is 48:52 • Probably more non-matches in bulk data (falsepos ↑) = 98.78% • Splitting = 𝑓𝑎𝑙𝑠𝑒𝑛𝑒𝑔 𝑡𝑜𝑡𝑎𝑙 𝑚𝑎𝑡𝑐ℎ𝑒𝑠 = 1.22% • Lumping = 𝑓𝑎𝑙𝑠𝑒𝑝𝑜𝑠 𝑡𝑜𝑡𝑎𝑙 𝑚𝑎𝑡𝑐ℎ𝑒𝑠 = 0.46% Swinburne Preliminary results (labelled data only) Swinburne Preliminary results (labelled data only) Swinburne Run-times for bulk data processing • Total: ~ 70h [not including assigning unique IDs] • Break-down: 1. 2. 3. 4. Binning: ~ 1h Generating comparison-map images: ~ 36h Image classification (deploying DNN for inference): ~ 33h Obtaining linked groups (with unique IDs): [not run yet] Swinburne Obtaining unique inventor IDs • DNN outputs probability of match/non-match for any given invtr-invtr comparison • However, obtaining unique IDs is not straightforward: B A C Swinburne Obtaining unique inventor IDs • DNN outputs probability of match/non-match for any given invtr-invtr comparison • However, obtaining unique IDs is not straightforward: B A • Should we give D:  same ID…?  different ID…? C D Swinburne Obtaining unique inventor IDs • DNN outputs probability of match/non-match for any given invtr-invtr comparison • However, obtaining unique IDs is not straightforward: B A • Should we give D:  same ID…?  different ID…? C D Swinburne Obtaining unique inventor IDs • DNN outputs probability of match/non-match for any given invtr-invtr comparison • However, obtaining unique IDs is not straightforward: B A C D • Should we give D:  same ID…?  bad for precision (more false pos)  different ID…?  bad for recall (fewer true pos) Swinburne Obtaining unique inventor IDs • DNN outputs probability of match/non-match for any given invtr-invtr comparison • However, obtaining unique IDs is not straightforward: B C A • Should we give D:  same ID…?  if # links ≥ n/2  different ID…?  if # links < n/2 D Swinburne Obtaining unique inventor IDs • DNN outputs probability of match/non-match for any given invtr-invtr comparison • However, obtaining unique IDs is not straightforward: B C A • Should we give D:  same ID…?  if # links ≥ n/2  different ID…?  if # links < n/2 D

at Swinburne University of Technology - Stephen

Related documents

Products

Support

at Swinburne University of Technology - Stephen

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib