at Swinburne University of Technology - Stephen

advertisement
Disambiguating inventor
names using deep
neural networks
Steve Petrie
T’Mir Julius
Swinburne
Project goal: match inventor names
• Inventors in patent apps do not have unique IDs:
 identical names  same inventor? / different inventors?
 different names  same inventor? / different inventors?
• Goal: disambiguate inventor names
 assign unique inventor ID
2
Swinburne
Options to tackle the problem
• Program a hand-crafted algorithm based on, eg:




same/similar last name (account for spelling variations)
same/similar first name (account for spelling variations)
similar application dates (investigate different windows)
similar co-authors (account for spelling variations)
• Machine learning (algorithm learns important
discriminating features from data), eg:
 neural networks
3
Swinburne
Options to tackle the problem
• Program a hand-crafted algorithm:
+
-
inner workings are explicit/transparent
requires a lot of programmer effort & time
brittle: a lot of special cases (exceptions) may go
unseen/unimplemented (analogous to overfitting)
• Machine learning:
+
+
+
-
automatically learns discriminating features from data
learns features fast
does not require as much expert knowledge of dataset
inner workings usually not explicit/transparent
overfitting may be a problem
4
Swinburne
Neural networks
• Biological neuron:
[credit: en.wikipedia.org/wiki/Neuron]
• Artificial neuron (“perceptron”):
[credit: neuralnetworksanddeeplearning.com]
Swinburne
Neural networks
• Deep neural networks (DNNs)
 multiple hidden layers
 enables abstraction of concepts
[credit: neuralnetworksanddeeplearning.com]
6
Swinburne
DNN example: AlexNet2012
• AlexNet2012 architecture:
[credit: Krizhevsky et al (2012)]
input
layer
hidden layers
output
layer
Swinburne
DNN example: AlexNet2012
• AlexNet2012 architecture:
dog
[credit: Krizhevsky et al (2012)]
input
layer
hidden layers
output
layer
Swinburne
DNN example: AlexNet2012
• AlexNet2012 architecture:
[credit: Krizhevsky et al (2012)]
• Train (labelled):
ship
flower
elephant
Swinburne
DNN example: AlexNet2012
• AlexNet2012 architecture:
[credit: Krizhevsky et al (2012)]
• Train (labelled):
ship
• Deploy (unlabelled):
flower
elephant
Swinburne
DNN example: AlexNet2012
• AlexNet2012 architecture:
[credit: Krizhevsky et al (2012)]
• Train (labelled):
ship
• Deploy (unlabelled):
flower
elephant
Swinburne
DNN example: AlexNet2012
• AlexNet2012 architecture:
[ credit: devblogs.nvidia.com/parallelforall/accelerate-machine-learning-cudnn-deep-neural-network-library/ ]
Swinburne
Can we use a DNN…?
• Perhaps a DNN designed to classify:
o
o
1.2 Mn training images  1,000 classes o
o
dog
human
ship
etc…
will perform well when classifying:
430k training comparisons  2 classes
• But patent app data is text, not images!
o match
o non-match
Swinburne
Transforming text to 2D images
• Need to represent text as numbers
• Convert to vector?
• Convert to image (2D bitmap)?
 works with previous DNNs designed for image analysis
 accounts for spelling errors, translations (different string
positions within word/s)
Swinburne
Transforming text to 2D images
• Clean text:
 remove whitespace
 remove punctuation
 convert to uppercase
Swinburne
Transforming text to 2D images
• 2D map structure:
L
M
N
H
W
D
C
K/Q
B
P
Z
S
A
E
F
R
T
I
O
U
J
G
Y
X
V
Swinburne
Transforming text to 2D images
• 2D map structure:
L
M
N
H
W
PETRIE
D
C
K/Q
B
P
Z
S
A
E
F
R
T
I
O
U
J
G
Y
X
V
Swinburne
Transforming text to 2D images
• 2D map structure:
L
M
N
H
W
PETRIE
D
C
K/Q
B
P
Z
S
A
E
F
R
T
I
O
U
J
G
Y
X
V
Swinburne
Transforming text to 2D images
• 2D map structure:
L
M
N
H
W
PETRIE
D
C
K/Q
B
P
Z
S
A
E
F
R
T
I
O
U
J
G
Y
X
V
Swinburne
Transforming text to 2D images
• 2D map structure:
L
M
N
H
W
PETRIE
D
C
K/Q
B
P
Z
S
A
E
F
R
T
I
O
U
J
G
Y
X
V
Swinburne
Transforming text to 2D images
• 2D map structure:
L
M
N
H
W
PETRIE
D
C
K/Q
B
P
Z
S
A
E
F
R
T
I
O
U
J
G
Y
X
V
Swinburne
Transforming text to 2D images
• 2D map structure:
L
M
N
H
W
PETRIE
D
C
K/Q
B
P
Z
S
A
E
F
R
T
I
O
U
J
G
Y
X
V
Swinburne
Transforming text to 2D images
• 2D map structure:
L
M
N
H
W
PETRIE
D
C
K/Q
B
P
Z
S
A
E
F
R
T
I
O
U
J
G
Y
X
V
Swinburne
Transforming text to 2D images
• 2D map structure:
L
M
N
H
W
PETRIE
D
C
K/Q
B
P
Z
S
A
E
F
R
T
I
O
U
J
G
Y
X
V
Swinburne
Transforming text to 2D images
• 2D map structure:
L
M
N
H
W
D
C
K/Q
B
P
WANG
Z
S
A
E
F
R
T
I
O
U
J
G
Y
X
V
Swinburne
Transforming text to 2D images
• 2D map structure:
L
M
N
H
W
D
C
K/Q
B
P
WANG
Z
S
A
E
F
R
T
I
O
U
J
G
Y
X
V
Swinburne
Transforming text to 2D images
• 2D map structure:
L
M
N
H
W
PETRIE
D
C
K/Q
B
P
WANG
Z
S
A
E
F
R
T
I
O
U
J
G
Y
X
V
Swinburne
Transforming text to 2D images
L
M
N
H
W
D
C
K/Q
B
P
Z
S
A
E
F
R
T
I
O
U
J
G
Y
X
V
firstname
Swinburne
Transforming text to 2D images
L
M
N
H
W
D
C
K/Q
B
P
Z
S
A
E
F
R
T
I
O
U
J
G
Y
X
V
firstname
L
M
N
H
W
D
C
K/Q
B
P
Z
S
A
E
F
R
T
I
O
U
J
G
Y
X
V
lastname
Swinburne
Transforming text to 2D images
L
M
N
H
W
D
C
K/Q
B
P
Z
S
A
E
F
R
T
I
O
U
J
G
Y
X
V
firstname
L
M
N
H
W
L
M
N
H
W
D
C
K/Q
B
P
D
C
K/Q
B
P
Z
S
A
E
F
Z
S
A
E
F
R
T
I
O
U
R
T
I
O
U
J
G
Y
X
V
J
G
Y
X
V
lastname
city
Swinburne
Transforming text to 2D images
L
M
N
H
W
8
D
C
K/Q
B
P
H
Z
S
A
E
F
9
R
T
I
O
U
G
J
G
Y
X
V
5
firstname
A
B
3
4
0
C
2
1
F
7
D
E
6
international patent
classification (IPC)
L
M
N
H
W
L
M
N
H
W
D
C
K/Q
B
P
D
C
K/Q
B
P
Z
S
A
E
F
Z
S
A
E
F
R
T
I
O
U
R
T
I
O
U
J
G
Y
X
V
J
G
Y
X
V
lastname
city
Swinburne
Transforming text to 2D images
match
non-match
Swinburne
How many comparisons?
• Process all data within 5 days:
n = 12.4Mn inventor names
Swinburne
How many comparisons?
• Process all data within 5 days:
n = 12.4Mn inventor names
n – 1 comparisons
Swinburne
How many comparisons?
• Process all data within 5 days:
n = 12.4Mn inventor names
n – 2 comparisons
Swinburne
How many comparisons?
• Process all data within 5 days:
n = 12.4Mn inventor names
𝑛(𝑛−1)
2
= 7.7x1013 (77 Tn) comparisons
computation time ~ years
Swinburne
Binning (blocking)
• Sort patent apps into “bins” by lastname:
Swinburne
Binning (blocking)
• Sort patent apps into “bins” by lastname:
PETRIE
Swinburne
Binning (blocking)
• Sort patent apps into “bins” by lastname:
PETRIE
PET
Swinburne
Binning (blocking)
• Sort patent apps into “bins” by lastname:
WANG
PET
Swinburne
Binning (blocking)
• Sort patent apps into “bins” by lastname:
WANG
PET
WAN
Swinburne
Binning (blocking)
• Sort patent apps into “bins” by lastname:
PETERSON
PET
WAN
Swinburne
Binning (blocking)
• Sort patent apps into “bins” by lastname:
PETERSON
PET
WAN
Swinburne
Binning (blocking)
# comparisons
matches retained
• No binning:
77 Tn
100%
• PETrie:
154 Bn
99.85%
• PETRie:
68 Bn
99.68%
• PETRIe:
38 Bn
99.37%
Swinburne
Binning (blocking)
# comparisons
matches retained
• No binning:
77 Tn
100%
• PETrie:
154 Bn
99.85%
• PETRie:
68 Bn
99.68%
• PETRIe:
38 Bn
99.37%
• Still a problem:
 LI: 27k inventors; 360 Mn comparisons
 LEE: 98k inventors; 5 Bn comparisons
Swinburne
Binning (blocking)
# comparisons
matches retained
• No binning:
77 Tn
100%
• PETrie:
154 Bn
99.85%
 many “problem” comparisons are within a small no. of bins
•
only consider bins containing > 100 inventors (> 10k comparisons)
 if bin key ≤ 2 letters, re-bin with 1st letter of 1st name:
Jet Li  LI,J
Jing Li  LI,J
Wei Li  LI,W
Swinburne
Binning (blocking)
# comparisons
matches retained
• No binning:
77 Tn
100%
• PETrie:
154 Bn
99.85%
 if bin key ≤ 2 letters, re-bin with 1st letter of 1st name
• PETRie:
64 Bn
99.71%
 only increase 3  4 letters if bin contains > 100 inventors
 if bin key ≤ 3 letters, re-bin with 1st letter of 1st name
Swinburne
Binning (blocking)
# comparisons
matches retained
• No binning:
77 Tn
100%
• PETrie:
154 Bn
99.85%
 if bin key ≤ 2 letters, re-bin with 1st letter of 1st name
• PETRie:
64 Bn
99.71%
 only increase 3  4 letters if bin contains > 100 inventors
 if bin key ≤ 3 letters, re-bin with 1st letter of 1st name
• …[15 letters]:
441 Mn
99.54%
 only increase 14  15 letters if bin contains > 100 invtrs
 if bin key ≤ 14 letters, re-bin with letters from 1st name
Swinburne
Preliminary results (labelled data only)
• Precision =
• Recall =
𝑡𝑟𝑢𝑒𝑝𝑜𝑠
𝑝𝑜𝑠
= 99.54%
𝑡𝑟𝑢𝑒𝑝𝑜𝑠
𝑡𝑜𝑡𝑎𝑙 𝑚𝑎𝑡𝑐ℎ𝑒𝑠
= 98.78%
• Splitting =
𝑓𝑎𝑙𝑠𝑒𝑛𝑒𝑔
𝑡𝑜𝑡𝑎𝑙 𝑚𝑎𝑡𝑐ℎ𝑒𝑠
= 1.22%
• Lumping =
𝑓𝑎𝑙𝑠𝑒𝑝𝑜𝑠
𝑡𝑜𝑡𝑎𝑙 𝑚𝑎𝑡𝑐ℎ𝑒𝑠
= 0.46%
Swinburne
Preliminary results (labelled data only)
• Precision =
• Recall =
𝑡𝑟𝑢𝑒𝑝𝑜𝑠
𝑝𝑜𝑠
= 99.54%
𝑡𝑟𝑢𝑒𝑝𝑜𝑠
𝑡𝑜𝑡𝑎𝑙 𝑚𝑎𝑡𝑐ℎ𝑒𝑠
• match:non-match ratio in
labelled data is 48:52
• Probably more non-matches
in bulk data (falsepos ↑)
= 98.78%
• Splitting =
𝑓𝑎𝑙𝑠𝑒𝑛𝑒𝑔
𝑡𝑜𝑡𝑎𝑙 𝑚𝑎𝑡𝑐ℎ𝑒𝑠
= 1.22%
• Lumping =
𝑓𝑎𝑙𝑠𝑒𝑝𝑜𝑠
𝑡𝑜𝑡𝑎𝑙 𝑚𝑎𝑡𝑐ℎ𝑒𝑠
= 0.46%
Swinburne
Preliminary results (labelled data only)
Swinburne
Preliminary results (labelled data only)
Swinburne
Run-times for bulk data processing
• Total: ~ 70h [not including assigning unique IDs]
• Break-down:
1.
2.
3.
4.
Binning:
~ 1h
Generating comparison-map images:
~ 36h
Image classification (deploying DNN for inference): ~ 33h
Obtaining linked groups (with unique IDs): [not run yet]
Swinburne
Obtaining unique inventor IDs
• DNN outputs probability of match/non-match for any
given invtr-invtr comparison
• However, obtaining unique IDs is not straightforward:
B
A
C
Swinburne
Obtaining unique inventor IDs
• DNN outputs probability of match/non-match for any
given invtr-invtr comparison
• However, obtaining unique IDs is not straightforward:
B
A
• Should we give D:
 same ID…?
 different ID…?
C
D
Swinburne
Obtaining unique inventor IDs
• DNN outputs probability of match/non-match for any
given invtr-invtr comparison
• However, obtaining unique IDs is not straightforward:
B
A
• Should we give D:
 same ID…?
 different ID…?
C
D
Swinburne
Obtaining unique inventor IDs
• DNN outputs probability of match/non-match for any
given invtr-invtr comparison
• However, obtaining unique IDs is not straightforward:
B
A
C
D
• Should we give D:
 same ID…?
 bad for precision (more false pos)
 different ID…?  bad for recall (fewer true pos)
Swinburne
Obtaining unique inventor IDs
• DNN outputs probability of match/non-match for any
given invtr-invtr comparison
• However, obtaining unique IDs is not straightforward:
B
C
A
• Should we give D:
 same ID…?
 if # links ≥ n/2
 different ID…?  if # links < n/2
D
Swinburne
Obtaining unique inventor IDs
• DNN outputs probability of match/non-match for any
given invtr-invtr comparison
• However, obtaining unique IDs is not straightforward:
B
C
A
• Should we give D:
 same ID…?
 if # links ≥ n/2
 different ID…?  if # links < n/2
D
Download