Large scale harvesting of variants of proper names name variants

advertisement
9‐10‐2014
name variants
• different versions of a name, that can denote the same object
Large scale harvesting of variants
of proper names
• requires a proof that the same object is involved (in at least one example)
Gerrit Bloothooft, UiL‐OTS, Utrecht University
Marijn Schraagen, LIACS, Leiden University
The Netherlands
– not always easy
– rarely explicitly provided
g.bloothooft@uu.nl
m.p.schraagen@liacs.leidenuniv.nl
ICOS 2014 Glasgow Utrecht Leiden
Links
ICOS 2014 Glasgow Utrecht Leiden
proper names in historical sources
spelling variation
suffix variation
abbreviation
translation
– typos (digitization)
–…
Utrecht Leiden
Dirk
Willem Willem Willem Willem
Willem Guljelmus
Wllhelmus
Wlhelmus
WIllem
(Willem)
Wiellem
Wlllem
Gujlelnius
Wllem
WiIllem
Wijllem
Wihelmus
Willemj
Wikllem
Wwillem
Willlem
Guilleam
Willeam
Willem
Wil.lem
Wilem
Guileam
Willelmini
willem
Wiilem
Guillem
Weillem
Guilelmis
Wil;helmus
Wilhlem
Welhelmus
‐ Dirck
‐ Willempje
‐ Wim
‐ Guillaume
‐ Wilhelmus
‐ Aillem
ICOS 2014 Glasgow Links
2
variation!
Lots of variation
–
–
–
–
Links
3
Wiillem
Wiehelmus
Wulhelmus
Willem)
Wilehelmus
Woillem
Wihhelmus
Weijlem
Willelmus
Wi;;em
Wilehlmus
Wuhelm
Guilelmus
Wilhlelmus
Willem(se)
Wilalem
Wullem
Willem.
W#ilhelmus
Guillelmus
Wliiem
Wlihelmus
Wilelmus
Willemm
Wileem
Wìllem
Willemem
Wolhelmus
Wechelmus
Guilllelmus
Wilemm
Utrecht Leiden
challenge
W.ilhelmus
Willem]
Willemh
\Willem
Wïllem
w8illem
Wilhellmus
Wilhelm.
Wilmhelmus
Wilhelmuns
Wilhelmua
Wilhelmos
wilhelmnus
Wilhelmnus
Wilhelmues
Guilleaumme
Wilhelmum
Guilhelmus
Willeml
Wilhelmanus
Wilhelmjus
Wilhelmes
Guilliaumme
Wilhelmas
Willemn
Wilhelmus
Wilhelmns
Willhelmus
Guiliaume
Willlen
Guiilleaume
Guilliaume
Willenis
Guiliermo
Wilempjen
Willempjen
Willepjen
Guilliermo
Wittem
Willen!
Wilhlenn
Wijlen
Wielen
Willen
wilhem
Willempke
Guilleaume
Wilhellemus
Wilhekmus
Guiileaume
Willeaume
Wilhelmuus
Guylleaume
Guileaumme
Guileaume
Wilhelemus
Guilleauma
Willewm
Guillesmus
Guïllermo
Guilermo
Guiilermo
Guillermo
Guillerlmus
Guijlleaume
Wilheminus
Wilhelhmus
Guillaum
guillaum
Gueillaum
Wilhemus
Guilhemus
Wielhemus
Wilhehmus
Wilhelminus
Wilhelmienus
Wilherlmus
Wilhermus
Weilhim
Wilhiem
Wilheim
Wilhein
Willaum
Guillaim
wilhemus
Wilhelnmus
Woalter
Willhem
Guillhem
Wilheem
Wilhem
Wölhelm
Wilhelimus
Wilhelus
Willaim
Willemerman
Wiechem
Wiloem
Wilhelmius
Wilhelmijs
Guilhelmis
Wilhelmjs
Wilhelmis
Willemhelmus
Willoem
Wilhelnus
Weilhelmus
Wwilhelmus
Wylhelmus
wWilhelmus
(Wilhelmus)
(Wilhelmus
Wilhelmüs
Wilhelmus\
Guiljame
Wilhelmus?
WilhelmusHubertus
Wilheelmus
Wilhelmmus
Wielhelmus
Wilhhelmus
Wiilhelmus
WEilhelmus
wilhelmus
Wilhelmus)
Guillieaume
Wilhelmuss
Wilhwlmus
WilhelmusStephanus Wilhwelmus
WIlhelmus
Willum
Willkem
Guillum
Wilkhelmus
William
Wilhelmiem
Wilhlemus
Wilhelmigs
Wilielmus
Willme
Willielmus
Wilme
Güilielmus
WilhelmusHenricus Guililmus
WilhelmusTheodorus Guileilmus
Wilhelmushenricus Guïllielmus
Wilhelmusn
Guilielmus
Wilhelmuszn
Guillijaam
Eilhelmus
Willemus
ilhelmus
Wiiliam
Ilhelmus
Guilemus
Willemcus
Guillemus
WilhelmusJohannes Willemmus
Wilhelmushubertus Wilehmus
Wilhelmuw
Wilemus
Wwilhwlmus
Willliam
Guilliaam
Wieliam
Guiliam
Guillielmus
Guiliaam
Wilhlmus
Guillieam
Guillmus
Guilliam
Wiliaam
Wiliam
Wilhmus
Wilnelmus
Guiilmus
Willwm
Wilmus
ICOS 2014 Glasgow Links
Guilmus
Aillem
JohannesWilhelmus
Johanneswilhelmus
CornelisWilhelmus
Gulliëlmus
Guliëlmus
Gijlliaume
Güliëlmus
Guli?lmus
Guijelmus
Gulielmus
Guiëlmus
Giliaume
Gilliaume
Gilliaumme
Guihelmus
Guikelmus
Gullielmus
Guielmus
Jannwillem
Janwillem
JanWillem
JanWilhelmus
MartinusWilhelmus
Qwillem
1. 4
required
name variation is difficult to model, therefore:
• big data
– with many references to individuals
• learn variation in person names from
use of names in real life • true person resolution
(let data speak for itself)
– proof that the same individual is concerned
– even with data that contain name variants
• automatically from big data
Utrecht Leiden
ICOS 2014 Glasgow Links
5
Utrecht Leiden
ICOS 2014 Glasgow Links
6
1
9‐10‐2014
big data
source names
1,052,000 different full first names (composite) Jan, Johanna Maria Cornelia
• Dutch vital registration (who‐was‐who 2011)
1811‐ early 20th century
111,900 different female first names (singular, Maria)
82,700 different male first names (singular, Jan)
– 4.1 million birth certificates
(~30%)
– 3.1 million marriage certificates (~90%)
– 7.6 million death certificates
(~65%)
681,000 different surnames (prefixes included)
Bakker, de Vries
600.000 different surnames (prefixes excluded)
Vries
55 million name references to persons
ICOS 2014 Glasgow Utrecht Leiden
Links
7
Utrecht Leiden
information per person
•
•
•
•
•
• relaxed assumption: one of the first names
and surnames of the mother or father is not
needed for true person resolution
• age person
Links
9
Utrecht Leiden
Links
10
(of true person resolution)
Johanna Endt
• consider all matches between birth and death
certificates with exact matching of all
information
• leave out one name per match
• count number of multiple matches
• marries in 1858 as 29 years old daughter of Gerrit Endt and Dorothea Kerbert
• dies in 1882 as 54 years old daughter of Gerrit Endt and Doortje Kerbert
~1829, Johanna, Gerrit, Endt, Kerbert, Dorothea
~1828, Johanna, Gerrit, Endt, Kerbert, Doortje
ICOS 2014 Glasgow ICOS 2014 Glasgow test of assumption
example
Utrecht Leiden
8
• assumption: the available information identifies a person uniquely (if there is exact matching)
in The Netherlands)
ICOS 2014 Glasgow Links
person resolution
first name person (child, bride or groom, deceased)
first name father
surname father
first name mother
surname mother (always maiden name
Utrecht Leiden
ICOS 2014 Glasgow Links
result: only 85 out of 1,107,162 matches are not unique
11
Utrecht Leiden
ICOS 2014 Glasgow Links
12
2
9‐10‐2014
harvesting name variant pairs
harvesting name variant pairs
(procedure)
(results)
• identify all record pairs of individuals (over birth, marriage and death certificates) that
exactly share
female first names
male first names
surnames
– first name of the individual
– approximate year of birth
– three out of four names of parents (first names and surnames)
48,600 pairs
31,900 pairs
177,000 pairs
246,500 tokens 183,000 tokens
374,900 tokens
average:
first names: 5 to 6 tokens per variant pair
surnames: 2 tokens per variant pair
• collect pairs of the remaining name, if
different
Christiena – Christina
Bloothooft ‐ Bloothoofd
Utrecht Leiden
ICOS 2014 Glasgow Links
13
ICOS 2014 Glasgow Utrecht Leiden
so far so good, but
in the source documents:
Pieter
born as son of Jacob Houtlosser and Aafje Spruit, died as son of Jacob Houtlosser and Grietje Spruit > found variants can be due to errors in the source, during transcription or to typos
• theoretical issue: what is a name variant, and what is an error?
ICOS 2014 Glasgow Links
variant Aafje – Grietje ?
15
Links
16
variants and errors
distinction is difficult to make
• Variants
Willem
Willem
Willem
• variants share the same lemma and errors do not
‐ Wilhelm
‐ Guillaume
‐ W8llem (no indication of different lemma)
• Errors
requires onomastic expertise Grietje
Fijtje
(which we would like to avoid, let the data speak for itself)
ICOS 2014 Glasgow ICOS 2014 Glasgow Utrecht Leiden
variants and errors
Utrecht Leiden
14
example
• the original certificates are not error‐free
Utrecht Leiden
Links
Links
17
Utrecht Leiden
‐ Aafje
‐ Sijtje
(understandable reading error but different lemma)
ICOS 2014 Glasgow Links
18
3
9‐10‐2014
methods for cleaning
cleaning | name dictionaries
• using name dictionaries with lemmas
• dictionary of Dutch first names (20,000), but
• to accept name pairs
– lemmas too detailed
– names with multiple lemmas
• using known non‐variants
• to reject name pairs
• rules
– only 8% of all first name pairs share lemma in dictionary (43 % of tokens)
• to accept name pairs
all with manual intervention (< 2%)
ICOS 2014 Glasgow Utrecht Leiden
Links
19
results, in variant pairs
13,900 errors (29%)
• male first name pairs
22,500 accepted
9,400 errors (29%)
• surnames pairs
120,100 accepted
Utrecht Leiden
57,100 errors (32%)
ICOS 2014 Glasgow Links
Links
20
very many variant pairs (Willemina)
WILMINA WILLEMJE WELLEMTJE WILMTJE WILLEMTJE WILHELMINA WILLEPMJE WILLEMPIE WELLEMTJE WELLEMTJE WILLEMIJNTJE WILLEMIJNTJE WLLEMIJNTJE
WILLEMIJN WILHELMINA WILLEMTIEN WILLEMTIEN WILEHELMINA WILLEMKE WILLEMKEN
WILLEMINA WILLEMINA WILLEMIENA WILLEMINA
WIHELMINA WILLEMKE WILLEMIJNTJE WILHEMINA WILLEMKEN WILLEMPJE WILLEMIJNTE WILLEMIJNTJE
WILLEMPTJE WILLEMIJNTJE WILLEMIJNTJE WILLEMYNA
WILLEMPJE WILEMPJE WILLEMIJNTJE WILLEMIINTJE WILLEMINA WILLEMINA WILHELMINA WILLEMIJN WILLEMIJN WILLEMINA WILLEMIJNTJE WILLEMIJNTJE WILLEMIJN WILHELMINA • female first name pairs
34,800 accepted
ICOS 2014 Glasgow Utrecht Leiden
21
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
WILMIJNA WILLEMPJE WILLEMTJE WILLEMPJE WILEMTJE WILLEMPJE WILLEMTJE WILLEMPJE WELLIMTJE WOLLEMTJE WILLEMPJE WLLEMIJNTJE WILLEMPJE WILLEMIJNA WILLEMINA WILMTIEN WILLEMTJE WILHELMINE WILLEMKEN WILLEKEN WILLEMINE WILLIMINA WILLEMINA WILLEMPJE WILHELMINA WILLENKE WILEMIJNTJE WILLEMINA WILMKEN WILLEMTJE WILLEMIJNTJE WILLEMYNTJE WILLEMTJE WILLEMTJE WILLEMYNA WILLEMIJNA WILSJE WILLEMPJE WILLEMEINTJE WILLEMIJNTJE WILLEMINTJE WILELMINA WILHELMINE WILLEMPJE WILLEMTJE WILLEMIJN WILLEMINTJE WILLEMEIJNTJE WILLEMIJNTJE
WILLEMIJNA WILHELMIMA ‐
WILHELMINA
‐
WILHELMIJNA ‐
WILLEMKE ‐
WILLEPMJE ‐
WILLEPMJE ‐
WILLEMIJNTJE ‐
WILHELMA ‐
WILLEMINA ‐
WILLEINTJE ‐
WILHELMIJNA ‐
WILHELMINA ‐
WILLEMINA ‐
WILHELMIA ‐
WILLEMTIEN ‐
WILLEKE ‐
WILHELMINA ‐
WILHELMINA ‐
WILLEMPTJE ‐
WILLEMIEN ‐
WILLEM ‐
WILLEMINA ‐
WILTIEN ‐
WILMKE ‐
WELHELMINA ‐
GUILLIELMINE ‐
WILLEMTIEN ‐
WILHELMIENA ‐
WILMINA ‐
WILLEMKE ‐
WELLEMTJE ‐
WILLEMIN ‐
WILMTJE ‐
WILLEMINA ‐
WILLELMIN ‐
GUILLIELMINE ‐
WILLEMINA ‐
WILEMIJNA ‐
WILLEMTIJN ‐
WILLEMINA ‐
WILLEMIJNE ‐
WILLEMS ‐
WILLEMINE ‐
WILLEMKE ‐
WILLEMIJNTJE ‐
WILLEMINA ‐
WILLEMA ‐
WILLEMINA
‐
WILHELINA ‐
WILLEMKEN ‐
WILLEMINA WILLEMIJNTJE WILHELMINA WULLEMPJE WILLEMINA WILHELMINE WILLEMIJN WILLEMIJNE WILLEMPTJE WILHELM WILLEMIEN WILLEMINA WILHELMA WILHELMINE WILLEMIN GUILLEMINE WILLEMIENTJE WILLMINA WILLEMIJNA WILLEMINA GUILLELMINE WILLEMIJNTJE WILLEM WILHELMINA WILMPJE WILLEMINA WILLEMKE WILLEMKE WILLEMIJNTJE WILLEMIJNTJE WILLEMPJE WILLEMINA WILLEINTJE WILLEMTJEN WILLEMTJE WILLEMINA GUILLIELMINE WILLEMPIEN WILHELMINA WILLEMINA WILLEMIEN WILLEMINA WILMINE WILKENS WILLEMINE WILLEMTJEN WIILEMINA WILEHELMINA WILHELMINA
WILLEMKEN ‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
WILLEMTJE WILLIMPJE WILLEMIJNTJE WILLEMPJE WELLEMINA WILLEMINE WILHELMINA WILHELMINA WILMPTJE WILHELMI WILHELMINA WILLEMKEN WILHELMINA WILLEMINA WILLEMINA WILHELMINE WILLEMEINTJE WILHELMINA WILEMINA WILLMINA WILHELMINE WILMIENA WILLEMS WILMINA WILLEMTJE WILLEMIENTJE WILLEMTJE WILLEMPKE WILLEMKEN WILLEMIJNTIE WILEMTJE WILMIJNTJE WILLEMTJE WILLEMPJE WILLMEPJE WILHELMIMA GUILIELMINE WILLEMPJE WILLEMTJE WILLEMEINTJE WILLEMIN WILMPJE WILLEMINE WILKES WILMINA WILLMEPJE WILLEMINA WILHELMINA WILLEMDINA WILHELMINA ICOS 2014 Glasgow Utrecht Leiden
name clusters
WILHELMINA WILHLEMINA WILHELMINA WILLEMPJE WILLEMKE WILLEMPJE WILLEMINA WILLEMIJNA WILLLEMINA WILLEMPJE WILLEMIJNA WILHELMUS WILHELMUS WILHELMINA WILTIEN WILLEMKE WILHLMINA WILHEMINA WILLEMTJEN WILLEMTIEN WILLEMPJE WILLEMIJNE WILMTIEN WILLEMKEN WILHELMINA GUILLELMINE WILLEMPIEN WILHELMINA WILMIENA WILLEMTIEN WELMTJE WILHELMINA WILLEMTJE WILMINA WILHELMINA WILHELMINA WILLEMKE WILLEMIJNA WILLEMTJE WILLEMMINA WILLEMIJNA WILLEMINA WILLELMINA WILMKE WILLEMIENTJE WILLEMIMA WILLEMINA WILLEMEIJNTJE WILHELMINA WILLENKE WILLEMIENTJE WILLEMA WILLEMPJEN WILLEMPIEN WILHELHERMINA GUILLEMINE WILLEMIJNTJE WILLEMPJE WILLEMINE WILLEMINA WILLEMPKE GUILLELMINE WILLEMIENA WILLEMIJNTIE WILLELMINA GUILLEMINE WILLEMIENA WILLEMINA WILELMINA GUILLEMINA WILLEMKE WILLEMKE WILLEMTJEN WILLEMPIEN WILLEMJE WILLEMKEN WILEMIJNA WILHELMINA WILLEMTJE WILLEMTIEN WILLEMTIEN GUILHELMINE WILLEMKE WILHELMINA WILHELLEMINA WILEMINA WILLEMJEN WILMINE WILHELMIN WILLEMPJ ‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
WILLEMIJNA WILLEMS WILLEMTJEN WILLEMTJE WILHELMINA WILHELMINA WILMIJNTJE WILMPJE WILLEMIENE WILLEMSEN WILLEMPJE GUILLELMINA WILLEMPJE WILLEMPJE WILLEMINA GUILLELMINA WILHELMIENA WILHELMIENA WILHELMINA GUILLELMINE WILEMKE WILLEM WILLEMTIJN WILLEMPJEN WILLEMTJE WILLEM WILMIJNA WILLEMIENA WILLEMTJEN WILLEMS WILLEMPJE GUILLELMINE WIMPKE WILKELINA WILHELMINA WILLEMINA WILLEMKEN WILLEMINA WILHELMINA WILLEMPJE and many more
Links
22
name clusters
• male first names
• female first names
• variant pairs (are interconnected)
Jan ‐ Johannes
Jan ‐ Joannes
Jan ‐ Johan
Johannes – Johan, etc
1.221 (16.487 names, 20%)
1.530 (23.816 names, 21%)
compares to number of lemma’s in Dutch dictionary of first names, vd Schaar 1964
• create cluster Jan {Jan, Johannes, Johan}
• surnames
11.686 (93.839 names, 17%)
compares to number in Dutch surnames overview (without many
variants), Winkler 1885
Utrecht Leiden
ICOS 2014 Glasgow Links
23
Utrecht Leiden
ICOS 2014 Glasgow Links
24
4
9‐10‐2014
conclusions
• person name variants need proof from true
person links • expert knowledge necessary because errors
cannot be distinguished fully automatically
from true variants (but < 2%)
• final results are promising as a starting point to create a national repository of proven name variants
Utrecht Leiden
ICOS 2014 Glasgow Links
25
5
Download