9‐10‐2014 name variants • different versions of a name, that can denote the same object Large scale harvesting of variants of proper names • requires a proof that the same object is involved (in at least one example) Gerrit Bloothooft, UiL‐OTS, Utrecht University Marijn Schraagen, LIACS, Leiden University The Netherlands – not always easy – rarely explicitly provided g.bloothooft@uu.nl m.p.schraagen@liacs.leidenuniv.nl ICOS 2014 Glasgow Utrecht Leiden Links ICOS 2014 Glasgow Utrecht Leiden proper names in historical sources spelling variation suffix variation abbreviation translation – typos (digitization) –… Utrecht Leiden Dirk Willem Willem Willem Willem Willem Guljelmus Wllhelmus Wlhelmus WIllem (Willem) Wiellem Wlllem Gujlelnius Wllem WiIllem Wijllem Wihelmus Willemj Wikllem Wwillem Willlem Guilleam Willeam Willem Wil.lem Wilem Guileam Willelmini willem Wiilem Guillem Weillem Guilelmis Wil;helmus Wilhlem Welhelmus ‐ Dirck ‐ Willempje ‐ Wim ‐ Guillaume ‐ Wilhelmus ‐ Aillem ICOS 2014 Glasgow Links 2 variation! Lots of variation – – – – Links 3 Wiillem Wiehelmus Wulhelmus Willem) Wilehelmus Woillem Wihhelmus Weijlem Willelmus Wi;;em Wilehlmus Wuhelm Guilelmus Wilhlelmus Willem(se) Wilalem Wullem Willem. W#ilhelmus Guillelmus Wliiem Wlihelmus Wilelmus Willemm Wileem Wìllem Willemem Wolhelmus Wechelmus Guilllelmus Wilemm Utrecht Leiden challenge W.ilhelmus Willem] Willemh \Willem Wïllem w8illem Wilhellmus Wilhelm. Wilmhelmus Wilhelmuns Wilhelmua Wilhelmos wilhelmnus Wilhelmnus Wilhelmues Guilleaumme Wilhelmum Guilhelmus Willeml Wilhelmanus Wilhelmjus Wilhelmes Guilliaumme Wilhelmas Willemn Wilhelmus Wilhelmns Willhelmus Guiliaume Willlen Guiilleaume Guilliaume Willenis Guiliermo Wilempjen Willempjen Willepjen Guilliermo Wittem Willen! Wilhlenn Wijlen Wielen Willen wilhem Willempke Guilleaume Wilhellemus Wilhekmus Guiileaume Willeaume Wilhelmuus Guylleaume Guileaumme Guileaume Wilhelemus Guilleauma Willewm Guillesmus Guïllermo Guilermo Guiilermo Guillermo Guillerlmus Guijlleaume Wilheminus Wilhelhmus Guillaum guillaum Gueillaum Wilhemus Guilhemus Wielhemus Wilhehmus Wilhelminus Wilhelmienus Wilherlmus Wilhermus Weilhim Wilhiem Wilheim Wilhein Willaum Guillaim wilhemus Wilhelnmus Woalter Willhem Guillhem Wilheem Wilhem Wölhelm Wilhelimus Wilhelus Willaim Willemerman Wiechem Wiloem Wilhelmius Wilhelmijs Guilhelmis Wilhelmjs Wilhelmis Willemhelmus Willoem Wilhelnus Weilhelmus Wwilhelmus Wylhelmus wWilhelmus (Wilhelmus) (Wilhelmus Wilhelmüs Wilhelmus\ Guiljame Wilhelmus? WilhelmusHubertus Wilheelmus Wilhelmmus Wielhelmus Wilhhelmus Wiilhelmus WEilhelmus wilhelmus Wilhelmus) Guillieaume Wilhelmuss Wilhwlmus WilhelmusStephanus Wilhwelmus WIlhelmus Willum Willkem Guillum Wilkhelmus William Wilhelmiem Wilhlemus Wilhelmigs Wilielmus Willme Willielmus Wilme Güilielmus WilhelmusHenricus Guililmus WilhelmusTheodorus Guileilmus Wilhelmushenricus Guïllielmus Wilhelmusn Guilielmus Wilhelmuszn Guillijaam Eilhelmus Willemus ilhelmus Wiiliam Ilhelmus Guilemus Willemcus Guillemus WilhelmusJohannes Willemmus Wilhelmushubertus Wilehmus Wilhelmuw Wilemus Wwilhwlmus Willliam Guilliaam Wieliam Guiliam Guillielmus Guiliaam Wilhlmus Guillieam Guillmus Guilliam Wiliaam Wiliam Wilhmus Wilnelmus Guiilmus Willwm Wilmus ICOS 2014 Glasgow Links Guilmus Aillem JohannesWilhelmus Johanneswilhelmus CornelisWilhelmus Gulliëlmus Guliëlmus Gijlliaume Güliëlmus Guli?lmus Guijelmus Gulielmus Guiëlmus Giliaume Gilliaume Gilliaumme Guihelmus Guikelmus Gullielmus Guielmus Jannwillem Janwillem JanWillem JanWilhelmus MartinusWilhelmus Qwillem 1. 4 required name variation is difficult to model, therefore: • big data – with many references to individuals • learn variation in person names from use of names in real life • true person resolution (let data speak for itself) – proof that the same individual is concerned – even with data that contain name variants • automatically from big data Utrecht Leiden ICOS 2014 Glasgow Links 5 Utrecht Leiden ICOS 2014 Glasgow Links 6 1 9‐10‐2014 big data source names 1,052,000 different full first names (composite) Jan, Johanna Maria Cornelia • Dutch vital registration (who‐was‐who 2011) 1811‐ early 20th century 111,900 different female first names (singular, Maria) 82,700 different male first names (singular, Jan) – 4.1 million birth certificates (~30%) – 3.1 million marriage certificates (~90%) – 7.6 million death certificates (~65%) 681,000 different surnames (prefixes included) Bakker, de Vries 600.000 different surnames (prefixes excluded) Vries 55 million name references to persons ICOS 2014 Glasgow Utrecht Leiden Links 7 Utrecht Leiden information per person • • • • • • relaxed assumption: one of the first names and surnames of the mother or father is not needed for true person resolution • age person Links 9 Utrecht Leiden Links 10 (of true person resolution) Johanna Endt • consider all matches between birth and death certificates with exact matching of all information • leave out one name per match • count number of multiple matches • marries in 1858 as 29 years old daughter of Gerrit Endt and Dorothea Kerbert • dies in 1882 as 54 years old daughter of Gerrit Endt and Doortje Kerbert ~1829, Johanna, Gerrit, Endt, Kerbert, Dorothea ~1828, Johanna, Gerrit, Endt, Kerbert, Doortje ICOS 2014 Glasgow ICOS 2014 Glasgow test of assumption example Utrecht Leiden 8 • assumption: the available information identifies a person uniquely (if there is exact matching) in The Netherlands) ICOS 2014 Glasgow Links person resolution first name person (child, bride or groom, deceased) first name father surname father first name mother surname mother (always maiden name Utrecht Leiden ICOS 2014 Glasgow Links result: only 85 out of 1,107,162 matches are not unique 11 Utrecht Leiden ICOS 2014 Glasgow Links 12 2 9‐10‐2014 harvesting name variant pairs harvesting name variant pairs (procedure) (results) • identify all record pairs of individuals (over birth, marriage and death certificates) that exactly share female first names male first names surnames – first name of the individual – approximate year of birth – three out of four names of parents (first names and surnames) 48,600 pairs 31,900 pairs 177,000 pairs 246,500 tokens 183,000 tokens 374,900 tokens average: first names: 5 to 6 tokens per variant pair surnames: 2 tokens per variant pair • collect pairs of the remaining name, if different Christiena – Christina Bloothooft ‐ Bloothoofd Utrecht Leiden ICOS 2014 Glasgow Links 13 ICOS 2014 Glasgow Utrecht Leiden so far so good, but in the source documents: Pieter born as son of Jacob Houtlosser and Aafje Spruit, died as son of Jacob Houtlosser and Grietje Spruit > found variants can be due to errors in the source, during transcription or to typos • theoretical issue: what is a name variant, and what is an error? ICOS 2014 Glasgow Links variant Aafje – Grietje ? 15 Links 16 variants and errors distinction is difficult to make • Variants Willem Willem Willem • variants share the same lemma and errors do not ‐ Wilhelm ‐ Guillaume ‐ W8llem (no indication of different lemma) • Errors requires onomastic expertise Grietje Fijtje (which we would like to avoid, let the data speak for itself) ICOS 2014 Glasgow ICOS 2014 Glasgow Utrecht Leiden variants and errors Utrecht Leiden 14 example • the original certificates are not error‐free Utrecht Leiden Links Links 17 Utrecht Leiden ‐ Aafje ‐ Sijtje (understandable reading error but different lemma) ICOS 2014 Glasgow Links 18 3 9‐10‐2014 methods for cleaning cleaning | name dictionaries • using name dictionaries with lemmas • dictionary of Dutch first names (20,000), but • to accept name pairs – lemmas too detailed – names with multiple lemmas • using known non‐variants • to reject name pairs • rules – only 8% of all first name pairs share lemma in dictionary (43 % of tokens) • to accept name pairs all with manual intervention (< 2%) ICOS 2014 Glasgow Utrecht Leiden Links 19 results, in variant pairs 13,900 errors (29%) • male first name pairs 22,500 accepted 9,400 errors (29%) • surnames pairs 120,100 accepted Utrecht Leiden 57,100 errors (32%) ICOS 2014 Glasgow Links Links 20 very many variant pairs (Willemina) WILMINA WILLEMJE WELLEMTJE WILMTJE WILLEMTJE WILHELMINA WILLEPMJE WILLEMPIE WELLEMTJE WELLEMTJE WILLEMIJNTJE WILLEMIJNTJE WLLEMIJNTJE WILLEMIJN WILHELMINA WILLEMTIEN WILLEMTIEN WILEHELMINA WILLEMKE WILLEMKEN WILLEMINA WILLEMINA WILLEMIENA WILLEMINA WIHELMINA WILLEMKE WILLEMIJNTJE WILHEMINA WILLEMKEN WILLEMPJE WILLEMIJNTE WILLEMIJNTJE WILLEMPTJE WILLEMIJNTJE WILLEMIJNTJE WILLEMYNA WILLEMPJE WILEMPJE WILLEMIJNTJE WILLEMIINTJE WILLEMINA WILLEMINA WILHELMINA WILLEMIJN WILLEMIJN WILLEMINA WILLEMIJNTJE WILLEMIJNTJE WILLEMIJN WILHELMINA • female first name pairs 34,800 accepted ICOS 2014 Glasgow Utrecht Leiden 21 ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ WILMIJNA WILLEMPJE WILLEMTJE WILLEMPJE WILEMTJE WILLEMPJE WILLEMTJE WILLEMPJE WELLIMTJE WOLLEMTJE WILLEMPJE WLLEMIJNTJE WILLEMPJE WILLEMIJNA WILLEMINA WILMTIEN WILLEMTJE WILHELMINE WILLEMKEN WILLEKEN WILLEMINE WILLIMINA WILLEMINA WILLEMPJE WILHELMINA WILLENKE WILEMIJNTJE WILLEMINA WILMKEN WILLEMTJE WILLEMIJNTJE WILLEMYNTJE WILLEMTJE WILLEMTJE WILLEMYNA WILLEMIJNA WILSJE WILLEMPJE WILLEMEINTJE WILLEMIJNTJE WILLEMINTJE WILELMINA WILHELMINE WILLEMPJE WILLEMTJE WILLEMIJN WILLEMINTJE WILLEMEIJNTJE WILLEMIJNTJE WILLEMIJNA WILHELMIMA ‐ WILHELMINA ‐ WILHELMIJNA ‐ WILLEMKE ‐ WILLEPMJE ‐ WILLEPMJE ‐ WILLEMIJNTJE ‐ WILHELMA ‐ WILLEMINA ‐ WILLEINTJE ‐ WILHELMIJNA ‐ WILHELMINA ‐ WILLEMINA ‐ WILHELMIA ‐ WILLEMTIEN ‐ WILLEKE ‐ WILHELMINA ‐ WILHELMINA ‐ WILLEMPTJE ‐ WILLEMIEN ‐ WILLEM ‐ WILLEMINA ‐ WILTIEN ‐ WILMKE ‐ WELHELMINA ‐ GUILLIELMINE ‐ WILLEMTIEN ‐ WILHELMIENA ‐ WILMINA ‐ WILLEMKE ‐ WELLEMTJE ‐ WILLEMIN ‐ WILMTJE ‐ WILLEMINA ‐ WILLELMIN ‐ GUILLIELMINE ‐ WILLEMINA ‐ WILEMIJNA ‐ WILLEMTIJN ‐ WILLEMINA ‐ WILLEMIJNE ‐ WILLEMS ‐ WILLEMINE ‐ WILLEMKE ‐ WILLEMIJNTJE ‐ WILLEMINA ‐ WILLEMA ‐ WILLEMINA ‐ WILHELINA ‐ WILLEMKEN ‐ WILLEMINA WILLEMIJNTJE WILHELMINA WULLEMPJE WILLEMINA WILHELMINE WILLEMIJN WILLEMIJNE WILLEMPTJE WILHELM WILLEMIEN WILLEMINA WILHELMA WILHELMINE WILLEMIN GUILLEMINE WILLEMIENTJE WILLMINA WILLEMIJNA WILLEMINA GUILLELMINE WILLEMIJNTJE WILLEM WILHELMINA WILMPJE WILLEMINA WILLEMKE WILLEMKE WILLEMIJNTJE WILLEMIJNTJE WILLEMPJE WILLEMINA WILLEINTJE WILLEMTJEN WILLEMTJE WILLEMINA GUILLIELMINE WILLEMPIEN WILHELMINA WILLEMINA WILLEMIEN WILLEMINA WILMINE WILKENS WILLEMINE WILLEMTJEN WIILEMINA WILEHELMINA WILHELMINA WILLEMKEN ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ WILLEMTJE WILLIMPJE WILLEMIJNTJE WILLEMPJE WELLEMINA WILLEMINE WILHELMINA WILHELMINA WILMPTJE WILHELMI WILHELMINA WILLEMKEN WILHELMINA WILLEMINA WILLEMINA WILHELMINE WILLEMEINTJE WILHELMINA WILEMINA WILLMINA WILHELMINE WILMIENA WILLEMS WILMINA WILLEMTJE WILLEMIENTJE WILLEMTJE WILLEMPKE WILLEMKEN WILLEMIJNTIE WILEMTJE WILMIJNTJE WILLEMTJE WILLEMPJE WILLMEPJE WILHELMIMA GUILIELMINE WILLEMPJE WILLEMTJE WILLEMEINTJE WILLEMIN WILMPJE WILLEMINE WILKES WILMINA WILLMEPJE WILLEMINA WILHELMINA WILLEMDINA WILHELMINA ICOS 2014 Glasgow Utrecht Leiden name clusters WILHELMINA WILHLEMINA WILHELMINA WILLEMPJE WILLEMKE WILLEMPJE WILLEMINA WILLEMIJNA WILLLEMINA WILLEMPJE WILLEMIJNA WILHELMUS WILHELMUS WILHELMINA WILTIEN WILLEMKE WILHLMINA WILHEMINA WILLEMTJEN WILLEMTIEN WILLEMPJE WILLEMIJNE WILMTIEN WILLEMKEN WILHELMINA GUILLELMINE WILLEMPIEN WILHELMINA WILMIENA WILLEMTIEN WELMTJE WILHELMINA WILLEMTJE WILMINA WILHELMINA WILHELMINA WILLEMKE WILLEMIJNA WILLEMTJE WILLEMMINA WILLEMIJNA WILLEMINA WILLELMINA WILMKE WILLEMIENTJE WILLEMIMA WILLEMINA WILLEMEIJNTJE WILHELMINA WILLENKE WILLEMIENTJE WILLEMA WILLEMPJEN WILLEMPIEN WILHELHERMINA GUILLEMINE WILLEMIJNTJE WILLEMPJE WILLEMINE WILLEMINA WILLEMPKE GUILLELMINE WILLEMIENA WILLEMIJNTIE WILLELMINA GUILLEMINE WILLEMIENA WILLEMINA WILELMINA GUILLEMINA WILLEMKE WILLEMKE WILLEMTJEN WILLEMPIEN WILLEMJE WILLEMKEN WILEMIJNA WILHELMINA WILLEMTJE WILLEMTIEN WILLEMTIEN GUILHELMINE WILLEMKE WILHELMINA WILHELLEMINA WILEMINA WILLEMJEN WILMINE WILHELMIN WILLEMPJ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ WILLEMIJNA WILLEMS WILLEMTJEN WILLEMTJE WILHELMINA WILHELMINA WILMIJNTJE WILMPJE WILLEMIENE WILLEMSEN WILLEMPJE GUILLELMINA WILLEMPJE WILLEMPJE WILLEMINA GUILLELMINA WILHELMIENA WILHELMIENA WILHELMINA GUILLELMINE WILEMKE WILLEM WILLEMTIJN WILLEMPJEN WILLEMTJE WILLEM WILMIJNA WILLEMIENA WILLEMTJEN WILLEMS WILLEMPJE GUILLELMINE WIMPKE WILKELINA WILHELMINA WILLEMINA WILLEMKEN WILLEMINA WILHELMINA WILLEMPJE and many more Links 22 name clusters • male first names • female first names • variant pairs (are interconnected) Jan ‐ Johannes Jan ‐ Joannes Jan ‐ Johan Johannes – Johan, etc 1.221 (16.487 names, 20%) 1.530 (23.816 names, 21%) compares to number of lemma’s in Dutch dictionary of first names, vd Schaar 1964 • create cluster Jan {Jan, Johannes, Johan} • surnames 11.686 (93.839 names, 17%) compares to number in Dutch surnames overview (without many variants), Winkler 1885 Utrecht Leiden ICOS 2014 Glasgow Links 23 Utrecht Leiden ICOS 2014 Glasgow Links 24 4 9‐10‐2014 conclusions • person name variants need proof from true person links • expert knowledge necessary because errors cannot be distinguished fully automatically from true variants (but < 2%) • final results are promising as a starting point to create a national repository of proven name variants Utrecht Leiden ICOS 2014 Glasgow Links 25 5