pnas.2119281119.sapp

1 4 Revealing the recent demographic history of Europe via haplotype sharing in the UK Biobank. 5 Supplementary Information Appendix 2 3 6 7 8 Authors Edmund Gilbert1,2*, Ashwini Shanmugam1,2,3, Gianpiero L. Cavalleri1,2,3*. 9 10 11 12 Affiliations 13 * Corresponding Author 14 15 Table of Contents 16 Affiliations ............................................................................................................................................... 1 17 Supplementary Data 1 - UK Biobank Self-Reported Ethnicity ................................................................ 3 18 Supplementary Data 2 – Individual Country Population Structure ........................................................ 4 19 Supplementary Data 2.1 - Per Country PCA Distributions .................................................................. 4 20 Supplementary Data 2.2 - Principal Component Analysis of UK Biobank Europeans ...................... 12 21 Supplementary Data 2.3 - Per ADMIXTURE Component Proportions .............................................. 13 22 Supplementary Data 2.3 - Comparison to Human Origins References............................................. 14 23 Supplementary Data 3 – Comparison on Linked and Unlinked Methods............................................. 15 24 Supplementary Data 4 - Genetic Structure of European Leiden Clusters ............................................ 18 25 Supplementary Data 5 – Detailed European Genetic Landscape ......................................................... 22 26 North-Western Europe ..................................................................................................................... 22 27 Central-Eastern Europe ..................................................................................................................... 23 28 Southern Europe ............................................................................................................................... 24 29 Supplementary Data 6 - Malta .............................................................................................................. 25 30 Supplementary Data 7 - Mixed Europeans ........................................................................................... 26 31 Supplementary Data 8 - IBD Ne Curves ................................................................................................ 31 32 Supplementary Data 9 - Spain .............................................................................................................. 36 1. School of Pharmacy and Biomolecular Sciences, Royal College of Surgeons in Ireland, Dublin. 2. FutureNeuro SFI Research Centre, Royal College of Surgeons in Ireland, Dublin. 3. The SFI Centre for Research Training in Genomics Data Science, Ireland. Authors.................................................................................................................................................... 1 33 34 35 References ............................................................................................................................................ 39 36 37 38 39 40 41 42 43 44 Supplementary Data 1 - UK Biobank Self-Reported Ethnicity Whilst sampling European ancestry within the UK Biobank1, we selected based on European birthplace and self-reported ethnicity, selecting the background which fell under the parent description “White” (F.21000 = 1/1001/1002/1003) or “other” (F.21000 = 6) (see Methods). With comparison to West Eurasian references from the Human Origins2 by projection principal component analysis (PCA) this reveals a sample of European ancestry across the continent (SI Appendix Figure S2.11). We further recorded the proportions of the selected self-reported ethnicity categorises within each sampled country/region of birth (Supplementary Figure 1.1), as well as each Leiden3 cluster subsequently detected (Supplementary Figure 1.2). Brit. & Ire. W Europe N Europe C Europe E Europe F.21000.Label 0.75 White White − British 0.50 White − Irish White − Other 0.25 46 Channel Islands England Isle of Man Northern Ireland Orkney Republic of Ireland Scotland Wales Belgium France Netherlands Denmark Faroe Islands Finland Iceland Norway Sweden Austria Czech Republic Germany Hungary Poland Slovakia Switzerland Belarus Estonia Latvia Lithuania Russia Ukraine Gibraltar Italy Malta Portugal Spain Albania Bosnia and Herzegovina Bulgaria Croatia Greece Kosovo North Macedonia Romania Serbia Serbia/Montenegro Slovenia Other 0.00 47 48 49 S Europe 1.00 Cyprus Turkey Proportion Membership Near East SE Europe 45 Supplementary Figure 1.1 – Per-country/region of birth proportions for five self-reported ethnicity categories from the UK Biobank. The ethnicity category corresponds to UK Biobank phenotype code F.21000. 50 White White − British 0.50 52 53 54 White − Irish White − Other Other Malta Mixed European Spain France & Switz. Cyprus Portugal Turkey Albania & Greece Italy Greece Finland Mixed Scand. Russia Latvia & Lithuania Poland NW Balkans NE Balkans CE Europe 1 CE Europe 2 Wales 2 Channel Islands Wales 1 France England 2 England 1 Isle of Man Ireland 2 Ireland & N.Ireland Orkney Ireland 1 N.Ireland & Scotland 4 N.Ireland & Scotland 2 N.Ireland & Scotland 3 N.Ireland & Scotland 1 Norway Sweden 1 Denmark Belgium Switzerland Netherlands 0.25 0.00 51 F.21000.Label 0.75 Belgium & France Proportion Membership 1.00 Supplementary Figure 1.2 – Per-Leiden-cluster proportions for five self-reported ethnicity categories from the UK Biobank. The ethnicity category corresponds to UK Biobank phenotype code F.21000. 55 Supplementary Data 2 – Individual Country Population Structure 56 57 58 59 60 61 62 Supplementary Data 2.1 - Per Country PCA Distributions In sampling of UK Biobank1 participants with European ancestry and birthplace we performed an initial principal component analysis (PCA), investigating the genetic structure in the sample of 5,500 individuals, and the genetic structure sampled in each individual country/region of birth. Below we record the distribution of individuals for each such region, grouped by geographic proximity. Each plot shows the coordinates from the PCA shown in Figure 1, highlighting individuals from each individual region in red and every other individual as a small grey circular point. 63 64 65 66 Supplementary Figure 2.1 – Principal Component coordinates of UK Biobank individuals with Cypriot, or Turkish birthplace. 67 68 69 70 Supplementary Figure 2.2 – Principal Component coordinates of UK Biobank individuals with Greek, Kosovan, North Macedonian, Romanian, Serbian/Montenegro, or Slovenian birthplace. 71 72 73 Supplementary Figure 2.3 – Principal Component coordinates of UK Biobank individuals with Gibraltarian, Italian, Maltese, Portuguese, or Spanish birthplace. 74 75 76 Supplementary Figure 2.4 – Principal Component coordinates of UK Biobank individuals with Belarusian, Estonian, Latvian, Lithuanian, Russian, or Ukrainian birthplace. 77 78 79 Supplementary Figure 2.5 – Principal Component coordinates of UK Biobank individuals with Austrian, Czech, German, Hungarian, Polish, Slovakian, or Swiss birthplace. 80 81 82 83 Supplementary Figure 2.6 – Principal Component coordinates of UK Biobank individuals with Danish, Faroese, Finnish, Icelandic, Norwegian, or Swedish birthplace. 84 85 86 Supplementary Figure 2.7 – Principal Component coordinates of UK Biobank individuals with Belgian, French, or Dutch birthplace. 87 88 89 90 Supplementary Figure 2.8 – Principal Component coordinates of UK Biobank individuals with British or Irish birthplace. 91 92 93 94 95 96 Supplementary Data 2.2 - Principal Component Analysis of UK Biobank Europeans To supplement Figure 1, we show below the additional principal components three through to ten. These were calculated using the same methodology as those principal components shown in Figure 1, with the same legend and codes for country/region of birth. For each of the four panels, the x-axis principal component is given first and then the y-axis principal component (PC), i.e., for “PC3-PC4”, PC three forms the x-axis and PC four forms the y. 97 98 99 100 101 Supplementary Figure 2.9 - Principal component (PC) analysis of 5,500 Europeans from the UK Biobank1 using PLINK4,5 across PCs three through ten. 102 103 104 105 106 107 108 Supplementary Data 2.3 - Per ADMIXTURE Component Proportions Further, we performed ADMIXTURE6 analysis on the same individuals and common markers there were used for the PCA using PLINK4,5 above (see Methods). ADMIXTURE v1.3 is a fast modelled-based maximum likelihood estimation of an individuals’ ancestry, assuming k ancestral “populations” or components. To explore the genetic structure in the UKBB 5,500 sample we ran ADMIXTURE over k values from two to seven, run ten replications for each k value and choosing the replicate with the high log-likelihood and lowest cross-error validation score. We show the results below: 109 110 111 112 113 114 Supplementary Figure 2.10 - ADMIXTURE ancestry components for each sampled European country/region of birth over k values 2 to 7. Populations are ordered into the same groups of geographically adjacent regions as in Supplementary Figures 2.1-8. 115 116 Supplementary Data 2.3 - Comparison to Human Origins References A B HO Population 0.00 −0.05 117 118 119 120 121 122 123 124 125 0.00 0.03 Principal Component 1 0.06 Jew_Turkish Adygei Jew_Yemenite Balkar Albanian Chechen Bulgarian Georgian Croatian Iranian Greek Iranian_Bandari Romanian Kumyk Basque Lezgin Italian_North Nogai Italian_South North_Ossetian Maltese Armenian Sardinian Assyrian Sicilian Cypriot Spanish Druze Spanish_North Jordanian English Lebanese French Lebanese_Christian Orcadian Lebanese_Muslim Scottish Palestinian Belarusian Syrian Czech Turkish Estonian BedouinA Hungarian BedouinB Ukrainian Saudi Chuvash Yemeni Finnish Libyan Icelandic Moroccan Lithuanian Jew_Ashkenazi Mordovian Jew_Georgian Norwegian Jew_Iranian Russian Jew_iraqi Saami_WGA DK RU 0.05 RU TR RU RU Principal Component 2 Principal Component 2 0.05 Abkhasian 0.00 RU TRTR GR NO RU TR TR TR RUNL RU UA TRTR TR FI NO TR TRTR TR CZ UA TR SE LT RU NLFIDE TR TR TR TR TR TR TR TR TR TR TR RO TR TR FI FI FI RU RU LT TR TR TR FI FIFI TR TR TR FI TR FI FI FI FI FI TR TR RU RU TR TR RU FI FI RU TR FIRU RURU RU TRTR FI TRTR TR TR TR FI TR FI FI RU NO FI FI FI FI RU FI TR FI LT TR TR FI FI FI TR HU TR FI NO FI FI RU TR TR TR FI EE LV TR RURU FI FI RU NO FI FI TR FI TR FI RU RU RU TR FI NO FI CZ RU FI RU TR FI LT FI RU FI UA FI RU LV RU LT RU FI RU EE FI HU TR FI BG TR TR FI LT RU TR GR LT TR SE RU FI FI FI RU FI RU LV RU FI TR RU LT RU FI RU FI RU RU FI FI FI FI RU TR FI RU TR EE FI RU TR DE LV LT FI LV FI UA LV UA LV FI FI FI EE LV LT RU FI CZ LV RU FI RU PL FI LT LT LT RU UA FI RU TR RU RU TR FI LT RU LV LV LV FI LT RU FI LT RU FI FI TR TR TR TR RU RU LT FI EE UA RU SE RU TR LT LV UA EE TR LT RU PL RU RU PL RU LT TR FI FI EE TR FI BG LT RU PL FI LT FI RU LT LV RU RU RU NO LT HU SE LV LV PL LT RU LV LV RU EE UA GR RU LV LT FI LV TR RU FI LT PL LT LT PL RU TR TR RO LV LV FI RU BY LT PL LT LV LV LT LT LT UA FI BY TR RU LT DK UA UA PL UA PL CY TR FI LV FI NO RU TR EE UA FI RU RU LV LV RU RU TR LV PL DE RU RU RU LV EE RU RU UA LV FI LT FI RO UA LV FI UA PL LT PL UAUA R U PL LT FI FI PL RU LT PL LT RU MT LT SE LV LV PL LV UA PL PL LV EE LV NO PL LT CZ CY AT UA PL HU FI UA EE PL RU BGTR LV LTLT RU PL PL PL PL RU LT PL LV TR GR LT FI PL BA LV LV LT RU PL UA NO FI LV RU UA SE RU RU LT EE BY UA UA TR PL PL LT CZ PL UA PL RU LV LT RU HR PL NO BY DE RU LV PL RU PL RU UA LT RO AT TR PL GR UA PL PL HU GR LV LV HU HU RU TR HU UA PL UA PL CZ PL UA PL RU CZ PL DE PL CZ HR FI PL LT LT PL FI UA PL SE NO UA GR PL PL TRTR LT LT HU RO UA LT FI UA LTLV UA LT PL SE SE PL FI SK HU SE PL PL DE EN CZ LV FI LV SK PL HU PL HU RO MT RU CZ NL SE PL PL SK BG PL PL UA HU BA RU BG AT TR DE CZ RO TR UA PL CZ DE PL PL CZ PL SK PL SE EN HU PL SE PL UA UA HU CZ AT GR SE RU SE HU PL PL HU BA SE PL SE CH SK SC NL LT PL DK NO PL BG UA PL DE UA PL SK SE RO PL SK CZ LV HU FI AT PL SE PL CZ HR PL GR LT SE NL SE PL HU SE DE RO RU RO SE NO SE SC PL SK HU PL BG NI RS BG SE BA HU HU PL RO PL SK AT GR IE SE NL NO AT PL NI SE PL FI SE HR RO CZ CZ CZ MK PL HU BG SE FR DK BA SE SE GRGRCY LT NO PL BA NO SE CZ MK BA BA RO UA RO SK NO HU RO BG GR IE BG IS DE NL PL FR SE AT DK DK NO CZ AT SK CZ FR DE SK AT NL RO FI PL RS PL DK PL RU HU PL NI AT RU HU CY HU BA CZ HU RS NO HU BE HU RS DE CZ RU NO RS BG NO PL PL NO CZ CZ IM AT RO CZ CZ SE PL PL SK AT HU HR DE NO NL RS RO HU CZ DE RU NO PL NI PL SE AT DE AT CY SC DK BA CY CZ SE CZ SK DK HR HU RO BG SE AT SE PL DE HU NL CZ SK PL BG GR CZ CZ SE RU GR PL PL HU BG DK FI SC MT BA NO DK PL BA TR SE SE RO CY DK PL IE PL HR NL SK HR BA RS SE NL DK AT IE DK PL CZ SC BG CZ IM HR HR BG NO SE BG BG AT NI IE SK IE AT HU SE RS DE AT HU NO NO NO GR GR GR NO DK NL CZ PL PL SE DE SE CZ RO HU RO BG RU BG NO SK RO BA TR HU AT RO NO NO NO CZ IT TR SE NL SK CZ SK AT MK SE CH HU MT NI NO IS RO SI MK HU HR PL IE SK HU RO GR BG IE SI SE BG RO LV BG HR NO SE CZ AT CZ DK PL CH GI CY SE NO SE IE SC DE DK SE DE BG GR GR NO EN NI NO CI PL RS DE HU BA RO GR DK DK EN EN RO CZ AT RO SE IM DE SI NL DE CZ SC CY PL HR BA RO BG CY NL SE SE EN CZ HU RO GR NO NL RS EN BE IE HU RS CY NO DK SI RS SE DK SC HU NO AT HU SE DK CZ AT IE RO NO DK RS HR BG DE SE HU HR RO CZ SE CY OK HU SE AT NL NI DK HR IM SC HU AT RU NI IM PL DE RU FO IS NO WA SC IE SI BA PL AT BA SE DE BG CY PL SK BG BG RO GR TR SE IT IS CZ BG IS PL DK CZ AT SI RS CY WA PL IE NI AT PL AT RS RS DK AT DE WA DE SE NL DE DK SE NL CZ EN BG BG DK GI WA SE IE IM DE IT CY CZ AT BA NL BA NO MT BG SE NL SK RS BG IE EN AT RO DK WA IM DK HU HR RO IM NI MT CY EN BA HR NI BA RO RO OK AT HU NO NI CI MT AT AT DK BG BG CY EN DE RU NO SE NI AT RS AT MK WA DK SK RU SE CZ HU HR PL BG RO IE NI CY IE SE AT IS EN NL BE SC GI PL BE CZ CY IS CZ PL HU NI IS DK NL FI NO CZ CZ BE SE RO FR NO DE NI EN RO HR PL DK DK SE SC NL NO DE SC HR MT CZ IE CZ HU RS SE DK CI SI NL DK NO IE BA SE OK CH WA WA SC HU RS BG DK OK EN NL BG SE SE EN WA CZ WA NI EN DE SK RO DK AT MT HU RS GR SE SC WA MT NI WA GR SC NO SC MK IE SE CI IM NI EN HR BG SE AT RO WA EN PL WA AT RS IE AT DK GI DE GR DK NL RS GR DK NO FR SE DK SC BG GR EN MT BE CY DE DE IT AT HR HU IS WA BE DE RU GR GR SC SC AT NO NI SE EN IM TR AT GR DK SK CZ EN IE WA CH RS BG MT CZ BG SE SE NI MT NI OK NL RO DE HU TR SE NL NI TR GR OK DK IE DK NI OK NO AT IE WA DE PL PL HU RS RO IE SE IE MT BA DK NO SC SE IM DK IE BA CH DE DK SC EN DE NL SC IM WA AT SE BG NI HR CY SE IS NI AT EN HR CY NO SE DK BE NO NI BA DK NO NI NL SC RS MK BG GR NL NL DK PL RS DK NL RS BG GR IS CI NL AT CY CH NO DE DE RO BG IWA E CY WA DE HU CH CY NI CI DE GR GR SE CI NO CZ NL FR BG HU NI NI FI GR RO EN OK EN BG EN NL AT DK DE WA AT RO NO EN NO SC CH KO SE NI SC WA AT CY AT BG FR MT CY DK NI NL NL DE BA CZ WA DK IM NL CZ RS BG DK PL WA AT MT BA RO WA DK BE CY WA CY SC SC AL SC NI IM SE IE MT IE SC NO GR DE NI HU DE WA AT IE NI RS HR CY BE BA CI IE SC OK IE EN UA GR SE NO SE IE DK MT NI RS RS MK WA EN IE FR CH MT DK IS EN DE SE NO CY IE EN CZ AT IM CY CY EN HR CYCY CY WA SC CZ NL AT CZ SE DE DK MT CZ RS AT OK OK NL NI HR BA BG DK DE NI HU SE IM DK TR FO NL GI CI RO DK GI NI FR NL SI CY DK SC KO GR GR GR AT AT HU IEIE CH CH BG GR CI UA IE IE PT BE NI DK CY DE NL NL CY IE FR CY CI CY CH EN MK GI NI GI WA MT NI SC HR RS BG IE NO MT IE SC GR CY SE MT DE NL AT DK DK DK SC DK FO NL CZ EN DE NL NI NO NL DK GI DK CH CY NO GR NI CI EN WA WA AT HR DK WA SC IS DK SE PL CZ BG DK EN MT SE TR CI DE CY RO IE NO WA DE MT CY NO IE NI IE NI DE IE HU SE SE EN DE SC CZ HR NI BE IE BE AT RS GR SE BG CH NL AT AT IS DK IE DK SE SC BG NO EN OK NL EN BE IM SC BG CZ IS OK DK SE WA IE IE DKGI CI AT CH BA DE NI CI BG CY TR MT MT CI CY IS FR WA SC SC HR CH DE ENDK CY MT SE NI CI WA EN NI NL AT CY WA DK GI DE UA DK MT IE DE GR GR CY SE NI SC DE CH AT CY NL GI DE NL WA SE SC CY AT CH AT NL CI EN MT AT AT NO SE DK SC IE NI NL CZ CZ SC NL CY GI AT RS GR EN BE MT BE CH OK MT NI DE EN NO MT BE NL OK OK CH TR GR DK ES IE GR GR NL IM CY WA CH NL NI DE EN AT HU NO CY DK SC BE GR DE IE PT EN WA BE BE AT BE AT GR DK BE BE WA HU DE NI MT NL BG CY CY IE NL CY BE DK GI AT BE IM IM EN SC DK SC NI IT NO SE IE SC AT SC CI AT MT AT WA EN WA CI EN EN NL HR AT DE BE DK WA IM CI AT NI DE CH IS WA MT CI CI DK IM IE EN GR DE CY WA NI NO EN DE SE CZ HU OK PL MT NI CY GR NL NI CI MK MT DK NI EN DE ES DE AT DK SE NL FR CH MT IT SE IE CH NI MT IE CY NI NL CY SE WA NO IE DK CI SC HU FR SC SE IE FR BE IE EN CI GI EN SC NI SC CI AT CY MT SE SC NI IE BE BE MT NL IE NO NL DE NO CH SC SC DE EN CH IT GR NO NO NL GI IM AT CI CI C Y AT HR IE WA CH GR DE DE RS NL IM MT CY FR N L RU AT HU NI WACI BE BA OK FR BE EN CI HR IM NI PT IM MT HR DE ES AL MT CH CI AT AT NL BE NL CH BE BE SE SC CY SC AT BE WA GI PL CY DK SC BE BE MT BE NL NO DE NL MT MT CZ NL GR NI SE WA CY IE GR GR EN CI CY DE CH CH WA NI BG IE NL AT RU SC EN IM IM AT AT AT NL IT GR IE DK IM F NI DE R IT CI AT KO DE WA CZ MT EN CI OK BE BA DE NO NL SC CI NI MT AT AT NO IE EN SC CY NL BE IE OK WA MT NI DK DE WA DE AT AT NL NI IE TR IE CH CH OK SE SC FR NI CY MT BE IE DE SC NI EN HU NI MT CY IE SC SC CH WA IM WA GI EN FR DE IE WA NI IE EN CH FI EN WA WA MT DK SC MT GI FR BE SE FR OK IE GI CI NI CI WA DE AL IE CI DK EN CY BG CY NI EN DE AT MT N IDE BE HR MT NL AT FR CI DK FO NL NI BE AT IT CY EN MT NL CH AT PL NI IE NI SC EN WA CI HR IT GR NO EN GI BE CI GR GR WA SC IT RS DK GR DK IE NL DE IM FR IT NO OK IM BE NO EN CH AT RO MK BE SC CY FR CH IT BE FR DE AT CI IT SC EN CH WA IT CH SC AT FR MT AT RS DK EN MT FO OK EN IM SC NL BE CI FR GR IM GI SC CY CY DE SC NI GR WA IM DE FR WA RO IT CY CY IE SC DK BE NL AT EN DK GI DE BE IT AL KO HU IE IM SC CH FR NL NL DE MT IE GI CY CY WA CI CY AT SC BE EN BE CY GI NL CY MT EN BE AT GR CH EN GI DE HR CH IT GR SE WA IE EN DE HU SC FR FI GR GR GR GR NO BE FR DE DE BE AT PT WA NL CI NI PT BE IT WA RS GR DE NO DK SE IE OK CY DE GR IT EN NL BE CYCY IE SE NI SC NI NL NL NI AT CH GR GR WA BE CY WA CI SC AT E FR N FR EN CI SE NI CY GI CY IE CI IE EN IT CI NL AT FR NL SC CH PT IM BE BE CH CZ DE NO EN EN CI DK NL CH EN DE IE FR CH DK NL EN DE CH NL CH IT BG MK DE BE IM TR CZ BY IM AT BE NLBE DE SC DE CH EN CY SC WA WA SC WA SC EN IE BE EN MT BE CY NI NO WA SE SC CI AT NL NO CY MT NL CH EN AL IE MT RS KO SC EN AT CY WA WA WA NO WA OK NI CH NL NI EN GR MT NI AT AT IE FR MT MT HU WA BE IT DE NL GI GI FR FR AT IT RS IT DK IE WA NI FR AT EN NL EN BE GR IT NI CH GR NI MT CY WA NI NL FR CH GR NL MT SC CY NI WA AT TR CH WA WA DE GR GR CY FR MT GR BY IT DK WA AT FR EN BE IT WA BE DE FR CY CY GI BE EN CI AT IT DE NL NL AT CH CI NL MT IT CY CI CI FR CH SC MK GR HU CY FR DE TR KO GR GR CI IT AT WA SC CY NL MT DK FR IT DK NI CH FR CY FO BE NL GR GR AT CY SE MT IE OK IE NI FR IT BE AT CY NL MT IT RU CY MT IE IE CI SC NI AT DE HU PL SC FR CH CH IT IT EN CH CI FR IT IE NI CI NI MT CH CI IT CY DKCI OK IE SC EN DE WA DE IT NL SC FR IT DE WA NI BE CH NL W WA A IT GR MT IM WA WA CI MK HU NI EN SC DK EN GR BE IT CY SC NO CY BE CH BE CH AL DK GR NL OK SC CH CY IM WA SC NI WA CH IS WA EN NL GI ITFR GR OK FR SE DE SC FR MT CH RU CY MT NL DE BE OK WA DE AT AT IT DK GI FR CH MT MK GR RO NL NI BE CY MT CI CI RU IM FR BE IT IT CY EN DE SE CY CH CY CZ WA FR TR CH AT CY SC CY BE IM CI IE AL RU NL IE CI DE SC FR FR FR AT FR BE MT IT MK GI CY IT EN NI MT WA CH GR GR DE AT IE RS AL IT NI CI CI CI CY CY DE IT FR CH MT PT MT FR IT MT BE CH IT UA FR WA NI CH BE FR DE EN HR BE IE NI CH MT WA DE PT CH FR WA NL CH CY BE AT NL SC IT CY DE DE FR IM WA CI IM SE GR GR GRCY CY CI CH FR IT SC CY IT BE IM EN DE DK IE BE CH IT IT CY NI IE MT WA SC WA IM GR SC BE NI CH DE FR EN DE IT MT CY SC CY DE IT IT EN AT CH NL CY SC CI BE WA CI FR BE CI FR DE MT FR IT WA FR CH FR FR IT IM EN IT CY FR AT BE IT IT GR UA IE EN BE BE CH IT NL EN IT HU NL BE IT MT GI BE CH FR CH BE CY IM EN GR NI SC FR IT NI BE BE TR PT IT ES GR HU WA OK MT CH CY CY FR IT WA FR BE IT CI FR FR IT IT CH EN CH BE IT CI EN CI EN CH MT CY CH DE ITIT IE CI CH CI MT FR IT NL MT IE WA IM IT IT IT MT FR NL CH MT AT HU EN SC CH AT CY SC FR ITPT HU SC BE IT MT IT BG AT CY CY CY GI IT CH IT RU MT RO IT EN PT SC BE TR RU IM FR WA CH IT CI IT MK CI FR BE CI CH WA GR CY PT SC CH CH HU WA BE BE IT NL NO FR CH FR GR CI IT GI GR FR PL IT CH FR IT FR CY IT GR BE IT BE CZ EN NL IT AT FR IT MT CH CY FR IT FR IT IT CY FR CH IT BE BE IT CH CH FR MT IT GI FR FR CH DE CH IT CH IT GR IT NL SC IT GR CY GI FR FR OK MT EN MT CZ GR FR IT IT EN IT IT SK IT MT CI IT CH FR DK AT CH BE FR IT MT GR FR IT CY FR CH BE FR IT IT MT IT BE RS MT MT GI IT FR FR IT CH FR IT FR CH FR TR MT BE IT WA FR HU FR MT FR FR CZ CZ IT IT FR FR IE PT GR IT FR MT SK EN GR FR CH PT FR IT MT MT MT FR IT BE CH FR MT FR FR IT IT FR MT TR CY IT IT CH FR MT BE MT RO ITCZ CH FR FR IT RO IT BE IT MT IT DE IT HU FR AT MT AL FR IT FR HU SC FR FR ITPT FR FR IT ES IT IT MT GI GI MT MT FR PT IT FR CH PT MT ES IT PT MT FR IT IT IT BE MT MT MT IT MT ITITCY MT FR ES MT BEFR IT MT IT MT ES ES PT ES ES ES ES FR GI GI IT IT MT IT IT MT ES FR FR MT ES FR ES GI GI FR MT ES ES ITPT FR PT GI HR ES MT GI GI PT PT IT MT ES PT ES MT IT PT MT MT MT IT GI MT ES FR PT FR GI ES ES PT FR PT ES PT FR ES FR PT ES FR PT ES ES ES GI ES ES ES PT FR PT ES PT GR MT PT PT PT IT ES ES PT ES PT GI PT PT ES PT ES IT FR PT PT MT MT ES PT FR FR PT PT ES FR ES PT ES MT ES CH ES ES ES ES PT ES ES GI PT PT PT PT PT PT GI ES PT PT PT PT PT ES FR ES GI GI ES ES ES ES ES ES ES PT GI PT GI GI GI ES PT PT FR PT ES PT ES PT PT ES PT GI PT ES PT GI PT PT PT ES PT ES ES ES ES ES PT ES GI PT PT ES ES ES ES ES GI PT ES ES PT ES ES PT GI DE PT ES PT FR ES PT ES PT ES PT GI ES ES PT PT ES ES ES PT ES PT PT ES PT ES PT ES PT ES PT PT FR ES PT PT PT ES ES ES ES PT ES PT PT PT ES ES PT ES PT ES ES GI PT PT ESPT GI ES ES ES PT PT PT PT FR ES ES ESES PT ES PT ES PT ES PT PT PT ES ES ES ES ES ES PT ES ES ES ES ES PT PT ES PT PT PT FR ES PT ES ES ES PT ES ES ES PT PTES ES CH ES ES ES ES ES ES PT PT ES ES ES ESPT ES ESES IT FR ES UKB EUR Group a Near East a SE Europe a S Europe a E Europe a C Europe a N Europe a W Europe a Brit. & Ire. IT IT IT IT IT −0.05 0.00 0.03 0.06 Principal Component 1 Supplementary Figure 2.11 – Comparison to 5,500 UK Biobank1 European individuals to 905 West Eurasians from the Human Origins2 dataset with principal component analysis. UK Biobank individuals were projected onto the genetic variation of the Human Origin references using PLINK4,5 (see Methods). (A) The principal component (PC) coordinates of the Human origins reference individuals for PC1 and PC2, with label shown by point colour and shape coding, and UK Biobank plotted as grey points. (B) The PC coordinates of the UK Biobank individuals, colour coded according to European meta-region and lettering to show country of birth. 126 127 128 129 130 131 132 Supplementary Data 3 – Comparison on Linked and Unlinked Methods Multiple authors have noted the increased power to detect fine-scale genetic structure in populations when utilising “linked” methods (as discussed by Lawson et al7, Leslie et al8, and others using European populations9-12). To elaborate, “linked” methods are methods which consider the linkage (in the linkage disequilibrium sense) between markers and do not assume that genetic markers are independent with respect to linkage disequilibrium (LD) – i.e., so-called “unlinked” methods. 133 134 135 136 137 138 139 140 141 142 143 144 Indeed, we observe a similar effect comparing the principal components of an unlinked PCA13 from PLINK4,5 (Figure 1) and principal components calculated from a co-ancestry matrix from pbwt paint14 (Figure 2), our equivalent of a linked analysis. Genetic regions of Europe appear to separate out more in this pbwt-based PCA than compared to the PCA calculated from unlinked allele-frequency data. As mentioned in the main text Finland appears to be more strongly differentiated in PCA, possibly as a result of the increased haplotype sharing within that population (see Figures 3 and 5). Other populations such as Malta, Turkey and Cyprus, and differentiation between central mainland Europe and Scandinavia to the north appear greater. We attribute this to PCA decomposition of the coancestry or genetic-relationship matrices detecting differing relationships along different time frames as haplotype information would be expected to “up-weight” more recent relationships. To further explore this we perform additional analysis, comparing unlinked and linked PCA methodologies on to a well described dataset of European ancestry. 145 146 147 148 149 150 This dataset, the POPRES dataset15 was used previously by Novembre and colleagues16 using unlinked methods to strikingly describe the overall European genetic landscape. We investigate whether our results from pbwt-PCA are consistent with previous characterisations we analysed the POPRES dataset15 utilised by Novembre et al16. We compared an unlinked method, performing PCA on a PLINK generated genetic-relationship-matrix (GRM), to two linked methods - performing PCA on the coancestry chunkcounts matrices estimated by pbwt and CHROMOPAINTER7. 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 We selected individuals of European ancestry from versions 1 and 2 of the Population Reference Sample (POPRES) dataset15 (n=5,917), a DNA resource from multiple studies across the world. We filtered individuals based on the observed country of birth data of their grandparents, selecting individuals whose grandparents were from the same European country. Due to the dataset containing >1000 Swiss samples, we randomly down sampled this label, choosing 100 individuals who were SwissFrench. 168 169 170 We divided the unpruned dataset by chromosome and converted these subsets to VCF file format using PLINK (ver. 1.9). These genotypes were then phased using SHAPEIT (ver. 4.2.1)18 with effective population size set to 11,418 and using genome map build GRCh37. The phased genotype VCF files Using PLINK (ver. 1.9)4,5, we excluded any strand-ambiguous SNPs (A/T, G/C) and kept autosomal SNPs. Further, we removed SNPs with minor allele frequency (MAF) <2%, missingness >5%, and a HardyWeinberg Equilibrium (HWE) p-value <1e-6. Additionally, we removed individuals with SNP missingness >5%. Pairs of related individuals (up to 3rd degree relations) were identified using KING (ver. 2.2)17 and one random individual from the pair was removed. To calculate an unlinked PCA, we utilised a set of SNPs pruned with respect to linkage disequilibrium using the PLINK command --indeppairwise 1000 50 0.2, performing PCA using the --pca PLINK command on these samples and pruned markers (nsamples = 954, nmarkers= 103,925). 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 were converted to the hap/sample format using bcftools (ver. 1.4.1)19 which were subsequently painted using (a), CHROMOPAINTER, which is a part of the fs (ver. 4.1.1) toolset7, and (b) pbwt paint from the pbwt package14. To generate the co-ancestry matrixes and perform PCA we carried out the following pipelines. For CHROMOPAINTER, we converted the files hap/sample file set to the CHROMOPAINTER input format using scripts provided by the authors of the fs utility. Next, the inferred mutation rate and effective population size parameters of CHROMOPAINTER Hidden Markov Model were estimated on the autosomal chromosomes (fs “stage 1”). Haplotype sharing in the full dataset was then estimated with these parameters (fs “stage 2”) generating co-ancestry matrices of “chunk counts” and “chunk lengths”, using default parameters. We calculated principal components on the co-ancestry matrix generated by CHROMOPAINTER using R scripts provided by the authors of fs. Alternatively, for pbwt, we used the phased VCF files to generate a co-ancestry matrix using the pbwt -paint command. We calculated principal components on the chunkcounts output from the analysis using the same scripts to perform the CHROMOPAINTER PCA, setting the diagonals of the pbwt paint co-ancestry matrix to 0. Additionally, we split the countries in the dataset into ten regional labels depending on their geographic position within Europe (see below). 190 191 192 193 194 195 Supplementary Figure 3.1 - Principal component analysis13 of 954 Europeans selected from the POPRES15 dataset, calculated from genetic relationship matrix output from PLINK4,5 (A), pbwt14 coancestry chunkcounts matrix (B), and CHROMOPAINTER7 co-ancestry chunkcounts matrix. Individual genotypes are shown by letters which encode the alpha-2 ISO 3166 international standard codes, and 196 197 are colour coded according to geographic region. The median position for each country label is shown as a label encapsulated by a circle. 198 199 200 201 202 203 204 205 206 207 In our analysis of comparing linked versus unlinked analyses, we show that European individuals do indeed exhibit great overall separation in principal component space (panel A versus B or C in Figure S3.1). All three PCAs show the defined north to south, and east to west gradients in European genetics16,20,21, and agree with our own findings using an expanded sample of European genetics (Figure 1). Our results from pbwt and CHROMOPAINTER show greater separation in south-east Europe, particularly individuals with Kosovan or Yugoslavian grandparental birthplaces. As these analyses are haplotype based this may be due to the relative enrichment of haplotype sharing in this region compared to other populations sampled in the POPRES dataset, consistent with our findings of individuals born from that European region (Figure 4). 208 209 210 211 Unfortunately, we were unable to analyse the genotypes of the exact individuals reported by Novembre et al16, leading to a reduction of sample sizes notably from the UK. This may explain that whilst we capture the general structure reported by Novembre et al, there are some topographic differences compared to our Figure S3.1A. 212 213 214 215 216 217 218 219 220 221 222 223 224 Our comparative results show that each method captures the broad genetic structure in the same dataset, though it appears exactly relationships are weighted differently, hence the differing topologies in PC space. Ultimately, all are informative, though linked methods were able to separate out sub-regions within the same dataset better than unlinked methods in both our analysis of the UK Biobank1 and POPRES15 datasets. These results are in agreement with the initial report of the CHROMOPAINTER method7, and findings from groups applying these methods to specific populations8,22. This performance is thought as a reflection of haplotypes to better capture information about more recent relatedness between individuals. Therefore, in this analysis of the POPRES dataset, and the comparison between unlinked and linked methodologies, we are confident both are sample capture the major genetic structure within Europe, and our findings from eigen decomposition of the pbwt paint co-ancestry matrix is consistent with previous efforts in Europe16 and findings from using the CHROMOPAINTER method7,8, which unfortunately doesn’t scale to this sample size well. 225 226 227 228 229 230 231 232 233 234 235 236 Supplementary Data 4 - Genetic Structure of European Leiden Clusters Investigating the population structure captured in our sampled of UK Biobank1 participants with a European place of birth, we performed Leiden3 clustering of a network of nodes made up of individuals, and edges of the per-individual pair pbwt paint14 co-ancestry estimate (see Figure 2, Methods). To further characterise these clusters, we perform several additional analyses, shown below. The ADMIXTURE6 ancestry component estimates are the same calculated in Supplementary Data 2.3, but individuals are instead grouped by Leiden cluster membership (Supplementary Figure 4.1). We relabelled the principal component coordinates for the 5,500 UK Biobank individuals projected onto the genetic variation of west Eurasians from the Human Origins dataset2, labelling the UK Biobank individuals by Leiden cluster (Supplementary Figure 4.2). 237 238 239 240 241 Supplementary Figure 4.1 - ADMIXTURE6 ancestry component estimates of 41 Leiden3 clusters of 5,500 UK Biobank1 participants over k values of two through to seven. A B HO Population Jew_Turkish Adygei Jew_Yemenite Balkar Albanian Chechen Bulgarian Georgian Croatian Iranian Greek Iranian_Bandari Romanian Kumyk Basque Lezgin Italian_North Nogai Italian_South Principal Component 2 North_Ossetian 0.00 −0.05 UKBB EUR Cluster 0.05 Belgium & France Channel Islands Netherlands CE Europe 1 Switzerland CE Europe 2 Belgium NE Balkans Denmark NW Balkans Norway Poland Sweden 1 Russia N.Ireland & Scotland 1 Latvia & Lithuania N.Ireland & Scotland 2 Finland N.Ireland & Scotland 3 Mixed Scand. N.Ireland & Scotland 4 Italy Orkney Greece Ireland 1 Albania & Greece Ireland & N.Ireland Turkey Ireland 2 Cyprus Isle of Man Portugal England 1 Spain France France & Switz. England 2 Mixed European Wales 1 Malta Maltese Armenian Sardinian Assyrian Sicilian Cypriot Spanish Druze Spanish_North Jordanian English Lebanese French Lebanese_Christian Orcadian Lebanese_Muslim Scottish Palestinian Belarusian Syrian Czech Turkish Estonian BedouinA Hungarian BedouinB Ukrainian Saudi Chuvash Yemeni Finnish Libyan Icelandic Moroccan Lithuanian Jew_Ashkenazi Mordovian Jew_Georgian Norwegian Jew_Iranian Russian Jew_iraqi Saami_WGA Principal Component 2 0.05 Abkhasian 0.00 −0.05 Wales 2 242 243 244 245 246 247 248 0.00 0.03 Principal Component 1 0.06 0.00 0.03 0.06 Principal Component 1 Supplementary Figure 4.2 – Principal component analysis of 5,500 UK Biobank participants with European birth places, projecting onto West Eurasian references from the Human Origins dataset. (A) Each coloured point represents the genotype of one Human Origins reference individual, with UK Biobank individuals represented by grey points. (B). Each colour point represents the genotype of one UK Biobank individual, with Human Origin references shown as grey points. 249 250 251 252 253 Supplementary Figure 4.3 - Heatmap of average per-cluster pair of haplotype copying profiles between every Leiden cluster as a “target”, using every other Leiden cluster as a potential “source” cluster. Average contribution was estimated using an adaptation of a nnls-based method8, using pbwt paint co-ancestry estimates as copy vectors. 254 255 256 257 258 259 260 Supplementary Figure 4.4 - Plot of t-SNE23 decomposition of the top 10 principal components calculated from the pbwt paint chunkcounts co-ancestry matrix. t-SNE analysis summarised haplotypic relationships between 5,500 UK Biobank individuals with a European birthplace into two dimensions, and individuals are colour and shape coded according to Leiden cluster membership. 261 262 263 264 Supplementary Data 5 – Detailed European Genetic Landscape 265 266 267 268 269 North-Western Europe 270 271 272 273 274 275 276 277 278 279 280 281 282 283 Individuals from Scotland and Northern Ireland are still grouped together at this level of clustering, agreeing with previous observations of gene flow between Scotland and the north of Ireland24. This is further supported in our nnls analysis (SI Appendix Fig. S4.3) which models the N.Ireland & Scotland clusters with a contribution from the N.Ireland & Ireland cluster. We continue observations that individuals with an Orcadian birthplace exhibit high haplotype sharing8,24 (Figure 3), and have a historically low effective population size (Figure 4) – which also agrees with estimates and time of lowest population size from an analysis of Orcadian individuals with extended recent ancestry from those isles11. Interestingly, we identify a discrete cluster of individuals (n=41) predominantly from the Isle of Man (Figure 2). These individuals present a similar profile of modest isolation as Orcadian individuals, with a reduced population size from 30 to 10 generations (Figure 4), modest increase in length of IBD segments (Figure 3), and elevated levels of ROH (Figure 5) - though the latter is not as elevated as Orkney. Previous analysis of genotypes from the Isle of Man agrees with genetic structure24, though we estimate elevated ROH in this sample compared to that separate of Manx genotypes. 284 285 286 287 288 289 290 291 292 293 294 295 296 297 Clusters with English, Welsh, French and Channel Island membership are placed in the eastern British Isles sub-branch. Wales, especially Wales 2, shows elevated within-cluster sharing of IBD-segment count compared to other clusters on this sub-branch suggestive of modest isolation. This agrees with previous analyses of samples with extended Welsh ancestry from the Peoples of the British Isles Study8. As discussed in the main manuscript, individuals placed within the Channel Islands cluster exhibit elevated sharing more of and longer IBD-segments than other Eng. & Wales clusters. These IBD results reflect the sustained lower effective population size than clusters on the same sub-branch, though the Channel Islands exhibit evidence of a more modest population contraction than Orkney. The 117 individuals with a Channel Island birthplace are not just placed in this cluster of 83 individuals, but also within English (n=11), Welsh (n=4), French (n=5), or Irish/Scottish (n=19) clusters. Compared to Orkney, whose cluster shows equivalent IBD sharing to Channel Islands, individuals with a Channel Islands birthplace are more distributed in other clusters – suggesting that whilst there may be an island population of interest to haplotype mapping, the islands are less isolated which reflects their geographic position in-between France and England. 298 299 300 301 302 303 304 Within the continental WE branch of clusters, Belgian, French, and German individuals are clustered together (Belgium & France) and are grouped with clusters of separately Swiss (Switzerland) and Dutch (Netherlands) predominant membership. A smaller cluster, Belgium (n=53), is detected. This cluster does not appear to present high haplotype sharing or evidence of population contraction, and in nnls analysis shares a haplotype profile predominantly donated from Netherlands. Further profiling of this cluster is difficult, though there appears to be a genetic continuity between Belgium and Netherlands, which is consistent with their geographic proximity and lack of large-scale geographic barriers. For the purposes of brevity, in the main manuscript we highlight the primary results from each geographic region sampled from the UK Biobank (UKBB). In this supplementary discussion we describe in greater detail the results of each geographic region. This group of clusters group individuals whose birthplace includes Scandinavia, the Low Countries, France, Switzerland, and the British Isles and Ireland. Three main branches of clusters are detected, one grouping Scandinavian individuals and other continental Europeans, one of individuals from the eastern British Isles, and one of Scotland and Ireland. 305 306 307 Genetically, Swiss individuals project between France, Germany, and Italy in PCA, reflecting their geographic location. We estimate Switzerland to have a recent effective population size to be equivalent to Belgium or the Netherlands, but no evidence of substantial isolation. 308 309 310 311 312 313 314 315 316 317 318 319 320 321 Within the Scandinavian sub-branch, we detect three clusters which group individuals from Denmark, Norway, and Sweden. These individuals project in-between northern continental Europe and Finnish individuals, showing within-cluster IBD profiles equivalent to north-western Britain or Ireland. Our small sample of Icelandic individuals (n=19) are grouped with Norwegians in Norway, and as described in the main manuscript show evidence of isolation consistent with the population history of that island25. Five out of six sampled Faroese individuals are grouped into the Sweden cluster. Similarly to the Icelandic individuals in Norway, these Faroe Islanders show substantially higher IBD-segment sharing than Swedish individuals placed in the same cluster. The average pair of Faroe-Faroe individuals share 90 cM of IBD segments > 1cM, sharing 28 segments each of which are on average 3.2 cM long. The average Swedish-Swedish individual pair share a total of 21 cM over 14 segments, each of which are 1.5 cM long. The low sample size of the Faroe Islands means it is unclear just how representative these results are, though a recent study Faroese genotypes indicates the population does indeed exhibit unsurprising hallmarks of isolation26 – though we are unable to extend and refine this picture as has been possible in Malta. 322 323 324 325 326 327 Central-Eastern Europe 328 329 330 331 332 333 334 335 336 337 338 339 Within the CE European group of clusters there are two sub-groups, each with two clusters. The first, CE Europe 1 and 2 cluster individuals predominantly from Germany, Austria, Poland, the Czech Republic, and Hungary. CE Europe 1 appears to group individuals of a more easterly birthplace, and in nnls analysis its haplotype profile includes more of a contribution from the Poland cluster, whereas CE Europe 2 contains more individuals from Austria and Germany and a slightly higher contribution from Switzerland in nnls analysis. In analysis of within-cluster haplotype sharing CE Europe 1 shows modest elevation of IBD-segment sharing (Figure 3) consistent with a slightly lower historical population size, which our Ne estimates from 30 generations ago agree with. Within the north of the Balkans, we observe a general east-west divide in our clustering and PCA results (Figure 2). With IBD-segment sharing as well as ROH information, it appears NW Balkans shows modestly elevated levels of haplotype sharing consistent with a lower effective population size, this is supported by a consistently lower effective population size in NW Balkans compared to NE Balkans. 340 341 342 343 344 345 346 347 348 349 In our main manuscript we discuss the results of the Finnish, E Europe, and northern Balkans clusters in greater detail, noting that our results in Finland agree with existing literature on this well studied population. In NE Europe the UK Biobank sample largely agrees with the Human Origins references, though our Russian samples more project towards Baltic/Ukrainian genetic space as opposed to the Human Origins Russians, who project more towards Chuvash or Saami – though there is overlap in the UKBB and Human Origins Russians. We suspect that the UKBB sample of Russian ancestry is therefore biased towards European ancestry rather than the central Eurasian or Caucus/related ancestry that is represented by the Chuvash or Saami in the Human Origins West Eurasian sample. This heterogeneity in sampled Russian ancestry can be explained by the variation in sampled ancestries within the Russian Federation, with ancestry in samples closer to Europe exhibiting more European-modal ancestry27. The second branch of Leiden clusters groups individuals with a birthplace from the centre and east of Europe, from Germany in the west to Russia in the east. This branch contains several sub-branches which group individuals from geographically adjacent locations in Europe; NE Europe (with Baltic, Polish, and Russian membership), CE Europe (with membership from the north of the Balkans and the centre and east of Europe), and Finland as an outgroup. 350 351 352 353 354 355 356 357 358 359 Southern Europe 360 361 362 363 364 365 Geographically neighbouring Malta, individuals from Italy are grouped into one Leiden cluster, though PCA divides individuals into predominantly two clusters, which appear to reflect the general northsouth divide previously detected in Italian genetics10. Italy haplotype contributions come from the mixed France & Switz. cluster, as well as Turkey and Greece - which agrees with its position between western Europe and the south-east of Europe and the Near East10. This is associated with a consistently high historical effective population size compared to other European regions (Figure 4). 366 367 368 369 370 371 372 373 374 375 376 Within Iberia, the predominant divide is between Spanish and Portuguese samples, in agreement with recent analyses12. Spain and Portugal show evidence of reciprocal admixture in nnls analysis. Portugal appears to have modestly elevated haplotype sharing compared to Spain, consistent with its geographic position at the end of the Iberian Peninsula. This haplotype sharing is observed in both IBD and ROH segment sharing, where elevation of the latter appears to be driven by predominantly longer ROH than greater numbers of ROH (Figure 5). As discussed in the main manuscript we detect small group of Spanish individuals within Spain which is genetically distinct from other Spanish clusters and exhibits elevated IBD-segment sharing. With evidence from co-analysis of references from the Human Origins dataset, we suspect that this cluster represents ancestry from the north-east of Spain and the Basque population. Further sample annotation is not available for these samples, so we are unable to definitively prove this. 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 Within SE Europe, we cluster individuals from Albania, Greece, North Macedonia, Turkey. Italian clusters from an outgroup to this sub-branch along with Cyprus. We observe some signal of haplotype sharing between Italy and Greece, as well as sharing between Italy and Turkey. As discussed in the main manuscript, we detect a small cluster of Greek, Albanian, and North Macedonian membership, Albania & Greece. It is unclear if this small cluster is largely represented of this geographic region, though there is no obvious reason that the UKBB would have sampled a specific community from this area of Europe with history of isolation. Turkey and Cyprus exhibit evidence of isolation that is consistent to some degree of consanguinity in inbreeding coefficient analysis. In comparison to Turkish references from the Human Origins dataset Turkey co-segregates with other Turkish references – forming a tight cluster in PCA in the centre of the distribution of Turkish genotypes. We see a similar relationship between Cyprus and Cypriot Human Origins references. This appears to be a relatively isolated Turkish community, captured within the UK Biobank. Lastly, as further expanded upon in the main manuscript as well as in SI Appendix Data 7 we detect a connected community of individuals with a mixture of birthplaces across Europe. In analysis of ancestry, population structure, and characterisation of demographic history we propose this Mixed European cluster represents a community of Ashkenazi Jewish participants in the UK Biobank. For more information and analysis, see SI Appendix Data 7. 394 The third and final group of Leiden clusters contains individuals with a birth-place from regions within or north of the Mediterranean. The major sub-groups include the islands of Malta and Cyprus (grouped for convenience, but genetically distinct), Italy, the Iberian Peninsula, SE Europe – which includes Greece and Turkey, and a cluster of individuals with a heterogenous mixture of birthplaces (Mixed European). We discuss the full results of the large Malta cluster in our main manuscript – but briefly our results show that Malta is genetically distinct and presents evidence of genetic isolation with high haplotype sharing and genetic distances to non-Maltese clusters. In PCA with UKBB individuals, or projected onto Human Origins genetic variation, Maltese individuals project close with Cyprus 2 and Italy 2. 395 396 397 398 399 400 401 402 Supplementary Data 6 - Malta 403 404 405 406 407 408 409 Investigating the population structure in this sample of European genotypes we observe population structure within our Maltese sample (see Figure 1, and Figure S2.2), where three broad clusters emerge in principal component space. Further investigating this, we find that principal component (PC) seven of the PLINK4,5 PCA, of which PC one and two are shown in Figure 1, also separates these individuals, see Figure S2.2. When we code the sampled Maltese individuals instead by Leiden cluster membership of the Malta cluster or not, the Leiden clustering clearly separates out these two groups, see below: In analysis of European ancestry sampled within the UK Biobank, we report to our knowledge the largest dataset of Maltese genotypes publicly available. As part of the sampling scheme, we randomly down-sampled individuals from countries/regions of birth which had more than 200 individuals to a randomly selected 200 individuals. This was to both reduce sample size and to control for uneven sampling rates across European regions, for example there were over 1000 individuals with a German place-of-birth compared to 180 Swiss. We randomly sampled 200 Maltese individuals out of a possible 362. 410 411 412 413 414 415 416 417 418 Where the Malta cluster is comprised of the Maltese individuals at the terminus of the cline away from other Europeans, as well as the smaller cluster of Maltese individuals which project in-between that group of Maltese and other Europeans. These two groups of Maltese individuals within the Malta Leiden cluster also show two patterns of IBD sharing within the cluster (Figure 3), with the intermediate Maltese individuals in PC seven also intermediate in within-cluster IBD-sharing estimates. 419 420 421 422 423 424 425 426 427 428 429 430 Supplementary Data 7 - Mixed Europeans 431 432 433 434 435 436 437 438 In addition to this genetic structure, individuals within this cluster exhibit elevated within-cluster IBDsegment sharing (Figure 3), a historically low effective population size with recent population expansion (Figure 4) and elevation of ROH sharing. This is consistent with a small isolate population that has experienced a population size bottleneck around 30 generations ago. These demographic results, with analysis of population structure in the context of Europe and the Middle East is suggestive that this is a community of Ashkenazi Jews. This would be consistent with previous analyses of the UK Biobank which detected such a community28, as well as the population history of European Ashkenazi Jews as previous modelled in more detail with individuals recruited from that community29. 439 440 441 442 443 444 445 446 447 448 To further confirm this, we performed an additional analysis with references from the Human Origins dataset. Using the same methodology to project the UK Biobank individuals onto western Eurasian genetic variation, we selected Human Origin West Eurasian individuals, as well as Yoruban references form that dataset, and merged this genotype data with the UK Biobank dataset, leaving 6,476 individual genotypes of 61,217 common SNPs after the same quality control thresholds as in the projected PC analysis. Analysing a subset of these markers which were in approximate linkage (filtering SNPs with the PLINK4,5 option --indep-pairwise 1000 50 0.2) we tested allelic sharing using f-statistics implemented in the R package ADMIXTOOLS2 (manuscript in prep). For each Leiden cluster we tested allelic sharing between the closest population reference in the Human Origins dataset and Human Origin Ashkenazi Jew references in the form: 449 f4(Yoruban, Y; Ashkenazi Jew, European Reference) 450 451 452 453 454 455 456 457 The f4 estimates for each Leiden cluster are shown below. Along the y-axis is each Leiden cluster tested in “quotation marks” and the paired population reference chosen from the Human Origins dataset. “Mixed A” and “Mixed B” refer to two sub-groups of the Mixed European cluster which are expanded upon below. With the exception of Malta, Turkey, and Cyprus (i.e., the UK Biobank clusters closest to Near Eastern populations in PCA) all European clusters are significantly differentiated from the Human Origin Ashkenazi Jew references. Both Mixed European groups, Mixed A and Mixed B shared significantly more alleles with Ashkenazi Jew than Turkish references as well as sharing more allelic drift with Ashkenazi Jew references than Turkish Jews (see below). In our analysis of European genotypes sampled in the UK Biobank1 we have identified a genetically connected community of individuals with birthplaces across Europe, though with a slight bias towards central/eastern European countries of birth. Despite being geographically dispersed, as represented by birthplace information, these individuals project along a single principal component (principal component nine in Figure S2.9) as well as are grouped as a single cluster by application of the Leiden algorithm (Figure 2). Compared to the rest of the European sample, a core group of these Mixed European individuals project towards individuals with a Near East birthplace (Figure 2), and project towards Ashkenazi or Turkish Jewish references from the Human Origins2 dataset. Furthermore, in our nnls based analysis (see Methods, Figure S4.3), the haplotype sharing of these individuals can be modelled as a mixture of Near Eastern (Turkey: 0.12), southern European, (Italy: 0.10, Spain: 0.10), and central or Eastern European sources (Poland: 0.14, CE European 1: 0.13, Russia: 0.07). 458 459 460 461 462 463 464 In PCA of the pbwt paint co-ancestry matrix, we observed this Mixed European cluster separate along PC three, see below. Similarly to individuals placed in the Malta cluster, there appears to be two groups of individuals along this PC, which we divided into two groups, Mixed A (Mixed European individuals to the left of the red line below) and Mixed B (Mixed European individuals to the right of the red line). 465 466 467 468 469 470 471 We next tested if these two groups within Mixed European presented different haplotype sharing profiles with the rest of our European UK Biobank dataset. We performed a modification of our nnls analysis (see Methods), this time modelling each Mixed A and Mixed B cluster as a mixture of any other Leiden cluster (with the exception of the other Mixed European sub-group). We present the results below, finding that the Mixed B cluster presents a more heterogenous sharing profile consistent with admixture (which would be consistent with ADMIXTURE estimates (see SI Appendix Data 4)). 472 473 474 475 476 477 478 Further testing evidence of Mixed A grouping individuals more genetic isolated we recorded levels Runs of Homozygosity (ROH) between the groups, finding that on average individuals placed in Mixed A carry more, and longer, ROH than individuals. These differences are significant comparing either the average length of ROH an individual carries between the groups (Mann-Whitney U p-value: <0.0001) or the number of ROH (Mann-Whitney U p-value: <0.0001). 479 480 481 482 483 484 485 486 We further explored the degree of haplotype sharing between and within these groups by recording the total length and number of IBD-segments > 3 cM and < 30 cM29 shared within and between Mixed A and Mixed B and a comparative cluster, England 2 using the same methodology as shown in Figure 3 and described in Methods. We find that IBD segments are slightly longer within and between Mixed A and Mixed B (below), and that the average Mixed A pair of individuals (i.e., individuals both placed in the Mixed A cluster) share a number higher number of IBD segments than the average pair of Mixed B individuals. Interestingly the average Mixed A - Mixed B pair of individuals share more IBD segments than the average Mixed B pair, suggesting the Mixed B are heterogenous within that group. 487 488 489 490 491 492 Lastly, we estimated the effective population size of each of the clusters Mixed A and Mixed B using IBD-segments and IBDNe30, characterising the haplotype sharing and population structure in terms of historical population size. We find that whilst both present evidence of historical population contraction, we record a much minimal historical population size for Mixed A and Mixed B, consistent with greater isolation. 493 494 495 Together these results suggest these two groups represent structure within the sampled community from the UK Biobank, one more isolated than the other. The latter which has experience greater 496 497 498 admixture with European regions, and whose members are less genetically connected (as measure by IBD segment sharing) than the former. 499 Supplementary Data 8 - IBD Ne Curves 500 501 502 503 504 Supplementary Figure 8.1 – Historical effective population size (Ne) of Leiden clusters grouping individuals with a north-western European birthplace. Each panel shows the estimate for one cluster, with grey curves showing the estimates for all other north-west European clusters. Shading indicates the 95% confidence intervals for the estimates. 505 506 507 508 509 510 Supplementary Figure 8.2 – Historical effective population size (Ne) of Leiden clusters grouping individuals with a northern Britain or Irish birthplace. Each panel shows the estimate for one cluster, with grey curves showing the estimates for all other northern British or Irish clusters. Shading indicates the 95% confidence intervals for the estimates. 511 512 513 514 515 516 Supplementary Figure 8.3 – Historical effective population size (Ne) of Leiden clusters grouping individuals with an English or Welsh birthplace. Each panel shows the estimate for one cluster, with grey curves showing the estimates for all other southern British clusters. Shading indicates the 95% confidence intervals for the estimates. 517 518 519 520 521 522 Supplementary Figure 8.4 – Historical effective population size (Ne) of Leiden clusters grouping individuals with a central, northern, or eastern European birthplace. Each panel shows the estimate for one cluster, with grey curves showing the estimates for all other central, northern, or eastern European clusters. Shading indicates the 95% confidence intervals for the estimates. 523 524 525 526 527 528 529 Supplementary Figure 8.5 – Historical effective population size (Ne) of Leiden clusters grouping individuals with a southern European birthplace. Each panel shows the estimate for one cluster, with grey curves showing the estimates for all other south European clusters. Shading indicates the 95% confidence intervals for the estimates. 530 531 532 533 534 535 536 537 Supplementary Data 9 - Spain In analysis of within-cluster Identity by Descent (IBD) segment analysis we detect a number of individuals placed in the Spain cluster who present high levels of within-cluster IBD sharing compared to other individuals with a birthplace in Spain (Figure 3). In principal component analysis (PCA) of the pbwt paint co-ancestry matrix we observe a group of 12 Spain individuals who project away from the rest of the cluster on principal component (PC) one, as well as PC six. This is shown below, with the fifth and sixth PCs plotted and individuals labelled by Leiden cluster, and a horizontal black line indicating the cut-off threshold for identifying “Spain Outliers” along PC six. 538 539 540 541 542 When we plot PCs one and two of the PCA decomposition of the pbwt paint co-ancestry matrix, highlighting these 12 individuals, we find that these individuals are the Spain outliers along PC one as well, see below. The Spain outliers are highlighted in black. 543 544 545 546 547 548 Due to the high levels of IBD segment sharing amongst these 12 individuals (Figure 3), we recorded the levels of IBD segment sharing between these Spain outlier individuals and the rest of the Spain cluster, as well as the levels of sharing between the two Spanish groups. We recorded total length of IBD (below left), and the number of IBD segments (below right), finding that the Spain outliers indeed share on average more IBD segments within, as well as between the outliers and the general Spain cluster, than the average within Spain pair of individuals. 549 550 551 This increase in haplotype sharing is also mirrored in levels of Runs of Homozygosity (ROH), see below, where the average total length of ROH > 1.5 Mb in length is elevated in the Spain outliers. 552 553 554 555 Furthermore, when projected onto the genetic variation of west Eurasian population references from the Human Origins dataset2 the Spain outlier individuals project towards the variation of Basque and northern Spanish references, see below. 556 557 558 559 560 561 562 563 Given the genetic distinctiveness of these outliers, their elevated haplotype sharing consistent with a degree of isolation relative to the other UK Biobank individuals with a Spanish place of birth, and their projection towards Basque genetic space in PCA, we infer these individuals to be of northern Spanish or Basque ancestry. Previous genetic analyse of the Iberian Peninsula12 has shown this area of Spain to be genetically distinct and would be consistent with the profile presented by these twelve individuals sampled from the UK Biobank. 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203-209 (2018). Lazaridis, I. et al. Genomic insights into the origin of farming in the ancient Near East. Nature 536, 419-24 (2016). Traag, V.A., Waltman, L. & van Eck, N.J. From Louvain to Leiden: guaranteeing wellconnected communities. Sci Rep 9, 5233 (2019). Purcell, S. et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet 81, 559-75 (2007). Chang, C.C. et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience 4, 7 (2015). Alexander, D.H., Novembre, J. & Lange, K. Fast model-based estimation of ancestry in unrelated individuals. Genome Res 19, 1655-64 (2009). Lawson, D.J., Hellenthal, G., Myers, S. & Falush, D. Inference of population structure using dense haplotype data. PLoS Genet 8, e1002453 (2012). Leslie, S. et al. The fine-scale genetic structure of the British population. Nature 519, 309-314 (2015). Karakachoff, M. et al. Fine-scale human genetic structure in Western France. Eur J Hum Genet 23, 831-6 (2015). Raveane, A. et al. Population structure of modern-day Italians reveals patterns of ancient and archaic ancestries in Southern Europe. Sci Adv 5, eaaw3492 (2019). Pankratov, V. et al. Differences in local population history at the finest level: the case of the Estonian population. Eur J Hum Genet 28, 1580-1591 (2020). Bycroft, C. et al. Patterns of genetic differentiation and the footprints of historical migrations in the Iberian Peninsula. Nat Commun 10, 551 (2019). Patterson, N., Price, A.L. & Reich, D. Population structure and eigenanalysis. PLoS Genet 2, e190 (2006). Durbin, R. Efficient haplotype matching and storage using the positional Burrows-Wheeler transform (PBWT). Bioinformatics 30, 1266-72 (2014). Nelson, M.R. et al. The Population Reference Sample, POPRES: a resource for population, disease, and pharmacological genetics research. Am J Hum Genet 83, 347-58 (2008). Novembre, J. et al. Genes mirror geography within Europe. Nature 456, 98-101 (2008). Manichaikul, A. et al. Robust relationship inference in genome-wide association studies. Bioinformatics 26, 2867-73 (2010). Delaneau, O., Zagury, J.F., Robinson, M.R., Marchini, J.L. & Dermitzakis, E.T. Accurate, scalable and integrative haplotype estimation. Nat Commun 10, 5436 (2019). Danecek, P. et al. Twelve years of SAMtools and BCFtools. Gigascience 10(2021). Lazaridis, I. et al. Ancient human genomes suggest three ancestral populations for presentday Europeans. Nature 513, 409-13 (2014). Haak, W. et al. Massive migration from the steppe was a source for Indo-European languages in Europe. Nature 522, 207-11 (2015). Byrne, R.P. et al. Insular Celtic population structure and genomic footprints of migration. PLoS Genet 14, e1007152 (2018). van der Maaten, L.J.P. & Hinton, G.E. Visualizing high-dimensional data using t-SNE. J. Mach. Learn. Res 9, 2579–605 (2008). Gilbert, E. et al. The genetic landscape of Scotland and the Isles. Proc Natl Acad Sci U S A 116, 19064-19070 (2019). Ebenesersdottir, S.S. et al. Ancient genomes from Iceland reveal the making of a human population. Science 360, 1028-1032 (2018). Leblond, C.S. et al. Both rare and common genetic variants contribute to autism in the Faroe Islands. NPJ Genom Med 4, 1 (2019). 615 616 617 618 619 620 621 622 623 27. 28. 29. 30. Khrunin, A.V. et al. A genome-wide analysis of populations from European Russia reveals a new pole of genetic diversity in northern Europe. PLoS One 8, e58552 (2013). Naseri, A. et al. Personalized genealogical history of UK individuals inferred from biobankscale IBD segments. BMC Biol 19, 32 (2021). Xue, J., Lencz, T., Darvasi, A., Pe'er, I. & Carmi, S. The time and place of European admixture in Ashkenazi Jewish history. PLoS Genet 13, e1006644 (2017). Browning, S.R. & Browning, B.L. Accurate Non-parametric Estimation of Recent Effective Population Size from Segments of Identity by Descent. Am J Hum Genet 97, 404-18 (2015).

pnas.2119281119.sapp

Related documents

Products

Support

pnas.2119281119.sapp

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib