The more the merrier? How a few SNPs predict pigmentation phenotypes in the Northern German population Amke Caliebe1,*, Melanie Harder2*, Rebecca Schuett2,4, Michael Krawczak1, Almut Nebel3, Nicole von Wurmb-Schwark2 1 Institute of Medical Informatics and Statistics, 2Institute of Legal Medicine, 3Institute of Clinical Molecular Biology, all at Christian-Albrechts University Kiel, Germany; 4current address: State Criminal Investigation Department of Lower Saxony Correspondence to: PD Dr. rer. nat. Nicole von Wurmb-Schwark, Institute of Legal Medicine, Christian-Albrechts University Kiel, University Hospital Schleswig-Holstein, Arnold-Heller Str. 12, 24105 Kiel, tel.: +49 431 597 3633, fax: +49 431 597 3612, mail: nvonwurmb@rechtsmedizin.uni-kiel.de * These authors contributed equally to this work. Material and Methods Study population A total of 400 unrelated individuals (197 male, 203 female) from Northern Germany were recruited for our study between 2010 and 2011. The median age was 27 years (inter-quartile range: 24-33 years). All individuals were born in Germany and had German parents and grandparents (self-report). All 400 participants were recruited and investigated in the same way. Whilst the first 300 participants were included in the ‘modelling sample’ of stage 1 (used for SNP selection), the remaining 100 individuals constituted the ‘estimation sample’ of stage 2 (prediction evaluation). All participants gave written informed consent prior to the study. Genotype and phenotype data were de-identified for analysis purposes according to the declaration of Helsinki. The project was approved by the Ethics Committee of the Medical Faculty of Christian-Albrechts University Kiel. Phenotyping Pigmentation phenotypes were documented by photographs taken at daylight conditions and from a distance of 30 cm, using a Canon EOS 400D (18-55 mm focal length). For each participant, one photograph was taken of each eye, the scalp hair and the inner arm. Photographs were normalised using the standard functions in Photoshop 4.0, and consensus phenotype calling was carried out by two raters by discussion. In the rare cases where no agreement was reached a third party was involved. Eye colour was divided into three categories, namely blue, green and brown (Fig. 1a). Individual skin type was classified applying the Fitzpatrick scheme (1988) to the inner arm. Hair colour type was defined in multi-tiered fashion. To this end, a collection of coloured hair strands obtained from a hairdresser was categorised into nine evenly graded types of shading, ranging from light blond (type I) to black (type IX) (Fig. 1b). Strands of red hair or with red tint were omitted from this classification because it was intended to address the light-dark component only. Then, hair colour was divided into two sub-phenotypes, namely the red tint component (yes/no) and the light-dark component (I-IX). For each individual, the presence of red tint was ascertained (by questioning) in head hair, facial hair (beard), axillary hair or pubic hair. If an individual had red head hair, this was noted separately to enable a separate analysis for this special phenotype. The light-dark component was determined by reference to the hair strand collection mentioned above. Here, individuals with recognizable red tint were classified according to their basic hair colour type. For example, strawberry blonds were deemed class I (blond) whereas people with auburn hair were classified as one of IV, V or VI (brown). While this was possible for 22 red-haired individuals, four had no definable basic hair colour. These were excluded from the analysis of the light-dark component. No participants with exclusively white hair were included in the study. When a participant had dyed hair, the original hair colour was determined by the hairline. Genotyping Buccal swabs (COPAN) were taken from all 400 participants and DNA was extracted using Chelex 100 (Walsh et al. 1991). In a comprehensive PubMed search, 12 SNPs were identified as promising candidates for further analysis using the following criteria (Table 1, Supplementary Table S1): large odds ratio, validation in several independent studies, large sample sizes and adequate population backgrounds, low to no linkage disequilibrium with other candidate markers and suitability for genotyping in a single assay. In addition to the 12 SNPs, participants were genotyped for rs1426654 (SLC24A5), rs1129038 (HERC2) and rs1667394 (OCA2). SNP rs1426654 is a European ancestry marker (Giardina et al. 2008) used to control population background. SNPs rs1129038 and rs1667394 served as genotyping quality markers because they are in perfect LD with candidate SNPs rs12913832 and rs916977 respectively (Mengel-From et al. 2010; Sturm et al. 2008). Primers were designed and checked for possible dimer and hairpin structures using the DNAstar Lasergene v8.1.2 software and BLAST. PCR fragments had to be shorter than 200 bp in order to meet the standards of reliable forensic or ancient DNA analysis. For DNA amplification, a Multiplex PCR Master Mix (Qiagen) was used in a total reaction volume of 12.5 µl, with 0.2-0.5 ng template DNA. PCR was performed with a thermal cycler 2700 (Life Technologies) under the following conditions: (1) 95° C 15 min, (2) 35 cycles of 94° C 30 sec, 64° C (SNPs nos. 2-5, 7-8, 10 in Table 1) or 58° C (SNPs nos. 1, 6, 9, 11-12) 90 sec, 72° C 1 min and (3) 60° C 30 min. PCR products were purified using ExoSAP-IT (Affymetrix) according to the manufacturer’s protocol. Single-base extension (SBE) was carried out in a total reaction volume of 7 µl, including 0.5 µl of cleaned PCR products, using the SNaPshot Multiplex Kit (Life Technologies) on the same PCR cycler as before. The SBE cycling conditions were as follows: 25 cycles of 96° C 10 sec, 55° C 5 sec, 60° C 30 sec. Fragment analysis was performed with the ABI Prism 3130 Genetic Analyzer (Life Technologies) using GeneMapper v3.2. For more information on primer sequences and concentrations, see Table S2a. The model proposed by Walsh et al. (2011) for the prediction of eye colour is based upon six SNPs. Five of these had been genotyped in all our study participants before (Table 1, SNPs nos. 1, 3, 10-12). The 100 individuals of stage 2 were additionally genotyped for rs16891982 (SLC45A2) as described above, with an annealing temperature of 58° C in the first PCR. For more information on primer sequences and concentrations, see Table S2b. The model devised by Branicki et al. (2011) for predicting hair colour is based upon 13 single or compound markers. Three of these were also included in our set of candidate SNPs (Table 1, SNPs nos. 1, 3, 10) and were genotyped in all participants. The 100 stage 2 individuals were also genotyped for the remaining 10 markers, namely two compound markers in MC1R and rs1042602 (TYR), rs4959270 (EXOC2), rs28777 (SLC45A2), rs683 (TYRP1), rs2402130 (SLC24A4), rs12821256 (KITLG), rs16891982 (SLC45A2) and rs2378249 (ASIP). SNPs were analyzed as described above, with an annealing temperature of 58° C for the first PCR. The MC1R markers were analyzed by sequencing the whole locus. To this end, a 1080 bp fragment was amplified and sequenced with an ABI Prism 3130xl Genetic Analyzer using the Big Dye Terminator v3.1 Cycle Sequencing Kit ( both Life Technologies), following the manufacturer’s protocols. See Table S2c for more information on primers used in this study. Genotype and phenotype data of this study were submitted to the European Genome-phenome archive (EGA, https://ega.crg.eu) with study accession number EGAS00001001174 (sample/proband ids EGAN00001268626-EGAN00001269025). Statistical analysis Genotypes for all markers of the two previously published models (or marker sets) (Branicki et al. 2011; Walsh et al. 2011) were only available for 100 individuals in our study (stage 2). To compare the predictive capability of the two marker sets to a model derived specifically for our target population, the 300 individuals of stage 1 were used to detect significant genotype-phenotype associations and to create an appropriate prediction model. Data from stage 2 then served for estimation and comparison of the predictive capability (sensitivity, specificity, predictive accuracy, area under the receiver operating characteristic curve AUC) for each new model and the two previously published models (Branicki et al. 2011; Walsh et al. 2011). For illustration, we also performed model selection and prediction evaluation on the whole data set (i.e. stages 1 and 2 combined), using cross-validation to estimate sensitivity and specificity. Sample size calculations indicated that approximately 100 individuals per group would suffice to detect an OR of 3 as nominally significant, depending upon the minor allele frequency of the SNP of interest, and 150 individuals per group after Bonferroni adjustment (12 SNPs tested, 80% power, 5% significance level). Stage 1 therefore comprised 300 individuals. The association between a given trait (i.e. eye colour, hair colour/red tint, hair colour/light-dark component, skin colour) and a candidate SNP was tested for statistical significance using regression models. To allow for scarce genotypes, we performed permutation tests (100,000 permutations) in addition to standard asymptotic tests. Since p values were not found to be notably different, only p values from permutation tests will be given. Each SNP was analyzed both individually (simple regression) and in combination with other candidate SNPs (multiple regression with backward selection), also allowing for possible SNP-SNP interactions. Genotypic, additive allelic, dominant and recessive models were considered for each SNP. Results, however, will be presented for the additive model only because this model required the least parameters but yielded consistently large effects. To derive robust prediction models, phenotypes were categorised in various ways. Dependent on the scaling of the outcome, we performed logistic, linear, ordinal (proportional odds) and/or multinomial logistic regression. Since blue was by far the most frequent eye colour in our study, the analysis of eye colour was confined to the discrimination between blue and non-blue. For hair colour, red tint was treated as a dichotomous outcome whereas the light-dark component was treated in three different ways, either as dichotomous (blond vs. non-blond), ordinal, or quantitative (nine types of increasing darkness). Skin colour was treated either as dichotomous (types I-II vs. types III-IV) or as ordinal. Model selection was performed differently for the four traits. For eye colour and red tint, SNPs that remained significant in the multiple logistic regression analysis after backward selection and adjustment for multiple testing were included in the final model. For the light-dark component of hair colour, and for skin colour, a SNP had to be significant in all or in all but one of the multiple regression analyses after backward selection using different outcome definitions (i.e. at least two of three analyses for the light-dark component, at least one of two analyses for skin colour). The relationships between traits were analyzed using logistic regression analysis, treating one trait as the dependent variable and the other traits as independent variables, both with and without the additional inclusion of SNP genotypes. All four traits were encoded as dichotomous variables in these analyses. Model selection was again performed by backward selection. Multidimensional scaling (MDS) was used to detect and visualise patterns in the phenotype data. The predictive capability of a derived model was evaluated by means of the phenotype probability from logistic regression analysis. This was done only for dichotomous outcomes (e.g. blue vs. non-blue eye colour). If >0.5, the corresponding phenotype was assumed to be present. Predictive capability was quantified by the sensitivity, specificity, predictive accuracy and AUC of the model in question. All statistical analyses were performed with R v2.10.1 (R Development Core Team 2009) unless indicated otherwise. Hardy-Weinberg equilibrium was assessed by means of the exact test implemented in R package genetics (Warnes et al. 2008). Package MASS was used for ordinal and multinomial regression (Venables and Ripley 2002). Permutation tests of the linear and logistic regression models were performed with package glmperm (Werft et al. 2013). For ordinal regression models, permutation tests were programmed in house. The predictive capabilities of different models were evaluated with packages DiagnosisMed (Brasil 2010) and pROC (Robin et al. 2011). The proportion of phenotype heritability explained by a given marker was calculated according to So et al.(So et al. 2011). Note that these estimates apply to single markers and do not take into account the characteristics of the respective regression models. Furthermore, these estimates refer to the liability scale and therefore tend to be higher than on the observation scale. Sample size calculations were performed with the GPower software v3.0.8. All tests were two-sided and a p value smaller than 0.05 was considered nominally statistically significant. P values were adjusted for multiple testing using the Bonferroni method. Supplementary Table 1: Terms for PubMed search Category 1 human Category 2 Category 3 Category 4 pigmentation genotype/genotyping prediction eye colour/color single nucleotide polymorphism/ SNP/SNPs determination skin colour/color hair colour/color red hair/red tint curly hair For the PubMed search were used: Always: one word each of categories 1-3 Often: one word of category 4 Sometimes: one word of category 5 Furthermore articles were investigated that cited already retrieved articles. Category 5 origin/ancestry/ ancestral forensic forensic phenotyping/ forensic DNA phenotyping Irisplex Supplementary Table 2: Genotyping information for all analyzed SNPs Supplementary Table 2a: Genotyping information for the 12 analyzed candidate SNPs and 3 control SNPs Multiplex I gene size (bp) SNP-ID MC1R 174 NC_000016.10:g89919709C>T rs1805007 rs1805008 NC_000016.10:g.89919736C>T OCA2 167 HERC 181 OCA2 64 OCA2 254 IRF4 121 HERC2 75 TYR 73 SLC24A5 137 MC1R 143 HERC2 176 SLC24A4 114 OCA2 66 HERC2 163 rs4778138 NC_000015.10:g.28090674A>G rs1667394 NC_000015.10:g.28285036C>T rs7495174 NC_000015.10:g.28099092A>G rs1800407 NC_000015.10:g.27985172C>T rs12203592 NC_000006.11:g.396321C>T rs916977 NC_000015.10:g.28268218T>C rs1393350 NC_000011.10:g.89277878G>A rs1426654 NC_000015.10:g.48134287A>G rs1805009 NC_000016.10:g.89920138G>C rs1129038 NC_000015.10:g.28111713C>T rs12896399 NC_000014.9:g.92307319G>T rs4778241 NC_000015.10:g.28093567A>C rs12913832 NC_000015.10:g.28120472A>G primer sequence SNP allele SBE-primer sequence tail total length (bp) C/G/T TCTCCATCTTCTACGCACTG (GACT)5 40 C/T AGCATCGTGACCCTGCCG (A)14 32 0.8 G/A GTGAAAATATAACATATCAAAATTG (GACT)6 49 0.2 G/A CATTGTTTCTTTGTTTGTTTGGT (A)7 30 0.2 G/A (C/T) AGGCAAGTTCCCCTAAAGGT (A)5 25 0,4 G/A AGGCATACCGGCTCTCCC (GACT)9 54 0.4 G/A (C/T) ACTTTGGTGGGTAAAAGAAGG (GACT)8 53 0.1 G/A GTGCAGCCTTGGCCAGCCTTCT (GACT)6 46 0.05 G/A CCTCAGTCCCTTCTCTGCAAC (GACT)9 57 0.2 G/A (C/T) CTGAACTGCCCGCTGCCATGAAAGTTG (GACT)4 43 0.2 G/C CTCATCATCTGCAATGCCATCATC (GACT)7 52 0.1 G/A TGAGCCAGGCAGCAGAGC (A)9 27 0.1 G/T CTTTAGGTCAGTATATTTTGGG 0.1 C/A (G/T) TGTTGGCTGGTAGTTGCAATT (GACT)6 45 0.05 G/A GCCAGTTTCATTTGAGCATTAA (A)13 35 [primer] in µM forward:GCCGTGGACCGCTACATC 0.6 reverse: GAAGAAGACCACGAGGCACAG forward: GGAAAATCTGCACACTTAGAAA reverse: GCTGTAAATTTCCTCCCATCA forward: TTGGCAGCTTTTCTGTCTTCT reverse: AAAATGAGAACTTGGTCAATCC forward: GGCTCCGTCGCACCCGTCTG reverse: GCGGCTTAGGAAGCAAGGCAAG forward: CAGAGGTGCTTTGCGTACCTTATGGT reverse: GGGGTAATGTTAGTTTGGCTCCCTGTTCTTA forward: TCATGTGAAACCACAGGGCA reverse: CTGGCACCAAAAGTACCACA forward: CACAGTGGGGATGCAGTTTGAGTA reverse: TTGGCCTTTCTGTTCTTCTTGACC Multiplex II forward: TATCCACCAACTCCTACTCTT reverse:TTATCATTTGTAAAAGACCACAC forward: GAAGAAAATAAAAATCACACTGAGTAAGC reverse:CCTTGGATTGTCTCAGGATGTTGC forward: CACCGCGCTCACCAGGAGC reverse:GGCTGCATCTTCAAGAACTTCAACC forward: GCCGACGACAGCAGCGACGAT reverse:CAGACACACCAGGCAGCCTACAGTCT forward: ATTGAGTATCCTATATTTTATCTG reverse:TCTTGATGTTGTATTGATGAGG forward: CTGGAAAGCAGTTTGACAGTT reverse:GTGCAATTGTTGGCTGGTAG forward: AAGAGGCGAGGCCAGTTTC reverse:AGAAACGACAAGTAGACCATTT reverse SBE-primer 22 Supplementary Table 2b: Genotyping information for additional SNP rs16891982 (SLC45A2) from Walsh et al. 30 Singelplex-PCR gene size (bp) SLC45A2 107 SNP-ID rs16891982 NC_000005.10:g.33951588C>G primer sequence forward: AGAAACTTTTAGAAGACATCCTTAGGAGAGAGAAA reverse: AAGAGGAGTCGAGGTTGGATGTTGG [primer] in µM SNP allele SBE-primer sequence tail total length (bp) 0.2 C/G (G/C) AGGTTGGATGTTGGGGCTT (GACT)7 47 reverse SBE-primer Supplementary Table 2c: Genotyping information for additional single or compound markers from Branicki et al. 29 Multiplex-PCR gene size (bp) TYR 120 EXOC2 113 SLC45A2 109 SNP-ID rs1042602 NC_000011.10:g.89178528C>A rs4959270 NC_000006.12:g.457748C>A rs28777 NC_000005.10:g.33958854C>A primer sequence forward: ATGACCTCTTTGTCTGGATGC reverse: CTATGCCAAGGCAGAAAAGC forward: CTGGGGTTTACGATTCAACA reverse: AGGATGGAAAAGAACCACCA forward: GTGGGAGTTCCATGCCTTT reverse: TCCAAGAGTCGCATAGGACA [primer] in µM SNP allele SBE-primer sequence tail total length (bp) 0.1 A/C (T/G) CAATGTCTCTCCAGATTTCA (GACT)9 32 0.2 A/C CCAAACTATGACACTATG (GACT) 22 0.2 A/C CATGTGATCCTCACAGCAG (GACT)2 29 0.2 A/G CATACGGAGCCCGTG (GACT)7 43 0.1 C/T (G/A) GGGCATGTTACTACGGCAC (GACT)5 39 0.2 A/G CTAGGAACTACTTTGCACAGTA (GACT)3 34 A/C (T/G) GCCTAGAACTTTAAT (GACT)3 27 Duplex-PCR SLC24A4 152 KITLG 106 ASIP 135 TYRP1 120 gene size (bp) MC1R 1080 rs2402130 NC_000014.9:g.92334859G>A rs12821256 NC_000012.12:g.88934558T>C rs2378249 NC_000020.11:g.34630286G>A rs683 NC_000009.12:g.12709305C>A forward: ACCTGTCTCACAGTGCTGCT reverse: TTCACCTCGATGACGATGAT forward: TTAAGCTCTGTGTTTAGGGTTTTT reverse: TGAGTCATGAGTGCTTTGTTCC Singleplex-PCR forward: GACCTCAGTTCTGGAGAAAGC reverse: AAGGTGGCTGGTTTCAGTCT Singleplex-PCR forward: TTTTCTTTCACTTTATTACCTTCTTTC 0.2 reverse: AAAGATTCTGAAAGGGTCTTCC Singleplex-PCR MC1R Exon primer sequence forward: GCAGCACCATGAACTAAGCA reverse: TGCCCAGCACACTTAAAGC concentration of each primer (µM) 0.4, 3.2 in the sequencing reaction Supplementary Table 3: Genotype-phenotype association analysis of eye colour Supplementary Table 3a: Association between SNP genotype and blue eye colour (stage 1) Alleles ref. fav. A G A G G A G A G A A C - SNP-ID (gene) rs12913832 (HERC2) rs916977(HERC2) rs1800407(OCA2) rs7495174(OCA2) rs4778138(OCA2) rs4778241(OCA2) rs1805007(MC1R) rs1805008(MC1R) rs1805009 (MC1R) rs12203592(IRF4) rs12896399 (SLC24A4) rs1393350 (TYR) Regression p value simple multiplea <1.010-5 <1.010-5 n.s. <1.010-5 <1.010-5 <1.010-5 n.s. n.s. n.s. n.s. n.s. n.s. <1.010-5 (1.210-4) n.s. 0.0014 (0.017) n.s. n.s. n.s. n.s. n.s. n.s. n.s. n.s. n.s. ref.: reference allele, fav.: favourable allele, n.s.: not significant. aFigures in brackets are p values after Bonferroni adjustment (12 SNPs tested). SNPs included in the final prediction model are printed in bold (see Table 2). All SNPs were analysed using an additive allelic model on the logit scale. Supplementary Table 3b: Combined rs12913832/rs1800407 genotype and blue eye colour (stage 2) rs12913832 (HERC2) GG GA AA a rs1800407 (OCA2) N Probability (blue eyes)a AA GA GG AA GA GG AA GA GG 0 6 70 0 3 19 0 0 2 1.00 0.98 0.90 0.84 0.52 0.18 0.12 0.027 0.0055 Prevalence (blue eyes) observed expectedb 0 0 5 5.9 61 63.0 0 0 2 1.6 5 3.4 0 0 0 0 0 0.0 Conditional probability of blue eye colour, given the respective genotype, as calculated from the prediction model derived in stage 1 (see Table 2). bThe expected prevalence equals the product of the genotype prevalence (N) and the probability of blue eyes. Supplementary Table 4: Genotype distribution, by eye colour, of the 12 candidate SNPs (all 400 individuals of stage 1 and 2) SNP-ID gene rs12913832 HERC2 rs916977 HERC2 rs1800407 OCA2 rs7495174 OCA2 rs4778138 OCA2 rs4778241 OCA2 rs1805007 MC1R rs1805008 MC1R rs1805009 MC1R rs12203592 IRF4 rs12896399 SLC24A4 rs1393350 TYR G1 GG 258 0.90 GG 270 0.94 GG 261 0.91 AA 284 0.99 AA 247 0.86 CC 239 0.84 CC 249 0.87 CC 232 0.81 GG 279 0.97 CC 240 0.84 GG 93 0.33 GG 158 0.55 blue (n=287) G2 G3 A1 GA AA G 29 0 545 0.10 0 0.95 GA AA G 17 0 557 0.059 0 0.97 GA AA G 25 1 547 0.087 0.0035 0.95 AG GG A 3 0 571 0.010 0 0.99 AG GG A 39 1 533 0.14 0.0035 0.93 CA AA C 46 1 524 0.16 0.0035 0.92 CT TT C 33 5 531 0.11 0.017 0.93 CT TT C 50 5 514 0.17 0.017 0.90 GC CC G 8 0 566 0.028 0 0.99 CT TT C 45 2 525 0.16 0.0070 0.91 GT TT G 134 59 320 0.47 0.21 0.56 GA AA G 118 11 434 0.41 0.038 0.76 A2 A 29 0.051 A 17 0.030 A 27 0.047 G 3 0.0052 G 41 0.071 A 48 0.084 T 43 0.075 T 60 0.10 C 8 0.014 T 49 0.085 T 252 0.44 A 140 0.24 G1 GG 26 0.5 GG 39 0.75 GG 47 0.90 AA 45 0.87 AA 37 0.71 CC 32 0.62 CC 42 0.81 CC 45 0.87 GG 51 0.98 CC 46 0.88 GG 20 0.38 GG 31 0.60 green (n=52) G2 G3 A1 GA AA G 26 0 78 0.5 0 0.75 GA AA G 13 0 91 0.25 0 0.88 GA AA G 5 0 99 0.096 0 0.95 AG GG A 7 0 97 0.13 0 0.93 AG GG A 15 0 89 0.29 0 0.86 CA AA C 19 1 83 0.37 0.019 0.80 CT TT C 9 1 93 0.17 0.019 0.89 CT TT C 7 0 97 0.13 0 0.93 GC CC G 1 0 103 0.019 0 0.99 CT TT C 6 0 98 0.12 0 0.94 GT TT G 22 10 62 0.42 0.19 0.60 GA AA G 18 3 80 0.35 0.058 0.77 13 A2 A 26 0.25 A 13 0.13 A 5 0.048 G 7 0.067 G 15 0.14 A 21 0.20 T 11 0.11 T 7 0.067 C 1 0.0096 T 6 0.058 T 42 0.40 A 24 0.23 G1 GG 5 0.082 GG 27 0.44 GG 52 0.85 AA 39 0.64 AA 28 0.46 CC 27 0.44 CC 51 0.84 CC 45 0.74 GG 61 1 CC 55 0.90 GG 28 0.46 GG 34 0.56 brown (n=61) G2 G3 A1 GA AA G 46 10 56 0.75 0.16 0.46 GA AA G 30 4 84 0.49 0.066 0.69 GA AA G 9 0 113 0.15 0 0.93 AG GG A 22 0 100 0.36 0 0.82 AG GG A 30 3 86 0.49 0.049 0.70 CA AA C 30 4 84 0.49 0.066 0.69 CT TT C 9 1 111 0.15 0.016 0.91 CT TT C 14 2 104 0.23 0.033 0.85 GC CC G 0 0 122 0 0 1 CT TT C 5 1 115 0.082 0.016 0.94 GT TT G 25 8 81 0.41 0.13 0.66 GA AA G 21 6 89 0.34 0.098 0.73 A2 A 66 0.54 A 38 0.31 A 9 0.074 G 22 0.18 G 36 0.30 A 38 0.31 T 11 0.090 T 18 0.15 C 0 0 T 7 0.057 T 41 0.34 A 33 0.27 Supplementary Table 5: Genotype-phenotype association analysis of red tint in hair colour Supplementary Table 5a: Association between SNP genotype and red tint (stage 1) SNP-ID (gene) rs12913832 (HERC2) rs916977 (HERC2) rs1800407 (OCA2) rs7495174 (OCA2) rs4778138 (OCA2) rs4778241 (OCA2) rs1805007 (MC1R) rs1805008 (MC1R) rs1805009 (MC1R) rs12203592 (IRF4) rs12896399 (SLC24A4) rs1393350 (TYR) Alleles ref. fav. C T C T - Regression p value simple multiple§ n.s. n.s. n.s. n.s. n.s. n.s. n.s. n.s. n.s. n.s. n.s. n.s. <1.010-5 6.010-4 n.s. n.s. n.s. n.s. <1.010-5 (1.210-4) 1.010-4 (0.0012) n.s. n.s. n.s. n.s. ref.: reference allele, fav.: favourable allele, n.s.: not significant. §Figures in brackets are p values after Bonferroni adjustment (12 SNPs tested). SNPs included in the final prediction model are printed in bold (see Table 2). The p value for the interaction between rs1805007 and rs1805008 equalled 0.020 (0.24). All SNPs were analyzed using an additive allelic model on the logit scale. Supplementary Table 5b: Combined rs1805007/rs1805008 genotype and red tint in hair colour (stage 2) rs1805007 (MC1R) CC CT TT rs1805008 (MC1R) N Probability (red tint)$ CC CT TT CC CT TT CC CT TT 62 22 4 9 1 0 2 0 0 0.14 0.36 0.67 0.47 0.76 0.92 0.82 0.94 0.98 $ Prevalence (red tint) observed expected& 10 8.68 9 7.92 3 2.68 6 4.23 1 0.76 0 0 2 1.64 0 0 0 0 Conditional probability of red tint, given the respective genotype, as calculated from the prediction model derived in stage 1 (see Table 2). &The expected prevalence equals the product of the genotype prevalence (N) and the probability of red tint. 14 Supplementary Table 6: Genotype distribution, by red tint, of the 12 candidate SNPs (all 400 individuals of stage 1 and 2) SNP-ID gene rs12913832 HERC2 rs916977 HERC2 rs1800407 OCA2 rs7495174 OCA2 rs4778138 OCA2 rs4778241 OCA2 rs1805007 MC1R rs1805008 MC1R rs1805009 MC1R rs12203592 IRF4 rs12896399 SLC24A4 rs1393350 TYR G1 GG 76 0.75 GG 88 0.86 GG 92 0.90 AA 96 0.94 AA 81 0.79 CC 77 0.75 CC 68 0.67 CC 68 0.67 GG 97 0.95 CC 85 0.83 GG 37 0.36 GG 57 0.56 red tint (n=102) G2 G3 A1 GA AA G 25 1 177 0.25 0.0098 0.87 GA AA G 14 0 190 0.14 0 0.93 GA AA G 10 0 194 0.098 0 0.95 AG GG A 6 0 198 0.059 0 0.97 AG GG A 21 0 183 0.21 0 0.90 CA AA C 25 0 179 0.25 0 0.88 CT TT C 28 6 164 0.27 0.059 0.80 CT TT C 28 6 164 0.27 0.059 0.80 GC CC G 5 0 199 0.049 0 0.98 CT TT C 15 2 185 0.15 0.020 0.91 GT TT G 45 20 119 0.44 0.20 0.58 GA AA G 42 3 156 0.41 0.029 0.76 A2 A 27 0.13 A 14 0.069 A 10 0.049 G 6 0.029 G 21 0.10 A 25 0.12 T 40 0.20 T 40 0.20 C 5 0.025 T 19 0.093 T 85 0.42 A 48 0.24 15 G1 GG 213 0.71 GG 248 0.83 GG 268 0.90 AA 272 0.91 AA 231 0.78 CC 221 0.74 CC 274 0.92 CC 254 0.85 GG 294 0.99 CC 256 0.86 GG 104 0.35 GG 166 0.56 no red tint (n=298) G2 G3 A1 GA AA G 76 9 502 0.26 0.030 0.84 GA AA G 46 4 542 0.15 0.013 0.91 GA AA G 29 1 565 0.097 0.0034 0.95 AG GG A 26 0 570 0.087 0 0.96 AG GG A 63 4 525 0.21 0.013 0.88 CA AA C 70 6 512 0.24 0.020 0.86 CT TT C 23 1 571 0.077 0.0034 0.96 CT TT C 43 1 551 0.14 0.0034 0.92 GC CC G 4 0 592 0.013 0 0.99 CT TT C 41 1 553 0.14 0.0034 0.93 GT TT G 136 57 344 0.46 0.19 0.58 GA AA G 115 17 447 0.39 0.057 0.75 A2 A 94 0.16 A 54 0.091 A 31 0.052 G 26 0.044 G 71 0.12 A 82 0.14 T 25 0.042 T 45 0.076 C 4 0.0067 T 43 0.072 T 250 0.42 A 149 0.25 Supplementary Table 7: Genotype distribution, by hair colour (light-dark component), of the 12 candidate SNPs (all 400 individuals of stage 1 and 2) SNP-ID gene rs12913832 HERC2 rs916977 HERC2 rs1800407 OCA2 rs7495174 OCA2 rs4778138 OCA2 rs4778241 OCA2 rs1805007 MC1R rs1805008 MC1R rs1805009 MC1R rs12203592 IRF4 rs12896399 SLC24A4 rs1393350 TYR G1 GG 188 0.82 GG 206 0.90 GG 209 0.91 AA 221 0.96 AA 193 0.84 CC 184 0.80 CC 196 0.85 CC 179 0.78 GG 223 0.97 CC 210 0.91 GG 68 0.30 GG 127 0.55 blond(n=230) G2 G3 A1 GA AA G 41 1 417 0.18 0.0043 0.91 GA AA G 24 0 436 0.10 0 0.95 GA AA G 20 1 438 0.087 0.0043 0.95 AG GG A 9 0 451 0.039 0 0.98 AG GG A 36 1 422 0.16 0.0043 0.92 CA AA C 45 1 413 0.20 0.0043 0.90 CT TT C 30 4 422 0.13 0.017 0.92 CT TT C 44 7 402 0.19 0.030 0.87 GC CC G 7 0 453 0.030 0 0.98 CT TT C 20 0 440 0.087 0 0.96 GT TT G 107 55 243 0.47 0.24 0.53 GA AA G 90 13 344 0.39 0.057 0.75 A2 A 43 0.093 A 24 0.052 A 22 0.048 G 9 0.020 G 38 0.083 A 47 0.10 T 38 0.083 T 58 0.13 C 7 0.015 T 20 0.043 T 217 0.47 A 116 0.25 G1 GG 95 0.64 GG 119 0.80 GG 131 0.89 AA 132 0.89 AA 108 0.73 CC 108 0.73 CC 129 0.87 CC 123 0.83 GG 147 0.99 CC 118 0.80 GG 62 0.42 GG 82 0.55 brown (n=148) G2 G3 A1 GA AA G 44 9 234 0.30 0.061 0.79 GA AA G 25 4 263 0.17 0.027 0.89 GA AA G 17 0 279 0.11 0 0.94 AG GG A 16 0 280 0.11 0 0.95 AG GG A 38 2 254 0.26 0.014 0.86 CA AA C 34 5 250 0.23 0.034 0.85 CT TT C 17 2 275 0.11 0.014 0.93 CT TT C 25 0 271 0.17 0 0.92 GC CC G 1 0 295 0.0068 0 1.00 CT TT C 27 3 263 0.18 0.020 0.89 GT TT G 63 22 187 0.43 0.15 0.64 GA AA G 60 6 224 0.41 0.041 0.76 Two individuals with pure red hair colour each were removed from stage 1 and 2. 16 A2 A 62 0.21 A 33 0.11 A 17 0.057 G 16 0.054 G 42 0.14 A 44 0.15 T 21 0.071 T 25 0.084 C 1 0.0034 T 33 0.11 T 107 0.36 A 72 0.24 G1 GG 4 0.22 GG 8 0.44 GG 17 0.94 AA 11 0.61 AA 7 0.39 CC 4 0.22 CC 15 0.83 CC 16 0.89 GG 18 1 CC 10 0.56 GG 11 0.61 GG 12 0.67 black (n=18) G2 G3 A1 GA AA G 14 0 22 0.78 0 0.61 GA AA G 10 0 26 0.56 0 0.72 GA AA G 1 0 35 0.056 0 0.97 AG GG A 7 0 29 0.39 0 0.81 AG GG A 10 1 24 0.56 0.056 0.67 CA AA C 14 0 22 0.78 0 0.61 CT TT C 3 0 33 0.17 0 0.92 CT TT C 2 0 34 0.11 0 0.94 GC CC G 0 0 36 0 0 1 CT TT C 8 0 28 0.44 0 0.78 GT TT G 7 0 29 0.39 0 0.81 GA AA G 5 1 29 0.28 0.056 0.81 A2 A 14 0.39 A 10 0.28 A 1 0.028 G 7 0.19 G 12 0.33 A 14 0.39 T 3 0.083 T 2 0.056 C 0 0 T 8 0.22 T 7 0.19 A 7 0.19 Supplementary Table 8: Genotype-phenotype association analysis of hair colour (light-dark component) Supplementary Table 8a: Association between SNP genotype and hair colour (stage 1) Alleles SNP-ID (gene) ref. fav. rs12913832 (HERC2) A G rs916977 (HERC2) rs1800407 (OCA2) rs7495174 (OCA2) rs4778138 (OCA2) rs4778241 (OCA2) rs1805007 (MC1R) rs1805008 (MC1R) rs1805009 (MC1R) A G G A C G G A A C T C rs12203592 (IRF4) T C rs12896399 (SLC24A4) G T rs1393350 (TYR) - - blond vs. non-blond simple multiple§ 6.010-4 0.0026 (0.0072) 0.0076 n.s. n.s. n.s. 0.032 n.s. 0.031 n.s. n.s. n.s. n.s. n.s. n.s. n.s. n.s. n.s. 0.0025 0.0032 (0.030) 0.0044 0.0022 (0.053) n.s. n.s. Regression p value ordinal regression simple multiple§ <1.010-5 1.010-5 (1.210-4) -5 n.s. 4.010 n.s. n.s. 0.0047 n.s. -4 n.s. 7.810 0.0071 n.s. n.s. 0.040 (0.48) n.s. 0.031 (0.37) n.s. 0.040 (0.48) <1.010-5 8.010-5 (1.210-4) 0.0032 0.0094 (0.11) n.s. n.s. linear regression simple multiple§ <1.010-5 <1.010-5 (1.210-4) -5 n.s. <1.010 n.s. n.s. 0.0024 n.s. -4 n.s. 3.010 0.0016 n.s. n.s. n.s. n.s. n.s. n.s. n.s. -5 <1.010 1.010-4 -4 (1.210 ) 0.0050 0.0025 (0.060) n.s. n.s. ref.: reference allele, fav.: favourable allele for light hair colour, n.s.: not significant. §Figures in brackets are p values after Bonferroni adjustment (12 SNPs tested). SNPs included in the final prediction model are printed in bold (see Table 2). Shown are the p values for the analysis of blond vs. non-blond, ordinal regression analysis of the nine hair types and linear regression analysis treating the nine hair types as a quantitative variable. All SNPs were analyzed using an additive allelic model on the logit scale. Two individuals with pure red hair colour were removed from stage 1. Supplementary Table 8b: Combined rs12913832/rs12203592 genotype and hair colour (stage 2) rs12913832 (HERC2) GG GA AA rs12203592 (IRF4) N CC CT TT CC CT TT CC CT TT 58 7 1 20 8 1 3 0 0 Prevalence (blond) n=52 prob. obs. exp. 0.69 43 40.02 0.42 3 2.94 0.19 0 0.19 0.50 5 10.00 0.24 1 1.92 0.091 0 0.091 0.29 0 0.87 0.11 0 0 0.038 0 0 Prevalence (brown) n=36 prob. obs. exp. 0.30 14 17.40 0.54 4 3.78 0.73 1 0.73 0.46 12 9.20 0.67 1 5.36 0.75 1 0.75 0.59 3 1.77 0.69 0 0 0.67 0 0 Prevalence (black) n=10 prob. obs. exp. 0.014 1 0.81 0.036 0 0.25 0.073 0 0.073 0.043 3 0.86 0.093 6 0.74 0.16 0 0.16 0.12 0 0.36 0.20 0 0 0.29 0 0 prob.: probability of a particular hair colour, given the respective genotype, as calculated from the prediction model derived in stage 1 (see Table 2) by multinomial regression, obs.: observed prevalence of the respective hair colour, exp.: expected prevalence, equal to the product of the genotype prevalence (N) and the probability of the respective hair colour. Two individuals with pure red hair colour were removed from stage 2. 17 Supplementary Table 9: Genotype distribution, by skin colour, of the 12 candidate SNPs (all 400 individuals of stage 1 and 2) SNP-ID gene rs12913832 HERC2 rs916977 HERC2 rs1800407 OCA2 rs7495174 OCA2 rs4778138 OCA2 rs4778241 OCA2 rs1805007 MC1R rs1805008 MC1R rs1805009 MC1R rs12203592 IRF4 rs12896399 SLC24A4 rs1393350 TYR G1 GG 17 0.89 GG 17 0.89 GG 16 0.84 AA 19 1 AA 18 0.95 CC 16 0.84 CC 14 0.74 CC 8 0.42 GG 17 0.89 CC 14 0.74 GG 3 0.16 GG 8 0.42 type I (n=19) G2 G3 A1 GA AA G 2 0 36 0.11 0 0.95 GA AA G 2 0 36 0.11 0 0.95 GA AA G 3 0 35 0.16 0 0.92 AG GG A 0 0 38 0 0 1 AG GG A 1 0 37 0.053 0 0.97 CA AA C 3 0 35 0.16 0 0.92 CT TT C 4 1 32 0.21 0.053 0.84 CT TT C 10 1 26 0.53 0.053 0.68 GC CC G 2 0 36 0.11 0 0.95 CT TT C 4 1 32 0.21 0.053 0.84 GT TT G 12 4 18 0.63 0.21 0.47 GA AA G 10 1 26 0.53 0.053 0.68 A2 A 2 0.053 A 2 0.053 A 3 0.079 G 0 0 G 1 0.026 A 3 0.079 T 6 0.16 T 12 0.32 C 2 0.053 T 6 0.16 T 20 0.53 A 12 0.32 G1 GG 101 0.80 GG 113 0.90 GG 114 0.90 AA 122 0.97 AA 107 0.85 CC 106 0.84 CC 99 0.79 CC 96 0.76 GG 124 0.98 CC 103 0.82 GG 49 0.39 GG 78 0.62 type II (n=126) G2 G3 A1 GA AA G 23 2 225 0.18 0.016 0.89 GA AA G 13 0 239 0.10 0 0.95 GA AA G 12 0 240 0.095 0 0.95 AG GG A 4 0 248 0.032 0 0.98 AG GG A 19 0 233 0.15 0 0.92 CA AA C 19 1 231 0.15 0.0079 0.92 CT TT C 22 5 220 0.17 0.040 0.87 CT TT C 25 5 217 0.20 0.040 0.86 GC CC G 2 0 250 0.016 0 0.99 CT TT C 21 2 227 0.17 0.016 0.90 GT TT G 50 26 148 0.40 0.21 0.59 GA AA G 42 6 198 0.33 0.048 0.79 A2 A 27 0.11 A 13 0.052 A 12 0.048 G 4 0.016 G 19 0.075 A 21 0.083 T 32 0.13 T 35 0.14 C 2 0.0079 T 25 0.099 T 102 0.41 A 54 0.21 18 G1 GG 168 0.70 GG 198 0.82 GG 217 0.90 AA 216 0.90 AA 184 0.76 CC 171 0.71 CC 216 0.90 CC 205 0.85 GG 236 0.98 CC 214 0.89 GG 80 0.33 GG 131 0.54 type III (n=241) G2 G3 A1 GA AA G 66 7 402 0.27 0.029 0.83 GA AA G 39 4 435 0.16 0.017 0.90 GA AA G 23 1 457 0.095 0.0041 0.95 AG GG A 25 0 457 0.10 0 0.95 AG GG A 53 4 421 0.22 0.017 0.87 CA AA C 64 5 406 0.27 0.021 0.85 CT TT C 24 1 456 0.10 0.0041 0.95 CT TT C 35 1 445 0.15 0.0041 0.92 GC CC G 5 0 477 0.021 0 0.99 CT TT C 27 0 455 0.11 0 0.94 GT TT G 116 45 276 0.48 0.19 0.57 GA AA G 98 12 360 0.41 0.050 0.75 A2 A 80 0.17 A 47 0.098 A 25 0.052 G 25 0.052 G 61 0.13 A 74 0.15 T 26 0.054 T 37 0.077 C 5 0.010 T 27 0.056 T 206 0.43 A 122 0.25 G1 GG 3 0.21 GG 8 0.57 GG 13 0.93 AA 11 0.79 AA 3 0.21 CC 5 0.36 CC 13 0.93 CC 13 0.93 GG 14 1 CC 10 0.71 GG 9 0.64 GG 6 0.43 type IV (n=14) G2 G3 A1 GA AA G 10 1 16 0.71 0.071 0.57 GA AA G 6 0 22 0.43 0 0.79 GA AA G 1 0 27 0.071 0 0.96 AG GG A 3 0 25 0.21 0 0.89 AG GG A 11 0 17 0.79 0 0.61 CA AA C 9 0 19 0.64 0 0.68 CT TT C 1 0 27 0.071 0 0.96 CT TT C 1 0 27 0.071 0 0.96 GC CC G 0 0 28 0 0 1 CT TT C 4 0 24 0.29 0 0.86 GT TT G 3 2 21 0.21 0.14 0.75 GA AA G 7 1 19 0.50 0.071 0.68 A2 A 12 0.43 A 6 0.21 A 1 0.036 G 3 0.11 G 11 0.39 A 9 0.32 T 1 0.036 T 1 0.036 C 0 0 T 4 0.14 T 7 0.25 A 9 0.32 Supplementary Table 10: Genotype-phenotype association analysis of skin colour Supplementary Table 10a: Association between SNP genotype and skin colour (stage 1) Alleles SNP-ID (gene) rs12913832 (HERC2) rs916977 (HERC2) rs1800407 (OCA2) rs7495174 (OCA2) rs4778138 (OCA2) rs4778241 (OCA2) rs1805007 (MC1R) rs1805008 (MC1R) rs1805009 (MC1R) rs12203592 (IRF4) rs12896399 (SLC24A4) rs1393350 (TYR) ref. A A G G A C C C - fav. G G A A C T T T - Regression p value types I-II vs. types III-IV ordinal regression simple multiple§ simple multiple§ n. s. n. s. 0.0046 n. s. n. s. n. s. 0.039 n. s. n. s. n. s. n. s. n. s. 0.031 0.0067 (0.080) 0.010 n. s. 0.024 n. s. 2.810-4 4.910-4 (0.0059) 0.044 n. s. 0.0060 n. s. 0.0041 0.0023 (0.028) 0.0060 0.0035 (0.042) 6.010-4 1.010-4 (0.0012) 1.210-4 6.010-5 (7.210-4) n. s. n. s. n. s. n. s. 0.034 0.027 (0.32) n. s. n. s. n. s. n. s. n. s. n. s. n. s. n. s. n. s. n. s. ref.: reference allele, fav.: favourable allele for fair skin type, n.s.: not significant. §Figures in brackets are p values after Bonferroni adjustment (12 SNPs tested). SNPs included in the final prediction model are printed in bold (see Table 2). All SNPs were analyzed using an additive allelic model on the logit scale. Supplementary Table 10b: Combined rs1805007/rs1805008/rs4778138 genotype and skin colour (stage 2) rs1805007 (MC1R) rs1805008 (MC1R) CC CC CT TT CC CT CT TT CC TT CT TT rs4778138 (OCA2) AA AG GG AA AG GG AA AG GG AA AG GG AA AG GG AA AG GG AA AG GG AA AG GG AA AG GG N 48 10 3 13 8 0 2 0 0 9 4 0 1 0 0 0 0 0 2 0 0 0 0 0 0 0 0 19 Probability (types I-II)$ 0.25 0.14 0.074 0.48 0.31 0.18 0.73 0.56 0.39 0.46 0.29 0.17 0.71 0.54 0.36 0.87 0.77 0.62 0.69 0.52 0.34 0.86 0.75 0.60 0.95 0.90 0.81 Prevalence (types I-II) observed expected& 26 12.00 3 1.40 0 0.22 8 6.24 3 2.48 0 0 2 1.46 0 0 0 0 7 4.14 2 1.16 0 0 1 0.71 0 0 0 0 0 0 0 0 0 0 2 1.38 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 $ Conditional probability of types I-II, given the respective genotype, as calculated from the prediction model derived in stage 1 (see Table 2). &The expected prevalence equals the product of the genotype prevalence (N) and the probability of types I-II. Supplementary Table 11: Composition of the four phenotype clusters identified by multidimensional scaling analysis (all 400 individuals of stage 1 and 2) eye colour blue green brown hair colour blond brown black skin type type I type II type III type IV blue eye colour red tint blue eye colour no red tint no blue eye colour red tint no blue eye colour no red tint 71 (100%) 0 (0%) 0 (0%) 214 (100%) 0 (0%) 0 (0%) 0 (0%) 14 (52%) 13 (48%) 0 (0%) 46 (55%) 38 (45%) 46 (65%) 23 (32%) 2 (3%) 145 (68%) 64 (30%) 5 (2%) 11 (41%) 16 (59%) 0 (0%) 28 (33%) 45 (54%) 11 (13%) 8 (11%) 35 (49%) 27 (38%) 1 (1%) 6 (3%) 69 (32%) 135 (63%) 4 (2%) 2 (7%) 13 (48%) 10 (37%) 2 (7%) 1 (1%) 9 (11%) 67 (80%) 7 (8%) Results were obtained from all 400 individuals from stages 1 and 2, except four individuals with pure red hair colour who were removed from the analysis. Supplementary Table 12: Association between SNP genotype and blue eye colour (all 400 individuals of stage 1 and 2) SNP-ID (gene) rs12913832 (HERC2) rs916977(HERC2) rs1800407(OCA2) rs7495174(OCA2) rs4778138(OCA2) rs4778241(OCA2) rs1805007(MC1R) rs1805008(MC1R) rs1805009 (MC1R) rs12203592(IRF4) rs12896399 (SLC24A4) rs1393350 (TYR) Alleles ref. fav. A G A G G A G A G A A C C T - Regression p value simple multiplea <1.010-5 <1.010-5 n.s. <1.010-5 <1.010-5 <1.010-5 n.s. n.s. n.s. n.s. n.s. n.s. <1.010-5 (1.210-4) n.s. 0.0072 (0.086) 0.0047 (0.056) n.s. n.s. n.s. n.s. n.s. 0.030 (0.36) n.s. n.s. ref.: reference allele, fav.: favourable allele, n.s.: not significant. aFigures in brackets are p values after Bonferroni adjustment (12 SNPs tested). All SNPs were analysed using an additive allelic model on the logit scale. SNPs included in the final prediction model are printed in bold (see Table S17). 20 Supplementary Table 13: Association between SNP genotype and red tint (all 400 individuals of stage 1 and 2) SNP-ID (gene) rs12913832 (HERC2) rs916977 (HERC2) rs1800407 (OCA2) rs7495174 (OCA2) rs4778138 (OCA2) rs4778241 (OCA2) rs1805007 (MC1R) rs1805008 (MC1R) rs1805009 (MC1R) rs12203592 (IRF4) rs12896399 (SLC24A4) rs1393350 (TYR) Alleles ref. fav. C T C T - Regression p value simple multiple§ n.s. n.s. n.s. n.s. n.s. n.s. n.s. n.s. n.s. n.s. n.s. n.s. <1.010-5 <1.010-5 n.s. n.s. n.s. n.s. <1.010-5 (1.210-4) <1.010-5 (1.210-4) n.s. n.s. n.s. n.s. ref.: reference allele, fav.: favourable allele, n.s.: not significant. §Figures in brackets are p values after Bonferroni adjustment (12 SNPs tested). The p value for the interaction between rs1805007 and rs1805008 equalled 0.023 (0.28). All SNPs were analyzed using an additive allelic model on the logit scale. SNPs included in the final prediction model are printed in bold (see Table S17). Supplementary Table 14: Association between SNP genotype and the light-dark component of hair colour (all 400 individuals of stage 1 and 2) SNP-ID (gene) Alleles blond vs. non-blond simple multiple§ <1.010-5 <1.010-5 (1.210-4) -4 n.s. 1.010 n.s. n.s. n.s. 6.010-4 -4 n.s. 5.010 0.0033 n.s. n.s. n.s. ref. fav. rs12913832 (HERC2) A G rs916977 (HERC2) rs1800407 (OCA2) rs7495174 (OCA2) rs4778138 (OCA2) rs4778241 (OCA2) rs1805007 (MC1R) A G G A - G A A C - rs1805008 (MC1R) C T n.s. n.s. rs1805009 (MC1R) G C n.s. rs12203592 (IRF4) T C <1.010-5 rs12896399 (SLC24A4) G T 6.010-4 rs1393350 (TYR) - - n.s. n.s. <1.010-5 (1.210-4) 0.0030 (0.036) n.s. Regression p value ordinal regression simple multiple§ <1.010-5 <1.010-5 (1.210-4) -5 n.s. <1.010 n.s. n.s. <1.010-5 0.035 (0.42) n.s. <1.010-5 -5 n.s. 2.010 n.s. n.s. 3.010-5 1.010-4 (3.610-4) 0.034 0.037 (0.44) <1.010-5 <1.010-5 (1.210-4) 0.0041 5.010-4 (0.49) n.s. n.s. linear regression simple multiple§ <1.010-5 <1.010-5 (1.210-4) -5 n.s. <1.010 n.s. n.s. n.s. <1.010-5 -5 0.024 (0.29) <1.010 -5 n.s. <1.010 n.s. n.s. 2.010-4 4.010-4 (0.0024) 0.028 0.049 (0.59) <1.010-5 <1.010-5 (1.210-4) 0.0032 2.010-4 (0.038) n.s. n.s. ref.: reference allele, fav.: favourable allele for light hair colour, n.s.: not significant. §Figures in brackets are p values after Bonferroni adjustment (12 SNPs tested). Shown are the p values for the analysis of blond vs. non-blond, ordinal regression analysis of the nine hair types and linear regression analysis treating the nine hair types as a quantitative variable. All SNPs were analyzed using an additive allelic model on the logit scale. Two individuals with pure red hair colour each were removed from stage 1 and 2. SNPs included in the final prediction model are printed in bold (see Table S17). 21 Supplementary Table 15: Association between SNP genotype and skin colour (all 400 individuals of stage 1 and 2) SNP-ID (gene) Alleles rs12913832 (HERC2) rs916977 (HERC2) rs1800407 (OCA2) rs7495174 (OCA2) rs4778138 (OCA2) rs4778241 (OCA2) rs1805007 (MC1R) ref. A A G G A C fav. G G A A C T rs1805008 (MC1R) C T rs1805009 (MC1R) rs12203592 (IRF4) rs12896399 (SLC24A4) rs1393350 (TYR) C - T - Regression p value types I-II vs. types III-IV ordinal regression § simple multiple simple multiple§ 0.0022 n. s. 0.0062 (0.074) 5.010-5 0.014 n. s. 0.0031 n. s. n. s. n. s. n. s. n. s. -4 -4 0.0034 2.010 (0.0024) 7.810 n. s. 0.0016 n. s. 0.0013 (0.016) <1.010-5 n. s. n. s. 8.010-4 1.910-4 -4 -4 -4 6.010 1.010 (0.0012) 2.410 9.010-5 (0.0011) -5 <1.010 2.010-4 1.010-5 1.010-5 (1.210-4) (1.210-4) n. s. n. s. n. s. n. s. 0.027 0.014 (0.17) 0.040 0.024 (0.29) n. s. n. s. n. s. n. s. n. s. n. s. n. s. n. s. ref.: reference allele, fav.: favourable allele for fair skin type, n.s.: not significant. §Figures in brackets are p values after Bonferroni adjustment (12 SNPs tested). All SNPs were analyzed using an additive allelic model on the logit scale. SNPs included in the final prediction model are printed in bold (see Table S17). Supplementary Table 16: Associations between dichotomous pigmentation traits (all 400 individuals of stage 1 and 2) Pigmentation trait p (padj) eye colour (blue) p (padj)a hair colour (blond) <1.010-5 (3.010-5) hair colour (red tint) 8.010-4 (0.0024) skin type (fair, I-II) <1.010-5 (3.010-5) hair colour (blond)b 4.010-4 (1.210-3) eye colour (blue) skin type (fair, I-II) <1.010-5 (3.010-5) 0.0066 (0.020) skin type (fair, I-II)d 0.00527 (0.016) 0.0025 (0.0075)c eye colour (blue) hair colour (red) hair colour (blond) <1.010-5 (3.010-5) 0.00417 (0.013) 9.410-4 (0.0028)c 0.00169 (0.0051)d 2.510-4 (7.510-4)d 0.015 (0.045)d padj: Bonferroni-adjusted p value (three tests). All associations were evaluated in a multiple logistic regression analysis treating the remaining three traits as influential variables, and by backward selection. aP values and ORs are from a multiple logistic regression analysis including the selected SNPs from Tables S12-S15 (cf. also Table S17) as influential variables. bIn the analysis of the light-dark component of hair colour, two individuals with pure red hair colour each were removed from stage 1 and 2. cMultiple logistic regression analysis eventually disregarded rs1805008 (MC1R) after backward selection. dMultiple logistic regression analysis eventually disregarded rs7495174 (OCA2) and rs4778138 (OCA2) after backward selection. 22 Supplementary Table 17: Capability of selected models to predict dichotomous pigmentation phenotypes (all 400 individuals of stage 1 and 2) Phenotype Prev. rs12913832 (HERC2) eye colour (blue) 72 hair colour (red tint) 26 hair colour (blond) skin type (fair, I-II) Predictors 58 36 rs12913832 (HERC2) hair colour (blond) rs1805007 (MC1R) rs1805008 (MC1R) rs1805007 (MC1R) rs1805008 (MC1R) skin type (fair, I-II) rs12913832 (HERC2) rs1805008 (MC1R) rs12203592 (IRF4) rs12896399 (SLC24A4) rs12913832 (HERC2) rs12203592 (IRF4) rs12896399 (SLC24A4) eye colour ((blue) skin type (fair, I-II) rs1805008 (MC1R) rs1805007 (MC1R) rs7495174 (OCA2) rs4778138 (OCA2) rs1805008 (MC1R) rs1805007 (MC1R) eye colour (blue) hair colour (red) hair colour (blond) median(range) all (95% CI) median(range) all (95% CI) median(range) all (95% CI) median(range) all (95% CI) Sensitivity 91 (80-93) 90 (86-93) 91 (80-93) 90 (86-93) 23 (9-88) 39 (29-49) 27 (0-63) 28 (21-37) Specificity 80 (29-91) 90 (86-93) 80 (29-91) 72 (64-80) 97 (83-100) 92 (89-95) 93 (81-100) 95 (93-97) Accuracy 86 (78-93) 85 (82-89) 86 (78-93) 85 (82-88) 73 (70-95) 78 (75-82) 75 (73-85) 76 (71-82) AUC 84 (60-92) 82 (77-86) 84 (60-92) 85 (81-90) 71 (52-90) 72 (67-78) 76 (60-88) 77 (71-82) median(range) all (95% CI) 78 (71-91) 82 (77-87) 47 (21-65) 47 (39-54) 68 (55-75) 67 (63-71) 69 (54-82) 70 (65-76) median(range) all (95% CI) 79 (30-94) 80 (74-85) 60 (33-69) 54 (47-61) 69 (40-75) 69 (65-73) 73 (44-88) 74 (70-79) median(range) all (95% CI) 86 (55-93) 84 (80-89) 45 (19-61) 42 (34-50) 71 (53-85) 69 (65-73) 66 (47-84) 67 (62-73) median(range) all (95% CI) 88 (72-100) 87 (83-91) 44 (22-67) 48 (39-55) 73 (58-80) 73 (69-77) 68 (59-88) 73 (68-78) Prev.: prevalence, AUC: area under the receiver operating characteristic curve, median (range): median and range of ten cross validation samples, all (95% CI): values for all 400 individuals with 95% confidence interval in brackets. Two individuals with pure red hair colour each were removed from stage 1 and 2 for the analysis of blond hair colour. All numerical figures are percentages. 23