Citation Van Geystelen A., Decorte R., Larmuseau M.H.D. (2013), Updating the Y-chromosomal phylogenetic tree for forensic applications based on whole genome SNPs Forensic Science International: Genetics, 7 (6), 573-580. Archived version Author manuscript: the content is identical to the content of the published paper, but without the final typesetting by the publisher Published version insert link to the published version of your paper http://www.sciencedirect.com/science/article/pii/S1872497313000896 Journal homepage insert link to the journal homepage of your http://www.fsigenetics.com/home. Author contact your email maarten.larmuseau@bio.kuleuven.be Klik hier als u tekst wilt invoeren. IR url in Lirias https://lirias.kuleuven.be/handle/123456789/ 395607 (article begins on next page) paper Abstract The Y-chromosomal phylogenetic tree has a wide variety of important forensic applications and therefore it needs to be state-of-the-art. Nevertheless, since the last 'official' published tree many publications reported additional Ychromosomal lineages and other phylogenetic topologies. Therefore, it is difficult for forensic scientists to interpret those reports and use an up-to-date tree and corresponding nomenclature in their daily work. Whole genome sequencing (WGS) data is useful to verify and optimize the current phylogenetic tree for haploid markers. The AMY-tree software is the first open access program which analyses WGS data for Y-chromosomal phylogenetic applications. Here, all published information is collected in a phylogenetic tree and the correctness of this tree is checked based on the first large analysis of 747 WGS samples with AMY-tree. The obtained result is one phylogenetic tree with all peer-reviewed reported Y-SNPs without the observed recurrent and ambiguous mutations. Nevertheless, the results showed that currently only the genomes of a limited set of Y-chromosomal (sub-)haplogroups is available and that many newly reported Y-SNPs based on WGS projects are false positives, even with high sequencing coverage methods. This study demonstrates the usefulness of AMY-tree in the process of checking the quality of the present Ychromosomal tree and it accentuates the difficulties to enlarge this tree based on only WGS methods. Keywords Haploid markers, Y-chromosome, phylogenetic tree, bio-informatics, Wholegenome SNP calling, Y-SNPs Introduction A state-of-the-art phylogeny of the human Y-chromosome based on bi-allelic polymorphisms is an essential tool for forensic genetics. Forensic scientists are taking advantage of the Y-chromosomal phylogenetic tree in their daily work, e.g. by checking the quality of datasets or by assigning geographical landscapes to specific lineages [1, 2]. Y-chromosomal single nucleotide polymorphisms (Y-SNPs) have a great capacity for detecting geographical origins as many lineages defined by Y-SNPs show a strong continent-specific [3, 4] and even intra-continent-specific distribution [5-7]. Their usefulness is illustrated by the fact that Y-SNP data are now also included in Y-chromosomal forensic databases, such as in the YHRD database [8]. Therefore, an up-to-date extended Y-chromosomal phylogeny based on these bi-allelic markers which are preferably unambiguous and non-recurrent but which have a high discrimination power is required for forensic applications. Since the publication of the latest 'official' Y-chromosomal phylogenetic tree by Karafet et al. [4], a continuous wave of new peer-reviewed articles which report changes to this tree are published. These changes include a new root and new basal clades [9, 10], modifications of the global backbone [3, 11], different phylogenetic topologies within a haplogroup [12-14], newly described subhaplogroups [15-17], or other phylogenetic positions for a certain mutation [18]. As these publications are not coordinated different names are given to the Y-chromosomal lineages for which the phylogenetic position is given in different topologies. Therefore, the currently overall reported Y-chromosomal tree is not clear and this makes it difficult for forensic researchers to use a uniform phylogenetic tree. Hence new initiatives to ensure more continuity in the report of the most recent phylogenetic Y-chromosomal tree are needed. Large whole genome sequencing (WGS) projects such as the 1000 Genomes project [19, 20] bring an opportunity to introduce the required uniformity in the reporting of the haploid Y-chromosomal tree. The analysis of whole Ychromosomes within male genomes allows verification and optimization of the currently used phylogenetic tree. WGS data has already proved to be useful in verifying and optimizing the phylogeny of the other haploid markers in the human genome i.e. the mitochondrial DNA (mtDNA). Relevant ambiguous markers and back-mutations which influence the interpretation of previous forensic and evolutionary genetic studies were detected based on these data [21]. Recently a new Y-chromosomal phylogenetic tree was built after a tabula rasa of the present Y-chromosomal tree by using only Y-SNPs from available WGS male samples [22]. By comparing this new phylogenetic tree with the currently used one, the backbone of the currently used phylogenetic tree was confirmed in this study. However, this new tree is not useful for forensic research because there is no link between currently used and newly reported lineages. Furthermore, the set of used genomes is not a good representation of all existing Y-chromosomal (sub-)haplogroups and geographical regions. There are also still too much false positive SNP calls in this WGS dataset. Alternatively, the AMY-tree software is the first open access program which academics and forensic professionals can use to verify and optimize the currently used Y-chromosomal tree by using WGS data [23]. The first AMYtree analysis was done based on 118 WGS samples and proved already its usefulness to verify and to optimize the present Y-chromosomal phylogenetic tree [23]. The aim of this study is to perform the largest reported screening of male genomes for Y-chromosomal phylogenetic applications based on the AMY-tree software. Firstly, we want to merge all newly Y-SNPs from recent peerreviewed publications since the latest 'official' Y-chromosomal phylogeny [4] into one single tree which is useful for forensic applications. Secondly, this updated Y-chromosomal phylogenetic tree needs to be checked for recurrent mutations, ambiguous SNPs and other difficulties for the (forensic) application of the tree. Thirdly, this study also wants to find out for which Y-chromosomal (sub-)haplogroups there is already WGS data available. Finally, investigating the possibilities to enlarge the Y-chromosomal phylogenetic tree based on the current Y-SNP detections in WGS data is the last aim of this study. Materials and methods Updated phylogenetic tree The latest updated phylogenetic tree of the Y-chromosome as it was published by Van Geystelen et al. [23] was manually updated based on recent descriptions of new Y-SNPs in academic research papers like Pamjav et al. [16] and Scozzari et al. [9]. As the exact phylogenetic position of a few new Y-SNPs was not given their position needed to be determined based on the results of AMY-tree of all WGS samples. Next, also recurrent mutations, ambiguous SNP-loci and wrongly defined mutation conversions within the newly updated Y-chromosomal tree were ascertained based on those AMY-tree results. WGS Y-SNPs dataset In order to check the manually updated phylogenetic tree and to optimise the AMY-tree software, a large dataset of whole genome Y-SNP calls was assembled. This dataset consists of 747 samples which represent 660 males, as several genomes were analysed in different projects. Within this dataset the genomes of eight males whose father's genome was also sequenced are present. The SNP calls were collected from four large WGS projects and several individual genome projects (Supplementary Materials Table S1). These projects differ from each other based on the used next-generation sequencing (NGS) platforms and sequence coverage. First, Complete Genomics made the SNP calls of 35 whole genomes of males available (http://www.completegenomics.com/public-data/69-Genomes/ 04 Jan 2013); those genomes were sequenced with a high sequencing coverage on the Complete Genomics Analysis (CGA) Platform [24]. Second, the Personal Genome Project (PGP) and Singapore Sequencing Malay Project (SSMP) also used this CGA platform. PGP is a project started to obtain and openly share human genome sequences in combination with health information. At the moment 40 male genomes were available (www.personalgenomes.org 04 Jan 2013). The SSMP on the other hand wanted to characterize the polymorphic variants in the population of Malays, an Austronesian group present in Southeast Asia and Oceania. Recently, the Y-SNP calls of 46 Malays were made publically available [25]. Next, the 1000 Genomes Project aims to provide a comprehensive resource on human genetic variation by sequencing more than 1000 human genomes. In 2010, SNP calls of 77 males were made available in the pilot phase [20] and two years later a set of 526 SNPs profiles were published as result of phase 1 of the Project [19]. As the 1000 Genomes project aims to sequence a large number of people, the sequencing coverage was lower than in the other projects. Finally, 23 additional samples were collected from several single genome projects [26-35, 36 and unpublished genomes of Guy Froyen]. AMY-tree modifications Several modifications to the AMY-tree software version 1.0 [23] were made for the assessment of the SNP calling quality in WGS data. This was necessary as the quality of the SNP calling influences the AMY-tree analysis of a sample and therefore also the interpretation of the result of the analysis [23]. The extra quality assessment is based on the results of the first AMY-tree run of a certain sample. This assessment assumes that the used phylogenetic tree is correct and that the assigned haplogroup after the first AMY-tree run is the actual haplogroup of that sample. The algorithm for the extra quality assessment is simple and comprehensible as shown in Figure 1. First, all Y-SNPs of the phylogenetic tree are selected except if the determined haplogroup of the first run is a paragroup (e.g. R1b1b2*). In the case of a paragroup, all Y-SNPs which are in sub-nodes of the main group of this paragroup are excluded from the selection. This is done in order to remove the influence of too much false positive SNP calls because when the haplogroup is not at the correct phylogenetic level, too much false positives and/or false negatives mutant Y-SNPs would be detected. Next, the expected state of all selected Y-SNPs is determined whereby all Y-SNPs in the path from the assigned sub-haplogroup till root are expected to be mutant and all others are expected to be ancestral. The state of each Y-SNP is also determined in the reference genome. Thereafter, these expectations of Y-SNP state are converted into expectations of ‘called’ or ‘not called’ in the next step based on the expected state and the state in the reference genome. Consider a Y-SNP in the path from sub-haplogroup till root, if the status of that SNP in the reference sequence is mutant, the Y-SNP is expected to be 'not called' else the expectation is 'called'. Consider a Y-SNP not in the path from sub-haplogroup till root, if the status of that Y-SNP in the reference sequence is mutant, the YSNP is expected to be 'called' else the expectation is 'not called'. Then, these expectations are compared to the state in the sample such that the number of true positive, false positive, true negative and false negative SNP calls will be determined. At last, the quality of a sample is expressed in several measures of quality: Matthews correlation coefficient (MCC), accuracy, sensitivity, specificity, precision, recall and F1-score (Supplementary method). When the MCC is larger or equal to 0.95 the SNP calling quality is called excellent, otherwise it is called low. The quality will be given in the output file of AMYtree. When the SNP calling quality is low, caution has to be taken about the result of the AMY-tree analysis due to the high occurrence of false negative and false positive SNPs. When the quality is excellent, the results of AMY-tree are considered to be valuable for the control of the currently used phylogenetic tree of the Y-chromosome and for the increase of its resolution. As such, a better YSNP call quality assessment is implemented in AMY-tree version 1.1 compared to the earlier version [23]. Next, when a sample was run in AMY-tree in the 'sufficient' mode such that the reference genome was taken into account but the determined haplogroup belongs to R-M269 and the MCC lower than 0.95 a second AMY-tree run needs to be executed but in the 'insufficient' mode. This is important as a MCC lower than 0.95 indicates that this result of the first run is too much influenced by the reference genome. Finally, another small modification to AMY-tree is based on the fact that Z381, L2 as well as L20 are mutant in the reference genome and although AMY-tree version 1.0 already had a quite complex system to filter out the influence of the reference genome even on samples belonging to haplogroup R, it was not yet efficient enough. Therefore, when both Z381 and L2 or both Z381 and L20 are mutant in the first AMY-tree run, e.g. the ancestral SNP was not called, the sample will be handled as insufficient in a second AMY-tree run, such that the Y-SNPs of the reference genome are not used anymore when determining the haplogroup. By including these modifications even more certainty is build in AMY-tree version 1.1 in comparison with the previous version [23]: for samples belonging to a R-M269 sub-haplogroup the reference genome is only taken into account when the SNP calling quality is excellent after the first run in the ‘sufficient’ mode. The cut-offs to assess the Y-SNP calling quality is optimised based on all 747 genomes by performing several test runs and manual analyses of the genomes and by checking it with the results in Van Geystelen et al. [23] and with the publications of the genomes whereby a Y-chromosomal analysis was already performed. Y-SNP detecting The Y-SNPs which are present in the WGS samples but which are not yet included in the updated phylogenetic tree were detected by AMY-tree. Only those Y-SNPs from WGS samples with an excellent Y-SNP calling quality were used. But that was not the only constraint for the potentially relevant YSNPs; they must also be positioned in the unique regions of the Y-chromosome as read mapping and variant detection difficulties are expected due to the high frequency of repeated sequences on the Y-chromosome. So, Y-SNPs in the pseudoautosomal, heterochromatic, X-transposed and ampliconic segments [37] of the male-specific part of the genome as reported by Wei et al. [22] were excluded. Results In total 747 samples are analysed by the updated version of the AMY-tree software with the updated phylogenetic Y-chromosomal tree. There are 131 samples of 126 individuals with an excellent Y-SNP calling quality, i.e. MCC ≥ 0.95, which are mostly obtained from Complete Genomics and the Personal Genome Project. The remaining 616 samples have a low calling quality and are mostly obtained from the 1000 Genomes Project pilot and phase 1 (Table S1). Updated tree for forensic applications The state-of-the-art Y-chromosomal tree is manually updated based on all published Y-SNPs from academic studies. After the AMY-tree runs of all 747 samples, all Y-SNPs of which no exact phylogenetic position was given in the publications could now be included in the updated phylogenetic tree, for example there were two new Y-SNPs reported in [16] within sub-haplogroup R-M198 without clear phylogenetic positions (Figure 2). The results with an excellent SNP calling quality also showed several recurrent Y-SNPs in the phylogenetic tree which cause the determination of multiple haplogroups for some samples. After ruling out that the recurrent SNPs are sample- or project-specific, these SNPs are removed from the phylogenetic tree; an overview of the three observed recurrent SNPs of which enough evidence was available, is given in Table S2. These modifications led to the final updated tree version 1.1 which includes 359 Y-chromosomal lineages and 721 Y-SNP markers. The final tree and its corresponding mutation conversions for all the Y-SNPs in the tree can be found in Tables S3 and S4. Sub-haplogroup determining The determined haplogroups of all 747 samples obtained by the AMY-tree analysis based on the final updated tree version 1.1 is given in Table S1. Only for 14 samples no haplogroup can be determined and for three other samples multiple haplogroups are determined by AMY-tree. The distribution of the MCC of the 730 samples for which an unambiguous haplogroup is determined is given in Figure 3. Only a minority of the samples has an excellent SNP calling quality with a MCC ≥ 0.95; a MCC lower than 0.95 means that less than 97.5% of the negative and positive predictions are correct. Overall, 17 different haplogroups and 106 sub-haplogroups are present in the dataset. When considering only the samples of excellent quality 10 different haplogroups and 47 sub-haplogroups remain. Figure 4 and Table S5 give an overview of those (sub-)haplogroups and their frequencies. The samples of paternally related samples present in the dataset are of particular interest because they are considered to represent the same Y-chromosome. All the samples of one family with eight members sequenced by Complete Genomics are determined to belong to R-P312*. The paternal grandfather of that family is also analysed in the 1000 Genomes project and there he is determined as P* [P-92R7*]. The first attempt of AMY-tree to determine the sub-haplogroup of that sample of 1000 Genomes led to the sub-haplogroup R-L2 which has a higher phylogenetic level than the haplogroups of the Complete Genomics samples. However, as the MCC value of that sample is smaller than 0.95 the sub-haplogroup determination is done again but without the influence of the reference genome. This led to the final sub-haplogroup P92R7* which has a less accurate phylogenetic level than that of Complete Genomics. Thus, the influence of the reference genome can sometimes cause a too high or too low phylogenetic level when the new modifications which were made to AMY-tree version 1.1 would not have been applied. Detecting new Y-SNPs The large amount of available samples leads to a huge amount of newly reported Y-SNPs, i.e. Y-SNPs that are not yet present in the updated phylogenetic tree version 1.1. In total 108,681 new Y-SNPs are reported in all 660 male genomes; when an individual is analysed in more than one project only the sample with the highest MCC value is used. The majority of the SNPs appears in only a few samples: 62% appears in only one sample and 16% is present in two samples as shown in Figure 5A. In the 126 male genomes with an excellent Y-SNP calling quality 50,430 new Y-SNPs are reported. These SNPs also come with a high frequency of low occurrences in the excellent genomes: 57% is unique and 11% appears in two samples. When only the regions within the Y-chromosome which are identified as unique are taken into account, a much lower number of new Y-SNPs is detected. In total 35,503 new Y-SNPs are reported in the 660 male genomes and 15,208 new Y-SNPs in the genomes with excellent Y-SNP calling quality. The same patterns of occurrence for these Y-SNPs are observed as with the non-filtered Y-SNPs: the majority of the new reported Y-SNPs appeared in only one or two samples of the WGS dataset as shown in Figure 5B. The genomes of eight males whose biological father's genome is also sequenced are present in the dataset: one family of eight males including the father, the son and the six grandsons next to one father-son pair. In the family of eight paternal relatives 5,155 new SNPs are reported on the whole Y-chromosome and a large number of these SNPs is found in only one of the eight individuals as shown in Figure 6A. The number of Y-SNPs decreases every time the number of samples in which the Y-SNP occurs increases except for the occurrence in all-exceptone and all samples of the family. When all Y-SNPs which also occur in any other sample with excellent SNP calling quality are removed only less than 20% of the Y-SNPs remains but the distribution of the number of Y-SNPs per occurrence stays the same as the black bars in Figure 6A show. The same comparison is made when only the SNPs in the unique part of the Ychromosome are selected: the same pattern as with all SNPs is visible in Figure 6B. However, for each number of occurrences the number of truly unique SNPs, i.e. SNPs that do not occur in other genomes outside the family, is much higher. Within the father-son pair 2,181 new SNPs are reported on the whole Y- chromosome for which the difference between occurrence in only one and both genomes is relatively small, also for the proportion of SNPs which do not occur in the other genomes with an excellent Y-SNP quality (Figure 6A). Remarkably, there are more new SNPs present in both samples than in one sample when the Y-SNPs which are not located in the unique region of the Ychromosome are removed (Figure 6B). The effect of the increasing number of unique Y-SNPs that do not occur in other genomes as seen with the previous family is also in the father-son pair present. Discussion The present study realises the first large screening of male genomes for phylogenetic applications of the Y-chromosome. Based on this screening of 747 male samples an update has been made of the AMY-tree software and of a state-of-the-art Y-chromosomal phylogenetic tree was established for forensic scientists. Furthermore, also recommendations for future sequencing projects dealing with a broader selection panel of Y-chromosomal haplogroup samples and for the validation of newly detected Y-SNPs are made. First, the large screening of the 747 male Y-chromosome samples revealed that the SNP calling quality of a few samples was overestimated in AMY-tree version 1.0. As these samples belong to sub-haplogroup R-M269 they are very similar to the reference genome which is used to estimate the SNP calling quality of a sample [23]. To remove this SNP calling quality overestimation the influence of the reference genome is excluded for all samples belonging RM269 with a low SNP call quality. That is why several modifications are made and implemented in AMY-tree version 1.1. Second, an update of the currently used Y-chromosomal phylogenetic tree was realised based on the large database of 747 available samples. At the moment this tree is the most state-of-the-art tree applicable for forensic geneticists: all Y-SNPs which are reported in academic publications till today are included and all ambiguous markers are excluded to avoid wrong Y-SNP interpretations (Table S3). As often the case in the literature the phylogenetic relationship between new Y-SNPs and earlier reported Y-SNPs in the same (sub)haplogroup are not given. By using the AMY-tree results it is possible to find out the concrete phylogenetic level of each Y-SNP relative to the other already known SNPs. For example, Pamjav et al. [16] described and genotyped two new Y-SNPs Z93 and Z280 within R-M198, although the exact phylogenetic positions in relationship with the other lineages within R-M198 were not given. The presence of both Z93 and Z80 is checked in all samples belonging to the sub-haplogroup R-M198, Z93 occurred in three R-M198* samples and in none of the other samples. Also Z280 does not occur in any sample except in one RM198* sample. Therefore, both Z93 and Z280 are placed in the phylogenetic tree as sub-haplogroups of R-M198 as shown in Figure 2. The choice is made for a phylogenetic tree in table format as described by Van Geystelen et al. [23] instead of a branching diagram because the tree is very large and will become larger in the future. Therefore, the table format can be adapted more easily than the diagram and it is also more manageable. Not only newly reported Y-SNPs are included in the new updated tree but also ambiguous Y-SNPs are excluded as they can complicate the Y-chromosomal applications for forensic studies. As previously described [38], the most relevant ambiguous Y-SNPs are recurrent SNPs which have a paralogous distribution along the phylogenetic tree and which have thus mutant alleles in at least two independent Y-chromosomal lineages. Based on the screening of the 747 analysed samples, three Y-SNPs are recognized for the first time as recurrent (Table S2). There are no other indications for recurrent mutations based on the present WGS dataset. Therefore we may assume, in most cases, that males which both have the mutant allele for a Y-SNP used in the updated tree (Table S3) have received this mutant allele from one common ancestor and not by convergent evolution. Next, also all Y-SNPs which could not be analysed by WGS for one reason or another are excluded from the updated tree, although it may be possible to genotype these SNPs correctly with Sanger sequencing methods. For example, all hundred E-M2 samples in the dataset did not reveal the mutant allele for Y-SNP V95 as expected by earlier publications [14, 39]. The reason for this remarkable result may be that V95 is not well validated but more likely it is the result of a bad SNP calling in the Ychromosomal region around V95 by the current WGS methods. For some YSNPs also a mutation conversion is found which is different from the reported one. For example, another ancestral and mutant allele are observed for Y-SNP M426 than reported by Rootsi et al. [40]. Since the Y-chromosome has a very complex origin it also has a lot of non-unique regions which complicate the analysis of WGS data [37]. The current reference genome GRCh37 shows the evolution of the Y-chromosome and its numerous resulting non-unique regions. So the reason for this wrong analysis of M426 may be the position of this SNP in one of the non-unique regions of the Y-chromosome as defined earlier [22, 37]. To have an unambiguous phylogenetic tree, we excluded all the Y-SNPs for which no reliable signal with the WGS methods could be found. In the end an updated Y-chromosomal tree which includes 359 Y-chromosomal lineages and more than 721 Y-SNP markers (Table S3) is obtained and this tree is the basic tool to develop and optimize Y-SNP-multiplexes for forensic applications [41, 42]. Third, the distribution of the analysed haplogroups per ancestral continent in the current dataset corresponds with the known distributions [3, 4]. Nevertheless, there is not yet a representative set of all phylogenetic sub-haplogroups available. In total 17 haplogroups and 106 sub-haplogroups are reported in the analysis of the whole dataset. However, the SNP calling quality is low (MCC < 0.95) for most of the analysed samples as most data are obtained from the 1000 Genomes project and therefore the sub-haplogroup determination of these data is not completely reliable. When considering only the samples of excellent quality (MCC ≥ 0.95) 10 haplogroups and 47 sub-haplogroups remain and this corresponds with only 13% of the total number lineages described so far. Most of the Y-chromosomes are assigned to haplogroups E, O and R. Therefore, when the set of WGS Y-chromosomes will be enlarged the current phylogenetic tree as well as the analysis of Wei et al. [22] to calibrate the tree can be optimized. Finally, the screening of the dataset also revealed that new in silico and in vitro methods are required to verify new Y-SNPs based on WGS methods. As earlier mentioned [23], a huge number of new Y-SNPs are false positives when the genomes were sequenced with low coverage and consequently the called SNPs will have a low quality. These false positive SNP calls disturb the determination of the correct (sub-)haplogroup of the sample; consequently the AMY-tree software has to correct for them by applying several additional methods in the analysis [23]. The high number of false positive Y-SNPs - even detected in WGS samples with an excellent SNP calling quality - is observed by comparing genomes of paternal relatives. Within the eight paternally related samples belonging to one family, which are genotyped by Complete Genomics, 949 newly reported Y-SNPs in the full Y-chromosome and 88 ones in the unique regions of the Y-chromosome are found in at least one but not all family members (Figure 6). Despite the high coverage of these genomes and the excellent SNP calling quality this is still a very high number of newly reported Y-SNPs which are most likely to be false based on the mutation rate on the Y- chromosomes calculated based on a deep-rooting pedigree [43] and based on human-chimpanzee comparisons [44]. A similar conclusion can be made based on the father-son pair in the dataset which is also sequenced with a high coverage by Complete Genomics (Figure 6). The Y-SNP results show that adding new lineages to the Y-chromosomal phylogenetic tree only based on WGS is not evident. Therefore, making tabula rasa and building a new tree based on all WGS Y-chromosomes as done by Wei et al. [22] is not an option when the tree is going to be used in forensics. Each new Y-SNP and consequently each Y-chromosomal lineage has to be validated independently from the WGS data before adding them to the updated phylogenetic tree. The validation status of many potential polymorphic Y-SNPs (> 1% in population) is often unclear but this can be resolved by the sharing of genomic data among genetic genealogists which are interested in finding new Y-SNPs to resolve their particular paternal ancestry [45]. Therefore, this is an area in which closer collaborations between amateurs and forensic academics could prove to be particularly useful [13]. Nevertheless, it is required that new in silico methods will be designed to select good and relevant candidates of YSNPs for the validation. For example, an interesting criterion is the position of the Y-SNPs: it is more interesting to validate only the SNPs located in the unique regions of the Y-chromosome as there are many non-unique regions due to the evolutionary history of this chromosome [37]. The lower number of false positive SNP calls in these unique regions in comparison with those in the full Y-chromosome is clearly demonstrated in the father-son pair. The number of new Y-SNPs reported in only one of the two samples is higher than the new ones reported in both samples based on the whole Y-chromosome (Figure 6A) in contrast with the situation based on the unique regions of the Y-chromosome (Figure 6B). Conclusions Based on the largest screening of male genomes with 747 samples in total, the most up-to-date Y-chromosomal phylogenetic tree for forensic applications is compiled. Future publications which will report new Y-SNPs have to situate their phylogenetic positions in this tree to guarantee the continuity between old and new publications. At this moment, forensic scientists as well as evolutionary biologists and genetic genealogists are lost in the many reports of newly described Y-chromosomal lineages [46]. Therefore, initiatives as AMYtree which optimize the phylogeny based on peer-reviewed publications are required [23]. This is already the case for the mitochondrial genome with the Phylotree initiative of van Oven and Kayser [47]. Nevertheless, to optimize the current updated phylogenetic tree for the human Y-chromosome more high quality genomes of a broader set of (sub-)haplogroups than the frequent haplogroups E, O and R are required. Also a higher effort in the validation of reported Y-SNPs by new in silico and in vitro methods is required. Acknowledgements The authors want to thank Tom Wenseleers, Manfred Kayser, Jean-Jacques Cassiman, Tom Havenith, Hendrik Larmuseau and Lucrece Lernout for useful discussions and comments. Thanks also to Guy Froyen (VIB, KU Leuven), Richard Rocca (independent researcher), Cuiping Pan (Stanford University) and Andreas Keller (Saarland University) for providing us yet unpublished and published called SNPs of whole genome sequencing projects. Maarten H.D. Larmuseau is postdoctoral fellow of the FWO-Vlaanderen (Research Foundation Flanders). This study was funded by the KU Leuven BOF Centre of Excellence Financing on 'Eco and socio evolutionary dynamics' (Project number PF/2010/07) and on 'Centre for Archaeological Sciences 2 (CAS 2) New methods for research in demography and interregional exchange'. Authors’ contributions Research design & supervision: MHDL; Programming: AVG; Writing: MHDL & AVG; Commenting on manuscript: AVG, RD & MHDL. Conflict of interest The authors declare no conflict of interest. References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] M. Kayser, Y-chromosomal markers in forensic genetics, in: R. Rapley and W. D., Editors, Molecular Forensics, John Wiley & Sons Ltd: Chichesters. 2007, pp. 141-161. J.M. Butler, Chapter 13 Y-Chromosomal DNA Testing, in: J.M. Butler, Editor, Advanced Topics in Forensic DNA Typing: Methodology, Academic Press: London. 2011, pp. 371-403. J. Chiaroni, P.A. Underhill, and L.L. Cavalli-Sforza, Y chromosome diversity, human expansion, drift, and cultural evolution, Proceedings of the National Academy of Sciences of the United States of America. 106 (2009) 20174-20179. T.M. Karafet, F.L. Mendez, M.B. Meilerman, P.A. Underhill, S.L. Zegura, and M.F. Hammer, New binary polymorphisms reshape and increase resolution of the human Y chromosomal haplogroup tree, Genome Research. 18 (2008) 830-838. M.H.D. Larmuseau, N. Vanderheyden, M. Jacobs, M. Coomans, L. Larno, and R. Decorte, Micro-geographic distribution of Y-chromosomal variation in the central-western European region Brabant, Forensic Science International-Genetics. 5 (2011) 95-99. F. Cruciani, B. Trombetta, C. Antonelli, R. Pascone, G. Valesini, V. Scalzi, G. Vona, B. Melegh, B. Zagradisnik, G. Assum, et al., Strong intra- and inter-continental differentiation revealed by Y chromosome SNPs M269, U106 and U152, Forensic Science International-Genetics. 5 (2011) E49-E52. M.H.D. Larmuseau, J. Vanoverbeke, G. Gielis, N. Vanderheyden, H.F.M. Larmuseau, and R. Decorte, In the name of the migrant father - Analysis of surname origin identifies historic admixture events undetectable from genealogical records., Heredity. 109 (2012) 90-95. S. Willuweit and L. Roewer, Y chromosome haplotype reference database (YHRD): Update, Forensic Science International-Genetics. 1 (2007) 83-87. R. Scozzari, A. Massaia, E. D'Atanasio, N.M. Myres, U.A. Perego, B. Trombetta, and F. Cruciani, Molecular dissection of the basal clades in the human Y chromosome phylogenetic tree, Plos One. 7 (2012) e49170. F. Cruciani, B. Trombetta, A. Massaia, G. Destro-Bisol, D. Sellitto, and R. Scozzari, A revised root for the human Y chromosomal phylogenetic tree: The origin of patrilineal diversity in Africa, American Journal of Human Genetics. 88 (2011) 814-818. S. Fornarino, M. Pala, V. Battaglia, R. Maranta, A. Achilli, G. Modiano, A. Torroni, O. Semino, and S.A. Santachiara-Benerecetti, Mitochondrial and Y-chromosome diversity of the Tharus (Nepal): a reservoir of genetic variation, Bmc Evolutionary Biology. 9 (2009) 154. F.L. Mendez, T.M. Karafet, T. Krahn, H. Ostrer, H. Soodyall, and M.F. Hammer, Increased resolution of Y chromosome haplogroup T defines relationships among populations of the Near East, Europe, and Africa, Human Biology. 83 (2011) 39-53. L.M. Sims, D. Garvey, and J. Ballantyne, Improved resolution haplogroup G phylogeny in the Y-chromosome, revealed by a set of newly characterized SNPs, Plos One. 4 (2009) e5792. B. Trombetta, F. Cruciani, D. Sellitto, and R. Scozzari, A new topology of the human Y chromosome haplogroup E1b1 (E-P2) revealed through the use of newly characterized binary polymorphisms, PLoS One. 6 (2011) e16073. M.S. Jota, D.R. Lacerda, J.R. Sandoval, P.P.R. Vieira, S.S. Santos-Lopes, R. Bisso-Machado, V.R. Paixao-Cortes, S. Revollo, C. Paz-Y-Mino, R. Fujita, et al., A new subhaplogroup of native American Y-chromosomes from the Andes, American Journal of Physical Anthropology. 146 (2011) 553-559. H. Pamjav, T. Feher, E. Nemeth, and Z. Padar, Brief communication: New Y-chromosome binary markers improve phylogenetic resolution within haplogroup R1a1, American Journal of Physical Anthropology. 149 (2012) 611-615. [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] N.M. Myres, S. Rootsi, A.A. Lin, M. Jarve, R.J. King, I. Kutuev, V.M. Cabrera, E.K. Khusnutdinova, A. Pshenichnov, B. Yunusbayev, et al., A major Y-chromosome haplogroup R1b Holocene era founder effect in Central and Western Europe, European Journal of Human Genetics. 19 (2011) 95-101. S. Yan, C.C. Wang, H. Li, S.L. Li, L. Jin, and G. Consortium, An updated tree of Y-chromosome Haplogroup O and revised phylogenetic positions of mutations P164 and PK4, European Journal of Human Genetics. 19 (2011) 1013-1015. T. 1000 Genomes Project Consortium, An integrated map of genetic variation from 1,092 human genomes, Nature. 491 (2012) 56-65. D.L. Altshuler, R.M. Durbin, G.R. Abecasis, D.R. Bentley, A. Chakravarti, A.G. Clark, F.S. Collins, F.M. De la Vega, P. Donnelly, M. Egholm, et al., A map of human genome variation from population-scale sequencing, Nature. 467 (2010) 1061-1073. A.T. Duggan and M. Stoneking, A highly unstable recent mutation in human mtDNA, American Journal of Human Genetics. 9 (2013) 279-284. W. Wei, Q. Ayub, Y. Chen, S. McCarthy, Y. Hou, I. Carbone, Y. Xue, and C. Tyler-Smith, A calibrated human Y-chromosomal phylogeny based on resequencing, Genome Research. 23 (2013) 388-395. A. Van Geystelen, R. Decorte, and M.H.D. Larmuseau, AMY-tree: an algorithm to use whole genome SNP calling for Y chromosomal phylogenetic applications, BMC Genomics. 14 (2013) 101. R. Drmanac, A.B. Sparks, M.J. Callow, A.L. Halpern, N.L. Burns, B.G. Kermani, P. Carnevali, I. Nazarenko, G.B. Nilsen, G. Yeung, et al., Human genome sequencing using unchained base reads on self-assembling DNA nanoarrays, Science. 327 (2010) 78-81. L.-P. Wong, R.T.-H. Ong, W.-T. Poh, X. Liu, P. Chen, R.Q. Li, K.K.-Y. Lam, N.E. Pillai, K.-S. Sim, H. Xu, et al., Deep whole-genome sequencing of 100 Southeast Asian Malays, American Journal of Human Genetics. 92 (2013) 1-15. S.C. Schuster, W. Miller, A. Ratan, L.P. Tomsho, B. Giardine, L.R. Kasson, R.S. Harris, D.C. Petersen, F.Q. Zhao, J. Qi, et al., Complete Khoisan and Bantu genomes from southern Africa, Nature. 463 (2010) 943-947. P. Tong, J.G.D. Prendergast, A.J. Lohan, S.M. Farrington, S. Cronin, N. Friel, D.G. Bradley, O. Hardiman, A. Evans, J.F. Wilson, et al., Sequencing and analysis of an Irish human genome, Genome Biology. 11 (2010) R91. A. Keller, A. Graefen, M. Ball, M. Matzas, V. Boisguerin, F. Maixner, P. Leidinger, C. Backes, R. Khairat, M. Forster, et al., New insights into the Tyrolean Iceman's origin and phenotype as inferred by whole-genome sequencing, Nature communications. 3 (2012) 698. R. Chen, G.I. Mias, J. Li-Pook-Than, L.H. Jiang, H.Y.K. Lam, R. Chen, E. Miriami, K.J. Karczewski, M. Hariharan, F.E. Dewey, et al., Personal omics profiling reveals dynamic molecular and medical phenotypes, Cell. 148 (2012) 1293-1307. J.M. Rothberg, W. Hinz, T.M. Rearick, J. Schultz, W. Mileski, M. Davey, J.H. Leamon, K. Johnson, M.J. Milgrew, M. Edwards, et al., An integrated semiconductor device enabling nonoptical genome sequencing, Nature. 475 (2011) 348-352. D. Pushkarev, N.F. Neff, and S.R. Quake, Single-molecule sequencing of an individual human genome, Nature Biotechnology. 27 (2009) 847-850. M. Rasmussen, Y.R. Li, S. Lindgreen, J.S. Pedersen, A. Albrechtsen, I. Moltke, M. Metspalu, E. Metspalu, T. Kivisild, R. Gupta, et al., Ancient human genome sequence of an extinct PalaeoEskimo, Nature. 463 (2010) 757-762. S.M. Ahn, T.H. Kim, S. Lee, D. Kim, H. Ghang, D.S. Kim, B.C. Kim, S.Y. Kim, W.Y. Kim, C. Kim, et al., The first Korean genome sequence and analysis: Full genome sequencing for a socioethnic group, Genome Research. 19 (2009) 1622-1629. D.A. Wheeler, M. Srinivasan, M. Egholm, Y. Shen, L. Chen, A. McGuire, W. He, Y.J. Chen, V. Makhijani, G.T. Roth, et al., The complete genome of an individual by massively parallel DNA sequencing, Nature. 452 (2008) 872-U5. J. Wang, W. Wang, R.Q. Li, Y.R. Li, G. Tian, L. Goodman, W. Fan, J.Q. Zhang, J. Li, J.B. Zhang, et al., The diploid genome sequence of an Asian individual, Nature. 456 (2008) 60-U1. B.A. Peters, B.G. Kermani, A.B. Sparks, O. Alferov, P. Hong, A. Alexeev, Y. Jiang, F. Dahl, Y.T. Tang, J. Haas, et al., Accurate whole-genome sequencing and haplotyping from 10 to 20 human cells, Nature. 487 (2012) 190-195. H. Skaletsky, T. Kuroda-Kawaguchi, P.J. Minx, H.S. Cordum, L. Hillier, L.G. Brown, S. Repping, T. Pyntikova, J. Ali, T. Bieri, et al., The male-specific region of the human Y chromosome is a mosaic of discrete sequence classes, Nature. 423 (2003) 825-U2. S.M. Adams, T.E. King, E. Bosch, and M.A. Jobling, The case of the unreliable SNP: recurrent back-mutation of Y-chromosomal markers P25 through gene conversion, Forensic Science International. 159 (2006) 14-20. B. Trombetta, F. Cruciani, P.A. Underhill, D. Sellitto, and R. Scozzari, Footprints of X-to-Y gene conversion in recent human evolution, Molecular Biology and Evolution. 27 (2010) 714-725. S. Rootsi, N.M. Myres, A.A. Lin, M. Järve, R.J. King, I. Kutuev, V.M. Cabrera, E.K. Khusnutdinova, K. Varendi, H. Sahakyan, et al., Distinguishing the co-ancestries of [41] [42] [43] [44] [45] [46] [47] haplogroup G Y-chromosomes in the populations of Europe and the Caucasus, European Journal of Human Genetics. 20 (2012) 1275-1282. S. Caratti, S. Gino, C. Torre, and C. Robino, Subtyping of Y-chromosomal haplogroup E-M78 (E1b1b1a) by SNP assay and its forensic application, International Journal of Legal Medicine. 123 (2009) 357-360. C. Bouakaze, C. Keyser, S. Amory, E. Crubézy, and B. Ludes, First successful assay of Y-SNP typing by SNaPshot minisequencing on ancient DNA, International Journal of Legal Medicine. 121 (2007) 493-499. Y.L. Xue, Q.J. Wang, Q. Long, B.L. Ng, H. Swerdlow, J. Burton, C. Skuce, R. Taylor, Z. Abdellah, Y.L. Zhao, et al., Human Y chromosome base-substitution mutation rate measured by direct sequencing in a deep-rooting pedigree, Current Biology. 19 (2009) 1453-1457. Y. Kuroki, A. Toyoda, H. Noguchi, T.D. Taylor, T. Itoh, D.S. Kim, D.W. Kim, S.H. Choi, I.C. Kim, and H.H. Choi, Comparative analysis of chimpanzee and human Y chromosomes unveils complex evolutionary pathway, Nature Genetics. 38 (2006) 158-167. T.E. King and M.A. Jobling, What's in a name? Y chromosomes, surnames and the genetic genealogy revolution, Trends in Genetics. 25 (2009) 351-360. M.H.D. Larmuseau, A. Van Geystelen, M. van Oven, and R. Decorte, Genetic genealogy comes of age - Perspectives on the use of deep-rooted pedigrees in human population genetics., American Journal of Physical Anthropology. (In press). M. van Oven and M. Kayser, Updated comprehensive phylogenetic tree of global human mitochondrial DNA variation, Human Mutation. 30 (2008) E386-E394. Figures Select Y-SNPs based on determined haplogroup Expected state ancestral/mutant State in reference genome Expected situation in sample Actual situation in sample called/not called called/not called TP, FP, TN, FN Measures Figure 1. Workflow of the quality assessment algorithm of the AMY-tree, version 1.1. First, certain Y-SNPs of the Y-chromosomal phylogenetic tree are selected based on the determined haplogroup of the first run in order to avoid too much false positive Y-SNPs. Next, the expected state (ancestral or mutant) of the selected SNPs is determined based on the haplogroup and the phylogenetic tree. These expectations of state are converted to expectations of called or not called based on the SNP state in the reference genome. These expectations are then compared to the actually called SNPs of the sample such that the number of true positives (TP), false positives (FP), true negatives (TN) and false negatives (FN) will be determined. Finally, several different measures will be calculated. M198, M417, M512, M514, M515, Page7 R-M198* R-M56 M157.1 R-M157.1 M87, M204 R-M87 P98 R-P98 PK5 R-PK5 M434 R-M434 M458 R-M458* M334 R-M334 Page68 R-Page68 Z280 R-Z280 Z93 R-Z93 M56 Figure 2. Overview of the position of the newly added sub-haplogroups RZ280 and R-Z93 (given in bold) within R-M198. Number of samples frequency 160 80 0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Matthew's Correlation Matthew’s correlationCoefficient coefficient Figure 3. Distribution of the Matthews correlation coefficient of the 730 samples for which an unambiguous haplogroup could be determined. 0.9 1.0 Number of available WGS genomes 165 All genomes 110 55 Good quality genomes 0 A B C D E F G H I J K L M N O P Q R Haplogroups Figure 4. Frequency of whole genome sequencing (WGS) genomes per haplogroup in the dataset of all samples (black) and in the dataset of samples with an excellent quality, i.e. MCC ≥ 0.95 (grey). S T 70000 A 35000 Number of Y-SNPs Whole Y-chromosome 70000 All genomes 0 1 2 35000 3 4 5 6 Excellent quality genomes 7 8 9 10 more 9 10 more Number of occurrences of Y-SNP 0 1 B 2 3 4 5 6 7 8 Unique regions Y-chromosome 24000 Number of occurrences of Y-SNP 12000 24000 0 12000 1 2 3 4 5 6 7 8 9 10 more Number of occurrences of Y-SNP Figure 5. Number of new Y-SNPs per number of occurrence in the full WGS 0 dataset: (A) SNPs 1in the2 whole Y-chromosome and7(B) SNPs in only the 3 4 5 6 8 9 10 more identified unique regions of the Y-chromosome. The grey bars indicate the SNPs in all genomes and the black bars indicate the SNPs Number of occurrences of from Y-SNPsamples with an excellent quality, i.e. MCC ≥ 0.95. 2500 2000 A 1500 of Y-SNPs Number Whole Y-chromosome 1000 2500 500 2000 All SNPs 0 1500 Family-unique SNPs 1000 500 0 8-member family B Father-son pair Unique regions Y-chromosome 120 100 80 60 120 40 100 20 80 0 60 1 2 3 4 5 6 7 8 1 2 40 20 Number of occurrences of Y-SNP 0 Figure 6. Number of 1 new2Y-SNPs 3 per4 occurrence 5 6in the7 eight8paternally related 1 2 samples and the father-son pair of Complete Genomics: (A) SNPs in the whole Y-chromosome and (B) SNPs in only the ofidentified unique regions of the YNumber occurrences of Y-SNP chromosome. The grey bars indicate the SNPs found in the family. The black bars indicate the SNPs which are ‘family-unique’: they do not occur in any of the other samples with an excellent quality, i.e. MCC ≥ 0.95.