Short Tandem Repeat (STR) Typing from Short Read Sequencing Data: STRTyper Daniel Bornman, Seth Faith, Jared Schuetter, Aaron Sander, Joe Regensburger, Nancy McMillan, Angela Minard-Smith, Manjula Kasoji, and Brian Young Battelle Memorial Institute, Columbus, OH ABSTRACT RESULTS To enable the analysis of STR from next generation sequencing (NGS) data we modified the approach for standard reference genome alignment yet preserved the ability to leverage open-source short read alignment software. We performed multiplexed sequencing on an Illumina GAIIx of five individuals, plus an equal parts mixture on two of the five samples. This sequencing run generated read lengths of 150 nucleotides that were aligned to a modified reference genome specifically constructed to enable analysis of diploid STR polymorphic patterns. Analysis of the final sequence alignment data enabled us to correctly estimate (with complete concordance to CE) STR allele status for the entire set of core CODIS loci from each sample including the single 1:1 mixture sample. Some exceptions included two instances where the repeat pattern for the D21S11 locus had repeat lengths greater than could be covered by the current maximum 150bp read length. Data are also presented that demonstrate the use of a novel filtering method for preprocessing raw sequencing data for enrichment of short reads containing STR patterns. In addition, we applied this method to recently generated whole genome sequencing data from the Illumina HiSeq 2500 platform. Comparison to Capillary Electrophoresis (CE) Method For each sample, allele calls were compared to results obtained by the standard CE method of STR genotyping. Allele calls were made from alignment data using a simple heuristic decision model similar to the technique for determining over-representation of genes mapping to molecular pathways in gene expression studies. Alleles were called positive if the calculated probability score was within a predefined threshold. Using this technique, all core 13 CODIS STR markers were successfully identified with some exceptions attributed to likely technical artifacts. Locus TPOX D3S1358 Method Sample 1 Sample 2 Sample 3 Sample 4 Sample 5 Sample 6 CE 8 8 16 16 22/23 22/23 11/12 11/12 11/13 11/13 10/11 10/11 13 13 6/9.3 6/9.3 16/17 16/17 11/13 11/13 12/13 12/13 13/17 13/17 30/34.2 *30 X X 10/12 10/12 14 14 22/25 22/25 12 12 11/12 11/12 10/11 10/11 13/14 13/14 6/9.3 6/9.3 15/19 15/19 8/12 8/12 11/12 11/12 13/17 13/17 28/30 28/30 XY XY 8 8 17/19 17/19 21 21 10/12 10/12 13 13 10/11 10/11 13/15 13/15 7/9.3 7/9.3 15/16 15/16 11 11 9/11 9/11 14/15 *15 30 30 XY XY 8/10 8/10 16 16 22/26 22/26 10/12 10/12 11/13 11/13 11 11 13/15 13/15 6/9 6/9 14/18 14/18 11/12 11/12 11/14 11/14 13/15 13/15 28/30.2 28/30.2 XY XY 8/11 8/11 15/18 15/18 21/25 21/25 10/11 10/11 10/13 10/13 12/13 12/13 13/14 13/14 6/9.3 6/9.3 15 15 11/12 11/12 12/13 12/13 14/17 *17 31.2/32.2 *31.2 XY XY NA 8/10 NA 16 NA 22/23/26 NA 10/11/13 NA 11/13 NA 10/11 NA 13/15 NA 6/9/9.3 NA 14/16/17/18 NA 11/12/13 NA 11/12/13/14 NA 13/15/17 NA 28/30/30.2 NA XXY NGS CE NGS FGA CE NGS CSF1PO CE NGS D5S818 CE NGS D7S820 CE NGS D8S1179 CE NGS TH01 CE NGS TECHNICAL APPROACH VWA CE NGS D13S317 CE NGS Reference Alignment with a Modified Genome A reference sequence consisting of common CODIS STR alleles was used for alignment of short read sequences. This approach allowed for utilization of off-the-shelf opensource reference alignment software with robust capability for accurately aligning sequencing data. TPOX_13 TPOX_6 FUTURE DEVELOPMENTS D16S539 CE NGS D18S51 CE NGS D21S11 CE NGS AMEL CE NGS These results indicate that short read sequencing technology may be applicable to genotyping CODIS STR loci. The results showed that the NGS method was concordant with the CE results for all 13 CODIS plus AMEL. While further validation is needed to utilize this technology for routine forensic analysis, several technical considerations may be critical for widespread use. Analysis of sequencing data typically requires large computing resources and may take considerable time to complete. In addition to the significant decrease in time and resources required using the reference genome presented here, we developed a read filtering algorithm based on signal processing that can detect reads containing short tandem repeats. This filter reduces even further the time and computing resources needed to process sequence data for making STR calls (Figure 3). Using this prefiltering step, we were able to process an entire lane of data from a GAIIx in less than 20min using a single computing core which is comparable to a modern desktop computer. The current study used PCR-enrichment of each STR location to maximize coverage at each locus. We recently demonstrated the ability to make STR calls directly from deep sequencing Illumina data from the HiSeq 2500 platform. We are currently developing a method for applying match probabilities in these cases where STR coverage may be relatively low. Table 1: The STR genotype reported for each locus in all 5 samples assayed including the 1:1 mixture sample (Sample 6). Sample 6was generated using equal amounts of starting DNA from Sample 1 and Sample 4 prior to sequencing. Near complete concordance was attained with comparison to CE. (*) Detected differences between the NGS and CE methods. (7,8,9,10,11,12) TPOX D3S1358 FGA CSF1PO A (7,8,9,10,11,12,13,14) CSF1PO_6 Figure 3: Computing time of the modified reference alignment approach with and without read filtering. Read filtering greatly reduced the total runtime for processing short read sequencing data while retaining the sensitivity of correctly calling alleles. All stringency setting resulted in reduced time and resources required as compared to performing reference alignment on all sequence reads. CSF1PO_15 Figure 1: The common alleles for each STR locus were concatenated into a reference sequence along with short flanking sequences. This concatenated reference sequence was used for reference alignment. B C D 1682 Sample 1 Reads mapping to alleles 852 747 2645 631 F Sample 4 Sample 6 (1:1 mixture) TPOX_8 TPOX_10 Figure 2: Reference alignment results visualized in IGV for two individuals including a 1:1 mixture of each sample. The viewer is currently focused on the TPOX “chromosome”. E Figure 4: Circular plot of the reference sequence used for alignment with the Bowtie short read sequence aligner. Each of the 13 STR loci examined are arranged around the circular plot as individual “chromosomes” (A). STR genotyping results obtained from NGS and CE methods for two individual samples including the mixture sample are plotted against the STR modified genome. Blue = Sample 1; Red = Sample 4; Black = Sample 6. (B) Individual alleles represented at each locus are labeled as their defined repeat number. (C) Coverage data is plotted at each 100bp interval. Coverage data for the mixture sample is separated from the individual samples for visual comparison. (D) A heatmap plot is provided adjacent to the coverage plot to indicate the read alignment signal strength. (E) Data obtained from the CE method is plotted as small circles above the heatmap for a comparison between methods. Note that allele calls by CE are not able to distinguish between variant alleles for the sample repeat number. (F) Data obtained from the Illumina HiSeq 2500 platform from a single individual sequenced to an average depth of 30X with 150bp reads was analyzed by the present method. Allele calls made from this data are plotted in green. The sample sequenced on the HiSeq 2500 was a Yoruban HapMap sample, NA18507. This work was performed under internal research funding at Battelle Memorial Institute. www.battelle.org bornmand@battelle.org Data published in Bornman DM, et al.. BioTechniques Rapid Dispatches. April 2012.