Master Thesis 9-2-2016 Computational methods for the detection of structural variation in the human genome. Erik Hoogendoorn Student Number: 3620557 Master’s programme: Cancer Genomics and Developmental Biology Utrecht Graduate School of Life Sciences Utrecht University Supervisor: Dr. W.P. Kloosterman Department of Medical Genetics University Medical Center Utrecht 2 1 Abstract Structural variations are genomic rearrangements that contribute significantly to evolution, natural variation between humans, and are often involved in genetic disorders. Cellular stresses and errors in repair mechanisms can lead to a large variety of structural variation events throughout the genome. Traditional microscopy- and array-based methods are used for the detection of larger events or copy number variations. Next generation sequencing has in theory enabled the detection of all types of structural variants in the human genome at unprecedented accuracy. In practice, a significant challenge lies in the development of computational methods that are able to identify these structural variants based on the generated data. In the last several years, many tools have been developed based on four different categories of information that can be obtained from sequencing experiments: read pairs, read depths, split reads and assembled sequences. In this thesis, I first introduce the topic of structural variation by discussing its impact in various areas, what mechanisms can lead to its formation, and the types of structural variation that can occur. Subsequently, I describe the array-based and sequencing-based methods that can be used to detect structural variation. Finally, I give an overview of the tools that are currently available to detect signatures of structural variants in NGS data and their properties, and conclude by discussing the current capabilities of these tools, possible future directions and expectations for the future. Keywords: Structural variation; Copy Number Variation; Next-Generation Sequencing; Detection algorithms; Read pair; Read depth; Split read; De novo assembly. 3 4 2 Contents 1 Abstract 3 2 Contents 5 1 Introduction 6 2 Structural variation 6 2.1 The importance of structural variation 6 2.2 Causes for structural variation 7 2.3 Types of structural variation 7 Detecting structural variation 8 3 3.1 Array-based methods ArrayCGH SNP arrays Advantages and limitations 3.2 Sequencing-based methods Read pair Read depth Split read De novo assembly Advantages and limitations 9 9 11 11 11 12 Computational methods 12 4 4.1 5 Read mapping 13 4.2 Read pair Clustering-based methods Distribution-based methods 14 14 15 4.3 Read depth 17 4.4 Split read 19 4.5 De novo assembly Genome assembly Identification of structural variation 20 20 21 4.6 22 Combined approaches Discussion The status quo Possible improvements: integration of recent advances Future perspectives 6 8 8 8 9 References 24 24 25 25 27 5 1 Introduction Structural variation describes genetic variation that affects the genomic structure. Although human genomic variation was first thought to be mostly due to SNPs (Single Nucleotide Polymorphisms), it has become clear that human genomic and probably phenotypic differences are related more to structural variation than SNPs1,2. Structural variation can range in size from several bp (base pairs) to entire chromosomes. Structural variation contributes significantly to human diversity and disease occurrence, and is an important consideration in any genetic study3,4. Structural variation studies used to be limited to the detection of larger variants like aneuploidies and chromosomal rearrangements by using microscopic methods. The development of array-based and, more recently, sequencing-based methods has enabled the detection smaller submicroscopic structural variants (SVs) at greater resolution. Next generation sequencing-based (NGS) methods are theoretically able to identify SVs of all types at previously unattainable speeds and resolution, and several different methods have been developed to detect signals in the data that indicate structural variants, each with their own advantages and disadvantages. However, these methods require extensive computational analysis and the development of various types of algorithms to filter the data, compare it to reference or other samples and detect the signals associated with structural variation. Here, I will introduce the effects structural variation can have in humans and other species, the mechanisms that can result in the formation of SVs and the different types of structural variation that can occur. Subsequently, I will give an overview of the methods that can be used to detect structural variation, and provide an overview of the currently available computational tools used for the detection of SVs in the human genome based on next-generation sequencing. 2 Structural variation 2.1 The importance of structural variation Structural variation is now known to cover more nucleotide variation in the human population than SNPs, and thousands of SVs are likely to be present in each genome1,2,5. Many SVs span, relocate or break encoding as well as regulating elements in the genome. This may often have no observable effect, but can also induce dosage effects, gene disruption, new fusion genes, new regulatory cascades, the formation of new SNPs and differences in epigenetic regulation due to relocation5–7. Thus, although many SVs may be neutral, they still introduce a large source of genetic and phenotypic variation not just between humans but in all species8,9. Considering the effects of SVs on phenotypic variation, the occurrence of structural variation is also expected to significantly affect natural selection and thus evolution5,8. Indeed, structural variation has been suggested to be related to the evolution of new species as well as the evolution within various species9–11. Examples exist in plants12 as well as primates13–15, also for the emergence of human specific-genes16. Several papers have shown recent human evolution in genes related to diet, reproduction and disease-related genes due to structural variation17–19. Structural variation has been characterized extensively in relation to disease. Variants affecting gene regulation or coding sequences may result in a wide variety of genomic disorders8,20,21. Two models for the relationship between structural variation and disease have been proposed, based on rare and common structural variation22. The first model describes rare and often de novo SVs in the population can cause various disorders, collectively accounting for a large fraction of these disorders22. Examples are found for various birth defects23–25, neurological disorders26–30 and predisposition to cancer31,32. The second model concerns SVs common in the population, especially copy number variable gene families, thought to collectively contribute to susceptibility of complex diseases, especially related to the immune system22. Examples for this model are HIV33, malaria34 and various immune disorders 35,36. Although examples can be found for both models, these are probably not comprehensive for all human disease in relation to structural variation. For example, a simple division between rare and common variants may be too simplistic37. However, it is clear that the detection of structural variation can have a large impact on the investigation of human disease, both in diagnosis and treatment of diseases38,39. In addition to their role in disease, SVs are also essential for normal functioning of human life. Class Switch Recombination (CSR)40 and V(D)J recombination41 are processes that rely on structural variation that is stimulated by our body itself. These processes are essential for the generation of diverse mature B cells in response to antigen stimulation, and thus for the human immune system. The study of SVs may also tell us more about genetic mechanisms that shape genome structure as well as genome evolution. Over the last years, 6 the need to take structural variation and its roles into account has become apparent4. However, essential for each of these research areas remains the accurate and unbiased identification of structural variants. 2.2 Causes for structural variation Although first considered to occur randomly42, structural variants form in specific situations, in response to specific environmental and cellular triggers. Various stressors like replication, transcription and genotoxic or oxidative stress, or combinations of these, can be the trigger for structural variation43. These stresses can result in DNA breaks and stalled DNA replication forks sensitive to the formation of structural variants. Specific sequences are more sensitive to structural variation due to their structure, associated proteins or epigenetic modifications43. Furthermore, the proteins involved in generation of functional recombination in the immune system may have off-target effects, leading to double-strand breaks. Subsequent errors in DNA repair or recombination then cause the structural variant to be implemented locally or between two loci in physical proximity. For example, non-homologous end joining (NHEJ) is an error-prone repair mechanism for DNA doublestrand breaks. Individual double-strand breaks are efficiently repaired by classical NHEJ, however the presence of two double-strand breaks can result in chromosomal translocations. Alternate end joining (A-NHEJ), is a different pathway that is associated with genomic rearrangements. However, the precise mechanisms are currently unknown44. Allelic homologous recombination repairs double-strand breaks using a template sequence and is relatively error-free. However, defects in homologous recombination could result in non-allelic homologous recombination (NAHR). In this case, non-allelic sequences, often LCRs, LINE-1 and Alu repeat elements or pseudogenes are used as a template for repair, resulting in structural variations 8. Additionally, repetitive and transposable elements like those involved in NAHR are considered to contribute to structural variation through the effects of retrotransposition and microhomology, which can result in Complex Chromosomal Rearrangements (CCRs). Several models exist for the explanation of these CCRs. The MMBIR model (microhomology-mediated break-induced replication) posits that single DNA strands of collapsed replication forks anneal to any single-stranded DNA in proximity. Following polymerization and template switches result in CCRs45. A similar model, FoSTeS (Fork Stalling and Template Switching), suggests replication fork template switching, but without breaks15,46. Finally, intra- and interchromosomal CCRs may result from random non-homologous end joining of fragments after an event termed chromothripsis. In this model, one or multiple chromosomes locally shatter, then fuse again randomly, possibly due to radiation or other events resulting in widespread chromosomal breakage23,47. For more information on this topic, please see a comprehensive review by Mani et al.43. 2.3 Types of structural variation Structural variation can occur in many types, among which a distinction can be made between copy number variant (CNV) and copy number balanced variants. Copy number balanced SVs include inversions and translocations. Copy number variant SVs include deletions, insertions and duplications. Insertions may involve a novel sequence or a mobile-element. Mobile element insertions can result from translocations or duplications. Duplications can occur as tandem duplications, where the duplicated segment remains adjacent to the source DNA, or interspersed, where the duplicated DNA is incorporated elsewhere in the genome. These events may occur intrachromosomally, but also between different chromosomes, leading to interchromosomal variants. The term is structural variant was traditionally used to refer to larger variants larger than 50 bp or 1 kb (kilobase)22. However, any variant other than a SNV (Single Nucleotide Variant) may be considered to alter the structure of the chromosome. As some of the methods discussed here are able to identify events of sizes from 1 to 50 bp at base pair resolution, the term structural variant is used for any non-SNV genetic variant. Of course, one event may include combination of multiple types of SVs, resulting in more complex patterns or CCRs, where for example an inverted fragment may contain a deletion or an insertion, or any other combinations. Detecting CCRs is more problematic for most methods. Additionally, an insertion may correspond to deletion elsewhere in the genome, resulting in what is essentially a translocation. However, not all methods may detect both events and may thus infer CNVs erroneously. Accurate identification of a certain SV may thus require comprehensive identification of all structural variation in the studied genome48. The ability for detection of these types of variants differs with respect to the various methods used, as is discussed below. 7 3 Detecting structural variation As mentioned above, structural variants can differ greatly in terms of size. Larger structural variants are considered microscopic variants, as these can be detected using traditional microscopy-based cytogenetic techniques. That includes genome-wide techniques like karyotyping, chromosome painting and FISH-based methods. Still commonly used, these methods can identify most types of structural variants beyond several Mb (Megabases) and aneuploidies. Improvements based on these techniques are still developed, providing higher resolution and sensitivity49. For the detection of smaller, submicroscopic SVs with higher resolution and sensitivity, more recent molecular methods are required. These methods can be classified as either array-based or sequencing-based. Common to these methods is that SVs are identified by comparing the experimental genome to a reference or other sample genome, inferring variants from the differences. I will briefly introduce these array- and sequencing-based methods below. 3.1 Array-based methods Microarrays were originally developed for RNA expression profiling, but now have a wide range of applications, including the detection of structural variation. Microarray-based methods rely on the design of microarray chips on glass slides, using immobilized cDNA oligonucleotides as targets for hybridization by experimental DNA. Although sequencing-based methods for the detection of CNVs are becoming more costeffective and popular, clinical diagnostics still mainly use microarray screening50. Detection of CNVs with arraybased methods is possible using two types of microarrays: ArrayCGH (Comparative Genomic Hybridization) and SNP arrays. Recent platforms, marketed by companies like Agilent, Illumina, Roche and Affymetrix, enable the detection of millions of probes on one chip, and new arrays are still being developed that increase the sensitivity and resolution even more. ArrayCGH ArrayCGH platforms can be used to detect relative CNVs by competitive hybridization of two fluorescently labeled samples to the target DNA. Experimental DNA is fragmented and fluorescently labeled prior to hybridization. By using different fluorescent dyes, for example Cy3 (green) and Cy5 (red) for each sample, the measured fluorescence for each color can give an indication for the abundance of experimental DNA from each sample. It is important to use known reference samples, as a gain in one sample can’t be distinguished from a loss in the other without further information. For accurate identification of SVs, normalization is often needed due to experimental biases for GC content in the DNA and dye imbalance. The first ArrayCGH experiments used large inserts, for example BACs (Bacterial Artificial Chromosomes), as targets, and were able to detect CNVs in the range of 100 kb and longer51. The current use of oligonucleotides allows the detection of CNVs with a resolution only several kilobases52,53. An advantage of ArrayCGH is the availability of custom arrays, allowing its use as a diagnostic platform 50,54. ArrayCGH platforms can reach high resolutions, especially using custom solutions2, but can’t match NGS-based methods. SNP arrays SNP arrays were originally designed to detect single nucleotide polymorphisms, but have been adapted for the detection of CNVs. Similarly to ArrayCGH, SNP arrays rely on hybridization to target NDA. However in SNP arrays, only the test sample is hybridized, and no competitively hybridizing reference sample is used. The intensity of the fluorescence upon binding is used as a measure for the matching sequences in the sample. For the detection of CNVs, intensities measured across many spots on the slide are clustered. CNVs are detected by comparing these sample values to (a set of) reference values from a database or from a different experiment. Several algorithms have been developed for this analysis, and an overview of these can be found in a review by Winchester et al.55. Similar to ArrayCGH, SNP array resolution has increased significantly in the years since its first use, which typed56. Currently, millions of SNPs can be interrogated on one chip. In addition to improvements in resolution, the design of arrays has focused on incorporating more informative SNPs in regions with known CNVs, increasing the amount of variants detected in one experiment57. However, this does have an important negative side-effect, as it introduces a large bias towards known CNVs. SNP arrays generally tend to have lower sensitivity in the detection of CNVs compared to ArrayCGH. However, SNP arrays provide advantages like additional information for genotyping, parental origin of CNVs, are more accurate in the determination of copy numbers and allow detection of LOH (Loss Of Heterozygosity)49. 8 Advantages and limitations A major disadvantage of array-based versus sequencing-based methods is that only gains and losses compared to a reference can be identified. Thus, balanced variants like translocations and large inversions cannot be identified, meaning that other experiments are needed to identify the location and type of the SV events in the test sample. Array-based methods are also unable to detect smaller variants and have a lower resolution, and thus miss a wide range of SVs that are potentially of interest. However, array-based methods are less costly and have a higher throughput than sequencing-based methods, so it is possible to genotype a larger number of individuals in less time and for a lower cost. Analysis of the data also requires less computational resources than sequencing-based methods. In addition to predesigned genome-wide solutions, it is often possible to order custom designs, allowing studies to focus on regions of interest, or increase overall resolution. Combinations of these two types of arrays have been used to detect CNVs. Either by integration of results58, using SNP arrays for fine-mapping regions identified by ArrayCGH59, or using hybrid CGH+SNP arrays49,60. These methods could provide more robust identification of structural variation as well as additional information versus existing approaches. This seems prudent, as a recent assessment has shown relatively low (<70%) reproducibility for repeated experiments as well as poor (<50%) concordance between platforms61. 3.2 Sequencing-based methods Detection of multiple different types of structural variation based on sequencing methods was first performed using paired-end mapping by Tuzun et al.62. This study was based on capillary Sanger sequencing using fosmid-end sequences. Throughput and resolution based on this data are not optimal, but the longer read lengths allow the reliable identification of large variants. The development of high-throughput next- generation sequencing technologies has enabled sequencing of a full human genome within a week. Since 2005, several companies including 454 Life Sciences, Illumina, and Life Technologies have marketed platforms with ever increasing throughput and base-calling accuracy, longer read lengths as well as lower costs versus traditional capillary methods. More recently, single Molecule Sequencing (SMS) has become a possibility with Helicos’ Helioscope platform and non-optical sequencing was introduced with Life Techologies’ Ion Torrent sequencer. Among other applications, this development in sequencing technology has enabled the genome-wide detection of structural variation at unprecedented resolution and speeds. Several methods have been employed for the identification of SVs using NGS data. The most self-evident method would be de novo assembly of a genome, with subsequent alignment to a reference to determine the structural differences. However, de novo assembly of a human genome remains challenging due to the relatively short read lengths generated by NGS platforms63. As a result, other methods were developed that use direct alignment of reads to one of the human genome reference assemblies. These methods are read pair, read depth and split read approaches, and are based on the identification of discordant patterns in sequencing data. I will describe the basic principles of each of these approaches below. Read pair As mentioned earlier, the first sequencing-based identification of SVs used a read pair method, which was applied to data from capillary sequencing62. The first NGS-based study on the genome-wide identification of SVs applied a similar method, using the same algorithms as in the earlier study but without any optimizations for the new type of data64. Most of the current sequencing technologies, excluding SMS platforms, are capable of generating paired-end or mate-paired reads. In read pair sequencing, both ends of a linear fragment with an insert sequence are sequenced, whereas in mate-pair sequencing a circularized fragment is used. Although the method of generating the read pairs differs, the detection of SVs based on the generated data is essentially the same. An important consideration in the detection of SVs is that the insert size for mate-pair libraries (1.5-40 kb) is often larger than for paired-end libraries (300-500bp)65. Read pair methods detect SVs by mapping read pairs with a predetermined insert size back to the reference genome. Assessing the mapping locations of the reads to the reference genome, a discordant span or orientation of the read pair indicates the occurrence of a genomic rearrangement (Figure 1). If read pairs map further apart than the insert size this suggests a deletion, whereas if read pairs map closer together or one read can’t be mapped together this suggests a (novel) insertion. Furthermore, insertions of mobile elements or other genomic regions map to the locations in which these are present on the reference genome. Inversion breakpoints are detected by a changed orientation of one of the reads inside the inversion, as well as varying spans for the pairs. Interspersed duplications or translocations can be detected by complex patterns where in several pairs one of the reads maps to a different location or chromosome. Finally, tandem duplications can be 9 detected by read pairs that have a correct orientation, but are reversed in their order and have differences in their span. Figure 1 The four sequencing-based methods used to identify structural variation, and the signatures that can be detected for each type of SV. The top line indicates reference DNA. Red arrows indicate breakpoints. MEI = Mobile Element Insertion. RP= Read Pair. For a full legend see Alkan et al.22 (Copied from Nature Reviews Genetics, Alkan et al. 201122.) As single read pairs are not reliable on their own due to possible mismapping or ambiguous mapping, multiple read pairs belonging to the same variant are clustered to increase the reliability of detection, as well as identify the breakpoint locations for the variant more accurately. Libraries with larger insert sizes (several kilobases) are better at detecting larger variants, but are often not able to reliably detect smaller variants due to the distribution of insert sizes66. In contrast, libraries with smaller insert sizes can’t reliably detect the larger events, but have higher resolution and are able to detect smaller variants. A major disadvantage of the read pair methods is that insertions larger than the insert size cannot be detected conventionally. Although with lower power, algorithmic detection of these insertions is possible when considering a linked signature, as described by Medvedev Et al.67. For example, a large insertion from a region far away in the genome (a translocation or duplication), the read pair will be detected as spanning a huge range in the reference genome, as regions that were first far from each other are now relatively close and used to generate the read pair. By finding this signature for both break-ends (the newly formed sequences by colocation of the flanking sequences and the insertion) and linking these, it is possible to determine the origin and possibly the size of the insertion. For novel sequence insertions this is more difficult, as one of the reads from the read pair will not map to the 10 genome. In this case, additional steps like assembly or targeted sequencing of the insertion sequence would be required. Read depth Analysis of read depth, also called depth of coverage (DOC), can identify structural variants by evaluating the depth of reads to mapped to the reference genome. This approach was first used in combination with NGS data to detect CNVs in healthy and tumor samples from the same individuals68. For this method, a uniform distribution of reads is assumed, often according to a Poisson distribution. Sufficient deviation from this distribution is expected to be due to copy number differences in the sequenced genome. Alternatively, the expected copy number of a region can be derived from a comparison of read depth to reads of a control genome. For both variants, a loss region will have less reads mapped to it than expected, whereas a gain region will have more reads mapped (Figure 1). The major disadvantage of read depth versus the other sequencing-based methods is that only CNVs can be detected. The location of events can’t be retrieved, and copy number balanced events like inversions or translocations can’t be detected. However, it’s the only sequencing-based method that can accurately predict copy numbers69. Larger events are more reliably detected than smaller ones, as the statistical support increases with the size. The reliability of the SVs detected is directly related to the sequencing coverage. As a result, the sequencing biases in the different platforms affect SV detection as well. For example GC-rich or –poor regions as well as repeat regions are sequenced less reliably, introducing biases70. These biases, as well as mismapped reads influence the data more than in other sequencing-based methods. Algorithms based on case versus control data suffer less from biases due to sequencing, as these are assumed to be cancelled out. However, these are more costly as it requires additional genomes to be sequenced. Split read Split read mapping detects structural variation by using unmappable or only partially mappable reads. The breakpoint of a SV is found based on a read which can only be mapped to the genome in two parts. Detection of SVs is similar to read pair-based methods, but instead of two paired reads, two parts of one read are used (Figure 1). A deletion will show a read mapping with alignment gaps in the reference genome, whereas insertions will show alignment gaps in the test genome. Like with read pairs, part of a read not mapping may indicate a novel sequence insertion, and partial mapping to a known mobile element in the reference genome indicates a MEI (Mobile Element Insertion). Reads spanning tandem duplications will have the split read mapping in reverse order. Interspersed duplications or interchromosomal translocations will show part of a read mapping to the duplicated region or another chromosome. Like read pair methods, split read mapping may use clustering of reads to increase the reliability of the findings. Split read mapping was originally used in combination with sanger sequencing, which produces longer reads than current NGS platforms71. The shorter reads currently generated by NGS platforms significantly reduce the power of SV detection using this method, as the length of a split read from NGS is rarely uniquely mappable to the genome. This results in strongly ambiguous, and often impossible mapping of reads, especially to regions with repeats or duplications. However, it is currently still possible to map breakpoints for smaller deletions (max. ~10 kb)as well as very small insertions (1-20 bp) at base pair resolution by using an algorithm called Pindel72. Using this method, also called anchored split mapping67, read pairs are used to select reads where one read maps uniquely to the genome and the other can’t be mapped. Knowing the location and orientation of the first read, the second read can be split-mapped using local alignment based on the known insert size, reducing the search space for possible mappings as well as ambiguous mapping significantly. However, this does require that one of the reads is mapped uniquely, which is still not always possible. The advantage of this method is that it can map breakpoints of SVs at base pair resolution. However, for larger events or those involving distant genomic regions this is still problematic. Using anchored split mapping to reduce the search space for split reads is an important step towards making split mapping useful in combination with NGS platforms, but may be hampered by inserts or deletions in between the reads, affecting mapping distance. Longer read lengths will make split read mapping even more powerful, as unique mapping of at least one end may not be required. De novo assembly Ideally, full alignment of de novo sequenced genomes against one or multiple reference genome(s) would be used to identify all structural variation in the genome. Depending on the algorithms and reference genome(s) used, this would enable unbiased detection of all types and lengths of SVs. Although studies have described 11 assembly of human genomes based on short-read data, these and other approaches still require assembly to the reference genome. Two human genome assemblies have recently been used to identify structural variation 73. However, this study was still limited in the identification of SVs in repetitive regions and was only able to identify homozygous SVs. Local de novo assembly is possible in more reliable genomic regions74. This allows alignment to the reference genome and subsequent identification of structural variants using these generated contigs. Identification of SVs is then possible using the same principles as in split read mapping, with differences only in identification of MEIs and tandem duplications (Figure 1). As these fragments are typically much larger than read fragments, this method is much more reliable in the identification of breakpoints and larger SVs. Although de novo assembly of genomes and subsequent pairwise comparison is expected to become the standard method of SV detection, this is currently still problematic due to the limited read lengths and assembly collapse over regions with repeats and duplications63. As these regions are especially susceptible to the formation of structural variation, this further decreases the reliability of SV detection due to false positives as well as false negatives. Additionally, differences in coverage between genomic regions due to biases affect assembly, inducing gaps and complicating statistics in assembly. Finally, de novo assembly requires extensive computational resources. In algorithms that reduce the computational requirements, tradeoffs are often necessary in terms of sensitivity to overlaps. Although improvements in these areas have been made with newer tools, the problems are still unsolved74. Further development of algorithms and sequencing platforms will be required before this method will be able to detect all structural variation reliably. Advantages and limitations A major advantage of sequencing-based methods over array-based methods is the possibility to detect all types of variants in a single experiment; both copy-balanced and copy-variant. Additionally, SVs of a broader range of lengths can be detected with significantly less bias, as no the genomic regions measured are not predetermined as is necessary for microarray probes. The resolution of sequencing enables breakpoint detection at base pair level with high enough coverage, allowing detailed investigation of CCRs as well. NGSbased methods are expected to replace microarrays for SV discovery and genotyping. Although costs of whole genome sequencing have declined significantly, these are currently still a large factor. This is especially true for genome-wide detection of structural variation, as the reliability of the findings depends in a large part on the sequencing coverage attained in the experiment. However, the decline in costs is expected to continue quickly over the coming years, in concert with the further development of single-molecule and third-generation sequencing platforms65. A problem common to all methods is the limited read length of current generation sequencing platforms, causing significant ambiguity in the mapping of reads, especially in repetitive regions. Third-generation sequencing technologies with increased read length and insert sizes are expected to alleviate these problems at least partially. but the development of new algorithms and the integration of information will also be an important factor. The different sequencing-based methods each have their own strengths and weaknesses in the detection of SVs. Read pair-based methods are efficient at detecting most types of structural variation and extensively used, however the insert size affects the length of the detected SVs significantly. Approaches based on read depth are able to identify sequence copy numbers, but only able to detect CNVs and at poor breakpoint resolution. Although split read mapping can identify breakpoints at base pair resolution, its sensitivity is currently a lot lower than other methods due to unreliability outside of unique genomic regions. Finally, de novo assembly of genomes promises to be the method to solve most of the problems, but is currently not yet possible and dependent on the further development of sequencing techniques and algorithms. Several tools have been developed recently to integrate the information from the various methodologies. By combining algorithms, several biases or deficiencies of some of these methods may be alleviated. Furthermore, several strategies seem more suitable for the detection of certain classes or properties of structural variants. For example, read depth information is more suitable for copy number detection than other methods, and split read information may indicate the breakpoints most reliably. Any combination of methodologies will need to take into account these factors. 4 Computational methods Various tools have been developed for NGS-based detection of structural variation. Here, I will give an overview of the currently available tools for read pair-, read depth-, split read- and assembly-based methods of genome-wide SV detection in the human genome with NGS data. Tools combining the information from several 12 detection methods to improve the results are discussed separately. As read mapping is an important first step for the read pair-, read depth- and split read-based methods, and assembly algorithms are similarly important in the assembly-based identification of SVs, approaches and tools used for these steps are discussed as well. An important distinction between the tools is the strategy that is used for alignment of the reads and how the SV identification algorithms process those alignments. The alignment processing strategies can be classified as either ‘hard clustering’ or ‘soft clustering’75. Most approaches use hard clustering, considering only the best mapping of each read to the genome for the identification of SVs. This works well for unique regions of the genome, but has lower sensitivity in tandem duplication and repeat regions. Some newer approaches use soft clustering, where reads are mapped to all possible locations, and all these mappings are considered in the detection of putative SVs. Although this increases sensitivity, soft clustering may lead to more false positives and often requires careful filtering of input reads. In sample-reference analyses, these false positives are offset by an increase in true positives as more SVs in total are present. However, in related samples the false positives may constitute higher percentage of total due to the low amount of total SVs between the samples. Thus, it is important that the clustering strategy is appropriate for the study, and the parameters in tools using the soft clustering strategy are well understood and set carefully. Table 1 summarizes the tools used for SV identification as discussed here, showing which clustering approach is used, what types of SVs can be detected, as well as their defining characteristics or advantages over other approaches. 4.1 Read mapping Except for de novo assembly, all computational methods described here require mapping of the to the reference genome as a first step. Many tools have been developed for this purpose, based on several different approaches. These tools mainly differ in how they find the possible mapping locations on the genome, whereas a final alignment step on these possible mapping locations to determine the scoring is generally performed by using the traditional Smith-Waterman76 alignment algorithm. The first development was the classical “seed and extend” approach77. Here, a seed DNA sequence is found based on a “hash tables” containing all DNA words of a certain length (k-mers) present in the first DNA sequence (this can be either the reads or the reference genome). The hash table is then used to locate the k-mer sequence in the other DNA sequence. Subsequently, this seed is extended on both sides to complete the alignment. This approach is used in several tools, like BLAT78, SOAP79, SeqMap80, mrFAST69, mrsFAST81 and PASS82. This implementation is simple and quick for shorter word lengths, but becomes exponentially more memory-intensive with longer word lengths. An improvement on this approach was introduced with PatternHunter83, which uses “spaced seeds”. This approach is similar to the “seed and extend” approach, but requires only some of the seed sequence’s positions to match. Thus, if a 5-mer is used, it may be that only the 1st, 3rd and 5th positions need to match the other sequence. This approach is more sensitive and allows for mutations in the seed sequence, but may introduce false matches that slow the mapping process down, and does not allow indels in the sequence. Many tools were developed based on this approach, including the Corona Lite mapping tool84, ZOOM85, BFAST86 and MAQ87. Newer tools like SHRiMP88 and RazerS89 improve on this approach by requiring multiple seed hits and allowing indels. Other “trie-based” approaches are aimed at reducing the memory requirements for alignment and use “Burrows-Wheeler Transform” (BWT), an technique that was first used for data compression90. The term trie comes from retrieval, as it can be used to retrieve entire sequences based on their position in a list. Different data structures can be used with this approach, based on prefix trees, suffix trees, FM-indices or suffix arrays, but the search method is essentially the same91. In trie-based approaches, the various k-mers are compressed into one string based on their position relative to the start of the string. These can be used to directly search the reference genome, even allowing simultaneous search of similar strings as these are compressed together. This further decreases the memory requirements and search times, but does require more computing time for the construction of compressed strings. Several very fast tools like SSAHA292, BWA-SW93, SOAP294, YOABS95 and BowTie96 have been created based on this approach. Even faster alignment tools like SOAP397, BarraCUDA98 and CUSHAW99 combine trie-based approaches with GPGPU computing, taking advantage of parallel GPU cores to accelerate the process. Most of the newer mapping tools are specifically designed to take into account the properties of NGS platforms; shorter reads, more data and sequencing errors. However, some tools like BLAT, SSAHA2, YOABS and BWA-SW are useful for mapping longer reads. Additionally, some mapping tools are developed specifically for certain platforms. For example, SHRiMP, BFAST and drFast map color-space reads associated with SOLiD platforms, and SOAP and BowTie tools were designed for use with data from Illumina platforms. For more extensive information on this topic, a good review was written by Li et al.91. The selection of the mapping tool is 13 an important consideration, also when selecting one specifically for certain SV detection methods. Split read mapping requires specific strategies, and BWA-SW and MOSAIK100 are examples of only few mapping tools that provide split mapping information. Finally, instead of alignment as a first step, some assembly-based algorithms require (whole genome) alignment as one of the later steps in SV identification, as will be discussed in the section on de novo assembly below. 4.2 Read pair Many tools have been developed for SV identification based on read pair data. These use algorithms that can be grouped into two categories: those based on clustering, and those based on distribution. Algorithms from both categories can identify discordant read pairs by differences in span and orientation, and may group read pairs for increased reliability. The difference lies in that clustering-based algorithms identify discordant read pair mapping distance by a fixed distance like a certain amount of standard deviations or based on simulations, whereas distribution-based algorithms test the mapping span distribution of a certain cluster of read pairs and calculate the chance of these being discordant by comparison to the genome wide distribution. Clustering-based methods The first read pair-based approaches using capillary sequencing by Tuzun et al.62, and using NGS by Korbel et al.64, both employed a clustering-based approach where a cluster is formed based on a minimum of two read pairs. These approaches used hard clustering of the reads. The standard clustering strategy used here detects SV signatures based on read pairs with discordant span and orientation, as described above in the introduction of NGS-based methods. The span is considered discordant if it deviates four or more standard deviations (SDs) from the mean. The limitations of these studies lie in the reduced sensitivity due to the use of hard clustering, as well as the fixed cutoff for the read pair distance and the number of required read pairs for a cluster, which can affect the specificity strongly based on the coverage attained66. The VariationHunter101,102 tool improves on the previous approaches by using soft clustering, thus increasing sensitivity. The same read pair distance cutoff (four SDs) as in earlier approaches is used. After mapping of all reads, a read is removed from consideration if it has at least one concordant mapping. If a read has only discordant mappings, it is classified as discordant. Each possible mapping is then assigned to each possible cluster of reads indicating a SV. Then, two algorithms may be used for the identification of SVs based on the clusters: VariationHunter-SC (Set Cover) or VariationHunter-Pr (Probabilistic). The first algorithm identifies SVs based on maximum parsimony, selecting clusters so the amount of implied SVs introduced is minimal. The second algorithm calculates the probability of a cluster representing a true SV based on the read mappings, with a clusters above a certain probability (90% was used in the paper) identified as SV clusters. Evaluation by the authors showed significant improvement in detecting SVs over previous methods, especially in the repeat regions. However, sensitivity was still lacking due to GC-content affecting the distribution of reads. Additionally, the fixed read pair distance cutoff used means that smaller differences in span with possibly good support are still ignored. PEMer103 is a tool combining various functions in an analysis pipeline, with the purpose of SV identification. Reads are first pre-processed based on the sequencing platforms used, and optimally aligned to the reference genome (using hard clustering). Subsequently, discordant read pairs are identified based on the clustering approach. It’s possible to merge clusters obtained from different experiments and with different cutoffs in a ‘cluster-merging’ step. This is a significant improvement over other tools, as it allows the use of multiple cutoffs for cluster formation and a custom cutoff for the calling of discordant read pairs. Furthermore, PEMer is modular, and offers extensive customization, allowing improvements to certain modules without having to design an entirely new pipeline. Another advantage is that PEMer can detect linked insertions as described by Medvedev et al.67, allowing the detection of insertions longer than the library insert size. Although the customizability is a large advantage, the parameters need to be carefully set to ensure good results. Implementation of a soft-clustering mapping algorithm may further increase the sensitivity of this tool. Another tool using a read pair clustering-based approach is HYDRA104. It uses soft-clustering, taking into account multiple possible mappings to specifically improve the identification of SVs arising from multi-copy sequences. Multiple mappings of the same read are considered to support the same SV if they span the same interval. Based on the support for each mapping, a variant call is generated for those with the highest support. Subsequently, SV types are identified as in a standard clustering-based approach which, in addition to the standard signatures, is able to detect several other signatures for tandem duplications and inversions that increase the sensitivity for these types of SVs. Although developed for the identification of structural variant breakpoints in the mouse genome this approach should also applicable to the human genome. This approach 14 may be very useful if applied to the specific identification of SVs in repeat and duplication regions. However, a significant risk in using this approach is that many false positives may be introduced if the mappings are not screened properly before the HYDRA tool is used, as mapping quality is not taken into account. SVM2105 is a recently introduced tool that uses a read pair-based approach, including non-standard signatures typically found flanking a SV event to increase the reliability of SV detection. SV flanking regions have defining properties for insertions larger and smaller than the insert size, as well as deletions. In addition to the default read pair span changes, OEA read pairs (One-End Anchored, read pairs of which only one read maps) are used. For deletions, there will be a sharp peak of OEA pairs on each strand about as long as the read length, as these cannot be mapped in their entirety. For insertions, this peak will become broader with the size of the insertion until the insert size is reached. Thus, the boundaries of an insertion larger than the insert size can be detected even though no spanning read pairs are available. Statistics on the characteristics of read pairs found around insertion and deletion regions are used in a machine-learning algorithm that detects SVs. A Support Vector Machine (SVM) is trained to recognize each of these statistics so SVs can by classified into their respective classes. Finally, a post-processing step combines clusters of these sites and identifies types and lengths of insertions and deletions by standard comparison of the span of read pairs to the global mean insert size. Although the boundaries of insertions larger than the insert size of the library are recognized, the size of these events cannot be determined. A comparison by the authors showed an increased specificity in detecting smaller (1-30bp) insertions and deletions versus BreakDancer. However, the detection of SVs other than insertions and deletions was not implemented. Adapting this method to also consider read pairs that map at great distances may also increase the sensitivity for detecting translocations or MEIs. Distribution-based methods Distribution-based detection of discordant read pairs was introduced with the MoDIL tool106. Using discordant as well as concordant read pairs, this tool compares the distribution of mapping distances for read pairs in a specific genomic locus to the genome-wide distribution to identify SVs. A shift in the distribution towards shorter spans indicates an insertion, whereas a shift towards longer spans indicates a deletion. This enables the identification of insertions and deletions in the range of 10-50 bp using paired-end data. An advantage of this tool is that heterozygous variants may be identified by observing a shift of half of the read pairs, which is not possible in clustering-based methods. As this tool only detects a very specific length of insertions and deletions, it is far from comprehensive. However, it is useful for detecting smaller insertions and deletions, possibly as part of a larger pipeline. MoGUL107 was developed based on MoDIL, but uses sequencing data from multiple genomes to enable the detection of common SVs from low-coverage genomes. After a soft clustering step, read pairs from multiple individuals are clustered. Based on these clusters, SV calls are generated based on the span distribution in a manner similar to MoDIL. Based on this data, indels of 20-100 bp can be detected if the minor allele frequency (MAF) is at least 0.06. Although Rare variants cannot be detected using this method, several variants that were not detected by MoDIL could be detected with the increased power for common variants in MoGUL. Although this tool is not useful for studying a single genome, it is effective in situations where a group of individuals is studied, allowing sequencing at low coverage and thus lower costs to identify common variants. This may be useful in situations where for example a familial disease or population differences are studied. BreakDancer108 combines clustering-based and distribution-based read pair-based SV detection by using two different algorithms. BreakDancerMax is used to detect all types of structural variation using the standard clustering strategy. BreakDancerMini is distribution-based and used to detect smaller insertions and deletions that are not found by BreakDancerMax, typically in the range of 10-100 bp. In addition to the insertions, deletions and inversions detected by previous methods, BreakDancerMax is able to identify inter- and intrachromosomal translocations. A comparison of BreakDancer to VariationHunter and MoDIL by the authors showed increased sensitivity and specificity due to the combination of the two methods, as well as the algorithmic improvements enabling the detection of other SV types. However, the detection of variant zygosity as with MoDIL is not possible using BreakDancerMini. Another possible limitation of the BreakDancer tool lies in the detection of SVs in repeat regions, as it relies on hard clustering. 15 Table 1: Overview of computational tools used for the detection of SVs based on NGS data. RP: Read Pair, RD: Read Depth, SR: Split Read, BP: Breakpoint, CN: Copy Number, TD: Tandem Duplications, MEI: Mobile Element Insertion, VH-SC: VariationHunter-Set Cover, VH-PR: VariationHunter-Probability, BDMax: BreakDancerMax, BDMini: BreakDancerMini, EWT: Event-Wise Testing, CBS: Circular Binary Segmentation, MSB: Mean-shift Based HMM: Hidden Markov Model, SV: structural variant, OEA: One End Anchored, beRD: breakend Read Depth *Considers ambiguously mapping reads, but maps these randomly and subsequently uses only that mapping. 4.3 Read depth Read depth methods can be grouped into two categories: those based on differences in read depth across a single genome and those based on case versus control data. Using a single sample, reads are mapped to the reference genome and CNVs are identified based on the average read depth or the read depth in other regions. Using case versus control data, differences in copy number ratios after mapping to a reference genome are used to identify copy number differences between the two genomes. Among both categories, the algorithms use genomic ‘windows’ in which the read depths are measured that determine the resolution at which copy number ratios are determined. Windows with similar read depths or copy number ratios are then merged to find CNV regions. Most read depth algorithms discussed use hard clustering alignment methods, evaluating only the best mapping of each read. The first algorithm used to detect copy number variants from NGS read depth data was an adapted circular binary segmentation (CBS) algorithm68 originally developed use with arrayCGH data109. This was a applied to a case versus control (cancer) dataset to identify somatically acquired rearrangements. The copy number ratio of the two samples was determined in genomic windows. The size of the genomic windows used was nonuniform, requiring 425 reads per window. This allows the resolution to become higher with higher sequence coverage. After mapping the reads to the reference genome, copy number change points were found by using the CBS algorithm for the segmentation of windows with differing copy numbers. The CBS algorithm segments the genome by testing for change points between different parts testing whether an observation is significantly different from the mean of a segment. This is done recursively, and stopped when no more changes can be found. The readDepth110 tool uses a CBS-based approach similar to those used in the first read depth-based studies. A major difference is that readDepth does not require the sequencing of a control sample, but calls CNVs based on a single sample. readDepth employs the CBS-based read depth strategy where the genome is divided into windows, and the genome is segmented by the CBS algorithm until no more differences in copy number can be detected to identify CNV regions. However, several improvements over earlier methods are introduced. Genomic windows are calculated based on a desired FDR (False Discovery Rate) that can be input by the user based on the number of reads. Heuristic thresholds for the detection of copy number gain and loss events are calculated based on the desired FDR and number of reads as well. Furthermore, the readDepth tool is able to process bisulfite-treated reads in addition to regular sequencing reads, and can thus also study epigenetic alterations. Several corrections for biases are introduced as well: The mapability of reads is corrected by multiplying the amount of reads in a window by the inverse of the percent mapability detected in a mapability simulation, and regions with extremely low mapability are filtered out. Read counts in each window are also normalized by using a LOESS method to fit a regression line to the data. RDXplorer111 is a tool that detects CNVs based on the EWT (Event-Wise Testing) algorithm. This algorithm uses 100 bp windows to identify CNV regions based on the differences in read depth in a single sample. As a first step, all read counts mapped to each window are corrected for the GC content. This is done by multiplying the read count for each window with the average deviation from the read count for all windows with the same GC percentage. This manner of GC content correction has been adopted by many other read depth-based tools. The amount of reads in each window is then converted into a Z-score in a two-tailed normal distribution. Based on the desired FPR (False-Positive Rate) the upper- and lower-tail probabilities identify gains and losses respectively. Afterwards, adjacent windows with a copy number change in the same direction are merged to identify the range of the CNV. The correction for GC content is a positive addition as this is a significant bias in read depth methods. The authors state that the read counts of 100 bp windows approximate the normal distribution well at 30x coverage, but more flexible settings are preferred as these windows may be too small or too large in experiments with better or worse overall coverage. JointSLM112 is an algorithm that is also based on EWT, but was developed to detect common CNVs present in multiple individuals using multivariate testing. Due to the increased statistical power by including multiple genomes, JointSLM is able to determine smaller CNVs than the EWT algorithm alone. Although it was designed for multivariate testing, this tool may also be used to study a single genome in a manner similar to EWT. Like other population- or group-based algorithms, this may be useful in the detection of CNVs between populations. CNVnator113 uses a mean shift-based (MSB) approach to identify CNVs in single genomes. This approach is also derived from an algorithm designed for the identification of copy number shifts in ArrayCGH data114. The optimal window size is determined as the one at which the ratio of average read depth to its standard deviation is roughly 4:5. In the MSB approach, copy number variant regions are identified by merging each window with flanking windows with a similar read depth. If a window with a read depth significantly different from that of 17 the merged windows is encountered, a break is detected. Subsequently, CNVs are called based on the probability in a t-test that that the read depth of that segment is significantly different from the global read depth. In addition to mapping of unique reads, CNVnator maps ambiguously mapping reads randomly to clustered read placements. Thus, it is not limited to unique regions by using best mappings only, but does not consider all possible mappings by either. Read counts are corrected for GC content in a method similar to the one used in RDXplorer. CNVeM115 uses read depth in single samples to determine CNVs by assigning ambiguously mapping reads to genomic windows fractionally. It is the only read depth-based tool that explicitly uses soft clustering. After mapping, the genome is divided into windows of 300 bp and an initial estimation of copy numbers is made based on an EM (Expectation Maximization) algorithm. A second step then evaluates all possible mappings of reads to calculate the posterior probability of each mapping, then assigns reads fractionally to windows based on this probability. This algorithm differs between read assignments with differences in sequence as small as one nucleotide, and predicts the copy numbers of each position. Instead of classifying CNVs as gains or losses, the copy number of each base is then determined based on these assigned reads, and the CNVs are determined from this copy number. In a comparison by the authors, this approach was found to have higher accuracy in detecting CNVs than CNVnator. It is also able to detect whether paralog regions are copied or deleted. BICseq116 is a tool that uses the MSB approach for the identification of CNVs, but is designed for use with case vs. control data instead of single samples. The definition of windows, merging of windows, and calling of CNVs is done similarly to the process in CNVnator. However, BICseq the Bayesian Information Criterion (BIC) as the merging and stopping criterion. By using the BIC, no bias is introduced by assuming a Poisson distribution of reads on the chromosome, increasing the reliability of the results. Furthermore, the case vs. control approach is used to correct for the GC content bias. CNV-seq117 is a tool for CNV detection based on the case versus control approach. This tool contains a module for calculation of the best window size based on the desired significance level, a copy ratio threshold and the attained coverage level. After mapping of the reads to the genome, genomic regions with potential CNVs are identified by a sliding of non-overlapping windows across the genome, measuring the copy number ratio in each window. The probability of a random occurrence of these ratios is calculated by a t-statistic, based on the hypothesis that no copy number variation is present. The hypothesis is rejected if the probability of a CNV exceeds the user-defined threshold, and a difference in copy number between the two genomes is inferred. Segseq118 uses a strategy that focuses on the CBS-based identification of CNV breakpoints for copy number ratios in case versus control data. Similar to CNV-seq, sliding windows are used to compare copy number ratios. However, Segseq has a variable window size based on a user-specified amount of required reads. Segseq identifies breakpoints by comparing the copy number ratio in each window to those in the adjacent windows. Significant change in the ratio versus either window identifies a breakpoint and copy number change. Subsequently, all windows with the same copy number ratio are merged to identify copy number variant and copy number balanced regions. rSW-seq119 is a tool that, similar to Segseq, uses case versus control read depths to identify changes in copy number ratio. However, rSW-seq directly identifies CNV regions by registering cumulative changes in the ratio as breakpoints of CNVs. Reads for each sample are sorted according to their mapping on the genome, and the read depths for each sample are subtracted from each other. Local positive or negative sums indicate copy number gains or losses. Regions with equal read depths are ignored, and regions where read depth differences are detected are defined as CNVs. This gives a very intuitive overview of where CNVs are found, and can also identify CNV regions within other CNVs. rSW-seq’s resolution is dependent on the sequencing depth, but seems limited as CNVs smaller than 10 kb were not reported. It is the only read depth-based tool discussed here that does not require the specification of genomic windows. CNAseg120 is another tool that uses genomic windows to identify differences copy number between case and control data. In addition to LOWESS regression normalization for GC content, CNAseg uses Discrete Wavelet Transform (DWT) to de-noise regions using, smoothing out regions with low mapability. This is necessary as a novel HMM-based (Hidden Markov Model) segmentation step is introduced to segment the windows based on the read depth. An additional algorithm then uses Pearson’s x2 test to merge segments with a similar copy number ratio, and the copy number state is estimated by comparing the log ratio of read depths. This identifies segments of contiguous windows with similar read depth, which are then defined as CNVs. This tool was shown by the authors to increase the specificity and lower the amount of false positives versus CNVseq without affecting sensitivity. Unless specified otherwise, the single sample read depth-based tools discussed here assume a uniform Poisson distribution of reads over the whole genome, thus considering any aberration in read depth an effect of 18 copy number. As read depths do in fact vary over the genome due to various biases 70, more accurate models like the BIC better approximate the distribution of reads over the genome. Although all tools described here are able to detect differences in copy number within or between genomes, the actual copy number of these regions is not always automatically determined. In most studies, the copy number may be estimated by normalizing the median of the read depth in a copy number variant region normalized to that of copy number 2 and rounding to the nearest integer68,111,112. This has been shown to work well for most platforms by comparing to regions with known copy numbers, however the copy numbers did not correlate well for the SOLiD platform 121. In a recent review of read depth approaches121, it was found that the EWT-based tools provide the best results in terms of both sensitivity and specificity. CBS- and MSB-based tools are better at detecting CNVs with a large number of windows (50-100), but worse at detecting those with a smaller number of windows (5-10). CNASeg performs better on high coverage data, but worse on low coverage data. CNV-seq seems to perform poorer overall. In combination with high coverage data, the EWT-based tools detect CNVs as small as 500 bp, while the CBS- and MSB-based tools identify CNVs with a size of 2-5 kb. Thus, there seems to be a great deal of variation between the performance of different tools, also based on the type of data that is used. 4.4 Split read Few tools have yet been developed for the identification of SVs using split read methods using NGS data. Most of these rely on specific alignment strategies to identify breakpoints. Pindel72 uses a pattern growth algorithm to identify the breakpoints of deletions and insertions. As described above, this tool uses anchored split mapping. Read pairs are selected where one read maps uniquely and the other can’t be mapped under a certain threshold are used. With the uniquely mapping read as the anchor point, the direction of the read as well as the user-specified maximum deletion size are used to define a region where Pindel will attempt to map the other read. This is done using the pattern growth algorithm which searches for minimum (to find the 5’ end) and maximum (to find the 3’ end) unique substrings to map both sides of the read. The read is then broken into either two (deletion) or three (short intra-read insertion fragment in the middle) fragments. At least two supporting reads are required for each event. Pindel is able to identify the breakpoints at base pair accuracy, even for deletions as large as 10 kb. Although the sensitivity of this approach is still problematic in repeat regions, allowing mismatches in mapping of the anchor read may increase the sensitivity in the future. By reducing the search space, the chance of mapping partial reads to the human genome is significantly increased and split read mapping is made possible for NGS platforms. However, the search space may be affected by insertions or deletions in between the reads. By combining this approach with information of the mapping distance of surrounding read pairs, the accuracy may be increased. The AGE122 (Alignment with Gap Excision) tool adopts a strictly alignment-based approach to split read mapping. Based on two given sequences in the approximate location of SV breakpoints, it simultaneously aligns the 5’ and 3’ ends of both sequences similar to Smith-Waterman local alignment. The final alignment is then constructed by tracing back the maximum position in the matrix of each alignment and then aligning the 5’ and 3’ ends. The SV region is then the unaligned region in between. This approach is able to identify SV breakpoints with base pair accuracy, and also the exact SV length and sequence if the whole sequence is supplied. However, it does require external identification of the SV region as well as two sequences as input. These sequences need to be unique enough for proper alignment, which means that either the putative SV region needs to be small enough or the provided sequences long enough. The SV type needs to be determined by additional processing of the results. which is often difficult to obtain with current NGS platforms. Considering the input and additional processing needed, the alignment algorithm is only useful for SV identification as part of a larger pipeline. ClipCrop123 detects SVs by using soft-clipping information. Soft-clipped sequences are defined as partially unmatched fragments in a mapped read. Unmapped parts of partially mapped sequences are used, with a minimum length of 10 bp. Subsequently, these clipped reads are mapped to the reference genome maximally 1000 bp on either side of the mapped part. Sequences mapping further ahead indicate deletions, inversely mapping sequences indicate inversions, sequences mapping before the mapped read indicate tandem duplications, and a cluster of unmapped reads from both sides indicates insertions. Similarly to read pair methods, additional tandem duplications over those present in the reference genome can’t be detected. Remapping of unmapped reads is used to differentiate between novel insertions or mobile element insertions/translocations, with novel insertions not expected to map to the reference genome. Clipped reads are clustered if they support the same event, and a reliability score based on this support is used to determine the most likely event. ClipCrop is able to detect a larger variety of signatures, and is not limited by the direction of the search space like Pindel. Furthermore, ClipCrop was shown to more efficiently detect short duplications 19 (<170 bp) than CNVnator, BreakDancer and Pindel based on simulated data. However, the detection of larger events was worse than with other methods. 4.5 De novo assembly Assembly-based identification of structural variation requires two steps: the assembly of the sequence, and the alignment of this sequence against a reference genome for detection of the variants. Assembly can be performed either completely de novo, or by using varying degrees of information from a reference assembly. Assembly can currently be used to identify SV in two ways: local sequence assembly allows the reconstruction of loci with possible variants, and whole genome assembly would provide the most comprehensive identification of structural variation in a genome by aligning (large parts of) whole genomes. Alignment to the reference genome may then identify all types of SVs as well as CCRs using similar methods as split read mapping. Genome assembly The first step, genome assembly, is not a trivial task. Several recent reviews have been published on this topic, explaining it in more detail63,74,124. Repeat sequences, read errors and heterozygosity present the greatest challenges here. The short read length of NGS platforms complicates these challenges even more. Previous assemblers used for the assembly of Sanger sequencing reads were insufficient for use with NGS data, so several new assemblers have been developed to deal with these problems. NGS assemblers can be divided into four categories: greedy algorithms, Overlap/Layout/Consensus (OLC) methods, de Bruijn Graph (DBG) methods and String graphs74,124. Most early assemblers used greedy algorithms. These operate by simply extending the seed sequence with the next highest-scoring overlap to the assembly until it is no longer possible. The score is calculated based on the amount of overlapping sequence. A problem with these algorithms is that false positives are easily added to a contig, especially with shorter reads. Two identical overlapping sequences in the genome may lead to the incorporation of unrelated sequences, producing a chimera. Several assemblers using greedy algorithms are SSAKE125, SHARCGS126 and VCAKE127. This category of assemblers is generally not used for NGS platforms, except when assembly is performed in combination with Sanger sequencing data. Overlap/Layout/Consensus assembly was used extensively for Sanger data, but some assemblers have been adapted for use with NGS data. OLC assembly involves three steps: first, all reads are aligned to each other in a pair-wise comparison using the seed and extend algorithm. Then, an overlap graph can be constructed and manipulated to get an approximate read layout. Finally, multiple sequence alignment (MSA) determines the consensus sequence. Examples of assemblers that use this approach are Newbler 128, which is distributed by 454 Life Sciences, and the Celera Assembler129, which was first used for Sanger data and subsequently revised for 454 data, now called CABOG. Edena130 and Shorty131 use the OLC approach for the assembly of shorter reads from Solexa and SOLiD platforms. The de Bruijn graph approach has been widely adopted and is mostly applied to shorter reads from Solexa and SOLiD platforms. Instead of calculating all alignments and overlaps, this approach relies on k-mers of a certain length that are present in any of the reads. k-mers must be shorter than the read length, and are represented by nodes in the graph. These nodes have connections (edges) with other nodes based on which other k-mers they are found in the same read with. Ideally, the k-mers would make one path that can be traveled to form the entire genome. However, this method is more sensitive to repeats and sequencing errors than OLC and many branches can be found in these graphs. Disadvantages of DBG assembly are that information from reads longer than k-mers is lost and the choice of K-mer size also has a large effect on the results. Some assemblers use approaches that still include read information during assembly, but require more computational power. Euler132 was the first assembler to use the DBG approach. Velvet133 and ALLPATHS134 were introduced later, offer improved assembly speed and contig length and allow the use of read-pair data. These assemblers are able to assemble entire bacterial genomes. ABySS135 was the first assembler used to assemble a human genome from short reads. SOAPdenovo136 was introduced later and is also able to assemble larger (and human) genomes. Finally, String graphs can be used to compress read and overlap data in assembly124. The primary advantages of String graphs over DBGs are that the data is compressed further so assembly can be performed more efficiently, and the possibility to use the full reads instead of k-mers. String graphs are based on the overlap between reads or k-mers. Similarly to DBG assembly, each sequence is represented by a node, these have edges to other nodes with overlapping sequence. In this case, the edges are represented by the nonoverlapping sequence between the nodes. Thus, this constructs all possible paths while the entire sequence is 20 retrievable by following the edges. This approach is used by the String Graph Assembler (SGA)137, which is able to assemble an entire human genome using one machine, and corrects single-base sequencing errors. Several updated assemblers like ALLPATHS-LG138, Velvet1.1 and Euler-USR139 show significant improvements over their predecessors. For example, they allow the incorporation of longer reads and matepaired reads to enhance the assembly of shorter reads, are able to align larger genomes, and allow the input of data from more different NGS platforms. Although de novo assembly of human genomes using shorter reads is now possible, several limitations still remain. In addition to significant sequence contaminations, it was found that de novo assemblies are significantly shorter than the reference genome, and large parts of repeated (420.2Mb) and duplicated sequence (99.1% of total duplicated sequences in the reference genome) are missing from genomes assembled from NGS data63. Until the introduction of more reliable ‘third-generation’ sequencing with longer read lengths, it remains important to include data from established large-molecule sequencing methods to inform and control the data generated with NGS platforms. Using information from alignment to the reference genome may also help to increase the reliability of assembly. For example, the Cortex assembler140,141 used in the 1000 genomes project can use varying degrees of information from a reference genome for assembly. However, using a reference genome may bias the data, and the problems in repeat and duplication regions will remain due to alignment problems in these regions. Identification of structural variation Although much work has been done to improve assembly algorithms, the identification of structural variation using this data has been studied far less. This is partially due to the problems and costs that are still involved with de novo assembly, prohibiting the use of assembly methods to detect SVs 142. Ideally, a fully accurate sample genome may simply be compared to a reference genome by alignment, with differences in the alignment indicating SVs as indicated in Figure 1. However, in addition to full de novo assembly currently not being possible, proper alignment of genomes and detection of these signatures are still significant challenges. Currently, the assemblers discussed here may also be used to construct smaller genomic regions to identify structural variation by alignment of those regions. The AGE tool122 that was discussed for split read mapping is able to align large contigs, even with multiple SVs, enabling it to potentially identify SV regions based on de novo assembled contigs as well. As the methodology for the identification of SVs using de novo assembly data is similar, other split read-based methods may also be adapted for use with assembly data. Another tool called LASTZ143, based on BLASTZ144, was optimized specifically for aligning whole genomes. This was recently used in the detection structural variation in two de novo assembled human genomes73,136. After whole genome alignment, the gaps and rearrangements in the alignment were extracted as putative SVs. Subsequently, over 130.000 SVs of several types (inversions, insertions, deletions, and complex rearrangements) were identified in each genome. However, the methodology for identification of specific variants was not discussed. A tool called NovelSeq145 was designed specifically for the detection novel sequence insertions in the genome. The first step in this process is mapping of all read pairs to the reference genome using mrFAST. Read pairs of which neither read can be aligned are classified as orphan reads, and if only one read can be aligned the read pair is classified as OEA. The hypothesis is that these orphan and OEA reads belong to novel sequence insertions. Subsequently, all orphan reads are assembled into longer contigs using available assembly tools such as EULER-USR and ABySS. The OEA reads are then clustered to the reference genome to find reads supporting the same insertion. Clustering is performed by a clustering algorithm which has a maximum parsimony objective to imply as few insertions as possible while explaining all OEA reads. Finally, these OEA clusters are assembled into longer contigs as well, and are used to anchor the orphan (insertion) contigs by aligning overlapping sequences. The identification of novel sequence insertions is an important step in the characterization of all structural variation in the human genome. Several insertion breakpoints could not be identified conclusively or at base pair resolution, as multiple insertion breakpoints may be identified due to OEA clusters mapping ambiguously to the genome. However, the information provided could significantly reduce the search space for these breakpoints, allowing other methods to validate these reliably (e.g., split read mapping). The Cortex assembler141 introduces a novel way to detect SVs based on DBG assembly. Colored de Bruijn graphs (CDBGs) are an extension to the classical DBGs. In CDBGs multiple samples are displayed in one graph, and the nodes and edges in the graphs are colored based on the sample they derive from. These samples may be different sequenced genomes, reference sequences, known variant sequences or a combination of those. The alignment of these samples will show ‘bubbles’ in one sequence when the sequences differ, where different types of bubbles indicate various variants. The simplest bubbles to detect are for SNPs, which can be detected using the Bubble-Calling (BC) algorithm. Deletions and insertions, where either the reference (deletion) or the 21 sample (insertion) shows a bubble are also detectable using the Path-Divergence (PD) algorithm. Although other types of SVs can theoretically be detected as well, these signatures are more complicated, confounded by branching paths in the assembly due to repeats or duplications. Thus, the identification of SVs currently seems reliable only in non-repetitive regions. SVs defined as complex have been reported, but these were not classified further. Cortex also allows population-based investigation by aligning multiple genomes, and can identify novel insertions based on this information. Assemblies could still be improved by using read pair information, and SV classification does not yet seem to be fully implemented. Although the reliability of this method has not yet been investigated thoroughly, this tool is an important step into the direction of complete assembly-based identification of SVs. 4.6 Combined approaches Genome STRiP146 uses read pair and read depth information to identify SVs in populations, and identifies breakpoints by using assemblies of unmapped reads from another study140 to span potential breakpoints. This tool was designed for use with 1000 genome project147 data, specifically to reduce false positives in SV identification, especially in population studies. After read pair-based detection of discordant read pairs, those in the same genomic region sharing a similar difference in insert size are clustered over different samples to increase the power of SV detection. Furthermore, heterogeneity in a population is used to filter out possible false positives that appear thinly in many genomes, but keep variants with a high signal in one or multiple genomes. The correlation between read depth and read pair information is also used to filter out false positive SVs: if read pairs indicate a possible deletion, it should be supported by a lower read depth in those samples with the detected deletion, but not in general. The approximated breakpoints based on read pair and read depth data could be resolved by assembly of unmapped reads, allowing the identification of breakpoint locations at base pair resolution. Compared to other methods, Genome STRIP was found to detect less false positives and more deletions in total in a comparison by the authors. The detection of rare alleles in the population with a sensitivity comparable to single-sample methods required higher than average coverage (8x, average 4x). For the detection of smaller deletions (<300bp), Pindel was more effective. This approach is currently only able to identify deletions in large populations, but the identification of other types of SVs is being worked on. Further development of these methods may allow reliable population-based identification of structural variation by integrating many different signals. CNVer148 is a tool that combines read depth information with read pair mapping for the accurate identification of CNVs. Without typing the SVs, discordant read pair mapping information is used to identify regions that are different between the reference and the donor genome. Independently of this, the read depth signal is used to identify regions with losses or gains. These signals are considered jointly in a framework termed a ‘donor graph’. Reads that map to several locations are considered for each location, and connected if adjacent in the reference genome or connected by read pairs. Based on traditional differences in read depth and the presence of discordant mate pairs, CNV calls are made. This data is also used to predict the copy count of each region. This method has several advantages over read depth- or read pair-only methods. The location of deletions detected by read depth can be determined by using the read pair signature. This tool uses soft clustering, which increases the sensitivity in repeat and duplication regions, and requires information from both read depth and read pair signals to reduce false positives. Furthermore, it is possible to detect regions with additional tandem duplications over those already present in the reference genome as well as insertions larger than the insert size, which is not possible using traditional read pair methods. However, SVs other than deletions can only be called as CNVs without a specific location. A comparison to other read depth and read pair-based approaches by the authors shows that the method is less sensitive to noise and false positives, but many confirmed SVs are still detected by either read pair or read depth methods alone, but not by CNVer, which indicates that the sensitivity is not maximized. Another tool that uses both read depth and read pair information is GASVPro 149, which integrates both signals into a probabilistic model. In read mapping, GASVPro is able consider all possible alignments by using a Markov Chain Monte Carlo (MCMC) approach that calculates the posterior probability of each variant over all possible alignments of each read (soft clustering), but a hard clustering approach (GASVPro-HQ) is also available. In addition to the standard read pair signatures, the read depth is used in signals of localized coverage drops that occur at the breakpoints of both copy number-variant and -invariant SVs. This is called breakend read depth (beRD), and is also used to predict zygosity of variants. GASVPro uses both the amount of discordant read pairs, as well as the beRD signatures at each breakpoint to determine the probability of each potential SV and remove false positives. A comparison to other SV detection methods with low coverage data, including BreakDancer, Hydra and CNVer, showed comparable sensitivity but much higher specificity in detecting deletions for lower quality data when using GASVPro, as far fewer false positives were predicted. For 22 insertions, GASVPro was the most sensitive method, but at the cost of many possible false positives. Higher coverage data showed better performance of tools that use a hard clustering approach (BreakDancer, HydraHQ, GASVPro-HQ). The increased specificity when considering both read pair and read depth signals is efficient for detecting large deletions reliably, and with further implementation may be useful in the detection of other types of variants. However, the detection of inversions was not significantly improved by using the beRD signal, and the detection of SV types other than insertions and deletions hasn’t been implemented. inGAP-SV150 uses read depth and read pair data to detect and visualize large and complex SVs. After alignment of reads to the genome, the read depth signature is used to detect SV ‘hotspots’ by gap signatures, drops in read depth that are also called beRD signatures in GASVPro. In these hotspots, SVs are called and classified based on a heuristic cutoff for discordantly mapping read pairs. The called variants are then evaluated based on information on mapping quality and read depth. inGAP-SV can identify different types of SVs, including large insertions and deletions, translocations and (tandem) duplications. The beRD is also used to determine the zygosity of variants, as the regions flanking homozygous SVs are expected to have a read depth of zero, whereas for heterozygous events it would be reduced to about half the local read depth. Novel insertions larger than the insert size are also detected by looking for OEA reads. Finally, the results are visualized in a genome browser-like display, with different representations for different signatures. The user can then inspect this information and annotate the putative events. The authors compared the detection of deletions for a confirmed reference set against other tools, including BreakDancer and VariationHunter. and found that inGAP-SV’s combined approach was more sensitive. An comprehensive comparison of the detection of other types of SVs was not performed. The visualization supplied by inGAP-SV is unique among the investigated tools, and may be very useful for researchers to investigate regions of interest in more detail. A recently introduced tool called CREST151 uses hanging soft-clipped reads in a method similar to the one used by ClipCrop123. However, CREST uses a case versus control approach that first filters out any soft-clipped reads that occur in the control genome. This filters out artifacts, and allows the detection of somatic variants in, for example, cancer samples. After mapping to the reference genome, all soft-clipping reads mapping to the same location are first assembled into a contigs. Thus, CREST uses a combination of split read and assembly methods. The contigs are then remapped to the genome iteratively using BLAT to identify candidate partner breakpoints on the genome. For this breakpoint, the reads are similarly assembled into a contig. Based on the alignment between the two assembled breakpoints and their mapping locations, a putative SV is called. The SV type can then be identified in a method similar to the one used by CripCrop. CREST is able to detect all signatures but tandem duplications. Differently from other split read-based methods, CREST is designed for the detection of larger events. In a comparative analysis using simulated data by the owners, CREST outperforms both BreakDancer108 and Pindel72 in terms of sensitivity and specificity. This may be due to the nature of the data, as CREST is designed to detect somatic events and the other tools aren’t. Finally, SVMerge152 attempts the integration of data from several SV calling tools into one pipeline to maximize the amount of SV calls in one run. BreakDancerMax is used to call deletions, insertions, inversions and translocations based on read pair mapping. Pindel is used to call insertions of 1-20 bp and deletions smaller than 1 Mb. RDXplorer is used to detect CNVs based on read depth information and determine the zygosity of events. SECluster and RetroSeq are two tools that were developed specifically for implementation in this pipeline to detect insertions. SECluster detects large insertions by identifying clusters of OEA reads similarly to NovelSeq. RetroSeq detects MEIs by looking for read pairs of which one read maps to the reference and the other read can be aligned to a known mobile element in Repbase153. After all calls have been made, calls are validated by de novo assembly of the reads at predicted SV breakpoints. These contigs are aligned to the reference genome, and the results of this alignment are used to confirm breakpoints and increase the resolution and filter out false positives if the alignment does not match the predicted SV. As heterozygous events can’t be validated by de novo assembly, read depth information is also used in this step. This pipeline is able to determine CNVs as well as the location of deletions detected by read depth analysis, as well as insertions and inversions that are confirmed by assembly. This pipeline was found to decrease the false discovery rate of the individual tools used significantly. SVMerge is the first meta SV caller and introduces important validation steps after the merging of results from different tools. However, a 50% overlap is used as a requirement for the merging of calls from different tools, which is a rather limited cutoff and may result in the merging of different events. Although SVMerge still only detects a subset of structural variants, the pipeline is modular which means that the sensitivity may be raised even more by integration of other tools. Actual integration of the signals before calling the SVs would enhance the specificity of detection event more, and would be the next logical step in the combination of the NGS-based signals that can be used for the detection of SVs. 23 5 Discussion The status quo Here, I have given an overview of the currently available tools for the detection of structural variation with NGS platforms. As discussed in the introduction of the four NGS-based methods of SV detection, each method has its own advantages and limitations. An evaluation of the performance of each of the discussed tools is beyond the scope of this thesis, but a comprehensive comparison would be difficult as most tools focus on the detection of a different or new class of structural variation by introducing a new method or algorithm. Most papers accompanying the introduction of a new tool provide a comparison against other tools, but mostly focus on a comparison of their own abilities for a proof of principle, without considering the full spectrum of structural variation. Read depth approaches alone seem to have matured enough that most tools aim at detecting the same range of SVs. This is possibly due to the fact that many of these are based on algorithms first applied to microarray data, and that only a limited spectrum of variation is detectable from the read depth information. A good review of the performance of read depth-based tools was recently published by Magi et al.121. From the information gathered here, we can see trends for the development of tools in each of the four NGS-based methods of SV detection. Most of the new tools for each of the methods have focused on strengthening the advantages and minimizing the disadvantages inherent to the information that is used. Read pair methods are able to detect the broadest range of SV types and sizes, but the detection of larger insertions is limited by the insert size, and the variance in library insert size limits the resolution. Newer read pair-based tools have focused on removing the limitations of this method by using both clustering- and distribution-based algorithms to increase the range of detectable SV sizes, and have developed algorithmic strategies to detect additional signatures associated with structural variation. Read depth methods are able to detect CNVs efficiently and can determine the local copy number, but are unable to identify copy number balanced variants or the location of the detected CNVs. Most read depth-based tools focus on minimizing experimental biases like the GC content and the mapability of reads, and using more advanced statistical models to increase the accuracy and the resolution. Split read methods are able to determine breakpoints at base pair resolution, but are currently only effective in unique regions of the genome due to ambiguous mapping of short read lengths. Several tools have now been developed to use split read mapping signatures for the identification of SVs. Algorithms that use split read mappings are able to detect most SV types at high resolution. However, larger and more complex events can’t be detected yet, and will require longer read lengths than those available from current generation sequencing platforms. Finally, reliable de novo assembly of a full human genome is still not possible due to technical limitations in repetitive regions. Due to these limitations, significant biases towards the detection of SVs in these regions are present in current assembly-based approaches63. Current tools for assembly-based SV identification rely on the assembly of shorter contigs or focus on non-repetitive regions, but are only able to detect a limited range of structural variation. However, as the technical limitations are expected to be reduced significantly in the third generation of sequencing platforms, the new algorithms and improvements introduced in these tools provide an important first step towards comprehensive identification of structural variation based on de novo assembly of genomes. Clearly there have been great advances in the development of computational methods for NGS-based SV detection in recent years. However, none of the four NGS-based methods is comprehensive, and strong biases are still present in each of the methods. Application of read pair, split read, and read depth methods to the same dataset has shown that a significant fraction of the SVs detected remains unique to a single method 22. Thus, the sources of information need to be combined in order to maximize the identification of structural variation in a human genome. This is true at least until complete de novo genome assembly becomes a viable option, but probably even afterwards, as assembly-based methods alone are not able to identify the zygosity of a structural variant or the copy number of a sequence. Several approaches have been developed to incorporate signals from various methods. These combined approaches have succeeded especially well in increasing the resolution and reliability of existing methods by requiring confirmation by other signals. Some tools like inGAP-SV, SVMerge and Genome STRiP already combine several algorithms so that a wider range of structural variation can be detected in one experiment. For multiple methods, population-based approaches have been developed. These approaches increase the statistical power for the detection of common structural variants by pooling data, while filtering out experimental artifacts at the same time. These tools are less powerful for the identification of personal structural variation, but may be extremely useful in a clinical setting with familial or larger case-control studies. 24 Read pair or combined methods seem most suited to this strategy as these have the potential to detect the widest range of SVs and will thus profit most from the increased statistical power. Possible improvements: integration of recent advances Most tools described here have introduced the detection of a new type of signature or a new way to increase the reliability of the findings. However, a comprehensive solution that incorporates all of the recent knowledge with the aim to identify all structural variation in a human genome is currently not available. As one sequencing experiment can generate the required information, methods using only one or two of the signals do not optimize the use of the data. The SVMerge pipeline combines signals by using various tools that are able to detect different ranges of SVs, and implements a filtering step based on local de novo assembly. However, SVMerge does not integrate the signals, but merges the results from each approach. This represents a significant step towards an integrated solution, but a comprehensive algorithm would ideally combine signals from each of the four NGS-based methods into one model. A lot of the knowledge gained in the development of previous approaches could be used in the development of such an algorithm, taking into account the advantages and limitations of each method, as well as newly discovered signatures that can be used to enhance the detection of SVs. The use of soft clustering methods will allow maximal sensitivity for the detection of SVs in repetitive regions, but extensive confirmation and filtering will be required to reduce false positives. This could be achieved by integrating all signals before the calling of SVs, preferably into a probabilistic model, as these have been found to be more accurate than heuristics in most cases66,101,121. From read pair data, discordantly mapping reads should be used in clustering- and distributionbased models as in BreakDancer to maximize the information obtained from this step, but also include concordantly mapping reads as in MoDIL and MoGUL as this provides additional information of events, also enabling the determination of zygosity. The read depth signal may be used to inform the detection of deletions, duplications and insertions by using the traditional differences in read depth across the genome, but beRD signals also provide important information that should be considered in the detection of any variant that forms a new sequence at the breakend, as evidenced in GASVPro. Furthermore, NovelSeq, inGAP-SV and SVMerge have shown that OEA and orphan reads should also be considered especially in the prediction of insertions, and OEA reads can also be used to confirm the location of other events. Split read information can be used to detect the breakpoints of various types of SVs by using both anchored split mapping (Pindel) and soft clipping information (ClipCrop) as these approaches detect different classes of variants, and can also identify the predicted breakpoints at higher resolution (AGE). Local assembly may currently be used in several of these steps, for example by assembling novel insertions and the linking reads (NovelSeq) and, increasing the reliability of split read mapping (CREST), for confirmation of breakpoints by assembling unmapped reads (Genome STRIP). Finally, an example of true integration of signals would be de novo assembly of contigs while considering multiple mappings, retaining the mapping, read depth, linked reads and insert size information may allow the use of larger sequences while still able to consider in the traditional signatures, integrating all possible signals into one source of information. The Cortex assembler’s CDBG may be a good starting point for this, as it allows the integration of several tracks of information. However, this approach would probably require significant computational power. Algorithmic improvement by integration of all four SV detection methods may significantly increase both the sensitivity and the specificity for the detection. However, the library insert size has also been found to play a large role. Whereas large insert sizes are better for detection of structural variation, smaller insert sizes allow for a higher resolution66,154. Thus, a combination of two insert sizes, while integrating the data to keep the statistical power in detection may be the best solution155. However, the root of the major problems common to each of four NGS-based SV detection methods will still remain: technical limitations. Future perspectives Although NGS-based methods can theoretically identify all types and lengths of structural variation, this is currently not possible using any algorithm due to the technical limitations of the current sequencing platforms. It’s estimated that about 50% of all SVs in the human genome is currently missed due to these limitations 142. The short reads generated by the current generation of sequencing platforms and relatively high error rates in those with longer reads significantly reduce the usefulness of any method used for the detection of SVs in repeat and duplication regions156. As the human genome contains many such regions, and SVs are predicted to be strongly enriched in these regions, this is a significant gap in the data157,158. The use of read pairs and soft clustering are good way to minimize the effects as much as possible, but do not provide a solution. The only 25 way to really attempt to solve these problems is by using sequencing platforms with longer read lengths and less biases and errors due to the PCR steps. Third generation sequencing platforms promise read lengths in the kilobases, decreased error rates and real-time SMS as fast as the nucleotides are processed, thus increasing throughput159,160. Currently available platforms like the Ion Torrent and Helioscope have several improvements over earlier platforms, but are still between second and third generation platforms. Further developments in the coming years will allow significant improvements in both read mapping and de novo assembly, at the same time reducing computational requirements as these processes will become less complex and thus require less processing time. This will probably enable the detection of a whole range of new SVs, possibly requiring new algorithms to evaluate these more complex regions. However, whether this will solve all of the problems remains to be seen. Estimations indicate that more than 1.5% of the genome can’t be mapped uniquely with read lengths of 1 kb, which means that some repetitive regions may remain elusive154. It is clear that sequencing-based methods will replace other methods for the detection of structural variation in the human genome. With the potential to detect a much broader variety of SVs with more power, the declines in costs, and the significant recent algorithmic developments, it is only a matter of time. Still, until the technical requirements can be met for the development of one unbiased solution, development and integration of algorithms will remain important for the detection of structural variation. Even after complete de novo assembly of a full human genome has become a possibility, the development of computational methods used for alignment and the detection of signatures associated with structural variation will still be of great importance and influence the results significantly. 26 6 References 1. Check, E. Human genome: patchwork people. Nature 437, 1084–6 (2005). 2. Conrad, D. F. et al. Origins and functional impact of copy number variation in the human genome. Nature 464, 704–12 (2010). 3. Fanciulli, M., Petretto, E. & Aitman, T. J. Gene copy number variation and common human disease. Clinical genetics 77, 201–13 (2010). 4. Feuk, L., Marshall, C. R., Wintle, R. F. & Scherer, S. W. Structural variants: changing the landscape of chromosomes and design of disease studies. Human molecular genetics 15 Spec No, R57–66 (2006). 5. Hurles, M. E., Dermitzakis, E. T. & Tyler-Smith, C. The functional impact of structural variation in humans. Trends in genetics : TIG 24, 238–45 (2008). 6. Buchanan, J. A. & Scherer, S. W. Contemplating effects of genomic structural variation. Genetics in medicine : official journal of the American College of Medical Genetics 10, 639–47 (2008). 7. Lupski, J. R. & Stankiewicz, P. Genomic disorders: molecular mechanisms for rearrangements and conveyed phenotypes. PLoS genetics 1, e49 (2005). 8. Stankiewicz, P. & Lupski, J. R. Genome architecture, rearrangements and genomic disorders. Trends in genetics : TIG 18, 74–82 (2002). 9. De, S. & Babu, M. M. A time-invariant principle of genome evolution. Proceedings of the National Academy of Sciences of the United States of America 107, 13004–9 (2010). 10. Schmitz, J. SINEs as Driving Forces in Genome Evolution. Genome dynamics 7, 92–107 (2012). 11. Ball, S., Colleoni, C., Cenci, U., Raj, J. N. & Tirtiaux, C. The evolution of glycogen and starch metabolism in eukaryotes gives molecular clues to understand the establishment of plastid endosymbiosis. Journal of experimental botany 62, 1775–801 (2011). 12. McHale, L. K. et al. Structural variants in the soybean genome localize to clusters of biotic stress response genes. Plant physiology (2012).doi:10.1104/pp.112.194605 13. Samuelson, L. C., Phillips, R. S. & Swanberg, L. J. Amylase gene structures in primates: retroposon insertions and promoter evolution. Molecular biology and evolution 13, 767–79 (1996). 14. Bailey, J. A. & Eichler, E. E. Primate segmental duplications: crucibles of evolution, diversity and disease. Nature reviews. Genetics 7, 552–64 (2006). 15. Xing, J. et al. Mobile elements create structural variation: analysis of a complete human genome. Genome research 19, 1516–26 (2009). 16. Nahon, J.-L. Birth of “human-specific” genes during primate evolution. Genetica 118, 193–208 (2003). 17. Perry, G. H. et al. Diet and the evolution of human amylase gene copy number variation. Nature genetics 39, 1256–60 (2007). 18. Coyne, J. A. & Hoekstra, H. E. Evolution of protein expression: new genes for a new diet. Current biology : CB 17, R1014–6 (2007). 19. Beck, C. R., Garcia-Perez, J. L., Badge, R. M. & Moran, J. V. LINE-1 elements in structural variation and disease. Annual review of genomics and human genetics 12, 187–215 (2011). 27 20. Stankiewicz, P. & Lupski, J. R. Structural variation in the human genome and its role in disease. Annual review of medicine 61, 437–55 (2010). 21. Zhang, F., Gu, W., Hurles, M. E. & Lupski, J. R. Copy number variation in human health, disease, and evolution. Annual review of genomics and human genetics 10, 451–81 (2009). 22. Alkan, C., Coe, B. P. & Eichler, E. E. Genome structural variation discovery and genotyping. Nature reviews. Genetics 12, 363–76 (2011). 23. Kloosterman, W. P. et al. Chromothripsis as a mechanism driving complex de novo structural rearrangements in the germline. Human molecular genetics 20, 1916–24 (2011). 24. Hochstenbach, R. et al. Discovery of variants unmasked by hemizygous deletions. European journal of human genetics : EJHG 20, 748–53 (2012). 25. Southard, A. E., Edelmann, L. J. & Gelb, B. D. Role of copy number variants in structural birth defects. Pediatrics 129, 755–63 (2012). 26. Poduri, A. & Lowenstein, D. Epilepsy genetics--past, present, and future. Current opinion in genetics & development 21, 325–32 (2011). 27. Garofalo, S., Cornacchione, M. & Di Costanzo, A. From genetics to genomics of epilepsy. Neurology research international 2012, 876234 (2012). 28. Pfundt, R. & Veltman, J. A. Structural genomic variation in intellectual disability. Methods in molecular biology (Clifton, N.J.) 838, 77–95 (2012). 29. Sebat, J. et al. Strong association of de novo copy number mutations with autism. Science (New York, N.Y.) 316, 445–9 (2007). 30. Stefansson, H. et al. Large recurrent microdeletions associated with schizophrenia. Nature 455, 232–6 (2008). 31. Kuiper, R. P., Ligtenberg, M. J. L., Hoogerbrugge, N. & Geurts van Kessel, A. Germline copy number variation and cancer risk. Current opinion in genetics & development 20, 282–9 (2010). 32. Shlien, A. et al. Excessive genomic DNA copy number variation in the Li-Fraumeni cancer predisposition syndrome. Proceedings of the National Academy of Sciences of the United States of America 105, 11264–9 (2008). 33. Gonzalez, E. et al. The influence of CCL3L1 gene-containing segmental duplications on HIV-1/AIDS susceptibility. Science (New York, N.Y.) 307, 1434–40 (2005). 34. Hedrick, P. W. Population genetics of malaria resistance in humans. Heredity 107, 283–304 (2011). 35. Janssens, W. et al. Genomic copy number determines functional expression of {beta}-defensin 2 in airway epithelial cells and associates with chronic obstructive pulmonary disease. American journal of respiratory and critical care medicine 182, 163–9 (2010). 36. Bentley, R. W. et al. Association of higher DEFB4 genomic copy number with Crohn’s disease. The American journal of gastroenterology 105, 354–9 (2010). 37. Hindorff, L. A., Gillanders, E. M. & Manolio, T. A. Genetic architecture of cancer and other complex diseases: lessons learned and future directions. Carcinogenesis 32, 945–54 (2011). 28 38. Rodriguez-Revenga, L., Mila, M., Rosenberg, C., Lamb, A. & Lee, C. Structural variation in the human genome: the impact of copy number variants on clinical diagnosis. Genetics in medicine : official journal of the American College of Medical Genetics 9, 600–6 (2007). 39. Rasmussen, H. B. & Dahmcke, C. M. Genome-wide identification of structural variants in genes encoding drug targets: possible implications for individualized drug therapy. Pharmacogenetics and genomics 22, 471–83 (2012). 40. Stavnezer, J., Guikema, J. E. J. & Schrader, C. E. Mechanism and regulation of class switch recombination. Annual review of immunology 26, 261–92 (2008). 41. Bassing, C. H., Swat, W. & Alt, F. W. The mechanism and regulation of chromosomal V(D)J recombination. Cell 109 Suppl, S45–55 (2002). 42. Savage, J. R. Interchange and intra-nuclear architecture. Environmental and molecular mutagenesis 22, 234–44 (1993). 43. Mani, R.-S. & Chinnaiyan, A. M. Triggers for genomic rearrangements: insights into genomic, cellular and environmental influences. Nature reviews. Genetics 11, 819–29 (2010). 44. Lieber, M. R. The mechanism of double-strand DNA break repair by the nonhomologous DNA end-joining pathway. Annual review of biochemistry 79, 181–211 (2010). 45. Hastings, P. J., Ira, G. & Lupski, J. R. A microhomology-mediated break-induced replication model for the origin of human copy number variation. PLoS genetics 5, e1000327 (2009). 46. Burns, K. H. & Boeke, J. D. Human Transposon Tectonics. Cell 149, 740–752 (2012). 47. Stephens, P. J. et al. Massive genomic rearrangement acquired in a single catastrophic event during cancer development. Cell 144, 27–40 (2011). 48. Quinlan, A. R. & Hall, I. M. Characterizing complex structural variation in germline and somatic genomes. Trends in genetics : TIG 28, 43–53 (2012). 49. Le Scouarnec, S. & Gribble, S. M. Characterising chromosome rearrangements: recent technical advances in molecular cytogenetics. Heredity 108, 75–85 (2012). 50. Miller, D. T. et al. Consensus statement: chromosomal microarray is a first-tier clinical diagnostic test for individuals with developmental disabilities or congenital anomalies. American journal of human genetics 86, 749–64 (2010). 51. Pinkel, D. et al. High resolution analysis of DNA copy number variation using comparative genomic hybridization to microarrays. Nature genetics 20, 207–11 (1998). 52. Carvalho, B. High resolution microarray comparative genomic hybridisation analysis using spotted oligonucleotides. Journal of Clinical Pathology 57, 644–646 (2004). 53. Brennan, C. et al. High-resolution global profiling of genomic alterations with long oligonucleotide microarray. Cancer research 64, 4744–8 (2004). 54. Armengol, L. et al. Clinical utility of chromosomal microarray analysis in invasive prenatal diagnosis. Human genetics 131, 513–23 (2012). 55. Winchester, L., Yau, C. & Ragoussis, J. Comparing CNV detection methods for SNP arrays. Briefings in functional genomics & proteomics 8, 353–66 (2009). 29 56. Wang, D. G. Large-Scale Identification, Mapping, and Genotyping of Single-Nucleotide Polymorphisms in the Human Genome. Science 280, 1077–1082 (1998). 57. LaFramboise, T. Single nucleotide polymorphism arrays: a decade of biological, computational and technological advances. Nucleic acids research 37, 4181–93 (2009). 58. Kloth, J. N. et al. Combined array-comparative genomic hybridization and single-nucleotide polymorphism-loss of heterozygosity analysis reveals complex genetic alterations in cervical cancer. BMC genomics 8, 53 (2007). 59. Redon, R. et al. Global variation in copy number in the human genome. Nature 444, 444–54 (2006). 60. Abbey, D., Hickman, M., Gresham, D. & Berman, J. High-Resolution SNP/CGH Microarrays Reveal the Accumulation of Loss of Heterozygosity in Commonly Used Candida albicans Strains. G3 (Bethesda, Md.) 1, 523–30 (2011). 61. Pinto, D. et al. Comprehensive assessment of array-based platforms and calling algorithms for detection of copy number variants. Nature biotechnology 29, 512–20 (2011). 62. Tuzun, E. et al. Fine-scale structural variation of the human genome. Nature genetics 37, 727–32 (2005). 63. Alkan, C., Sajjadian, S. & Eichler, E. E. Limitations of next-generation genome sequence assembly. Nature methods 8, 61–5 (2011). 64. Korbel, J. O. et al. Paired-end mapping reveals extensive structural variation in the human genome. Science (New York, N.Y.) 318, 420–6 (2007). 65. Mardis, E. R. A decade’s perspective on DNA sequencing technology. Nature 470, 198–203 (2011). 66. Bashir, A., Volik, S., Collins, C., Bafna, V. & Raphael, B. J. Evaluation of paired-end sequencing strategies for detection of genome rearrangements in cancer. PLoS computational biology 4, e1000051 (2008). 67. Medvedev, P., Stanciu, M. & Brudno, M. Computational methods for discovering structural variation with next-generation sequencing. Nature methods 6, S13–20 (2009). 68. Campbell, P. J. et al. Identification of somatically acquired rearrangements in cancer using genome-wide massively parallel paired-end sequencing. Nature genetics 40, 722–9 (2008). 69. Alkan, C. et al. Personalized copy number and segmental duplication maps using next-generation sequencing. Nature genetics 41, 1061–7 (2009). 70. Harismendy, O. et al. Evaluation of next generation sequencing platforms for population targeted sequencing studies. Genome biology 10, R32 (2009). 71. Mills, R. E. et al. An initial map of insertion and deletion (INDEL) variation in the human genome. Genome research 16, 1182–90 (2006). 72. Ye, K., Schulz, M. H., Long, Q., Apweiler, R. & Ning, Z. Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads. Bioinformatics (Oxford, England) 25, 2865–71 (2009). 73. Li, Y. et al. Structural variation in two human genomes mapped at single-nucleotide resolution by whole genome de novo assembly. Nature biotechnology 29, 723–30 (2011). 74. Miller, J. R., Koren, S. & Sutton, G. Assembly algorithms for next-generation sequencing data. Genomics 95, 315–27 (2010). 30 75. Hormozdiari, F., Hajirasouliha, I., McPherson, A., Eichler, E. E. & Sahinalp, S. C. Simultaneous structural variation discovery among multiple paired-end sequenced genomes. Genome research 21, 2203–12 (2011). 76. Smith, T. F. & Waterman, M. S. Identification of common molecular subsequences. Journal of molecular biology 147, 195–7 (1981). 77. Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. Journal of molecular biology 215, 403–10 (1990). 78. Kent, W. J. BLAT--the BLAST-like alignment tool. Genome research 12, 656–64 (2002). 79. Li, R., Li, Y., Kristiansen, K. & Wang, J. SOAP: short oligonucleotide alignment program. Bioinformatics (Oxford, England) 24, 713–4 (2008). 80. Jiang, H. & Wong, W. H. SeqMap: mapping massive amount of oligonucleotides to the genome. Bioinformatics (Oxford, England) 24, 2395–6 (2008). 81. Hach, F. et al. mrsFAST: a cache-oblivious algorithm for short-read mapping. Nature methods 7, 576–7 (2010). 82. Campagna, D. et al. PASS: a program to align short sequences. Bioinformatics (Oxford, England) 25, 967–8 (2009). 83. Ma, B., Tromp, J. & Li, M. PatternHunter: faster and more sensitive homology search. Bioinformatics (Oxford, England) 18, 440–5 (2002). 84. McKernan, K. J. et al. Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using two-base encoding. Genome research 19, 1527–41 (2009). 85. Lin, H., Zhang, Z., Zhang, M. Q., Ma, B. & Li, M. ZOOM! Zillions of oligos mapped. Bioinformatics (Oxford, England) 24, 2431–7 (2008). 86. Homer, N., Merriman, B. & Nelson, S. F. BFAST: an alignment tool for large scale genome resequencing. PloS one 4, e7767 (2009). 87. Li, H., Ruan, J. & Durbin, R. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome research 18, 1851–8 (2008). 88. Rumble, S. M. et al. SHRiMP: accurate mapping of short color-space reads. PLoS computational biology 5, e1000386 (2009). 89. Weese, D., Emde, A.-K., Rausch, T., Döring, A. & Reinert, K. RazerS--fast read mapping with sensitivity control. Genome research 19, 1646–54 (2009). 90. Burrows, M., Burrows, M. & Wheeler, D. J. A block-sorting lossless data compression algorithm. Digital Equipment Corporation 124, (1994). 91. Li, H. & Homer, N. A survey of sequence alignment algorithms for next-generation sequencing. Briefings in bioinformatics 11, 473–83 (2010). 92. Ning, Z., Cox, A. J. & Mullikin, J. C. SSAHA: a fast search method for large DNA databases. Genome research 11, 1725–9 (2001). 93. Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics (Oxford, England) 25, 1754–60 (2009). 31 94. Li, R. et al. SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics (Oxford, England) 25, 1966–7 (2009). 95. Galinsky, V. L. YOABS: yet other aligner of biological sequences--an efficient linearly scaling nucleotide aligner. Bioinformatics (Oxford, England) 28, 1070–7 (2012). 96. Langmead, B., Trapnell, C., Pop, M. & Salzberg, S. L. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome biology 10, R25 (2009). 97. Liu, C.-M. et al. SOAP3: ultra-fast GPU-based parallel alignment tool for short reads. Bioinformatics (Oxford, England) 28, 878–9 (2012). 98. Klus, P. et al. BarraCUDA - a fast short read sequence aligner using graphics processing units. BMC research notes 5, 27 (2012). 99. Liu, Y., Schmidt, B. & Maskell, D. L. CUSHAW: a CUDA compatible short read aligner to large genomes based on the Burrows-Wheeler transform. Bioinformatics (Oxford, England) 28, 1830–7 (2012). 100. Stromberg, M., Lee, W. & Marth, G. MOSAIK: A next-generation reference-guided aligner. at <https://github.com/wanpinglee/MOSAIK> 101. Hormozdiari, F., Alkan, C., Eichler, E. E. & Sahinalp, S. C. Combinatorial algorithms for structural variation detection in high-throughput sequenced genomes. Genome research 19, 1270–8 (2009). 102. Hormozdiari, F. et al. Next-generation VariationHunter: combinatorial algorithms for transposon insertion discovery. Bioinformatics (Oxford, England) 26, i350–7 (2010). 103. Korbel, J. O. et al. PEMer: a computational framework with simulation-based error models for inferring genomic structural variants from massive paired-end sequencing data. Genome biology 10, R23 (2009). 104. Quinlan, A. R. et al. Genome-wide mapping and assembly of structural variant breakpoints in the mouse genome. Genome research 20, 623–35 (2010). 105. Chiara, M., Pesole, G. & Horner, D. S. SVM2: an improved paired-end-based tool for the detection of small genomic structural variations using high-throughput single-genome resequencing data. Nucleic acids research gks606– (2012).doi:10.1093/nar/gks606 106. Lee, S., Hormozdiari, F., Alkan, C. & Brudno, M. MoDIL: detecting small indels from clone-end sequencing with mixtures of distributions. Nature methods 6, 473–4 (2009). 107. Lee, S. MoGUL: detecting common insertions and deletions in a population. Research in Computational Molecular Biology 1–12 (2010).at <http://www.springerlink.com/index/32W7184R7057461W.pdf> 108. Chen, K. et al. BreakDancer: an algorithm for high-resolution mapping of genomic structural variation. Nature methods 6, 677–81 (2009). 109. Venkatraman, E. S. & Olshen, A. B. A faster circular binary segmentation algorithm for the analysis of array CGH data. Bioinformatics (Oxford, England) 23, 657–63 (2007). 110. Miller, C. A., Hampton, O., Coarfa, C. & Milosavljevic, A. ReadDepth: a parallel R package for detecting copy number alterations from short sequencing reads. PloS one 6, e16327 (2011). 111. Yoon, S., Xuan, Z., Makarov, V., Ye, K. & Sebat, J. Sensitive and accurate detection of copy number variants using read depth of coverage. Genome research 19, 1586–92 (2009). 32 112. Magi, A., Benelli, M., Yoon, S., Roviello, F. & Torricelli, F. Detecting common copy number variants in highthroughput sequencing data by using JointSLM algorithm. Nucleic acids research 39, e65 (2011). 113. Abyzov, A., Urban, A. E., Snyder, M. & Gerstein, M. CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. Genome research 21, 974–84 (2011). 114. Wang, L.-Y., Abyzov, A., Korbel, J. O., Snyder, M. & Gerstein, M. MSB: a mean-shift-based approach for the analysis of structural variation in the genome. Genome research 19, 106–17 (2009). 115. Wang, Z., Hormozdiari, F. & Yang, W. CNVeM: copy number variation detection using uncertainty of read mapping. Research in 326–340 (2012).at <http://www.springerlink.com/index/P622187L42V41243.pdf> 116. Xi, R. et al. Copy number variation detection in whole-genome sequencing data using the Bayesian information criterion. Proceedings of the National Academy of Sciences of the United States of America 108, E1128–36 (2011). 117. Xie, C. & Tammi, M. T. CNV-seq, a new method to detect copy number variation using high-throughput sequencing. BMC bioinformatics 10, 80 (2009). 118. Chiang, D. Y. et al. High-resolution mapping of copy-number alterations with massively parallel sequencing. Nature methods 6, 99–103 (2009). 119. Kim, T.-M., Luquette, L. J., Xi, R. & Park, P. J. rSW-seq: algorithm for detection of copy number alterations in deep sequencing data. BMC bioinformatics 11, 432 (2010). 120. Ivakhno, S. et al. CNAseg--a novel framework for identification of copy number changes in cancer from second-generation sequencing data. Bioinformatics (Oxford, England) 26, 3051–8 (2010). 121. Magi, A., Tattini, L., Pippucci, T., Torricelli, F. & Benelli, M. Read count approach for DNA copy number variants detection. Bioinformatics (Oxford, England) 28, 470–8 (2012). 122. Abyzov, A. & Gerstein, M. AGE: defining breakpoints of genomic structural variants at single-nucleotide resolution, through optimal alignments with gap excision. Bioinformatics (Oxford, England) 27, 595–603 (2011). 123. Suzuki, S., Yasuda, T., Shiraishi, Y., Miyano, S. & Nagasaki, M. ClipCrop: a tool for detecting structural variations with single-base resolution using soft-clipping information. BMC bioinformatics 12 Suppl 1, S7 (2011). 124. Henson, J., Tischler, G. & Ning, Z. Next-generation sequencing and large genome assemblies. Pharmacogenomics 13, 901–15 (2012). 125. Warren, R. L., Sutton, G. G., Jones, S. J. M. & Holt, R. A. Assembling millions of short DNA sequences using SSAKE. Bioinformatics (Oxford, England) 23, 500–1 (2007). 126. Dohm, J. C., Lottaz, C., Borodina, T. & Himmelbauer, H. SHARCGS, a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing. Genome research 17, 1697–706 (2007). 127. Jeck, W. R. et al. Extending assembly of short DNA sequences to handle error. Bioinformatics (Oxford, England) 23, 2942–4 (2007). 128. Margulies, M. et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature 437, 376–80 (2005). 33 129. Myers, E. W. et al. A whole-genome assembly of Drosophila. Science (New York, N.Y.) 287, 2196–204 (2000). 130. Hernandez, D., François, P., Farinelli, L., Osterås, M. & Schrenzel, J. De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer. Genome research 18, 802–9 (2008). 131. Hossain, M. S., Azimi, N. & Skiena, S. Crystallizing short-read assemblies around seeds. BMC bioinformatics 10 Suppl 1, S16 (2009). 132. Pevzner, P. A., Pevzner, P. A., Tang, H. & Tesler, G. De novo repeat classification and fragment assembly. Genome research 14, 1786–96 (2004). 133. Zerbino, D. R. & Birney, E. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome research 18, 821–9 (2008). 134. Butler, J. et al. ALLPATHS: de novo assembly of whole-genome shotgun microreads. Genome research 18, 810–20 (2008). 135. Simpson, J. T. et al. ABySS: a parallel assembler for short read sequence data. Genome research 19, 1117– 23 (2009). 136. Li, R. et al. De novo assembly of human genomes with massively parallel short read sequencing. Genome research 20, 265–72 (2010). 137. Simpson, J. T. & Durbin, R. Efficient de novo assembly of large genomes using compressed data structures. Genome research 22, 549–56 (2012). 138. Gnerre, S. et al. High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proceedings of the National Academy of Sciences of the United States of America 108, 1513–8 (2011). 139. Chaisson, M. J., Brinza, D. & Pevzner, P. A. De novo fragment assembly with short mate-paired reads: Does the read length matter? Genome research 19, 336–46 (2009). 140. Mills, R. E. et al. Mapping copy number variation by population-scale genome sequencing. Nature 470, 59–65 (2011). 141. Iqbal, Z., Caccamo, M., Turner, I., Flicek, P. & McVean, G. De novo assembly and genotyping of variants using colored de Bruijn graphs. Nature genetics 44, 226–32 (2012). 142. Baker, M. Structural variation: the genome’s hidden architecture. Nature methods 9, 133–7 (2012). 143. Harris, R. . Improved pairwise alignment of genomic DNA. (2007). 144. Schwartz, S. et al. Human-mouse alignments with BLASTZ. Genome research 13, 103–7 (2003). 145. Hajirasouliha, I. et al. Detection and characterization of novel sequence insertions using paired-end nextgeneration sequencing. Bioinformatics (Oxford, England) 26, 1277–83 (2010). 146. Handsaker, R. E., Korn, J. M., Nemesh, J. & McCarroll, S. a Discovery and genotyping of genome structural polymorphism by sequencing on a population scale. Nature genetics 43, 269–76 (2011). 147. A map of human genome variation from population-scale sequencing. Nature 467, 1061–73 (2010). 148. Medvedev, P., Fiume, M., Dzamba, M., Smith, T. & Brudno, M. Detecting copy number variation with mated short reads. Genome research 20, 1613–22 (2010). 34 149. Sindi, S. S., Onal, S., Peng, L., Wu, H.-T. & Raphael, B. J. An integrative probabilistic model for identification of structural variation in sequencing data. Genome Biology 13, R22 (2012). 150. Qi, J. & Zhao, F. inGAP-sv: a novel scheme to identify and visualize structural variation from paired end mapping data. Nucleic acids research 39, W567–75 (2011). 151. Wang, J. et al. CREST maps somatic structural variation in cancer genomes with base-pair resolution. Nature methods 8, 652–4 (2011). 152. Wong, K., Keane, T. M., Stalker, J. & Adams, D. J. Enhanced structural variant and breakpoint detection using SVMerge by integration of multiple detection methods and local assembly. Genome biology 11, R128 (2010). 153. Jurka, J. Repbase update: a database and an electronic journal of repetitive elements. Trends in genetics : TIG 16, 418–20 (2000). 154. Schatz, M. C., Delcher, A. L. & Salzberg, S. L. Assembly of large genomes using second-generation sequencing. Genome research 20, 1165–73 (2010). 155. Bashir, A., Bansal, V. & Bafna, V. Designing deep sequencing experiments: detecting structural variation and estimating transcript abundance. BMC genomics 11, 385 (2010). 156. Metzker, M. L. Sequencing technologies - the next generation. Nature reviews. Genetics 11, 31–46 (2010). 157. Kim, P. M. et al. Analysis of copy number variants and segmental duplications in the human genome: Evidence for a change in the process of formation in recent evolutionary history. Genome research 18, 1865–74 (2008). 158. Wong, K. K. et al. A comprehensive analysis of common copy-number variations in the human genome. American journal of human genetics 80, 91–104 (2007). 159. Pareek, C. S., Smoczynski, R. & Tretyn, A. Sequencing technologies and genome sequencing. Journal of applied genetics 52, 413–35 (2011). 160. Eid, J. et al. Real-time DNA sequencing from single polymerase molecules. Science (New York, N.Y.) 323, 133–8 (2009). 35