Computational methods for the detection of structural variation in the

advertisement
Master Thesis
9-2-2016
Computational methods
for the detection of
structural variation in the
human genome.
Erik Hoogendoorn
Student Number: 3620557
Master’s programme: Cancer Genomics and Developmental Biology
Utrecht Graduate School of Life Sciences
Utrecht University
Supervisor:
Dr. W.P. Kloosterman
Department of Medical Genetics
University Medical Center Utrecht
2
1 Abstract
Structural variations are genomic rearrangements that contribute significantly to evolution, natural
variation between humans, and are often involved in genetic disorders. Cellular stresses and errors in repair
mechanisms can lead to a large variety of structural variation events throughout the genome. Traditional
microscopy- and array-based methods are used for the detection of larger events or copy number variations.
Next generation sequencing has in theory enabled the detection of all types of structural variants in the human
genome at unprecedented accuracy. In practice, a significant challenge lies in the development of
computational methods that are able to identify these structural variants based on the generated data. In the
last several years, many tools have been developed based on four different categories of information that can
be obtained from sequencing experiments: read pairs, read depths, split reads and assembled sequences.
In this thesis, I first introduce the topic of structural variation by discussing its impact in various areas,
what mechanisms can lead to its formation, and the types of structural variation that can occur. Subsequently, I
describe the array-based and sequencing-based methods that can be used to detect structural variation. Finally,
I give an overview of the tools that are currently available to detect signatures of structural variants in NGS
data and their properties, and conclude by discussing the current capabilities of these tools, possible future
directions and expectations for the future.
Keywords: Structural variation; Copy Number Variation; Next-Generation Sequencing; Detection algorithms;
Read pair; Read depth; Split read; De novo assembly.
3
4
2 Contents
1
Abstract
3
2
Contents
5
1
Introduction
6
2
Structural variation
6
2.1
The importance of structural variation
6
2.2
Causes for structural variation
7
2.3
Types of structural variation
7
Detecting structural variation
8
3
3.1 Array-based methods
ArrayCGH
SNP arrays
Advantages and limitations
3.2 Sequencing-based methods
Read pair
Read depth
Split read
De novo assembly
Advantages and limitations
9
9
11
11
11
12
Computational methods
12
4
4.1
5
Read mapping
13
4.2 Read pair
Clustering-based methods
Distribution-based methods
14
14
15
4.3
Read depth
17
4.4
Split read
19
4.5 De novo assembly
Genome assembly
Identification of structural variation
20
20
21
4.6
22
Combined approaches
Discussion
The status quo
Possible improvements: integration of recent advances
Future perspectives
6
8
8
8
9
References
24
24
25
25
27
5
1 Introduction
Structural variation describes genetic variation that affects the genomic structure. Although human
genomic variation was first thought to be mostly due to SNPs (Single Nucleotide Polymorphisms), it has
become clear that human genomic and probably phenotypic differences are related more to structural variation
than SNPs1,2. Structural variation can range in size from several bp (base pairs) to entire chromosomes.
Structural variation contributes significantly to human diversity and disease occurrence, and is an important
consideration in any genetic study3,4. Structural variation studies used to be limited to the detection of larger
variants like aneuploidies and chromosomal rearrangements by using microscopic methods. The development
of array-based and, more recently, sequencing-based methods has enabled the detection smaller
submicroscopic structural variants (SVs) at greater resolution. Next generation sequencing-based (NGS)
methods are theoretically able to identify SVs of all types at previously unattainable speeds and resolution, and
several different methods have been developed to detect signals in the data that indicate structural variants,
each with their own advantages and disadvantages. However, these methods require extensive computational
analysis and the development of various types of algorithms to filter the data, compare it to reference or other
samples and detect the signals associated with structural variation. Here, I will introduce the effects structural
variation can have in humans and other species, the mechanisms that can result in the formation of SVs and the
different types of structural variation that can occur. Subsequently, I will give an overview of the methods that
can be used to detect structural variation, and provide an overview of the currently available computational
tools used for the detection of SVs in the human genome based on next-generation sequencing.
2 Structural variation
2.1 The importance of structural variation
Structural variation is now known to cover more nucleotide variation in the human population than SNPs,
and thousands of SVs are likely to be present in each genome1,2,5. Many SVs span, relocate or break encoding as
well as regulating elements in the genome. This may often have no observable effect, but can also induce
dosage effects, gene disruption, new fusion genes, new regulatory cascades, the formation of new SNPs and
differences in epigenetic regulation due to relocation5–7. Thus, although many SVs may be neutral, they still
introduce a large source of genetic and phenotypic variation not just between humans but in all species8,9.
Considering the effects of SVs on phenotypic variation, the occurrence of structural variation is also
expected to significantly affect natural selection and thus evolution5,8. Indeed, structural variation has been
suggested to be related to the evolution of new species as well as the evolution within various species9–11.
Examples exist in plants12 as well as primates13–15, also for the emergence of human specific-genes16. Several
papers have shown recent human evolution in genes related to diet, reproduction and disease-related genes
due to structural variation17–19.
Structural variation has been characterized extensively in relation to disease. Variants affecting gene
regulation or coding sequences may result in a wide variety of genomic disorders8,20,21. Two models for the
relationship between structural variation and disease have been proposed, based on rare and common
structural variation22. The first model describes rare and often de novo SVs in the population can cause various
disorders, collectively accounting for a large fraction of these disorders22. Examples are found for various birth
defects23–25, neurological disorders26–30 and predisposition to cancer31,32. The second model concerns SVs
common in the population, especially copy number variable gene families, thought to collectively contribute to
susceptibility of complex diseases, especially related to the immune system22. Examples for this model are
HIV33, malaria34 and various immune disorders 35,36. Although examples can be found for both models, these are
probably not comprehensive for all human disease in relation to structural variation. For example, a simple
division between rare and common variants may be too simplistic37. However, it is clear that the detection of
structural variation can have a large impact on the investigation of human disease, both in diagnosis and
treatment of diseases38,39.
In addition to their role in disease, SVs are also essential for normal functioning of human life. Class Switch
Recombination (CSR)40 and V(D)J recombination41 are processes that rely on structural variation that is
stimulated by our body itself. These processes are essential for the generation of diverse mature B cells in
response to antigen stimulation, and thus for the human immune system. The study of SVs may also tell us
more about genetic mechanisms that shape genome structure as well as genome evolution. Over the last years,
6
the need to take structural variation and its roles into account has become apparent4. However, essential for
each of these research areas remains the accurate and unbiased identification of structural variants.
2.2 Causes for structural variation
Although first considered to occur randomly42, structural variants form in specific situations, in response to
specific environmental and cellular triggers. Various stressors like replication, transcription and genotoxic or
oxidative stress, or combinations of these, can be the trigger for structural variation43. These stresses can result
in DNA breaks and stalled DNA replication forks sensitive to the formation of structural variants. Specific
sequences are more sensitive to structural variation due to their structure, associated proteins or epigenetic
modifications43. Furthermore, the proteins involved in generation of functional recombination in the immune
system may have off-target effects, leading to double-strand breaks. Subsequent errors in DNA repair or
recombination then cause the structural variant to be implemented locally or between two loci in physical
proximity.
For example, non-homologous end joining (NHEJ) is an error-prone repair mechanism for DNA doublestrand breaks. Individual double-strand breaks are efficiently repaired by classical NHEJ, however the presence
of two double-strand breaks can result in chromosomal translocations. Alternate end joining (A-NHEJ), is a
different pathway that is associated with genomic rearrangements. However, the precise mechanisms are
currently unknown44. Allelic homologous recombination repairs double-strand breaks using a template
sequence and is relatively error-free. However, defects in homologous recombination could result in non-allelic
homologous recombination (NAHR). In this case, non-allelic sequences, often LCRs, LINE-1 and Alu repeat
elements or pseudogenes are used as a template for repair, resulting in structural variations 8. Additionally,
repetitive and transposable elements like those involved in NAHR are considered to contribute to structural
variation through the effects of retrotransposition and microhomology, which can result in Complex
Chromosomal Rearrangements (CCRs). Several models exist for the explanation of these CCRs. The MMBIR
model (microhomology-mediated break-induced replication) posits that single DNA strands of collapsed
replication forks anneal to any single-stranded DNA in proximity. Following polymerization and template
switches result in CCRs45. A similar model, FoSTeS (Fork Stalling and Template Switching), suggests replication
fork template switching, but without breaks15,46. Finally, intra- and interchromosomal CCRs may result from
random non-homologous end joining of fragments after an event termed chromothripsis. In this model, one or
multiple chromosomes locally shatter, then fuse again randomly, possibly due to radiation or other events
resulting in widespread chromosomal breakage23,47. For more information on this topic, please see a
comprehensive review by Mani et al.43.
2.3 Types of structural variation
Structural variation can occur in many types, among which a distinction can be made between copy number
variant (CNV) and copy number balanced variants. Copy number balanced SVs include inversions and
translocations. Copy number variant SVs include deletions, insertions and duplications. Insertions may involve
a novel sequence or a mobile-element. Mobile element insertions can result from translocations or
duplications. Duplications can occur as tandem duplications, where the duplicated segment remains adjacent to
the source DNA, or interspersed, where the duplicated DNA is incorporated elsewhere in the genome. These
events may occur intrachromosomally, but also between different chromosomes, leading to interchromosomal
variants. The term is structural variant was traditionally used to refer to larger variants larger than 50 bp or 1
kb (kilobase)22. However, any variant other than a SNV (Single Nucleotide Variant) may be considered to alter
the structure of the chromosome. As some of the methods discussed here are able to identify events of sizes
from 1 to 50 bp at base pair resolution, the term structural variant is used for any non-SNV genetic variant.
Of course, one event may include combination of multiple types of SVs, resulting in more complex patterns
or CCRs, where for example an inverted fragment may contain a deletion or an insertion, or any other
combinations. Detecting CCRs is more problematic for most methods. Additionally, an insertion may
correspond to deletion elsewhere in the genome, resulting in what is essentially a translocation. However, not
all methods may detect both events and may thus infer CNVs erroneously. Accurate identification of a certain
SV may thus require comprehensive identification of all structural variation in the studied genome48. The
ability for detection of these types of variants differs with respect to the various methods used, as is discussed
below.
7
3 Detecting structural variation
As mentioned above, structural variants can differ greatly in terms of size. Larger structural variants are
considered microscopic variants, as these can be detected using traditional microscopy-based cytogenetic
techniques. That includes genome-wide techniques like karyotyping, chromosome painting and FISH-based
methods. Still commonly used, these methods can identify most types of structural variants beyond several Mb
(Megabases) and aneuploidies. Improvements based on these techniques are still developed, providing higher
resolution and sensitivity49.
For the detection of smaller, submicroscopic SVs with higher resolution and sensitivity, more recent
molecular methods are required. These methods can be classified as either array-based or sequencing-based.
Common to these methods is that SVs are identified by comparing the experimental genome to a reference or
other sample genome, inferring variants from the differences. I will briefly introduce these array- and
sequencing-based methods below.
3.1 Array-based methods
Microarrays were originally developed for RNA expression profiling, but now have a wide range of
applications, including the detection of structural variation. Microarray-based methods rely on the design of
microarray chips on glass slides, using immobilized cDNA oligonucleotides as targets for hybridization by
experimental DNA. Although sequencing-based methods for the detection of CNVs are becoming more costeffective and popular, clinical diagnostics still mainly use microarray screening50. Detection of CNVs with arraybased methods is possible using two types of microarrays: ArrayCGH (Comparative Genomic Hybridization)
and SNP arrays. Recent platforms, marketed by companies like Agilent, Illumina, Roche and Affymetrix, enable
the detection of millions of probes on one chip, and new arrays are still being developed that increase the
sensitivity and resolution even more.
ArrayCGH
ArrayCGH platforms can be used to detect relative CNVs by competitive hybridization of two fluorescently
labeled samples to the target DNA. Experimental DNA is fragmented and fluorescently labeled prior to
hybridization. By using different fluorescent dyes, for example Cy3 (green) and Cy5 (red) for each sample, the
measured fluorescence for each color can give an indication for the abundance of experimental DNA from each
sample. It is important to use known reference samples, as a gain in one sample can’t be distinguished from a
loss in the other without further information. For accurate identification of SVs, normalization is often needed
due to experimental biases for GC content in the DNA and dye imbalance.
The first ArrayCGH experiments used large inserts, for example BACs (Bacterial Artificial Chromosomes), as
targets, and were able to detect CNVs in the range of 100 kb and longer51. The current use of oligonucleotides
allows the detection of CNVs with a resolution only several kilobases52,53. An advantage of ArrayCGH is the
availability of custom arrays, allowing its use as a diagnostic platform 50,54. ArrayCGH platforms can reach high
resolutions, especially using custom solutions2, but can’t match NGS-based methods.
SNP arrays
SNP arrays were originally designed to detect single nucleotide polymorphisms, but have been adapted for
the detection of CNVs. Similarly to ArrayCGH, SNP arrays rely on hybridization to target NDA. However in SNP
arrays, only the test sample is hybridized, and no competitively hybridizing reference sample is used. The
intensity of the fluorescence upon binding is used as a measure for the matching sequences in the sample. For
the detection of CNVs, intensities measured across many spots on the slide are clustered. CNVs are detected by
comparing these sample values to (a set of) reference values from a database or from a different experiment.
Several algorithms have been developed for this analysis, and an overview of these can be found in a review by
Winchester et al.55.
Similar to ArrayCGH, SNP array resolution has increased significantly in the years since its first use, which
typed56. Currently, millions of SNPs can be interrogated on one chip. In addition to improvements in resolution,
the design of arrays has focused on incorporating more informative SNPs in regions with known CNVs,
increasing the amount of variants detected in one experiment57. However, this does have an important negative
side-effect, as it introduces a large bias towards known CNVs. SNP arrays generally tend to have lower
sensitivity in the detection of CNVs compared to ArrayCGH. However, SNP arrays provide advantages like
additional information for genotyping, parental origin of CNVs, are more accurate in the determination of copy
numbers and allow detection of LOH (Loss Of Heterozygosity)49.
8
Advantages and limitations
A major disadvantage of array-based versus sequencing-based methods is that only gains and losses
compared to a reference can be identified. Thus, balanced variants like translocations and large inversions
cannot be identified, meaning that other experiments are needed to identify the location and type of the SV
events in the test sample. Array-based methods are also unable to detect smaller variants and have a lower
resolution, and thus miss a wide range of SVs that are potentially of interest. However, array-based methods
are less costly and have a higher throughput than sequencing-based methods, so it is possible to genotype a
larger number of individuals in less time and for a lower cost. Analysis of the data also requires less
computational resources than sequencing-based methods. In addition to predesigned genome-wide solutions, it
is often possible to order custom designs, allowing studies to focus on regions of interest, or increase overall
resolution. Combinations of these two types of arrays have been used to detect CNVs. Either by integration of
results58, using SNP arrays for fine-mapping regions identified by ArrayCGH59, or using hybrid CGH+SNP
arrays49,60. These methods could provide more robust identification of structural variation as well as additional
information versus existing approaches. This seems prudent, as a recent assessment has shown relatively low
(<70%) reproducibility for repeated experiments as well as poor (<50%) concordance between platforms61.
3.2 Sequencing-based methods
Detection of multiple different types of structural variation based on sequencing methods was first
performed using paired-end mapping by Tuzun et al.62. This study was based on capillary Sanger sequencing
using fosmid-end sequences. Throughput and resolution based on this data are not optimal, but the longer read
lengths allow the reliable identification of large variants. The development of high-throughput next- generation
sequencing technologies has enabled sequencing of a full human genome within a week. Since 2005, several
companies including 454 Life Sciences, Illumina, and Life Technologies have marketed platforms with ever
increasing throughput and base-calling accuracy, longer read lengths as well as lower costs versus traditional
capillary methods. More recently, single Molecule Sequencing (SMS) has become a possibility with Helicos’
Helioscope platform and non-optical sequencing was introduced with Life Techologies’ Ion Torrent sequencer.
Among other applications, this development in sequencing technology has enabled the genome-wide
detection of structural variation at unprecedented resolution and speeds. Several methods have been employed
for the identification of SVs using NGS data. The most self-evident method would be de novo assembly of a
genome, with subsequent alignment to a reference to determine the structural differences. However, de novo
assembly of a human genome remains challenging due to the relatively short read lengths generated by NGS
platforms63. As a result, other methods were developed that use direct alignment of reads to one of the human
genome reference assemblies. These methods are read pair, read depth and split read approaches, and are
based on the identification of discordant patterns in sequencing data. I will describe the basic principles of each
of these approaches below.
Read pair
As mentioned earlier, the first sequencing-based identification of SVs used a read pair method, which was
applied to data from capillary sequencing62. The first NGS-based study on the genome-wide identification of SVs
applied a similar method, using the same algorithms as in the earlier study but without any optimizations for
the new type of data64. Most of the current sequencing technologies, excluding SMS platforms, are capable of
generating paired-end or mate-paired reads. In read pair sequencing, both ends of a linear fragment with an
insert sequence are sequenced, whereas in mate-pair sequencing a circularized fragment is used. Although the
method of generating the read pairs differs, the detection of SVs based on the generated data is essentially the
same. An important consideration in the detection of SVs is that the insert size for mate-pair libraries (1.5-40
kb) is often larger than for paired-end libraries (300-500bp)65.
Read pair methods detect SVs by mapping read pairs with a predetermined insert size back to the reference
genome. Assessing the mapping locations of the reads to the reference genome, a discordant span or
orientation of the read pair indicates the occurrence of a genomic rearrangement (Figure 1). If read pairs map
further apart than the insert size this suggests a deletion, whereas if read pairs map closer together or one read
can’t be mapped together this suggests a (novel) insertion. Furthermore, insertions of mobile elements or other
genomic regions map to the locations in which these are present on the reference genome. Inversion
breakpoints are detected by a changed orientation of one of the reads inside the inversion, as well as varying
spans for the pairs. Interspersed duplications or translocations can be detected by complex patterns where in
several pairs one of the reads maps to a different location or chromosome. Finally, tandem duplications can be
9
detected by read pairs that have a correct orientation, but are reversed in their order and have differences in
their span.
Figure 1 The four sequencing-based methods used to identify structural variation, and the signatures that can be detected for each
type of SV. The top line indicates reference DNA. Red arrows indicate breakpoints. MEI = Mobile Element Insertion. RP= Read Pair. For a
full legend see Alkan et al.22 (Copied from Nature Reviews Genetics, Alkan et al. 201122.)
As single read pairs are not reliable on their own due to possible mismapping or ambiguous mapping,
multiple read pairs belonging to the same variant are clustered to increase the reliability of detection, as well as
identify the breakpoint locations for the variant more accurately. Libraries with larger insert sizes (several
kilobases) are better at detecting larger variants, but are often not able to reliably detect smaller variants due
to the distribution of insert sizes66. In contrast, libraries with smaller insert sizes can’t reliably detect the larger
events, but have higher resolution and are able to detect smaller variants. A major disadvantage of the read pair
methods is that insertions larger than the insert size cannot be detected conventionally. Although with lower
power, algorithmic detection of these insertions is possible when considering a linked signature, as described
by Medvedev Et al.67. For example, a large insertion from a region far away in the genome (a translocation or
duplication), the read pair will be detected as spanning a huge range in the reference genome, as regions that
were first far from each other are now relatively close and used to generate the read pair. By finding this
signature for both break-ends (the newly formed sequences by colocation of the flanking sequences and the
insertion) and linking these, it is possible to determine the origin and possibly the size of the insertion. For
novel sequence insertions this is more difficult, as one of the reads from the read pair will not map to the
10
genome. In this case, additional steps like assembly or targeted sequencing of the insertion sequence would be
required.
Read depth
Analysis of read depth, also called depth of coverage (DOC), can identify structural variants by evaluating
the depth of reads to mapped to the reference genome. This approach was first used in combination with NGS
data to detect CNVs in healthy and tumor samples from the same individuals68. For this method, a uniform
distribution of reads is assumed, often according to a Poisson distribution. Sufficient deviation from this
distribution is expected to be due to copy number differences in the sequenced genome. Alternatively, the
expected copy number of a region can be derived from a comparison of read depth to reads of a control
genome. For both variants, a loss region will have less reads mapped to it than expected, whereas a gain region
will have more reads mapped (Figure 1).
The major disadvantage of read depth versus the other sequencing-based methods is that only CNVs can be
detected. The location of events can’t be retrieved, and copy number balanced events like inversions or
translocations can’t be detected. However, it’s the only sequencing-based method that can accurately predict
copy numbers69. Larger events are more reliably detected than smaller ones, as the statistical support increases
with the size. The reliability of the SVs detected is directly related to the sequencing coverage. As a result, the
sequencing biases in the different platforms affect SV detection as well. For example GC-rich or –poor regions
as well as repeat regions are sequenced less reliably, introducing biases70. These biases, as well as mismapped
reads influence the data more than in other sequencing-based methods. Algorithms based on case versus
control data suffer less from biases due to sequencing, as these are assumed to be cancelled out. However,
these are more costly as it requires additional genomes to be sequenced.
Split read
Split read mapping detects structural variation by using unmappable or only partially mappable reads. The
breakpoint of a SV is found based on a read which can only be mapped to the genome in two parts. Detection of
SVs is similar to read pair-based methods, but instead of two paired reads, two parts of one read are used
(Figure 1). A deletion will show a read mapping with alignment gaps in the reference genome, whereas
insertions will show alignment gaps in the test genome. Like with read pairs, part of a read not mapping may
indicate a novel sequence insertion, and partial mapping to a known mobile element in the reference genome
indicates a MEI (Mobile Element Insertion). Reads spanning tandem duplications will have the split read
mapping in reverse order. Interspersed duplications or interchromosomal translocations will show part of a
read mapping to the duplicated region or another chromosome. Like read pair methods, split read mapping
may use clustering of reads to increase the reliability of the findings.
Split read mapping was originally used in combination with sanger sequencing, which produces longer
reads than current NGS platforms71. The shorter reads currently generated by NGS platforms significantly
reduce the power of SV detection using this method, as the length of a split read from NGS is rarely uniquely
mappable to the genome. This results in strongly ambiguous, and often impossible mapping of reads, especially
to regions with repeats or duplications. However, it is currently still possible to map breakpoints for smaller
deletions (max. ~10 kb)as well as very small insertions (1-20 bp) at base pair resolution by using an algorithm
called Pindel72. Using this method, also called anchored split mapping67, read pairs are used to select reads
where one read maps uniquely to the genome and the other can’t be mapped. Knowing the location and
orientation of the first read, the second read can be split-mapped using local alignment based on the known
insert size, reducing the search space for possible mappings as well as ambiguous mapping significantly.
However, this does require that one of the reads is mapped uniquely, which is still not always possible.
The advantage of this method is that it can map breakpoints of SVs at base pair resolution. However, for
larger events or those involving distant genomic regions this is still problematic. Using anchored split mapping
to reduce the search space for split reads is an important step towards making split mapping useful in
combination with NGS platforms, but may be hampered by inserts or deletions in between the reads, affecting
mapping distance. Longer read lengths will make split read mapping even more powerful, as unique mapping
of at least one end may not be required.
De novo assembly
Ideally, full alignment of de novo sequenced genomes against one or multiple reference genome(s) would be
used to identify all structural variation in the genome. Depending on the algorithms and reference genome(s)
used, this would enable unbiased detection of all types and lengths of SVs. Although studies have described
11
assembly of human genomes based on short-read data, these and other approaches still require assembly to the
reference genome. Two human genome assemblies have recently been used to identify structural variation 73.
However, this study was still limited in the identification of SVs in repetitive regions and was only able to
identify homozygous SVs. Local de novo assembly is possible in more reliable genomic regions74. This allows
alignment to the reference genome and subsequent identification of structural variants using these generated
contigs. Identification of SVs is then possible using the same principles as in split read mapping, with
differences only in identification of MEIs and tandem duplications (Figure 1). As these fragments are typically
much larger than read fragments, this method is much more reliable in the identification of breakpoints and
larger SVs.
Although de novo assembly of genomes and subsequent pairwise comparison is expected to become the
standard method of SV detection, this is currently still problematic due to the limited read lengths and
assembly collapse over regions with repeats and duplications63. As these regions are especially susceptible to
the formation of structural variation, this further decreases the reliability of SV detection due to false positives
as well as false negatives. Additionally, differences in coverage between genomic regions due to biases affect
assembly, inducing gaps and complicating statistics in assembly. Finally, de novo assembly requires extensive
computational resources. In algorithms that reduce the computational requirements, tradeoffs are often
necessary in terms of sensitivity to overlaps. Although improvements in these areas have been made with
newer tools, the problems are still unsolved74. Further development of algorithms and sequencing platforms
will be required before this method will be able to detect all structural variation reliably.
Advantages and limitations
A major advantage of sequencing-based methods over array-based methods is the possibility to detect all
types of variants in a single experiment; both copy-balanced and copy-variant. Additionally, SVs of a broader
range of lengths can be detected with significantly less bias, as no the genomic regions measured are not
predetermined as is necessary for microarray probes. The resolution of sequencing enables breakpoint
detection at base pair level with high enough coverage, allowing detailed investigation of CCRs as well. NGSbased methods are expected to replace microarrays for SV discovery and genotyping. Although costs of whole
genome sequencing have declined significantly, these are currently still a large factor. This is especially true for
genome-wide detection of structural variation, as the reliability of the findings depends in a large part on the
sequencing coverage attained in the experiment. However, the decline in costs is expected to continue quickly
over the coming years, in concert with the further development of single-molecule and third-generation
sequencing platforms65. A problem common to all methods is the limited read length of current generation
sequencing platforms, causing significant ambiguity in the mapping of reads, especially in repetitive regions.
Third-generation sequencing technologies with increased read length and insert sizes are expected to alleviate
these problems at least partially. but the development of new algorithms and the integration of information will
also be an important factor.
The different sequencing-based methods each have their own strengths and weaknesses in the detection of
SVs. Read pair-based methods are efficient at detecting most types of structural variation and extensively used,
however the insert size affects the length of the detected SVs significantly. Approaches based on read depth are
able to identify sequence copy numbers, but only able to detect CNVs and at poor breakpoint resolution.
Although split read mapping can identify breakpoints at base pair resolution, its sensitivity is currently a lot
lower than other methods due to unreliability outside of unique genomic regions. Finally, de novo assembly of
genomes promises to be the method to solve most of the problems, but is currently not yet possible and
dependent on the further development of sequencing techniques and algorithms. Several tools have been
developed recently to integrate the information from the various methodologies. By combining algorithms,
several biases or deficiencies of some of these methods may be alleviated. Furthermore, several strategies seem
more suitable for the detection of certain classes or properties of structural variants. For example, read depth
information is more suitable for copy number detection than other methods, and split read information may
indicate the breakpoints most reliably. Any combination of methodologies will need to take into account these
factors.
4 Computational methods
Various tools have been developed for NGS-based detection of structural variation. Here, I will give an
overview of the currently available tools for read pair-, read depth-, split read- and assembly-based methods of
genome-wide SV detection in the human genome with NGS data. Tools combining the information from several
12
detection methods to improve the results are discussed separately. As read mapping is an important first step
for the read pair-, read depth- and split read-based methods, and assembly algorithms are similarly important
in the assembly-based identification of SVs, approaches and tools used for these steps are discussed as well.
An important distinction between the tools is the strategy that is used for alignment of the reads and how
the SV identification algorithms process those alignments. The alignment processing strategies can be classified
as either ‘hard clustering’ or ‘soft clustering’75. Most approaches use hard clustering, considering only the best
mapping of each read to the genome for the identification of SVs. This works well for unique regions of the
genome, but has lower sensitivity in tandem duplication and repeat regions. Some newer approaches use soft
clustering, where reads are mapped to all possible locations, and all these mappings are considered in the
detection of putative SVs. Although this increases sensitivity, soft clustering may lead to more false positives
and often requires careful filtering of input reads. In sample-reference analyses, these false positives are offset
by an increase in true positives as more SVs in total are present. However, in related samples the false positives
may constitute higher percentage of total due to the low amount of total SVs between the samples. Thus, it is
important that the clustering strategy is appropriate for the study, and the parameters in tools using the soft
clustering strategy are well understood and set carefully. Table 1 summarizes the tools used for SV
identification as discussed here, showing which clustering approach is used, what types of SVs can be detected,
as well as their defining characteristics or advantages over other approaches.
4.1 Read mapping
Except for de novo assembly, all computational methods described here require mapping of the to the
reference genome as a first step. Many tools have been developed for this purpose, based on several different
approaches. These tools mainly differ in how they find the possible mapping locations on the genome, whereas
a final alignment step on these possible mapping locations to determine the scoring is generally performed by
using the traditional Smith-Waterman76 alignment algorithm. The first development was the classical “seed and
extend” approach77. Here, a seed DNA sequence is found based on a “hash tables” containing all DNA words of a
certain length (k-mers) present in the first DNA sequence (this can be either the reads or the reference
genome). The hash table is then used to locate the k-mer sequence in the other DNA sequence. Subsequently,
this seed is extended on both sides to complete the alignment. This approach is used in several tools, like
BLAT78, SOAP79, SeqMap80, mrFAST69, mrsFAST81 and PASS82. This implementation is simple and quick for
shorter word lengths, but becomes exponentially more memory-intensive with longer word lengths.
An improvement on this approach was introduced with PatternHunter83, which uses “spaced seeds”. This
approach is similar to the “seed and extend” approach, but requires only some of the seed sequence’s positions
to match. Thus, if a 5-mer is used, it may be that only the 1st, 3rd and 5th positions need to match the other
sequence. This approach is more sensitive and allows for mutations in the seed sequence, but may introduce
false matches that slow the mapping process down, and does not allow indels in the sequence. Many tools were
developed based on this approach, including the Corona Lite mapping tool84, ZOOM85, BFAST86 and MAQ87.
Newer tools like SHRiMP88 and RazerS89 improve on this approach by requiring multiple seed hits and allowing
indels.
Other “trie-based” approaches are aimed at reducing the memory requirements for alignment and use
“Burrows-Wheeler Transform” (BWT), an technique that was first used for data compression90. The term trie
comes from retrieval, as it can be used to retrieve entire sequences based on their position in a list. Different
data structures can be used with this approach, based on prefix trees, suffix trees, FM-indices or suffix arrays,
but the search method is essentially the same91. In trie-based approaches, the various k-mers are compressed
into one string based on their position relative to the start of the string. These can be used to directly search the
reference genome, even allowing simultaneous search of similar strings as these are compressed together. This
further decreases the memory requirements and search times, but does require more computing time for the
construction of compressed strings. Several very fast tools like SSAHA292, BWA-SW93, SOAP294, YOABS95 and
BowTie96 have been created based on this approach. Even faster alignment tools like SOAP397, BarraCUDA98
and CUSHAW99 combine trie-based approaches with GPGPU computing, taking advantage of parallel GPU cores
to accelerate the process.
Most of the newer mapping tools are specifically designed to take into account the properties of NGS
platforms; shorter reads, more data and sequencing errors. However, some tools like BLAT, SSAHA2, YOABS
and BWA-SW are useful for mapping longer reads. Additionally, some mapping tools are developed specifically
for certain platforms. For example, SHRiMP, BFAST and drFast map color-space reads associated with SOLiD
platforms, and SOAP and BowTie tools were designed for use with data from Illumina platforms. For more
extensive information on this topic, a good review was written by Li et al.91. The selection of the mapping tool is
13
an important consideration, also when selecting one specifically for certain SV detection methods. Split read
mapping requires specific strategies, and BWA-SW and MOSAIK100 are examples of only few mapping tools that
provide split mapping information. Finally, instead of alignment as a first step, some assembly-based
algorithms require (whole genome) alignment as one of the later steps in SV identification, as will be discussed
in the section on de novo assembly below.
4.2 Read pair
Many tools have been developed for SV identification based on read pair data. These use algorithms that
can be grouped into two categories: those based on clustering, and those based on distribution. Algorithms
from both categories can identify discordant read pairs by differences in span and orientation, and may group
read pairs for increased reliability. The difference lies in that clustering-based algorithms identify discordant
read pair mapping distance by a fixed distance like a certain amount of standard deviations or based on
simulations, whereas distribution-based algorithms test the mapping span distribution of a certain cluster of
read pairs and calculate the chance of these being discordant by comparison to the genome wide distribution.
Clustering-based methods
The first read pair-based approaches using capillary sequencing by Tuzun et al.62, and using NGS by Korbel
et al.64, both employed a clustering-based approach where a cluster is formed based on a minimum of two read
pairs. These approaches used hard clustering of the reads. The standard clustering strategy used here detects
SV signatures based on read pairs with discordant span and orientation, as described above in the introduction
of NGS-based methods. The span is considered discordant if it deviates four or more standard deviations (SDs)
from the mean. The limitations of these studies lie in the reduced sensitivity due to the use of hard clustering,
as well as the fixed cutoff for the read pair distance and the number of required read pairs for a cluster, which
can affect the specificity strongly based on the coverage attained66.
The VariationHunter101,102 tool improves on the previous approaches by using soft clustering, thus
increasing sensitivity. The same read pair distance cutoff (four SDs) as in earlier approaches is used. After
mapping of all reads, a read is removed from consideration if it has at least one concordant mapping. If a read
has only discordant mappings, it is classified as discordant. Each possible mapping is then assigned to each
possible cluster of reads indicating a SV. Then, two algorithms may be used for the identification of SVs based
on the clusters: VariationHunter-SC (Set Cover) or VariationHunter-Pr (Probabilistic). The first algorithm
identifies SVs based on maximum parsimony, selecting clusters so the amount of implied SVs introduced is
minimal. The second algorithm calculates the probability of a cluster representing a true SV based on the read
mappings, with a clusters above a certain probability (90% was used in the paper) identified as SV clusters.
Evaluation by the authors showed significant improvement in detecting SVs over previous methods, especially
in the repeat regions. However, sensitivity was still lacking due to GC-content affecting the distribution of
reads. Additionally, the fixed read pair distance cutoff used means that smaller differences in span with
possibly good support are still ignored.
PEMer103 is a tool combining various functions in an analysis pipeline, with the purpose of SV identification.
Reads are first pre-processed based on the sequencing platforms used, and optimally aligned to the reference
genome (using hard clustering). Subsequently, discordant read pairs are identified based on the clustering
approach. It’s possible to merge clusters obtained from different experiments and with different cutoffs in a
‘cluster-merging’ step. This is a significant improvement over other tools, as it allows the use of multiple cutoffs
for cluster formation and a custom cutoff for the calling of discordant read pairs. Furthermore, PEMer is
modular, and offers extensive customization, allowing improvements to certain modules without having to
design an entirely new pipeline. Another advantage is that PEMer can detect linked insertions as described by
Medvedev et al.67, allowing the detection of insertions longer than the library insert size. Although the
customizability is a large advantage, the parameters need to be carefully set to ensure good results.
Implementation of a soft-clustering mapping algorithm may further increase the sensitivity of this tool.
Another tool using a read pair clustering-based approach is HYDRA104. It uses soft-clustering, taking into
account multiple possible mappings to specifically improve the identification of SVs arising from multi-copy
sequences. Multiple mappings of the same read are considered to support the same SV if they span the same
interval. Based on the support for each mapping, a variant call is generated for those with the highest support.
Subsequently, SV types are identified as in a standard clustering-based approach which, in addition to the
standard signatures, is able to detect several other signatures for tandem duplications and inversions that
increase the sensitivity for these types of SVs. Although developed for the identification of structural variant
breakpoints in the mouse genome this approach should also applicable to the human genome. This approach
14
may be very useful if applied to the specific identification of SVs in repeat and duplication regions. However, a
significant risk in using this approach is that many false positives may be introduced if the mappings are not
screened properly before the HYDRA tool is used, as mapping quality is not taken into account.
SVM2105 is a recently introduced tool that uses a read pair-based approach, including non-standard
signatures typically found flanking a SV event to increase the reliability of SV detection. SV flanking regions
have defining properties for insertions larger and smaller than the insert size, as well as deletions. In addition
to the default read pair span changes, OEA read pairs (One-End Anchored, read pairs of which only one read
maps) are used. For deletions, there will be a sharp peak of OEA pairs on each strand about as long as the read
length, as these cannot be mapped in their entirety. For insertions, this peak will become broader with the size
of the insertion until the insert size is reached. Thus, the boundaries of an insertion larger than the insert size
can be detected even though no spanning read pairs are available. Statistics on the characteristics of read pairs
found around insertion and deletion regions are used in a machine-learning algorithm that detects SVs. A
Support Vector Machine (SVM) is trained to recognize each of these statistics so SVs can by classified into their
respective classes. Finally, a post-processing step combines clusters of these sites and identifies types and
lengths of insertions and deletions by standard comparison of the span of read pairs to the global mean insert
size. Although the boundaries of insertions larger than the insert size of the library are recognized, the size of
these events cannot be determined. A comparison by the authors showed an increased specificity in detecting
smaller (1-30bp) insertions and deletions versus BreakDancer. However, the detection of SVs other than
insertions and deletions was not implemented. Adapting this method to also consider read pairs that map at
great distances may also increase the sensitivity for detecting translocations or MEIs.
Distribution-based methods
Distribution-based detection of discordant read pairs was introduced with the MoDIL tool106. Using
discordant as well as concordant read pairs, this tool compares the distribution of mapping distances for read
pairs in a specific genomic locus to the genome-wide distribution to identify SVs. A shift in the distribution
towards shorter spans indicates an insertion, whereas a shift towards longer spans indicates a deletion. This
enables the identification of insertions and deletions in the range of 10-50 bp using paired-end data. An
advantage of this tool is that heterozygous variants may be identified by observing a shift of half of the read
pairs, which is not possible in clustering-based methods. As this tool only detects a very specific length of
insertions and deletions, it is far from comprehensive. However, it is useful for detecting smaller insertions and
deletions, possibly as part of a larger pipeline.
MoGUL107 was developed based on MoDIL, but uses sequencing data from multiple genomes to enable the
detection of common SVs from low-coverage genomes. After a soft clustering step, read pairs from multiple
individuals are clustered. Based on these clusters, SV calls are generated based on the span distribution in a
manner similar to MoDIL. Based on this data, indels of 20-100 bp can be detected if the minor allele frequency
(MAF) is at least 0.06. Although Rare variants cannot be detected using this method, several variants that were
not detected by MoDIL could be detected with the increased power for common variants in MoGUL. Although
this tool is not useful for studying a single genome, it is effective in situations where a group of individuals is
studied, allowing sequencing at low coverage and thus lower costs to identify common variants. This may be
useful in situations where for example a familial disease or population differences are studied.
BreakDancer108 combines clustering-based and distribution-based read pair-based SV detection by using
two different algorithms. BreakDancerMax is used to detect all types of structural variation using the standard
clustering strategy. BreakDancerMini is distribution-based and used to detect smaller insertions and deletions
that are not found by BreakDancerMax, typically in the range of 10-100 bp. In addition to the insertions,
deletions and inversions detected by previous methods, BreakDancerMax is able to identify inter- and
intrachromosomal translocations. A comparison of BreakDancer to VariationHunter and MoDIL by the authors
showed increased sensitivity and specificity due to the combination of the two methods, as well as the
algorithmic improvements enabling the detection of other SV types. However, the detection of variant zygosity
as with MoDIL is not possible using BreakDancerMini. Another possible limitation of the BreakDancer tool lies
in the detection of SVs in repeat regions, as it relies on hard clustering.
15
Table 1: Overview of computational tools used for the detection of SVs based on NGS data.
RP: Read Pair, RD: Read Depth, SR: Split Read, BP: Breakpoint, CN: Copy Number, TD: Tandem Duplications, MEI: Mobile Element Insertion, VH-SC: VariationHunter-Set Cover, VH-PR: VariationHunter-Probability,
BDMax: BreakDancerMax, BDMini: BreakDancerMini, EWT: Event-Wise Testing, CBS: Circular Binary Segmentation, MSB: Mean-shift Based HMM: Hidden Markov Model, SV: structural variant, OEA: One End
Anchored, beRD: breakend Read Depth
*Considers ambiguously mapping reads, but maps these randomly and subsequently uses only that mapping.
4.3 Read depth
Read depth methods can be grouped into two categories: those based on differences in read depth across a
single genome and those based on case versus control data. Using a single sample, reads are mapped to the
reference genome and CNVs are identified based on the average read depth or the read depth in other regions.
Using case versus control data, differences in copy number ratios after mapping to a reference genome are used
to identify copy number differences between the two genomes. Among both categories, the algorithms use
genomic ‘windows’ in which the read depths are measured that determine the resolution at which copy
number ratios are determined. Windows with similar read depths or copy number ratios are then merged to
find CNV regions. Most read depth algorithms discussed use hard clustering alignment methods, evaluating
only the best mapping of each read.
The first algorithm used to detect copy number variants from NGS read depth data was an adapted circular
binary segmentation (CBS) algorithm68 originally developed use with arrayCGH data109. This was a applied to a
case versus control (cancer) dataset to identify somatically acquired rearrangements. The copy number ratio of
the two samples was determined in genomic windows. The size of the genomic windows used was nonuniform, requiring 425 reads per window. This allows the resolution to become higher with higher sequence
coverage. After mapping the reads to the reference genome, copy number change points were found by using
the CBS algorithm for the segmentation of windows with differing copy numbers. The CBS algorithm segments
the genome by testing for change points between different parts testing whether an observation is significantly
different from the mean of a segment. This is done recursively, and stopped when no more changes can be
found.
The readDepth110 tool uses a CBS-based approach similar to those used in the first read depth-based
studies. A major difference is that readDepth does not require the sequencing of a control sample, but calls
CNVs based on a single sample. readDepth employs the CBS-based read depth strategy where the genome is
divided into windows, and the genome is segmented by the CBS algorithm until no more differences in copy
number can be detected to identify CNV regions. However, several improvements over earlier methods are
introduced. Genomic windows are calculated based on a desired FDR (False Discovery Rate) that can be input
by the user based on the number of reads. Heuristic thresholds for the detection of copy number gain and loss
events are calculated based on the desired FDR and number of reads as well. Furthermore, the readDepth tool
is able to process bisulfite-treated reads in addition to regular sequencing reads, and can thus also study
epigenetic alterations. Several corrections for biases are introduced as well: The mapability of reads is
corrected by multiplying the amount of reads in a window by the inverse of the percent mapability detected in
a mapability simulation, and regions with extremely low mapability are filtered out. Read counts in each
window are also normalized by using a LOESS method to fit a regression line to the data.
RDXplorer111 is a tool that detects CNVs based on the EWT (Event-Wise Testing) algorithm. This algorithm
uses 100 bp windows to identify CNV regions based on the differences in read depth in a single sample. As a
first step, all read counts mapped to each window are corrected for the GC content. This is done by multiplying
the read count for each window with the average deviation from the read count for all windows with the same
GC percentage. This manner of GC content correction has been adopted by many other read depth-based tools.
The amount of reads in each window is then converted into a Z-score in a two-tailed normal distribution. Based
on the desired FPR (False-Positive Rate) the upper- and lower-tail probabilities identify gains and losses
respectively. Afterwards, adjacent windows with a copy number change in the same direction are merged to
identify the range of the CNV. The correction for GC content is a positive addition as this is a significant bias in
read depth methods. The authors state that the read counts of 100 bp windows approximate the normal
distribution well at 30x coverage, but more flexible settings are preferred as these windows may be too small
or too large in experiments with better or worse overall coverage.
JointSLM112 is an algorithm that is also based on EWT, but was developed to detect common CNVs present
in multiple individuals using multivariate testing. Due to the increased statistical power by including multiple
genomes, JointSLM is able to determine smaller CNVs than the EWT algorithm alone. Although it was designed
for multivariate testing, this tool may also be used to study a single genome in a manner similar to EWT. Like
other population- or group-based algorithms, this may be useful in the detection of CNVs between populations.
CNVnator113 uses a mean shift-based (MSB) approach to identify CNVs in single genomes. This approach is
also derived from an algorithm designed for the identification of copy number shifts in ArrayCGH data114. The
optimal window size is determined as the one at which the ratio of average read depth to its standard deviation
is roughly 4:5. In the MSB approach, copy number variant regions are identified by merging each window with
flanking windows with a similar read depth. If a window with a read depth significantly different from that of
17
the merged windows is encountered, a break is detected. Subsequently, CNVs are called based on the
probability in a t-test that that the read depth of that segment is significantly different from the global read
depth. In addition to mapping of unique reads, CNVnator maps ambiguously mapping reads randomly to
clustered read placements. Thus, it is not limited to unique regions by using best mappings only, but does not
consider all possible mappings by either. Read counts are corrected for GC content in a method similar to the
one used in RDXplorer.
CNVeM115 uses read depth in single samples to determine CNVs by assigning ambiguously mapping reads to
genomic windows fractionally. It is the only read depth-based tool that explicitly uses soft clustering. After
mapping, the genome is divided into windows of 300 bp and an initial estimation of copy numbers is made
based on an EM (Expectation Maximization) algorithm. A second step then evaluates all possible mappings of
reads to calculate the posterior probability of each mapping, then assigns reads fractionally to windows based
on this probability. This algorithm differs between read assignments with differences in sequence as small as
one nucleotide, and predicts the copy numbers of each position. Instead of classifying CNVs as gains or losses,
the copy number of each base is then determined based on these assigned reads, and the CNVs are determined
from this copy number. In a comparison by the authors, this approach was found to have higher accuracy in
detecting CNVs than CNVnator. It is also able to detect whether paralog regions are copied or deleted.
BICseq116 is a tool that uses the MSB approach for the identification of CNVs, but is designed for use with
case vs. control data instead of single samples. The definition of windows, merging of windows, and calling of
CNVs is done similarly to the process in CNVnator. However, BICseq the Bayesian Information Criterion (BIC)
as the merging and stopping criterion. By using the BIC, no bias is introduced by assuming a Poisson
distribution of reads on the chromosome, increasing the reliability of the results. Furthermore, the case vs.
control approach is used to correct for the GC content bias.
CNV-seq117 is a tool for CNV detection based on the case versus control approach. This tool contains a
module for calculation of the best window size based on the desired significance level, a copy ratio threshold
and the attained coverage level. After mapping of the reads to the genome, genomic regions with potential
CNVs are identified by a sliding of non-overlapping windows across the genome, measuring the copy number
ratio in each window. The probability of a random occurrence of these ratios is calculated by a t-statistic, based
on the hypothesis that no copy number variation is present. The hypothesis is rejected if the probability of a
CNV exceeds the user-defined threshold, and a difference in copy number between the two genomes is inferred.
Segseq118 uses a strategy that focuses on the CBS-based identification of CNV breakpoints for copy number
ratios in case versus control data. Similar to CNV-seq, sliding windows are used to compare copy number
ratios. However, Segseq has a variable window size based on a user-specified amount of required reads. Segseq
identifies breakpoints by comparing the copy number ratio in each window to those in the adjacent windows.
Significant change in the ratio versus either window identifies a breakpoint and copy number change.
Subsequently, all windows with the same copy number ratio are merged to identify copy number variant and
copy number balanced regions.
rSW-seq119 is a tool that, similar to Segseq, uses case versus control read depths to identify changes in copy
number ratio. However, rSW-seq directly identifies CNV regions by registering cumulative changes in the ratio
as breakpoints of CNVs. Reads for each sample are sorted according to their mapping on the genome, and the
read depths for each sample are subtracted from each other. Local positive or negative sums indicate copy
number gains or losses. Regions with equal read depths are ignored, and regions where read depth differences
are detected are defined as CNVs. This gives a very intuitive overview of where CNVs are found, and can also
identify CNV regions within other CNVs. rSW-seq’s resolution is dependent on the sequencing depth, but seems
limited as CNVs smaller than 10 kb were not reported. It is the only read depth-based tool discussed here that
does not require the specification of genomic windows.
CNAseg120 is another tool that uses genomic windows to identify differences copy number between case
and control data. In addition to LOWESS regression normalization for GC content, CNAseg uses Discrete
Wavelet Transform (DWT) to de-noise regions using, smoothing out regions with low mapability. This is
necessary as a novel HMM-based (Hidden Markov Model) segmentation step is introduced to segment the
windows based on the read depth. An additional algorithm then uses Pearson’s x2 test to merge segments with
a similar copy number ratio, and the copy number state is estimated by comparing the log ratio of read depths.
This identifies segments of contiguous windows with similar read depth, which are then defined as CNVs. This
tool was shown by the authors to increase the specificity and lower the amount of false positives versus CNVseq without affecting sensitivity.
Unless specified otherwise, the single sample read depth-based tools discussed here assume a uniform
Poisson distribution of reads over the whole genome, thus considering any aberration in read depth an effect of
18
copy number. As read depths do in fact vary over the genome due to various biases 70, more accurate models
like the BIC better approximate the distribution of reads over the genome. Although all tools described here are
able to detect differences in copy number within or between genomes, the actual copy number of these regions
is not always automatically determined. In most studies, the copy number may be estimated by normalizing the
median of the read depth in a copy number variant region normalized to that of copy number 2 and rounding to
the nearest integer68,111,112. This has been shown to work well for most platforms by comparing to regions with
known copy numbers, however the copy numbers did not correlate well for the SOLiD platform 121. In a recent
review of read depth approaches121, it was found that the EWT-based tools provide the best results in terms of
both sensitivity and specificity. CBS- and MSB-based tools are better at detecting CNVs with a large number of
windows (50-100), but worse at detecting those with a smaller number of windows (5-10). CNASeg performs
better on high coverage data, but worse on low coverage data. CNV-seq seems to perform poorer overall. In
combination with high coverage data, the EWT-based tools detect CNVs as small as 500 bp, while the CBS- and
MSB-based tools identify CNVs with a size of 2-5 kb. Thus, there seems to be a great deal of variation between
the performance of different tools, also based on the type of data that is used.
4.4 Split read
Few tools have yet been developed for the identification of SVs using split read methods using NGS data.
Most of these rely on specific alignment strategies to identify breakpoints. Pindel72 uses a pattern growth
algorithm to identify the breakpoints of deletions and insertions. As described above, this tool uses anchored
split mapping. Read pairs are selected where one read maps uniquely and the other can’t be mapped under a
certain threshold are used. With the uniquely mapping read as the anchor point, the direction of the read as
well as the user-specified maximum deletion size are used to define a region where Pindel will attempt to map
the other read. This is done using the pattern growth algorithm which searches for minimum (to find the 5’
end) and maximum (to find the 3’ end) unique substrings to map both sides of the read. The read is then
broken into either two (deletion) or three (short intra-read insertion fragment in the middle) fragments. At
least two supporting reads are required for each event. Pindel is able to identify the breakpoints at base pair
accuracy, even for deletions as large as 10 kb. Although the sensitivity of this approach is still problematic in
repeat regions, allowing mismatches in mapping of the anchor read may increase the sensitivity in the future.
By reducing the search space, the chance of mapping partial reads to the human genome is significantly
increased and split read mapping is made possible for NGS platforms. However, the search space may be
affected by insertions or deletions in between the reads. By combining this approach with information of the
mapping distance of surrounding read pairs, the accuracy may be increased.
The AGE122 (Alignment with Gap Excision) tool adopts a strictly alignment-based approach to split read
mapping. Based on two given sequences in the approximate location of SV breakpoints, it simultaneously aligns
the 5’ and 3’ ends of both sequences similar to Smith-Waterman local alignment. The final alignment is then
constructed by tracing back the maximum position in the matrix of each alignment and then aligning the 5’ and
3’ ends. The SV region is then the unaligned region in between. This approach is able to identify SV breakpoints
with base pair accuracy, and also the exact SV length and sequence if the whole sequence is supplied. However,
it does require external identification of the SV region as well as two sequences as input. These sequences need
to be unique enough for proper alignment, which means that either the putative SV region needs to be small
enough or the provided sequences long enough. The SV type needs to be determined by additional processing
of the results. which is often difficult to obtain with current NGS platforms. Considering the input and
additional processing needed, the alignment algorithm is only useful for SV identification as part of a larger
pipeline.
ClipCrop123 detects SVs by using soft-clipping information. Soft-clipped sequences are defined as partially
unmatched fragments in a mapped read. Unmapped parts of partially mapped sequences are used, with a
minimum length of 10 bp. Subsequently, these clipped reads are mapped to the reference genome maximally
1000 bp on either side of the mapped part. Sequences mapping further ahead indicate deletions, inversely
mapping sequences indicate inversions, sequences mapping before the mapped read indicate tandem
duplications, and a cluster of unmapped reads from both sides indicates insertions. Similarly to read pair
methods, additional tandem duplications over those present in the reference genome can’t be detected.
Remapping of unmapped reads is used to differentiate between novel insertions or mobile element
insertions/translocations, with novel insertions not expected to map to the reference genome. Clipped reads
are clustered if they support the same event, and a reliability score based on this support is used to determine
the most likely event. ClipCrop is able to detect a larger variety of signatures, and is not limited by the direction
of the search space like Pindel. Furthermore, ClipCrop was shown to more efficiently detect short duplications
19
(<170 bp) than CNVnator, BreakDancer and Pindel based on simulated data. However, the detection of larger
events was worse than with other methods.
4.5 De novo assembly
Assembly-based identification of structural variation requires two steps: the assembly of the sequence, and
the alignment of this sequence against a reference genome for detection of the variants. Assembly can be
performed either completely de novo, or by using varying degrees of information from a reference assembly.
Assembly can currently be used to identify SV in two ways: local sequence assembly allows the reconstruction
of loci with possible variants, and whole genome assembly would provide the most comprehensive
identification of structural variation in a genome by aligning (large parts of) whole genomes. Alignment to the
reference genome may then identify all types of SVs as well as CCRs using similar methods as split read
mapping.
Genome assembly
The first step, genome assembly, is not a trivial task. Several recent reviews have been published on this
topic, explaining it in more detail63,74,124. Repeat sequences, read errors and heterozygosity present the greatest
challenges here. The short read length of NGS platforms complicates these challenges even more. Previous
assemblers used for the assembly of Sanger sequencing reads were insufficient for use with NGS data, so
several new assemblers have been developed to deal with these problems. NGS assemblers can be divided into
four categories: greedy algorithms, Overlap/Layout/Consensus (OLC) methods, de Bruijn Graph (DBG)
methods and String graphs74,124.
Most early assemblers used greedy algorithms. These operate by simply extending the seed sequence with
the next highest-scoring overlap to the assembly until it is no longer possible. The score is calculated based on
the amount of overlapping sequence. A problem with these algorithms is that false positives are easily added to
a contig, especially with shorter reads. Two identical overlapping sequences in the genome may lead to the
incorporation of unrelated sequences, producing a chimera. Several assemblers using greedy algorithms are
SSAKE125, SHARCGS126 and VCAKE127. This category of assemblers is generally not used for NGS platforms,
except when assembly is performed in combination with Sanger sequencing data.
Overlap/Layout/Consensus assembly was used extensively for Sanger data, but some assemblers have
been adapted for use with NGS data. OLC assembly involves three steps: first, all reads are aligned to each other
in a pair-wise comparison using the seed and extend algorithm. Then, an overlap graph can be constructed and
manipulated to get an approximate read layout. Finally, multiple sequence alignment (MSA) determines the
consensus sequence. Examples of assemblers that use this approach are Newbler 128, which is distributed by
454 Life Sciences, and the Celera Assembler129, which was first used for Sanger data and subsequently revised
for 454 data, now called CABOG. Edena130 and Shorty131 use the OLC approach for the assembly of shorter reads
from Solexa and SOLiD platforms.
The de Bruijn graph approach has been widely adopted and is mostly applied to shorter reads from Solexa
and SOLiD platforms. Instead of calculating all alignments and overlaps, this approach relies on k-mers of a
certain length that are present in any of the reads. k-mers must be shorter than the read length, and are
represented by nodes in the graph. These nodes have connections (edges) with other nodes based on which
other k-mers they are found in the same read with. Ideally, the k-mers would make one path that can be
traveled to form the entire genome. However, this method is more sensitive to repeats and sequencing errors
than OLC and many branches can be found in these graphs. Disadvantages of DBG assembly are that
information from reads longer than k-mers is lost and the choice of K-mer size also has a large effect on the
results. Some assemblers use approaches that still include read information during assembly, but require more
computational power. Euler132 was the first assembler to use the DBG approach. Velvet133 and ALLPATHS134
were introduced later, offer improved assembly speed and contig length and allow the use of read-pair data.
These assemblers are able to assemble entire bacterial genomes. ABySS135 was the first assembler used to
assemble a human genome from short reads. SOAPdenovo136 was introduced later and is also able to assemble
larger (and human) genomes.
Finally, String graphs can be used to compress read and overlap data in assembly124. The primary
advantages of String graphs over DBGs are that the data is compressed further so assembly can be performed
more efficiently, and the possibility to use the full reads instead of k-mers. String graphs are based on the
overlap between reads or k-mers. Similarly to DBG assembly, each sequence is represented by a node, these
have edges to other nodes with overlapping sequence. In this case, the edges are represented by the nonoverlapping sequence between the nodes. Thus, this constructs all possible paths while the entire sequence is
20
retrievable by following the edges. This approach is used by the String Graph Assembler (SGA)137, which is able
to assemble an entire human genome using one machine, and corrects single-base sequencing errors.
Several updated assemblers like ALLPATHS-LG138, Velvet1.1 and Euler-USR139 show significant
improvements over their predecessors. For example, they allow the incorporation of longer reads and matepaired reads to enhance the assembly of shorter reads, are able to align larger genomes, and allow the input of
data from more different NGS platforms. Although de novo assembly of human genomes using shorter reads is
now possible, several limitations still remain. In addition to significant sequence contaminations, it was found
that de novo assemblies are significantly shorter than the reference genome, and large parts of repeated
(420.2Mb) and duplicated sequence (99.1% of total duplicated sequences in the reference genome) are missing
from genomes assembled from NGS data63. Until the introduction of more reliable ‘third-generation’ sequencing
with longer read lengths, it remains important to include data from established large-molecule sequencing
methods to inform and control the data generated with NGS platforms. Using information from alignment to the
reference genome may also help to increase the reliability of assembly. For example, the Cortex assembler140,141
used in the 1000 genomes project can use varying degrees of information from a reference genome for
assembly. However, using a reference genome may bias the data, and the problems in repeat and duplication
regions will remain due to alignment problems in these regions.
Identification of structural variation
Although much work has been done to improve assembly algorithms, the identification of structural
variation using this data has been studied far less. This is partially due to the problems and costs that are still
involved with de novo assembly, prohibiting the use of assembly methods to detect SVs 142. Ideally, a fully
accurate sample genome may simply be compared to a reference genome by alignment, with differences in the
alignment indicating SVs as indicated in Figure 1. However, in addition to full de novo assembly currently not
being possible, proper alignment of genomes and detection of these signatures are still significant challenges.
Currently, the assemblers discussed here may also be used to construct smaller genomic regions to identify
structural variation by alignment of those regions. The AGE tool122 that was discussed for split read mapping is
able to align large contigs, even with multiple SVs, enabling it to potentially identify SV regions based on de
novo assembled contigs as well. As the methodology for the identification of SVs using de novo assembly data is
similar, other split read-based methods may also be adapted for use with assembly data. Another tool called
LASTZ143, based on BLASTZ144, was optimized specifically for aligning whole genomes. This was recently used
in the detection structural variation in two de novo assembled human genomes73,136. After whole genome
alignment, the gaps and rearrangements in the alignment were extracted as putative SVs. Subsequently, over
130.000 SVs of several types (inversions, insertions, deletions, and complex rearrangements) were identified in
each genome. However, the methodology for identification of specific variants was not discussed.
A tool called NovelSeq145 was designed specifically for the detection novel sequence insertions in the
genome. The first step in this process is mapping of all read pairs to the reference genome using mrFAST. Read
pairs of which neither read can be aligned are classified as orphan reads, and if only one read can be aligned the
read pair is classified as OEA. The hypothesis is that these orphan and OEA reads belong to novel sequence
insertions. Subsequently, all orphan reads are assembled into longer contigs using available assembly tools
such as EULER-USR and ABySS. The OEA reads are then clustered to the reference genome to find reads
supporting the same insertion. Clustering is performed by a clustering algorithm which has a maximum
parsimony objective to imply as few insertions as possible while explaining all OEA reads. Finally, these OEA
clusters are assembled into longer contigs as well, and are used to anchor the orphan (insertion) contigs by
aligning overlapping sequences. The identification of novel sequence insertions is an important step in the
characterization of all structural variation in the human genome. Several insertion breakpoints could not be
identified conclusively or at base pair resolution, as multiple insertion breakpoints may be identified due to
OEA clusters mapping ambiguously to the genome. However, the information provided could significantly
reduce the search space for these breakpoints, allowing other methods to validate these reliably (e.g., split read
mapping).
The Cortex assembler141 introduces a novel way to detect SVs based on DBG assembly. Colored de Bruijn
graphs (CDBGs) are an extension to the classical DBGs. In CDBGs multiple samples are displayed in one graph,
and the nodes and edges in the graphs are colored based on the sample they derive from. These samples may
be different sequenced genomes, reference sequences, known variant sequences or a combination of those. The
alignment of these samples will show ‘bubbles’ in one sequence when the sequences differ, where different
types of bubbles indicate various variants. The simplest bubbles to detect are for SNPs, which can be detected
using the Bubble-Calling (BC) algorithm. Deletions and insertions, where either the reference (deletion) or the
21
sample (insertion) shows a bubble are also detectable using the Path-Divergence (PD) algorithm. Although
other types of SVs can theoretically be detected as well, these signatures are more complicated, confounded by
branching paths in the assembly due to repeats or duplications. Thus, the identification of SVs currently seems
reliable only in non-repetitive regions. SVs defined as complex have been reported, but these were not
classified further. Cortex also allows population-based investigation by aligning multiple genomes, and can
identify novel insertions based on this information. Assemblies could still be improved by using read pair
information, and SV classification does not yet seem to be fully implemented. Although the reliability of this
method has not yet been investigated thoroughly, this tool is an important step into the direction of complete
assembly-based identification of SVs.
4.6 Combined approaches
Genome STRiP146 uses read pair and read depth information to identify SVs in populations, and identifies
breakpoints by using assemblies of unmapped reads from another study140 to span potential breakpoints. This
tool was designed for use with 1000 genome project147 data, specifically to reduce false positives in SV
identification, especially in population studies. After read pair-based detection of discordant read pairs, those
in the same genomic region sharing a similar difference in insert size are clustered over different samples to
increase the power of SV detection. Furthermore, heterogeneity in a population is used to filter out possible
false positives that appear thinly in many genomes, but keep variants with a high signal in one or multiple
genomes. The correlation between read depth and read pair information is also used to filter out false positive
SVs: if read pairs indicate a possible deletion, it should be supported by a lower read depth in those samples
with the detected deletion, but not in general. The approximated breakpoints based on read pair and read
depth data could be resolved by assembly of unmapped reads, allowing the identification of breakpoint
locations at base pair resolution. Compared to other methods, Genome STRIP was found to detect less false
positives and more deletions in total in a comparison by the authors. The detection of rare alleles in the
population with a sensitivity comparable to single-sample methods required higher than average coverage (8x,
average 4x). For the detection of smaller deletions (<300bp), Pindel was more effective. This approach is
currently only able to identify deletions in large populations, but the identification of other types of SVs is being
worked on. Further development of these methods may allow reliable population-based identification of
structural variation by integrating many different signals.
CNVer148 is a tool that combines read depth information with read pair mapping for the accurate
identification of CNVs. Without typing the SVs, discordant read pair mapping information is used to identify
regions that are different between the reference and the donor genome. Independently of this, the read depth
signal is used to identify regions with losses or gains. These signals are considered jointly in a framework
termed a ‘donor graph’. Reads that map to several locations are considered for each location, and connected if
adjacent in the reference genome or connected by read pairs. Based on traditional differences in read depth
and the presence of discordant mate pairs, CNV calls are made. This data is also used to predict the copy count
of each region. This method has several advantages over read depth- or read pair-only methods. The location of
deletions detected by read depth can be determined by using the read pair signature. This tool uses soft
clustering, which increases the sensitivity in repeat and duplication regions, and requires information from
both read depth and read pair signals to reduce false positives. Furthermore, it is possible to detect regions
with additional tandem duplications over those already present in the reference genome as well as insertions
larger than the insert size, which is not possible using traditional read pair methods. However, SVs other than
deletions can only be called as CNVs without a specific location. A comparison to other read depth and read
pair-based approaches by the authors shows that the method is less sensitive to noise and false positives, but
many confirmed SVs are still detected by either read pair or read depth methods alone, but not by CNVer,
which indicates that the sensitivity is not maximized.
Another tool that uses both read depth and read pair information is GASVPro 149, which integrates both
signals into a probabilistic model. In read mapping, GASVPro is able consider all possible alignments by using a
Markov Chain Monte Carlo (MCMC) approach that calculates the posterior probability of each variant over all
possible alignments of each read (soft clustering), but a hard clustering approach (GASVPro-HQ) is also
available. In addition to the standard read pair signatures, the read depth is used in signals of localized
coverage drops that occur at the breakpoints of both copy number-variant and -invariant SVs. This is called
breakend read depth (beRD), and is also used to predict zygosity of variants. GASVPro uses both the amount of
discordant read pairs, as well as the beRD signatures at each breakpoint to determine the probability of each
potential SV and remove false positives. A comparison to other SV detection methods with low coverage data,
including BreakDancer, Hydra and CNVer, showed comparable sensitivity but much higher specificity in
detecting deletions for lower quality data when using GASVPro, as far fewer false positives were predicted. For
22
insertions, GASVPro was the most sensitive method, but at the cost of many possible false positives. Higher
coverage data showed better performance of tools that use a hard clustering approach (BreakDancer, HydraHQ, GASVPro-HQ). The increased specificity when considering both read pair and read depth signals is efficient
for detecting large deletions reliably, and with further implementation may be useful in the detection of other
types of variants. However, the detection of inversions was not significantly improved by using the beRD signal,
and the detection of SV types other than insertions and deletions hasn’t been implemented.
inGAP-SV150 uses read depth and read pair data to detect and visualize large and complex SVs. After
alignment of reads to the genome, the read depth signature is used to detect SV ‘hotspots’ by gap signatures,
drops in read depth that are also called beRD signatures in GASVPro. In these hotspots, SVs are called and
classified based on a heuristic cutoff for discordantly mapping read pairs. The called variants are then
evaluated based on information on mapping quality and read depth. inGAP-SV can identify different types of
SVs, including large insertions and deletions, translocations and (tandem) duplications. The beRD is also used
to determine the zygosity of variants, as the regions flanking homozygous SVs are expected to have a read
depth of zero, whereas for heterozygous events it would be reduced to about half the local read depth. Novel
insertions larger than the insert size are also detected by looking for OEA reads. Finally, the results are
visualized in a genome browser-like display, with different representations for different signatures. The user
can then inspect this information and annotate the putative events. The authors compared the detection of
deletions for a confirmed reference set against other tools, including BreakDancer and VariationHunter. and
found that inGAP-SV’s combined approach was more sensitive. An comprehensive comparison of the detection
of other types of SVs was not performed. The visualization supplied by inGAP-SV is unique among the
investigated tools, and may be very useful for researchers to investigate regions of interest in more detail.
A recently introduced tool called CREST151 uses hanging soft-clipped reads in a method similar to the one
used by ClipCrop123. However, CREST uses a case versus control approach that first filters out any soft-clipped
reads that occur in the control genome. This filters out artifacts, and allows the detection of somatic variants in,
for example, cancer samples. After mapping to the reference genome, all soft-clipping reads mapping to the
same location are first assembled into a contigs. Thus, CREST uses a combination of split read and assembly
methods. The contigs are then remapped to the genome iteratively using BLAT to identify candidate partner
breakpoints on the genome. For this breakpoint, the reads are similarly assembled into a contig. Based on the
alignment between the two assembled breakpoints and their mapping locations, a putative SV is called. The SV
type can then be identified in a method similar to the one used by CripCrop. CREST is able to detect all
signatures but tandem duplications. Differently from other split read-based methods, CREST is designed for the
detection of larger events. In a comparative analysis using simulated data by the owners, CREST outperforms
both BreakDancer108 and Pindel72 in terms of sensitivity and specificity. This may be due to the nature of the
data, as CREST is designed to detect somatic events and the other tools aren’t.
Finally, SVMerge152 attempts the integration of data from several SV calling tools into one pipeline to
maximize the amount of SV calls in one run. BreakDancerMax is used to call deletions, insertions, inversions
and translocations based on read pair mapping. Pindel is used to call insertions of 1-20 bp and deletions
smaller than 1 Mb. RDXplorer is used to detect CNVs based on read depth information and determine the
zygosity of events. SECluster and RetroSeq are two tools that were developed specifically for implementation in
this pipeline to detect insertions. SECluster detects large insertions by identifying clusters of OEA reads
similarly to NovelSeq. RetroSeq detects MEIs by looking for read pairs of which one read maps to the reference
and the other read can be aligned to a known mobile element in Repbase153. After all calls have been made, calls
are validated by de novo assembly of the reads at predicted SV breakpoints. These contigs are aligned to the
reference genome, and the results of this alignment are used to confirm breakpoints and increase the
resolution and filter out false positives if the alignment does not match the predicted SV. As heterozygous
events can’t be validated by de novo assembly, read depth information is also used in this step. This pipeline is
able to determine CNVs as well as the location of deletions detected by read depth analysis, as well as
insertions and inversions that are confirmed by assembly. This pipeline was found to decrease the false
discovery rate of the individual tools used significantly. SVMerge is the first meta SV caller and introduces
important validation steps after the merging of results from different tools. However, a 50% overlap is used as
a requirement for the merging of calls from different tools, which is a rather limited cutoff and may result in the
merging of different events. Although SVMerge still only detects a subset of structural variants, the pipeline is
modular which means that the sensitivity may be raised even more by integration of other tools. Actual
integration of the signals before calling the SVs would enhance the specificity of detection event more, and
would be the next logical step in the combination of the NGS-based signals that can be used for the detection of
SVs.
23
5 Discussion
The status quo
Here, I have given an overview of the currently available tools for the detection of structural variation with
NGS platforms. As discussed in the introduction of the four NGS-based methods of SV detection, each method
has its own advantages and limitations. An evaluation of the performance of each of the discussed tools is
beyond the scope of this thesis, but a comprehensive comparison would be difficult as most tools focus on the
detection of a different or new class of structural variation by introducing a new method or algorithm. Most
papers accompanying the introduction of a new tool provide a comparison against other tools, but mostly focus
on a comparison of their own abilities for a proof of principle, without considering the full spectrum of
structural variation. Read depth approaches alone seem to have matured enough that most tools aim at
detecting the same range of SVs. This is possibly due to the fact that many of these are based on algorithms first
applied to microarray data, and that only a limited spectrum of variation is detectable from the read depth
information. A good review of the performance of read depth-based tools was recently published by Magi et
al.121.
From the information gathered here, we can see trends for the development of tools in each of the four
NGS-based methods of SV detection. Most of the new tools for each of the methods have focused on
strengthening the advantages and minimizing the disadvantages inherent to the information that is used. Read
pair methods are able to detect the broadest range of SV types and sizes, but the detection of larger insertions
is limited by the insert size, and the variance in library insert size limits the resolution. Newer read pair-based
tools have focused on removing the limitations of this method by using both clustering- and distribution-based
algorithms to increase the range of detectable SV sizes, and have developed algorithmic strategies to detect
additional signatures associated with structural variation. Read depth methods are able to detect CNVs
efficiently and can determine the local copy number, but are unable to identify copy number balanced variants
or the location of the detected CNVs. Most read depth-based tools focus on minimizing experimental biases like
the GC content and the mapability of reads, and using more advanced statistical models to increase the
accuracy and the resolution. Split read methods are able to determine breakpoints at base pair resolution, but
are currently only effective in unique regions of the genome due to ambiguous mapping of short read lengths.
Several tools have now been developed to use split read mapping signatures for the identification of SVs.
Algorithms that use split read mappings are able to detect most SV types at high resolution. However, larger
and more complex events can’t be detected yet, and will require longer read lengths than those available from
current generation sequencing platforms. Finally, reliable de novo assembly of a full human genome is still not
possible due to technical limitations in repetitive regions. Due to these limitations, significant biases towards
the detection of SVs in these regions are present in current assembly-based approaches63. Current tools for
assembly-based SV identification rely on the assembly of shorter contigs or focus on non-repetitive regions, but
are only able to detect a limited range of structural variation. However, as the technical limitations are expected
to be reduced significantly in the third generation of sequencing platforms, the new algorithms and
improvements introduced in these tools provide an important first step towards comprehensive identification
of structural variation based on de novo assembly of genomes.
Clearly there have been great advances in the development of computational methods for NGS-based SV
detection in recent years. However, none of the four NGS-based methods is comprehensive, and strong biases
are still present in each of the methods. Application of read pair, split read, and read depth methods to the same
dataset has shown that a significant fraction of the SVs detected remains unique to a single method 22. Thus, the
sources of information need to be combined in order to maximize the identification of structural variation in a
human genome. This is true at least until complete de novo genome assembly becomes a viable option, but
probably even afterwards, as assembly-based methods alone are not able to identify the zygosity of a structural
variant or the copy number of a sequence. Several approaches have been developed to incorporate signals from
various methods. These combined approaches have succeeded especially well in increasing the resolution and
reliability of existing methods by requiring confirmation by other signals. Some tools like inGAP-SV, SVMerge
and Genome STRiP already combine several algorithms so that a wider range of structural variation can be
detected in one experiment.
For multiple methods, population-based approaches have been developed. These approaches increase the
statistical power for the detection of common structural variants by pooling data, while filtering out
experimental artifacts at the same time. These tools are less powerful for the identification of personal
structural variation, but may be extremely useful in a clinical setting with familial or larger case-control studies.
24
Read pair or combined methods seem most suited to this strategy as these have the potential to detect the
widest range of SVs and will thus profit most from the increased statistical power.
Possible improvements: integration of recent advances
Most tools described here have introduced the detection of a new type of signature or a new way to
increase the reliability of the findings. However, a comprehensive solution that incorporates all of the recent
knowledge with the aim to identify all structural variation in a human genome is currently not available. As one
sequencing experiment can generate the required information, methods using only one or two of the signals do
not optimize the use of the data. The SVMerge pipeline combines signals by using various tools that are able to
detect different ranges of SVs, and implements a filtering step based on local de novo assembly. However,
SVMerge does not integrate the signals, but merges the results from each approach. This represents a
significant step towards an integrated solution, but a comprehensive algorithm would ideally combine signals
from each of the four NGS-based methods into one model.
A lot of the knowledge gained in the development of previous approaches could be used in the development
of such an algorithm, taking into account the advantages and limitations of each method, as well as newly
discovered signatures that can be used to enhance the detection of SVs. The use of soft clustering methods will
allow maximal sensitivity for the detection of SVs in repetitive regions, but extensive confirmation and filtering
will be required to reduce false positives. This could be achieved by integrating all signals before the calling of
SVs, preferably into a probabilistic model, as these have been found to be more accurate than heuristics in most
cases66,101,121. From read pair data, discordantly mapping reads should be used in clustering- and distributionbased models as in BreakDancer to maximize the information obtained from this step, but also include
concordantly mapping reads as in MoDIL and MoGUL as this provides additional information of events, also
enabling the determination of zygosity. The read depth signal may be used to inform the detection of deletions,
duplications and insertions by using the traditional differences in read depth across the genome, but beRD
signals also provide important information that should be considered in the detection of any variant that forms
a new sequence at the breakend, as evidenced in GASVPro. Furthermore, NovelSeq, inGAP-SV and SVMerge
have shown that OEA and orphan reads should also be considered especially in the prediction of insertions, and
OEA reads can also be used to confirm the location of other events. Split read information can be used to detect
the breakpoints of various types of SVs by using both anchored split mapping (Pindel) and soft clipping
information (ClipCrop) as these approaches detect different classes of variants, and can also identify the
predicted breakpoints at higher resolution (AGE). Local assembly may currently be used in several of these
steps, for example by assembling novel insertions and the linking reads (NovelSeq) and, increasing the
reliability of split read mapping (CREST), for confirmation of breakpoints by assembling unmapped reads
(Genome STRIP).
Finally, an example of true integration of signals would be de novo assembly of contigs while considering
multiple mappings, retaining the mapping, read depth, linked reads and insert size information may allow the
use of larger sequences while still able to consider in the traditional signatures, integrating all possible signals
into one source of information. The Cortex assembler’s CDBG may be a good starting point for this, as it allows
the integration of several tracks of information. However, this approach would probably require significant
computational power.
Algorithmic improvement by integration of all four SV detection methods may significantly increase both
the sensitivity and the specificity for the detection. However, the library insert size has also been found to play
a large role. Whereas large insert sizes are better for detection of structural variation, smaller insert sizes allow
for a higher resolution66,154. Thus, a combination of two insert sizes, while integrating the data to keep the
statistical power in detection may be the best solution155. However, the root of the major problems common to
each of four NGS-based SV detection methods will still remain: technical limitations.
Future perspectives
Although NGS-based methods can theoretically identify all types and lengths of structural variation, this is
currently not possible using any algorithm due to the technical limitations of the current sequencing platforms.
It’s estimated that about 50% of all SVs in the human genome is currently missed due to these limitations 142.
The short reads generated by the current generation of sequencing platforms and relatively high error rates in
those with longer reads significantly reduce the usefulness of any method used for the detection of SVs in
repeat and duplication regions156. As the human genome contains many such regions, and SVs are predicted to
be strongly enriched in these regions, this is a significant gap in the data157,158. The use of read pairs and soft
clustering are good way to minimize the effects as much as possible, but do not provide a solution. The only
25
way to really attempt to solve these problems is by using sequencing platforms with longer read lengths and
less biases and errors due to the PCR steps. Third generation sequencing platforms promise read lengths in the
kilobases, decreased error rates and real-time SMS as fast as the nucleotides are processed, thus increasing
throughput159,160. Currently available platforms like the Ion Torrent and Helioscope have several improvements
over earlier platforms, but are still between second and third generation platforms. Further developments in
the coming years will allow significant improvements in both read mapping and de novo assembly, at the same
time reducing computational requirements as these processes will become less complex and thus require less
processing time. This will probably enable the detection of a whole range of new SVs, possibly requiring new
algorithms to evaluate these more complex regions. However, whether this will solve all of the problems
remains to be seen. Estimations indicate that more than 1.5% of the genome can’t be mapped uniquely with
read lengths of 1 kb, which means that some repetitive regions may remain elusive154.
It is clear that sequencing-based methods will replace other methods for the detection of structural
variation in the human genome. With the potential to detect a much broader variety of SVs with more power,
the declines in costs, and the significant recent algorithmic developments, it is only a matter of time. Still, until
the technical requirements can be met for the development of one unbiased solution, development and
integration of algorithms will remain important for the detection of structural variation. Even after complete de
novo assembly of a full human genome has become a possibility, the development of computational methods
used for alignment and the detection of signatures associated with structural variation will still be of great
importance and influence the results significantly.
26
6 References
1.
Check, E. Human genome: patchwork people. Nature 437, 1084–6 (2005).
2.
Conrad, D. F. et al. Origins and functional impact of copy number variation in the human genome. Nature
464, 704–12 (2010).
3.
Fanciulli, M., Petretto, E. & Aitman, T. J. Gene copy number variation and common human disease.
Clinical genetics 77, 201–13 (2010).
4.
Feuk, L., Marshall, C. R., Wintle, R. F. & Scherer, S. W. Structural variants: changing the landscape of
chromosomes and design of disease studies. Human molecular genetics 15 Spec No, R57–66 (2006).
5.
Hurles, M. E., Dermitzakis, E. T. & Tyler-Smith, C. The functional impact of structural variation in humans.
Trends in genetics : TIG 24, 238–45 (2008).
6.
Buchanan, J. A. & Scherer, S. W. Contemplating effects of genomic structural variation. Genetics in
medicine : official journal of the American College of Medical Genetics 10, 639–47 (2008).
7.
Lupski, J. R. & Stankiewicz, P. Genomic disorders: molecular mechanisms for rearrangements and
conveyed phenotypes. PLoS genetics 1, e49 (2005).
8.
Stankiewicz, P. & Lupski, J. R. Genome architecture, rearrangements and genomic disorders. Trends in
genetics : TIG 18, 74–82 (2002).
9.
De, S. & Babu, M. M. A time-invariant principle of genome evolution. Proceedings of the National Academy
of Sciences of the United States of America 107, 13004–9 (2010).
10.
Schmitz, J. SINEs as Driving Forces in Genome Evolution. Genome dynamics 7, 92–107 (2012).
11.
Ball, S., Colleoni, C., Cenci, U., Raj, J. N. & Tirtiaux, C. The evolution of glycogen and starch metabolism in
eukaryotes gives molecular clues to understand the establishment of plastid endosymbiosis. Journal of
experimental botany 62, 1775–801 (2011).
12.
McHale, L. K. et al. Structural variants in the soybean genome localize to clusters of biotic stress response
genes. Plant physiology (2012).doi:10.1104/pp.112.194605
13.
Samuelson, L. C., Phillips, R. S. & Swanberg, L. J. Amylase gene structures in primates: retroposon
insertions and promoter evolution. Molecular biology and evolution 13, 767–79 (1996).
14.
Bailey, J. A. & Eichler, E. E. Primate segmental duplications: crucibles of evolution, diversity and disease.
Nature reviews. Genetics 7, 552–64 (2006).
15.
Xing, J. et al. Mobile elements create structural variation: analysis of a complete human genome. Genome
research 19, 1516–26 (2009).
16.
Nahon, J.-L. Birth of “human-specific” genes during primate evolution. Genetica 118, 193–208 (2003).
17.
Perry, G. H. et al. Diet and the evolution of human amylase gene copy number variation. Nature genetics
39, 1256–60 (2007).
18.
Coyne, J. A. & Hoekstra, H. E. Evolution of protein expression: new genes for a new diet. Current biology :
CB 17, R1014–6 (2007).
19.
Beck, C. R., Garcia-Perez, J. L., Badge, R. M. & Moran, J. V. LINE-1 elements in structural variation and
disease. Annual review of genomics and human genetics 12, 187–215 (2011).
27
20.
Stankiewicz, P. & Lupski, J. R. Structural variation in the human genome and its role in disease. Annual
review of medicine 61, 437–55 (2010).
21.
Zhang, F., Gu, W., Hurles, M. E. & Lupski, J. R. Copy number variation in human health, disease, and
evolution. Annual review of genomics and human genetics 10, 451–81 (2009).
22.
Alkan, C., Coe, B. P. & Eichler, E. E. Genome structural variation discovery and genotyping. Nature
reviews. Genetics 12, 363–76 (2011).
23.
Kloosterman, W. P. et al. Chromothripsis as a mechanism driving complex de novo structural
rearrangements in the germline. Human molecular genetics 20, 1916–24 (2011).
24.
Hochstenbach, R. et al. Discovery of variants unmasked by hemizygous deletions. European journal of
human genetics : EJHG 20, 748–53 (2012).
25.
Southard, A. E., Edelmann, L. J. & Gelb, B. D. Role of copy number variants in structural birth defects.
Pediatrics 129, 755–63 (2012).
26.
Poduri, A. & Lowenstein, D. Epilepsy genetics--past, present, and future. Current opinion in genetics &
development 21, 325–32 (2011).
27.
Garofalo, S., Cornacchione, M. & Di Costanzo, A. From genetics to genomics of epilepsy. Neurology
research international 2012, 876234 (2012).
28.
Pfundt, R. & Veltman, J. A. Structural genomic variation in intellectual disability. Methods in molecular
biology (Clifton, N.J.) 838, 77–95 (2012).
29.
Sebat, J. et al. Strong association of de novo copy number mutations with autism. Science (New York, N.Y.)
316, 445–9 (2007).
30.
Stefansson, H. et al. Large recurrent microdeletions associated with schizophrenia. Nature 455, 232–6
(2008).
31.
Kuiper, R. P., Ligtenberg, M. J. L., Hoogerbrugge, N. & Geurts van Kessel, A. Germline copy number
variation and cancer risk. Current opinion in genetics & development 20, 282–9 (2010).
32.
Shlien, A. et al. Excessive genomic DNA copy number variation in the Li-Fraumeni cancer predisposition
syndrome. Proceedings of the National Academy of Sciences of the United States of America 105, 11264–9
(2008).
33.
Gonzalez, E. et al. The influence of CCL3L1 gene-containing segmental duplications on HIV-1/AIDS
susceptibility. Science (New York, N.Y.) 307, 1434–40 (2005).
34.
Hedrick, P. W. Population genetics of malaria resistance in humans. Heredity 107, 283–304 (2011).
35.
Janssens, W. et al. Genomic copy number determines functional expression of {beta}-defensin 2 in
airway epithelial cells and associates with chronic obstructive pulmonary disease. American journal of
respiratory and critical care medicine 182, 163–9 (2010).
36.
Bentley, R. W. et al. Association of higher DEFB4 genomic copy number with Crohn’s disease. The
American journal of gastroenterology 105, 354–9 (2010).
37.
Hindorff, L. A., Gillanders, E. M. & Manolio, T. A. Genetic architecture of cancer and other complex
diseases: lessons learned and future directions. Carcinogenesis 32, 945–54 (2011).
28
38.
Rodriguez-Revenga, L., Mila, M., Rosenberg, C., Lamb, A. & Lee, C. Structural variation in the human
genome: the impact of copy number variants on clinical diagnosis. Genetics in medicine : official journal of
the American College of Medical Genetics 9, 600–6 (2007).
39.
Rasmussen, H. B. & Dahmcke, C. M. Genome-wide identification of structural variants in genes encoding
drug targets: possible implications for individualized drug therapy. Pharmacogenetics and genomics 22,
471–83 (2012).
40.
Stavnezer, J., Guikema, J. E. J. & Schrader, C. E. Mechanism and regulation of class switch recombination.
Annual review of immunology 26, 261–92 (2008).
41.
Bassing, C. H., Swat, W. & Alt, F. W. The mechanism and regulation of chromosomal V(D)J recombination.
Cell 109 Suppl, S45–55 (2002).
42.
Savage, J. R. Interchange and intra-nuclear architecture. Environmental and molecular mutagenesis 22,
234–44 (1993).
43.
Mani, R.-S. & Chinnaiyan, A. M. Triggers for genomic rearrangements: insights into genomic, cellular and
environmental influences. Nature reviews. Genetics 11, 819–29 (2010).
44.
Lieber, M. R. The mechanism of double-strand DNA break repair by the nonhomologous DNA end-joining
pathway. Annual review of biochemistry 79, 181–211 (2010).
45.
Hastings, P. J., Ira, G. & Lupski, J. R. A microhomology-mediated break-induced replication model for the
origin of human copy number variation. PLoS genetics 5, e1000327 (2009).
46.
Burns, K. H. & Boeke, J. D. Human Transposon Tectonics. Cell 149, 740–752 (2012).
47.
Stephens, P. J. et al. Massive genomic rearrangement acquired in a single catastrophic event during
cancer development. Cell 144, 27–40 (2011).
48.
Quinlan, A. R. & Hall, I. M. Characterizing complex structural variation in germline and somatic genomes.
Trends in genetics : TIG 28, 43–53 (2012).
49.
Le Scouarnec, S. & Gribble, S. M. Characterising chromosome rearrangements: recent technical advances
in molecular cytogenetics. Heredity 108, 75–85 (2012).
50.
Miller, D. T. et al. Consensus statement: chromosomal microarray is a first-tier clinical diagnostic test for
individuals with developmental disabilities or congenital anomalies. American journal of human genetics
86, 749–64 (2010).
51.
Pinkel, D. et al. High resolution analysis of DNA copy number variation using comparative genomic
hybridization to microarrays. Nature genetics 20, 207–11 (1998).
52.
Carvalho, B. High resolution microarray comparative genomic hybridisation analysis using spotted
oligonucleotides. Journal of Clinical Pathology 57, 644–646 (2004).
53.
Brennan, C. et al. High-resolution global profiling of genomic alterations with long oligonucleotide
microarray. Cancer research 64, 4744–8 (2004).
54.
Armengol, L. et al. Clinical utility of chromosomal microarray analysis in invasive prenatal diagnosis.
Human genetics 131, 513–23 (2012).
55.
Winchester, L., Yau, C. & Ragoussis, J. Comparing CNV detection methods for SNP arrays. Briefings in
functional genomics & proteomics 8, 353–66 (2009).
29
56.
Wang, D. G. Large-Scale Identification, Mapping, and Genotyping of Single-Nucleotide Polymorphisms in
the Human Genome. Science 280, 1077–1082 (1998).
57.
LaFramboise, T. Single nucleotide polymorphism arrays: a decade of biological, computational and
technological advances. Nucleic acids research 37, 4181–93 (2009).
58.
Kloth, J. N. et al. Combined array-comparative genomic hybridization and single-nucleotide
polymorphism-loss of heterozygosity analysis reveals complex genetic alterations in cervical cancer.
BMC genomics 8, 53 (2007).
59.
Redon, R. et al. Global variation in copy number in the human genome. Nature 444, 444–54 (2006).
60.
Abbey, D., Hickman, M., Gresham, D. & Berman, J. High-Resolution SNP/CGH Microarrays Reveal the
Accumulation of Loss of Heterozygosity in Commonly Used Candida albicans Strains. G3 (Bethesda, Md.)
1, 523–30 (2011).
61.
Pinto, D. et al. Comprehensive assessment of array-based platforms and calling algorithms for detection
of copy number variants. Nature biotechnology 29, 512–20 (2011).
62.
Tuzun, E. et al. Fine-scale structural variation of the human genome. Nature genetics 37, 727–32 (2005).
63.
Alkan, C., Sajjadian, S. & Eichler, E. E. Limitations of next-generation genome sequence assembly. Nature
methods 8, 61–5 (2011).
64.
Korbel, J. O. et al. Paired-end mapping reveals extensive structural variation in the human genome.
Science (New York, N.Y.) 318, 420–6 (2007).
65.
Mardis, E. R. A decade’s perspective on DNA sequencing technology. Nature 470, 198–203 (2011).
66.
Bashir, A., Volik, S., Collins, C., Bafna, V. & Raphael, B. J. Evaluation of paired-end sequencing strategies
for detection of genome rearrangements in cancer. PLoS computational biology 4, e1000051 (2008).
67.
Medvedev, P., Stanciu, M. & Brudno, M. Computational methods for discovering structural variation with
next-generation sequencing. Nature methods 6, S13–20 (2009).
68.
Campbell, P. J. et al. Identification of somatically acquired rearrangements in cancer using genome-wide
massively parallel paired-end sequencing. Nature genetics 40, 722–9 (2008).
69.
Alkan, C. et al. Personalized copy number and segmental duplication maps using next-generation
sequencing. Nature genetics 41, 1061–7 (2009).
70.
Harismendy, O. et al. Evaluation of next generation sequencing platforms for population targeted
sequencing studies. Genome biology 10, R32 (2009).
71.
Mills, R. E. et al. An initial map of insertion and deletion (INDEL) variation in the human genome. Genome
research 16, 1182–90 (2006).
72.
Ye, K., Schulz, M. H., Long, Q., Apweiler, R. & Ning, Z. Pindel: a pattern growth approach to detect break
points of large deletions and medium sized insertions from paired-end short reads. Bioinformatics
(Oxford, England) 25, 2865–71 (2009).
73.
Li, Y. et al. Structural variation in two human genomes mapped at single-nucleotide resolution by whole
genome de novo assembly. Nature biotechnology 29, 723–30 (2011).
74.
Miller, J. R., Koren, S. & Sutton, G. Assembly algorithms for next-generation sequencing data. Genomics
95, 315–27 (2010).
30
75.
Hormozdiari, F., Hajirasouliha, I., McPherson, A., Eichler, E. E. & Sahinalp, S. C. Simultaneous structural
variation discovery among multiple paired-end sequenced genomes. Genome research 21, 2203–12
(2011).
76.
Smith, T. F. & Waterman, M. S. Identification of common molecular subsequences. Journal of molecular
biology 147, 195–7 (1981).
77.
Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. Journal of
molecular biology 215, 403–10 (1990).
78.
Kent, W. J. BLAT--the BLAST-like alignment tool. Genome research 12, 656–64 (2002).
79.
Li, R., Li, Y., Kristiansen, K. & Wang, J. SOAP: short oligonucleotide alignment program. Bioinformatics
(Oxford, England) 24, 713–4 (2008).
80.
Jiang, H. & Wong, W. H. SeqMap: mapping massive amount of oligonucleotides to the genome.
Bioinformatics (Oxford, England) 24, 2395–6 (2008).
81.
Hach, F. et al. mrsFAST: a cache-oblivious algorithm for short-read mapping. Nature methods 7, 576–7
(2010).
82.
Campagna, D. et al. PASS: a program to align short sequences. Bioinformatics (Oxford, England) 25, 967–8
(2009).
83.
Ma, B., Tromp, J. & Li, M. PatternHunter: faster and more sensitive homology search. Bioinformatics
(Oxford, England) 18, 440–5 (2002).
84.
McKernan, K. J. et al. Sequence and structural variation in a human genome uncovered by short-read,
massively parallel ligation sequencing using two-base encoding. Genome research 19, 1527–41 (2009).
85.
Lin, H., Zhang, Z., Zhang, M. Q., Ma, B. & Li, M. ZOOM! Zillions of oligos mapped. Bioinformatics (Oxford,
England) 24, 2431–7 (2008).
86.
Homer, N., Merriman, B. & Nelson, S. F. BFAST: an alignment tool for large scale genome resequencing.
PloS one 4, e7767 (2009).
87.
Li, H., Ruan, J. & Durbin, R. Mapping short DNA sequencing reads and calling variants using mapping
quality scores. Genome research 18, 1851–8 (2008).
88.
Rumble, S. M. et al. SHRiMP: accurate mapping of short color-space reads. PLoS computational biology 5,
e1000386 (2009).
89.
Weese, D., Emde, A.-K., Rausch, T., Döring, A. & Reinert, K. RazerS--fast read mapping with sensitivity
control. Genome research 19, 1646–54 (2009).
90.
Burrows, M., Burrows, M. & Wheeler, D. J. A block-sorting lossless data compression algorithm. Digital
Equipment Corporation 124, (1994).
91.
Li, H. & Homer, N. A survey of sequence alignment algorithms for next-generation sequencing. Briefings
in bioinformatics 11, 473–83 (2010).
92.
Ning, Z., Cox, A. J. & Mullikin, J. C. SSAHA: a fast search method for large DNA databases. Genome research
11, 1725–9 (2001).
93.
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform.
Bioinformatics (Oxford, England) 25, 1754–60 (2009).
31
94.
Li, R. et al. SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics (Oxford, England)
25, 1966–7 (2009).
95.
Galinsky, V. L. YOABS: yet other aligner of biological sequences--an efficient linearly scaling nucleotide
aligner. Bioinformatics (Oxford, England) 28, 1070–7 (2012).
96.
Langmead, B., Trapnell, C., Pop, M. & Salzberg, S. L. Ultrafast and memory-efficient alignment of short
DNA sequences to the human genome. Genome biology 10, R25 (2009).
97.
Liu, C.-M. et al. SOAP3: ultra-fast GPU-based parallel alignment tool for short reads. Bioinformatics
(Oxford, England) 28, 878–9 (2012).
98.
Klus, P. et al. BarraCUDA - a fast short read sequence aligner using graphics processing units. BMC
research notes 5, 27 (2012).
99.
Liu, Y., Schmidt, B. & Maskell, D. L. CUSHAW: a CUDA compatible short read aligner to large genomes
based on the Burrows-Wheeler transform. Bioinformatics (Oxford, England) 28, 1830–7 (2012).
100.
Stromberg, M., Lee, W. & Marth, G. MOSAIK: A next-generation reference-guided aligner. at
<https://github.com/wanpinglee/MOSAIK>
101.
Hormozdiari, F., Alkan, C., Eichler, E. E. & Sahinalp, S. C. Combinatorial algorithms for structural variation
detection in high-throughput sequenced genomes. Genome research 19, 1270–8 (2009).
102.
Hormozdiari, F. et al. Next-generation VariationHunter: combinatorial algorithms for transposon
insertion discovery. Bioinformatics (Oxford, England) 26, i350–7 (2010).
103.
Korbel, J. O. et al. PEMer: a computational framework with simulation-based error models for inferring
genomic structural variants from massive paired-end sequencing data. Genome biology 10, R23 (2009).
104.
Quinlan, A. R. et al. Genome-wide mapping and assembly of structural variant breakpoints in the mouse
genome. Genome research 20, 623–35 (2010).
105.
Chiara, M., Pesole, G. & Horner, D. S. SVM2: an improved paired-end-based tool for the detection of small
genomic structural variations using high-throughput single-genome resequencing data. Nucleic acids
research gks606– (2012).doi:10.1093/nar/gks606
106.
Lee, S., Hormozdiari, F., Alkan, C. & Brudno, M. MoDIL: detecting small indels from clone-end sequencing
with mixtures of distributions. Nature methods 6, 473–4 (2009).
107.
Lee, S. MoGUL: detecting common insertions and deletions in a population. Research in Computational
Molecular Biology 1–12 (2010).at <http://www.springerlink.com/index/32W7184R7057461W.pdf>
108.
Chen, K. et al. BreakDancer: an algorithm for high-resolution mapping of genomic structural variation.
Nature methods 6, 677–81 (2009).
109.
Venkatraman, E. S. & Olshen, A. B. A faster circular binary segmentation algorithm for the analysis of
array CGH data. Bioinformatics (Oxford, England) 23, 657–63 (2007).
110.
Miller, C. A., Hampton, O., Coarfa, C. & Milosavljevic, A. ReadDepth: a parallel R package for detecting
copy number alterations from short sequencing reads. PloS one 6, e16327 (2011).
111.
Yoon, S., Xuan, Z., Makarov, V., Ye, K. & Sebat, J. Sensitive and accurate detection of copy number variants
using read depth of coverage. Genome research 19, 1586–92 (2009).
32
112.
Magi, A., Benelli, M., Yoon, S., Roviello, F. & Torricelli, F. Detecting common copy number variants in highthroughput sequencing data by using JointSLM algorithm. Nucleic acids research 39, e65 (2011).
113.
Abyzov, A., Urban, A. E., Snyder, M. & Gerstein, M. CNVnator: an approach to discover, genotype, and
characterize typical and atypical CNVs from family and population genome sequencing. Genome research
21, 974–84 (2011).
114.
Wang, L.-Y., Abyzov, A., Korbel, J. O., Snyder, M. & Gerstein, M. MSB: a mean-shift-based approach for the
analysis of structural variation in the genome. Genome research 19, 106–17 (2009).
115.
Wang, Z., Hormozdiari, F. & Yang, W. CNVeM: copy number variation detection using uncertainty of read
mapping. Research in 326–340 (2012).at
<http://www.springerlink.com/index/P622187L42V41243.pdf>
116.
Xi, R. et al. Copy number variation detection in whole-genome sequencing data using the Bayesian
information criterion. Proceedings of the National Academy of Sciences of the United States of America
108, E1128–36 (2011).
117.
Xie, C. & Tammi, M. T. CNV-seq, a new method to detect copy number variation using high-throughput
sequencing. BMC bioinformatics 10, 80 (2009).
118.
Chiang, D. Y. et al. High-resolution mapping of copy-number alterations with massively parallel
sequencing. Nature methods 6, 99–103 (2009).
119.
Kim, T.-M., Luquette, L. J., Xi, R. & Park, P. J. rSW-seq: algorithm for detection of copy number alterations
in deep sequencing data. BMC bioinformatics 11, 432 (2010).
120.
Ivakhno, S. et al. CNAseg--a novel framework for identification of copy number changes in cancer from
second-generation sequencing data. Bioinformatics (Oxford, England) 26, 3051–8 (2010).
121.
Magi, A., Tattini, L., Pippucci, T., Torricelli, F. & Benelli, M. Read count approach for DNA copy number
variants detection. Bioinformatics (Oxford, England) 28, 470–8 (2012).
122.
Abyzov, A. & Gerstein, M. AGE: defining breakpoints of genomic structural variants at single-nucleotide
resolution, through optimal alignments with gap excision. Bioinformatics (Oxford, England) 27, 595–603
(2011).
123.
Suzuki, S., Yasuda, T., Shiraishi, Y., Miyano, S. & Nagasaki, M. ClipCrop: a tool for detecting structural
variations with single-base resolution using soft-clipping information. BMC bioinformatics 12 Suppl 1,
S7 (2011).
124.
Henson, J., Tischler, G. & Ning, Z. Next-generation sequencing and large genome assemblies.
Pharmacogenomics 13, 901–15 (2012).
125.
Warren, R. L., Sutton, G. G., Jones, S. J. M. & Holt, R. A. Assembling millions of short DNA sequences using
SSAKE. Bioinformatics (Oxford, England) 23, 500–1 (2007).
126.
Dohm, J. C., Lottaz, C., Borodina, T. & Himmelbauer, H. SHARCGS, a fast and highly accurate short-read
assembly algorithm for de novo genomic sequencing. Genome research 17, 1697–706 (2007).
127.
Jeck, W. R. et al. Extending assembly of short DNA sequences to handle error. Bioinformatics (Oxford,
England) 23, 2942–4 (2007).
128.
Margulies, M. et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature 437,
376–80 (2005).
33
129.
Myers, E. W. et al. A whole-genome assembly of Drosophila. Science (New York, N.Y.) 287, 2196–204
(2000).
130.
Hernandez, D., François, P., Farinelli, L., Osterås, M. & Schrenzel, J. De novo bacterial genome sequencing:
millions of very short reads assembled on a desktop computer. Genome research 18, 802–9 (2008).
131.
Hossain, M. S., Azimi, N. & Skiena, S. Crystallizing short-read assemblies around seeds. BMC
bioinformatics 10 Suppl 1, S16 (2009).
132.
Pevzner, P. A., Pevzner, P. A., Tang, H. & Tesler, G. De novo repeat classification and fragment assembly.
Genome research 14, 1786–96 (2004).
133.
Zerbino, D. R. & Birney, E. Velvet: algorithms for de novo short read assembly using de Bruijn graphs.
Genome research 18, 821–9 (2008).
134.
Butler, J. et al. ALLPATHS: de novo assembly of whole-genome shotgun microreads. Genome research 18,
810–20 (2008).
135.
Simpson, J. T. et al. ABySS: a parallel assembler for short read sequence data. Genome research 19, 1117–
23 (2009).
136.
Li, R. et al. De novo assembly of human genomes with massively parallel short read sequencing. Genome
research 20, 265–72 (2010).
137.
Simpson, J. T. & Durbin, R. Efficient de novo assembly of large genomes using compressed data
structures. Genome research 22, 549–56 (2012).
138.
Gnerre, S. et al. High-quality draft assemblies of mammalian genomes from massively parallel sequence
data. Proceedings of the National Academy of Sciences of the United States of America 108, 1513–8 (2011).
139.
Chaisson, M. J., Brinza, D. & Pevzner, P. A. De novo fragment assembly with short mate-paired reads:
Does the read length matter? Genome research 19, 336–46 (2009).
140.
Mills, R. E. et al. Mapping copy number variation by population-scale genome sequencing. Nature 470,
59–65 (2011).
141.
Iqbal, Z., Caccamo, M., Turner, I., Flicek, P. & McVean, G. De novo assembly and genotyping of variants
using colored de Bruijn graphs. Nature genetics 44, 226–32 (2012).
142.
Baker, M. Structural variation: the genome’s hidden architecture. Nature methods 9, 133–7 (2012).
143.
Harris, R. . Improved pairwise alignment of genomic DNA. (2007).
144.
Schwartz, S. et al. Human-mouse alignments with BLASTZ. Genome research 13, 103–7 (2003).
145.
Hajirasouliha, I. et al. Detection and characterization of novel sequence insertions using paired-end nextgeneration sequencing. Bioinformatics (Oxford, England) 26, 1277–83 (2010).
146.
Handsaker, R. E., Korn, J. M., Nemesh, J. & McCarroll, S. a Discovery and genotyping of genome structural
polymorphism by sequencing on a population scale. Nature genetics 43, 269–76 (2011).
147.
A map of human genome variation from population-scale sequencing. Nature 467, 1061–73 (2010).
148.
Medvedev, P., Fiume, M., Dzamba, M., Smith, T. & Brudno, M. Detecting copy number variation with
mated short reads. Genome research 20, 1613–22 (2010).
34
149.
Sindi, S. S., Onal, S., Peng, L., Wu, H.-T. & Raphael, B. J. An integrative probabilistic model for identification
of structural variation in sequencing data. Genome Biology 13, R22 (2012).
150.
Qi, J. & Zhao, F. inGAP-sv: a novel scheme to identify and visualize structural variation from paired end
mapping data. Nucleic acids research 39, W567–75 (2011).
151.
Wang, J. et al. CREST maps somatic structural variation in cancer genomes with base-pair resolution.
Nature methods 8, 652–4 (2011).
152.
Wong, K., Keane, T. M., Stalker, J. & Adams, D. J. Enhanced structural variant and breakpoint detection
using SVMerge by integration of multiple detection methods and local assembly. Genome biology 11,
R128 (2010).
153.
Jurka, J. Repbase update: a database and an electronic journal of repetitive elements. Trends in genetics :
TIG 16, 418–20 (2000).
154.
Schatz, M. C., Delcher, A. L. & Salzberg, S. L. Assembly of large genomes using second-generation
sequencing. Genome research 20, 1165–73 (2010).
155.
Bashir, A., Bansal, V. & Bafna, V. Designing deep sequencing experiments: detecting structural variation
and estimating transcript abundance. BMC genomics 11, 385 (2010).
156.
Metzker, M. L. Sequencing technologies - the next generation. Nature reviews. Genetics 11, 31–46 (2010).
157.
Kim, P. M. et al. Analysis of copy number variants and segmental duplications in the human genome:
Evidence for a change in the process of formation in recent evolutionary history. Genome research 18,
1865–74 (2008).
158.
Wong, K. K. et al. A comprehensive analysis of common copy-number variations in the human genome.
American journal of human genetics 80, 91–104 (2007).
159.
Pareek, C. S., Smoczynski, R. & Tretyn, A. Sequencing technologies and genome sequencing. Journal of
applied genetics 52, 413–35 (2011).
160.
Eid, J. et al. Real-time DNA sequencing from single polymerase molecules. Science (New York, N.Y.) 323,
133–8 (2009).
35
Download