Chapter 6: Structural Variation and Medical Genomics CS-6293 Bioinformatics

advertisement
Chapter 6:
Structural Variation and
Medical Genomics
CS-6293 Bioinformatics
Instructor: Dr. Jianhua Ruan
Presented by:
Nesthor Perez
Outline
1. Introduction
7. Future
Prospects
2. Germline and
Somatic SVs
3. Technologies for
Measurement of SVs
6. Challenges for
Cancer Genomics
4. Resequencing
Strategies for SVs
5. Representation of
SVs
Nesthor Perez
Outline
1. Introduction
7. Future
Prospects
2. Germline and
Somatic SVs
3. Technologies for
Measurement of SVs
6. Challenges for
Cancer Genomics
4. Resequencing
Strategies for SVs
5. Representation of
SVs
Nesthor Perez
1. Introduction
• Based on the genetic every single human has
different genomes.
• Based on each genome there’s special trait for
diseases.
• GWAS identified common germline.
• DNA variants are associated to: diabetes, heart
deseases, and other deseases.
• GWAS only explained fraction of heritability of traits.
Nesthor Perez
1. Introduction
Every single person:
Has a different genome sequence:
Nesthor Perez
Based on each person
genetic and genomes,
special trait are applied
for each disease.
1. Introduction
• Cancer Genome Sequencing Studies identified
Somatic Mutations associated with cancer
progression.
• This mutations are very heterogeneous.
• Few mutations are common between patients.
• Hard to associate mutations to cancer causes.
• Comprehensive studies involve “all variants”.
Individual genomes are req for each case.
Nesthor Perez
1. Introduction
• GWAS focus on Single Nucleotide Polymorphism:
every single human genome is unique.
• Previously Germline Variants identified SCALES
ranging of DNA sequences:
SNP’s  Structural Variants
• Examples:
–
–
–
–
Duplications.
Deletions.
Inversions.
Translocations.
Nesthor Perez
1. Introduction
• Then, GWAS identified common Single Nucleotide
Polymorphism SNP’s:
 Common SNP’s for common diseases (similarities).
 Common Variants between diseases (differences).
• Main purpose: Disease Association and Cancer
Genetics Studies.
• In the last 5 years, DNA sequence next-generation
technology become commercially available to
companies:
 Illumina
 Life Technology
 Complete Genomics
Nesthor Perez
1. Introduction
Chromosome components:
Nesthor Perez
1. Introduction
A reference genome range from SNPs to Stuctural Variants:
Nesthor Perez
1. Introduction
In the last 5 years, these companies develop sequencing
technology:
Consequently
DNA cost
decreased
Nesthor Perez
1. Introduction
• Consequently the cost of DNA practice has
decreased.
• DNA at low cost, the study of all variables is possible.
• All variables:
 Germlines.
 Somatics.
 SNP’s (Single Nucleotide Polymorphism).
 SV’s (Structural Variants).
• This paper talks about these sequence technologies,
especially on Structural Variables: SV’s.
Nesthor Perez
Outline
1. Introduction
7. Future
Prospects
2. Germline and
Somatic SVs
3. Technologies for
Measurement of SVs
6. Challenges for
Cancer Genomics
4. Resequencing
Strategies for SVs
5. Representation of
SVs
Nesthor Perez
Outline
1. Introduction
7. Future
Prospects
2. Germline and
Somatic SVs
3. Technologies for
Measurement of SVs
6. Challenges for
Cancer Genomics
4. Resequencing
Strategies for SVs
5. Representation of
SVs
Nesthor Perez
2.1 Germline Structural Variation
• Human Genetic Study has a big purpose:
Identify a unique DNA sequence
• Attempts:
 Identify common SNP’s (HapMap project).
 Whole-Genome Seq & Micro-Array measurement found
similar SV’s for:
 Duplications
 Deletions
 Inversions
Then, common SV’s are now linked to:
Autism
Schizophrenia
Nesthor Perez
2.1 Germline Structural Variation
Human Genetics Study purpose:
Identify a unique DNA
sequencing.
Steps:
Identify
common
SNPs
Whole-Genome Seq and
Micro-Array measurement
found similar SVs through:
- Duplications
- Deletions
- Inversions
Nesthor Perez
Large DNA seq
2.2 Somatic Structural Variation
• Cancer: driven by somatic mutations accumulated in
life: “Micro Evolutionary Process”.
• Early studies in Leukemia and Lymphoma.
• Identified as “Recurrent Chromosomal
Rearrangements”.
• Present in many patients with the same cancer.
• DNA sequence Next-Generation reconstruct how
cancer genomes are organized at single nucleotide
resolution.
Nesthor Perez
2.3 Mechanisms of Structural Variation
• Base on the amount of sequence similarity
(homology) at the breakpoint of SV’s, there are two
mechanism:
 NHEJ: Non-Homologus End Joining:
 Little or no sequence similarity.
 NAHR: Non-Allelic Homologous Recombination:
 High sequence similarity.
Nesthor Perez
2.3 Mechanisms of Structural Variation
Cytogenetic Techniques:
Chromosome Painting:
Nesthor Perez
2.3 Mechanisms of Structural Variation
Cytogenetic Techniques:
Nesthor Perez
2.3 Mechanisms of Structural Variation
Cytogenetic Techniques:
Fluorescent in Situ Hybridization (FISH):
Nesthor Perez
(FISH)
Nesthor Perez
Outline
1. Introduction
7. Future
Prospects
2. Germline and
Somatic SVs
3. Technologies for
Measurement of SVs
6. Challenges for
Cancer Genomics
4. Resequencing
Strategies for SVs
5. Representation of
SVs
Nesthor Perez
Outline
1. Introduction
7. Future
Prospects
2. Germline and
Somatic SVs
3. Technologies for
Measurement of SVs
6. Challenges for
Cancer Genomics
4. Resequencing
Strategies for SVs
5. Representation of
SVs
Nesthor Perez
3. Technologies for Measurement of
Structural Variation
• SV’s features are based on:
 Size.
 Complexity.
 Ranging: from hundred of nucleotides to large scale of
chromosome rearrangements.
 Cytogenetic Techniques:
Chromosome Painting.
Spectral Karyotyping (SKY).
Fluorescent in Situ Hybridization. (FISH)
Nesthor Perez
3. Technologies for Measurement of
Structural Variation
• Large SV’s can be observed on CHROMOSOMES:
Nesthor Perez
3.1 Microarrays
• This technology was used for the first genome-wide
survey in 2004.
• This technique apply the concept of “array
Comparative Genomic Hybridization: aCGH.
• Reference genome are identified by a fluorescent
color.
• By now, there are hundreds of thousands of probes
avaiables.
• Since individual copy number ratios are subject to
experimental errors, computational techniques are
required to analyze aCGH.
Nesthor Perez
3.1 Microarrays
Nesthor Perez
3.1 Microarrays
• aCGH can be used to measure both: germline SV’s in
normal genomes and somatic SV’s in cancer
genomes.
• aCGH initially was developed for cancer genomics
applications.
• aCGH now is also used to detect copy number
variants in large number of genomes at low cost.
• aCHG limitations:
 Detects only copy number variants.
 Requires that genomic probes from the reference genome
lie in non-repetitive regions.
Nesthor Perez
3.2 Next-generation DNA Sequencing
Technologies
• Since DNA sequencing technology has demonstrated
substantial sophistication, the DNA analysis cost has
decreased a lot, too.
• A limitation can be the length of a DNA that can be
sequenced.
• DNA short sequences range from 30 to 1000
nucleotides, or base pairs (bp).
Nesthor Perez
3.2 Next-generation DNA Sequencing
Technologies
• Some DNA sequence technologies use a paired-end
sequencing protocol to increase read length.
• At earlier Sanger sequencing protocols the DNA
fragments size depended on the cloning vector.
• At next-generation technologies, several techniques
have been used to generate paired reads.
• Today, latest techniques produce paired reads from
fragments of only a few hundred bp to fragments of
2-3 kb.
Nesthor Perez
3.2 Next-generation DNA Sequencing
Technologies
• Next-generation sequencing technologies have
limited read lengths and limited insert sizes in
comparison to Sanger sequencing.
• Two approaches to detect SV’s using DNA nextgeneration technology:
 Novo Assembly:
 Sophisticated algorithms are used to reconstruct genome
sequences from overlaps between reads.
 Human genome assemblies are highly fragmented.
Nesthor Perez
3.2 Next-generation DNA Sequencing
Technologies
• Two approaches to detect SV’s using DNA nextgeneration technology:
 Resequencing:
 Differences are found between an individual genome and a related
reference genome.
 These differences are the same differences between the aligned
reads and the reference sequence.
Nesthor Perez
3.2 Next-generation DNA Sequencing
Technologies
Advantages:
From earlier DNA Generation to new sequencing technology:
Disadvantages:
Limitation in the length of a DNA molecule to be sequenced:
Today’s technologies produce “SHORT SEQUENCES” of DNA.
Range:
30
1000 nucleotides
In order to increase read length, these DNA sequencing technologies use:
Paired End or Mate Pair
Nesthor Perez
3.2 Next-generation DNA Sequencing
Technologies
There’re two approaches to detect SVs:
Nesthor Perez
3.3 New DNA Sequencing Technologies
• Previous DNA technologies challenges have been
several limitations.
• For example:
 SV’s breakpoints in high-repetitive sequences.
• Third-generation and single molecule technologies
offer additional advantages for SV’s:
–
–
–
–
Longer reads lengths.
Easier sample preparation.
Lower input DNA requirements.
Higher throughput.
Nesthor Perez
3.3 New DNA Sequencing Technologies
• Third-generation technologies expected
improvements:
– Paired reads:
Include more than two reads from a single DNA fragment.
– Long-range sequence information with low input DNA
requirements.
• Sequencing technologies keep a fast development
thanks to the improvements of:
– Chemistry.
– Imaging.
– Technology manufacture.
Nesthor Perez
3.3 New DNA Sequencing Technologies
• New improvements are expected about:
– Increasing read lengths.
– Inserting lengths.
– Enhancing throughput.
• A new sequencing technology is the “Nanopore”,
which directly read the nucleotides of long molecules
of DNA, giving a dramatic advance.
• Using Nanopore, extremely long reads (tens of kb)
are generated.
Nesthor Perez
3.3 New DNA Sequencing Technologies
New features:
Longer read lenghts:
Higher throughput:
Nesthor Perez
3.3 New DNA Sequencing Technologies
New features:
Easier sample preparation
Nesthor Perez
3.3 New DNA Sequencing Technologies
New features:
Lower input DNA
requirements:
Nesthor Perez
3.3 New DNA Sequencing Technologies
Keep active development thanks new improvements around:
Chemistry:
Imaging Processing:
Data Processing:
Nesthor Perez
Outline
1. Introduction
7. Future
Prospects
2. Germline and
Somatic SVs
3. Technologies for
Measurement of SVs
6. Challenges for
Cancer Genomics
4. Resequencing
Strategies for SVs
5. Representation of
SVs
Nesthor Perez
Outline
1. Introduction
7. Future
Prospects
2. Germline and
Somatic SVs
3. Technologies for
Measurement of SVs
6. Challenges for
Cancer Genomics
4. Resequencing
Strategies for SVs
5. Representation of
SVs
Nesthor Perez
4. Resequencing Strategies for
Structural Variation
• Purpose:
Predict SV’s by alignments of sequence reads to the reference
genome.
• Steps:
 Alignments of reads
 Prediction of SV’s from alignments.
• Resequencing is straightforward in principle but
detection of SV’s in human genomes is really hard.
• Some types of SV’s are easy to detect, other are
really difficult.
Nesthor Perez
4. Resequencing Strategies for
Structural Variation
Step 1: Alignments of reads:
Reads
4. Resequencing Strategies for
Structural Variation
Step 2: Predictions of SVs from alignments:
“Disease”
4. Resequencing Strategies for
Structural Variation
• Some SV’s are hard to detect due technological
limitations and biological features.
• Technological limitations:
 Sequencing errors.
 Limited read lengths.
 Insert sizes.
• SV’s biological features :
 Enriched for repetitive sequences near their breakpoints.
 Overlap: multiple states or complex architectures.
 Recurrent variants at the same locus.
Nesthor Perez
4. Resequencing Strategies for
Structural Variation
• Therefore, alignments and predictions of SV’s are not
easy tasks.
• Effective algorithms are required for highly sensitive
and specific predictions of SV’s.
• Three approaches to identify SV’s from aligned reads:
 Split reads.
 Depth of coverage analysis.
 Paired-end mapping.
Nesthor Perez
4.1 Read Alignment
• This is one of the most researched problem in
Bioinformatics.
• Specialized task of aligning millions to billions of
individual short reads is done by software like:
 Maq.
 BWA.
 Bowtie/Bowtie2.
 BFAST.
 mrsFAST.
Nesthor Perez
4.1 Read Alignment
• Reading alignment can be done getting a single
alignment for each read, or reads with multiple highquality alignments.
• Choosing an alignment randomly with multiple
alignments of equal score, is another option.
• In case of unique alignment, there’s a limitation to
detect SV’s with breakpoints in repetitive regions.
• In case of ambiguous alignment, SV’s prediction
requires an algorithm to distinguish between
multiple possible alignments for each read.
Nesthor Perez
4.2 Split Reads
• This is a direct approach to detect SV’s where
alignments are in two parts.
• To reduce false positive predictions, multiple split
reads are required.
• Split reads is only feasible when reads are sufficient
long.
Nesthor Perez
4.3 Depth of Coverage
• Depth of coverage detects differences in the number
of reads that align to intervals in the reference
genome.
• The number of reads in a nucleotide is:
c = NL , where N is the number of reads
G
L is the length of each read
G is the length of the genome
c is the coverage
• An example is “30X coverage”, which means a
number of reads of c = 30.
Nesthor Perez
4.3 Depth of Coverage
• In case an individual genome got a deletion of a
segment, the coverage of this segment is reduced to
the half.
• In case an interval of the reference genome was
duplicated or amplified, the coverage increases in
the same number of copies.
• The coverage depth indicates the number of copies
of this interval in the genome.
• Coverage calculation is affected by repetitive
sequences.
Nesthor Perez
4.4 Paired-end Sequencing and
Mapping
• This is the most common resequencing approach.
• This is used to identify somatic SV’s in cancer
genomes and germline SV’s.
• This is using several next-generation sequencing
technologies.
• This is used to obtain paired reads from opposite
ends of a larger DNA.
• The length of particular sequenced fragment is
unknown.
Nesthor Perez
Outline
1. Introduction
7. Future
Prospects
2. Germline and
Somatic SVs
3. Technologies for
Measurement of SVs
6. Challenges for
Cancer Genomics
4. Resequencing
Strategies for SVs
5. Representation of
SVs
Nesthor Perez
Outline
1. Introduction
7. Future
Prospects
2. Germline and
Somatic SVs
3. Technologies for
Measurement of SVs
6. Challenges for
Cancer Genomics
4. Resequencing
Strategies for SVs
5. Representation of
SVs
Nesthor Perez
5. Representation of Structural
Variants
• Earlier DNA technologies have reduced the survey
cost of SV’s.
• The Cancer Genome Atlas (TCGA) are performing
paired-end sequencing and aCGH of several human
genomes.
• On the other hand, Microarray-based techniques are
being used for small or single investigator projects.
• Therefore, in the future there’s an expectation of
enormus number of measurement of SV’s.
Nesthor Perez
Outline
1. Introduction
7. Future
Prospects
2. Germline and
Somatic SVs
3. Technologies for
Measurement of SVs
6. Challenges for
Cancer Genomics
4. Resequencing
Strategies for SVs
5. Representation of
SVs
Nesthor Perez
Outline
1. Introduction
7. Future
Prospects
2. Germline and
Somatic SVs
3. Technologies for
Measurement of SVs
6. Challenges for
Cancer Genomics
4. Resequencing
Strategies for SVs
5. Representation of
SVs
Nesthor Perez
6. Challenges for Cancer Genomics
Studies
• Most cancer genomes are aneuploid, so the number
of copies of regions are variables.
• High-resolution reconstruction of cancer genomes
are too small to be detected by cytogenetics.
• Cancer is a heterogeneous mixture of cells with
possibly several number of mutations.
• Heterogeneity means admixture and subpopulation
of tumor cells.
• Some subpopulations contain mutations.
• Most cancer genomes do not sequence single tumor
cells. They sequence mixture of cells.
Nesthor Perez
Outline
1. Introduction
7. Future
Prospects
2. Germline and
Somatic SVs
3. Technologies for
Measurement of SVs
6. Challenges for
Cancer Genomics
4. Resequencing
Strategies for SVs
5. Representation of
SVs
Nesthor Perez
Outline
1. Introduction
7. Future
Prospects
2. Germline and
Somatic SVs
3. Technologies for
Measurement of SVs
6. Challenges for
Cancer Genomics
4. Resequencing
Strategies for SVs
5. Representation of
SVs
Nesthor Perez
7. Future Prospects
• It will be possible to systematically measure nearly all
but most complex variants in an individual genome.
• SV’s between nearly identical sequences might
remain inaccesible until significally different types of
DNA sequencing technologies become available.
• Having a complete list of germline SV’s, unsolved
heritability for a trait cannot readily be the cause of
lack of measurement of genetic information.
• The efficacy of particular treatments will require
additional and hard working for future successfull
results.
Nesthor Perez
Thanks
Nesthor Perez
Download