Chapter 6: Structural Variation and Medical Genomics CS-6293 Bioinformatics Instructor: Dr. Jianhua Ruan Presented by: Nesthor Perez Outline 1. Introduction 7. Future Prospects 2. Germline and Somatic SVs 3. Technologies for Measurement of SVs 6. Challenges for Cancer Genomics 4. Resequencing Strategies for SVs 5. Representation of SVs Nesthor Perez Outline 1. Introduction 7. Future Prospects 2. Germline and Somatic SVs 3. Technologies for Measurement of SVs 6. Challenges for Cancer Genomics 4. Resequencing Strategies for SVs 5. Representation of SVs Nesthor Perez 1. Introduction • Based on the genetic every single human has different genomes. • Based on each genome there’s special trait for diseases. • GWAS identified common germline. • DNA variants are associated to: diabetes, heart deseases, and other deseases. • GWAS only explained fraction of heritability of traits. Nesthor Perez 1. Introduction Every single person: Has a different genome sequence: Nesthor Perez Based on each person genetic and genomes, special trait are applied for each disease. 1. Introduction • Cancer Genome Sequencing Studies identified Somatic Mutations associated with cancer progression. • This mutations are very heterogeneous. • Few mutations are common between patients. • Hard to associate mutations to cancer causes. • Comprehensive studies involve “all variants”. Individual genomes are req for each case. Nesthor Perez 1. Introduction • GWAS focus on Single Nucleotide Polymorphism: every single human genome is unique. • Previously Germline Variants identified SCALES ranging of DNA sequences: SNP’s Structural Variants • Examples: – – – – Duplications. Deletions. Inversions. Translocations. Nesthor Perez 1. Introduction • Then, GWAS identified common Single Nucleotide Polymorphism SNP’s: Common SNP’s for common diseases (similarities). Common Variants between diseases (differences). • Main purpose: Disease Association and Cancer Genetics Studies. • In the last 5 years, DNA sequence next-generation technology become commercially available to companies: Illumina Life Technology Complete Genomics Nesthor Perez 1. Introduction Chromosome components: Nesthor Perez 1. Introduction A reference genome range from SNPs to Stuctural Variants: Nesthor Perez 1. Introduction In the last 5 years, these companies develop sequencing technology: Consequently DNA cost decreased Nesthor Perez 1. Introduction • Consequently the cost of DNA practice has decreased. • DNA at low cost, the study of all variables is possible. • All variables: Germlines. Somatics. SNP’s (Single Nucleotide Polymorphism). SV’s (Structural Variants). • This paper talks about these sequence technologies, especially on Structural Variables: SV’s. Nesthor Perez Outline 1. Introduction 7. Future Prospects 2. Germline and Somatic SVs 3. Technologies for Measurement of SVs 6. Challenges for Cancer Genomics 4. Resequencing Strategies for SVs 5. Representation of SVs Nesthor Perez Outline 1. Introduction 7. Future Prospects 2. Germline and Somatic SVs 3. Technologies for Measurement of SVs 6. Challenges for Cancer Genomics 4. Resequencing Strategies for SVs 5. Representation of SVs Nesthor Perez 2.1 Germline Structural Variation • Human Genetic Study has a big purpose: Identify a unique DNA sequence • Attempts: Identify common SNP’s (HapMap project). Whole-Genome Seq & Micro-Array measurement found similar SV’s for: Duplications Deletions Inversions Then, common SV’s are now linked to: Autism Schizophrenia Nesthor Perez 2.1 Germline Structural Variation Human Genetics Study purpose: Identify a unique DNA sequencing. Steps: Identify common SNPs Whole-Genome Seq and Micro-Array measurement found similar SVs through: - Duplications - Deletions - Inversions Nesthor Perez Large DNA seq 2.2 Somatic Structural Variation • Cancer: driven by somatic mutations accumulated in life: “Micro Evolutionary Process”. • Early studies in Leukemia and Lymphoma. • Identified as “Recurrent Chromosomal Rearrangements”. • Present in many patients with the same cancer. • DNA sequence Next-Generation reconstruct how cancer genomes are organized at single nucleotide resolution. Nesthor Perez 2.3 Mechanisms of Structural Variation • Base on the amount of sequence similarity (homology) at the breakpoint of SV’s, there are two mechanism: NHEJ: Non-Homologus End Joining: Little or no sequence similarity. NAHR: Non-Allelic Homologous Recombination: High sequence similarity. Nesthor Perez 2.3 Mechanisms of Structural Variation Cytogenetic Techniques: Chromosome Painting: Nesthor Perez 2.3 Mechanisms of Structural Variation Cytogenetic Techniques: Nesthor Perez 2.3 Mechanisms of Structural Variation Cytogenetic Techniques: Fluorescent in Situ Hybridization (FISH): Nesthor Perez (FISH) Nesthor Perez Outline 1. Introduction 7. Future Prospects 2. Germline and Somatic SVs 3. Technologies for Measurement of SVs 6. Challenges for Cancer Genomics 4. Resequencing Strategies for SVs 5. Representation of SVs Nesthor Perez Outline 1. Introduction 7. Future Prospects 2. Germline and Somatic SVs 3. Technologies for Measurement of SVs 6. Challenges for Cancer Genomics 4. Resequencing Strategies for SVs 5. Representation of SVs Nesthor Perez 3. Technologies for Measurement of Structural Variation • SV’s features are based on: Size. Complexity. Ranging: from hundred of nucleotides to large scale of chromosome rearrangements. Cytogenetic Techniques: Chromosome Painting. Spectral Karyotyping (SKY). Fluorescent in Situ Hybridization. (FISH) Nesthor Perez 3. Technologies for Measurement of Structural Variation • Large SV’s can be observed on CHROMOSOMES: Nesthor Perez 3.1 Microarrays • This technology was used for the first genome-wide survey in 2004. • This technique apply the concept of “array Comparative Genomic Hybridization: aCGH. • Reference genome are identified by a fluorescent color. • By now, there are hundreds of thousands of probes avaiables. • Since individual copy number ratios are subject to experimental errors, computational techniques are required to analyze aCGH. Nesthor Perez 3.1 Microarrays Nesthor Perez 3.1 Microarrays • aCGH can be used to measure both: germline SV’s in normal genomes and somatic SV’s in cancer genomes. • aCGH initially was developed for cancer genomics applications. • aCGH now is also used to detect copy number variants in large number of genomes at low cost. • aCHG limitations: Detects only copy number variants. Requires that genomic probes from the reference genome lie in non-repetitive regions. Nesthor Perez 3.2 Next-generation DNA Sequencing Technologies • Since DNA sequencing technology has demonstrated substantial sophistication, the DNA analysis cost has decreased a lot, too. • A limitation can be the length of a DNA that can be sequenced. • DNA short sequences range from 30 to 1000 nucleotides, or base pairs (bp). Nesthor Perez 3.2 Next-generation DNA Sequencing Technologies • Some DNA sequence technologies use a paired-end sequencing protocol to increase read length. • At earlier Sanger sequencing protocols the DNA fragments size depended on the cloning vector. • At next-generation technologies, several techniques have been used to generate paired reads. • Today, latest techniques produce paired reads from fragments of only a few hundred bp to fragments of 2-3 kb. Nesthor Perez 3.2 Next-generation DNA Sequencing Technologies • Next-generation sequencing technologies have limited read lengths and limited insert sizes in comparison to Sanger sequencing. • Two approaches to detect SV’s using DNA nextgeneration technology: Novo Assembly: Sophisticated algorithms are used to reconstruct genome sequences from overlaps between reads. Human genome assemblies are highly fragmented. Nesthor Perez 3.2 Next-generation DNA Sequencing Technologies • Two approaches to detect SV’s using DNA nextgeneration technology: Resequencing: Differences are found between an individual genome and a related reference genome. These differences are the same differences between the aligned reads and the reference sequence. Nesthor Perez 3.2 Next-generation DNA Sequencing Technologies Advantages: From earlier DNA Generation to new sequencing technology: Disadvantages: Limitation in the length of a DNA molecule to be sequenced: Today’s technologies produce “SHORT SEQUENCES” of DNA. Range: 30 1000 nucleotides In order to increase read length, these DNA sequencing technologies use: Paired End or Mate Pair Nesthor Perez 3.2 Next-generation DNA Sequencing Technologies There’re two approaches to detect SVs: Nesthor Perez 3.3 New DNA Sequencing Technologies • Previous DNA technologies challenges have been several limitations. • For example: SV’s breakpoints in high-repetitive sequences. • Third-generation and single molecule technologies offer additional advantages for SV’s: – – – – Longer reads lengths. Easier sample preparation. Lower input DNA requirements. Higher throughput. Nesthor Perez 3.3 New DNA Sequencing Technologies • Third-generation technologies expected improvements: – Paired reads: Include more than two reads from a single DNA fragment. – Long-range sequence information with low input DNA requirements. • Sequencing technologies keep a fast development thanks to the improvements of: – Chemistry. – Imaging. – Technology manufacture. Nesthor Perez 3.3 New DNA Sequencing Technologies • New improvements are expected about: – Increasing read lengths. – Inserting lengths. – Enhancing throughput. • A new sequencing technology is the “Nanopore”, which directly read the nucleotides of long molecules of DNA, giving a dramatic advance. • Using Nanopore, extremely long reads (tens of kb) are generated. Nesthor Perez 3.3 New DNA Sequencing Technologies New features: Longer read lenghts: Higher throughput: Nesthor Perez 3.3 New DNA Sequencing Technologies New features: Easier sample preparation Nesthor Perez 3.3 New DNA Sequencing Technologies New features: Lower input DNA requirements: Nesthor Perez 3.3 New DNA Sequencing Technologies Keep active development thanks new improvements around: Chemistry: Imaging Processing: Data Processing: Nesthor Perez Outline 1. Introduction 7. Future Prospects 2. Germline and Somatic SVs 3. Technologies for Measurement of SVs 6. Challenges for Cancer Genomics 4. Resequencing Strategies for SVs 5. Representation of SVs Nesthor Perez Outline 1. Introduction 7. Future Prospects 2. Germline and Somatic SVs 3. Technologies for Measurement of SVs 6. Challenges for Cancer Genomics 4. Resequencing Strategies for SVs 5. Representation of SVs Nesthor Perez 4. Resequencing Strategies for Structural Variation • Purpose: Predict SV’s by alignments of sequence reads to the reference genome. • Steps: Alignments of reads Prediction of SV’s from alignments. • Resequencing is straightforward in principle but detection of SV’s in human genomes is really hard. • Some types of SV’s are easy to detect, other are really difficult. Nesthor Perez 4. Resequencing Strategies for Structural Variation Step 1: Alignments of reads: Reads 4. Resequencing Strategies for Structural Variation Step 2: Predictions of SVs from alignments: “Disease” 4. Resequencing Strategies for Structural Variation • Some SV’s are hard to detect due technological limitations and biological features. • Technological limitations: Sequencing errors. Limited read lengths. Insert sizes. • SV’s biological features : Enriched for repetitive sequences near their breakpoints. Overlap: multiple states or complex architectures. Recurrent variants at the same locus. Nesthor Perez 4. Resequencing Strategies for Structural Variation • Therefore, alignments and predictions of SV’s are not easy tasks. • Effective algorithms are required for highly sensitive and specific predictions of SV’s. • Three approaches to identify SV’s from aligned reads: Split reads. Depth of coverage analysis. Paired-end mapping. Nesthor Perez 4.1 Read Alignment • This is one of the most researched problem in Bioinformatics. • Specialized task of aligning millions to billions of individual short reads is done by software like: Maq. BWA. Bowtie/Bowtie2. BFAST. mrsFAST. Nesthor Perez 4.1 Read Alignment • Reading alignment can be done getting a single alignment for each read, or reads with multiple highquality alignments. • Choosing an alignment randomly with multiple alignments of equal score, is another option. • In case of unique alignment, there’s a limitation to detect SV’s with breakpoints in repetitive regions. • In case of ambiguous alignment, SV’s prediction requires an algorithm to distinguish between multiple possible alignments for each read. Nesthor Perez 4.2 Split Reads • This is a direct approach to detect SV’s where alignments are in two parts. • To reduce false positive predictions, multiple split reads are required. • Split reads is only feasible when reads are sufficient long. Nesthor Perez 4.3 Depth of Coverage • Depth of coverage detects differences in the number of reads that align to intervals in the reference genome. • The number of reads in a nucleotide is: c = NL , where N is the number of reads G L is the length of each read G is the length of the genome c is the coverage • An example is “30X coverage”, which means a number of reads of c = 30. Nesthor Perez 4.3 Depth of Coverage • In case an individual genome got a deletion of a segment, the coverage of this segment is reduced to the half. • In case an interval of the reference genome was duplicated or amplified, the coverage increases in the same number of copies. • The coverage depth indicates the number of copies of this interval in the genome. • Coverage calculation is affected by repetitive sequences. Nesthor Perez 4.4 Paired-end Sequencing and Mapping • This is the most common resequencing approach. • This is used to identify somatic SV’s in cancer genomes and germline SV’s. • This is using several next-generation sequencing technologies. • This is used to obtain paired reads from opposite ends of a larger DNA. • The length of particular sequenced fragment is unknown. Nesthor Perez Outline 1. Introduction 7. Future Prospects 2. Germline and Somatic SVs 3. Technologies for Measurement of SVs 6. Challenges for Cancer Genomics 4. Resequencing Strategies for SVs 5. Representation of SVs Nesthor Perez Outline 1. Introduction 7. Future Prospects 2. Germline and Somatic SVs 3. Technologies for Measurement of SVs 6. Challenges for Cancer Genomics 4. Resequencing Strategies for SVs 5. Representation of SVs Nesthor Perez 5. Representation of Structural Variants • Earlier DNA technologies have reduced the survey cost of SV’s. • The Cancer Genome Atlas (TCGA) are performing paired-end sequencing and aCGH of several human genomes. • On the other hand, Microarray-based techniques are being used for small or single investigator projects. • Therefore, in the future there’s an expectation of enormus number of measurement of SV’s. Nesthor Perez Outline 1. Introduction 7. Future Prospects 2. Germline and Somatic SVs 3. Technologies for Measurement of SVs 6. Challenges for Cancer Genomics 4. Resequencing Strategies for SVs 5. Representation of SVs Nesthor Perez Outline 1. Introduction 7. Future Prospects 2. Germline and Somatic SVs 3. Technologies for Measurement of SVs 6. Challenges for Cancer Genomics 4. Resequencing Strategies for SVs 5. Representation of SVs Nesthor Perez 6. Challenges for Cancer Genomics Studies • Most cancer genomes are aneuploid, so the number of copies of regions are variables. • High-resolution reconstruction of cancer genomes are too small to be detected by cytogenetics. • Cancer is a heterogeneous mixture of cells with possibly several number of mutations. • Heterogeneity means admixture and subpopulation of tumor cells. • Some subpopulations contain mutations. • Most cancer genomes do not sequence single tumor cells. They sequence mixture of cells. Nesthor Perez Outline 1. Introduction 7. Future Prospects 2. Germline and Somatic SVs 3. Technologies for Measurement of SVs 6. Challenges for Cancer Genomics 4. Resequencing Strategies for SVs 5. Representation of SVs Nesthor Perez Outline 1. Introduction 7. Future Prospects 2. Germline and Somatic SVs 3. Technologies for Measurement of SVs 6. Challenges for Cancer Genomics 4. Resequencing Strategies for SVs 5. Representation of SVs Nesthor Perez 7. Future Prospects • It will be possible to systematically measure nearly all but most complex variants in an individual genome. • SV’s between nearly identical sequences might remain inaccesible until significally different types of DNA sequencing technologies become available. • Having a complete list of germline SV’s, unsolved heritability for a trait cannot readily be the cause of lack of measurement of genetic information. • The efficacy of particular treatments will require additional and hard working for future successfull results. Nesthor Perez Thanks Nesthor Perez