CZ5225: Modeling and Simulation in Biology Lecture 10: Copy Number Variations Prof. Chen Yu Zong Tel: 6516-6877 Email: phacyz@nus.edu.sg http://bidd.nus.edu.sg Room 08-14, level 8, S16, NUS Copy number variation (CNV) What is it? • A form of human genetic variation: instead of 2 copies of each region of each chromosome (diploid), some people have amplifications or losses (> 1kb) in different regions – this doesn’t include translocations or inversions • We all have such regions – the publicly available genome NA15510 has between 5 & 240 by various estimates – they are only rarely harmful (but rare things do happen) 2 Copy-number probes are used to quantify the amount of DNA at known loci CN locus: ...CGTAGCCATCGGTAAGTACTCAATGATAG... PM: ATCGGTAGCCATTCATGAGTTACTA CN=1 CN=2 ** * PM = c CN=3 ** * PM = 2c ** * PM = 3c Copy number variation Population genomics The genomes of two humans differ more in a structural sense than at the nucleotide level; a recent paper estimates that on average two of us differ by ~ 4 - 24 Mb of genetic due to Copy Number Variation ~ 2.5 Mb due to Single Nucleotide Polymorphisms 4 Abundance of CNVs in the human population ? Still an open question but probably thousands, at low allelic frequency (<20%) Abundance of deletion CNVs in the human population Comparison of overlapping CNVs identified by Conrad et al. (2006) and McCarroll et al. (2006). Freeman et al. Genome Res 2006 Non-allelic homologous recombination events between low-copy repeats (LCR-NAHR) Lupski & Inoue, TIG 2002 Duplications and Deletions of LCRs mediated by NAHR LCRs in direct orientation LCRs in inverted orientation Inversions Intrachromatid recombination between LCRs LCRs in direct orientation Deletion LCRs in inverted orientation Inversion Mechanisms generating genomic deletions Copy number variation Relations to human disease Responsible for a number of rare genetic conditions. For example, Down syndrome ( trisomy 21), Cri du chat syndrome (a partial deletion of 5p). Implicated in complex diseases. For example: CCL3L1 CN HIV/AIDS susceptibility; also, some sporadic (non-inherited) CN variants are strongly associated with autism, while Tumors typically have a lot of chromosomal abnormalities, including recurrent CN changes. 11 Evolutionary and medical implications of CNVs: CCL3L1 as an example Gonzales et al., Science, 2005 When CCL3L1 occupies the CCR5 receptor on CD4 cells, it blocks HIV's entry. Copy-number variation of CCL3L1 within and among human and chimp populations Gonzales et al., Science, 2005 CCL3L1 and HIV Infection Individuals with a high CCL3L1 gene copy number relative to their population average are more resistant to HIV infection than those with a low copy number, presumably because there is more ligand to compete with HIV during binding to CCR5. Gonzales et al., Science, 2005 Trisomy 21 QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. 15 Partial deletion of chr 5p QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. 16 A cytogeneticist’s story “The story is about diagnosis of a 3 month old baby with macrocephaly and some heart problems. The doctors questioned a couple of syndromes which we tested for and found negative. Rather than continue this ‘shot in the dark’ approach, we put the case on an array and found a 2Mb deletion which notably deletes the gene NSD1 on chr 5, mutations in which are known to be cause Sotos syndrome. This is an overgrowth syndrome and fits with the macrocephaly. The bottom line is that we are able to diagnose quicker by this approach and delineate exactly the underlying genetic change.” 17 A cytogeneticist’s story Chromosome 5 2Mb deletion 18 Many tumors have gross CN changes 19 A lung cancer cell line vs matched normal lymphoblast, from Nannya et al Cancer Res 2005;65:6071-6079 Research into gonad dysfunction: Human sex reversal • 20% of 46,XY females have mutations in SRY • 80% of 46,XY females unexplained! • 90% of 46,XX males due to translocation SRY • 10% of 46,XX males unexplained! Suggests loss of function and gain of function mutations in other genes may cause sex reversal. We’re looking at shared deletions. 20 Affymetrix SNP chip terminology Genomic DNA SNP A TAGCCATCGGTA GTACTCAATGAT G Perfect Match probe for Allele A ATCGGTAGCCATTCATGAGTTACTA Perfect Match probe for Allele B ATCGGTAGCCATCCATGAGTTACTA Genotyping: answering the question about the two copies of the chromosome on which the SNP is located: Is a sample AA (AA) , AB (AG) or BB (GG) at this SNP? 21 Affymetrix GeneChip * * * * * * 5µ 5µ 1.28cm 1.28cm 6.4 million features/ chip > 1 million identical 25 bp probes / feature GeneChip Mapping Assay Overview 250 ng Genomic DNA Xba Xba Xba RE Digestion PCR: One Primer Amplification Adaptor Ligation Complexity Reduction Fragmentation and Labeling Hyb & Wash AA BB AB 23 Principal low-level analysis steps • Background adjustment and normalization at probe level These steps are to remove lab/operator/reagent effects • Combining probe level summaries to probe set level summary: best done robustly, on many chips at once This is to remove probe affinity effects and discordant observations (gross errors/non-responding probes, etc) • Possibly further rounds of normalization (probe set level) as lab/cohort/batch/other effects are frequently still visible • Derive the relevant copy-number quantities Finally, quality assessment is an important low-level task. 24 Preprocessing for total CN using SNP probe pairs (250K chip) TT AT AA Modification by H Bengtsson of a method due to A Wirapati developed some years25ago for microsatellite genotyping; similar to the approach used by Illumina. Background adjustment and normalization 26 Outcome similar to that achieved by quantile normalization Low-level analysis problems remain unsolved; why? • The feature size keeps and so the # features/chip keeps; • Fewer and fewer features are used for a given measurement, allowing more measurements to be made using a single chip These considerations all place more and more demands on the low-level analysis: to maintain the quality of existing measurements, and to obtain good new ones. 27 SNP probes can be used to estimate total copy numbers BB AA ** * ** * ** * PM = PMA + PMB = 2c PM = PMA + PMB = 2c AB ** * AAB ** * ** * PM = PMA + PMB = 2c ** * ** * PM = PMA + PMB = 3c * SNP probe tiling strategy SNP 0 position A/G TAGCCATCGGTA N GTACTCAATGAT* PM 0 Allele A MM 0 Allele A ATCGGTAGCCAT T ATCGGTAGCCAT A CATGAGTTACTA CATGAGTTACTA PM 0 Allele B MM 0 Allele B ATCGGTAGCCAT C ATCGGTAGCCAT G CATGAGTTACTA CATGAGTTACTA Central probe quartet 29 SNP probe tiling strategy SNP A / G +4 Position TAGCCATCGGTA N GTA C TCAATGATCAGCT* PM +4 Allele A MM +4 Allele A GTAGCCAT T CAT G AGTTACTAGTCG GTAGCCAT T CAT C AGTTACTAGTCG PM +4 Allele B MM +4 Allele B GTAGCCAT C CAT G AGTTACTAGTCG GTAGCCAT C CAT C AGTTACTAGTCG +4 offset probe quartet 30 SNP for Identifying Copy Number Variations • Using SNP chips to identify change in total copy number (i.e. CN ≠ 2) • Outline a new method (CRMA) • Evaluate and compare it with other methods • Make some closing remarks on further issues 31 Copy-number estimation using Robust Multichip Analysis (CRMA) CRMA Preprocessing (probe signals) Total CN Summarization (SNP signals ) 32 allelic crosstalk (or quantile) PM=PMA+PMB log-additive PM only Post-processing fragment-length (GC-content) Raw total CNs R = Reference Mij = log2(ij/Rj) chip i, probe j A few details are passed over. Ask me later if you care about them. Crosstalk between alleles - adds significant artifacts to signals Cross-hybridization: Allele A: TCGGTAAGTACTC Allele B: TCGGTATGTACTC AA ** * AB ** * ** * PMA ≈ PMB ** * PMA >> PMB * BB ** * ** * PMA << PMB There are six possible allele pairs • Nucleotides: {A, C, G, T} • Ordered pairs: – (A,C), (A,G), (A,T), (C,G), (C,T), (G,C) • Because of different nucleotides bind differently, the crosstalk from A to C might be very different from A to T. Crosstalk between alleles is easy to spot Example: BB Data from one array AB PMB AA + PMA offset Probe pairs (PMA, PMB) for nucleotide pair (A,T) Crosstalk between alleles can be estimated and corrected for What is done: Offset is removed from SNPs and CN units. BB AB PMB AA + PMA no offset Crosstalk is removed from SNPs. Copy-number estimation using Robust Multichip Analysis (CRMA) CRMA Preprocessing (probe signals) allelic crosstalk (or quantile) Already briefly described. Total CN Summarization (SNP signals ) PM=PMA+PMB log-additive PM only Postprocessing fragment-length (GC-content) Raw total CNs Mij = log2(ij/Rj) 37 Copy-number estimation using Robust Multichip Analysis (CRMA) CRMA Preprocessing (probe signals) Total CN Summarization (SNPsignals ) allelic crosstalk (quantile) PM=PMA+PMB log-additive PM only Postprocessing fragment-length (GC-content) Raw total CNs 38 Mij = log2(ij/Rj) That’s it! Copy-number estimation using Robust Multichip Analysis (CRMA) CRMA Preprocessing (probe signals) allelic crosstalk (quantile) Total CNs PM=PMA+PMB Summarization (SNP signals ) log-additive PM only Postprocessing fragment-length (GC-content) Raw total CNs 39 Mij = log2(ij/Rj) log2(PMijk) = log2ij + log2jk + ijk Fit using rlm Copy-number estimation using Robust Multichip Analysis (CRMA) CRMA Preprocessing (probe signals) Total CN Summarization (SNP signals ) allelic crosstalk (quantile) Longer fragments get less well amplified by PCR and so give weaker SNP signals PM=PMA+PMB log-additive PM-only Postprocessing fragment-length (GC-content) Raw total CNs 40 Mij = log2(ij/Rj) 100K Copy-number estimation using Robust Multichip Analysis (CRMA) CRMA Preprocessing (probe signals) Total CN Summarization (SNP signals ) allelic crosstalk (quantile) Longer fragments get less well amplified by PCR and so give weaker SNP signals PM=PMA+PMB log-additive PM-only Postprocessing fragment-length (GC-content) Raw total CNs 41 Mij = log2(ij/Rj) 500K Copy-number estimation using Robust Multichip Analysis (CRMA) CRMA Preprocessing (probe signals) Total CN Summarization (SNP signals ) allelic crosstalk (quantile) Longer fragments get less well amplified by PCR and so give weaker SNP signals PM=PMA+PMB log-additive PM-only Postprocessing fragment-length (GC-content) Raw total CNs 42 Mij = log2(ij/Rj) 500K Copy-number estimation using Robust Multichip Analysis (CRMA) CRMA Preprocessing (probe signals) allelic crosstalk (quantile) Total CN PM=PMA+PMB Summarization (SNP signals ) log-additive PM-only Postprocessing fragment-length (GC-content) Raw total CNs Mij = log2(ij/Rj) 43 Care required with the number and nature of Reference samples used Comparison of 4 methods CRMA dChip (Li & Wong 2001) CNAG* (Nannya et al 2005) CNAT v4 (Affymetrix 2006) Preprocessing (probe signals) allelic crosstalk (quantile) quantile scale quantile Total CN PM=PMA+PMB PM=PMA+PMB MM=MMA+MMB PM=PMA+PMB “log-additive” PM-only Summarization (SNP signals ) Log additive PM only Multiplicative PM-MM Post-processing fragment-length (GC-content) Raw total CNs Mij = log2(ij/Rj) Mij = log2(ij/Rj) =A+B fragment-length (GC-content) fragment-length (GC-content) Mij = log2(ij/Rj) Mij = log2(ij/Rj) 44 Further bioinformatic issues • Estimating copy number: needs calibration data • Segmentation (of chromosomes into constant copy number regions): an HMM-like algorithm • Analyzing family CN data: a different HMM • Incorporating non-polymorphic probes: independent HMM observations to be weighted and combined • Dealing with mixed normal-abnormal samples • Utilizing poor quality DNA samples • Estimating allele-specific copy number 45