Copy Number Variation Eleanor Feingold University of Pittsburgh March 2012 What do we mean by “copy number variation?” kb - Mb (gene or gene region) GCTCATATATATTTG Copy number variation in a gene or gene region “normal” duplication of one gene duplication of several genes duplication of part of a gene deletion Classical copy number study types Cancer genetics Clinical pediatrics What What Find chromosomal segments (usually large ones) that are duplicated and/or deleted in tumor cell lines Why Learn something about cancer biology or Implications for treatment and prognosis Detect inherited or de novo deletions in individuals Why “Diagnose” birth defects And now: Genetic association studies for CNVs 1) Collect cases and controls. 2) “Genotype” everyone at a CNV. 2 0 5 0 4 1 3) Test genotype/phenotype association. cases 0 1 2+ 65 133 202 81 316 controls 16 1 2 4 1 1 3 2 2 1 16 0 How do we assay copy number variation? Generation 1 - Array CGH What Microarray of clones (e.g. BACs) Usually on glass slide Competitive hybridization of test and reference samples. Measure fluorescence ratio clone by clone. Limitations Large clones. Sparse coverage. High noise due to spotting process. Generation 2 - SNP chips What High-throughput SNP genotyping platforms (e.g. Affymetrix, Illumina) Advantage Hundreds of thousands of points of info. Disadvantages Technology was never intended for measuring copy number. SNPs on chip selected to avoid CNV regions by design. Generation 3 - SNP chips with CNV markers (Affy 6.0, Illumina 1M) Advantages Illumina SNPs in known CNV regions are now included. 1M markers in 10K regions of various types and sizes Also have “non-polymorphic SNPs” (SNs?) Affymetrix 200K probes in 5K known large CNV regions 700K probes “evenly spaced along the genome” Generation 4 (Illumina 2.5M, 5M) Changes Got rid of the non-polymorphic markers. Special coverage of CNV regions??? Are these better or worse for CNVs than the previous generation? What data do these technologies give us, and how do we use it? Standard genotyping Genotype information is in the angle (relative intensity of the two alleles). BB AB AA Copy number information is in the distance from the origin (total intensity). In theory AAA AAB ABB AA AB A null B BB BBB But when you look at the data … AAA and AA trisomic (Down Syndrome) AAB AB disomic ABB BBB and BB All SNPs on chromosome 21 disomic total intensity total intensity (trisomic) trisomic total intensity (disomic) In theory AAA AAB ABB AA AB A null B BB BBB In practice A null B So how are copy numbers called? Look for runs of SNPs that are high or low in intensity Many available algorithms e.g. HMM, CBS, change-point Basic picture Komura et al. Genome Research 2006 More complex examples (cancer genetics) Peiffer et al. Genome Research, 2006 amplification total intensity AA AB Angle (genotype info) BB deletion deletion Extra copy of whole chromosome total intensity high over whole chromosome 3 genotype groups No copy number change, but a region of homozygosity (LOH) LOH Basic picture Wang et al. Genome Research, 2007 Chromosome 9 29 A few statistical issues to think about … (there’s still a lot to do) Many run-calling algorithms are oriented towards clinical applications. Many CNV detection algorithms are very conservative - aim for zero false positive rate. Most use normalization methods that assume a large reference population is not available. Many use models that make assumptions about what kinds of variation are likely (e.g. cancer). Family data should be modeled together. CNV “calls” will be much more accurate if you use the whole family, but the model you use should depend on whether you are expecting de novo mutations or not. For some diseases you’ll expect associations with de novo changes. For others you might expect inherited variants. How do we group CNVs for association testing? deletion deletion deletion deletion duplication Separate methods for deletions? Deletions are easier to detect than other changes. Deletions are likely to have simpler biological effects. The most important one … The technology is still NOT intended for reliably and comparably measuring total intensity! Total intensity numbers are very sensitive to DNA source, sample handling, etc., so extreme measures must be taken to ensure that cases and controls are comparable.