Supplementary Information (doc 194K)

SUPPLEMENTARY METHODS CNV Calling from Affymetrix Axiom Array Data Released in early 2011, the Affymetrix Axiom arrays used for EMA have been designed and used for SNP genotyping1. Thus, the available CNV calling tools such as PennCNV2, QuantiSNP3 or Birdsuite4 were not directly applicable to our EMA dataset, so we developed our own methodology to translate SNP data into CNV calls. EMA CNV calls are available as an online file. First, quantile normalization (QN) transforms the quantiles for each array to have the same distribution. This was performed using the “apt-probeset-genotype” function in Affymetrix Power Tools. Second, to estimate integer copy number, we use the COPy number Polymorphism Evaluation Routine (COPPER), which we previously developed to call CNVs from SNP array data5. In brief, COPPER uses genotype calls to set the median for 0, 1, and 2 copies of each allele and transform the intensity axes appropriately to derive scaled intensity estimates. We implemented this procedure in R (http://www.R-project.org/). After QN and COPPER, we can then sum up the intensities on each axis, which represents a total copy number estimate at each locus for each individual. To take into account the non-independence of consecutive SNPs, we implemented a 5-SNP median meaning that the Copy Number (CN) of a SNP is estimated more accurately by the median of its CN and two flanking SNPs on each side. We then apply a hidden Markov model (HMM) to make the CNV calls, similar to QuantiSNP3, which we have implemented in Python (Python Software Foundation, Python Language Reference, version 2.6.5. Available at http://www.python.org). To define the HMM transition matrix, we first defined an a priori probability that a genetic event occurs between two consecutive SNPs separated by a distance d. This a priori probability is extended from QuantiSNP to use as many prior probabilities as there are possible states. There are three possible states: deletion, normal region and duplication. Here are the corresponding note that those three equations are extensions of equation 1 from QuantiSNP3:  1 d ))  Del  (1  exp(  2 2 L Del   1 d ))  Norm  (1  exp(  2 2 L Norm   1 d ))  Dupl  (1  exp(  2 2 LDupl  LDel , L Norm , LDupl are, respectively, the average length of a deletion region, a normal region and a duplication region. LDel and LDupl are based on information found on the Database of Genomic Variants (http://projects.tcag.ca/variation/) and we estimated them to be, respectively, 36,647 and 39,633. We arbitrarily chose 400,000 so that L Norm is much greater than LDel and LDupl . Note that although LDel and LDupl estimates are similar, LDel and LDupl are much smaller than L Norm , in order to reduce false-positive transitions to the extent possible. We then define the transition matrix of hidden states between two consecutive SNPs i and i+1   Del  Del   1  1 . 5 0 . 5 Del  Ns 1 Ns 1    Norm  Norm  1.33 1   Norm 0.67  Ns 1 N s  1    Dupl  Dupl  1.5 1   Dupl   0.5 Ns 1 Ns 1   With Del ,  Norm and  Dupl as the probability (previously defined) that there is a genetic event between SNPs i and i+1, and N s as the total number of states, three in our case. Individual Quality Control Several iterations of the normalization steps were run to perform QC on individual sample data. We first ran QN and COPPER on all array data (N = 1705: 854 mothers, 851 neonates), and identified outlier individuals with at least 15 SNPs (on a given chromosome) with a total copy number estimate greater than 10. 168 individuals were excluded in this step (69 mothers, 99 neonates). We re-ran QN and COPPER on the remaining data (N = 1537: 785 moms, 752 neonates). The second step of quality control consisted of manually assessing genome-wide variability in each individual’s copy number graph and excluding the individuals with excess variability. Typical individuals have variance ~1, so we excluded individuals with variance across entire chromosomes >>1. Our final run of QN and COPPER additionally excluded individuals identified in the SNP analysis as contaminated or mis-identified, resulting in our final sample (N = 1389: 707 moms, 682 neonates)6. After these steps of quality control at the individual array level, we ended up with 682 neonates (336 cases, 339 controls) and 707 mothers (351 cases, 349 controls) of good quality. CNV quality control After the HMM CNV calling, we smooth over small gaps: if the gap between two consecutive CNVs is less than 20 SNPs, they are called as one CNV. During our QC process, we identified a systematic difficulty in calling CNVs specific to chromosome 19. We confirmed that the reagents (Kit A, Affymetrix) used at the time of this experiment is known to generate gaps in the whole genome amplification step, which likely explains the failure of this chromosome during QC since chromosome 19 has a disproportionate number of gaps (personal communication, Affymetrix). Thus, CNVs on chromosome 19 in the EMA samples are excluded from all analyses. After a closer look at the CNV-length distributions, we estimated that we could not effectively detect deletions smaller than 25 SNPs or duplications smaller than 35 SNPs. Using these thresholds, the number of putative deletions was 128 for mothers and 1,259 for neonates, and the number of duplications was 864 for mothers and 4,552 for neonates. The comparatively large number of CNVs in neonates is likely explained by the use of archived blood spot DNA. However, by visual inspection (below), we eliminated most of the noisy data, resulting in high confidence calls in both maternal and neonatal samples. After setting both a number of SNP and CNV-size cutoff, we then manually inspected CNVs (blinded to affection status), resulting in 104 high confidence CNV regions (66 deletions, 38 duplications). Inheritance checks revealed an additional 8 deletions in mothers and 7 in neonates and an additional 5 duplications in mothers and 6 in neonates, suggesting that automated calling may have 10-20% false negative rate, some of which we can correct by utilizing family structure. Close to 50% of maternal CNVs are transmitted to offspring, as expected by Mendelian inheritance, suggesting that we have a low false positive rate. In total, we identified 50 CNVs in mothers (32 deletions, 18 duplications) and 79 CNVs in the neonates (48 deletions, 31 duplications). However, the numbers of high confidence CNVs detected in maternal and neonatal samples are not directly comparable because we visually inspected many more regions in neonatal samples and had the opportunity to identify additional CNVs we might have missed by automated calling in maternal samples. Supplementary Figure 1: QQ plots for autism parent samples The quantile-quantile (QQ) plot is shown with the expected –log10(P) distribution under the null hypothesis on the x-axis and the observed –log10(P) distribution on the y-axis. The P-values plotted correspond to an association test between mothers and fathers for the AGP (orange), AGRE (green), and SSC (purple) across genome-wide SNP data. The genomic control GC values are shown and the black line corresponds to expectations under the null hypothesis of no association (and no stratification). Supplementary Table 1: Autism replication datasets Publication Number of mother/father pairs Family type Platform Data source CNV calling Smoothing Deletion min SNPs Deletion min kb Duplication min SNPs Duplication min kb Frequency threshold AGRE Glessner et al. 20097 943 Multiplex Illumina 550k .cel files PennCNV 20 SNPs 25 30 35 30 1% SSC Sanders et al. 20118 1124 Simplex Illumina 1M and 1M duo Supplementary Table PennCNV,QuantiSNP, GNOSIS 20 SNPs 25 30 35 30 1% AGP Pinto et al. 20109 1366 Multiplex and Simplex Illumina 1M dbGAP iPattern, QuantiSNP 20 SNPs 25 30 35 30 1% The number of pairs of mothers and fathers compared per autism replication dataset is shown along with the family type(s) represented in each dataset. The microarray platform used to genotype the individuals is shown, as are the methods used for CNV calling. The quality control measures of combining nearby CNV calls (smoothing), minimum SNP and length criteria, and maximum frequency threshold are shown. Supplementary Table 2: Control replication datasets Publication Number of females/males Platform Data source CNV calling Smoothing Deletion min SNPs Deletion min kb Duplication min SNPs Duplication min kb Frequency threshold HapMap Conrad et al. 201010 106/106 Illumina Infinium Supplementary Table GADA35 20 SNPs 25 30 35 30 1KGP 1000 Genomes Project Consortium. 201211 600/551* Sequencing DGV CNVnator36 20 kb NA 30 NA NA 1% 1% The number of females and males compared per control replication dataset is shown. The microarray platform used to genotype the individuals is shown, as are the methods used for CNV calling and CNV thresholds. The quality control measures of combining nearby CNV calls (smoothing), minimum SNP and length criteria, and maximum frequency threshold are shown. *77/58 females/males in 1KGP are also in the HapMap sample (although detection of deletions is independent). Supplementary Table 3: SSC sex-matched proband vs. sibling comparison SSC All Number of CNVs in probands (N=967) Number of CNVs in siblings (N=403) Males Females Deletion Duplication 1644 886 758 634 330 304 Test of equality (Wilcoxon-test) 0.11 0.07 0.80 CNV-Length distribution 0.33 0.69 0.41 Gene-content 0.13 0.35 0.20 311 162 149 Number of CNVs in probands (N=157) Number of CNVs in siblings (N=469) Test of equality (Wilcoxon-test) 801 460 341 0.01 0.11 1.3x10-4 CNV-Length distribution 0.69 0.21 0.76 Gene-content 1.8 x 10-2 2.7 x 10-2 0.48 The number of CNVs in probands and siblings in the SSC dataset is shown, along with the P-value of the Wilcoxon-test comparing the number of CNVs in sex-matched probands and siblings. The P-value of the test of equality of distributions (Kolmogorov-Smirnov) is shown to compare length of CNVs. The P-value of the Wilcoxon-test for gene content is shown. P-values < 0.05 are bolded. Supplementary Table 4: SSC proband and sibling sex comparisons SSC All Probands Siblings Number of CNVs in females (N=157) Number of CNVs in males (N=967) Test of equality (Wilcoxon-test) Deletion Duplication 311 162 149 1644 886 758 7.1x10-3 1.6x10-3 1.2x10-4 CNV-Length distribution 0.48 0.29 0.65 Gene-content 0.09 0.29 0.27 801 460 341 Number of CNVs in females (N=469) Number of CNVs in males (N=403) Test of equality (Wilcoxon-test) 634 330 304 0.27 6.2 x 10-4 0.78 CNV-Length distribution 0.13 0.58 0.73 Gene-content 0.75 0.46 0.13 The number of CNVs in probands in the SSC dataset is shown, along with the P-value of the Wilcoxon-test comparing the number of CNVs in males and females within probands and siblings. The P-value of the test of equality of distributions (Kolmogorov-Smirnov) is shown to compare length of CNVs. The P-value of the Wilcoxon-test for gene content is shown. P-values < 0.05 are bolded. Supplementary Table 5: Excess CNVs in high-burden females SSC all HapMap all 1KGP deletions Male observed Female observed Female expected Male observed Female observed Female expected Male observed Female observed Female expected 1 CNV 2 CNVs 3 CNVs >3 CNVs 345 341 376 21 27 60 112 110 149 592 554 646 8 24 23 142 184 189 435 519 475 3 12 9 165 201 220 417 538 455 0 29 0 255 402 339 Total number of CNVs 1789 1952 1952 32 92 92 674 897 897 Chisquare p-value 8.6x10-8 3.0x10-4 3.2x10-5 For each dataset with a significant difference in all CNVs (SSC and HapMap) or deletions (1KGP), the number of CNVs observed in males with only a single CNV, 2 CNVs, 3 CNVs and more than 3 CNVs is shown, along with the number of CNVs observed in females in the same categories. Female expected counts were calculated based on the male counts and the observed total number of female CNVs. Supplementary Table 6: EMA proband maternally-inherited and non-maternal CNVs All Deletion Duplication Maternally-inherited CNVs in cases (N complete pairs=293) 22 15 7 Maternally-inherited CNVs in controls (N complete pairs=299) 9 5 4 Test of equality (Wilcoxon-test) 0.01 0.02 0.34 CNV-Length distribution 0.82 0.77 0.66 Gene-content 0.35 1.00 0.07 Non-maternal CNVs in cases (N complete pairs=293) 12 7 5 Non-maternal CNVs in controls (N complete pairs=299) 20 9 11 Test of equality (Wilcoxon-test) 0.85 0.82 0.78 CNV-Length distribution 0.51 0.50 0.82 Gene-content 0.22 0.20 0.91 The number of CNVs in probands is shown, along with the P-value of the Wilcoxon-test comparing cases to controls. The same results are shown for CNVs inherited from mothers, and non-maternal CNVs (de novo and paternal). For some CNVs, transmission status could not be determined (see Supplementary Methods) because only one member of the pair passed QC. The P-value of the test of equality of distributions (Kolmogorov-Smirnov) is shown to compare length of CNVs. The P-value of the Wilcoxon-test comparing gene content in cases and controls is shown. P-values < 0.05 are bolded. Supplementary Table 7: EMA maternally-transmitted and non-transmitted CNVs All Deletion Duplication Transmitted CNVs in mothers of cases (N complete pairs=293) 22 15 7 Transmitted CNVs in mothers of controls (N complete pairs=299) 9 5 4 Test of equality (Wilcoxon-test) 0.01 0.02 0.34 CNV-Length distribution 0.82 0.77 0.66 Gene-content 0.35 1.00 0.07 Non-transmitted CNVs in mothers of cases (N complete pairs=293) 7 4 3 Non-transmitted CNVs in mothers of controls (N complete pairs=299) 6 4 2 Test of equality (Wilcoxon-test) 0.75 0.98 0.64 CNV-Length distribution 0.07 0.21 0.60 Gene-content 0.13 0.22 0.37 The number of CNVs in EMA mothers is shown, along with the P-value of the Wilcoxon-test comparing mothers of cases to mothers of controls. The same results are shown for CNVs transmitted to proband/control children, and non-transmitted CNVs. For some CNVs, transmission status could not be determined (see Supplementary Methods) because only one member of the pair passed QC. The P-value of the test of equality of distributions (Kolmogorov-Smirnov) is shown to compare length of CNVs. The Pvalue of the Wilcoxon-test for gene content is shown. P-values < 0.05 are bolded. Supplementary References: 1. Hoffmann TJ et al. Next generation genome-wide association tool: design and coverage of a high-throughput European-optimized SNP array. Genomics 2011; 98: 79-89. 2. Wang K et al. PennCNV: an integrated hidden Markov model designed for highresolution copy number variation detection in whole-genome SNP genotyping data. Genome Res 2007; 17: 1665-74. 3. Colella S et al. QuantiSNP: an Objective Bayes Hidden-Markov Model to detect and accurately map copy number variation using SNP genotyping data. Nucleic Acids Res 2007; 35: 2013-25. 4. Korn J et al. Integrated genotype calling and association analysis of SNPs, common copy number polymorphisms and rare CNVs. Nat Genet 2008; 40: 1253-60. 5. Weiss LA et al. Association between microdeletion and microduplication at 16p11.2 and autism. N Engl J Med 2008; 358: 667-75. 6. Tsang KM et al. A genome-wide survey of transgenerational genetic effects in autism. PLoS One 2013; 8: e76978. 7. Glessner JT et al. Autism genome-wide copy number variation reveals ubiquitin and neuronal genes. Nature 2009; 459: 569-73. 8. Sanders SJ et al. Multiple recurrent de novo CNVs, including duplications of the 7q11.23 Williams syndrome region, are strongly associated with autism. Neuron 2011; 70: 86385. 9. Pinto D et al. Functional impact of global rare copy number variation in autism spectrum disorders. Nature 2010; 466: 368-72. 10. Conrad DF et al. Origins and functional impact of copy number variation in the human genome. Nature 2010; 464: 704-12. 11. 1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092 human genomes. Nature 2012; 491: 56-65.

Supplementary Information (doc 194K)

Related documents

Products

Support

Supplementary Information (doc 194K)

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib