Supplementary Information (doc 194K)

advertisement
SUPPLEMENTARY METHODS
CNV Calling from Affymetrix Axiom Array Data
Released in early 2011, the Affymetrix Axiom arrays used for EMA have been designed and
used for SNP genotyping1. Thus, the available CNV calling tools such as PennCNV2,
QuantiSNP3 or Birdsuite4 were not directly applicable to our EMA dataset, so we developed our
own methodology to translate SNP data into CNV calls. EMA CNV calls are available as an
online file.
First, quantile normalization (QN) transforms the quantiles for each array to have the same
distribution. This was performed using the “apt-probeset-genotype” function in Affymetrix Power
Tools. Second, to estimate integer copy number, we use the COPy number Polymorphism
Evaluation Routine (COPPER), which we previously developed to call CNVs from SNP array
data5. In brief, COPPER uses genotype calls to set the median for 0, 1, and 2 copies of each
allele and transform the intensity axes appropriately to derive scaled intensity estimates. We
implemented this procedure in R (http://www.R-project.org/). After QN and COPPER, we can
then sum up the intensities on each axis, which represents a total copy number estimate at
each locus for each individual.
To take into account the non-independence of consecutive SNPs, we implemented a 5-SNP
median meaning that the Copy Number (CN) of a SNP is estimated more accurately by the
median of its CN and two flanking SNPs on each side. We then apply a hidden Markov model
(HMM) to make the CNV calls, similar to QuantiSNP3, which we have implemented in Python
(Python Software Foundation, Python Language Reference, version 2.6.5. Available at
http://www.python.org). To define the HMM transition matrix, we first defined an a priori
probability that a genetic event occurs between two consecutive SNPs separated by a
distance d. This a priori probability is extended from QuantiSNP to use as many prior
probabilities as there are possible states. There are three possible states: deletion, normal
region and duplication.
Here are the corresponding note that those three equations are extensions of equation 1 from
QuantiSNP3:

1
d
))
 Del  (1  exp( 
2
2
L
Del


1
d
))
 Norm  (1  exp( 
2
2
L
Norm


1
d
))
 Dupl  (1  exp( 
2
2 LDupl

LDel , L Norm , LDupl are, respectively, the average length of a deletion region, a normal region and
a duplication region. LDel and LDupl are based on information found on the Database of Genomic
Variants (http://projects.tcag.ca/variation/) and we estimated them to be, respectively, 36,647
and 39,633. We arbitrarily chose 400,000 so that L Norm is much greater than LDel and LDupl .
Note that although LDel and LDupl estimates are similar, LDel and LDupl are much smaller than
L Norm , in order to reduce false-positive transitions to the extent possible. We then define the
transition matrix of hidden states between two consecutive SNPs i and i+1

 Del
 Del 
 1 
1
.
5
0
.
5
Del

Ns 1
Ns 1 

 Norm
 Norm 
1.33
1   Norm 0.67

Ns 1
N s  1


 Dupl
 Dupl

1.5
1   Dupl 
 0.5
Ns 1
Ns 1


With Del ,  Norm and  Dupl as the probability (previously defined) that there is a genetic event
between SNPs i and i+1, and N s as the total number of states, three in our case.
Individual Quality Control
Several iterations of the normalization steps were run to perform QC on individual sample data.
We first ran QN and COPPER on all array data (N = 1705: 854 mothers, 851 neonates), and
identified outlier individuals with at least 15 SNPs (on a given chromosome) with a total copy
number estimate greater than 10. 168 individuals were excluded in this step (69 mothers, 99
neonates).
We re-ran QN and COPPER on the remaining data (N = 1537: 785 moms, 752 neonates). The
second step of quality control consisted of manually assessing genome-wide variability in each
individual’s copy number graph and excluding the individuals with excess variability. Typical
individuals have variance ~1, so we excluded individuals with variance across entire
chromosomes >>1. Our final run of QN and COPPER additionally excluded individuals identified
in the SNP analysis as contaminated or mis-identified, resulting in our final sample (N = 1389:
707 moms, 682 neonates)6. After these steps of quality control at the individual array level, we
ended up with 682 neonates (336 cases, 339 controls) and 707 mothers (351 cases, 349
controls) of good quality.
CNV quality control
After the HMM CNV calling, we smooth over small gaps: if the gap between two consecutive
CNVs is less than 20 SNPs, they are called as one CNV. During our QC process, we identified
a systematic difficulty in calling CNVs specific to chromosome 19. We confirmed that the
reagents (Kit A, Affymetrix) used at the time of this experiment is known to generate gaps in the
whole genome amplification step, which likely explains the failure of this chromosome during
QC since chromosome 19 has a disproportionate number of gaps (personal communication,
Affymetrix). Thus, CNVs on chromosome 19 in the EMA samples are excluded from all
analyses.
After a closer look at the CNV-length distributions, we estimated that we could not effectively
detect deletions smaller than 25 SNPs or duplications smaller than 35 SNPs. Using these
thresholds, the number of putative deletions was 128 for mothers and 1,259 for neonates, and
the number of duplications was 864 for mothers and 4,552 for neonates. The comparatively
large number of CNVs in neonates is likely explained by the use of archived blood spot DNA.
However, by visual inspection (below), we eliminated most of the noisy data, resulting in high
confidence calls in both maternal and neonatal samples.
After setting both a number of SNP and CNV-size cutoff, we then manually inspected CNVs
(blinded to affection status), resulting in 104 high confidence CNV regions (66 deletions, 38
duplications). Inheritance checks revealed an additional 8 deletions in mothers and 7 in
neonates and an additional 5 duplications in mothers and 6 in neonates, suggesting that
automated calling may have 10-20% false negative rate, some of which we can correct by
utilizing family structure. Close to 50% of maternal CNVs are transmitted to offspring, as
expected by Mendelian inheritance, suggesting that we have a low false positive rate. In total,
we identified 50 CNVs in mothers (32 deletions, 18 duplications) and 79 CNVs in the neonates
(48 deletions, 31 duplications). However, the numbers of high confidence CNVs detected in
maternal and neonatal samples are not directly comparable because we visually inspected
many more regions in neonatal samples and had the opportunity to identify additional CNVs we
might
have
missed
by
automated
calling
in
maternal
samples.
Supplementary Figure 1: QQ plots for autism parent samples
The quantile-quantile (QQ) plot is shown with the expected –log10(P) distribution under the null
hypothesis on the x-axis and the observed –log10(P) distribution on the y-axis. The P-values plotted
correspond to an association test between mothers and fathers for the AGP (orange), AGRE (green), and
SSC (purple) across genome-wide SNP data. The genomic control GC values are shown and the black
line corresponds to expectations under the null hypothesis of no association (and no stratification).
Supplementary Table 1: Autism replication datasets
Publication
Number of mother/father pairs
Family type
Platform
Data source
CNV calling
Smoothing
Deletion min SNPs
Deletion min kb
Duplication min SNPs
Duplication min kb
Frequency threshold
AGRE
Glessner et al. 20097
943
Multiplex
Illumina 550k
.cel files
PennCNV
20 SNPs
25
30
35
30
1%
SSC
Sanders et al. 20118
1124
Simplex
Illumina 1M and 1M
duo
Supplementary Table
PennCNV,QuantiSNP,
GNOSIS
20 SNPs
25
30
35
30
1%
AGP
Pinto et al. 20109
1366
Multiplex and Simplex
Illumina 1M
dbGAP
iPattern, QuantiSNP
20 SNPs
25
30
35
30
1%
The number of pairs of mothers and fathers compared per autism replication dataset is shown along with
the family type(s) represented in each dataset. The microarray platform used to genotype the individuals
is shown, as are the methods used for CNV calling. The quality control measures of combining nearby
CNV calls (smoothing), minimum SNP and length criteria, and maximum frequency threshold are shown.
Supplementary Table 2: Control replication datasets
Publication
Number of females/males
Platform
Data source
CNV calling
Smoothing
Deletion min SNPs
Deletion min kb
Duplication min SNPs
Duplication min kb
Frequency threshold
HapMap
Conrad et al. 201010
106/106
Illumina Infinium
Supplementary Table
GADA35
20 SNPs
25
30
35
30
1KGP
1000 Genomes Project Consortium. 201211
600/551*
Sequencing
DGV
CNVnator36
20 kb
NA
30
NA
NA
1%
1%
The number of females and males compared per control replication dataset is shown. The microarray
platform used to genotype the individuals is shown, as are the methods used for CNV calling and CNV
thresholds. The quality control measures of combining nearby CNV calls (smoothing), minimum SNP and
length criteria, and maximum frequency threshold are shown.
*77/58 females/males in 1KGP are also in the HapMap sample (although detection of deletions is
independent).
Supplementary Table 3: SSC sex-matched proband vs. sibling comparison
SSC
All
Number of CNVs in probands
(N=967)
Number of CNVs in siblings
(N=403)
Males
Females
Deletion
Duplication
1644
886
758
634
330
304
Test of equality (Wilcoxon-test)
0.11
0.07
0.80
CNV-Length distribution
0.33
0.69
0.41
Gene-content
0.13
0.35
0.20
311
162
149
Number of CNVs in probands
(N=157)
Number of CNVs in siblings
(N=469)
Test of equality (Wilcoxon-test)
801
460
341
0.01
0.11
1.3x10-4
CNV-Length distribution
0.69
0.21
0.76
Gene-content
1.8 x 10-2
2.7 x 10-2
0.48
The number of CNVs in probands and siblings in the SSC dataset is shown, along with the P-value of the
Wilcoxon-test comparing the number of CNVs in sex-matched probands and siblings. The P-value of the
test of equality of distributions (Kolmogorov-Smirnov) is shown to compare length of CNVs. The P-value of
the Wilcoxon-test for gene content is shown. P-values < 0.05 are bolded.
Supplementary Table 4: SSC proband and sibling sex comparisons
SSC
All
Probands
Siblings
Number of CNVs in females
(N=157)
Number of CNVs in males
(N=967)
Test of equality (Wilcoxon-test)
Deletion
Duplication
311
162
149
1644
886
758
7.1x10-3
1.6x10-3
1.2x10-4
CNV-Length distribution
0.48
0.29
0.65
Gene-content
0.09
0.29
0.27
801
460
341
Number of CNVs in females
(N=469)
Number of CNVs in males
(N=403)
Test of equality (Wilcoxon-test)
634
330
304
0.27
6.2 x 10-4
0.78
CNV-Length distribution
0.13
0.58
0.73
Gene-content
0.75
0.46
0.13
The number of CNVs in probands in the SSC dataset is shown, along with the P-value of the Wilcoxon-test
comparing the number of CNVs in males and females within probands and siblings. The P-value of the
test of equality of distributions (Kolmogorov-Smirnov) is shown to compare length of CNVs. The P-value
of the Wilcoxon-test for gene content is shown. P-values < 0.05 are bolded.
Supplementary Table 5: Excess CNVs in high-burden females
SSC all
HapMap all
1KGP deletions
Male observed
Female observed
Female expected
Male observed
Female observed
Female expected
Male observed
Female observed
Female expected
1 CNV
2 CNVs
3 CNVs
>3 CNVs
345
341
376
21
27
60
112
110
149
592
554
646
8
24
23
142
184
189
435
519
475
3
12
9
165
201
220
417
538
455
0
29
0
255
402
339
Total number of
CNVs
1789
1952
1952
32
92
92
674
897
897
Chisquare
p-value
8.6x10-8
3.0x10-4
3.2x10-5
For each dataset with a significant difference in all CNVs (SSC and HapMap) or deletions (1KGP),
the number of CNVs observed in males with only a single CNV, 2 CNVs, 3 CNVs and more than 3 CNVs is
shown, along with the number of CNVs observed in females in the same categories. Female expected
counts were calculated based on the male counts and the observed total number of female CNVs.
Supplementary Table 6: EMA proband maternally-inherited and non-maternal CNVs
All
Deletion
Duplication
Maternally-inherited CNVs in cases
(N complete pairs=293)
22
15
7
Maternally-inherited CNVs in controls
(N complete pairs=299)
9
5
4
Test of equality (Wilcoxon-test)
0.01
0.02
0.34
CNV-Length distribution
0.82
0.77
0.66
Gene-content
0.35
1.00
0.07
Non-maternal CNVs in cases
(N complete pairs=293)
12
7
5
Non-maternal CNVs in controls
(N complete pairs=299)
20
9
11
Test of equality (Wilcoxon-test)
0.85
0.82
0.78
CNV-Length distribution
0.51
0.50
0.82
Gene-content
0.22
0.20
0.91
The number of CNVs in probands is shown, along with the P-value of the Wilcoxon-test comparing cases
to controls. The same results are shown for CNVs inherited from mothers, and non-maternal CNVs (de
novo and paternal). For some CNVs, transmission status could not be determined (see Supplementary
Methods) because only one member of the pair passed QC. The P-value of the test of equality of
distributions (Kolmogorov-Smirnov) is shown to compare length of CNVs. The P-value of the Wilcoxon-test
comparing gene content in cases and controls is shown. P-values < 0.05 are bolded.
Supplementary Table 7: EMA maternally-transmitted and non-transmitted CNVs
All
Deletion
Duplication
Transmitted CNVs in mothers of cases
(N complete pairs=293)
22
15
7
Transmitted CNVs in mothers of controls
(N complete pairs=299)
9
5
4
Test of equality (Wilcoxon-test)
0.01
0.02
0.34
CNV-Length distribution
0.82
0.77
0.66
Gene-content
0.35
1.00
0.07
Non-transmitted CNVs in mothers of cases
(N complete pairs=293)
7
4
3
Non-transmitted CNVs in mothers of controls
(N complete pairs=299)
6
4
2
Test of equality (Wilcoxon-test)
0.75
0.98
0.64
CNV-Length distribution
0.07
0.21
0.60
Gene-content
0.13
0.22
0.37
The number of CNVs in EMA mothers is shown, along with the P-value of the Wilcoxon-test comparing
mothers of cases to mothers of controls. The same results are shown for CNVs transmitted to
proband/control children, and non-transmitted CNVs. For some CNVs, transmission status could not be
determined (see Supplementary Methods) because only one member of the pair passed QC. The P-value
of the test of equality of distributions (Kolmogorov-Smirnov) is shown to compare length of CNVs. The Pvalue of the Wilcoxon-test for gene content is shown. P-values < 0.05 are bolded.
Supplementary References:
1.
Hoffmann TJ et al. Next generation genome-wide association tool: design and coverage
of a high-throughput European-optimized SNP array. Genomics 2011; 98: 79-89.
2.
Wang K et al. PennCNV: an integrated hidden Markov model designed for highresolution copy number variation detection in whole-genome SNP genotyping data.
Genome Res 2007; 17: 1665-74.
3.
Colella S et al. QuantiSNP: an Objective Bayes Hidden-Markov Model to detect and
accurately map copy number variation using SNP genotyping data. Nucleic Acids Res
2007; 35: 2013-25.
4.
Korn J et al. Integrated genotype calling and association analysis of SNPs, common
copy number polymorphisms and rare CNVs. Nat Genet 2008; 40: 1253-60.
5.
Weiss LA et al. Association between microdeletion and microduplication at 16p11.2 and
autism. N Engl J Med 2008; 358: 667-75.
6.
Tsang KM et al. A genome-wide survey of transgenerational genetic effects in autism.
PLoS One 2013; 8: e76978.
7.
Glessner JT et al. Autism genome-wide copy number variation reveals ubiquitin and
neuronal genes. Nature 2009; 459: 569-73.
8.
Sanders SJ et al. Multiple recurrent de novo CNVs, including duplications of the 7q11.23
Williams syndrome region, are strongly associated with autism. Neuron 2011; 70: 86385.
9.
Pinto D et al. Functional impact of global rare copy number variation in autism spectrum
disorders. Nature 2010; 466: 368-72.
10.
Conrad DF et al. Origins and functional impact of copy number variation in the human
genome. Nature 2010; 464: 704-12.
11.
1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092
human genomes. Nature 2012; 491: 56-65.
Download