Supplementary Materials Quality Control and Quality Assurance (QA

advertisement
Supplementary Materials
Quality Control and Quality Assurance (QA/QC) Report
QA/QC Methods
Computer Software
SAS 9.3 (SAS Institute, Cary, NC, USA) and R (R Core Team, 2013) were used for
initial genotype data processing and preparation prior to QA/QC steps. Unless otherwise noted,
PLINK (Purcell et al., 2007) was used to generate the QA/QC metrics and R was used for
analysis and graphical display of QA/QC information.
Targeted Sample for Genotyping
In all, a total of 2,020 DNA samples were genotyped which includes duplicate DNA
samples from the same individual to be used for quality assurance purposes. The final number of
non-duplicate individuals with acceptable genotype quality is presented below. The targeted
sample for genotyping was the National Longitudinal Study of Adolescent Health (Add Health)
sibling pairs who provided a saliva sample and consented to participate in a genome-wide
association project during the Wave IV data collection (Harris et al., 2013). Note that this sample
includes not only sibling pairs, but also other types of putative relationship pairs who share a
common Add Health family identifier. Known MZ twin pairs were not part of the targeted
genotyping sample.
Sample Preparation and Genotyping
Saliva was collected during Wave IV using the Oragene collection methods (Oragene,
DNAgenotek, Ottawa, Ontario, Canada) and genomic DNA isolated from the Oragene solutions
using ZymoResearch (Irivine, CA, USA). Silicon-A™ plates were used according to protocols
supplied by the manufacturer at the Institute for Behavioral Genetics (IBG) at the University of
1
Colorado Boulder. Extracted DNA was normalized to 50 ng/µl using Picogreen® fluorescence
and sent to Expression Analysis, Inc. (Durham, NC, USA) for genotyping. The genome-wide
platform used for this study is the Illumina HumanOmni1-Quad v1 (Illumina Inc., San Diego,
CA, USA), which includes a total of 1,134,514 genetic and structural variants. Clustering, calling
and scoring genotypes were performed using Illumina’s GenCall software with version 1.0 (H)
product files. The initial genotypes were called using Illumina’s “TOP” strand designation. For
further details on strand designation as it relates to “TOP”, “(fwd)” and “(+)”, see Nelson et al.
(2012). Each 96-well plate contained one inter-plate duplicate that was randomly assigned and
one CEPH and non-template control.
SNP Marker Set
We removed markers with no reliable map location based upon the Human Genome
Reference Build 37.1 (9,837). The genotyping platform includes structural variation such as
copy-number variants (CNVs) and Insertion/Deletion variants (INDELs). While potentially
useful and interesting for future projects, the initial focus was on biallelic single-nucleotide
polymorphisms (SNPs) to establish sample quality, quality control parameters as well as to
conduct the initial genome-wide association analysis. There are 126,969 “markers of interest”
identified in Illumina’s product files with majority of these being intensity-only probes
(123,295). There are an additional 128 markers identified as INDELs. While potentially
interesting for future association analyses, the focus of the present study is on SNP markers and
therefore, the INDELs were removed. Removal of these markers of interest and INDELs resulted
in 1,000,970 SNP markers. Further, for the purposes of this study, we chose to remove the Y
chromosome SNPs (Y; 1,209), mitochondria SNPs (MT; 25 markers) as well as SNPs on the
pseudo-autosomal region of X (XY; 872 markers). This step results in a total of 998,864 SNPs
2
(2106 markers removed). Additionally, we selected markers that map to the 1000 genomes
project (phase 1, version 3) reference set. This was done to ensure that only the most validated
SNPs are used for quality checks and analyses, but also for strand alignment and to update
Illumina marker names with a corresponding dbSNP identifier. Out of the 998,864 markers,
959,527 markers were found in the 1000 genomes reference database (phase 1, version 3).
Illumina markers without a valid dbSNP identifier (i.e. markers with the prefix “kgp” or “GA”)
were linked to a dbSNP identifier by merging them based on map location (chromosome and
base-pair location). The final set of genotypes used for the QC/QA steps use SNP markers from
chromosomes 1-22 and the X chromosome. This set of markers allows us to reliably perform
various genome-wide quality assessments in addition to checks of biological sex. In
supplemental table 1, we provide a breakdown of the number of markers per chromosome used
for this study. Further pruning of markers was conducted at various stages for QC/QA purposes
and noted accordingly throughout the text.
QA/QC Results
Missing Data Rate and Sample Quality
To generate missing data rates for individual samples, we focused on the 959,527 SNP
markers across chromosome 1-22 and X. The average missing data rate for the 2020 samples is
0.0126 (SD=0.0481) with a range of 0.0009 to 0.6294. The missing data thresholds that are often
used for GWAS range from 3-5%. Further, data is often inspected for the distribution of mean
heterozygosity across the autosomes. In general, samples exhibiting excess heterozygosity may
be an indicator of sample contamination while less than expected heterozygosity is thought to be
an indicator for inbreeding. Mean heterozygosity in this context is defined as (N-O)/O, where N
3
is the number of non-missing genotypes and O is the observed number of homozygous
genotypes. Supplemental figure 1 displays the proportion of missing genotypes (log-scale on the
x-axis) versus heterozygosity rate (y-axis). The two horizontal dotted lines indicate 2x the
standard deviation of the mean heterozygosity in this sample. Based upon the distribution of
missing data it could be argued that a missing data rate of anywhere from 0.03-0.05 could be
reasonably adopted. A threshold of 0.03 (vertical dotted line) would remove 101 samples while a
threshold of 0.05 would remove 76 samples (~5% and 4% of the total sample respectively),
which is well within the range of acceptable sample loss among traditional genome-wide
association studies. However, rather than removing respondents at this stage, we opted for an
iterative procedure to come to a final set of samples to be removed that includes first removing
low quality SNP markers.
Final QA/QC Marker Set
We initially assessed marker quality by calculating the missing data rate for each SNP
using only samples that had a genotyping call rate of at least 90%. This step generated a list of
SNP markers that could be considered of low quality using a SNP marker call rate threshold of
95%. A total of 18,665 SNP markers were removed using the 95% call rate threshold leaving
940,862 SNP markers across chromosomes 1-22 and X (see Supplemental Table 1 for a
breakdown by chromosome).
Final QA/QC Sample
We then recalculated the missing data rate for samples after removal of the low-quality
markers to get a more accurate estimate of the sample missing data rates. Using a marker set with
low-call markers removed, a 0.03 threshold would remove 98 samples while a 0.05 threshold
would remove 74 samples. Two different criteria were used to flag individual samples for
4
potential removal from the analysis data set. First, missing data rate > 0.05. Second, individual
mean heterozygosity rate that exceeds ± 2(SD) of the mean heterozygosity rate of the entire
sample. For this sample, there were no samples removed because of excess heterozygosity.
However, based upon the missing data rate of > 0.05, we removed 74 genotyped samples. As
noted above, this results in a sample loss of approximately 3-4%. Therefore, the sample used for
subsequent QC/QA steps is N=1,946. Note this number includes duplicate samples, as part of the
QC/QA is to assess duplicate concordance. The number of individuals (excluding duplicates)
that meet the thresholds above is 1,888 (note that this includes one known MZ twin pair). In
situations where there were duplicate samples from the same respondent, we used the set of
genotypes (DNA sample) that yielded the higher genotyping call rates.
Sex Checks
Using the X chromosome to check for biological sex can be useful to identify problems
related to sample mix-up and/or misaligned coding files. After removal of 74 samples based on
missing data rates, 138 samples were “flagged” by PLINK using an inbreeding (homozygosity;
F) estimate for the X chromosome. PLINK conservatively expects a homozygosity estimate >
0.80 for males and < 0.20 for females and essentially, any homozygosity estimates between 0.20
and 0.80 are flagged. Supplemental figure 2 displays the X chromosome homozygosity for Add
Health respondents coded as male (supplemental figure 2A) and females (supplemental figure
2B) respectively. Out of the 1,888 individuals, 138 individuals were flagged for further
inspection. The first conclusion to be drawn from this is that there is no evidence of a widespread
issue linking samples to coded information such as biological sex. Of the 138 flagged
individuals, there were only 4 self-reported males. Of these 4 males, 3 would be considered
female based upon their X chromosome (F = 0.16, 0.10 and 0.01). One of these male respondents
5
exhibits a relatively ambiguous sex via genotype (F = 0.36). Of the 134 self-reported females,
nearly all of them fall within the expected null distribution and therefore, they are likely to have
been flagged unnecessarily. However, there are 2 females who also have similar homozygosity
estimates (F = 0.58 and 0.38) as the ambiguous male noted previously. Collectively, these three
individuals (coded as one male and two female) exhibit heterozygosity estimates that are
consistent with chromosomal abnormalities. No samples were removed at this step, as there is no
evidence of widespread sample mix-ups or linking file issues.
Duplicate Concordance
There were 53 duplicate pairs that passed initial QA/QC included in the Add Health
sibling pairs file. Pairwise mean IBD (PI_HAT) values that exceed 0.90 are thought to be either
duplicate samples or MZ twins (PI_HAT values have a maximum of 1.0 indicating perfect
concordance). Equivalently, pairwise measures of Kinship above 0.354 are considered duplicate
samples or MZ twins (Manichaikul et al., 2010) (Kinship estimates have a maximum of 0.5
indicating perfect concordance). The average mean IBD (PI_HAT) for the 53 duplicate pairs is
0.9995 with a minimum of 0.9987 and a maximum of 1.0). The average Kinship estimate for the
same 53 duplicate pairs is 0.4998 with a minimum of 0.4996 and maximum of 0.5). Overall, the
concordance for these duplicate pairs is quite high using both estimation methods.
6
Supplemental Table 1: Number of SNP markers per chromosome for the QA/QC Report
Chromosome
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
Total (1-22)
X
Total (1-22, X)
All
79,267
74,584
60,733
56,854
55,042
71,959
50,291
50,192
44,306
50,524
47,380
46,078
33,572
29,098
28,416
30,304
26,876
26,841
21,317
26,195
13,503
13,646
936,978
22,549
959,527
Call Rate > 95%
77,907
73,218
59,576
55,756
54,014
70,424
49,240
49,295
43,535
49,570
46,475
45,254
32,933
28,613
27,962
29,753
26,391
26,388
20,803
25,749
13,257
13,396
919,509
21,353
940,862
7
0.75
Supplemental Figure 1: Proportion of missing genotypes versus Heterozygosity rates among
the 2020 samples. Horizontal lines correspond to 2x the standard deviation of heterozygosity rate
while the vertical line corresponds to a missing data rate of 0.03 (97% genotyping call rate).
●
●
0.70
●
●
●
0.65
●
0.60
●
●●
0.50
0.45
●
●
●
●
●
●
●●
0.40
●
●
●
●
●●
●
●
0.35
●
●
●
●
●
●
●
●
●●● ●● ● ● ●●● ●● ●● ●●●
● ●
●
●
● ●●●
●
●
● ●
●
●● ● ●
●●
●
●● ●
● ●
●
●
●
●
●
●
● ●● ●●●
●
●
●
● ●●
●
●
●
●
●●
●
●●●●●●●●●●●●
● ●
● ●● ●● ●
●● ●
●
● ●● ●●●●●● ●● ●● ●
●●●●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●●●●●●●●●●●●
● ●
●
●●
●
●●●●●●●●●● ●●●●●
●●●●●● ●●●●●●●●●● ●
●
● ●●
●
●●
●●●●● ● ● ●●● ● ●
●
● ● ● ●
●
●●
●●
●●
●●
●●●●●●●●●●●●
●
● ●●●
●●●●
●●●
● ●●●●●●●●●
●●●●●●●●●●● ●●●●
●●●
●●
●●●
●●
●●●●
●
●
●
●●
●●●
●
●●●
●
● ● ●● ● ● ● ● ●● ●
●●●
●●●●
●●●●
●●●● ● ●● ●●●
●●●●●●
●
●
●●
●●●●
●
●●● ●●●●● ●●●●● ●
●●●
●●
●
●
●●●●
●●●
●●●
● ●
●●
●●●
●
●
●
●●
●●●
●
●●
●●
●
●●
●●
●
●●●
●●
●●●
●
●●●
●●●
●●
●●
●
●●●
●
●●●●
●●
●
●●
●●●●
●
● ●●●●●●●● ●
● ●
● ●●
●
●●
●●●● ●●
●● ● ● ●
●● ●●●●●●
●
●
● ●● ●
●●●
●●
●●
●
●●●
●●●●
●●
●●
●
●●●
●●
●●●●
●●●
●●
●
●●●● ●●
●●●●
●●●
●●
●●●
●●●●
●
●● ●●
●●
●
●●●●
●
● ●●
●
●●●
●●●●●
●●
●●●
●●
●
●
● ●●●●● ●●
●●● ●
●●●●●
●
●
● ●●
●●●●● ●●●●●●●●
●
●●● ● ●● ● ●
●
●●●●●●●●● ● ●●●
● ●●
●● ● ●●●●●
●●●●●
●●
● ●●●
●●
●●
●●
●●●●●●
●
●
●
●
●● ● ●● ● ● ● ● ●
● ●●●●●●●●●
●●●●
● ●
●●●●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●
●●●●●● ●●●●●●●
●
●
● ●●●
●
●●●● ●
●●
● ●● ●
●●●●●●●●●●●●●●●●●●●● ●●
●●
●●●
●●●
●●●
●
●●●●
●
●●
●●●●
●● ● ●●●●●●
●
●
●●●
●
●
●●
●●●
●●
●●●
●
● ●● ●● ● ●
●
●
●●●●
●●●
●
●
●
●●●●
●●●
●●●
●
●●●●●
●●
●
●
●
●●
●●
●●
●
●
●
●●●
●
●
●●●●●
●●
●●●
●●●●●●
●●
●●●●
●
●
●●
●●●●●●
●
●●
●
●●
●
●●
●
●
●
●
●●
●●
●●
●●
●
●●
●●●
●●
● ● ●●●●●●●●
●
●●
●●
●●
●
●●
●●
●●
●
●●
●●●
●●
●
●●●●●● ●●● ●●● ●●
●●
●
●
●
●
●
●
●●●●●●●●●●
●●●
●●
●
●
●●
●●
●
●●
●●
●●●●●●●●●
●
●
●●
●
●●
●●●
●
●●
●●
●●●
●
●
●
●●●●●
●
●●
●●
●●
●
●●●
●
●●
●●
●
●
●
●●
●●●
●●●●
●●●
●
●●●
●●
●●●●●●●
●●●
●●
●●
●
●●●●●●
●●
●●
●
●
●●
●●●
●
●
●●
●
●● ●
●
●●
●●
●
●●●
●●
●●●●
●
●●
●●●●
●
●●●
●●
●●
●
●
●
●●●●●
●
●●
●●
●
●●●●
●
●
●
●●●●● ●● ● ●● ●●● ●●●●
●●●●●
●●
●●●●
●
●
●
●●●
●●
●
●●●●
●●●●●
●
●●
● ● ●●● ● ●●● ●
●
●
●
●
● ● ●● ● ● ●
●
●
●
●
●
●●● ● ●●●
●
●
●●● ●
●●● ● ●●
●● ●● ● ●●● ●● ● ●●
●
●
●
●
● ● ●
● ●●
●
●
●●
●
●● ● ●
● ●
●
●
●
●
0.25
0.30
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
0.20
●
●●
● ●
●
0.15
Heterozygosity rate
0.55
●
0.001
0.01
0.1
1
Proportion of missing genotypes
8
Supplemental Figure 2: X chromosome homozygosity for individuals who are coded in Add
Health as male (A) and female (B).
A
B
All Male Samples
80
Frequency
60
40
400
0
20
200
0
Frequency
600
100
120
800
140
All Female Samples
0.0
0.2
0.4
0.6
0.8
X Chromosome Homozygozity Estimate
1.0
−0.2
0.0
0.2
0.4
0.6
X Chromosome Homozygozity Estimate
9
Supplemental Figure 3: MDS principal coordinate (PC) estimates generated by KING.
A: PC1 vs PC2
10
B: PC2 vs PC3
11
C: PC3 vs PC4
12
D: PC4 vs PC5
13
Supplemental Figure 4: Genetic ancestry by self-identified ethnic group where CEU = Europe,
CHB = China, JPT = Japan, YRI = Africa and AMR = America.
4A: Self-Identified White
14
4B: Self-Identified Black
15
4C: Self-Identified Hispanic
16
4D: Self-Identified Native American
17
4E: Self-Identified Asian
18
Supplemental Figure 5: Quantile-Quantile Plot of the unweighted GWAS p-values
19
Download