Supplementary Materials Quality Control and Quality Assurance (QA

Supplementary Materials Quality Control and Quality Assurance (QA/QC) Report QA/QC Methods Computer Software SAS 9.3 (SAS Institute, Cary, NC, USA) and R (R Core Team, 2013) were used for initial genotype data processing and preparation prior to QA/QC steps. Unless otherwise noted, PLINK (Purcell et al., 2007) was used to generate the QA/QC metrics and R was used for analysis and graphical display of QA/QC information. Targeted Sample for Genotyping In all, a total of 2,020 DNA samples were genotyped which includes duplicate DNA samples from the same individual to be used for quality assurance purposes. The final number of non-duplicate individuals with acceptable genotype quality is presented below. The targeted sample for genotyping was the National Longitudinal Study of Adolescent Health (Add Health) sibling pairs who provided a saliva sample and consented to participate in a genome-wide association project during the Wave IV data collection (Harris et al., 2013). Note that this sample includes not only sibling pairs, but also other types of putative relationship pairs who share a common Add Health family identifier. Known MZ twin pairs were not part of the targeted genotyping sample. Sample Preparation and Genotyping Saliva was collected during Wave IV using the Oragene collection methods (Oragene, DNAgenotek, Ottawa, Ontario, Canada) and genomic DNA isolated from the Oragene solutions using ZymoResearch (Irivine, CA, USA). Silicon-A™ plates were used according to protocols supplied by the manufacturer at the Institute for Behavioral Genetics (IBG) at the University of 1 Colorado Boulder. Extracted DNA was normalized to 50 ng/µl using Picogreen® fluorescence and sent to Expression Analysis, Inc. (Durham, NC, USA) for genotyping. The genome-wide platform used for this study is the Illumina HumanOmni1-Quad v1 (Illumina Inc., San Diego, CA, USA), which includes a total of 1,134,514 genetic and structural variants. Clustering, calling and scoring genotypes were performed using Illumina’s GenCall software with version 1.0 (H) product files. The initial genotypes were called using Illumina’s “TOP” strand designation. For further details on strand designation as it relates to “TOP”, “(fwd)” and “(+)”, see Nelson et al. (2012). Each 96-well plate contained one inter-plate duplicate that was randomly assigned and one CEPH and non-template control. SNP Marker Set We removed markers with no reliable map location based upon the Human Genome Reference Build 37.1 (9,837). The genotyping platform includes structural variation such as copy-number variants (CNVs) and Insertion/Deletion variants (INDELs). While potentially useful and interesting for future projects, the initial focus was on biallelic single-nucleotide polymorphisms (SNPs) to establish sample quality, quality control parameters as well as to conduct the initial genome-wide association analysis. There are 126,969 “markers of interest” identified in Illumina’s product files with majority of these being intensity-only probes (123,295). There are an additional 128 markers identified as INDELs. While potentially interesting for future association analyses, the focus of the present study is on SNP markers and therefore, the INDELs were removed. Removal of these markers of interest and INDELs resulted in 1,000,970 SNP markers. Further, for the purposes of this study, we chose to remove the Y chromosome SNPs (Y; 1,209), mitochondria SNPs (MT; 25 markers) as well as SNPs on the pseudo-autosomal region of X (XY; 872 markers). This step results in a total of 998,864 SNPs 2 (2106 markers removed). Additionally, we selected markers that map to the 1000 genomes project (phase 1, version 3) reference set. This was done to ensure that only the most validated SNPs are used for quality checks and analyses, but also for strand alignment and to update Illumina marker names with a corresponding dbSNP identifier. Out of the 998,864 markers, 959,527 markers were found in the 1000 genomes reference database (phase 1, version 3). Illumina markers without a valid dbSNP identifier (i.e. markers with the prefix “kgp” or “GA”) were linked to a dbSNP identifier by merging them based on map location (chromosome and base-pair location). The final set of genotypes used for the QC/QA steps use SNP markers from chromosomes 1-22 and the X chromosome. This set of markers allows us to reliably perform various genome-wide quality assessments in addition to checks of biological sex. In supplemental table 1, we provide a breakdown of the number of markers per chromosome used for this study. Further pruning of markers was conducted at various stages for QC/QA purposes and noted accordingly throughout the text. QA/QC Results Missing Data Rate and Sample Quality To generate missing data rates for individual samples, we focused on the 959,527 SNP markers across chromosome 1-22 and X. The average missing data rate for the 2020 samples is 0.0126 (SD=0.0481) with a range of 0.0009 to 0.6294. The missing data thresholds that are often used for GWAS range from 3-5%. Further, data is often inspected for the distribution of mean heterozygosity across the autosomes. In general, samples exhibiting excess heterozygosity may be an indicator of sample contamination while less than expected heterozygosity is thought to be an indicator for inbreeding. Mean heterozygosity in this context is defined as (N-O)/O, where N 3 is the number of non-missing genotypes and O is the observed number of homozygous genotypes. Supplemental figure 1 displays the proportion of missing genotypes (log-scale on the x-axis) versus heterozygosity rate (y-axis). The two horizontal dotted lines indicate 2x the standard deviation of the mean heterozygosity in this sample. Based upon the distribution of missing data it could be argued that a missing data rate of anywhere from 0.03-0.05 could be reasonably adopted. A threshold of 0.03 (vertical dotted line) would remove 101 samples while a threshold of 0.05 would remove 76 samples (~5% and 4% of the total sample respectively), which is well within the range of acceptable sample loss among traditional genome-wide association studies. However, rather than removing respondents at this stage, we opted for an iterative procedure to come to a final set of samples to be removed that includes first removing low quality SNP markers. Final QA/QC Marker Set We initially assessed marker quality by calculating the missing data rate for each SNP using only samples that had a genotyping call rate of at least 90%. This step generated a list of SNP markers that could be considered of low quality using a SNP marker call rate threshold of 95%. A total of 18,665 SNP markers were removed using the 95% call rate threshold leaving 940,862 SNP markers across chromosomes 1-22 and X (see Supplemental Table 1 for a breakdown by chromosome). Final QA/QC Sample We then recalculated the missing data rate for samples after removal of the low-quality markers to get a more accurate estimate of the sample missing data rates. Using a marker set with low-call markers removed, a 0.03 threshold would remove 98 samples while a 0.05 threshold would remove 74 samples. Two different criteria were used to flag individual samples for 4 potential removal from the analysis data set. First, missing data rate > 0.05. Second, individual mean heterozygosity rate that exceeds ± 2(SD) of the mean heterozygosity rate of the entire sample. For this sample, there were no samples removed because of excess heterozygosity. However, based upon the missing data rate of > 0.05, we removed 74 genotyped samples. As noted above, this results in a sample loss of approximately 3-4%. Therefore, the sample used for subsequent QC/QA steps is N=1,946. Note this number includes duplicate samples, as part of the QC/QA is to assess duplicate concordance. The number of individuals (excluding duplicates) that meet the thresholds above is 1,888 (note that this includes one known MZ twin pair). In situations where there were duplicate samples from the same respondent, we used the set of genotypes (DNA sample) that yielded the higher genotyping call rates. Sex Checks Using the X chromosome to check for biological sex can be useful to identify problems related to sample mix-up and/or misaligned coding files. After removal of 74 samples based on missing data rates, 138 samples were “flagged” by PLINK using an inbreeding (homozygosity; F) estimate for the X chromosome. PLINK conservatively expects a homozygosity estimate > 0.80 for males and < 0.20 for females and essentially, any homozygosity estimates between 0.20 and 0.80 are flagged. Supplemental figure 2 displays the X chromosome homozygosity for Add Health respondents coded as male (supplemental figure 2A) and females (supplemental figure 2B) respectively. Out of the 1,888 individuals, 138 individuals were flagged for further inspection. The first conclusion to be drawn from this is that there is no evidence of a widespread issue linking samples to coded information such as biological sex. Of the 138 flagged individuals, there were only 4 self-reported males. Of these 4 males, 3 would be considered female based upon their X chromosome (F = 0.16, 0.10 and 0.01). One of these male respondents 5 exhibits a relatively ambiguous sex via genotype (F = 0.36). Of the 134 self-reported females, nearly all of them fall within the expected null distribution and therefore, they are likely to have been flagged unnecessarily. However, there are 2 females who also have similar homozygosity estimates (F = 0.58 and 0.38) as the ambiguous male noted previously. Collectively, these three individuals (coded as one male and two female) exhibit heterozygosity estimates that are consistent with chromosomal abnormalities. No samples were removed at this step, as there is no evidence of widespread sample mix-ups or linking file issues. Duplicate Concordance There were 53 duplicate pairs that passed initial QA/QC included in the Add Health sibling pairs file. Pairwise mean IBD (PI_HAT) values that exceed 0.90 are thought to be either duplicate samples or MZ twins (PI_HAT values have a maximum of 1.0 indicating perfect concordance). Equivalently, pairwise measures of Kinship above 0.354 are considered duplicate samples or MZ twins (Manichaikul et al., 2010) (Kinship estimates have a maximum of 0.5 indicating perfect concordance). The average mean IBD (PI_HAT) for the 53 duplicate pairs is 0.9995 with a minimum of 0.9987 and a maximum of 1.0). The average Kinship estimate for the same 53 duplicate pairs is 0.4998 with a minimum of 0.4996 and maximum of 0.5). Overall, the concordance for these duplicate pairs is quite high using both estimation methods. 6 Supplemental Table 1: Number of SNP markers per chromosome for the QA/QC Report Chromosome 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 Total (1-22) X Total (1-22, X) All 79,267 74,584 60,733 56,854 55,042 71,959 50,291 50,192 44,306 50,524 47,380 46,078 33,572 29,098 28,416 30,304 26,876 26,841 21,317 26,195 13,503 13,646 936,978 22,549 959,527 Call Rate > 95% 77,907 73,218 59,576 55,756 54,014 70,424 49,240 49,295 43,535 49,570 46,475 45,254 32,933 28,613 27,962 29,753 26,391 26,388 20,803 25,749 13,257 13,396 919,509 21,353 940,862 7 0.75 Supplemental Figure 1: Proportion of missing genotypes versus Heterozygosity rates among the 2020 samples. Horizontal lines correspond to 2x the standard deviation of heterozygosity rate while the vertical line corresponds to a missing data rate of 0.03 (97% genotyping call rate). ● ● 0.70 ● ● ● 0.65 ● 0.60 ● ●● 0.50 0.45 ● ● ● ● ● ● ●● 0.40 ● ● ● ● ●● ● ● 0.35 ● ● ● ● ● ● ● ● ●●● ●● ● ● ●●● ●● ●● ●●● ● ● ● ● ● ●●● ● ● ● ● ● ●● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ●●● ● ● ● ● ●● ● ● ● ● ●● ● ●●●●●●●●●●●● ● ● ● ●● ●● ● ●● ● ● ● ●● ●●●●●● ●● ●● ● ●●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●●●●●●●●●●● ● ● ● ●● ● ●●●●●●●●●● ●●●●● ●●●●●● ●●●●●●●●●● ● ● ● ●● ● ●● ●●●●● ● ● ●●● ● ● ● ● ● ● ● ● ●● ●● ●● ●● ●●●●●●●●●●●● ● ● ●●● ●●●● ●●● ● ●●●●●●●●● ●●●●●●●●●●● ●●●● ●●● ●● ●●● ●● ●●●● ● ● ● ●● ●●● ● ●●● ● ● ● ●● ● ● ● ● ●● ● ●●● ●●●● ●●●● ●●●● ● ●● ●●● ●●●●●● ● ● ●● ●●●● ● ●●● ●●●●● ●●●●● ● ●●● ●● ● ● ●●●● ●●● ●●● ● ● ●● ●●● ● ● ● ●● ●●● ● ●● ●● ● ●● ●● ● ●●● ●● ●●● ● ●●● ●●● ●● ●● ● ●●● ● ●●●● ●● ● ●● ●●●● ● ● ●●●●●●●● ● ● ● ● ●● ● ●● ●●●● ●● ●● ● ● ● ●● ●●●●●● ● ● ● ●● ● ●●● ●● ●● ● ●●● ●●●● ●● ●● ● ●●● ●● ●●●● ●●● ●● ● ●●●● ●● ●●●● ●●● ●● ●●● ●●●● ● ●● ●● ●● ● ●●●● ● ● ●● ● ●●● ●●●●● ●● ●●● ●● ● ● ● ●●●●● ●● ●●● ● ●●●●● ● ● ● ●● ●●●●● ●●●●●●●● ● ●●● ● ●● ● ● ● ●●●●●●●●● ● ●●● ● ●● ●● ● ●●●●● ●●●●● ●● ● ●●● ●● ●● ●● ●●●●●● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ●●●●●●●●● ●●●● ● ● ●●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ●●●●●● ●●●●●●● ● ● ● ●●● ● ●●●● ● ●● ● ●● ● ●●●●●●●●●●●●●●●●●●●● ●● ●● ●●● ●●● ●●● ● ●●●● ● ●● ●●●● ●● ● ●●●●●● ● ● ●●● ● ● ●● ●●● ●● ●●● ● ● ●● ●● ● ● ● ● ●●●● ●●● ● ● ● ●●●● ●●● ●●● ● ●●●●● ●● ● ● ● ●● ●● ●● ● ● ● ●●● ● ● ●●●●● ●● ●●● ●●●●●● ●● ●●●● ● ● ●● ●●●●●● ● ●● ● ●● ● ●● ● ● ● ● ●● ●● ●● ●● ● ●● ●●● ●● ● ● ●●●●●●●● ● ●● ●● ●● ● ●● ●● ●● ● ●● ●●● ●● ● ●●●●●● ●●● ●●● ●● ●● ● ● ● ● ● ● ●●●●●●●●●● ●●● ●● ● ● ●● ●● ● ●● ●● ●●●●●●●●● ● ● ●● ● ●● ●●● ● ●● ●● ●●● ● ● ● ●●●●● ● ●● ●● ●● ● ●●● ● ●● ●● ● ● ● ●● ●●● ●●●● ●●● ● ●●● ●● ●●●●●●● ●●● ●● ●● ● ●●●●●● ●● ●● ● ● ●● ●●● ● ● ●● ● ●● ● ● ●● ●● ● ●●● ●● ●●●● ● ●● ●●●● ● ●●● ●● ●● ● ● ● ●●●●● ● ●● ●● ● ●●●● ● ● ● ●●●●● ●● ● ●● ●●● ●●●● ●●●●● ●● ●●●● ● ● ● ●●● ●● ● ●●●● ●●●●● ● ●● ● ● ●●● ● ●●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●●● ● ●●● ● ● ●●● ● ●●● ● ●● ●● ●● ● ●●● ●● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ●● ● ● ● ● ● ● ● ● 0.25 0.30 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● 0.20 ● ●● ● ● ● 0.15 Heterozygosity rate 0.55 ● 0.001 0.01 0.1 1 Proportion of missing genotypes 8 Supplemental Figure 2: X chromosome homozygosity for individuals who are coded in Add Health as male (A) and female (B). A B All Male Samples 80 Frequency 60 40 400 0 20 200 0 Frequency 600 100 120 800 140 All Female Samples 0.0 0.2 0.4 0.6 0.8 X Chromosome Homozygozity Estimate 1.0 −0.2 0.0 0.2 0.4 0.6 X Chromosome Homozygozity Estimate 9 Supplemental Figure 3: MDS principal coordinate (PC) estimates generated by KING. A: PC1 vs PC2 10 B: PC2 vs PC3 11 C: PC3 vs PC4 12 D: PC4 vs PC5 13 Supplemental Figure 4: Genetic ancestry by self-identified ethnic group where CEU = Europe, CHB = China, JPT = Japan, YRI = Africa and AMR = America. 4A: Self-Identified White 14 4B: Self-Identified Black 15 4C: Self-Identified Hispanic 16 4D: Self-Identified Native American 17 4E: Self-Identified Asian 18 Supplemental Figure 5: Quantile-Quantile Plot of the unweighted GWAS p-values 19

Supplementary Materials Quality Control and Quality Assurance (QA

Related documents

Products

Support

Supplementary Materials Quality Control and Quality Assurance (QA

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib