Supplementary Materials and Methods (doc 62K)

advertisement
Supplemental Materials and Methods
Subjects: SAGE subjects were recruited from 8 different study sites in 7 states and the District
of Columbia; the majority of subjects (62%) were recruited in Missouri (Bierut et al, 2010).
COGA subjects were recruited from 7 centers in the United States (Edenberg et al, 2010). All
subjects were interviewed using the Semi-Structured Assessment for the Genetics of Alcoholism
(SSAGA) (Bucholz et al, 1994). Affected subjects were excluded if they had schizophrenia or
other psychotic illnesses (n=314). These alcohol dependent subjects had comorbid marijuana
dependence (34.8%), opioid dependence (13.8%), nicotine dependence (61.0%), cocaine
dependence (48.2%) or other illicit substance (stimulants, sedatives) dependence (5.8%).
Genotyping platform: All samples were genotyped on the Illumina Human 1M beadchip at the
Center for Inherited Disease Research (CIDR) at Johns Hopkins University (Baltimore, MD
USA). Allele cluster definitions for each marker were determined using Illumina BeadStudio
Genotyping Module version 3.1.14 and the combined intensity data from the samples.
Data cleaning: Before statistical analysis, we cleaned the phenotype data first and then the
genotype data. Before being released to the public, the SAGE project had 4,324 subjects and the
COGA project had 1,989 subjects. At release to the public, the dbGaP data had already been
cleaned using general quality control exclusion criteria, e.g., poor genotypic data, questionable
connection between genotype and phenotype. The mean non-Y SNP call rate and mean sample
call rate was 99.7% for the released dataset. The study duplicate reproducibility was 99.98%.
The genotype concordance rate in the overlapping subjects (n=1,477) between SAGE and COGA
1
was 99.98%. Before cleaning the data in this study (detailed steps and numbers see
Supplemental Table S1), we combined SAGE and COGA samples. In the event that a particular
subject was included in both the SAGE and COGA studies, SAGE subjects were retained; and
finally, only 480 COGA subjects having genotype data were merged into the SAGE sample.
Then, we excluded subjects with missing or inconclusive diagnostic information that prevented
them from being reliably classified as cases or controls. The subjects with allele discordance,
duplicated IDs, potential sample misidentification, sample relatedness, other sample
misspecification, gender anomalies, chromosome anomalies (such as aneuploidy and mosaic cell
populations), missing race, non-EA and non-AA ethnicity, and population group outliers were
also screened out step-by-step. Then we used principal components analysis (PCA) and Bayesian
approaches (see below), to classify subjects on the basis of genetic ethnicity and to filter out
subjects (n=12) where there was a mismatch between self-identified and genetically-inferred
ethnicity.
After cleaning, EA and AA subjects constituted groups with high levels of ancestral
homogeneity (Supplemental Figure 3), as supported by the application of three criteria (selfidentification, principal component eigenvalues, and ancestry proportions). This differed from
previous studies using the same datasets. After filtering out the subjects with a missing genotype
call rate ≥2% across all markers, the final sample included 4,116 individuals, including 1,409 EA
cases, 1,518 EA controls, 681 AA cases and 508 AA controls.
Originally, the Illumina Human 1M beadchip (excluding the number of intensity-only
probes) had 1,049,008 markers. At release to dbGaP, these markers had been cleaned by general
quality control. We merged SAGE and COGA samples, and then excluded the markers with
minor allele frequency (MAF) difference > 2% and missing rate difference > 2% between SAGE
2
and COGA, and the markers with allele discordance or chromosomal anomalies. There were no
outliers that would indicate a strong batch effect once ethnicity was taken into account. We then
filtered out the markers on all chromosomes with an overall missing genotype call rate ≥2%, the
monomorphic markers, and the SNPs with minor allele frequencies (MAFs) <0.01 in either EAs
or AAs. The SNPs that deviated from HWE (P < 10−4) within EA or AA controls were also
excluded (note: in EA and AA controls combined, there were 145,043 SNPs in significant HWD).
This selection process yielded 805,814 markers in EAs and 895,714 markers in AAs (see
Supplemental Table S1).
The cleaning process yielded high-quality data for association analysis, as evidenced by
the following: (1) The homogeneity of the two samples was very high; that is, EAs and AAs
were well differentiated (see Bar plots in Supplemental Figure 3). (2) The observed and
expected p-values for the associations fit very well within EAs or AAs (see QQ plots in
Supplemental Figures S4a,b). (3) We also computed from these p-values a low genomic
inflation factor (GIF) of 1.07 in EAs and 1.03 in AAs. The GIF is defined as the ratio of the
median of the empirically observed distribution of the test statistic to the expected median, thus
quantifying the extent of the bulk inflation and the excess false positive rate. It is calculated for
the genomic control analysis and reflects the maximum possible inflation factor if the
associations are affected by population stratification. A GIF with a departure of less than 0.1
from a value of 1.0 is considered an indicator of very good quality data. Finally, the raw
genotype clusters and intensity plots for the top-ranked 33 SNPs (Table 1) and 28 risk markers
in TNN-KIAA0040 region (Table 2) were visually inspected as an additional quality control
check after association analysis, which showed that all SNPs were accurately clustered and had
no significant batch effects.
3
Data analytic procedure summary: In the present study, the EA case-control sample served as
the discovery sample, and the AA case-control sample served as the replication sample.
Genome-wide association analysis was performed in the discovery sample to identify genomewide significant risk SNPs (p<WTCCC-defined α plus FDR<0.05). Associations of all available
SNPs in the significant risk genes were retested in AAs, then performed correlation analysis on
the distributions of -log(p) values between EAs and AAs to identify the consistent risk regions.
In addition to the replication design for the gene-disease associations between EAs and
AAs, we also performed functional expression quantitative trait locus (eQTL) analysis on the
risk SNPs as confirmation design, which included (a) cis-eQTL analysis on mRNA expression
levels in lymphoblastoid cell lines, (b) cis-eQTL analysis on exon-/transcript-level expression
changes in peripheral blood mononucleated cells (PBMCs) and cortical brain tissues, (c)
transcriptome-wide trans-acting eQTL analysis on transcript expression, and (d) alteration of
RNA secondary structure by these risk variants. Furthermore, genome-wide trans-acting
regulatory effects on the transcript expression of the risk gene and correlations of transcript
expression between the risk gene and other genes across transcriptome were analyzed.
Estimation of population structure: Population structure was first evaluated using PCA
implemented in the software package EIGENSTRAT (Price et al, 2006) (with all autosomal
SNPs having a call rate >95%). Each individual received scores on each principal component.
Because related subjects, non-EA and non-AA subjects, and any population group outliers were
excluded, their effect on PCA was removed. The first principal component (PC1) separated the
self-identified EA and AA subjects very well, which was highly consistent with a previous report
4
(Bierut et al, 2010) and thus the PCA plots were not presented in this study again. PC1 was used
to measure the continuous ethnicity variance for EAs and AAs. The cut-off value (=0) of PC1
separated the “genetic” EAs and AAs. A total of 12 subjects were mismatched between “genetic”
and self-identified ethnicity. The second principal component separated the self-identified
Hispanic subjects from the non-Hispanic subjects. Other principal components did not improve
separation among the major ethnic groups and accounted for very small fractions of the total
variance (<0.1%), and thus were ignored in our analysis.
Estimation of population admixture: To measure the degree of admixture in these subjects, we
estimated the ancestry proportions for each individual. A model-based clustering method was
used to examine the proportions by utilizing the ancestry information content of a set of
ancestry-informative markers (AIMs). The algorithm used for clustering is a Bayesian approach
implemented in the program STRUCTURE (Pritchard et al, 2000).
The AIMs were selected by the following procedure: (a) Among the 991,617 markers
(generally cleaned by allele discordance, chromosomal anomalies, batch effects and missing
genotype call rate ≥ 2%; Supplemental Table S1), alleles of 754,259 SNPs (76%) differed
between EAs and AAs at a genome-wide significance level (p<10-8), and thus were selected. (b)
Among the 754,259 AIMs, we excluded SNPs (n=280,276) that were nominally associated
(p<0.05) with the disease, to ensure that the AIMs reflected only ethnic differences, rather than
disease status, although p<0.05 was highly conservative for this purpose. (c) 210,691 SNPs were
then excluded because of Hardy-Weinberg Disequilibrium (HWD) (p<0.05). This ensured the
independence of different alleles within each marker, which was required for the accuracy of the
prediction of ancestry proportion by the Bayesian approach; p<0.05 was highly conservative for
5
this purpose as well. (d) Finally, 3,172 completely independent (r2=0) SNPs were selected from
the AIMs after the LD pruning. In summary, all AIMs showed a difference in allele frequency
between EAs and AAs at a genome-wide significance level, had no association with disease, and
were in HWE. The selection of AIMs from such a large marker set is a strength of the current
study as it increased the accuracy of the classification of EAs and AAs, enhanced the detection
of population structure, provided information that enabled all individuals to be assigned to
genetic ancestral populations, enabled analysis by the STRUCTURE program, and accurately
estimated ancestry proportions for each subject. Using these completely independent and highly
ancestry-informative markers, two “genetic” populations were separated very well (K=2; cut-off
value of ancestry proportions was 0.5; see Supplemental Figure S3). When we compared the
results with those from other newly developed programs, e.g., EigenSoft and ShrunkenPCA, the
consistency of the ancestry prediction was extremely high (data not shown). We excluded the
subjects (n=7) for which self-indentified ethnicity and “genetic” ethnicity were discordant.
Detection of population heterogeneity and controlling for stratification and admixture
effects: In our analyses, we detected highly significant population heterogeneity in the samples:
(a) The allele frequency difference in 76% of markers from the Illumina Human 1M beadchip
reached the genome-wide significance level (p<10-8) between EAs and AAs, which showed that
EAs and AAs were genetically distinct populations. Population-specific allele frequency
distributions, which are usually related to evolutionary history, often lead to population-specific
LD between individual markers and disease variants. Thus, distinct populations do not
necessarily have the same risk markers associated with disease; alternatively, they could have the
same risk markers, but different phases of alleles might be associated with disease. However,
6
although two populations might not have common risk markers, they could have common causal
variants. For example, if one rare allele of a marker is risk for a disease in one population but its
counterpart allele, which is also rare, is risk for that disease in another population, this may just
indicate that the same phase of allele of the putative causal variant causes the disease in both
populations; that causal allele is in LD with opposite phases of alleles of that marker in two
populations, respectively. Therefore, in the present study, we aimed to identify the risk regions,
not individual markers, which overlapped between two populations. (b) 25,393 and 28,909 SNPs
were in significant HWD (p<10-4) in EA and AA controls, respectively. However, 145,043 SNPs
were in significant HWD in EA+AA controls, which was triple the sum of the number of SNPs
in HWD in EAs only and AAs only. Population heterogeneity could result in significant
population stratification effects in mixed populations that may lead to inflation of the marker
number with HWE violation.
The population stratification and admixture effects in association analysis were taken into
account through structured association analysis, which included two components, i.e., controlling
for the between-population stratification effects and within-population admixture effects in the
association analysis. To control for the population stratification effects, different populations
were analyzed separately; to control for the admixture effects, ancestry proportions were
included as covariates in the analysis.
7
Download