Supplemental Materials and Methods Subjects: SAGE subjects were recruited from 8 different study sites in 7 states and the District of Columbia; the majority of subjects (62%) were recruited in Missouri (Bierut et al, 2010). COGA subjects were recruited from 7 centers in the United States (Edenberg et al, 2010). All subjects were interviewed using the Semi-Structured Assessment for the Genetics of Alcoholism (SSAGA) (Bucholz et al, 1994). Affected subjects were excluded if they had schizophrenia or other psychotic illnesses (n=314). These alcohol dependent subjects had comorbid marijuana dependence (34.8%), opioid dependence (13.8%), nicotine dependence (61.0%), cocaine dependence (48.2%) or other illicit substance (stimulants, sedatives) dependence (5.8%). Genotyping platform: All samples were genotyped on the Illumina Human 1M beadchip at the Center for Inherited Disease Research (CIDR) at Johns Hopkins University (Baltimore, MD USA). Allele cluster definitions for each marker were determined using Illumina BeadStudio Genotyping Module version 3.1.14 and the combined intensity data from the samples. Data cleaning: Before statistical analysis, we cleaned the phenotype data first and then the genotype data. Before being released to the public, the SAGE project had 4,324 subjects and the COGA project had 1,989 subjects. At release to the public, the dbGaP data had already been cleaned using general quality control exclusion criteria, e.g., poor genotypic data, questionable connection between genotype and phenotype. The mean non-Y SNP call rate and mean sample call rate was 99.7% for the released dataset. The study duplicate reproducibility was 99.98%. The genotype concordance rate in the overlapping subjects (n=1,477) between SAGE and COGA 1 was 99.98%. Before cleaning the data in this study (detailed steps and numbers see Supplemental Table S1), we combined SAGE and COGA samples. In the event that a particular subject was included in both the SAGE and COGA studies, SAGE subjects were retained; and finally, only 480 COGA subjects having genotype data were merged into the SAGE sample. Then, we excluded subjects with missing or inconclusive diagnostic information that prevented them from being reliably classified as cases or controls. The subjects with allele discordance, duplicated IDs, potential sample misidentification, sample relatedness, other sample misspecification, gender anomalies, chromosome anomalies (such as aneuploidy and mosaic cell populations), missing race, non-EA and non-AA ethnicity, and population group outliers were also screened out step-by-step. Then we used principal components analysis (PCA) and Bayesian approaches (see below), to classify subjects on the basis of genetic ethnicity and to filter out subjects (n=12) where there was a mismatch between self-identified and genetically-inferred ethnicity. After cleaning, EA and AA subjects constituted groups with high levels of ancestral homogeneity (Supplemental Figure 3), as supported by the application of three criteria (selfidentification, principal component eigenvalues, and ancestry proportions). This differed from previous studies using the same datasets. After filtering out the subjects with a missing genotype call rate ≥2% across all markers, the final sample included 4,116 individuals, including 1,409 EA cases, 1,518 EA controls, 681 AA cases and 508 AA controls. Originally, the Illumina Human 1M beadchip (excluding the number of intensity-only probes) had 1,049,008 markers. At release to dbGaP, these markers had been cleaned by general quality control. We merged SAGE and COGA samples, and then excluded the markers with minor allele frequency (MAF) difference > 2% and missing rate difference > 2% between SAGE 2 and COGA, and the markers with allele discordance or chromosomal anomalies. There were no outliers that would indicate a strong batch effect once ethnicity was taken into account. We then filtered out the markers on all chromosomes with an overall missing genotype call rate ≥2%, the monomorphic markers, and the SNPs with minor allele frequencies (MAFs) <0.01 in either EAs or AAs. The SNPs that deviated from HWE (P < 10−4) within EA or AA controls were also excluded (note: in EA and AA controls combined, there were 145,043 SNPs in significant HWD). This selection process yielded 805,814 markers in EAs and 895,714 markers in AAs (see Supplemental Table S1). The cleaning process yielded high-quality data for association analysis, as evidenced by the following: (1) The homogeneity of the two samples was very high; that is, EAs and AAs were well differentiated (see Bar plots in Supplemental Figure 3). (2) The observed and expected p-values for the associations fit very well within EAs or AAs (see QQ plots in Supplemental Figures S4a,b). (3) We also computed from these p-values a low genomic inflation factor (GIF) of 1.07 in EAs and 1.03 in AAs. The GIF is defined as the ratio of the median of the empirically observed distribution of the test statistic to the expected median, thus quantifying the extent of the bulk inflation and the excess false positive rate. It is calculated for the genomic control analysis and reflects the maximum possible inflation factor if the associations are affected by population stratification. A GIF with a departure of less than 0.1 from a value of 1.0 is considered an indicator of very good quality data. Finally, the raw genotype clusters and intensity plots for the top-ranked 33 SNPs (Table 1) and 28 risk markers in TNN-KIAA0040 region (Table 2) were visually inspected as an additional quality control check after association analysis, which showed that all SNPs were accurately clustered and had no significant batch effects. 3 Data analytic procedure summary: In the present study, the EA case-control sample served as the discovery sample, and the AA case-control sample served as the replication sample. Genome-wide association analysis was performed in the discovery sample to identify genomewide significant risk SNPs (p<WTCCC-defined α plus FDR<0.05). Associations of all available SNPs in the significant risk genes were retested in AAs, then performed correlation analysis on the distributions of -log(p) values between EAs and AAs to identify the consistent risk regions. In addition to the replication design for the gene-disease associations between EAs and AAs, we also performed functional expression quantitative trait locus (eQTL) analysis on the risk SNPs as confirmation design, which included (a) cis-eQTL analysis on mRNA expression levels in lymphoblastoid cell lines, (b) cis-eQTL analysis on exon-/transcript-level expression changes in peripheral blood mononucleated cells (PBMCs) and cortical brain tissues, (c) transcriptome-wide trans-acting eQTL analysis on transcript expression, and (d) alteration of RNA secondary structure by these risk variants. Furthermore, genome-wide trans-acting regulatory effects on the transcript expression of the risk gene and correlations of transcript expression between the risk gene and other genes across transcriptome were analyzed. Estimation of population structure: Population structure was first evaluated using PCA implemented in the software package EIGENSTRAT (Price et al, 2006) (with all autosomal SNPs having a call rate >95%). Each individual received scores on each principal component. Because related subjects, non-EA and non-AA subjects, and any population group outliers were excluded, their effect on PCA was removed. The first principal component (PC1) separated the self-identified EA and AA subjects very well, which was highly consistent with a previous report 4 (Bierut et al, 2010) and thus the PCA plots were not presented in this study again. PC1 was used to measure the continuous ethnicity variance for EAs and AAs. The cut-off value (=0) of PC1 separated the “genetic” EAs and AAs. A total of 12 subjects were mismatched between “genetic” and self-identified ethnicity. The second principal component separated the self-identified Hispanic subjects from the non-Hispanic subjects. Other principal components did not improve separation among the major ethnic groups and accounted for very small fractions of the total variance (<0.1%), and thus were ignored in our analysis. Estimation of population admixture: To measure the degree of admixture in these subjects, we estimated the ancestry proportions for each individual. A model-based clustering method was used to examine the proportions by utilizing the ancestry information content of a set of ancestry-informative markers (AIMs). The algorithm used for clustering is a Bayesian approach implemented in the program STRUCTURE (Pritchard et al, 2000). The AIMs were selected by the following procedure: (a) Among the 991,617 markers (generally cleaned by allele discordance, chromosomal anomalies, batch effects and missing genotype call rate ≥ 2%; Supplemental Table S1), alleles of 754,259 SNPs (76%) differed between EAs and AAs at a genome-wide significance level (p<10-8), and thus were selected. (b) Among the 754,259 AIMs, we excluded SNPs (n=280,276) that were nominally associated (p<0.05) with the disease, to ensure that the AIMs reflected only ethnic differences, rather than disease status, although p<0.05 was highly conservative for this purpose. (c) 210,691 SNPs were then excluded because of Hardy-Weinberg Disequilibrium (HWD) (p<0.05). This ensured the independence of different alleles within each marker, which was required for the accuracy of the prediction of ancestry proportion by the Bayesian approach; p<0.05 was highly conservative for 5 this purpose as well. (d) Finally, 3,172 completely independent (r2=0) SNPs were selected from the AIMs after the LD pruning. In summary, all AIMs showed a difference in allele frequency between EAs and AAs at a genome-wide significance level, had no association with disease, and were in HWE. The selection of AIMs from such a large marker set is a strength of the current study as it increased the accuracy of the classification of EAs and AAs, enhanced the detection of population structure, provided information that enabled all individuals to be assigned to genetic ancestral populations, enabled analysis by the STRUCTURE program, and accurately estimated ancestry proportions for each subject. Using these completely independent and highly ancestry-informative markers, two “genetic” populations were separated very well (K=2; cut-off value of ancestry proportions was 0.5; see Supplemental Figure S3). When we compared the results with those from other newly developed programs, e.g., EigenSoft and ShrunkenPCA, the consistency of the ancestry prediction was extremely high (data not shown). We excluded the subjects (n=7) for which self-indentified ethnicity and “genetic” ethnicity were discordant. Detection of population heterogeneity and controlling for stratification and admixture effects: In our analyses, we detected highly significant population heterogeneity in the samples: (a) The allele frequency difference in 76% of markers from the Illumina Human 1M beadchip reached the genome-wide significance level (p<10-8) between EAs and AAs, which showed that EAs and AAs were genetically distinct populations. Population-specific allele frequency distributions, which are usually related to evolutionary history, often lead to population-specific LD between individual markers and disease variants. Thus, distinct populations do not necessarily have the same risk markers associated with disease; alternatively, they could have the same risk markers, but different phases of alleles might be associated with disease. However, 6 although two populations might not have common risk markers, they could have common causal variants. For example, if one rare allele of a marker is risk for a disease in one population but its counterpart allele, which is also rare, is risk for that disease in another population, this may just indicate that the same phase of allele of the putative causal variant causes the disease in both populations; that causal allele is in LD with opposite phases of alleles of that marker in two populations, respectively. Therefore, in the present study, we aimed to identify the risk regions, not individual markers, which overlapped between two populations. (b) 25,393 and 28,909 SNPs were in significant HWD (p<10-4) in EA and AA controls, respectively. However, 145,043 SNPs were in significant HWD in EA+AA controls, which was triple the sum of the number of SNPs in HWD in EAs only and AAs only. Population heterogeneity could result in significant population stratification effects in mixed populations that may lead to inflation of the marker number with HWE violation. The population stratification and admixture effects in association analysis were taken into account through structured association analysis, which included two components, i.e., controlling for the between-population stratification effects and within-population admixture effects in the association analysis. To control for the population stratification effects, different populations were analyzed separately; to control for the admixture effects, ancestry proportions were included as covariates in the analysis. 7