Module-based Association Analysis for Omics Data with Network Structure Zhi Wang1, Arnab Maity2, Chuhsing Kate Hsiao3, Deepak Voora4, Rima Kaddurah-Daouk5, Jung-Ying Tzeng1,2,6 1: Bioinformatics Research Center, North Carolina State University, Raleigh NC, 27695, USA 2: Department of Statistics, North Carolina State University, Raleigh NC, 27695, USA 3: Institute of Epidemiology and Preventive Medicine, College of Public Health, National Taiwan University, Taipei, Taiwan 4: Institute for Genome Sciences and Policy, Duke University, Durham, NC, USA 5: Department of Psychiatry and Behavioral Sciences, Duke University, Durham, NC, USA 6: Department of Statistics, National Cheng-Kung University, Taiwan, R.O.C. RUNNING TITLE: Module-based analysis for structured omics data ADDRESS FOR CORRESPONDENCE: Jung-Ying Tzeng, Department of Statistics and Bioinformatics Research Center, North Carolina State University, Campus Box 7566, Raleigh NC, 27695, USA. Tel: 919-513-2723. Fax: 919-515-7315. E-mail:jytzeng@stat.ncsu.edu. KEY WORDS: module structure; network structure; association analysis; metabolomics 1 APPENDIX Derivation of the score test statistics and their distributions Consider the linear mixed model representation given in model (3). As our primary interest is to test the variance components π1 , π2 , π12 , we propose to use the restricted maximum likelihood (REML) function to estimate the variance components (π1 , π2 , π12 , π). We have that the REML estimate under model (3) is βπ πΈππΏ (π1 , π2 , π12 ; π) = −{log|π| + log|π π π −1 π| + π π ππ}/2, where π = π1 πΎ1 + π2 πΎ2 + π12 πΎ12 + ππΌ is the marginal variance of Y and π = π −1 − π −1 π(π π π −1 π)−1 π π π −1 is a projection matrix. The score functions based on the REML can be obtained as below (Harville 1977): π ∗π2 Under π»0 1 : π12 = 0, ππ12 (πΜ1 , πΜ2 , 0, πΜ) = πβπ πΈππΏ (π1 , π2 , π12 , π) | ππ12 π Μ, Μ,π=π Μ 12 =0,π1 =π 1 π2 =π 2 π 1 ∗π2 1 = {π π π12 πΎ12 π12 π − π‘π(π12 πΎ12 )}. 2 π |π2 Under π»0 1 : π1 = 0 with the constraints of π12 = 0, ππ1 (0, πΜ2 , 0, πΜ) = πβπ πΈππΏ (π1 , π2 , π12 , π) | ππ1 π Μ,π=π Μ 12 =0,π1 =0, π2 =π 2 π1|π2 1 = {π π π1 πΎ1 π1 π − π‘π(π1 πΎ1 )}. 2 π |π1 Under π»0 2 : π2 = 0 with the constraints of π12 = 0, ππ2 (πΜ1 , 0,0, πΜ) = πβπ πΈππΏ (π1 , π2 , π12 , π) | πππΈ π Μ, Μ 12 =0,π1 =π 1 π2 =0,π=ππ2|π1 1 = {π π π2 πΎ2 π2 π − π‘π(π2 πΎ2 )}, 2 where ππ‘ = ππ‘−1 − ππ‘−1 π(π π ππ‘−1 π)−1 π π ππ‘−1 for π‘ = {12, 1, 2} ,with π12 = π1 πΎ1 + π2 πΎ2 + ππΌ, π1 = π2 πΎ2 + ππΌ and and π2 = π1 πΎ1 + ππΌ. 2 NULL DISTRIBUTION OF THE SCORE STATEISTICS FOR GE TEST Because score statistics are not asymptotically normal (Tzeng and Zhang 2007), we use the first term of the score statistics as the testing statistics. For interaction test, 1 the test statistic is ππ1 ∗π2 = 2 π π π12 πΎ12 π12 π. Define π = ππ½ , then ππ1 ∗π2 = 1 (π − π)π π12 πΎ12 π12 (π − π) because π π π12 = 0. 2 1 π πΆ 2 1 2 1 2 (π π12 πΎ12 π12 π ) πΆ , Further, we can rewrite ππ1 ∗π2 = 1 − 2 where πΆ = π (π − π) and it follows a standard multivariate normal distribution. Define ππ and ππ the eigenvector and eigenvalue of matrix π 1/2 π12 πΎ12 π12 π 1/2 /2, respectively, then ππ ∗π = ∑ππ=1 ππ (πππ πΆ)2 ≡ ∑πΏπ=1 ππ πΆΜπ2 1 with πΆΜπ2 2 follows a 1 ππ chi-square distribution. Therefore the distribution of ππ1 ∗π2 2 can be approximated by the distribution of ∑ππ=1 πΜ π ππ1 , where πΜ π ′π are the non-zero 1 1 eigenvalues of π 2 π12 πΎ12 π12 π 2 /2|π12 =0,π1 =πΜ, . Hence, we can use a Μ,π=π Μ 1 π2 =π 2 π ∗π 1 2 moment matching approach to obtain p-values (Duchesne and Lafaye De Micheaux 2010). Above we use the interaction test as an example and derive the test statistics and its null distribution. By similar argument, we can approximate the null 2 distributions of ππ1 |π2 and ππ2 |π1 using the distribution of ∑ππ=1 πΜ π ππ1 where πΜ π ′π are 1 1 1 1 the non-zero eigenvalues of π 2 π1 πΎ1 π1 π 2 /2|π12 =0,π1=0, π2 =πΜ,π=π andπ 2 π2 πΎ2 π2 π 2 / Μ 2 π |π 1 2 2|π12 =0,π1 =πΜ, , respectively. Μ 1 π2 =0,π=ππ |π 2 1 EM ALGORITHM FOR THE REML ESTIMATES OF ππ AND ππ WHEN TESTING πΏ ∗πΏπ π―π π : πππ = π Using the interaction test (ππ1 ∗π2 ) as an example, we derive the EM algorithm for π ∗π2 estimating the nuisance variance components (VC), π1 , π2 , and π, under π»0 1 . The EM algorithms for estimating nuisance VCs for the π1 |π2 test and the π2 |π1 test can be obtained by zeroing out the corresponding variance components. In short, the derivation of the EM algorithm is similar to the one derived in Tzeng et al. (2011). Let π’ = π΄π π with π΄π π΄ = πΌπ∗π πππ π΄π΄π = πΌ − π(π π π)−1 ππ π . Then π(π’|β1 , β2 ) follows normal distribution with mean π΄π β1 + π΄π β2 and variance ππΌ and does not depend on the fixed effect π½. Therefore, the REML estimators of π1 and π2 can be based on 3 their marginal distributions, π(π’) = ∫ ∫ π(π’|β1 , β2 )π(β1 , β2 )πβ1 πβ2 . This motivated the EM algorithm based on observed data π’ and missing data β1 and β2 . The complete data log likelihood is given be ππππ(π’, β1 , β2 ; π1 , π2 , π) = ππππ(π’|β1 , β2 ; π1 , π2 , π) + ππππ(β2 ; π2 , π) + ππππ(β1 ; π1 , π) π−π 1 (π’ − π΄π β1 − π΄π β2 )π (π’ − π΄π β1 − π΄π β2 ) =− πππ π − 2 2π π 1 1 π −1 − log π1 − log|πΎ1 | − β πΎ β 2 2 2π1 1 1 1 π 1 1 − 2 log π2 − 2 log|πΎ2 | − 2π β2π πΎ2−1 β2 . 2 In the expectation step, we calculate the expected value of the log likelihood (π‘) (π‘) function, π(π1 , π2 , π|πΜ1 , πΜ2 , πΜ (π‘) ) with respect to the observed data π’ under the (π‘) (π‘) current (the π‘-th iteration) estimate of the parameters πΜ1 , πΜ 2 πππ πΜ (π‘) , (π‘) (π‘) (π‘) (π‘) π(π1 , π2 , π|πΜ1 , πΜ2 , πΜ (π‘) ) = πΈ[ππππ(π’, β1 , β2 ; π1 , π2 , π)|π’; πΜ1 , πΜ2 , πΜ (π‘) ] =− π−π πππ π 2 1 πΈ{(π’ − π΄π β1 − π΄π β2 )π (π’ − π΄π β1 2π (π‘) (π‘) − π΄π β2 )|π’; πΜ1 , πΜ2 , πΜ (π‘) } π 1 1 (π‘) (π‘) − log π1 − log|πΎ1 | − πΈ{β1π πΎ1−1 β1 |π’; πΜ1 , πΜ 2 , πΜ (π‘) } 2 2 2ππΊ π 1 1 (π‘) (π‘) − log π2 − log|πΎ2 | − πΈ{β2π πΎ2−1 β2 |π’; πΜ1 , πΜ 2 , πΜ (π‘) }. 2 2 2π2 − (π‘) (π‘) ππ In the maximization step, we maximize π(π1 , π2 , π|πΜ1 , πΜ 2 , πΜ (π‘) ) by solving ππ = ππ ππ 2 ππ 0, ππ = 0 and (π‘+1) πΜ1 (π‘+1) πΜ2 πΜ (π‘+1) 1 = 0 and obtain the following estimates 1 (π‘) (π‘) πΈ{β1π πΎ1−1 β1 |π’; πΜ1 , πΜ2 , πΜ (π‘) } π 1 = {πΜ1 π π π12 πΎ1 π12 π + π‘π(π1 πΌ − π12 π12 πΎ1 )}; π 1 (π‘) (π‘) = πΈ{β2π πΎ2−1 β2 |π’; πΜ1 , πΜ 2 , πΜ (π‘) } π 1 = {πΜ 2 π π π12 πΎ2 π12 π + π‘π(π2 πΌ − π22 π12 πΎ2 )}; π 1 (π‘) (π‘) = πΈ{(π’ − π΄π β1 − π΄π β2 )π (π’ − π΄π β1 − π΄π β2 )|π’; πΜ1 , πΜ2 , πΜ (π‘) } π−π Μ )π π΄π΄π (π − π Μ ) + π‘π(π΄π πΜ π΄), = (π − π = 4 (π‘) (π‘) Μ = πΈ(β1 + β2 |π’; πΜ1 , πΜ 2 , πΜ (π‘) ) = (π1 πΎ1 + where π΄π΄π = πΌ − π(π π π)−1 π π , π (π‘) (π‘) π2 πΎ2 )π12 , Μπ = π£ππ(β1 + β2 |π’; πΜ1 , πΜ 2 , πΜ (π‘) ) = π1 πΎ1 − π12 πΎ1 π12 πΎ1 + π2 πΎ2 − Μ and πΜ are obtained from the joint distribution of π22 πΎ2 π12 πΎ2 − 2π1 π2 πΎ2 π12 πΎ1 , and π (π’, β1 , β2 ). REFERENCE Duchesne P, Lafaye De Micheaux P. 2010. Computing the distribution of quadratic forms: Further comparisons between the Liu–Tang–Zhang approximation and exact methods. Comput Stat Data Anal 54: 858-862. Harville D. 1977. Maximum likelihood approaches to variance component estimation and related problems. J Am Stat Assoc 72:322–340. Tzeng JY, Zhang D. (2007) Haplotype-based association analysis via variancecomponents score test. Am J Hum Genet 81:927-38. Tzeng JY, Zhang D, Pongpanich M, Smith C, McCarthy MI, Sale MM, Worrall BB, Hsu FC, Thomas DC, Sullivan PF. 2011. Studying gene and gene-environment effects of uncommon and common variants on continuous traits: a markerset approach using gene-trait similarity regression. Am J Hum Genet 12: 27788. Zhang B, Horvath S. 2005. A general framework for weighted gene co-expression network analysis. Stat Appl Genet Molec Biol 4: 1128. 5