S1 Appendix.

Module-based Association Analysis for Omics Data with Network Structure Zhi Wang1, Arnab Maity2, Chuhsing Kate Hsiao3, Deepak Voora4, Rima Kaddurah-Daouk5, Jung-Ying Tzeng1,2,6 1: Bioinformatics Research Center, North Carolina State University, Raleigh NC, 27695, USA 2: Department of Statistics, North Carolina State University, Raleigh NC, 27695, USA 3: Institute of Epidemiology and Preventive Medicine, College of Public Health, National Taiwan University, Taipei, Taiwan 4: Institute for Genome Sciences and Policy, Duke University, Durham, NC, USA 5: Department of Psychiatry and Behavioral Sciences, Duke University, Durham, NC, USA 6: Department of Statistics, National Cheng-Kung University, Taiwan, R.O.C. RUNNING TITLE: Module-based analysis for structured omics data ADDRESS FOR CORRESPONDENCE: Jung-Ying Tzeng, Department of Statistics and Bioinformatics Research Center, North Carolina State University, Campus Box 7566, Raleigh NC, 27695, USA. Tel: 919-513-2723. Fax: 919-515-7315. E-mail:jytzeng@stat.ncsu.edu. KEY WORDS: module structure; network structure; association analysis; metabolomics 1 APPENDIX Derivation of the score test statistics and their distributions Consider the linear mixed model representation given in model (3). As our primary interest is to test the variance components 𝜏1 , 𝜏2 , 𝜏12 , we propose to use the restricted maximum likelihood (REML) function to estimate the variance components (𝜏1 , 𝜏2 , 𝜏12 , 𝜎). We have that the REML estimate under model (3) is ℓ𝑅𝐸𝑀𝐿 (𝜏1 , 𝜏2 , 𝜏12 ; 𝑌) = −{log|𝑉| + log|𝑍 𝑇 𝑉 −1 𝑍| + 𝑌 𝑇 𝑃𝑌}/2, where 𝑉 = 𝜏1 𝐾1 + 𝜏2 𝐾2 + 𝜏12 𝐾12 + 𝜎𝐼 is the marginal variance of Y and 𝑃 = 𝑉 −1 − 𝑉 −1 𝑍(𝑍 𝑇 𝑉 −1 𝑍)−1 𝑍 𝑇 𝑉 −1 is a projection matrix. The score functions based on the REML can be obtained as below (Harville 1977): 𝑋 ∗𝑋2 Under 𝐻0 1 : 𝜏12 = 0, 𝑈𝜏12 (𝜏̂1 , 𝜏̂2 , 0, 𝜎̂) = 𝜕ℓ𝑅𝐸𝑀𝐿 (𝜏1 , 𝜏2 , 𝜏12 , 𝜎) | 𝜕𝜏12 𝜏 ̂, ̂,𝜎=𝜎 ̂ 12 =0,𝜏1 =𝜏 1 𝜏2 =𝜏 2 𝑋 1 ∗𝑋2 1 = {𝑌 𝑇 𝑃12 𝐾12 𝑃12 𝑌 − 𝑡𝑟(𝑃12 𝐾12 )}. 2 𝑋 |𝑋2 Under 𝐻0 1 : 𝜏1 = 0 with the constraints of 𝜏12 = 0, 𝑈𝜏1 (0, 𝜏̂2 , 0, 𝜎̂) = 𝜕ℓ𝑅𝐸𝑀𝐿 (𝜏1 , 𝜏2 , 𝜏12 , 𝜎) | 𝜕𝜏1 𝜏 ̃,𝜎=𝜎 ̃ 12 =0,𝜏1 =0, 𝜏2 =𝜏 2 𝑋1|𝑋2 1 = {𝑌 𝑇 𝑃1 𝐾1 𝑃1 𝑌 − 𝑡𝑟(𝑃1 𝐾1 )}. 2 𝑋 |𝑋1 Under 𝐻0 2 : 𝜏2 = 0 with the constraints of 𝜏12 = 0, 𝑈𝜏2 (𝜏̂1 , 0,0, 𝜎̂) = 𝜕ℓ𝑅𝐸𝑀𝐿 (𝜏1 , 𝜏2 , 𝜏12 , 𝜎) | 𝜕𝜏𝐸 𝜏 ̃, ̃ 12 =0,𝜏1 =𝜏 1 𝜏2 =0,𝜎=𝜎𝑋2|𝑋1 1 = {𝑌 𝑇 𝑃2 𝐾2 𝑃2 𝑌 − 𝑡𝑟(𝑃2 𝐾2 )}, 2 where 𝑃𝑡 = 𝑉𝑡−1 − 𝑉𝑡−1 𝑍(𝑍 𝑇 𝑉𝑡−1 𝑍)−1 𝑍 𝑇 𝑉𝑡−1 for 𝑡 = {12, 1, 2} ,with 𝑉12 = 𝜏1 𝐾1 + 𝜏2 𝐾2 + 𝜎𝐼, 𝑉1 = 𝜏2 𝐾2 + 𝜎𝐼 and and 𝑉2 = 𝜏1 𝐾1 + 𝜎𝐼. 2 NULL DISTRIBUTION OF THE SCORE STATEISTICS FOR GE TEST Because score statistics are not asymptotically normal (Tzeng and Zhang 2007), we use the first term of the score statistics as the testing statistics. For interaction test, 1 the test statistic is 𝑇𝑋1 ∗𝑋2 = 2 𝑌 𝑇 𝑃12 𝐾12 𝑃12 𝑌. Define 𝜇 = 𝑍𝛽 , then 𝑇𝑋1 ∗𝑋2 = 1 (𝑌 − 𝜇)𝑇 𝑃12 𝐾12 𝑃12 (𝑌 − 𝜇) because 𝜇 𝑇 𝑃12 = 0. 2 1 𝑇 𝐶 2 1 2 1 2 (𝑉 𝑃12 𝐾12 𝑃12 𝑉 ) 𝐶 , Further, we can rewrite 𝑇𝑋1 ∗𝑋2 = 1 − 2 where 𝐶 = 𝑉 (𝑌 − 𝜇) and it follows a standard multivariate normal distribution. Define 𝑒𝑖 and 𝜂𝑖 the eigenvector and eigenvalue of matrix 𝑉 1/2 𝑃12 𝐾12 𝑃12 𝑉 1/2 /2, respectively, then 𝑇𝑋 ∗𝑋 = ∑𝑐𝑖=1 𝜂𝑖 (𝑒𝑖𝑇 𝐶)2 ≡ ∑𝐿𝑖=1 𝜂𝑖 𝐶̃𝑖2 1 with 𝐶̃𝑖2 2 follows a 1 𝑑𝑓 chi-square distribution. Therefore the distribution of 𝑇𝑋1 ∗𝑋2 2 can be approximated by the distribution of ∑𝑐𝑖=1 𝜂̂ 𝑖 𝜒𝑖1 , where 𝜂̂ 𝑖 ′𝑠 are the non-zero 1 1 eigenvalues of 𝑉 2 𝑃12 𝐾12 𝑃12 𝑉 2 /2|𝜏12 =0,𝜏1 =𝜏̂, . Hence, we can use a ̂,𝜎=𝜎 ̂ 1 𝜏2 =𝜏 2 𝑋 ∗𝑋 1 2 moment matching approach to obtain p-values (Duchesne and Lafaye De Micheaux 2010). Above we use the interaction test as an example and derive the test statistics and its null distribution. By similar argument, we can approximate the null 2 distributions of 𝑇𝑋1 |𝑋2 and 𝑇𝑋2 |𝑋1 using the distribution of ∑𝑐𝑖=1 𝜂̂ 𝑖 𝜒𝑖1 where 𝜂̂ 𝑖 ′𝑠 are 1 1 1 1 the non-zero eigenvalues of 𝑉 2 𝑃1 𝐾1 𝑃1 𝑉 2 /2|𝜏12 =0,𝜏1=0, 𝜏2 =𝜏̃,𝜎=𝜎 and𝑉 2 𝑃2 𝐾2 𝑃2 𝑉 2 / ̃ 2 𝑋 |𝑋 1 2 2|𝜏12 =0,𝜏1 =𝜏̃, , respectively. ̃ 1 𝜏2 =0,𝜎=𝜎𝑋 |𝑋 2 1 EM ALGORITHM FOR THE REML ESTIMATES OF 𝝉𝟏 AND 𝝉𝟐 WHEN TESTING 𝑿 ∗𝑿𝟐 𝑯𝟎 𝟏 : 𝝉𝟏𝟐 = 𝟎 Using the interaction test (𝑇𝑋1 ∗𝑋2 ) as an example, we derive the EM algorithm for 𝑋 ∗𝑋2 estimating the nuisance variance components (VC), 𝜏1 , 𝜏2 , and 𝜎, under 𝐻0 1 . The EM algorithms for estimating nuisance VCs for the 𝑋1 |𝑋2 test and the 𝑋2 |𝑋1 test can be obtained by zeroing out the corresponding variance components. In short, the derivation of the EM algorithm is similar to the one derived in Tzeng et al. (2011). Let 𝑢 = 𝐴𝑇 𝑌 with 𝐴𝑇 𝐴 = 𝐼𝑛∗𝑛 𝑎𝑛𝑑 𝐴𝐴𝑇 = 𝐼 − 𝑍(𝑍 𝑇 𝑍)−1 𝑍𝑋 𝑇 . Then 𝑓(𝑢|ℎ1 , ℎ2 ) follows normal distribution with mean 𝐴𝑇 ℎ1 + 𝐴𝑇 ℎ2 and variance 𝜎𝐼 and does not depend on the fixed effect 𝛽. Therefore, the REML estimators of 𝜏1 and 𝜏2 can be based on 3 their marginal distributions, 𝑓(𝑢) = ∫ ∫ 𝑓(𝑢|ℎ1 , ℎ2 )𝑓(ℎ1 , ℎ2 )𝑑ℎ1 𝑑ℎ2 . This motivated the EM algorithm based on observed data 𝑢 and missing data ℎ1 and ℎ2 . The complete data log likelihood is given be 𝑙𝑜𝑔𝑓(𝑢, ℎ1 , ℎ2 ; 𝜏1 , 𝜏2 , 𝜎) = 𝑙𝑜𝑔𝑓(𝑢|ℎ1 , ℎ2 ; 𝜏1 , 𝜏2 , 𝜎) + 𝑙𝑜𝑔𝑓(ℎ2 ; 𝜏2 , 𝜎) + 𝑙𝑜𝑔𝑓(ℎ1 ; 𝜏1 , 𝜎) 𝑛−𝑑 1 (𝑢 − 𝐴𝑇 ℎ1 − 𝐴𝑇 ℎ2 )𝑇 (𝑢 − 𝐴𝑇 ℎ1 − 𝐴𝑇 ℎ2 ) =− 𝑙𝑜𝑔 𝜎 − 2 2𝜎 𝑛 1 1 𝑇 −1 − log 𝜏1 − log|𝐾1 | − ℎ 𝐾 ℎ 2 2 2𝜏1 1 1 1 𝑛 1 1 − 2 log 𝜏2 − 2 log|𝐾2 | − 2𝜏 ℎ2𝑇 𝐾2−1 ℎ2 . 2 In the expectation step, we calculate the expected value of the log likelihood (𝑡) (𝑡) function, 𝑄(𝜏1 , 𝜏2 , 𝜎|𝜏̂1 , 𝜏̂2 , 𝜎̂ (𝑡) ) with respect to the observed data 𝑢 under the (𝑡) (𝑡) current (the 𝑡-th iteration) estimate of the parameters 𝜏̂1 , 𝜏̂ 2 𝑎𝑛𝑑 𝜎̂ (𝑡) , (𝑡) (𝑡) (𝑡) (𝑡) 𝑄(𝜏1 , 𝜏2 , 𝜎|𝜏̂1 , 𝜏̂2 , 𝜎̂ (𝑡) ) = 𝐸[𝑙𝑜𝑔𝑓(𝑢, ℎ1 , ℎ2 ; 𝜏1 , 𝜏2 , 𝜎)|𝑢; 𝜏̂1 , 𝜏̂2 , 𝜎̂ (𝑡) ] =− 𝑛−𝑑 𝑙𝑜𝑔 𝜎 2 1 𝐸{(𝑢 − 𝐴𝑇 ℎ1 − 𝐴𝑇 ℎ2 )𝑇 (𝑢 − 𝐴𝑇 ℎ1 2𝜎 (𝑡) (𝑡) − 𝐴𝑇 ℎ2 )|𝑢; 𝜏̂1 , 𝜏̂2 , 𝜎̂ (𝑡) } 𝑛 1 1 (𝑡) (𝑡) − log 𝜏1 − log|𝐾1 | − 𝐸{ℎ1𝑇 𝐾1−1 ℎ1 |𝑢; 𝜏̂1 , 𝜏̂ 2 , 𝜎̂ (𝑡) } 2 2 2𝜏𝐺 𝑛 1 1 (𝑡) (𝑡) − log 𝜏2 − log|𝐾2 | − 𝐸{ℎ2𝑇 𝐾2−1 ℎ2 |𝑢; 𝜏̂1 , 𝜏̂ 2 , 𝜎̂ (𝑡) }. 2 2 2𝜏2 − (𝑡) (𝑡) 𝜕𝑄 In the maximization step, we maximize 𝑄(𝜏1 , 𝜏2 , 𝜎|𝜏̂1 , 𝜏̂ 2 , 𝜎̂ (𝑡) ) by solving 𝜕𝜏 = 𝜕𝑄 𝜕𝑄 2 𝜕𝜎 0, 𝜕𝜏 = 0 and (𝑡+1) 𝜏̂1 (𝑡+1) 𝜏̂2 𝜎̂ (𝑡+1) 1 = 0 and obtain the following estimates 1 (𝑡) (𝑡) 𝐸{ℎ1𝑇 𝐾1−1 ℎ1 |𝑢; 𝜏̂1 , 𝜏̂2 , 𝜎̂ (𝑡) } 𝑛 1 = {𝜏̂1 𝑌 𝑇 𝑃12 𝐾1 𝑃12 𝑌 + 𝑡𝑟(𝜏1 𝐼 − 𝜏12 𝑃12 𝐾1 )}; 𝑛 1 (𝑡) (𝑡) = 𝐸{ℎ2𝑇 𝐾2−1 ℎ2 |𝑢; 𝜏̂1 , 𝜏̂ 2 , 𝜎̂ (𝑡) } 𝑛 1 = {𝜏̂ 2 𝑌 𝑇 𝑃12 𝐾2 𝑃12 𝑌 + 𝑡𝑟(𝜏2 𝐼 − 𝜏22 𝑃12 𝐾2 )}; 𝑛 1 (𝑡) (𝑡) = 𝐸{(𝑢 − 𝐴𝑇 ℎ1 − 𝐴𝑇 ℎ2 )𝑇 (𝑢 − 𝐴𝑇 ℎ1 − 𝐴𝑇 ℎ2 )|𝑢; 𝜏̂1 , 𝜏̂2 , 𝜎̂ (𝑡) } 𝑛−𝑑 ̃ )𝑇 𝐴𝐴𝑇 (𝑌 − 𝑀 ̃ ) + 𝑡𝑟(𝐴𝑇 𝑉̃ 𝐴), = (𝑌 − 𝑀 = 4 (𝑡) (𝑡) ̃ = 𝐸(ℎ1 + ℎ2 |𝑢; 𝜏̂1 , 𝜏̂ 2 , 𝜎̂ (𝑡) ) = (𝜏1 𝐾1 + where 𝐴𝐴𝑇 = 𝐼 − 𝑍(𝑍 𝑇 𝑍)−1 𝑍 𝑇 , 𝑀 (𝑡) (𝑡) 𝜏2 𝐾2 )𝑃12 , ̃𝑉 = 𝑣𝑎𝑟(ℎ1 + ℎ2 |𝑢; 𝜏̂1 , 𝜏̂ 2 , 𝜎̂ (𝑡) ) = 𝜏1 𝐾1 − 𝜏12 𝐾1 𝑃12 𝐾1 + 𝜏2 𝐾2 − ̃ and 𝑉̃ are obtained from the joint distribution of 𝜏22 𝐾2 𝑃12 𝐾2 − 2𝜏1 𝜏2 𝐾2 𝑃12 𝐾1 , and 𝑀 (𝑢, ℎ1 , ℎ2 ). REFERENCE Duchesne P, Lafaye De Micheaux P. 2010. Computing the distribution of quadratic forms: Further comparisons between the Liu–Tang–Zhang approximation and exact methods. Comput Stat Data Anal 54: 858-862. Harville D. 1977. Maximum likelihood approaches to variance component estimation and related problems. J Am Stat Assoc 72:322–340. Tzeng JY, Zhang D. (2007) Haplotype-based association analysis via variancecomponents score test. Am J Hum Genet 81:927-38. Tzeng JY, Zhang D, Pongpanich M, Smith C, McCarthy MI, Sale MM, Worrall BB, Hsu FC, Thomas DC, Sullivan PF. 2011. Studying gene and gene-environment effects of uncommon and common variants on continuous traits: a markerset approach using gene-trait similarity regression. Am J Hum Genet 12: 27788. Zhang B, Horvath S. 2005. A general framework for weighted gene co-expression network analysis. Stat Appl Genet Molec Biol 4: 1128. 5

S1 Appendix.

Related documents

Products

Support

S1 Appendix.

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib