Supplementary Materials for MetaComp: A Comprehensive Analysis Tool for Comparative Metagenomics Peng Zhai, Jiangtao Guo, Xiaoqi Wang, Xiao Guo, Fumeng Wang, Fangmin Tian and Huaiqiu Zhu* State Key Lab for Turbulence and Complex Systems and Department of Biomedical Engineering, and Center for Quantitative Biology, Peking University, Beijing 100871, China *To whom correspondence should be addressed, Email: hqzhu@pku.edu.cn S1. Input data The input data of MetaComp can be represented as an Abundance Profile Matrix (APM). In order to obtain APM, users first have to align sequences (e.g. DNA, RNA or protein sequence) from metagenomic samples against a given database (e.g. Pfam, COG or GO). Then, total hit numbers of all sequences in one sample to each feature (e.g. Pfam ID or COG ID) is counted. The hit number of sample j to feature i is the value of πππ (Table S1). In other words, πππ is the total number of feature i observed in metagenomic sample j . Table S1. Input data of MetaComp. Feature Sample 1 Sample 2 … Sample n Feature 1 π11 π21 … ππ1 Feature 2 π12 π22 … ππ2 Feature 3 π13 π23 … ππ3 …… … … … … Feature m π1π π2π … πππ S2. Statistical test According to the type of input data, we have classified statistical tests into three modes: two samples test mode, multiple samples test mode and two sample groups test mode. (1) Two samples test mode As the sample size of Metagenome is generally large, we choose z-test instead of t-test as our recommendation for testing. Specifically, N1 and N2 are total counts of features in two samples. Then z-score for the feature Fi is: π π 1 1 π§π = (ππ1 + ππ2 )⁄√π(1 − π)(π + π ) π1 π2 π1 π2 (1) π where ππ1 = ∑π π=1 ππ1 , ππ2 = ∑π=1 ππ2 , and π = (ππ1 + ππ2 )⁄(ππ1 + ππ2 ). Since z-test is not valid, the prerequisite of z-test is minβ‘(ππ1 , ππ2 ) ≤ π§π2 . (2) Multiple samples test mode Two samples tests between all conceivable pairs of samples are executed in this mode. For a specific feature i, the minimum of all conceivable p-values is considered as the p-value of this feature. Thus we can identify that this feature is significantly different in at least one pair of samples. (3) Two groups of samples test mode In MetaComp, we provide four statistical test methods (t-test, Paired t-test, Mann-Whitney U test and Wilcoxon sign-rank test) to assess whether a specific feature is significantly different between two groups. Users can choose a proper method themselves. Meanwhile, in order to facilitate the user without adequate statistic knowledge, MetaComp can automatically select the most suitable test method according to your input data. The appropriate statistical test method is chosen according to the contingency table shown in Table S2. Moreover, odds ratio test is implemented for evaluating the abundance of each feature. As the contingency table shown in Table S3, πππ denotes the count of feature i in sample j. Considering the possibility of unevenness between two groups, an empirical continuity correction is introduced to improve the accuracy of the test (Sweeting, M.J. et al., 2004). Consequently, the statistics of the OR statistic for feature i is: β‘β‘β‘β‘β‘β‘β‘β‘β‘β‘β‘β‘β‘β‘β‘β‘β‘log 2 ππ (π) = log 2 π 1 )(π22 + ) π +1 π +1 1 π (π12 + )(π21 + ) π +1 π +1 (π11 + (2) where π = π1 ⁄π2 . According to the computation above, features can be classified as group 1 enriched (when log 2 ππ (π) > 1) and group 1 scarcity (when log 2 ππ (π) < 1). S3. Multiple Hypothesis Testing Correction A typical metagenomic profile consists of several hundreds to thousands of features (e.g. Pfam/Cog functional profiles). Therefore, direct application of statistical method described above may probably lead to large numbers of false positive. For example, choosing a threshold of 0.05 will result in 500 false positives in a profile contains 10000 features. Two correction methods are implemented in MetaComp to solve this problem, including default option false discovery rate (FDR) and an alternative option Bonferroni correction. Table S2. Method for comparison of two groups of samples independent Correlatedc Parametrica Non- parametricb t-test Paired t-test Mann-Whitney U test Wilcoxon sign-rank test a. Under the assumption that the variable follow a normal distribution b. Applicable when sample size is small and normality assumption is violated c. Consist of a sample of matched pairs of similar units, or one group of units that has been tested twice Table S3. Contingency table for odd ratio test Group 1(G1) Feature i π11 = ∑ Other Feature π21 = ∑ Total π∈πΊ1 π∉πΊ1 Group 2(G2) πππ π12 = ∑ πππ π22 = ∑ π1 = π11 + π21 π∈πΊ2 π∉πΊ2 Total πππ π1 = π11 + π12 πππ π2 = π21 + π22 π2 = π12 + π22 S4. Environmental Factor Analysis This analysis is implemented by regression analysis via the lasso algorithm. MetaComp first normalizes the input data and environmental factors data before analysing. After that, the ith environmental factor in jth sample (Xij ) is considered as independent variable, and the jth frequency of a given feature (Yj ) is considered as dependent variable. The regression function is: m n Y j = ∑i αi Xji + ∑m≠n m,n βm,n X j × X j (3) where πππ × πππ means the co-effect of environmental factor πππ and πππ to feature Yπ , πΌπ , π½π,π represent the regression coefficients of the function. Moreover, the reliability of the regression coefficients is determined by p-value. Only when all p-values (p-values of all regression coefficients and p-value of the regression function) meet the prescribed standard, we shall accept the result of regression. Table S4. Part of the detailed result of the analysis of whale fall, Acid Mine Drainage, Sargasso Sea, and Minnesota soil metagenomic samples AMD Soil S.1a S.2 S.3 W.Boneb W.Mat W.Rib p-value q-value Function PF01036 0 1 344 354 332 0 0 0 1.60e-35 4.67e-34 Bacteriorhodopsin-like protein PF03814 99 17 3 4 7 0 0 1 5.22e-48 2.92e-46 Ion channel KdpA Potassium-transporting ATPase A subunit PF02705 0 87 15 30 30 10 0 1 2.27e-56 3.56e-54 APC K trans K+ potassium transporter PF01077 42 4870 62 51 71 57 45 37 0 0 NIR_SIR Nitrite and sulphite reductase 4Fe-4S domain a.S=Sargasso Sea b.W=whale fall Table S5. Part of the detailed result of relationship between metagenomic samples and environmental factors Feature DIPa COG0378 0 Oxygenb DIP&Oxygen p-value Correlation Annotation 0 3.303E-06 2.974E-02 9.006E-01 Ni2+-binding GTPase involved in regulation of expression and mature COG1921 1.157E-03 0 0 8.413E-02 8.223E-01 Selenocysteine synthase [seryl-tRNASer selenium transferase] COG0318 4.841E-03 0 0 8.972E-02 8.152E-01 Acyl-CoA synthetases (AMP-forming)/AMP-acid ligases II a.DIP=dissolved inorganic phosphate b.Oxygen=oxygen content References Sweeting, M.J. et al. (2004) What to add to nothing? Use and avoidance of continuity corrections in meta-analysis of sparse data, Statistics in medicine, 23(9), 1351-1375