FINDING CONSISTENT SUBNETWORKS ACROSS MICROARRAY DATASET Fan Qi GS5002 Journal Club OUTLINE Introduction Methodology Results & Discussions Conclusions 2 INTRODUCTION Identify Differential Gene Expression Identify significant genes w.r.t a phenotype Importance: Testing effectiveness of treatment Biological insights of diseases Develop new treatment Disease Prophylaxis Any others ? 3 CURRENT METHODS Individual Genes Search for individual differentially expressed genes Fold-change, t-test, SAM Gene Pathway Detection Looking at a set of genes instead of individual genes Bayesian learning and Boolean network learning Gene Classes Adding existing biological insights Over-representation analysis (ORA), Functional Class Scoring(FCS), GSEA, NEA, ErmineJ 4 CHALLENGE Different Results from Different Dataset of the SAME disease! Zhang M [1] demonstrated inconsistency in SAM: Datasets Prostate cancer Lung cancer DMD DEGs POG nPOG Top 10 0.3 0.3 Top 50 0.14 0.14 TOP 100 0.15 0.15 Top 10 0.00 0.00 Top 50 0.20 0.19 TOP 100 0.31 0.30 Top 10 0.20 0.20 Top 50 0.42 0.42 TOP 100 0.54 0.54 Reconstruct from Table 1 in [1] Inconsistency among datasets 5 NEW APPROACH SNet [2] Proposed in 2011 Utilize gene-gene relationship in analysis Gene-gene relationship Activates VS. Inhibits From Fig 1 in [2] Gene Subnetwork Gene is the Vertex, Relationship is an edge RHOA VAV PIK3R2 6 ARHGEF1 RAC1 IQGAP1 Partially adapted from Fig 2 in [2] METHODOLOGY Input: Genes labeled with phenotype Gain from microarray experiment Third-party Info: Gene Pathway Info Gene Reaction Info Subnetwork Subnetwork Extraction Attributes of Scoring Subnetwork Subnetwork Significance Size, Score Output: A set of significant sub-network 7 METHODOLOGY –STEP 1 Phenotypes P1 Patient’s Gene Ranked List P2 P3 {𝑎1 , 𝑎2 , . . 𝑎𝑛 } {𝑎1 , 𝑎2 , . . 𝑎𝑛 } {𝑎1 , 𝑎2 , . . 𝑎𝑛 } {𝑎1 , 𝑎2 , . . 𝑎𝑛 } {𝑎1 , 𝑎2 , . . 𝑎𝑛 } {𝑎1 , 𝑎2 , . . 𝑎𝑛 } {𝑎1 , 𝑎2 , . . 𝑎𝑛 } {𝑎1 , 𝑎2 , . . 𝑎𝑛 } {𝑎1 , 𝑎2 , . . 𝑎𝑛 } {𝑎1 , 𝑎2 , . . 𝑎𝑛 } {𝑎1 , 𝑎2 , . . 𝑎𝑛 } {𝑎1 , 𝑎2 , . . 𝑎𝑛 } {𝑎1 , 𝑎2 , . . 𝑎𝑛 } {𝑎1 , 𝑎2 , . . 𝑎𝑛 } {𝑎1 , 𝑎2 , . . 𝑎𝑛 } {𝑎1 , 𝑎2 , . . 𝑎𝑛 } {𝑎1 , 𝑎2 , . . 𝑎𝑛 } …….. {𝑎1 , 𝑎2 , . . 𝑎𝑛 } 8 METHODOLOGY –STEP 1 P1 P1 {𝑎1 , 𝑎2 , . . 𝑎𝑛 } {𝑎1 , 𝑎2 , . . 𝑎m } {𝑎1 , 𝑎2 , . . 𝑎𝑛 } Only top 𝛼% genes is kept {𝑎1 , 𝑎2 , . . 𝑎𝑛 } {𝑎1 , 𝑎2 , . . 𝑎𝑛 } {𝑎1 , 𝑎2 , . . 𝑎m } 𝐺𝑃𝑖 for patient 𝑃𝑖 {𝑎1 , 𝑎2 , . . 𝑎m } 𝛼 = 10 {𝑎1 , 𝑎2 , . . 𝑎m } {𝑎1 , 𝑎2 , . . 𝑎𝑛 } {𝑎1 , 𝑎2 , . . 𝑎m } {𝑎1 , 𝑎2 , . . 𝑎𝑛 } {𝑎1 , 𝑎2 , . . 𝑎m } Repeat for every phenotype group 9 METHODOLOGY –STEP 1 P1 P1 P1 P1 {𝑎1 , 𝑎2 , . . 𝑎𝑚 } {𝑎1 , 𝑎2 , . . 𝑎𝑚 } {𝑎1 , 𝑎2 , . . 𝑎𝑚 } {𝑎1 , 𝑎2 , . . 𝑎𝑚 } {𝑎1 , 𝑎2 , . . 𝑎𝑚 } {𝑎1 , 𝑎2 , . . 𝑎𝑚 } {𝑎1 , 𝑎2 , . . 𝑎𝑚 } {𝑎1 , 𝑎2 , . . 𝑎𝑚 } {𝑎1 , 𝑎2 , . . 𝑎𝑚 } {𝑎1 , 𝑎2 , . . 𝑎𝑚 } {𝑎1 , 𝑎2 , . . 𝑎𝑚 } {𝑎1 , 𝑎2 , . . 𝑎𝑚 } {𝑎1 , 𝑎2 , . . 𝑎𝑚 } {𝑎1 , 𝑎2 , . . 𝑎𝑚 } {𝑎1 , 𝑎2 , . . 𝑎𝑚 } {𝑎1 , 𝑎2 , . . 𝑎𝑚 } {𝑎1 , 𝑎2 , . . 𝑎𝑚 } {𝑎1 , 𝑎2 , . . 𝑎𝑚 } {𝑎1 , 𝑎2 , . . 𝑎𝑚 } {𝑎1 , 𝑎2 , . . 𝑎𝑚 } {𝑎1 , 𝑎2 , . . 𝑎𝑚 } {𝑎1 , 𝑎2 , . . 𝑎𝑚 } {𝑎1 , 𝑎2 , . . 𝑎𝑚 } {𝑎1 , 𝑎2 , . . 𝑎𝑚 } ……. P1 (d) {𝑎1 , 𝑎2 , . . 𝑎𝑚 } {𝑎1 , 𝑎2 , . . 𝑎𝑚 } select genes occur in ≥ 𝛽% of patients {𝑎1 , 𝑎2 , . . 𝑎𝑚 } 𝐺𝐿 𝑎2 {𝑎1 , 𝑎2 , . . 𝑎𝑚 } {𝑎1 , 𝑎2 , . . 𝑎𝑚 } 𝑎1 … 𝛽 = 50 𝑎𝑘 {𝑎1 , 𝑎2 , . . 𝑎𝑚 } Select one phenotype as 𝑑 others as ¬𝑑 10 METHODOLOGY –STEP 1 𝑎2 𝑎1 𝑎1 𝐺𝐿 … 𝑎6 𝑎3 𝑎5 𝑎𝑖 𝑎1 𝑎4 𝑎7 𝑎2 … ……… 𝑎𝑘 A list of Subnetworks 𝑐𝑐 w.r.t 𝑑 𝑎1 𝑎 2 𝑎3 … 𝑎6 𝑎3 𝑎5 𝑘 𝑎4 Partition 𝐺𝐿 into multiple pathways Generate Subnetwork 𝑎7 11 METHODOLOGY – STEP 2 𝑆𝑁𝑒𝑡𝑠𝑝,𝑖 = 𝑔′∈𝐺𝑃𝑖∩𝑠𝑝 𝑆𝑔𝑠𝑝,𝑔′ , where 𝑆𝑔𝑠𝑝,𝑔 = 𝑎6 𝑎3 𝑘 𝑛 𝑔: a gene in 𝑠𝑝 that is highly expressed in 𝑃𝑖 𝑘: # patients in 𝑑(¬𝑑) who have 𝑔 highly expressed 𝑛: total # patients in 𝑑(¬𝑑) 𝑎1 For each Subnetwork in 𝑠𝑝 in the 𝑐𝑐 and Patient 𝑃𝑎𝑖2, compute overall expression level: 𝑎5 𝑎4 For Patients < 𝑃1 , 𝑃2 … 𝑃𝑛 > ∈ 𝑑 and < 𝑃𝑛+1 , 𝑃𝑛+2 … 𝑃𝑚 > ∈ ¬𝑑 compute t-test 𝑎7 P1 (d) {𝑎1 , 𝑎2 , . . 𝑎𝑚 } {𝑎1 , 𝑎2 , . . 𝑎𝑚 } {𝑎1 , 𝑎2 , . . 𝑎𝑚 } {𝑎1 , 𝑎2 , . . 𝑎𝑚 } 𝑆𝑠𝑝𝑠𝑝,𝑑 =< 𝑆𝑁𝑒𝑡𝑠𝑝,1 , 𝑆𝑁𝑒𝑡𝑠𝑝,2 … 𝑆𝑁𝑒𝑡𝑠𝑝,𝑛 > 𝑆𝑠𝑝𝑠𝑝,¬𝑑 =< 𝑆𝑁𝑒𝑡𝑠𝑝,𝑛+1 , 𝑆𝑁𝑒𝑡𝑠𝑝,𝑛+2 … 𝑆𝑁𝑒𝑡𝑠𝑝,𝑚 > {𝑎1 , 𝑎2 , . . 𝑎𝑚 } T test 𝑆𝑆𝑝𝑠𝑝,𝑡 Assign to each Subnetwork {𝑎1 , 𝑎2 , . . 𝑎𝑚 } 12 METHODOLOGY – STEP 3 Randomly Swap Phenotype labels of patient, recreating subnetworks and t-test scores (step 1-2) Repeat [A] for 1,000 permutations. A. B. • C. D. Forms a 2-D histogram (𝑆𝑖𝑧𝑒 × 𝑆𝑐𝑜𝑟𝑒) Estimate the nominal p-value of each Subnetwork Select Subnetwork with 𝑝-𝑣𝑎𝑙𝑢𝑒 ≤ 0.05 Null-hypo: subnetwork with 𝑠𝑖𝑧𝑒, 𝑠𝑐𝑜𝑟𝑒 is not significant 13 Fig 5 in original paper RESULTS AND DISCUSSIONS Dataset: Leukemia: Golub VS Armstrong ALL: Ross VS Yeoh DMD: Haslett VS Pescatori Lung: Bhattacharjee VS Garber Performance Comparison: Subnetwork Overlap (with GSEA) Gene Overlap (GSEA, SAM, t-Test) Other Comparisons: Network Size, Gene Validity with t-Test 14 RESULTS AND DISCUSSIONS Subnetwork Overlap Disease Dataset 1 Dataset 2 SNET GSEA SNET GSEA Leukemia Golub Armstrong 83.33% 0% 20 0 ALL Ross Yeoh 47.63% 23.1% 10 6 DMD Haslett Pescatori 58.33% 55.6% 7 10 Lung Bhattacharjee Garber 90.90% 0% 9 0 Higher the better Synthesized from Table 1, 2 from [2] 15 RESULTS AND DISCUSSIONS Gene Overlap Disease Snet GSEA T-Test (p <0.05) T-Test (top) SAM (p <0.05) SAM (top) Leukemia 91.30% 2.38% 73.01% 14.29% 49.96% 22.62% ALL 93.01% 4.0% 60.20% 57.33% 81.25% 49.33% DMD 69.23% 28.9% 49.60% 20.00% 76.98% 42.22% Lung 51.18% 4.0% 65.61% 26.16% 65.61% 24.62% Higher the better Synthesized from Table 3, 4,5 from [2] 16 RESULTS AND DISCUSSIONS Size of subnetworks Disease 𝜸 T-Test SNet Size of Network 2 3 4 5 5 6 7 >8 Leukemia 84 8 1 0 0 2 3 2 1 Subtype 75 5 1 1 1 1 0 1 6 DMD 45 3 1 0 0 1 0 0 5 Lung 65 3 2 1 0 5 3 0 1 Reconstructed from Table 6 from [2] 17 RESULTS AND DISCUSSIONS Validity Compare the genes in EACH Subnetwork with those in ttest Genes in each Subnetwork appears in T-Test is around 70%- 100% Selected Results (too large to present full) Subnetwork Name Percentage Subnetwork Name Percentage Leukaemia_B Cell-VAV1 81.82% SNET_CTNNB1 100% Leukaemia_UBC 100% SNET_TNFSF10 60% Leukaemia_RAC1 57.15% SNET_PYGM 60% DMD_RHOA 75% DMD_ACTB 83.33% DMD_SDC3 88.89% Leaukaemia_POU2F2 75.00% MLLBCR_ACAA1 28.67% BCR_T_RASA1 44.44% MLLBCR_BLNK 72.73% BCR_ABL1 75.00% SNET_NOTCH3 100% DMD_CALM1 80% Selected from Table 7,8,9,10 in[2] 18 CONCLUSIONS Traditional Methods have inconsistency problem across different dataset of the same disease SNet utilize Biological insights to mitigate the gap Gene-to-Gene relationship Gene Pathway knowledge SNet shows better results than established algorithms More consistent 19 REFERENCES [1] Zhang M, Zhang L, Zou J, Yao C, Xiao H, Liu Q, Wang J, Wang D, Wang C, Guo Z: Evaluating reproducibility of differential expression discoveries in microarray studies by considering correlated molecular changes. [2] Donny Soh, Difeng Dong1, Yike Guo, Limsoon Wong Finding consistent disease subnetworks across microarray datasets 20 21