Scoring Pathway Activities Eunjung Lee Department of Biosystems, KAIST, Korea Introduction Diseases have malfunctions of diverse pathways disrupting normal processes of human body. Cancer is a remarkable example that have a wide range of malignant pathways for signaling of cell growth and survival, cell proliferation, apoptosis, and so on.1 Understanding these abnormal circuits give us a great guideline to overcome diseases effectively. In recent years, disease markers have been increasingly identified through analysis of genome-wide expression profiles.2-3 Marker sets are selected by scoring each individual gene for its power to discriminate between different classes of disease given an observed pattern of expression. The main difficulty is to interpret the faulty mechanisms of illness from the discriminative marker genes. Several approaches have been devised to address this challenge through utilizing gene sets from prior biological knowledge. 4-6 In this research, four previous works are evaluated whether they can capture the perturbed or activated pathways from gene expression profiles or not. This comparison enables us to identify the effective scoring scheme and have insights on devising new method to define pathways without a priori defined gene set as well as identify their activity levels in a diseased condition, which is the ultimate goal of this analysis. Methods Hypergeometric Method The score for a pathway of size n is calculated based on the pmf(probabiliby mass function) of the hypergeometric distribution, a discrete probability distribution. It is the probability of having k differentially expressed genes in a pathway of size n given that D genes are regarded differentially expressed among total N genes in expression profiles. Due to its simplicity, many software and websites including GENMAPP4, and CHIPINFO provide this function by using mostly Gene Ontology as a source of gene sets. This method needs a cut-off value for defining differentially expressed genes such as a fold-change or p-value cut-off. It can lead a loss of information that the expression amplitudes or orders of genes over the cut-off value are not reflected to scores simply counting the number of differentially expressed member genes in a pathway. The hypergeometric distribution considers the size of pathways intrinsically, which needs to be considered in other scoring schemes. N : total # of genes D:# of DEGs n: size of pathway k pmf = GSEA(Gene Set Enrichment Analysis) GSEA5 evaluates whether members of a gene set tend to occur in the top of the list ranked by discriminative power of phenotypes. It consists of following steps: (i) all genes are ranked by a measure such as t-test, S2N; (ii) for each gene set, the enrichment score(ES) based on a one-sided Kolmogorov-Smirnov statistic is calculated comparing the distribution of gene ranks in a gene set with the distribution of rest genes; (iii) statistical significance of the observed ES is assessed based on the null distribution generated by permuting phenotype labels. Figure 2. Overview of GSEA (Figure in reference 5) Z-score Method The original paper7 by Ideker et al. used the z-score to explore active subnetworks in an integrated protein-protein and protein-DNA interaction network using expression profiles instead to score pre-defined gene sets. However, its success critically depends on whether the z-score can detect perturbed or active pathways using gene expression data only without networks. Thus its evaluation using curated pathways would be very necessary. Following is the procedure of z-score method: (i) the significance of differential expression of each gene (p value) is calculated; (ii) each p value is converted to a z score by inverse normal CDF. In random data, p-values are distributed uniformly from 0 to 1 and z-scores follow a standard normal, with smaller p-values corresponding larger z-scores. (iii) for a gene set, an aggregated z-score ZA is calculated by summing all z scores of genes in the gene set like below. (iv) The score ZA of a gene set with size k is calibrated against the background distribution generated by random selection of gene sets of size k yielding the score mean μk and standard deviation σk for each k. Figure 3. Overview of Z-score (Figure in reference 7) Permutation Method Tian et al.6 proposed statistical framework to test two related hypotheses: (i) do the genes in a gene set have the same pattern of association as the rest? (ii) does the gene set contain genes correlated with the phenotype? They pointed out the necessity of a normalization step to account for the different correlation structure of gene sets before comparison. Their method includes three steps: (i) for each gene set, a test statistics such as t statistics is calculated to test the correlation between gene expression and phenotypes; (ii) for each gene set, t statistics of member genes are averaged yielding Tk below; (iii) the averaged t static Tk is checked for statistical significance based on null distribution generated by permuting rows(genes or t scores) or columns(phenotype labels). Summing raw t scores can weaken the signal when the direction of expression changes of member genes are heterogeneous. t-scores for B genes K gene set 1: gene set K includes gene 2 0: otherwise Average of t-scores can reduce Tk when the direction of regulations are diverse for member genes Figure 4. Overview of Permutation Method (Figure in reference 6) Dataset NF-kB Pathway Expression Data A time course gene expression profiles of TNF(Tumor Necrosis Factor) stimulation in the presence or absence of NF-kB signaling was downloaded from GEO(Gene Expression Omnibus; http://www.ncbi.nlm.nih.gov/geo/index.cgi). TNF is a pro-inflammatory cytokine that controls expression of inflammatory genetic networks. This expression data gives expression profiles of genes in the NF-kB pathway effectively controlling other pathways activated by TNF such as JNK with expression of iKB(inhibitor of kB) mutants. 945, and 179 genes are differentially expressed in samples with absence and presence of NF-kB signaling with fdr<50% (p=0.0560), and more strict fdr<50% (p=0.0011) respectively from t-statistics. Active Nf-kB pathway m-iKB +TNF - - - - 0h 1h 3h 6h 4 4 4 + 1h + 3h + 6h 4 4 4 4 replicates m-iKB + +TNF 0h 4 replicates m-IKB: mutants of Inactive Nf-kB pathway PDGF(Platelet-Derived Growth Factor) Pathway Data A microarray dataset to observe autocrine PDGF loop and the effect of exogenous PDGF in U87 MG globlastoma cells was downloaded from GEO9. PDGF plays a critical role in cell proliferation and development, and the presence of PDGF autocrine loop in globlastoma is a frequent hallmark of malignancy. They used dominant-negative PDGF-As preventing active PDGF dimers, which in turn preventing both the intacrine activation and the secretion of PDGF into extracellular space (autocrin, and paracrine). 195 genes are differentially expressed between samples with and without exogenous PDGFs (fdr*<50% (p=0.0018) from t-statistics. Human Pathways MsigDB (Molecular Signature Database C2) was downloaded and used to define gene sets. It consists of 472 gene sets of metabolic and signaling pathways, and 50 sets containing genes coregulated by genetic and chemical perturbations. Results All methods captured the NF-kB induced gene set and TNF pathway at rank 1 or 2, and NF-kB pathway itself at significant p value and fdr levels (fdr*<25%). However with PDGF expression data all methods captured gene sets related with cell cycle, cell proliferation, cell cycle regulation, and damage control including the p53 pathway. It has been reported that autocrine PDGF signaling become more oncogenic with the accumulation of mutations in p53. It must be because excessive cell cycle and proliferation needs more damage control to prevent erroneous development of tumors. In contrast to NF-kB pathway, PDGF pathway itself was not detected at fdr < 50% level even though GSEA and Z-score schemes ranked it at a slightly higher fdr than 50%. There can be two possible explanations for this: (i) genes in PDGF pathway are not regulated by mRNA levels or show very weak differential expression in samples with and without exogenous PDGFs; (ii) all samples had autocrine PDGF loop blockage by negative dominant mutants of PDGFA, which might not good enough to represent the real PDGF pathway activation. Hypergeometic : NF-kB Name 1st 21th size CR_SIGNALLING FRASOR_ER_DOWN KRAS_TOP100_KNOCKDOWN NFKB_INDUCED DOWNREG_BY_HOXA9 nthiPathway ST_Tumor_Necrosis_Factor_Pathway tnfr2Pathway HOXA9_DOWN EMT_UP hivnefPathway nfkbPathway 182 68 72 105 28 21 18 31 53 54 21 p fdr<0.25 0.000000 y 0.000000 y 0.000000 y 0.000000 y 0.000002 y 0.000002 y 28 0.000002 0.000003 y 0.000007 y 0.000019 y 0.000025 y 0.000905 y GSEA : NF-kB Name KRAS_TOP100_KNOCKDOWN 2nd NFKB_INDUCED DOWNREG_BY_HOXA9 metPathway keratinocytePathway HOXA9_DOWN stressPathway nthiPathway tnfr2Pathway CR_SIGNALLING 40th nfkbPathway p q(fdr) 0.0 0.005444646 0.0043290043 0.004008016 0.0 0.0044247787 0.0062370063 0.006802721 0.009578544 0.011131725 0.012715259 0.06464923 0.118684635 0.12523517 0.11972623 0.11115593 0.09745527 0.0880979 0.08824222 0.08366267 0.025540275 0.22657 y Z-score : NF-kB Name size z NFKB_INDUCED KRAS_TOP100_KNOCKDOWN ST_Tumor_Necrosis_Factor_Pathway tnfr2Pathway CR_SIGNALLING hivnefPathway ST_Gaq_Pathway deathPathway nthiPathway DOWNREG_BY_HOXA9 tnf_and_fas_network 105 72 28 18 182 54 27 31 21 28 20 9.669149 7.575231 6.536961 5.957621 5.462894 5.347498 4.831110 4.745057 4.635058 4.621381 4.416775 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000001 0.000001 0.000002 0.000002 0.000005 nfkbPathway 21 3.117105 0.000913 34th p Permutation : NF-kB caspasePathway CR_SIGNALLING deathPathway eponfkbPathway il10Pathway inflamPathway NFKB_INDUCED nfkbPathway tnfr2Pathway tollPathway Q1 21 182 31 11 12 27 105 21 18 29 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 42 pathways with p=0.000000 Q2 caspasePathway 21 0.000000 deathPathway 31 0.000000 eponfkbPathway 11 0.000000 il10Pathway 12 0.000000 NFKB_INDUCED 105 0.000000 nfkbPathway 21 0.000000 ST_Tumor_Necrosis_Factor_Pathway 28 0 tall1Pathway 12 0.000000 tnf_and_fas_network 20 0 tnfr2Pathway 18 0.000000 39 pathways with p=0.000000 Hypergeometic : PDGF Name size CR_CELL_CYCLE 78 MAP00100_Sterol_biosynthesis 10 s1pPathway 7 LEU_DOWN 167 Cell_Cycle 73 p27Pathway 12 cell_proliferation 199 SA_REG_CASCADE_OF_CYCLIN_EXPR DNA_DAMAGE_SIGNALLING g1Pathway 26 p53Pathway 16 472th pdgfPathway p 0.000000 0.000000 0.000007 0.000017 0.000028 0.000097 0.000117 13 0.000137 90 0.000148 0.000218 0.000332 not significant GSEA : PDGF Name cdc25Pathway g1Pathway rbPathway EMT_UP CR_CELL_CYCLE MAP00100_Sterol_biosynthesis Cell_Cycle NFKB_INDUCED ptc1Pathway g2Pathway FRASOR_ER_DOWN fxrPathway srcRPTPPathway P53_UP cell_proliferation plk3Pathway 51th pdgfPathway size 9 26 13 53 78 10 73 105 11 23 68 6 11 40 199 7 27 p 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 q(fdr) 0.10099991 0.10099991 0.100999966 0.13869974 0.15389395 0.17021161 0.17880023 0.1690752 0.16151145 0.15546033 0.2139626 0.2267665 0.225836 0.21691884 0.21947579 0.24841224 0.59359473 Z-score : PDGF Name 40th size z p Cell_Cycle 73 CR_CELL_CYCLE 78 GLUT_DOWN 286 MAP00100_Sterol_biosynthesis 10 LEU_DOWN 167 rbPathway 13 s1pPathway 7 g2Pathway 23 HTERT_DOWN 64 SA_REG_CASCADE_OF_CYCLIN_EXPR 6.631856 0.000000 7.236692 0.000000 5.398338 0.000000 6.689549 0.000000 4.186298 0.000014 4.124814 0.000019 4.102092 0.000020 4.073250 0.000023 3.944680 0.000040 13 3.855292 0.000058 pdgfPathway 2.216931 27 0.013314 Permutaion : PDGF Q1 atrbrcaPathway Cell_Cycle CR_CELL_CYCLE CR_REPAIR GLUT_DOWN LEU_DOWN RAP_DOWN cell_proliferation 19 73 78 39 286 167 213 199 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.001000 Q2 eif2Pathway 8 0.004000 fasPathway 28 0.004000 pentosePathway 2 0.004000 SA_REG_CASCADE_OF_CYCLIN_EXPR 13 tumor_supressor 22 0.004000 atmPathway 19 0.010000 cell_cycle_checkpointII 10 0.010000 igf1mtorPathway 20 0.010000 GLUT_DOWN 286 0.011000 PDGF pathway itself is not significant. Conclusion and Discussion This research showed that four previous approaches including hypergeometric method, GSEA, Z-score scheme, and permutation-based method can capture activated pathways based on the discriminative power of member genes in pathways. One of the statistical insights from this research is that pathways have different background distributions especially according to the pathway size. Below figure shows background distributions with varying standard deviations of modules with size 10, 100, 213, and 326. Thus the adjustment of raw pathway scores based on the well approximated background distribution is necessary to extract really significant pathways. The results from this research need to be analyzed with a more quantitative way. Also, it is with no doubt, a very useful challenge to devise a method to explore and define pathways measuring their activities simultaneously from gene expression and/or integrated multiomics data without pre-defined gene sets. The insights gained from this research will help to achieve the future goal. References 1. Hanahan D, Weinberg RA (2000) The hallmarks of cancer. Cell 100(1): 57-70. 2. Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M et al. (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439): 531-537. 3. Ramaswamy S, Ross KN, Lander ES, Golub TR (2003) A molecular signature of metastasis in primary solid tumors. Nat Genet 33(1): 49-54. 4. Doniger, S.W., Salomonis, N., Dahlquist, K.D., Vranizan, K., Lawlor, S.C., Conklin, B.R. (2003). MAPPFinder: using Gene Ontology and GenMAPP to create a global geneexpression profile from microarray data. Genome Biology 4(1):R7 5. Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL et al. (2005) Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A 102(43): 15545-15550. 6. Lu Tian et al.Discovering statistically significant pathways in expression profiling studies. PNAS (2005) 102(38) 13544-13549 7. Ideker T, Ozier O, Schwikowski B, Siegel AF (2002) Discovering regulatory and signalling circuits in molecular interaction networks. Bioinformatics 18 Suppl 1: S233-240. 8. Bing Tian et al. (2005) Identification of Direct Genomic Targets Downstream of the Neclear Factor-kB Transcription Factor Mediating Tumor Necrosis Factor Signaling. The Journal of Biological Chemistry 280(17): 17435-17448 9. Deqin Ma et al. (2005) Autocrine Platelet-Derived Growth Factor-Dependent Gene Expression in Glioblastoma Cells is Mediated Largely by Activation of the Transcription Factor Sterol Regulatory Element Binding Protein and Is Associated with Altered Genotype and Patient Survival in Human Brain Tumors. Cancer Research 65(15): 5523-5534