Comparison of Pathway Activity Scoring Schemes

advertisement
Scoring Pathway Activities
Eunjung Lee
Department of Biosystems, KAIST, Korea
Introduction
Diseases have malfunctions of diverse pathways disrupting normal processes of human
body. Cancer is a remarkable example that have a wide range of malignant pathways for
signaling of cell growth and survival, cell proliferation, apoptosis, and so on.1
Understanding these abnormal circuits give us a great guideline to overcome diseases
effectively.
In recent years, disease markers have been increasingly identified through analysis of
genome-wide expression profiles.2-3 Marker sets are selected by scoring each individual
gene for its power to discriminate between different classes of disease given an observed
pattern of expression. The main difficulty is to interpret the faulty mechanisms of illness
from the discriminative marker genes. Several approaches have been devised to address this
challenge through utilizing gene sets from prior biological knowledge. 4-6
In this research, four previous works are evaluated whether they can capture the perturbed
or activated pathways from gene expression profiles or not. This comparison enables us to
identify the effective scoring scheme and have insights on devising new method to define
pathways without a priori defined gene set as well as identify their activity levels in a
diseased condition, which is the ultimate goal of this analysis.
Methods
Hypergeometric Method
The score for a pathway of size n is calculated based on the pmf(probabiliby mass function)
of the hypergeometric distribution, a discrete probability distribution. It is the probability of
having k differentially expressed genes in a pathway of size n given that D genes are
regarded differentially expressed among total N genes in expression profiles. Due to its
simplicity, many software and websites including GENMAPP4, and CHIPINFO provide
this function by using mostly Gene Ontology as a source of gene sets. This method needs a
cut-off value for defining differentially expressed genes such as a fold-change or p-value
cut-off. It can lead a loss of information that the expression amplitudes or orders of genes
over the cut-off value are not reflected to scores simply counting the number of
differentially expressed member genes in a pathway. The hypergeometric distribution
considers the size of pathways intrinsically, which needs to be considered in other scoring
schemes.
N : total # of genes
D:# of DEGs
n: size of pathway
k
pmf =
GSEA(Gene Set Enrichment Analysis)
GSEA5 evaluates whether members of a gene set tend to occur in the top of the list ranked
by discriminative power of phenotypes. It consists of following steps: (i) all genes are
ranked by a measure such as t-test, S2N; (ii) for each gene set, the enrichment score(ES)
based on a one-sided Kolmogorov-Smirnov statistic is calculated comparing the
distribution of gene ranks in a gene set with the distribution of rest genes; (iii) statistical
significance of the observed ES is assessed based on the null distribution generated by
permuting phenotype labels.
Figure 2. Overview of GSEA (Figure in reference 5)
Z-score Method
The original paper7 by Ideker et al. used the z-score to explore active subnetworks in an
integrated protein-protein and protein-DNA interaction network using expression profiles
instead to score pre-defined gene sets. However, its success critically depends on whether
the z-score can detect perturbed or active pathways using gene expression data only without
networks. Thus its evaluation using curated pathways would be very necessary. Following
is the procedure of z-score method: (i) the significance of differential expression of each
gene (p value) is calculated; (ii) each p value is converted to a z score by inverse normal
CDF. In random data, p-values are distributed uniformly from 0 to 1 and z-scores follow a
standard normal, with smaller p-values corresponding larger z-scores. (iii) for a gene set, an
aggregated z-score ZA is calculated by summing all z scores of genes in the gene set like
below. (iv) The score ZA of a gene set with size k is calibrated against the background
distribution generated by random selection of gene sets of size k yielding the score mean μk
and standard deviation σk for each k.
Figure 3. Overview of Z-score (Figure in reference 7)
Permutation Method
Tian et al.6 proposed statistical framework to test two related hypotheses: (i) do the genes in
a gene set have the same pattern of association as the rest? (ii) does the gene set contain
genes correlated with the phenotype? They pointed out the necessity of a normalization step
to account for the different correlation structure of gene sets before comparison. Their
method includes three steps: (i) for each gene set, a test statistics such as t statistics is
calculated to test the correlation between gene expression and phenotypes; (ii) for each
gene set, t statistics of member genes are averaged yielding Tk below; (iii) the averaged t
static Tk is checked for statistical significance based on null distribution generated by
permuting rows(genes or t scores) or columns(phenotype labels). Summing raw t scores
can weaken the signal when the direction of expression changes of member genes are
heterogeneous.
t-scores for B genes
K gene set
1: gene set K includes gene 2
0: otherwise
Average of t-scores
can reduce Tk when the direction of
regulations are diverse for member genes
Figure 4. Overview of Permutation Method (Figure in reference 6)
Dataset
NF-kB Pathway Expression Data
A time course gene expression profiles of TNF(Tumor Necrosis Factor) stimulation in the
presence or absence of NF-kB signaling was downloaded from GEO(Gene Expression
Omnibus; http://www.ncbi.nlm.nih.gov/geo/index.cgi). TNF is a pro-inflammatory
cytokine that controls expression of inflammatory genetic networks. This expression data
gives expression profiles of genes in the NF-kB pathway effectively controlling other
pathways activated by TNF such as JNK with expression of iKB(inhibitor of kB) mutants.
945, and 179 genes are differentially expressed in samples with absence and presence of
NF-kB signaling with fdr<50% (p=0.0560), and more strict fdr<50% (p=0.0011)
respectively from t-statistics.
Active Nf-kB pathway
m-iKB
+TNF
-
-
-
-
0h
1h
3h
6h
4
4
4
+
1h
+
3h
+
6h
4
4
4
4 replicates
m-iKB +
+TNF 0h
4 replicates
m-IKB: mutants of
Inactive Nf-kB pathway
PDGF(Platelet-Derived Growth Factor) Pathway Data
A microarray dataset to observe autocrine PDGF loop and the effect of exogenous PDGF in
U87 MG globlastoma cells was downloaded from GEO9. PDGF plays a critical role in cell
proliferation and development, and the presence of PDGF autocrine loop in globlastoma is
a frequent hallmark of malignancy. They used dominant-negative PDGF-As preventing
active PDGF dimers, which in turn preventing both the intacrine activation and the
secretion of PDGF into extracellular space (autocrin, and paracrine).
195 genes are
differentially expressed between samples with and without exogenous PDGFs (fdr*<50%
(p=0.0018) from t-statistics.
Human Pathways
MsigDB (Molecular Signature Database C2) was downloaded and used to define gene sets.
It consists of 472 gene sets of metabolic and signaling pathways, and 50 sets containing
genes coregulated by genetic and chemical perturbations.
Results
All methods captured the NF-kB induced gene set and TNF pathway at rank 1 or 2, and
NF-kB pathway itself at significant p value and fdr levels (fdr*<25%). However with
PDGF expression data all methods captured gene sets related with cell cycle, cell
proliferation, cell cycle regulation, and damage control including the p53 pathway. It has
been reported that autocrine PDGF signaling become more oncogenic with the
accumulation of mutations in p53. It must be because excessive cell cycle and proliferation
needs more damage control to prevent erroneous development of tumors. In contrast to
NF-kB pathway, PDGF pathway itself was not detected at fdr < 50% level even though
GSEA and Z-score schemes ranked it at a slightly higher fdr than 50%. There can be two
possible explanations for this: (i) genes in PDGF pathway are not regulated by mRNA
levels or show very weak differential expression in samples with and without exogenous
PDGFs; (ii) all samples had autocrine PDGF loop blockage by negative dominant mutants
of PDGFA, which might not good enough to represent the real PDGF pathway activation.
Hypergeometic : NF-kB
Name
1st
21th
size
CR_SIGNALLING
FRASOR_ER_DOWN
KRAS_TOP100_KNOCKDOWN
NFKB_INDUCED
DOWNREG_BY_HOXA9
nthiPathway
ST_Tumor_Necrosis_Factor_Pathway
tnfr2Pathway
HOXA9_DOWN
EMT_UP
hivnefPathway
nfkbPathway
182
68
72
105
28
21
18
31
53
54
21
p
fdr<0.25
0.000000
y
0.000000
y
0.000000
y
0.000000
y
0.000002
y
0.000002
y
28
0.000002
0.000003
y
0.000007
y
0.000019
y
0.000025
y
0.000905
y
GSEA : NF-kB
Name
KRAS_TOP100_KNOCKDOWN
2nd NFKB_INDUCED
DOWNREG_BY_HOXA9
metPathway
keratinocytePathway
HOXA9_DOWN
stressPathway
nthiPathway
tnfr2Pathway
CR_SIGNALLING
40th nfkbPathway
p
q(fdr)
0.0
0.005444646
0.0043290043
0.004008016
0.0
0.0044247787
0.0062370063
0.006802721
0.009578544
0.011131725
0.012715259
0.06464923
0.118684635
0.12523517
0.11972623
0.11115593
0.09745527
0.0880979
0.08824222
0.08366267
0.025540275
0.22657
y
Z-score : NF-kB
Name
size
z
NFKB_INDUCED
KRAS_TOP100_KNOCKDOWN
ST_Tumor_Necrosis_Factor_Pathway
tnfr2Pathway
CR_SIGNALLING
hivnefPathway
ST_Gaq_Pathway
deathPathway
nthiPathway
DOWNREG_BY_HOXA9
tnf_and_fas_network
105
72
28
18
182
54
27
31
21
28
20
9.669149
7.575231
6.536961
5.957621
5.462894
5.347498
4.831110
4.745057
4.635058
4.621381
4.416775
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000001
0.000001
0.000002
0.000002
0.000005
nfkbPathway
21
3.117105
0.000913
34th
p
Permutation : NF-kB

caspasePathway
CR_SIGNALLING
deathPathway
eponfkbPathway
il10Pathway
inflamPathway
NFKB_INDUCED
nfkbPathway
tnfr2Pathway
tollPathway


Q1
21
182
31
11
12
27
105
21
18
29
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
42 pathways with p=0.000000
Q2
caspasePathway
21
0.000000
deathPathway
31
0.000000
eponfkbPathway
11
0.000000
il10Pathway
12
0.000000
NFKB_INDUCED
105
0.000000
nfkbPathway
21
0.000000
ST_Tumor_Necrosis_Factor_Pathway 28 0
tall1Pathway
12
0.000000
tnf_and_fas_network 20
0
tnfr2Pathway
18
0.000000
39 pathways with p=0.000000
Hypergeometic : PDGF
Name
size
CR_CELL_CYCLE
78
MAP00100_Sterol_biosynthesis
10
s1pPathway
7
LEU_DOWN
167
Cell_Cycle
73
p27Pathway
12
cell_proliferation
199
SA_REG_CASCADE_OF_CYCLIN_EXPR
DNA_DAMAGE_SIGNALLING
g1Pathway
26
p53Pathway
16
472th
pdgfPathway
p
0.000000
0.000000
0.000007
0.000017
0.000028
0.000097
0.000117
13
0.000137
90
0.000148
0.000218
0.000332
not significant
GSEA : PDGF
Name
cdc25Pathway
g1Pathway
rbPathway
EMT_UP
CR_CELL_CYCLE
MAP00100_Sterol_biosynthesis
Cell_Cycle
NFKB_INDUCED
ptc1Pathway
g2Pathway
FRASOR_ER_DOWN
fxrPathway
srcRPTPPathway
P53_UP
cell_proliferation
plk3Pathway
51th
pdgfPathway
size
9
26
13
53
78
10
73
105
11
23
68
6
11
40
199
7
27
p
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
q(fdr)
0.10099991
0.10099991
0.100999966
0.13869974
0.15389395
0.17021161
0.17880023
0.1690752
0.16151145
0.15546033
0.2139626
0.2267665
0.225836
0.21691884
0.21947579
0.24841224
0.59359473
Z-score : PDGF
Name
40th
size
z
p
Cell_Cycle
73
CR_CELL_CYCLE
78
GLUT_DOWN
286
MAP00100_Sterol_biosynthesis
10
LEU_DOWN
167
rbPathway
13
s1pPathway
7
g2Pathway
23
HTERT_DOWN
64
SA_REG_CASCADE_OF_CYCLIN_EXPR
6.631856
0.000000
7.236692
0.000000
5.398338
0.000000
6.689549
0.000000
4.186298
0.000014
4.124814
0.000019
4.102092
0.000020
4.073250
0.000023
3.944680
0.000040
13
3.855292
0.000058
pdgfPathway
2.216931
27
0.013314
Permutaion : PDGF


Q1
atrbrcaPathway
Cell_Cycle
CR_CELL_CYCLE
CR_REPAIR
GLUT_DOWN
LEU_DOWN
RAP_DOWN
cell_proliferation
19
73
78
39
286
167
213
199
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.001000
Q2
eif2Pathway 8
0.004000
fasPathway
28
0.004000
pentosePathway
2
0.004000
SA_REG_CASCADE_OF_CYCLIN_EXPR 13
tumor_supressor
22
0.004000
atmPathway 19
0.010000
cell_cycle_checkpointII
10
0.010000
igf1mtorPathway
20
0.010000
GLUT_DOWN 286
0.011000
PDGF pathway itself is not significant.
Conclusion and Discussion
This research showed that four previous approaches including hypergeometric method,
GSEA, Z-score scheme, and permutation-based method can capture activated pathways
based on the discriminative power of member genes in pathways.
One of the statistical insights from this research is that pathways have different background
distributions especially according to the pathway size. Below figure shows background
distributions with varying standard deviations of modules with size 10, 100, 213, and 326.
Thus the adjustment of raw pathway scores based on the well approximated background
distribution is necessary to extract really significant pathways.
The results from this research need to be analyzed with a more quantitative way. Also, it is
with no doubt, a very useful challenge to devise a method to explore and define pathways
measuring their activities simultaneously from gene expression and/or integrated multiomics data without pre-defined gene sets. The insights gained from this research will help
to achieve the future goal.
References
1. Hanahan D, Weinberg RA (2000) The hallmarks of cancer. Cell 100(1): 57-70.
2. Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M et al. (1999) Molecular
classification of cancer: class discovery and class prediction by gene expression monitoring.
Science 286(5439): 531-537.
3. Ramaswamy S, Ross KN, Lander ES, Golub TR (2003) A molecular signature of
metastasis in primary solid tumors. Nat Genet 33(1): 49-54.
4. Doniger, S.W., Salomonis, N., Dahlquist, K.D., Vranizan, K., Lawlor, S.C., Conklin, B.R.
(2003). MAPPFinder: using Gene Ontology and GenMAPP to create a global geneexpression profile from microarray data. Genome Biology 4(1):R7
5. Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL et al. (2005) Gene set
enrichment analysis: a knowledge-based approach for interpreting genome-wide expression
profiles. Proc Natl Acad Sci U S A 102(43): 15545-15550.
6. Lu Tian et al.Discovering statistically significant pathways in expression profiling
studies. PNAS (2005) 102(38) 13544-13549
7. Ideker T, Ozier O, Schwikowski B, Siegel AF (2002) Discovering regulatory and
signalling circuits in molecular interaction networks. Bioinformatics 18 Suppl 1: S233-240.
8. Bing Tian et al. (2005) Identification of Direct Genomic Targets Downstream of the
Neclear Factor-kB Transcription Factor Mediating Tumor Necrosis Factor Signaling. The
Journal of Biological Chemistry 280(17): 17435-17448
9. Deqin Ma et al. (2005) Autocrine Platelet-Derived Growth Factor-Dependent Gene
Expression in Glioblastoma Cells is Mediated Largely by Activation of the Transcription
Factor Sterol Regulatory Element Binding Protein and Is Associated with Altered Genotype
and Patient Survival in Human Brain Tumors. Cancer Research 65(15): 5523-5534
Download