SNet

advertisement
FINDING CONSISTENT SUBNETWORKS
ACROSS MICROARRAY DATASET
Fan Qi
GS5002 Journal Club
OUTLINE

Introduction

Methodology

Results & Discussions

Conclusions
2
INTRODUCTION

Identify Differential Gene Expression


Identify significant genes w.r.t a phenotype
Importance:





Testing effectiveness of treatment
Biological insights of diseases
Develop new treatment
Disease Prophylaxis
Any others ?
3
CURRENT METHODS

Individual Genes
Search for individual differentially expressed genes
 Fold-change, t-test, SAM


Gene Pathway Detection
Looking at a set of genes instead of individual genes
 Bayesian learning and Boolean network learning


Gene Classes
Adding existing biological insights
 Over-representation analysis (ORA), Functional Class
Scoring(FCS), GSEA, NEA, ErmineJ

4
CHALLENGE

Different Results from Different Dataset of the SAME
disease!

Zhang M [1] demonstrated inconsistency in SAM:
Datasets
Prostate cancer
Lung cancer
DMD
DEGs
POG
nPOG
Top 10
0.3
0.3
Top 50
0.14
0.14
TOP 100
0.15
0.15
Top 10
0.00
0.00
Top 50
0.20
0.19
TOP 100
0.31
0.30
Top 10
0.20
0.20
Top 50
0.42
0.42
TOP 100
0.54
0.54
Reconstruct from Table 1 in [1]
Inconsistency
among datasets
5
NEW APPROACH

SNet [2]
Proposed in 2011
 Utilize gene-gene relationship in analysis


Gene-gene relationship

Activates VS. Inhibits
From Fig 1 in [2]

Gene Subnetwork

Gene is the Vertex, Relationship is an edge
RHOA
VAV
PIK3R2
6
ARHGEF1
RAC1
IQGAP1
Partially adapted
from Fig 2 in [2]
METHODOLOGY

Input:

Genes labeled with phenotype


Gain from microarray experiment
Third-party Info:
Gene Pathway Info
 Gene Reaction Info Subnetwork
Subnetwork

Extraction
Attributes of


Scoring
Subnetwork
Subnetwork
Significance
Size, Score
Output:

A set of significant sub-network
7
METHODOLOGY –STEP 1
Phenotypes
P1
Patient’s
Gene
Ranked
List
P2
P3
{𝑎1 , 𝑎2 , . . 𝑎𝑛 }
{𝑎1 , 𝑎2 , . . 𝑎𝑛 }
{𝑎1 , 𝑎2 , . . 𝑎𝑛 }
{𝑎1 , 𝑎2 , . . 𝑎𝑛 }
{𝑎1 , 𝑎2 , . . 𝑎𝑛 }
{𝑎1 , 𝑎2 , . . 𝑎𝑛 }
{𝑎1 , 𝑎2 , . . 𝑎𝑛 }
{𝑎1 , 𝑎2 , . . 𝑎𝑛 }
{𝑎1 , 𝑎2 , . . 𝑎𝑛 }
{𝑎1 , 𝑎2 , . . 𝑎𝑛 }
{𝑎1 , 𝑎2 , . . 𝑎𝑛 }
{𝑎1 , 𝑎2 , . . 𝑎𝑛 }
{𝑎1 , 𝑎2 , . . 𝑎𝑛 }
{𝑎1 , 𝑎2 , . . 𝑎𝑛 }
{𝑎1 , 𝑎2 , . . 𝑎𝑛 }
{𝑎1 , 𝑎2 , . . 𝑎𝑛 }
{𝑎1 , 𝑎2 , . . 𝑎𝑛 }
……..
{𝑎1 , 𝑎2 , . . 𝑎𝑛 }
8
METHODOLOGY –STEP 1
P1
P1
{𝑎1 , 𝑎2 , . . 𝑎𝑛 }
{𝑎1 , 𝑎2 , . . 𝑎m }
{𝑎1 , 𝑎2 , . . 𝑎𝑛 }
Only top 𝛼% genes is kept
{𝑎1 , 𝑎2 , . . 𝑎𝑛 }
{𝑎1 , 𝑎2 , . . 𝑎𝑛 }
{𝑎1 , 𝑎2 , . . 𝑎m }
𝐺𝑃𝑖 for
patient 𝑃𝑖
{𝑎1 , 𝑎2 , . . 𝑎m }
𝛼 = 10
{𝑎1 , 𝑎2 , . . 𝑎m }
{𝑎1 , 𝑎2 , . . 𝑎𝑛 }
{𝑎1 , 𝑎2 , . . 𝑎m }
{𝑎1 , 𝑎2 , . . 𝑎𝑛 }
{𝑎1 , 𝑎2 , . . 𝑎m }
Repeat for every phenotype group
9
METHODOLOGY –STEP 1
P1
P1
P1
P1
{𝑎1 , 𝑎2 , . . 𝑎𝑚 }
{𝑎1 , 𝑎2 , . . 𝑎𝑚 }
{𝑎1 , 𝑎2 , . . 𝑎𝑚 }
{𝑎1 , 𝑎2 , . . 𝑎𝑚 }
{𝑎1 , 𝑎2 , . . 𝑎𝑚 }
{𝑎1 , 𝑎2 , . . 𝑎𝑚 }
{𝑎1 , 𝑎2 , . . 𝑎𝑚 }
{𝑎1 , 𝑎2 , . . 𝑎𝑚 }
{𝑎1 , 𝑎2 , . . 𝑎𝑚 }
{𝑎1 , 𝑎2 , . . 𝑎𝑚 }
{𝑎1 , 𝑎2 , . . 𝑎𝑚 }
{𝑎1 , 𝑎2 , . . 𝑎𝑚 }
{𝑎1 , 𝑎2 , . . 𝑎𝑚 }
{𝑎1 , 𝑎2 , . . 𝑎𝑚 }
{𝑎1 , 𝑎2 , . . 𝑎𝑚 }
{𝑎1 , 𝑎2 , . . 𝑎𝑚 }
{𝑎1 , 𝑎2 , . . 𝑎𝑚 }
{𝑎1 , 𝑎2 , . . 𝑎𝑚 }
{𝑎1 , 𝑎2 , . . 𝑎𝑚 }
{𝑎1 , 𝑎2 , . . 𝑎𝑚 }
{𝑎1 , 𝑎2 , . . 𝑎𝑚 }
{𝑎1 , 𝑎2 , . . 𝑎𝑚 }
{𝑎1 , 𝑎2 , . . 𝑎𝑚 }
{𝑎1 , 𝑎2 , . . 𝑎𝑚 }
…….
P1 (d)
{𝑎1 , 𝑎2 , . . 𝑎𝑚 }
{𝑎1 , 𝑎2 , . . 𝑎𝑚 }
select genes occur in ≥
𝛽% of patients
{𝑎1 , 𝑎2 , . . 𝑎𝑚 }
𝐺𝐿
𝑎2
{𝑎1 , 𝑎2 , . . 𝑎𝑚 }
{𝑎1 , 𝑎2 , . . 𝑎𝑚 }
𝑎1
…
𝛽 = 50
𝑎𝑘
{𝑎1 , 𝑎2 , . . 𝑎𝑚 }
Select one phenotype as 𝑑
others as ¬𝑑
10
METHODOLOGY –STEP 1
𝑎2
𝑎1
𝑎1
𝐺𝐿
…
𝑎6
𝑎3
𝑎5
𝑎𝑖
𝑎1
𝑎4
𝑎7
𝑎2
…
………
𝑎𝑘
A list of Subnetworks 𝑐𝑐
w.r.t 𝑑
𝑎1
𝑎
2
𝑎3
…
𝑎6
𝑎3
𝑎5
𝑘
𝑎4
Partition 𝐺𝐿 into multiple pathways
Generate Subnetwork
𝑎7
11
METHODOLOGY – STEP 2


𝑆𝑁𝑒𝑡𝑠𝑝,𝑖 =
𝑔′∈𝐺𝑃𝑖∩𝑠𝑝 𝑆𝑔𝑠𝑝,𝑔′ ,
where 𝑆𝑔𝑠𝑝,𝑔 =
𝑎6
𝑎3
𝑘
𝑛
𝑔: a gene in 𝑠𝑝 that is highly expressed in 𝑃𝑖
 𝑘: # patients in 𝑑(¬𝑑) who have 𝑔 highly expressed
 𝑛: total # patients in 𝑑(¬𝑑)


𝑎1
For each Subnetwork in 𝑠𝑝 in the 𝑐𝑐 and Patient 𝑃𝑎𝑖2,
compute overall expression level:
𝑎5
𝑎4
For Patients < 𝑃1 , 𝑃2 … 𝑃𝑛 > ∈ 𝑑 and
< 𝑃𝑛+1 , 𝑃𝑛+2 … 𝑃𝑚 > ∈ ¬𝑑 compute t-test
𝑎7
P1 (d)
{𝑎1 , 𝑎2 , . . 𝑎𝑚 }
{𝑎1 , 𝑎2 , . . 𝑎𝑚 }
{𝑎1 , 𝑎2 , . . 𝑎𝑚 }
{𝑎1 , 𝑎2 , . . 𝑎𝑚 }
𝑆𝑠𝑝𝑠𝑝,𝑑 =< 𝑆𝑁𝑒𝑡𝑠𝑝,1 , 𝑆𝑁𝑒𝑡𝑠𝑝,2 … 𝑆𝑁𝑒𝑡𝑠𝑝,𝑛 >
𝑆𝑠𝑝𝑠𝑝,¬𝑑 =< 𝑆𝑁𝑒𝑡𝑠𝑝,𝑛+1 , 𝑆𝑁𝑒𝑡𝑠𝑝,𝑛+2 … 𝑆𝑁𝑒𝑡𝑠𝑝,𝑚 >
{𝑎1 , 𝑎2 , . . 𝑎𝑚 }
T test
𝑆𝑆𝑝𝑠𝑝,𝑡
Assign to each
Subnetwork
{𝑎1 , 𝑎2 , . . 𝑎𝑚 }
12
METHODOLOGY – STEP 3
Randomly Swap Phenotype labels of patient,
recreating subnetworks and t-test scores (step 1-2)
Repeat [A] for 1,000 permutations.
A.
B.
•
C.
D.
Forms a 2-D histogram (𝑆𝑖𝑧𝑒 × 𝑆𝑐𝑜𝑟𝑒)
Estimate the nominal p-value of each Subnetwork
Select Subnetwork with 𝑝-𝑣𝑎𝑙𝑢𝑒 ≤ 0.05
Null-hypo:
subnetwork
with
𝑠𝑖𝑧𝑒, 𝑠𝑐𝑜𝑟𝑒 is
not significant
13
Fig 5 in original paper
RESULTS AND DISCUSSIONS

Dataset:
Leukemia: Golub VS Armstrong
 ALL: Ross VS Yeoh
 DMD: Haslett VS Pescatori
 Lung: Bhattacharjee VS Garber


Performance Comparison:
Subnetwork Overlap (with GSEA)
 Gene Overlap (GSEA, SAM, t-Test)


Other Comparisons:

Network Size, Gene Validity with t-Test
14
RESULTS AND DISCUSSIONS

Subnetwork Overlap
Disease
Dataset 1
Dataset 2
SNET
GSEA
SNET
GSEA
Leukemia
Golub
Armstrong
83.33%
0%
20
0
ALL
Ross
Yeoh
47.63%
23.1%
10
6
DMD
Haslett
Pescatori
58.33%
55.6%
7
10
Lung
Bhattacharjee
Garber
90.90%
0%
9
0
Higher the better
Synthesized from Table 1, 2 from [2]
15
RESULTS AND DISCUSSIONS

Gene Overlap
Disease
Snet
GSEA
T-Test
(p <0.05)
T-Test
(top)
SAM
(p <0.05)
SAM
(top)
Leukemia
91.30%
2.38%
73.01%
14.29%
49.96%
22.62%
ALL
93.01%
4.0%
60.20%
57.33%
81.25%
49.33%
DMD
69.23%
28.9%
49.60%
20.00%
76.98%
42.22%
Lung
51.18%
4.0%
65.61%
26.16%
65.61%
24.62%
Higher the better
Synthesized from Table 3, 4,5 from [2]
16
RESULTS AND DISCUSSIONS

Size of subnetworks
Disease
𝜸
T-Test
SNet
Size of Network
2
3
4
5
5
6
7
>8
Leukemia
84
8
1
0
0
2
3
2
1
Subtype
75
5
1
1
1
1
0
1
6
DMD
45
3
1
0
0
1
0
0
5
Lung
65
3
2
1
0
5
3
0
1
Reconstructed from Table 6 from [2]
17
RESULTS AND DISCUSSIONS

Validity
Compare the genes in EACH Subnetwork with those in ttest
 Genes in each Subnetwork appears in T-Test is around
70%- 100%


Selected Results (too large to present full)
Subnetwork Name
Percentage
Subnetwork Name
Percentage
Leukaemia_B Cell-VAV1
81.82%
SNET_CTNNB1
100%
Leukaemia_UBC
100%
SNET_TNFSF10
60%
Leukaemia_RAC1
57.15%
SNET_PYGM
60%
DMD_RHOA
75%
DMD_ACTB
83.33%
DMD_SDC3
88.89%
Leaukaemia_POU2F2
75.00%
MLLBCR_ACAA1
28.67%
BCR_T_RASA1
44.44%
MLLBCR_BLNK
72.73%
BCR_ABL1
75.00%
SNET_NOTCH3
100%
DMD_CALM1
80%
Selected from
Table 7,8,9,10 in[2]
18
CONCLUSIONS

Traditional Methods have inconsistency problem
across different dataset of the same disease

SNet utilize Biological insights to mitigate the gap
Gene-to-Gene relationship
 Gene Pathway knowledge


SNet shows better results than established algorithms

More consistent
19
REFERENCES

[1] Zhang M, Zhang L, Zou J, Yao C, Xiao H, Liu Q, Wang J, Wang D, Wang C,
Guo Z: Evaluating reproducibility of differential expression discoveries in
microarray studies by considering correlated molecular changes.

[2] Donny Soh, Difeng Dong1, Yike Guo, Limsoon Wong Finding consistent
disease subnetworks across microarray datasets
20
21
Download