Guided Tutorial:
Using GSEA as an analytical tool for molecular profiling
Getting Started:
Create a GSEA workstation that works for you
When would you use GSEA?
What do you need to get started?
Basic workstaIon with Java
Expression files converted to .gct (GenePaNern)
Phenotype labels (any text editor)
Gene Sets (public repository or custom sets)
Desktop/Laptop running Mac OS X, Windows, or Linux:
Java 6 or 7 ( Desktop GUI or command line)
R
GenePa4ern ( Web, Desktop , or Server)
Getting Started:
Working with GSEA files
1. Sources of gene expression data (.gct files):
Microarrays (.cel files)
RNA-‐Seq (.fpkm_tracking file)
2. Phenotype labels (.cls files)
3. Gene Sets (MSigDB or custom):
PosiIonal
Curated
Pathway (KEGG, Biocarta, etc.)
Getting Started:
GCT file column structure
#1.2
23230 2 gene_short_name DescripIon Untreated_FPKM CisplaIn_FPKM
Mcm4 chr16:15623989-‐15637493 16.5908 33.9435
Mcm5 chr8:77633426-‐77652338 5.97553 16.5161
Mcm6 chr1:130228167-‐130256233 12.7255 28.2007
Mcm7 chr5:138605816-‐138613090 25.7281 49.4802
Getting Started:
Converting from microarray .cel expression to .gct file
Getting Started:
Converting from RNA-Seq expression genes.fpkm_tracking to .gct file
Getting Started:
Real World Example
GCT file structure:
#1.2
23230 2 gene_short_name DescripIon Untreated CisplaIn
GeneA na 16.5908 33.9435
GeneB na 5.97553 16.5161
GeneC na 12.7255 28.2007
GeneD na 25.7281 49.4802
Getting Started:
Creating a .cls file to label phenotypes
Getting Started:
Real World Example
GCT file structure:
#1.2
23230 2 gene_short_name DescripIon Untreated CisplaIn
GeneA na 16.5908 33.9435
GeneB na 5.97553 16.5161
GeneC na 12.7255 28.2007
GeneD na 25.7281 49.4802
CLS file structure:
2 2 1
# Untreated CisplaIn
Untreated CisplaIn
Running GSEA:
Analyzing cancer cell lines for p53 targets
Download and load the p53 datasets from the GSEA website: hNp://www.broadinsItute.org/gsea/datasets.jsp
P53_hgu95av2 (50 p53 MUT or WT cell lines analyzed on the HG_U95Av2 (Affy) Human
Genome U95 GeneChip:
#1.2
12625 50
NAME DescripIon 786-‐0 BT-‐549 … UACC-‐62 UO-‐31
100_g_at na 215.37 132.94 … 451.01 186.68
1000_at na 328.68 234.31 … 381.41 285.72
1001_at na 39.64
8.84 … 40.52
31.88
P53.cls (MUT vs. WT):
50 2 1
#MUT WT
MUT MUT … WT WT
Chip: HG_U95Av2.chip
Running GSEA:
Analyzing cancer cell lines for p53 targets
Running GSEA:
Analyzing cancer cell lines for p53 targets
GSEA Reports:
Analyzing cancer cell lines for p53 targets
GSEA Reports:
Analyzing cancer cell lines for p53 targets
GSEA Reports:
Analyzing cancer cell lines for p53 targets
Leading Edge Analysis:
Analyzing cancer cell lines for p53 targets
Leading Edge Analysis:
Analyzing cancer cell lines for p53 targets
Leading Edge Analysis:
Analyzing cancer cell lines for p53 targets
Browsing MSigDB:
Analyzing cancer cell lines for p53 targets
References
GSEA: hNp://www.broadinsItute.org/gsea/
GenepaNern: hNp://www.broadinsItute.org/cancer/sojware/genepaNern/
Friday, May 10, 13
•
ARACNE theory recap
•
Java GUI
•
Command line tool
Xi Zhao, PhD
Cancer Center for Systems Biology, Stanford University xi.zhao@stanford.edu
•
Genes do not function alone
Friday, May 10, 13
•
Construct gene interaction network from genomic data e.g Expression matrix sample (N) gene (p)
SIDW299104
SIDW380102
SID73161
GNAL
H.sapiensmRN
SID325394
RASGTPASE
SID207172
ESTs Pair-wise correlation matrix
SIDW469884
ESTs
SID471915
MYBPROTO
ESTsChr.1
SID377451
DNAPOLYME
SID375812
SIDW31489
SID167117
SIDW470459
SIDW487261
Homosapiens
SIDW376586
Chr
MITOCHONDR
SID47116
ESTsChr.6
SIDW296310
SID488017
SID305167
ESTsChr.3
SID127504
SID289414
PTPRC
SIDW298203
SIDW310141
SIDW376928
ESTsCh31
SID114241
SID377419
SID297117
SIDW201620
SIDW279664
SIDW510534
HLACLASSI
SIDW203464
SID239012
SIDW205716
SIDW376776
HYPOTHETIC
WASWiskott
SIDW321854
ESTsChr.15
SIDW376394
SID280066
ESTsChr.5
SIDW488221
SID46536
SIDW257915
ESTsChr.2
SIDW322806
ARACNE: Mutual Information
SID485148
SID297905
ESTs
SIDW486740
SMALLNUC
ESTs
SIDW366311
SIDW357197
SID52979
ESTs
SID43609
SIDW416621
ERLUMEN
TUPLE1TUP1
SIDW428642
SID381079
SIDW298052
SIDW417270
SIDW362471
ESTsChr.15
SIDW321925
SID380265
SIDW308182
SID381508
SID377133
SIDW365099
ESTsChr.10
SIDW325120
SID360097
SID375990
SIDW128368
SID301902
SID31984
SID42354
Post processing: e.g. correlation significance
Basso et al 2005
ARACNE: Data Processing Inequality
FIGURE 1.3.
DNA microarray data: expression ma-
Friday, May 10, 13
• detect non-linear dependencies between a pair of variables ( X, Y )
• remove weakest link (tends to be indirect interaction) in a triple
I
2
I
1
> I
2
> I
3 I
1
I
1
I
2
I
3 I
3
1. MI has low statistical power although detect non-linear dependencies
MI “has lower power than distant correlation, ... (it) is sometimes less powerful than Pearson correlation as well, the linear case being particularly worrisome.” (Simon and Tibshirani, 2011 comment on Reshef et al 2011)
2. Can easily turn into Hair balls: Difficulty in interpretation
Martin Krzywinski
Friday, May 10, 13
Friday, May 10, 13
•
Download
(http://wiki.c2b2.columbia.edu/califanolab/index.php/Software/ARACNE)
• aracne.zip (http://wiki.c2b2.columbia.edu/califanolab/download/ARACNE/aracne.zip):
•
Java JDK 1.5 or newer (compiling issues with 1.7 on Mac?)
•
Set $PATH variable
(on Mac)
• edit .bashrc file (e.g vi ~/.bashrc)
• edit launch_aracne.sh
•
Launch from terminal:
•
$cd <aracne folder>
$sh launch_aracne.sh
Friday, May 10, 13
•
Input file - expression matrix (.exp):
•
TAB-delimited text files optional
•
Output file - adjacency matrix file (.adj):
•
TAB-delimited text files
Friday, May 10, 13
•
Load “BCell_matrix.exp”
•
Name the output file ( optional )
•
Load existing adjacency file “.adj”( optional )
•
Filter by p value ( optional )
•
Filter by DPI tolerance ( optional ): 0~0.2
•
Construct network around hubs ( optional )
•
TF list ( optional )
•
Output: adjacency file “.adj”
Friday, May 10, 13
•
Network visualization
•
Network Browser -> Load -> Draw -> Cytoscape
Download ARACNE.src.tar.gz
ARACNE options: http://wiki.c2b2.columbia.edu/califanolab/images/9/92/Usage.txt
-i <file> Input gene expression profile dataset
-o <file> Output file name (optional)
-j <file> Existing adjacency matrix (.adj) file
-e <tolerance> DPI tolerance
-p <p-value> P-value for MI threshold (e.g. 1e-7)
$cd <ARACNE.src/ARACNE folder>
Construct network using all probes provided in BCell_matrix.exp
$./aracne2 -i ../../aracne/Data/ BCell_matrix.exp
Construct network around probes provided in myc_probes.txt
$./aracne2 -i ../../aracne/Data/ BCell_matrix.exp -s ../../aracne/Data/ myc_probes.txt
Load existing adjacency file .
adj with DPI & P-value filtering
$./aracne2 -i ../../aracne/Data/ BCell_matrix.exp -j ../../aracne/Data/ BCell_matrix_k0.139.adj
-p 0.001 –e 0.15
Friday, May 10, 13
Friday, May 10, 13
•
Basso, K., Margolin, A. a, Stolovitzky, G., Klein, U., Dalla-Favera, R., & Califano, A. (2005). Reverse engineering of regulatory networks in human B cells. Nature genetics, 37(4), 382-90. doi:10.1038/ ng1532
•
Margolin, A. a, Nemenman, I., Basso, K., Wiggins, C., Stolovitzky, G., Dalla Favera, R., & Califano, A.
(2006). ARACNE: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC bioinformatics, 7 Suppl 1, S7. doi:10.1186/1471-2105-7-S1-S7
•
Margolin, A. a, Wang, K., Lim, W. K., Kustagi, M., Nemenman, I., & Califano, A. (2006). Reverse engineering cellular networks. Nature protocols, 1(2), 662-71. doi:10.1038/nprot.2006.106
•
Carro, M. S., Lim, W. K., Alvarez, M. J., Bollo, R. J., Zhao, X., Snyder, E. Y., Sulman, E. P., et al. (2010). The transcriptional network for mesenchymal transformation of brain tumours. Nature, 463(7279),
318-25. Nature Publishing Group. doi:10.1038/nature08712
•
SIMON, N., and TIBSHIRANI R.. "comment on “detecting novel associations in large data sets” by
Reshef et al, Science dec 16, 2011." Science (2011).
•
Reshef et al (2011) Detecting Novel Associations in Large Data Sets. Science: 1518-1524. [DOI:
10.1126/science.1205438]